# Spark - Working with arrays of structs

In this Scala notebook we are going to load a json (inferring the schema) which has an array of structs and we are going to:
 - Explode an array.
 - Modify an array of structs renaming its elements and adding a new element.

In [1]:
import $ivy.`org.apache.spark::spark-sql:2.4.0`

[32mimport [39m[36m$ivy.$                                  [39m

In [2]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[32mimport [39m[36morg.apache.spark.sql.types._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._[39m

In [4]:
val spark = SparkSession.builder().appName("Spark").master("local[*]").getOrCreate()

[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@2e2dfbc6

In [5]:
import spark.implicits._

[32mimport [39m[36mspark.implicits._[39m

In [6]:
val colorsDF = spark.read.option("multiLine", true).json("../files/colors.json")

[36mcolorsDF[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [colors: array<struct<code:struct<hex:string,rgb:struct<Blue: :bigint,Green: :bigint,Red: :bigint>>,color:string,old_tags:array<struct<old_tags_names:array<string>,old_tags_values:array<bigint>>>,type:string>>]

Once we have read the json, let's see the schema and show the dataframe content to see what we have:

In [7]:
colorsDF.printSchema

root
 |-- colors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- code: struct (nullable = true)
 |    |    |    |-- hex: string (nullable = true)
 |    |    |    |-- rgb: struct (nullable = true)
 |    |    |    |    |-- Blue: : long (nullable = true)
 |    |    |    |    |-- Green: : long (nullable = true)
 |    |    |    |    |-- Red: : long (nullable = true)
 |    |    |-- color: string (nullable = true)
 |    |    |-- old_tags: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- old_tags_names: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- old_tags_values: array (nullable = true)
 |    |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- type: string (nullable = true)



In [8]:
colorsDF.show(false)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|colors                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[[#000, [255, 255, 255]], black, [[[tag1, tag2, tag3],], [, [10, 22, 60]]], primary], [[#FFF, [0, 0, 0]], white, [[[tag4, tag5],], [, [5, 2]]],], [[#FF0, [0, 0, 255]], red, [[[tag6],], [, [100]]], primary]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------

First, let's explode the _colors_ array and show 1 row for each color:

In [9]:
val exploded_colors = colorsDF
    .withColumn("colors", explode($"colors"))
    .select("colors.color", "colors.type", "colors.code", "colors.old_tags")

[36mexploded_colors[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [color: string, type: string ... 2 more fields]

Let's see the new schema and content:

In [10]:
exploded_colors.printSchema

root
 |-- color: string (nullable = true)
 |-- type: string (nullable = true)
 |-- code: struct (nullable = true)
 |    |-- hex: string (nullable = true)
 |    |-- rgb: struct (nullable = true)
 |    |    |-- Blue: : long (nullable = true)
 |    |    |-- Green: : long (nullable = true)
 |    |    |-- Red: : long (nullable = true)
 |-- old_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- old_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- old_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)



In [11]:
exploded_colors.show(false)

+-----+-------+-----------------------+-----------------------------------------+
|color|type   |code                   |old_tags                                 |
+-----+-------+-----------------------+-----------------------------------------+
|black|primary|[#000, [255, 255, 255]]|[[[tag1, tag2, tag3],], [, [10, 22, 60]]]|
|white|null   |[#FFF, [0, 0, 0]]      |[[[tag4, tag5],], [, [5, 2]]]            |
|red  |primary|[#FF0, [0, 0, 255]]    |[[[tag6],], [, [100]]]                   |
+-----+-------+-----------------------+-----------------------------------------+



Now let's work with `old_tags` array of structs (which contains `old_tags_names` and `old_tags_values`).

Let's try by **renaming the struct and its elements** with `new` instead of `old` **and adding a new element**: `new_tag_description`.

In [12]:
exploded_colors
    .withColumn("new_tags",
             struct(
                 col("old_tags.old_tags_names").as("new_tags_names"),
                 col("old_tags.old_tags_values").as("new_tags_values"),
                 array(lit("")).as("new_tag_description")
                    )
               ).printSchema

root
 |-- color: string (nullable = true)
 |-- type: string (nullable = true)
 |-- code: struct (nullable = true)
 |    |-- hex: string (nullable = true)
 |    |-- rgb: struct (nullable = true)
 |    |    |-- Blue: : long (nullable = true)
 |    |    |-- Green: : long (nullable = true)
 |    |    |-- Red: : long (nullable = true)
 |-- old_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- old_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- old_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |-- new_tags: struct (nullable = false)
 |    |-- new_tags_names: array (nullable = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- new_tags_values: array (nullable = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: long (containsNull = tr

This is not what we want. Instead of having an array of struct, we have a struct of multiples arrays.

We need to use the [`arrays_zip`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions) function (new in Spark 2.4.0):

In [13]:
exploded_colors
    .withColumn("new_tags",
             arrays_zip(
                 col("old_tags.old_tags_names").as("new_tags_names"),
                 col("old_tags.old_tags_values").as("new_tags_values"),
                 array(array(lit("Description"))).as("new_tag_description")
                    )
               )
.printSchema

root
 |-- color: string (nullable = true)
 |-- type: string (nullable = true)
 |-- code: struct (nullable = true)
 |    |-- hex: string (nullable = true)
 |    |-- rgb: struct (nullable = true)
 |    |    |-- Blue: : long (nullable = true)
 |    |    |-- Green: : long (nullable = true)
 |    |    |-- Red: : long (nullable = true)
 |-- old_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- old_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- old_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |-- new_tags: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- 0: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- 1: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- 2: array (nullable = true)
 |    |    |    |-- element:

But as we can see, the elements of the struct **were not renamed**.

In order to solve this, we have to create a schema with the new array of struct...

In [14]:
val new_tags_schema = ArrayType(new StructType()
    .add("new_tags_names", ArrayType(StringType))
    .add("new_tags_values",  ArrayType(LongType))
    .add("new_tag_description", ArrayType(StringType))
    )

[36mnew_tags_schema[39m: [32mArrayType[39m = [33mArrayType[39m(
  StructType(StructField(new_tags_names,ArrayType(StringType,true),true), StructField(new_tags_values,ArrayType(LongType,true),true), StructField(new_tag_description,ArrayType(StringType,true),true)),
  [32mtrue[39m
)

... and **apply the schema** to the struct that we have just created by using `arrays_zip`:

In [15]:
val new_colors = exploded_colors
    .withColumn("new_tags",
             arrays_zip(
                 col("old_tags.old_tags_names").as("new_tags_names"),
                 col("old_tags.old_tags_values").as("new_tags_values"),
                 array(array(lit("Description"))).as("new_tag_description")
                    ).cast(new_tags_schema)
               )

new_colors.printSchema
new_colors.show(false)

root
 |-- color: string (nullable = true)
 |-- type: string (nullable = true)
 |-- code: struct (nullable = true)
 |    |-- hex: string (nullable = true)
 |    |-- rgb: struct (nullable = true)
 |    |    |-- Blue: : long (nullable = true)
 |    |    |-- Green: : long (nullable = true)
 |    |    |-- Red: : long (nullable = true)
 |-- old_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- old_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- old_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |-- new_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- new_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- new_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- new_tag_description: array (n

[36mnew_colors[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [color: string, type: string ... 3 more fields]

Here we can see that both structures have the same `tags_names` and `tags_values` data:

In [16]:
new_colors
    .select("old_tags", "new_tags")
    .show(false)

+-----------------------------------------+---------------------------------------------------------+
|old_tags                                 |new_tags                                                 |
+-----------------------------------------+---------------------------------------------------------+
|[[[tag1, tag2, tag3],], [, [10, 22, 60]]]|[[[tag1, tag2, tag3],, [Description]], [, [10, 22, 60],]]|
|[[[tag4, tag5],], [, [5, 2]]]            |[[[tag4, tag5],, [Description]], [, [5, 2],]]            |
|[[[tag6],], [, [100]]]                   |[[[tag6],, [Description]], [, [100],]]                   |
+-----------------------------------------+---------------------------------------------------------+



Finally, we will show the expected result:

In [17]:
val final_colors = new_colors.drop("old_tags")

final_colors.printSchema
final_colors.show(false)

root
 |-- color: string (nullable = true)
 |-- type: string (nullable = true)
 |-- code: struct (nullable = true)
 |    |-- hex: string (nullable = true)
 |    |-- rgb: struct (nullable = true)
 |    |    |-- Blue: : long (nullable = true)
 |    |    |-- Green: : long (nullable = true)
 |    |    |-- Red: : long (nullable = true)
 |-- new_tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- new_tags_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- new_tags_values: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- new_tag_description: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

+-----+-------+-----------------------+---------------------------------------------------------+
|color|type   |code                   |new_tags                                                 |
+-----+-------+-----------------------+--

[36mfinal_colors[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [color: string, type: string ... 2 more fields]