# Spark SQL

In this Section, we will study the Spark SQL API

## Basic DataFrame Operations

Now, we will see some of the DataFrame operations. Among other, se can highlight the following ones:

    * show()
    * select()
    * filter()
    * groupBy()

First we create a DataFrame

In [1]:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
import org.apache.spark.sql.{functions => F}

import org.apache.spark.sql.{functions=>F}


In [2]:
val schema = new StructType(Array(StructField("Name", StringType, true),
                                  StructField("Age", IntegerType, true)))

val rddData = sc.parallelize(List(("John", 25), ("Maria", 33), ("Irene", 75), ("John", 45)))

val rdd = rddData.map(x => Row(x._1, x._2))

val df = spark.createDataFrame(rdd, schema)

schema = StructType(StructField(Name,StringType,true), StructField(Age,IntegerType,true))
rddData = ParallelCollectionRDD[0] at parallelize at <console>:35
rdd = MapPartitionsRDD[1] at map at <console>:37
df = [Name: string, Age: int]


[Name: string, Age: int]

`Show(n)` --> to show the first nth elements of the DataFrame

In [3]:
df.show(2)

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Maria| 33|
+-----+---+
only showing top 2 rows



`select()` --> to select some columns of the DataFrame

In [4]:
df.select("Name").show()

+-----+
| Name|
+-----+
| John|
|Maria|
|Irene|
| John|
+-----+



`filter()` --> to filter the rows of the DataFrame according to a condition

In [5]:
df.filter(F.col("Age") > 30).show()

+-----+---+
| Name|Age|
+-----+---+
|Maria| 33|
|Irene| 75|
| John| 45|
+-----+---+



`groupBy()` --> to grop the dataframe by the values of one or several columns

In [6]:
df.groupBy("Name").count().show()

+-----+-----+
| Name|count|
+-----+-----+
|Irene|    1|
| John|    2|
|Maria|    1|
+-----+-----+



## Loading and Saving Data

In this section, we will explore how to load and save data in three different formats:

    * Parquet
    * CSV
    * Json

### Parquet Format

Loading data

In [7]:
val parquetData = spark.read.parquet("../data/person.parquet")

parquetData = [Name: string, Age: int]


[Name: string, Age: int]

In [8]:
parquetData.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



Saving data

In [9]:
parquetData.write.mode("overwrite").parquet("../data/person_write.parquet")

In [10]:
spark.read.parquet("../data/person_write.parquet").show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



### CSV Format

Loading data

In [11]:
val csvData = spark.read.option("header", "true").option("inferschema", "true").csv("../data/person.csv")

csvData = [Name: string, Age: int]


[Name: string, Age: int]

In [12]:
csvData.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



In [13]:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val schema = new StructType(Array(StructField("Name", StringType, true), 
                                  StructField("Age", IntegerType, true)))

schema = StructType(StructField(Name,StringType,true), StructField(Age,IntegerType,true))


StructType(StructField(Name,StringType,true), StructField(Age,IntegerType,true))

In [14]:
val csvDataSchema = spark.read.schema(schema).csv("../data/person.csv")

csvDataSchema = [Name: string, Age: int]


[Name: string, Age: int]

In [15]:
csvDataSchema.show()

+----+----+
|Name| Age|
+----+----+
|null|null|
|Raul|  29|
|Javi|  34|
+----+----+



Saving data:

In [16]:
csvData.write.mode("overwrite").option("header", "true").csv("../data/person_write.csv")

In [17]:
spark.read.option("inferSchema", "true").option("header", "true").csv("../data/person_write.csv").show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



### JSON Format

Loading data

In [18]:
val jsonData = spark.read.json("../data/person.json")

jsonData = [age: bigint, name: string]


[age: bigint, name: string]

In [19]:
jsonData.show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+



Saving data

In [20]:
jsonData.write.mode("overwrite").json("../data/person_write.json")

In [21]:
spark.read.json("../data/person_write.json").show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+



## User-Defined Functions

User-defined functions allows us to apply a specific function to one or several columns to get a new one.

Let's check the following dataframe:

In [22]:
df.show()

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Maria| 33|
|Irene| 75|
| John| 45|
+-----+---+



Now we are going to create a new column, "Young_Tag", with two possible values: 0 if age <= 30 and 1 if age > 30. In order to do that, we are going to create our `udf`

In [23]:
/**
Function that returns 1 if age <=30 and 0 if age > 30
    
@param age age
@return: young tag(0 or 1)

**/

val ageTag = (age:Int) => {
    var tag = 0
    if(age <= 30) tag = 1
    tag
}

val ageUdf = F.udf(ageTag)

ageTag = > Int = <function1>
ageUdf = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))


UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))

In [24]:
df.withColumn("Young_Tag", ageUdf(F.col("Age"))).show()

+-----+---+---------+
| Name|Age|Young_Tag|
+-----+---+---------+
| John| 25|        1|
|Maria| 33|        0|
|Irene| 75|        0|
| John| 45|        0|
+-----+---+---------+

