# Spark SQL

In this Section, we will study the Spark SQL API

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL").master("local[*]").getOrCreate()
sc = spark.sparkContext

## Basic DataFrame Operations

Now, we will see some of the DataFrame operations. Among other, se can highlight the following ones:

    * show()
    * select()
    * filter()
    * groupBy()

First we create a DataFrame

In [2]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import pyspark.sql.functions as F

In [3]:
schema = StructType([StructField("Name", StringType(), True),
                     StructField("Age", IntegerType(), True)])

rdd_data = sc.parallelize([("John", 25), ("Maria", 33), ("Irene", 75), ("John", 45)])

df = spark.createDataFrame(rdd_data, schema)

`show(n)` --> to show the first nth elements of the DataFrame

In [4]:
df.show(2)

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Maria| 33|
+-----+---+
only showing top 2 rows



`select()` --> to select some columns of the DataFrame

In [5]:
df.select("Name").show()

+-----+
| Name|
+-----+
| John|
|Maria|
|Irene|
| John|
+-----+



`filter()` --> to filter the rows of the DataFrame according to a condition

In [6]:
df.filter(F.col("Age") > 30).show()

+-----+---+
| Name|Age|
+-----+---+
|Maria| 33|
|Irene| 75|
| John| 45|
+-----+---+



`groupBy()` --> to grop the dataframe by the values of one or several columns

In [7]:
df.groupBy("Name").count().show()

+-----+-----+
| Name|count|
+-----+-----+
|Irene|    1|
| John|    2|
|Maria|    1|
+-----+-----+



## Loading and Saving Data

In this section, we will explore how to load and save data in three different formats:

    * Parquet
    * CSV
    * Json

### Parquet Format

Loading data

In [8]:
parquet_data = spark.read.parquet("../data/person.parquet")

In [9]:
parquet_data.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



Saving data

In [10]:
parquet_data.write.mode("overwrite").parquet("../data/person_write.parquet")

In [11]:
spark.read.parquet("../data/person_write.parquet").show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



### CSV Format

Loading data

In [12]:
csv_data = spark.read.option("header", "true").option("inferschema", "true").csv("../data/person.csv")

In [13]:
csv_data.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



In [14]:
csv_data_bis = spark.read.csv("../data/person.csv", header=True, inferSchema=True)

In [15]:
csv_data_bis.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



In [16]:
import pyspark.sql.types as T

In [17]:
schema = T.StructType([T.StructField("Name", T.StringType(), True),
                       T.StructField("Age", T.IntegerType(), True)])

In [18]:
csv_data_schema = spark.read.csv("../data/person.csv", header=True, schema=schema)

In [19]:
csv_data_schema.show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



Writing data

In [20]:
csv_data.write.mode("overwrite").csv("../data/person_write.csv", header=True)

In [21]:
spark.read.csv("../data/person_write.csv", header=True, inferSchema=True).show()

+----+---+
|Name|Age|
+----+---+
|Raul| 29|
|Javi| 34|
+----+---+



### JSON Format

Loading data

In [22]:
json_data = spark.read.json("../data/person.json")

In [23]:
json_data.show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+



Saving data

In [24]:
json_data.write.mode("overwrite").json("../data/person_write.json")

In [25]:
spark.read.json("../data/person_write.json").show()

+---+----+
|age|name|
+---+----+
| 29|Raul|
| 33|Javi|
+---+----+



## User-Defined Functions

User-defined functions allows us to apply a specific function to one or several columns to get a new one.

Let's check the following dataframe:

In [26]:
df.show()

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Maria| 33|
|Irene| 75|
| John| 45|
+-----+---+



Now we are going to create a new column, "Young_Tag", with two possible values: 0 if age <= 30 and 1 if age > 30. In order to do that, we are going to create our `udf`

In [27]:
def age_tag(age):
    """
    Function that returns 1 if age <=30 and 0 if age > 30
    
    :input age: age
    :return: young tag(0 or 1)
    
    """
    tag = 0
    if age <= 30:
        tag = 1
    return tag

age_udf = F.udf(age_tag)

In [28]:
df.withColumn("Young_Tag", age_udf(F.col("Age"))).show()

+-----+---+---------+
| Name|Age|Young_Tag|
+-----+---+---------+
| John| 25|        1|
|Maria| 33|        0|
|Irene| 75|        0|
| John| 45|        0|
+-----+---+---------+

