## Module 2

Creating Datasets, organising raw data and working with structured APIs

This file is running on Databricks cluster: **DBR 9.1 LTS | Spark 3.1.2 | Scala 2.12**

Notebook has default language: **Python**

## Reading data  with PySpark and SparkR and Scala

We will create a Spark Session for easier manipulation with RDD files using **PySpark**

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession \
    .builder \
    .appName("Module 2 with Python") \
    .getOrCreate()

In [0]:
# spark is an existing SparkSession
df = spark.read.json("dbfs:/FileStore/module2/json2_1.json")

# Displays the content of the DataFrame to stdout
df.show()

### Some DataFrame operations using **PySpark**

In [0]:
# Print the schema in a tree format
df.printSchema()

In [0]:
# Select only the "name" column
df.select("gender").show()

In [0]:
# Select everybody, but increment the age by 1
# And also creating the select list
df.select(df['name'],df['gender'], df['age'] + 1).show()

In [0]:
# Row filtering: Select people older than 21
df.filter(df['age'] > 21).show()

In [0]:
# Row filtering: Select people between 7 and 31
df.filter((df['age'] > 7) & (df['age'] <= 31)).show()

In [0]:
# Count people by age
df.groupBy("age").count().show()

### Some DataFrame operations using **SparkR**

In [0]:
%r
library(SparkR)

In [0]:
%r
sparkR.session(appName = "Module 2 with R")

In [0]:
%r

df <- read.json("dbfs:/FileStore/module2/json2_1.json")

In [0]:
%r
head(df,4)

In [0]:
%r
showDF(df)

In [0]:
%r

# Print the schema in a tree format
printSchema(df)

In [0]:
%r

# Select only the "name" column
head(select(df, "name"))
 

In [0]:

%r

# Select everybody, but increment the age by 1
head(select(df, df$name, df$age + 1))


In [0]:
%r

# Select people older than 21
head(where(df, df$age > 21))


In [0]:
%r

# Row filtering: Select people between 7 and 31
head(where(df, df$age > 7 & df$age <=31))


In [0]:
%r

# Count people by age
head(count(groupBy(df, "age")))



### Some DataFrame operations using **Scala**

In [0]:
%scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Module 2 with Scala")
  .getOrCreate()

In [0]:
%scala

val df = spark.read.json("dbfs:/FileStore/module2/json2_1.json")

// Displays the content of the DataFrame to stdout
df.show()

In [0]:
%scala

// This import is needed to use the $-notation
import spark.implicits._
df.printSchema()

In [0]:
%scala

// Select only the "name" column
df.select("name").show()


In [0]:
%scala

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()


In [0]:
%scala

// Select people older than 21
df.filter($"age" > 21).show()


In [0]:
%scala

//  Row filtering: Select people between 7 and 31
df.filter($"age" > 7 && $"age"<= 31).show()


In [0]:
%scala

// Count people by age
df.groupBy("age").count().show()


## Running SQL

### SQL with PySpark

In [0]:
# Register the DataFrame as a global temporary view
df.createGlobalTempView("people")


In [0]:
# Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()

In [0]:
# Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()

### SQL With Scala

In [0]:
%scala
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()


### SQL With R

In [0]:
%r
df <- sql("SELECT * FROM table")

## Creating and organising Data sources and DataSets

### Loading data, bucketing data with PySpark

If you are running this out of Databricks environment, run submit command:
```
  ./bin/spark-submit your/path/to/python.py
```

In [0]:
# $example on:init_session$
from pyspark.sql import SparkSession

# $example on:schema_inferring$
from pyspark.sql import Row

# Import data types
from pyspark.sql.types import StringType, StructType, StructField

In [0]:
# parquet

df = spark.read.parquet("dbfs:/FileStore/module2/users.parquet")
df.select("name", "favorite_color").write.save("dbfs:/FileStore/module2/namesAndFavColors.parquet")

In [0]:
#json

df = spark.read.load("dbfs:/FileStore/module2/people.json", format="json")
df.select("name", "age").write.save("dbfs:/FileStore/module2/namesAndAges.parquet", format="parquet")

In [0]:
#csv
df = spark.read.load("dbfs:/FileStore/module2/people.csv",
                     format="csv", sep=";", inferSchema="true", header="true")

In [0]:
#ORC 
df = spark.read.orc("dbfs:/FileStore/module2/users.orc")
(df.write.format("orc")
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .option("orc.column.encoding.direct", "name")
    .save("users_with_options.orc"))

In [0]:
#Parquet

df = spark.read.parquet("dbfs:/FileStore/module2/users.parquet")
(df.write.format("parquet")
    .option("parquet.bloom.filter.enabled#favorite_color", "true")
    .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
    .option("parquet.enable.dictionary", "true")
    .option("parquet.page.write-checksum.enabled", "false")
    .save("users_with_options.parquet"))

In [0]:
# Reading Parquet file directly with SQL
df = spark.sql("SELECT * FROM parquet.`dbfs:/FileStore/module2/users.parquet`")

#### Saving and bucketing with PySpark

In [0]:
# Bucketing and sorting are applicable only to persistent tables (!!!) Delta tables do not support bucketing and sorting!
df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

In [0]:
#create and save to Hive table
df.write.saveAsTable("people_bucketed")

In [0]:
%sql
SELECT * FROM people_bucketed

name,favorite_color,favorite_numbers
Alyssa,,"List(3, 9, 15, 20)"
Ben,red,List()


In [0]:
# partitioning can be used with both save and saveAsTable when using the Dataset APIs
df.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

In [0]:
# but it is possible to use both partitioning and bucketing in Delta tables; but pySpark is esentially capable of this
df = spark.read.parquet("dbfs:/FileStore/module2/users.parquet")
(df
    .write
    .partitionBy("favorite_color")
    .bucketBy(42, "name")
    .saveAsTable("users_partitioned_bucketed"))

### Loading data, bucketing data with SparkR

In [0]:
%r

library(SparkR)

In [0]:
%r

# reading from Parquet
df <- read.df("dbfs:/FileStore/module2/users.parquet")
write.df(select(df, "name", "favorite_color"), "dbfs:/FileStore/module2/namesAndFavColors.parquet")

In [0]:
%r
# Reading JSON
df <- read.df("dbfs:/FileStore/module2/people.json", "json")
namesAndAges <- select(df, "name", "age")
#write.df(namesAndAges, "dbfs:/FileStore/module2/namesAndAges.parquet", "parquet")  Will create error, turn overwrite ON
write.df(namesAndAges, "dbfs:/FileStore/module2/namesAndAges.parquet", "parquet", "overwrite")

In [0]:
%r
# Reading CSV
df <- read.df("dbfs:/FileStore/module2/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)
namesAndAges <- select(df, "name", "age")

In [0]:
%r
# Write to ORC
df <- read.df("dbfs:/FileStore/module2/users.orc", "orc")
write.orc(df, "users_with_options.orc", orc.bloom.filter.columns = "favorite_color", orc.dictionary.key.threshold = 1.0, orc.column.encoding.direct = "name")

In [0]:
%r
# Save to Parquet file
f <- read.df("dbfs:/FileStore/module2/users.parquet", "parquet")
write.parquet(df, "users_with_options.parquet")

In [0]:
%r
df <- sql("SELECT * FROM parquet.`dbfs:/FileStore/module2/users.parquet`")

In [0]:
%r
head(df)

#### Saving and bucketing

No support for R

### Loading data, bucketing data with Scala

we will disable formatCheck on Databrick SET spark.databricks.delta.formatCheck.enabled=false

In [0]:
%scala

val usersDF = spark.read.load("dbfs:/FileStore/module2/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")


In [0]:
%scala

// Reading JSON
val peopleDF = spark.read.format("json").load("dbfs:/FileStore/module2/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

In [0]:
%scala

// reading CSV

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("dbfs:/FileStore/module2/people.csv")

In [0]:
%scala

//Writing to ORC data
usersDF.write.format("orc")
  .option("orc.bloom.filter.columns", "favorite_color")
  .option("orc.dictionary.key.threshold", "1.0")
  .option("orc.column.encoding.direct", "name")
  .save("users_with_options.orc")

In [0]:
%scala

// writing to Parquet file

usersDF.write.format("parquet")
  .option("parquet.bloom.filter.enabled#favorite_color", "true")
  .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
  .option("parquet.enable.dictionary", "true")
  .option("parquet.page.write-checksum.enabled", "false")
  .save("users_with_options.parquet")

In [0]:
%scala
// reading directly using SQL
val sqlDF = spark.sql("SELECT * FROM parquet.`dbfs:/FileStore/module2/users.parquet`")

#### Saving and bucketing

In [0]:
%scala
// bucketing and sorting
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

In [0]:
%scala
// partitiong
usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

In [0]:
%scala
// Partitionin and bucketing

usersDF
  .write
  .partitionBy("favorite_color")
  .bucketBy(42, "name")
  .saveAsTable("users_partitioned_bucketed")

## Organizing data using PySpark

#### RDD API vs DataFrame API

In [0]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("SampleRDDAPI").getOrCreate())

In [0]:
# Create a tuple of Age and binary variable is:person or dog
dataRDD = spark.sparkContext.parallelize([("Person",20),("Dog",2),("Person",34),("Person",63), ("Dog", 6)])

In [0]:
# Use map and reduceByKey Transformation with lamba function
# to aggreate and compute an average
agesRDD = (dataRDD
          .map(lambda x: (x[0], (x[1], 1)))
           .reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1]))
           .map(lambda x: (x[0], x[1][0]/x[1][1]))
          )

In [0]:
for data in agesRDD.collect():
  print(data)

In [0]:
#Create a DataFrame
data_df = spark.createDataFrame([("Person",20),("Dog",2),("Person",34),("Person",63), ("Dog", 6)],["type", "age"])

In [0]:
avg_df = data_df.groupby("type").avg("age")
avg_df.show()

If you want to follow using PySpark, you can install it locally by simply using
```
pip install pyspark
pip install pandas
pip install jupyterlab
```

### Create Schemas and Assign DataTypes with PySpark

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [0]:
#check data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
spark = (SparkSession.builder.appName("SQLBitsModule2Schemas").getOrCreate())

```
StructType	org.apache.spark.sql.Row	StructType(fields)
Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
```

In [0]:
data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

In [0]:
type(data)

In [0]:
#assign schema
schema = StructType([
  StructField("firstname", StringType(), True),
  StructField("middlename", StringType(), True),
  StructField("lastname", StringType(), True),
  StructField("id", StringType(), True),
  StructField("gender", StringType(), True),
  StructField("salary", IntegerType(), True)
  
])

In [0]:
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()

In [0]:
df.show(truncate=False)

### Read and Write Data  by using the DataFrame Reader and Writer

In [0]:
file_path = "dbfs:/FileStore/module2/fire_incidents.csv"
fire_df = (spark.read.format("csv")
          .option("header", True)
          .option("inferSchema", True)
          .load(file_path))

In [0]:
#Spark transformation
#Lazy evaluation
fire_df.select("IncidentNumber", "IncidentDate", "City").show(10)

In [0]:
#print schema or columns
#fire_df.printSchema()
fire_df.columns

In [0]:
#use data writer to store data to Parquet file
output_path = "dbfs:/FileStore/module2/output/fireincidents"
fire_df.write.format("parquet").mode("overwrite").save(output_path)
#check output folder and talk about parquet file

In [0]:
spark.stop()

### Working with Structured operations in PySpark

1. Columns and Expressions
1. Filter and Where Conditions
1. Distinct, Drop Duplicated, Order By
1. Rows and Union
1. Adding, Renaming, Dropping Columns
1. Working with missing and "bad" data
1. User-defined functions

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, FloatType, DateType, BooleanType
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

#assign schema
per_schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("first_name", StringType(), True),
  StructField("last_name", StringType(), True),
  StructField("fav_movies", ArrayType(StringType()), True),
  StructField("salary", FloatType(), True),
  StructField("image_url", StringType(), True),
  StructField("date_of_birth", DateType(), True),
  StructField("active", BooleanType(), True)
])

#Load data
json_file_path =  "dbfs:/FileStore/module2/persons.json"
persons_df = (spark.read.json(json_file_path, per_schema, multiLine="True"))


In [0]:
persons_df.printSchema()


In [0]:
persons_df.show(7)
persons_df.show(5, truncate=False)

##### 1. Columns and Expressions

In [0]:
from pyspark.sql.functions import col, expr

In [0]:
persons_df.select(col("first_name"), col("last_name"), col("date_of_birth")).show(5)

In [0]:
person_df.select(expr("first_name"), expr("last_name"), expr("date_of_birth")).show(5)