# Abstracting Data with DataFrames

## Introduction:

This notebook aims to focus on furthering the understanding of PySpark fundamentals in data structures called DataFrames. Interestingly, DataFrames in PySpark takes advantage of the developments and improvements from Project Tungsten and Catalyst Optimiser.

## Project Tungsten:

The following is a direct quote from https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. It briefly describes the project. 

Project Tungsten will be the largest change to Spark’s execution engine since the project’s inception. It focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. This effort includes three initiatives:

- __Memory Management and Binary Processing__: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection.
- __Cache-aware computation__: algorithms and data structures to exploit memory hierarchy.
- __Code generation__: using code generation to exploit modern compilers and CPUs.

## Catalyst Optimiser:

The following is a direct quote from https://databricks.com/glossary/catalyst-optimizer. It briefly describes the project.

Catalyst is based on functional programming constructs in Scala and designed with these key two purposes:

- Easily add new optimization techniques and features to Spark SQL.
- Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.).


## Breakdown of this Notebook:
- Creating DataFrames
- Accessing underlying RDDs
- Performance optimisations
- Inferring the schema using Reflections
- Specifying the schema programmatically
- Creating a temporary Table
- Using SQL to interact with DataFrames
- Overview of the DataFrame transformations and Actions.

## 1 PySpark Machine Configuration:

Here it only uses two processing cores from the CPU, and it set up by the following code.

In [1]:
%%configure
{
    "executorCores" : 4
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,,pyspark,idle,,,


In [2]:
from pyspark.sql.types import *

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2 Setup the Correct Directory:

In [3]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3 Creating DataFrames:

First is to create some sample data and column headers as follows.

In [4]:
sample_data = sc.parallelize(
    [(1, 'MacBook Pro', 2015, '15"', '16GB', '512GB SSD', 13.75, 9.48, 0.61, 4.02), 
    (2, 'MacBook', 2016, '12"', '8GB', '256GB SSD', 11.04, 7.74, 0.52, 2.03), 
    (3, 'MacBook Air', 2016, '13.3"', '8GB', '128GB SSD', 12.8, 8.94, 0.68, 2.96), 
    (4, 'iMac', 2017, '27"', '64GB', '1TB SSD', 25.6, 8.0, 20.3, 20.8)]
)

col_names = ['Id', 'Model', 'Year', 'ScreenSize', 'RAM', 'HDD', 'W', 'D', 'H', 'Weight']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Create the dataFrame from the sample data:

In [5]:
sample_df = spark.createDataFrame(sample_data, col_names)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# Inspect:
sample_df.take(2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Id=1, Model='MacBook Pro', Year=2015, ScreenSize='15"', RAM='16GB', HDD='512GB SSD', W=13.75, D=9.48, H=0.61, Weight=4.02), Row(Id=2, Model='MacBook', Year=2016, ScreenSize='12"', RAM='8GB', HDD='256GB SSD', W=11.04, D=7.74, H=0.52, Weight=2.03)]

#### As it can be seen, unlike RDDs, a DataFrame is a collection of Row(.......) objects and that a Row(......) object consists of data that is named.

## 3.1 To have a look at the DataFrame Data:

This can be done by using the .show() method.

In [7]:
sample_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  1|MacBook Pro|2015|       15"|16GB|512GB SSD|13.75|9.48|0.61|  4.02|
|  2|    MacBook|2016|       12"| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|
|  3|MacBook Air|2016|     13.3"| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  4|       iMac|2017|       27"|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
+---+-----------+----+----------+----+---------+-----+----+----+------+

## 3.2 To have a look at the Schema:

As DataFrames have a schema, and that the columns of the DataFrame have matching datatypes as the original sample_data RDD.

It can be examined like so.

In [8]:
sample_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Id: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- Year: long (nullable = true)
 |-- ScreenSize: string (nullable = true)
 |-- RAM: string (nullable = true)
 |-- HDD: string (nullable = true)
 |-- W: double (nullable = true)
 |-- D: double (nullable = true)
 |-- H: double (nullable = true)
 |-- Weight: double (nullable = true)

## 3.3 Create DataFrame from a JSON file:

#### From an example source.

source: https://github.com/kavgan/spark-examples/blob/master/sample-data/description.json

In [9]:
sample_data_json_df = ( spark.read.json(folder_pathway + '/Datasets/' + 'description.json') )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
sample_data_json_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- text: string (nullable = true)
 |-- title: string (nullable = true)

In [11]:
sample_data_json_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------+
|                text|      title|
+--------------------+-----------+
|Data (/ˈdeɪtə/ DA...|       Data|
|Big data is a ter...|   Big Data|
|Natural language ...|        NLP|
|Text mining, also...|Text Mining|
+--------------------+-----------+

#### From sample_data:

In [12]:
sample_data_json_df = ( spark.read.json(folder_pathway + '/Datasets/' + 'sample_data.json') )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
sample_data_json_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |--  D: double (nullable = true)
 |--  H: double (nullable = true)
 |--  HDD: string (nullable = true)
 |--  Model: string (nullable = true)
 |--  RAM: string (nullable = true)
 |--  ScreenSize: string (nullable = true)
 |--  W: double (nullable = true)
 |--  Weight: double (nullable = true)
 |--  Year: long (nullable = true)
 |-- Id: long (nullable = true)

In [14]:
sample_data_json_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+----+---------+-----------+----+-----------+-----+-------+-----+---+
|   D|   H|      HDD|      Model| RAM| ScreenSize|    W| Weight| Year| Id|
+----+----+---------+-----------+----+-----------+-----+-------+-----+---+
|9.48|0.61|512GB SSD|MacBook Pro|16GB|        15"|13.75|   4.02| 2015|  1|
|7.74|0.52|256GB SSD|    MacBook| 8GB|        12"|11.04|   2.03| 2016|  2|
|8.94|0.68|128GB SSD|MacBook Air| 8GB|      13.3"| 12.8|   2.96| 2016|  3|
| 8.0|20.3|  1TB SSD|       iMac|64GB|        27"| 25.6|   20.8| 2017|  4|
+----+----+---------+-----------+----+-----------+-----+-------+-----+---+

## 3.4 Create a DataFrame from CSV file:

Unlike loading it in from a JSON file, a CSV file here requires more parameters. These paramters are "header" and "inferSchema". The header parameter will try to assign the right data-type to each column, and the inderSchema parameter will assign strings as the default.

#### From an example source:

Source: https://github.com/kavgan/spark-examples/blob/master/sample-data/description.csv

In [15]:
sample_data_csv_df = ( spark.read.csv(folder_pathway + '/Datasets/' + 'description.csv', 
                                       header=True, inferSchema=True) )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
sample_data_csv_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- title: string (nullable = true)
 |-- text: string (nullable = true)

In [17]:
sample_data_csv_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+
|      title|                text|
+-----------+--------------------+
|       Data|Data (/ˈdeɪtə/ DA...|
|   Big Data|Big data is a ter...|
|        NLP|                text|
|Text Mining|Text mining, also...|
+-----------+--------------------+

#### From sample_data:

In [18]:
sample_data_csv_df = ( spark.read.csv(folder_pathway + '/Datasets/' + 'sample_data.csv', 
                                       header=True, inferSchema=True) )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [19]:
sample_data_csv_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Id: integer (nullable = true)
 |--  Model: string (nullable = true)
 |--  Year: integer (nullable = true)
 |--  ScreenSize: string (nullable = true)
 |--  RAM: string (nullable = true)
 |--  HDD: string (nullable = true)
 |--  W: double (nullable = true)
 |--  D: double (nullable = true)
 |--  H: double (nullable = true)
 |--  Weight: double (nullable = true)

In [20]:
sample_data_csv_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+-----+-----------+----+---------+-----+----+----+-------+
| Id|      Model| Year| ScreenSize| RAM|      HDD|    W|   D|   H| Weight|
+---+-----------+-----+-----------+----+---------+-----+----+----+-------+
|  1|MacBook Pro| 2015|        15"|16GB|512GB SSD|13.75|9.48|0.61|   4.02|
|  2|    MacBook| 2016|        12"| 8GB|256GB SSD|11.04|7.74|0.52|   2.03|
|  3|MacBook Air| 2016|      13.3"| 8GB|128GB SSD| 12.8|8.94|0.68|   2.96|
|  4|       iMac| 2017|        27"|64GB|  1TB SSD| 25.6| 8.0|20.3|   20.8|
+---+-----------+-----+-----------+----+---------+-----+----+----+-------+

## 4 Accessing the underlying RDDs:

DataFrames under the hood does continue to use RDDs. This section will describe the process of interacting with the underlying RDD of a DataFrame.

## 4.1 Import the Required Libraries:


In [21]:
import pyspark.sql as sql
import pyspark.sql.functions as f

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4.2 Extract the Size and Type of the HDD:

These will be represented in separate columns. The minimum volume needed will be calculated for each computer specifications and put into boxes. 

### To do this:
1 -> Begin by extracting the ".rdd" from the sample_df. \
2 -> Use the ".map()" transformation to ass the HDD_size column to the schema.
3 -> As working with RDD, there is a need to retain all the other columns, so these are converted into row objects with the "Row()" function and then into a dictionary by using ".asDict()" method. This is so that it can be later unpacked using the " ** ". 

    NOTE in Python: 
    - A single * preceding a list of tuples is passed as a parameter to a function, it then passes each element of the list as a separate argument to the function. 
    - Whereas the double ** will take the first element and turn it into a keyword parameter and the second element will be the value to be passed. 
    

4 -> Inside this Row(), the second argument is where the name of the column "HDD_size" is passed and set to the desired value. Here, the ".HDD" column is split and the first element is extracted as it is the HDD_size. \
5 -> The same is done for the "HDD_type" column. \
6 -> The 3rd ".map()" function adds the Volume information and follows the same method as the last two steps (Steps 4+5)\
7 -> The ".toDF()" method is used to convert the RDD back into the DataFrame. \
8 -> To ensure that the column names and the row data of the DataFrame is not empty, the ".select()" function is used to select the relevant information. Additionally, the Volume column is rounded with the ".round()" function and the Cubic Inch volume is created using the ".alias()" methid.\



In [22]:
transformed_dat_sample = (
    sample_df
    .rdd
    .map(lambda row: sql.Row(**row.asDict(), HDD_size = row.HDD.split(' ')[0])
        ).map(lambda row: sql.Row(**row.asDict(), HDD_type = row.HDD.split(' ')[1])
             ).map(lambda row: sql.Row(**row.asDict(), Volume = row.H * row.D * row.W)
                  )
    .toDF()
    .select(
        sample_df.columns + ['HDD_size', 'HDD_type', f.round(f.col('Volume')).alias('Volume_cuIn')]
    )
)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
# Inspect the RDD under the hood:
transformed_dat_sample.rdd.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Id=1, Model='MacBook Pro', Year=2015, ScreenSize='15"', RAM='16GB', HDD='512GB SSD', W=13.75, D=9.48, H=0.61, Weight=4.02, HDD_size='512GB', HDD_type='SSD', Volume_cuIn=80.0)]

#### As it can be seen above, the "transformed_dat_sample" produces a single item list that consist of the element Row(...)

In [24]:
# Inspect the DataFrame:
sample_data.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(1, 'MacBook Pro', 2015, '15"', '16GB', '512GB SSD', 13.75, 9.48, 0.61, 4.02)]

#### As it can be seen above, the "sample_data" produces a single item list, however, the item is now a tuple.

In [25]:
# Show the whole DataFrame:
transformed_dat_sample.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+--------+--------+-----------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|HDD_size|HDD_type|Volume_cuIn|
+---+-----------+----+----------+----+---------+-----+----+----+------+--------+--------+-----------+
|  1|MacBook Pro|2015|       15"|16GB|512GB SSD|13.75|9.48|0.61|  4.02|   512GB|     SSD|       80.0|
|  2|    MacBook|2016|       12"| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|   256GB|     SSD|       44.0|
|  3|MacBook Air|2016|     13.3"| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|   128GB|     SSD|       78.0|
|  4|       iMac|2017|       27"|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|     1TB|     SSD|     4157.0|
+---+-----------+----+----------+----+---------+-----+----+----+------+--------+--------+-----------+

## 5 Performance Optimisations on User Defined Functions (UDFs):

DataFrames on PySpark did have performance improvements, however, when it came to using User Define Functions, PySpark would constantly switch runtimes between Python and JVM and resulted in a great performance hit. 

This was fixed when Spark adapted the improvements from project Arrow (https://arrow.apache.org). This was where a single memory space was created and used by all environments and relieved Spark from the constant need to copying and converting between objects. More details can be found from the above mentioned link.

## 5.1 Import the required Libraries:

NOTE: 
- pip install pyarrow
- may require a different version. Try -> pip install pyarrow==0.14.1 

In [26]:
import pyspark.sql.functions as f 
import pandas as pd
from scipy import stats
import timeit

import sys
import os
# setup to work around with pandas udf
# see answers here https://stackoverflow.com/questions/58458415/pandas-scalar-udf-failing-illegalargumentexception
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5.2 Demonstration:

### What is happening here?

1 -> Firstly, create a Spark dataFrame that has a range from 0 to 1 million, with the column name as "val". These values should be randomise, by using ".rand()". \
2 -> Cache the dataFrame with the ".cache()" method. \
3 -> Show the contents of the DataFrame with the ".show()" method. \
4 -> Secondly, define the "pandas_cdf(...)" method. This declares that it is using a vectorised UDF in PySpark. The first parameter is set to "double". Note that this can be either a DDL-formatted type string or a pyspark.sqp.types.DataType. The second parameter is a function type, and to set it to return a single column, the ".PandasUDFType.SCALAR" is used. If the operation is required for multiple columns, it would be set as ".PandasUDFType.GROUPED_MAP". \
5 -> Next, is the UDF of "pandas_pdf()" function where it takes in a single column and returns a pandas.Series object. The values of the normal CDF numbers. \
6 -> Finally, the dataFrame is transformed by use of the UDF with a new column name and shown with the ".show()" method.


In [27]:
# Make a BIG dataFrame:
df_big = (
    spark
    .range(0, 1000000)
    .withColumn('val', f.rand())
)

# Cache it:
df_big.cache()

# Inspect:
df_big.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------------+
| id|                val|
+---+-------------------+
|  0|0.08198988133925589|
|  1| 0.8509392124864215|
|  2| 0.4077124736653549|
+---+-------------------+
only showing top 3 rows

In [28]:
# Declaration:
@f.pandas_udf('double', f.PandasUDFType.SCALAR)

# User Define Function:
def pandas_pdf(v):
    return pd.Series(stats.norm.pdf(v))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
# Apply the UDF:
# (
#     df_big
#     .withColumn('probability', pandas_pdf(df_big.val))
#     .show(5)
# )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5.3 Compare the performance of the approaches:

#### Test 1 - Vectorised pandas UDF

The "test_pandas_pdf()" uses the pandas_pdf() to retrieve the PDF from the normal distribution where it then performs a ".count()" operation and prints out the results by using the ".show()" method.

In [30]:
def test_pandas_pdf():
    return (
        df_big
        .withColumn('probability', pandas_pdf(df_big.val))
        .agg(f.count(f.col('probability')))
        .show()
    )

# Run and time the function:
# %timeit -n 1 test_pandas_pdf()

print(timeit.timeit(test_pandas_pdf, number=1))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+
|count(probability)|
+------------------+
|           1000000|
+------------------+

1.5764408579999998

#### Test 2 - Row-byRow version with Python to JVM conversion.

The test_pdf() method will be similar but instead, uses the "pdf()" method to perform a row-by-row compute version of the UDFs.

In [31]:
# Declaration:
@f.udf('double')

# UDF:
def pdf(v):
    return float(stats.norm.pdf(v))

#
def test_pdf():
    return(
        df_big
        .withColumn('probability', pdf(df_big.val))
        .agg(f.count(f.col('probability')))
        .show()
    )


# Run and time the function:
# %timeit -n1 test_pdf()

print(timeit.timeit(test_pdf, number=1))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+
|count(probability)|
+------------------+
|           1000000|
+------------------+

77.427466586

#### Observation:

Immediately, it can be seen that the vectoried UDF method with Pandas is much faster at 3 seconds over the row-by-row method that took 428 seconds. This is performance boost from project Arrow.

## 6 Inferring the schema utilising Reflection:

It is also important to note that RDDs does not have schema and the DataFrames do have it. This section will create a DataFrame by inferring the schema by using Reflection.

## 6.1 Import the required libraries:

In [32]:
import pyspark.sql as sql

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 6.2 Read in the CSV sample data :

After reading in the CSV data, it will be saved and created as an RDD which will then be used to to create a DataFrame.

source: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html

In [33]:
# Load in the data:
sample_data_RDD = sc.textFile(folder_pathway + '/Datasets/' + 'sample_data.csv')

# Define the column header: .first() here references to the first row of the data.
header = sample_data_RDD.first()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
# 
sample_data_RDD_row = (
    sample_data_RDD
    .filter(lambda row: row != header)
    .map(lambda row: row.split(','))
    .map(lambda row: 
        sql.Row(
            Id = int(row[0]),
            Model = row[1],
            Year = row[2],
            ScreenSize = row[3],
            RAM = row[4],
            HDD = row[5],
            W = float(row[6]),
            D = float(row[7]),
            H = float(row[8]),
            Weight = float(row[9])
        ))
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
# Inspect the RDD:
sample_data_RDD_row.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(D=9.48, H=0.61, HDD='512GB SSD', Id=1, Model='MacBook Pro', RAM='16GB', ScreenSize='"15\\""', W=13.75, Weight=4.02, Year='2015')]

In [36]:
# Inspect the DataFrame:
sample_data_df = spark.createDataFrame(sample_data_RDD_row)
sample_data_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+----+---------+---+-----------+----+----------+-----+------+----+
|   D|   H|      HDD| Id|      Model| RAM|ScreenSize|    W|Weight|Year|
+----+----+---------+---+-----------+----+----------+-----+------+----+
|9.48|0.61|512GB SSD|  1|MacBook Pro|16GB|    "15\""|13.75|  4.02|2015|
|7.74|0.52|256GB SSD|  2|    MacBook| 8GB|    "12\""|11.04|  2.03|2016|
|8.94|0.68|128GB SSD|  3|MacBook Air| 8GB|  "13.3\""| 12.8|  2.96|2016|
| 8.0|20.3|  1TB SSD|  4|       iMac|64GB|    "27\""| 25.6|  20.8|2017|
+----+----+---------+---+-----------+----+----------+-----+------+----+

## 7 Specifying the Schema Programmatically:

## 7.1 Import the required Libraries:

In [37]:
import pyspark.sql.types as typ

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 7.2 Create the schema:

Below will show the following:
- ".StructField()" is the programmatic way of adding a field to a schema in PySpark. Here the first parameter is the nameof the column, and the second parameter is the dataType of the data for this column. The last parameter is a boolean that defines if the values for this column can contain null values (True) or no null values (False).

In [38]:
schema_pgm = typ.StructType(
    [typ.StructField('Id', typ.LongType(), False),
    typ.StructField('Model', typ.StringType(), True),
    typ.StructField('Year', typ.IntegerType(), True),
    typ.StructField('ScreenSize', typ.StringType(), True),
    typ.StructField('RAM', typ.StringType(), True),
    typ.StructField('HDD', typ.StringType(), True),
    typ.StructField('W', typ.DoubleType(), True),
    typ.StructField('D', typ.DoubleType(), True),
    typ.StructField('H', typ.DoubleType(), True),
    typ.StructField('Weight', typ.DoubleType(), True)]
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 7.3 Load in the CSV file:

In [39]:
sample_data_RDD = sc.textFile(folder_pathway + '/Datasets/' + 'sample_data.csv')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 7.4 Create the underlying RDD:

In [40]:
# Define the headers:
header = sample_data_RDD.first()

# RDD:
sample_data_rdd = (
    sample_data_RDD
    .filter(lambda row: row != header)
    .map(lambda row: row.split(','))
    .map(lambda row: (
        int(row[0]),
        row[1],
        int(row[2]),
        row[3],
        row[4],
        row[5],
        float(row[6]),
        float(row[7]),
        float(row[8]),
        float(row[9]),
    ))
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Convert the RDD format into DataFrames:

In [41]:
sample_data_schema_DF = spark.createDataFrame(sample_data_rdd, schema = schema_pgm)

# Inspect:
sample_data_schema_DF.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
+---+-----------+----+----------+----+---------+-----+----+----+------+

## 8 Creating a Temporary Table:

PySpark DataFrame can also be manipulated with SQL queries. This section will explore the creation of a temporary view to access the data inside the DataFrame through SQL queries.

This example will also be using the "sample_data_schema_DF" from the previous section.

## 8.1 Create the temp view:

This method creates a temporary view that allows for querying the data and requires only 1 parameter that is the name of the view.

In [42]:
sample_data_schema_DF.createTempView('sample_data_view')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [43]:
# 8.2 Use the temp view to extract data:
spark.sql('''
    SELECT Model, Year, RAM, HDD FROM sample_data_view
''').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+----+----+---------+
|      Model|Year| RAM|      HDD|
+-----------+----+----+---------+
|MacBook Pro|2015|16GB|512GB SSD|
|    MacBook|2016| 8GB|256GB SSD|
|MacBook Air|2016| 8GB|128GB SSD|
|       iMac|2017|64GB|  1TB SSD|
+-----------+----+----+---------+

## 8.3 Note: Update a View:

Once a temp view is created, it is not possible to create another view with the same name. However, Spark does provide another method which allows for the creation and update of a view.

This is called ".createOrReplaceTempView()"

In [44]:
sample_data_schema_DF.createOrReplaceTempView('sample_data_view')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [45]:
# Use this NEW temp view to extract the data:
spark.sql('''
    SELECT Model, Year, RAM, HDD, ScreenSize FROM sample_data_view
''').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+----+----+---------+----------+
|      Model|Year| RAM|      HDD|ScreenSize|
+-----------+----+----+---------+----------+
|MacBook Pro|2015|16GB|512GB SSD|    "15\""|
|    MacBook|2016| 8GB|256GB SSD|    "12\""|
|MacBook Air|2016| 8GB|128GB SSD|  "13.3\""|
|       iMac|2017|64GB|  1TB SSD|    "27\""|
+-----------+----+----+---------+----------+

## 9 Utilising SQL to interact with the DataFrames:

This section will explore the data within the DataFrame using the SQL queries.

## 9.1 Extend the data of the original Data:

Here, the orignal DataFrame is extended with the form factor for each of the computer model.

To do this:
- Create a DataFrame with two columns that are "Model" and "FormFactor", the RDD is converted to DataFrame with the ".toDF()" method. The list passed here are the column names and the schema which will be automatically inferred.
- Create the model's view and replace the "sample_data_view".
- Append the "FormFactor" onto the original DataFrame by joining both the Views via the "Model" column. Within the ".sql()" function, it can be seen below that it takes in regular SQL expressions.

In [46]:
# Define the FormFactor:
models_df = sc.parallelize([
    ('MacBook Pro', 'Laptop'),
    ('MacBook', 'Laptop'),
    ('MacBook Air', 'Laptop'),
    ('iMac', 'Desktop')
]).toDF(['Model', 'FormFactor'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [47]:
# Create a temp View(s):
models_df.createOrReplaceTempView('models')

sample_data_schema_DF.createOrReplaceTempView('sample_data_view')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [48]:
# SQL Query:
spark.sql('''
    SELECT a.*, b.FormFactor 
    FROM sample_data_view AS a 
    LEFT JOIN models AS b 
        ON a.Model == b.Model
    ORDER BY Weight DESC
''').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+----------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|FormFactor|
+---+-----------+----+----------+----+---------+-----+----+----+------+----------+
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|   Desktop|
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|    Laptop|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|    Laptop|
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|    Laptop|
+---+-----------+----+----------+----+---------+-----+----+----+------+----------+

## 9.2 Perform some aggregations on the Data:

Count the number of devices that are of different FormFactor.

In [59]:
spark.sql('''
    SELECT b.FormFactor, COUNT(*) AS ComputerCnt 
    FROM sample_data_view AS a 
    LEFT JOIN models AS b 
        ON a.Model == b.Model 
    GROUP BY FormFactor
''').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-----------+
|FormFactor|ComputerCnt|
+----------+-----------+
|    Laptop|          3|
|   Desktop|          1|
+----------+-----------+

## 10 Overview of the DataFrame Transformations:

Similar to RDDs, DataFrames (DF) in PySpark also have transformations and actions.  

## 10.1 DataFrame Transformation Type: .select()

The .select() function transforms to a new DF to extract the column(s) from the DF. Below shows the code.

Additionally, there is a similar SQL syntax that does the same thing:

SELECT Model, ScreenSize FROM sample_data_schema_DF;

In [60]:
sample_data_schema_DF.select('Model', 'ScreenSize').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+----------+
|      Model|ScreenSize|
+-----------+----------+
|MacBook Pro|    "15\""|
|    MacBook|    "12\""|
|MacBook Air|  "13.3\""|
|       iMac|    "27\""|
+-----------+----------+

## 10.2 DataFrame Transformation Type:  .filter()

The .filter() function will transform a new DF where it only selects the row that are passed as the condition specified. In SQL syntax, it would be similar to "WHERE".

Additionally, there is a similar SQL syntax that does the same thing:

SELECT * FROM sample_data_schema_DF WHERE Year > 2015

In [61]:
(
    sample_data_schema_DF
    .filter(sample_data_schema_DF.Year > 2015)
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
+---+-----------+----+----------+----+---------+-----+----+----+------+

## 10.3 DataFrame Transformation Type:  .groupBy()

The .groupBy() function will transform a new DF where it performs data aggregation according to the value(s) from a column(s). The SQL syntax would be "GROUP BY"

Additionally, there is a similar SQL syntax that does the same thing:

SELECT RAM, COUNT(*) AS count FROM sample_data_schema_DF GROUP BY RAM

In [63]:
(
    sample_data_schema_DF
    .groupBy('RAM')
    .count()
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----+
| RAM|count|
+----+-----+
|64GB|    1|
|16GB|    1|
| 8GB|    2|
+----+-----+

## 10.4 DataFrame Transformation Type: .orderBy()

The .orderBy() function will transform a new DF where it sorts the results from the specificed columns. The SQL syntac would be "ORDER BY"

Additionally, there is a similar SQL syntax that does the same thing:

SELECT * FROM sample_data_schema_DF ORDER BY W

In [64]:
(
    sample_data_schema_DF
    .orderBy('W')
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
+---+-----------+----+----------+----+---------+-----+----+----+------+

#### Similarly, the order of the sorting can be changed from ascending to descending:

Additionally, there is a similar SQL syntax that does the same thing:

SELECT * FROM sample_data_schema_DF ORDER BY H DESC

In [65]:
(
    sample_data_schema_DF
    .orderBy(f.col('H').desc())
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|
+---+-----------+----+----------+----+---------+-----+----+----+------+

## 10.5 DataFrame Transformation Type:  .withColumn()

The .withColumn() function will transform a new DF which apply the function to a column or literals (with the .lit() method), and stores it as a new function. The SQL syntax would be "AS" which assigns a new column name.

Additionally, there is a similar SQL syntax that does the same thing:

SELECT *, STRING_SPLIT(HDD, ' ') AS HDD_Arrau FROM sample_data_schema

In [66]:
# Split the HDD into Size and Type:
(
    sample_data_schema_DF
    .withColumn('HDDSplit', f.split(f.col('HDD'), ' '))
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+------------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|    HDDSplit|
+---+-----------+----+----------+----+---------+-----+----+----+------+------------+
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|[512GB, SSD]|
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|[256GB, SSD]|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|[128GB, SSD]|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|  [1TB, SSD]|
+---+-----------+----+----------+----+---------+-----+----+----+------+------------+

In [67]:
# Perform the same with .withColumn:
(
    sample_data_schema_DF
    .select(
        f.col('*'), f.split(f.col('HDD'), ' ').alias('HDD_Array')
    )
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+------------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|   HDD_Array|
+---+-----------+----+----------+----+---------+-----+----+----+------+------------+
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|[512GB, SSD]|
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|[256GB, SSD]|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|[128GB, SSD]|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|  [1TB, SSD]|
+---+-----------+----+----------+----+---------+-----+----+----+------+------------+

## 10.6 DataFrame Transformation Type:  .join()

The .join() function allows to join two DataFrames together. Here, the first parameter is the other DF to be joined with. The second paramter specifies which columns to join on. The final parameter defines the nature of the join.

Types of Joins are:
- inner
- outer
- full
- full_outer
- left
- left_outer
- right
- right_outer
- left_semi
- left_anti

The SQL syntax would be the "JOIN" statement.

Additionally, there is a similar SQL syntax that does the same thing:

SELECT a.*, b.FormFactor FROM sample_data_schema AS a LEFT JOIN models_df AS b ON a.Model == b.Model

In [69]:
# Define the FormFactor:
models_df = sc.parallelize([
    ('MacBook Pro', 'Laptop'),
    ('MacBook', 'Laptop'),
    ('MacBook Air', 'Laptop'),
    ('iMac', 'Desktop')
]).toDF(['Model', 'FormFactor'])

# Perform the .join() func.:
#
(
    sample_data_schema_DF
    .join(
        models_df, sample_data_schema_DF.Model == models_df.Model, 'left'
    ).show()
    
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|      Model|FormFactor|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|    MacBook|    Laptop|
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|MacBook Pro|    Laptop|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|MacBook Air|    Laptop|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|       iMac|   Desktop|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+

#### If the DataFrame have Models that are not listed:

In [70]:
# Define the FormFactor wiht missing data:
models_df = sc.parallelize([
    ('MacBook Pro', 'Laptop'),
    ('MacBook Air', 'Laptop'),
    ('iMac', 'Desktop')
]).toDF(['Model', 'FormFactor'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [71]:
(
    sample_data_schema_DF
    .join(
        models_df, sample_data_schema_DF.Model == models_df.Model, 'left'
    ).show()
    
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|      Model|FormFactor|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
|  2|    MacBook|2016|    "12\""| 8GB|256GB SSD|11.04|7.74|0.52|  2.03|       null|      null|
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|MacBook Pro|    Laptop|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|MacBook Air|    Laptop|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|       iMac|   Desktop|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+

#### The "RIGHT" join method only keeps the records that are matched with records in the right DF:

In [72]:
(
    sample_data_schema_DF
    .join(
        models_df, sample_data_schema_DF.Model == models_df.Model, 'right'
    ).show()
    
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|      Model|FormFactor|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|MacBook Pro|    Laptop|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|MacBook Air|    Laptop|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|       iMac|   Desktop|
+---+-----------+----+----------+----+---------+-----+----+----+------+-----------+----------+

#### Other JOIN methods: "SEMI" and "ANTI"

- "SEMI" join method will keep all the records that are from the left DF which matches the records from the right DF. BUT it only keeps the columns from the Left DF.

- "ANTI" join method would be the opposite of the "SEMI" join function. It will keep the records that are not found from the rigth DF.

In [73]:
# "SEMI" JOIN:
(
    sample_data_schema_DF
    .join(
        models_df, sample_data_schema_DF.Model == models_df.Model, 'left_semi'
    ).show()
    
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+----+----------+----+---------+-----+----+----+------+
| Id|      Model|Year|ScreenSize| RAM|      HDD|    W|   D|   H|Weight|
+---+-----------+----+----------+----+---------+-----+----+----+------+
|  1|MacBook Pro|2015|    "15\""|16GB|512GB SSD|13.75|9.48|0.61|  4.02|
|  3|MacBook Air|2016|  "13.3\""| 8GB|128GB SSD| 12.8|8.94|0.68|  2.96|
|  4|       iMac|2017|    "27\""|64GB|  1TB SSD| 25.6| 8.0|20.3|  20.8|
+---+-----------+----+----------+----+---------+-----+----+----+------+

In [74]:
# "ANTI" JOIN:
(
    sample_data_schema_DF
    .join(
        models_df, sample_data_schema_DF.Model == models_df.Model, 'left_anti'
    ).show()
    
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------+----+----------+---+---------+-----+----+----+------+
| Id|  Model|Year|ScreenSize|RAM|      HDD|    W|   D|   H|Weight|
+---+-------+----+----------+---+---------+-----+----+----+------+
|  2|MacBook|2016|    "12\""|8GB|256GB SSD|11.04|7.74|0.52|  2.03|
+---+-------+----+----------+---+---------+-----+----+----+------+

## 10.7 DataFrame Transformation Type: .unionAll()

The .unionAll() fun

In [None]:
## 10.8 DataFrame Transformation Type: 

In [None]:
## 10.9 DataFrame Transformation Type: 

In [None]:
## 10.10 DataFrame Transformation Type: 

In [None]:
## 10.11 DataFrame Transformation Type: 

In [None]:
## 10.12 DataFrame Transformation Type: 

In [None]:
## 11 Overview of the DataFrame Actions:


