# Learning PySpark 
### Video series

### Packt Publishing

**Author**: Tomasz Drabas
**Date**:   2018-01-30





# Section 4: Spark DataFrames & Transformations

In this section we will look at the Spark DataFrames and the transformations available.

## Creating DataFrames
### From RDDs

In [1]:
simple_rdd = sc.parallelize([
      ['2017-02-01','Rachel', 19, 156, 'Sydney']
    , ['2018-01-01','Albert',  3,  45, 'New York']
    , ['2018-03-02','Jack',   61, 190, 'Krakow']
    , ['2017-12-31','Skye',    8,  82, 'Harbin']
])

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,,pyspark,idle,,,✔


SparkSession available as 'spark'.


In [2]:
simple_df = spark.createDataFrame(
    simple_rdd, 
    ['Date','Name', 'Age', 'Weight', 'Location']
)

In [3]:
simple_df.show()

+----------+------+---+------+--------+
|      Date|  Name|Age|Weight|Location|
+----------+------+---+------+--------+
|2017-02-01|Rachel| 19|   156|  Sydney|
|2018-01-01|Albert|  3|    45|New York|
|2018-03-02|  Jack| 61|   190|  Krakow|
|2017-12-31|  Skye|  8|    82|  Harbin|
+----------+------+---+------+--------+

### From JSON string

In [4]:
json_string = [
    '{"Date":"2017-02-01","Name":"Rachel","Age":19,"Weight":156,"Location":"Sydney"}', 
    '{"Date":"2018-01-01","Name":"Albert","Age":3 ,"Weight":45 ,"Location":"New York"}', 
    '{"Date":"2018-03-02","Name":"Jack"  ,"Age":61,"Weight":190,"Location":"Krakow"}', 
    '{"Date":"2017-12-31","Name":"Skye"  ,"Age":8 ,"Weight":82 ,"Location":"Harbin"}'
]

simple_df_json = spark.read.json(sc.parallelize(json_string))
simple_df_json.show()

+---+----------+--------+------+------+
|Age|      Date|Location|  Name|Weight|
+---+----------+--------+------+------+
| 19|2017-02-01|  Sydney|Rachel|   156|
|  3|2018-01-01|New York|Albert|    45|
| 61|2018-03-02|  Krakow|  Jack|   190|
|  8|2017-12-31|  Harbin|  Skye|    82|
+---+----------+--------+------+------+

### Reading data

In [5]:
sample_df = spark.read.csv(
    '../data/sample_data.csv'
    , header=True
)

In [6]:
sample_df.show(4)

+----------+-------+-------+------+-----+--------+------+
| OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+----------+-------+-------+------+-----+--------+------+
|    1/6/16|   East|  Jones|Pencil|   95|    1.99|189.05|
|2017-03-02|Central| Kivell|Binder|   50|   19.99| 999.5|
|    2/9/16|Central|Jardine|Pencil|   36|    4.99|179.64|
|   2/26/16|Central|   Gill|   Pen|   27|   19.99|539.73|
+----------+-------+-------+------+-----+--------+------+
only showing top 4 rows

## Spark DataFrame schema
### RDDs reflection

In [7]:
simple_df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Weight: long (nullable = true)
 |-- Location: string (nullable = true)

### Programmatically specifying schema

In [8]:
import pyspark.sql.types as typ
import datetime as dt

schema = [
      ('Date', typ.DateType())
    , ('Name', typ.StringType())
    , ('Age',  typ.IntegerType())
    , ('Weight', typ.IntegerType())
    , ('Location', typ.StringType())
]

schema = typ.StructType([typ.StructField(e[0], e[1], True) for e in schema])

simple_df_schema = spark.createDataFrame(
      simple_rdd
        .map(lambda row: 
             [dt.datetime.strptime(row[0], '%Y-%m-%d')] + row[1:]
            )
    , schema=schema
)

simple_df_schema.show()

+----------+------+---+------+--------+
|      Date|  Name|Age|Weight|Location|
+----------+------+---+------+--------+
|2017-02-01|Rachel| 19|   156|  Sydney|
|2018-01-01|Albert|  3|    45|New York|
|2018-03-02|  Jack| 61|   190|  Krakow|
|2017-12-31|  Skye|  8|    82|  Harbin|
+----------+------+---+------+--------+

In [9]:
simple_df_schema.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Location: string (nullable = true)

### Automatically inferring schema while reading data

In [10]:
sample_df.printSchema()

root
 |-- OrderDate: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Rep: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Units: string (nullable = true)
 |-- UnitCost: string (nullable = true)
 |-- Total: string (nullable = true)

In [11]:
sample_df_inferred = spark.read.csv(
    '../data/sample_data.csv'
    , header=True
    , inferSchema = True
)

In [12]:
sample_df_inferred.printSchema()

root
 |-- OrderDate: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Rep: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Units: integer (nullable = true)
 |-- UnitCost: double (nullable = true)
 |-- Total: double (nullable = true)

In [13]:
import pyspark.sql.functions as f

sample_df_inferred = (
    sample_df_inferred
    .withColumn('OrderDate'
                , f.to_date('OrderDate', 'MM/dd/yy')
               )
)

sample_df_inferred.show(4)

+----------+-------+-------+------+-----+--------+------+
| OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+----------+-------+-------+------+-----+--------+------+
|2016-01-06|   East|  Jones|Pencil|   95|    1.99|189.05|
|      null|Central| Kivell|Binder|   50|   19.99| 999.5|
|2016-02-09|Central|Jardine|Pencil|   36|    4.99|179.64|
|2016-02-26|Central|   Gill|   Pen|   27|   19.99|539.73|
+----------+-------+-------+------+-----+--------+------+
only showing top 4 rows

## .agg(...)

In [14]:
sample_df_inferred.agg(
    {
          'Total': 'avg'
    }
).show()

+------------------+
|        avg(Total)|
+------------------+
|456.46232558139553|
+------------------+

In [15]:
aggregations = [
      ('Total', f.min,    'Total_min')
    , ('Total', f.max,    'Total_max')
    , ('Total', f.avg,    'Total_avg')
    , ('Total', f.stddev, 'Total_stddev')
]

(
    sample_df_inferred
    .agg(*[e[1](e[0]).alias(e[2]) for e in aggregations])
    .show()
)

+---------+---------+------------------+-----------------+
|Total_min|Total_max|         Total_avg|     Total_stddev|
+---------+---------+------------------+-----------------+
|     9.03|  1879.06|456.46232558139553|447.0221038416717|
+---------+---------+------------------+-----------------+

## .sql(...)

In [16]:
sample_df_inferred.createOrReplaceTempView('sample_df_inferred')

(
    spark
    .sql('''
        SELECT 
              MIN(Total)    AS Total_min
            , MAX(Total)    AS Total_max
            , AVG(Total)    AS Total_avg
            , STDDEV(Total) AS Total_std
        FROM sample_df_inferred
    ''')
    .show()
)

+---------+---------+------------------+-----------------+
|Total_min|Total_max|         Total_avg|        Total_std|
+---------+---------+------------------+-----------------+
|     9.03|  1879.06|456.46232558139553|447.0221038416717|
+---------+---------+------------------+-----------------+

In [17]:
(
    sample_df_inferred
    .selectExpr(
          'MIN(Total) AS Total_min'
        , 'MAX(Total) AS Total_max'
    )
).show()

+---------+---------+
|Total_min|Total_max|
+---------+---------+
|     9.03|  1879.06|
+---------+---------+

## Creating temporary views

In [18]:
sample_df_inferred.createTempView('sample_df_inferred')

"Temporary table 'sample_df_inferred' already exists;"
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 153, in createTempView
    self._jdf.createTempView(name)
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 71, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Temporary table 'sample_df_inferred' already exists;"



In [19]:
sample_df_inferred.createOrReplaceTempView('sample_df_inferred')

## Joining two DataFrames

In [20]:
regions = spark.createDataFrame(
    sc.parallelize([
        ('Central', 'Chicago')
        , ('West', 'Seattle')
        , ('East', 'Boston')
    ]),
    ['Region', 'Headquarters']
)

In [21]:
(
    sample_df_inferred.join(
        regions
        , on=['Region']
        , how='left_outer'
    )
    .orderBy('OrderDate')
    .show(4)
)

+-------+----------+-------+------+-----+--------+------+------------+
| Region| OrderDate|    Rep|  Item|Units|UnitCost| Total|Headquarters|
+-------+----------+-------+------+-----+--------+------+------------+
|Central|      null| Kivell|Binder|   50|   19.99| 999.5|     Chicago|
|   East|2016-01-06|  Jones|Pencil|   95|    1.99|189.05|      Boston|
|Central|2016-02-09|Jardine|Pencil|   36|    4.99|179.64|     Chicago|
|Central|2016-02-26|   Gill|   Pen|   27|   19.99|539.73|     Chicago|
+-------+----------+-------+------+-----+--------+------+------------+
only showing top 4 rows

## Descriptive statistics

In [22]:
sample_df_inferred.describe().show()

+-------+-------+--------+------+------------------+------------------+------------------+
|summary| Region|     Rep|  Item|             Units|          UnitCost|             Total|
+-------+-------+--------+------+------------------+------------------+------------------+
|  count|     43|      43|    43|                43|                43|                43|
|   mean|   null|    null|  null|49.325581395348834|20.308604651162792|456.46232558139553|
| stddev|   null|    null|  null|30.078247899067208| 47.34511769375187| 447.0221038416717|
|    min|Central| Andrews|Binder|                 2|              1.29|              9.03|
|    max|   West|Thompson|Pencil|                96|             275.0|           1879.06|
+-------+-------+--------+------+------------------+------------------+------------------+

In [23]:
numeric_columns = [e[0] 
         for e in sample_df_inferred.dtypes 
         if e[1] in ('int', 'double')
        ]

(
    sample_df_inferred
    .select(numeric_columns)
    .describe()
    .show()
)

+-------+------------------+------------------+------------------+
|summary|             Units|          UnitCost|             Total|
+-------+------------------+------------------+------------------+
|  count|                43|                43|                43|
|   mean|49.325581395348834|20.308604651162792|456.46232558139553|
| stddev|30.078247899067208| 47.34511769375187| 447.0221038416717|
|    min|                 2|              1.29|              9.03|
|    max|                96|             275.0|           1879.06|
+-------+------------------+------------------+------------------+

In [24]:
sample_df_inferred.agg(*
    [f.mean(f.col(e)).alias('mean_' + e) for e in numeric_columns] +
    [f.stddev(f.col(e)).alias('stddev_' + e) for e in numeric_columns]
).show()

+------------------+------------------+------------------+------------------+-----------------+-----------------+
|        mean_Units|     mean_UnitCost|        mean_Total|      stddev_Units|  stddev_UnitCost|     stddev_Total|
+------------------+------------------+------------------+------------------+-----------------+-----------------+
|49.325581395348834|20.308604651162792|456.46232558139553|30.078247899067208|47.34511769375187|447.0221038416717|
+------------------+------------------+------------------+------------------+-----------------+-----------------+

## .distinct()

In [25]:
sample_df_inferred.distinct().show(4)

+----------+-------+------+-------+-----+--------+------+
| OrderDate| Region|   Rep|   Item|Units|UnitCost| Total|
+----------+-------+------+-------+-----+--------+------+
|2016-11-25|Central|Kivell|Pen Set|   96|    4.99|479.04|
|2016-09-01|Central| Smith|   Desk|    2|   125.0| 250.0|
|2016-10-05|Central|Morgan| Binder|   28|    8.99|251.72|
|2017-05-31|Central|  Gill| Binder|   80|    8.99| 719.2|
+----------+-------+------+-------+-----+--------+------+
only showing top 4 rows

In [26]:
(
    sample_df_inferred
    .select('Region', 'Rep')
    .distinct()
    .orderBy('Region', 'Rep')
    .show()
)

+-------+--------+
| Region|     Rep|
+-------+--------+
|Central| Andrews|
|Central|    Gill|
|Central| Jardine|
|Central|  Kivell|
|Central|  Morgan|
|Central|   Smith|
|   East|  Howard|
|   East|   Jones|
|   East|  Parent|
|   West| Sorvino|
|   West|Thompson|
+-------+--------+