In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Load and Query CSV with SQL") \
    .getOrCreate()

# Data cleaning review

There are many benefits for using Spark for data cleaning.
Which of the following are benefits?


- Spark offers high performance.
- Spark allows orderly data flows.
- Spark can use strictly defined schemas while ingesting data.

# Defining a schema

Creating a defined schema helps with data quality and import performance. As mentioned during the lesson, we'll create a simple schema to read in the following columns:

- Name
- Age
- City

In [2]:
# Import the pyspark.sql.types library
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False),
])

# Immutability review

You’ve just seen that immutability and lazy processing are fundamental concepts in the way Spark handles data. But why would Spark use immutable data frames to begin with?

- To efficiently handle data throughout the cluster.

# Using lazy processing

Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.

In [9]:
schema = StructType([
    StructField("Date", DateType(), nullable=True),
    StructField("Flight Number", StringType(), nullable=True),
    StructField("Destination Airport", StringType(), nullable=True),
    StructField("Actual elapsed time (Minutes)", IntegerType(), nullable=True),
])

In [10]:
import pyspark.sql.functions as F

# Load the CSV file
aa_dfw_df = spark.read.format('csv').options(Header=True).load('dataset/AA_DFW_2017_Departures_Short.csv')

# Add the airport column using the F.lower() method
aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))

# Drop the Destination Airport column
aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport'])

# Show the DataFrame
aa_dfw_df.show()

+-----------------+-------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-----------------------------+-------+
|       01/01/2017|         0005|                          537|    hnl|
|       01/01/2017|         0007|                          498|    ogg|
|       01/01/2017|         0037|                          241|    sfo|
|       01/01/2017|         0043|                          134|    dtw|
|       01/01/2017|         0051|                           88|    stl|
|       01/01/2017|         0060|                          149|    mia|
|       01/01/2017|         0071|                          203|    lax|
|       01/01/2017|         0074|                           76|    mem|
|       01/01/2017|         0081|                          123|    den|
|       01/01/2017|         0089|                          161|    slc|
|       01/01/2017|         0096|                           84| 

In [11]:
aa_dfw_df.dtypes

[('Date (MM/DD/YYYY)', 'string'),
 ('Flight Number', 'string'),
 ('Actual elapsed time (Minutes)', 'string'),
 ('airport', 'string')]

# Saving a DataFrame in Parquet format

When working with Spark, you'll often start with CSV, JSON, or other data sources. This provides a lot of flexibility for the types of data to load, but it is not an optimal format for Spark. The Parquet format is a columnar data store, allowing Spark to use predicate pushdown. This means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.

In [12]:
df1 = aa_dfw_df
df2 = spark.read.format('csv').options(Header=True).schema(schema).load('dataset/AA_DFW_2016_Departures_Short.csv')

# View the row count of df1 and df2
print("df1 Count: %d" % df1.count())
print("df2 Count: %d" % df2.count())


df1 Count: 139358
df2 Count: 140604


In [6]:

# Combine the DataFrames into one
df3 = df1.union(df2)

# Save the df3 DataFrame in Parquet format
# df3.write.format('parquet').save('dataset/AA_DFW_ALL.parquet')
# df3.write.parquet('AA_DFW_ALL', mode='overwrite')

# Read the Parquet file into a new DataFrame and run a count
# print(spark.read.parquet('dataset/AA_DFW_ALL.parquet').count())

# SQL and Parquet

Parquet files are perfect as a backing data store for SQL queries in Spark. While it is possible to run the same queries directly via Spark's Python functions, sometimes it's easier to run SQL queries alongside the Python options.

In [7]:
print(aa_dfw_df.dtypes)

[('Date (MM/DD/YYYY)', 'string'), ('Flight Number', 'string'), ('Actual elapsed time (Minutes)', 'string'), ('airport', 'string')]


In [15]:
# Read the Parquet file into flights_df
# flights_df = spark.read.csv('dataset/AA_DFW_2017_Departures_Short.csv', header=True)
flights_df = aa_dfw_df

# Register the temp table
flights_df.createOrReplaceTempView('flights')

# Run a SQL query of the average flight duration
avg_duration = spark.sql('SELECT avg(CAST(`Actual elapsed time (Minutes)` AS INTEGER)) FROM flights').collect()[0]
# print('The average flight time is: %d' % avg_duration)
print(avg_duration)

Row(avg(CAST(Actual elapsed time (Minutes) AS INT))=151.99931112673832)
