# Saving a DataFrame in Parquet format

- When working with Spark, you'll often start with CSV, JSON, or other data sources. This provides a lot of flexibility for the types of data to load, but it is not an optimal format for Spark. The `Parquet` format is a columnar data store, allowing Spark to use predicate pushdown. This means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.

- In this exercise, we're going to practice creating a new Parquet file and then process some data from it.

- The `spark` object and the `df1` and `df2` DataFrames have been setup for you.

## Instructions

- View the row count of `df1` and `df2`.
- Combine `df1` and `df2` in a new DataFrame named `df3` with the `union` method.
- Save `df3` to a parquet file named `AA_DFW_ALL.parquet`.
- Read the `AA_DFW_ALL.parquet` file and show the count.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
!pwd

/home/talentum/test-jupyter/P3/M1/SM3/3_UnderstandingParquet


In [15]:
df1 = spark.read.format('csv').load('file:///home/talentum/test-jupyter/P3/M1/SM3/3_UnderstandingParquet/Dataset/AA_DFW_2017_Departures_Short.csv.gz') # AA_DFW_2017_Departures_Short.csv
df2 = spark.read.format('csv').load('file:///home/talentum/test-jupyter/P3/M1/SM3/3_UnderstandingParquet/Dataset/AA_DFW_2018_Departures_Short.csv.gz') # AA_DFW_2018_Departures_Short.csv

# View the row count of df1 and df2
print("df1 Count: %d" % df1.count())
print("df2 Count: %d" % df2.count())

# Combine the DataFrames into one
df3 = df1.union(df2)
df3 = df3.withColumnRenamed('_c3', 'flight_duration')

# Save location 
save_path = 'file:///home/talentum/parquet_temp'

# Save the df3 DataFrame in Parquet format
df3.write.parquet(f'{save_path}/AA_DFW_ALL.parquet', mode='overwrite')

# # Read the Parquet file into a new DataFrame and run a count
print('New DataFrame :',spark.read.parquet(f'{save_path}/AA_DFW_ALL.parquet').count())



df1 Count: 139359
df2 Count: 119911
New DataFrame : 259270


In [1]:
# look the size of both file here
!ls ./Dataset -lh    

# look the size of new file here
!ls ~/parquet_temp/AA_DFW_ALL.parquet -lh

total 1.2M
-rwxrwx--- 1 talentum talentum 613K Jan  4 17:21 AA_DFW_2017_Departures_Short.csv.gz
-rwxrwx--- 1 talentum talentum 528K Jan  4 17:21 AA_DFW_2018_Departures_Short.csv.gz
total 524K
-rw-r--r-- 1 talentum talentum 262K Jan  4 17:53 part-00000-622d4b2f-90cd-4599-b51e-123bc3dcd3ab-c000.snappy.parquet
-rw-r--r-- 1 talentum talentum 259K Jan  4 17:53 part-00001-622d4b2f-90cd-4599-b51e-123bc3dcd3ab-c000.snappy.parquet
-rw-r--r-- 1 talentum talentum    0 Jan  4 17:53 _SUCCESS
