# First PYSpark DataFrame Creation

In this file we 
* Create a SparkSession object
* Download a csv file from the web
* Read the csv as a PySpark DataFrame
* View the top 20 rows of the PySpark DataFrame
* Write the PySpark DF as a parquet file to a folder zones (with default partition 1)

In [1]:
import pyspark

In [2]:
pyspark.__file__

'/home/sanyashireen/spark/spark-3.2.3-bin-hadoop3.2/python/pyspark/__init__.py'

In [3]:
from pyspark.sql import SparkSession

In [4]:
# How we connect to spark locally with all available resources
# Create PySpark SparkSession
spark = SparkSession.builder.master("local[*]").appName('test').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/22 01:46:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
# Download the file from the web into the current folder
!wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

In [6]:
# Look at the file contents in bash
!head taxi+_zone_lookup.csv

"LocationID","Borough","Zone","service_zone"
1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone"
3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
4,"Manhattan","Alphabet City","Yellow Zone"
5,"Staten Island","Arden Heights","Boro Zone"
6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone"
8,"Queens","Astoria Park","Boro Zone"
9,"Queens","Auburndale","Boro Zone"


In [5]:
# Read the csv as a PySpark DF object
df = spark.read.option("header", "true").csv('taxi+_zone_lookup.csv')

In [7]:
# View the top 20 rows of the PySpark DF
df.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

In [8]:
# Use Spark and write the PySpark DF to the folder 'zones' as a parquet file where it will be written as partitons
# if number of partitions is not defined the default paritions in 1
df.write.parquet('zones')

                                                                                

In [11]:
# We can see the folder zones was created to write the parquet file into
!ls -lh

total 28K
-rw-rw-r-- 1 sanyashireen sanyashireen 6.8K Feb 22 01:52 Untitled.ipynb
-rw-rw-r-- 1 sanyashireen sanyashireen  13K Aug 17  2016 taxi+_zone_lookup.csv
drwxr-xr-x 2 sanyashireen sanyashireen 4.0K Feb 22 01:54 zones
