# **Note:**

# **This notebook is a demo of some Spark DataFrames commands used in Ch 10 of Asllani's book. **

In [1]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 72kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=b516a116eabfeb80a062821001aa7efdac39947d6bfb12e47431e597dce0242a
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1


**Creating SparkSession object**

Spark session objects can be created by using SparkSession.builder.getorCreated(). You also have to specify APP_NAME for your sparkSession that you want to work on.

In [2]:
from pyspark.sql import SparkSession
APP_NAME = "ch10_DF_example"
spark = SparkSession.builder.appName(APP_NAME).getOrCreate()
spark

**Mounting with your google drive**

Mount the notebook to the drive to access the data files that are stored in your google drive.

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


**Import necessary libararies**

SparkSession is used to create spark dataframe

In [4]:
from pyspark.sql import SparkSession

**Checking the files, that are existing in the path that we want to access.**

In [5]:
import os
os.listdir('./gdrive/My Drive/Asllani/ch10_data')

['us_zip_codes.csv', 'customers.csv', 'customers.json']

#**Create a Dataframe from a Json file** 

Creating a DataFrame from a JSON File. Run the following commands to create a DataFrame named customersDF.

In [6]:
customersDF = spark.read.json('././gdrive/My Drive/Asllani/ch10_data/customers.json')
customersDF

DataFrame[Address: string, Age: bigint, CID: string, City: string, FirstName: string, LastName: string, State: string]

Print the schema of the Dataframe loaded

In [7]:
customersDF.printSchema()

root
 |-- Address: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- CID: string (nullable = true)
 |-- City: string (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- State: string (nullable = true)



Show the data from the dataframe

In [None]:
customersDF.show()

+-----------------+----+---+------+---------+--------+-----+
|          Address| Age|CID|  City|FirstName|LastName|State|
+-----------------+----+---+------+---------+--------+-----+
|55 West Point St.|  30| 01|  null|     Jane|   Smith| null|
|   67 W Point Dr.|null| 02|Hixson|     null|  Vaughn|   TN|
|   2736 N 3rd St.|  40| 03|  null|     Mary| McBride| null|
|   1842 Woods Rd.|null| 04|  null|  Richard|  Becher| null|
|   1842 Wood Rd..|null| 05|Hixson|      Pat|    null|   TN|
| 555 Eastside St.|  25| 06|  null|    Cindy| Wallace| null|
| 555 Eastside St.|  45| 07|  null|     Mike|    Long| null|
+-----------------+----+---+------+---------+--------+-----+



Printing the first 3 data from tyhe dataframe

In [8]:
First3DF = customersDF.take(3)
First3DF

[Row(Address='55 West Point St.', Age=30, CID='01', City=None, FirstName='Jane', LastName='Smith', State=None),
 Row(Address='67 W Point Dr.', Age=None, CID='02', City='Hixson', FirstName=None, LastName='Vaughn', State='TN'),
 Row(Address='2736 N 3rd St.', Age=40, CID='03', City=None, FirstName='Mary', LastName='McBride', State=None)]

Select some columns from the dataframe

In [9]:
nameAgeDF = customersDF.select("FirstName","LastName", "Age")
nameAgeDF.show()

+---------+--------+----+
|FirstName|LastName| Age|
+---------+--------+----+
|     Jane|   Smith|  30|
|     null|  Vaughn|null|
|     Mary| McBride|  40|
|  Richard|  Becher|null|
|      Pat|    null|null|
|    Cindy| Wallace|  25|
|     Mike|    Long|  45|
+---------+--------+----+



You create a new value nameAgeOver30, which will store all data from the newly created DataFrame named nameAgeDF that meet the specified condition of age greater than 30.

In [10]:
nameAgeOver30 = nameAgeDF.where("age>30")
nameAgeOver30.show()

+---------+--------+---+
|FirstName|LastName|Age|
+---------+--------+---+
|     Mary| McBride| 40|
|     Mike|    Long| 45|
+---------+--------+---+



The same result can be achieved by using the following command. The whole process of selecting, querying, and showing the results is in one line.

In [11]:
customersDF.select("FirstName", "LastName", "Age").where("Age>30").show()

+---------+--------+---+
|FirstName|LastName|Age|
+---------+--------+---+
|     Mary| McBride| 40|
|     Mike|    Long| 45|
+---------+--------+---+



#**Create a Dataframe from a csv file** 

Without option('header', 'true'), columns are marked as _c0,  _c1, _c2, etc.
Without option('inferSchema', 'true'), all data types are considered strings

In [12]:
file_path = './gdrive/My Drive/Asllani/ch10_data/customers.csv'
customerDF1 = spark.read.format('csv').option('inferSchema', 'true').load(file_path)
customerDF1.show()

+-----------------+---+------+---------+--------+-----+----+
|              _c0|_c1|   _c2|      _c3|     _c4|  _c5| _c6|
+-----------------+---+------+---------+--------+-----+----+
|          Address|CID|  City|FirstName|LastName|State| Age|
|55 West Point St.|  1|  null|     Jane|   Smith| null|  30|
|   67 W Point Dr.|  2|Hixson|     null|  Vaughn|   TN|null|
|   2736 N 3rd St.|  3|  null|     Mary| McBride| null|  40|
|   1842 Woods Rd.|  4|  null|  Richard|  Becher| null|null|
|   1842 Woods Rd.|  5|Hixson|      Pat|    null|   TN|null|
| 555 Eastside St.|  6|  null|    Cindy| Wallace| null|  25|
| 555 Eastside St.|  7|  null|     Mike|    Long| null|  45|
+-----------------+---+------+---------+--------+-----+----+



In [13]:
file_path = './gdrive/My Drive/Asllani/ch10_data/customers.csv'
customerDF1 = spark.read.format('csv').option('inferSchema', 'true').option('header', 'true').load(file_path)
customerDF1.show()

+-----------------+---+------+---------+--------+-----+----+
|          Address|CID|  City|FirstName|LastName|State| Age|
+-----------------+---+------+---------+--------+-----+----+
|55 West Point St.|  1|  null|     Jane|   Smith| null|  30|
|   67 W Point Dr.|  2|Hixson|     null|  Vaughn|   TN|null|
|   2736 N 3rd St.|  3|  null|     Mary| McBride| null|  40|
|   1842 Woods Rd.|  4|  null|  Richard|  Becher| null|null|
|   1842 Woods Rd.|  5|Hixson|      Pat|    null|   TN|null|
| 555 Eastside St.|  6|  null|    Cindy| Wallace| null|  25|
| 555 Eastside St.|  7|  null|     Mike|    Long| null|  45|
+-----------------+---+------+---------+--------+-----+----+



You can also create and populate DataFrames from a collection of direct user input data. For example, you can start by creating a custom List with these commands:

In [14]:
myCustList = [["Mary","Luthans"],["Joyce","Lari"],["Kathi","Burge"]]
myCustList

[['Mary', 'Luthans'], ['Joyce', 'Lari'], ['Kathi', 'Burge']]

The next step is to create the DataFrame from the myCustList list that you just created and store it in myCustDF

In [15]:
myCustDF = spark.createDataFrame(myCustList)
myCustDF.show()

+-----+-------+
|   _1|     _2|
+-----+-------+
| Mary|Luthans|
|Joyce|   Lari|
|Kathi|  Burge|
+-----+-------+



Rename the columns

In [16]:
myCustDF.select("_1", "_2").withColumnRenamed ("_1","name").withColumnRenamed ("_2","surname").show()

+-----+-------+
| name|surname|
+-----+-------+
| Mary|Luthans|
|Joyce|   Lari|
|Kathi|  Burge|
+-----+-------+



Writing or Saving DataFrames into Output Files Use the following commands to create a new DataFrame named AddressDF that contains only the FirstName, LastName, and Address of the customers:

In [17]:
AddressDF = customersDF.select("FirstName","LastName", "Address")
AddressDF.show()

+---------+--------+-----------------+
|FirstName|LastName|          Address|
+---------+--------+-----------------+
|     Jane|   Smith|55 West Point St.|
|     null|  Vaughn|   67 W Point Dr.|
|     Mary| McBride|   2736 N 3rd St.|
|  Richard|  Becher|   1842 Woods Rd.|
|      Pat|    null|   1842 Wood Rd..|
|    Cindy| Wallace| 555 Eastside St.|
|     Mike|    Long| 555 Eastside St.|
+---------+--------+-----------------+



You would use the following commands to save the records of AddressDF into the cluster under the Asllani directory. If the folder does not exist, the new folder is created. If the file already exists, you will get an error.

In [18]:
AddressDF.write.csv('./gdrive/My Drive/Asllani/address_list3')

#**Analyzing Data with DataFrames**

In [19]:
customersDF2 = spark.read.format('csv').option('inferSchema', 'true').option('header', 'true').load(file_path)
customersDF2.show()

+-----------------+---+------+---------+--------+-----+----+
|          Address|CID|  City|FirstName|LastName|State| Age|
+-----------------+---+------+---------+--------+-----+----+
|55 West Point St.|  1|  null|     Jane|   Smith| null|  30|
|   67 W Point Dr.|  2|Hixson|     null|  Vaughn|   TN|null|
|   2736 N 3rd St.|  3|  null|     Mary| McBride| null|  40|
|   1842 Woods Rd.|  4|  null|  Richard|  Becher| null|null|
|   1842 Woods Rd.|  5|Hixson|      Pat|    null|   TN|null|
| 555 Eastside St.|  6|  null|    Cindy| Wallace| null|  25|
| 555 Eastside St.|  7|  null|     Mike|    Long| null|  45|
+-----------------+---+------+---------+--------+-----+----+



Select the age column from the dataframe

In [20]:
customersDF2.select("Age").show()

+----+
| Age|
+----+
|  30|
|null|
|  40|
|null|
|null|
|  25|
|  45|
+----+



Some other ways of showing the data

In [21]:
customersDF2.select(customersDF2["Age"]).show()

+----+
| Age|
+----+
|  30|
|null|
|  40|
|null|
|null|
|  25|
|  45|
+----+



In [22]:
customersDF2.select(customersDF2.Age).show()

+----+
| Age|
+----+
|  30|
|null|
|  40|
|null|
|null|
|  25|
|  45|
+----+



Put a where condition such that we want to remove the NULL values from the age column

In [24]:
ageDF = (customersDF2.select("FirstName", "LastName", "Age")
                     .where (customersDF2.Age.isNotNull()))
ageDF.show()

+---------+--------+---+
|FirstName|LastName|Age|
+---------+--------+---+
|     Jane|   Smith| 30|
|     Mary| McBride| 40|
|    Cindy| Wallace| 25|
|     Mike|    Long| 45|
+---------+--------+---+



You can use the aggregate count function to count how many people live at the same address

In [None]:
customersDF2.groupBy("Address").count().show()         

+-----------------+-----+
|          Address|count|
+-----------------+-----+
|   67 W Point Dr.|    1|
|   1842 Woods Rd.|    2|
|   2736 N 3rd St.|    1|
| 555 Eastside St.|    2|
|55 West Point St.|    1|
+-----------------+-----+



#**Performing Joins with DataFrames**

Join on customersDF and zipCodeDF DataFrames

Load the zipcode csv file

In [25]:
file_path = './gdrive/My Drive/Asllani/ch10_data/us_zip_codes.csv'
zipCodeDF= (spark.read.format('csv').option('inferSchema', 'true')
                                    .option('header', 'true').load(file_path))
zipCodeDF.show()

+---+--------+---------+-------------+--------+-----------+----+-----------+----------+-------+-----------+-------------+--------------------+--------------------+--------------------+---------+--------+-------------------+
|zip|     lat|      lng|         city|state_id| state_name|zcta|parent_zcta|population|density|county_fips|  county_name|      county_weights|    county_names_all|     county_fips_all|imprecise|military|           timezone|
+---+--------+---------+-------------+--------+-----------+----+-----------+----------+-------+-----------+-------------+--------------------+--------------------+--------------------+---------+--------+-------------------+
|601|18.18004|-66.75218|     Adjuntas|      PR|Puerto Rico|true|       null|     17242|  111.4|      72001|     Adjuntas|{'72001':99.43,'7...|     Adjuntas|Utuado|         72001|72141|    false|   false|America/Puerto_Rico|
|602|18.36073|-67.17517|       Aguada|      PR|Puerto Rico|true|       null|     38442|  523.5|      720

The inner join creates a list of customers who have a non-null value in the City column and also shows the Zip for these customers.

In [26]:
customersDF.join(zipCodeDF, "City").show()

+------+--------------+----+---+---------+--------+-----+-----+--------+---------+--------+----------+----+-----------+----------+-------+-----------+-----------+--------------+----------------+---------------+---------+--------+----------------+
|  City|       Address| Age|CID|FirstName|LastName|State|  zip|     lat|      lng|state_id|state_name|zcta|parent_zcta|population|density|county_fips|county_name|county_weights|county_names_all|county_fips_all|imprecise|military|        timezone|
+------+--------------+----+---+---------+--------+-----+-----+--------+---------+--------+----------+----+-----------+----------+-------+-----------+-----------+--------------+----------------+---------------+---------+--------+----------------+
|Hixson|1842 Wood Rd..|null| 05|      Pat|    null|   TN|37343|35.16885|-85.20914|      TN| Tennessee|true|       null|     41655|  367.6|      47065|   Hamilton| {'47065':100}|        Hamilton|          47065|    false|   false|America/New_York|
|Hixson|67 W

The left_outer join creates a list of all customers and adds the zip code to those customers who have a non-null value for the City

In [27]:
customersDF.join(zipCodeDF, "City", "left_outer").show()

+------+-----------------+----+---+---------+--------+-----+-----+--------+---------+--------+----------+----+-----------+----------+-------+-----------+-----------+--------------+----------------+---------------+---------+--------+----------------+
|  City|          Address| Age|CID|FirstName|LastName|State|  zip|     lat|      lng|state_id|state_name|zcta|parent_zcta|population|density|county_fips|county_name|county_weights|county_names_all|county_fips_all|imprecise|military|        timezone|
+------+-----------------+----+---+---------+--------+-----+-----+--------+---------+--------+----------+----+-----------+----------+-------+-----------+-----------+--------------+----------------+---------------+---------+--------+----------------+
|  null|55 West Point St.|  30| 01|     Jane|   Smith| null| null|    null|     null|    null|      null|null|       null|      null|   null|       null|       null|          null|            null|           null|     null|    null|            null|
