<a href="https://colab.research.google.com/github/sahilshah9111/PySpark_Delta/blob/main/PySpark_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TASK 1: ENVIRONMENT SETUP**

Downloading Java

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Download and unzip Apache Spark with Hadoop to install it

In [2]:
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz

In [3]:
!tar xf spark-3.1.2-bin-hadoop2.7.tgz

Environment Variables for Java and Spark

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta-core_2.12:0.7.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog pyspark-shell'

In [5]:
!pip install -q findspark
import findspark
findspark.init()

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = (SparkSession.builder.appName("PySpark Assign").getOrCreate())

**TASK 2**

In [7]:
people = spark.read.csv(header = True, inferSchema = True, path = 'people.csv')
people.show(3)

+---------+---------------+-------+-----+------+---------------+------------+------------+-------------------+
|person_ID|           name|  first| last|middle|          email|       phone|         fax|              title|
+---------+---------------+-------+-----+------+---------------+------------+------------+-------------------+
|     3130|Burks, Rosella |Rosella|Burks|  null|BurksR@univ.edu|963.555.1253|963.777.4065|         Professor |
|     3297| Avila, Damien | Damien|Avila|  null|AvilaD@univ.edu|963.555.1352|963.777.7914|         Professor |
|     3547|  Olsen, Robin |  Robin|Olsen|  null|OlsenR@univ.edu|963.555.1378|963.777.9262|Assistant Professor|
+---------+---------------+-------+-----+------+---------------+------------+------------+-------------------+
only showing top 3 rows



In [8]:
people.printSchema()

root
 |-- person_ID: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- middle: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- fax: string (nullable = true)
 |-- title: string (nullable = true)



In [9]:
people = people.fillna('Unknown')

In [10]:
people.show(5)

+---------+-------------------+-------+------+-------+----------------+-----------------+------------+-------------------+
|person_ID|               name|  first|  last| middle|           email|            phone|         fax|              title|
+---------+-------------------+-------+------+-------+----------------+-----------------+------------+-------------------+
|     3130|    Burks, Rosella |Rosella| Burks|Unknown| BurksR@univ.edu|     963.555.1253|963.777.4065|         Professor |
|     3297|     Avila, Damien | Damien| Avila|Unknown| AvilaD@univ.edu|     963.555.1352|963.777.7914|         Professor |
|     3547|      Olsen, Robin |  Robin| Olsen|Unknown| OlsenR@univ.edu|     963.555.1378|963.777.9262|Assistant Professor|
|     1538|Moises, Edgar Estes|  Edgar|Moises|  Estes|MoisesE@univ.edu|963.555.2731x3565|963.777.8264|          Professor|
|     2941|Brian, Heath Pruitt|  Heath| Brian| Pruitt| BrianH@univ.edu|     963.555.2800|963.777.7249| Associate Curator |
+---------+-----

Writing the above dataframe to delta format

In [11]:
people.write.format("delta").save("people_delta")

In [12]:
from delta.tables import *

delta_df = DeltaTable.forPath(spark, "people_delta")

In [18]:
delta_df = delta_df.toDF()

In [20]:
delta_df.show()

+---------+--------------------+---------+--------+--------+------------------+-----------------+------------+-------------------+
|person_ID|                name|    first|    last|  middle|             email|            phone|         fax|              title|
+---------+--------------------+---------+--------+--------+------------------+-----------------+------------+-------------------+
|     3130|     Burks, Rosella |  Rosella|   Burks| Unknown|   BurksR@univ.edu|     963.555.1253|963.777.4065|         Professor |
|     3297|      Avila, Damien |   Damien|   Avila| Unknown|   AvilaD@univ.edu|     963.555.1352|963.777.7914|         Professor |
|     3547|       Olsen, Robin |    Robin|   Olsen| Unknown|   OlsenR@univ.edu|     963.555.1378|963.777.9262|Assistant Professor|
|     1538| Moises, Edgar Estes|    Edgar|  Moises|   Estes|  MoisesE@univ.edu|963.555.2731x3565|963.777.8264|          Professor|
|     2941| Brian, Heath Pruitt|    Heath|   Brian|  Pruitt|   BrianH@univ.edu|    

In [22]:
print("Total number of records: ")
delta_df.count()

Total number of records: 


40