<a href="https://colab.research.google.com/github/tiasaxena/PySpark/blob/main/Pyspark_00.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **1. PySpark Basic Introduction**


In [None]:
! pip install pyspark



In [None]:
import pyspark

Before starting to work with PySpark, we need to start a PySpark session

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Pyspark-Demo').getOrCreate()

spark

In [None]:
# Use the first value as the column name for each column using `.option("header", "true")`
print(f"Dataframe info:")

df_spark = spark.read.option('header', 'true').csv('/content/sample_data/california_housing_train.csv')
df_spark.printSchema()

print("---------------------------------------------------------")

print(f"Type of dataframe: {type(df_spark)}")

print("---------------------------------------------------------")

print(f"Head of dataframe: {df_spark.head(4)}")

Dataframe info:
root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)

---------------------------------------------------------
Type of dataframe: <class 'pyspark.sql.dataframe.DataFrame'>
---------------------------------------------------------
Head of dataframe: [Row(longitude='-114.310000', latitude='34.190000', housing_median_age='15.000000', total_rooms='5612.000000', total_bedrooms='1283.000000', population='1015.000000', households='472.000000', median_income='1.493600', median_house_value='66900.000000'), Row(longitude='-114.470000', latitude='34.400000', housing_median_age='19.000000', total_rooms='7650.000000', total_bedrooms='190

In [None]:
df_spark.show()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|-114.310000|34.190000|         15.000000|5612.000000|   1283.000000|1015.000000| 472.000000|     1.493600|      66900.000000|
|-114.470000|34.400000|         19.000000|7650.000000|   1901.000000|1129.000000| 463.000000|     1.820000|      80100.000000|
|-114.560000|33.690000|         17.000000| 720.000000|    174.000000| 333.000000| 117.000000|     1.650900|      85700.000000|
|-114.570000|33.640000|         14.000000|1501.000000|    337.000000| 515.000000| 226.000000|     3.191700|      73400.000000|
|-114.570000|33.570000|         20.000000|1454.000000|    326.000000| 624.000000| 262.000000|     1.925000|    

#### **2. PySpark Dataframes**

**Contents:**
1. Pyspark Dataframes
2. Reading the Dataset
3. Checking the datatype of the columns
4. Selecting columns and indexing
5. Checking the description
6. Adding Columns
7. Dropping columns

In [None]:
# Start the session
from pyspark.sql import SparkSession

session = SparkSession.builder.appName("Dataframe-Demo").getOrCreate()
session

In [None]:
# 2. Read the dataset
df = spark.read.option('header', 'true').csv('/content/sample_data/california_housing_train.csv')
df

DataFrame[longitude: string, latitude: string, housing_median_age: string, total_rooms: string, total_bedrooms: string, population: string, households: string, median_income: string, median_house_value: string]

In [None]:
df.show(), df.printSchema()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|-114.310000|34.190000|         15.000000|5612.000000|   1283.000000|1015.000000| 472.000000|     1.493600|      66900.000000|
|-114.470000|34.400000|         19.000000|7650.000000|   1901.000000|1129.000000| 463.000000|     1.820000|      80100.000000|
|-114.560000|33.690000|         17.000000| 720.000000|    174.000000| 333.000000| 117.000000|     1.650900|      85700.000000|
|-114.570000|33.640000|         14.000000|1501.000000|    337.000000| 515.000000| 226.000000|     3.191700|      73400.000000|
|-114.570000|33.570000|         20.000000|1454.000000|    326.000000| 624.000000| 262.000000|     1.925000|    

(None, None)

We can observe that the datatypes provided are all of type: string. We must include **inferSchema=True** to get to know about the accurate datatypes of each of the columns.

In [None]:
type(df)

In [None]:
df = spark.read.csv('/content/sample_data/california_housing_train.csv', header=True, inferSchema=True)

# 3. Check the datatype of the columns
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)



In [None]:
# 4. Selecting columns and indexing
print(f"The column names are: \n {df.columns}")

The column names are: 
 ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']


In [None]:
# pick up a column and see all the elements
df.select(['total_rooms', 'latitude']).show(), type(df.select('longitude'))

+-----------+--------+
|total_rooms|latitude|
+-----------+--------+
|     5612.0|   34.19|
|     7650.0|    34.4|
|      720.0|   33.69|
|     1501.0|   33.64|
|     1454.0|   33.57|
|     1387.0|   33.63|
|     2907.0|   33.61|
|      812.0|   34.83|
|     4789.0|   33.61|
|     1497.0|   34.83|
|     3741.0|   33.62|
|     1988.0|    33.6|
|     1291.0|   34.84|
|     2478.0|   34.83|
|     1448.0|   32.76|
|     2556.0|   34.89|
|     1678.0|    33.6|
|       44.0|   32.79|
|     1388.0|   32.74|
|       97.0|   33.92|
+-----------+--------+
only showing top 20 rows



(None, pyspark.sql.dataframe.DataFrame)

In [None]:
# 5. Checking the description
df.describe().show() # .show() is used to display the data in the dataframe format, i.e, tabular format

+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|       households|     median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|  count|              17000|             17000|             17000|            17000|            17000|             17000|            17000|             17000|             17000|
|   mean|-119.56210823529375|  35.6252247058827| 28.58935294117647|2643.664411764706|539.4108235294118|1429.5739411764705|501.2219411764706| 3.883578100000021|207300.91235294117|
| stddev| 2.0051664084260357|2.1373397946570867|12.586936981660406|2179.947071452777|421.4994515798648| 1

**Note:** Using **withColumn()** a new class of DataFrame is returned by adding a new column or replacing the existing column that has the same name.

In [None]:
# 6. Adding columns in PySpark dataframe
df.withColumn('# of Rooms after addition of extra room', df['total_rooms']+1)

DataFrame[longitude: double, latitude: double, housing_median_age: double, total_rooms: double, total_bedrooms: double, population: double, households: double, median_income: double, median_house_value: double, # of Rooms after addition of extra room: double]

In [None]:
# 7. Drop the columns
df.drop('# of Rooms after addition of extra room').show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|    

#### **3. PySpark Handling Missing Values**

* Dropping Columns
* Dropping Rows
* Various Parameters in Dropping Fufznctionalities
* Handling Missing Values by Mean, Median, Mode

In [None]:
from google.colab import files

# Upload kaggle.json file
uploaded = files.upload()

# Confirm the upload
if 'kaggle.json' in uploaded:
  print("pyspark_demo.csv from /Downloads uploaded successfully!")
else:
  print("Upload failed. Please try again.")

Saving pyspark_demo.csv to pyspark_demo (1).csv
Upload failed. Please try again.


In [None]:
from pyspark.sql import SparkSession

session_2 = SparkSession.builder.appName('session').getOrCreate()
session_2

In [None]:
df = spark.read.csv('/content/pyspark_demo.csv', header=True, inferSchema=True)
df.show()

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+



In [None]:
# 1. Drop the columns
df.drop('Name').show()

+----+----------+------+
| Age|Experience|Salary|
+----+----------+------+
|  31|        10| 30000|
|  30|         8| 25000|
|  29|         4| 20000|
|  24|         3| 20000|
|  21|         1| 15000|
|  23|         2| 18000|
|NULL|      NULL| 40000|
|  34|        10| 38000|
|  36|      NULL|  NULL|
+----+----------+------+



In [None]:
df.show()

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+



In [None]:
# 2. Drop rows
df.na.drop().show() # Drops all the rows with more than 1 value

+-----+---+----------+------+
| Name|Age|Experience|Salary|
+-----+---+----------+------+
|  Tia| 31|        10| 30000|
|  Ria| 30|         8| 25000|
|  Mia| 29|         4| 20000|
|  Dia| 24|         3| 20000|
|  Sia| 21|         1| 15000|
|  Kia| 23|         2| 18000|
|Mishi| 34|        10| 38000|
+-----+---+----------+------+



We can provide attribute specifying that we want to drops rows that have:
1. All its values equal to NULL
2. Any of its values equal to NULL.


We can also provide a threshold values which sets the maximum limit for meeting the `all` or `any` condition.

In [None]:
df.na.drop(how="any", thresh=2).show() # atleast 2 values are not null
df.na.drop(how="all", thresh=1).show() # atleast 1 value is not null

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+



By using **`subset`** attribute, we can set the specific columns which we want to delete if there is any/all null values.

In [None]:
df.na.drop(how="any", subset=["Experience"]).show()

+-----+---+----------+------+
| Name|Age|Experience|Salary|
+-----+---+----------+------+
|  Tia| 31|        10| 30000|
|  Ria| 30|         8| 25000|
|  Mia| 29|         4| 20000|
|  Dia| 24|         3| 20000|
|  Sia| 21|         1| 15000|
|  Kia| 23|         2| 18000|
|Mishi| 34|        10| 38000|
+-----+---+----------+------+



In [None]:
# 3. Filling the missing values
df.na.fill('Missing Values', ['Age', 'Experience']).show()

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+



In [None]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
  inputCols=['Age', 'Experience', 'Salary'],
  outputCols=["{}_imputed".format(c) for c in ['Age', 'Experience', 'Salary']]
).setStrategy("mean")

In [None]:
imputer.fit(df).transform(df).show()

+-----+----+----------+------+-----------+------------------+--------------+
| Name| Age|Experience|Salary|Age_imputed|Experience_imputed|Salary_imputed|
+-----+----+----------+------+-----------+------------------+--------------+
|  Tia|  31|        10| 30000|         31|                10|         30000|
|  Ria|  30|         8| 25000|         30|                 8|         25000|
|  Mia|  29|         4| 20000|         29|                 4|         20000|
|  Dia|  24|         3| 20000|         24|                 3|         20000|
|  Sia|  21|         1| 15000|         21|                 1|         15000|
|  Kia|  23|         2| 18000|         23|                 2|         18000|
|  Pia|NULL|      NULL| 40000|         28|                 5|         40000|
|Mishi|  34|        10| 38000|         34|                10|         38000|
|Kishu|  36|      NULL|  NULL|         36|                 5|         25750|
+-----+----+----------+------+-----------+------------------+--------------+

#### **4. Pyspark Dataframes- Filter Operations**
* Filter Operation
* &, |, ==
* ~

In [None]:
from pyspark.sql import SparkSession

session_2 = SparkSession.builder.appName('session').getOrCreate()
session_2

In [None]:
data = spark.read.csv('/content/pyspark_demo.csv', header=True, inferSchema=True)

data.show()

+-----+----+----------+------+
| Name| Age|Experience|Salary|
+-----+----+----------+------+
|  Tia|  31|        10| 30000|
|  Ria|  30|         8| 25000|
|  Mia|  29|         4| 20000|
|  Dia|  24|         3| 20000|
|  Sia|  21|         1| 15000|
|  Kia|  23|         2| 18000|
|  Pia|NULL|      NULL| 40000|
|Mishi|  34|        10| 38000|
|Kishu|  36|      NULL|  NULL|
+-----+----+----------+------+



In [None]:
# Filter all the Names, Age of people with Salary >= 25,000
df.filter('Salary>=25000').select(['Name', 'Age']).show()

+-----+----+
| Name| Age|
+-----+----+
|  Tia|  31|
|  Ria|  30|
|  Pia|NULL|
|Mishi|  34|
+-----+----+



In [None]:
df.filter((df['Salary']>=2500) & (df['Salary']<=38000)).select(['Name', 'Age']).show()

+-----+---+
| Name|Age|
+-----+---+
|  Tia| 31|
|  Ria| 30|
|  Mia| 29|
|  Dia| 24|
|  Sia| 21|
|  Kia| 23|
|Mishi| 34|
+-----+---+



In [None]:
df.filter(~(df['Salary']>=25000)).select(['Name', 'Age']).show()

+----+---+
|Name|Age|
+----+---+
| Mia| 29|
| Dia| 24|
| Sia| 21|
| Kia| 23|
+----+---+



#### **PySpark-GroupBy and Aggregate Functions**


In [None]:
from google.colab import files

# Upload pyspark_demo2.csv file from /Downloads
uploaded = files.upload()

if 'pyspark_demo2.csv' in uploaded:
  print("pyspark_demo2.csv uploaded successfully!")
else:
  print("Upload failed. Please try again.")

Saving pyspark_demo2.csv to pyspark_demo2.csv
pyspark_demo2.csv uploaded successfully!


In [None]:
from pyspark.sql import SparkSession

session = SparkSession.builder.appName('Agg').getOrCreate()
session

In [None]:
df = spark.read.csv('/content/pyspark_demo2.csv', header=True, inferSchema=True)
df.show()

+---------+------------+------+
|     Name| Departments|Salary|
+---------+------------+------+
|    Krish|Data Science| 10000|
|    Krish|         IOT|  5000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|    Big Data|  5000|
|    Sunny|Data Science| 10000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [None]:
# Display the total salary of each of the employees
df.groupBy('Name').sum('Salary').show()

+---------+-----------+
|     Name|sum(Salary)|
+---------+-----------+
|Sudhanshu|      35000|
|    Sunny|      12000|
|    Krish|      19000|
|   Mahesh|       7000|
+---------+-----------+



In [None]:
# Group by departments which gives maximum salary
df.groupBy('Departments').mean('Salary').show()

+------------+-----------+
| Departments|avg(Salary)|
+------------+-----------+
|         IOT|     7500.0|
|    Big Data|     3750.0|
|Data Science|    10750.0|
+------------+-----------+



In [None]:
# Number of employees working in each department
df.groupby('Departments').count().show()

+------------+-----+
| Departments|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    4|
|Data Science|    4|
+------------+-----+



In [None]:
# Total and mean of the salaries
df.agg({'Salary': 'mean'}).show(), df.agg({'Salary': 'sum'}).show()

+-----------+
|avg(Salary)|
+-----------+
|     7300.0|
+-----------+

+-----------+
|sum(Salary)|
+-----------+
|      73000|
+-----------+



(None, None)

In [None]:
# Get the minimum salary of ea ch of the employee
df.groupBy('Name').min('Salary').show()

+---------+-----------+
|     Name|min(Salary)|
+---------+-----------+
|Sudhanshu|       5000|
|    Sunny|       2000|
|    Krish|       4000|
|   Mahesh|       3000|
+---------+-----------+

