## SPARK - Distributed Data Processing or Cluster Computing System often used for Large Scale data processing
### PySpark = Spark API + Python

# Topics:
-  Loading Spark Session
- Reading Dataset
- Checking data types of columns
- Data Preprocessing - Adding columns, Dropping columns, Renaming columns

Installing PySpark API

In [1]:
!pip install pyspark



In [2]:
import pyspark

In [3]:
import pandas as pd
pd.read_csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv')

Unnamed: 0,Name,Age
0,A,21
1,B,34
2,C,50
3,D,31
4,E,25
5,F,29
6,G,26
7,H,31


In [4]:
type(pd.read_csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv'))

pandas.core.frame.DataFrame

We have to create a Spark session inorder to work with Spark

In [5]:
from pyspark.sql import SparkSession

In [6]:
spark = SparkSession.builder.appName('Practise').getOrCreate()

In [7]:
spark

When we execute a session in a local, there will only be one cluster, but when we are working in a cloud, we can create multiple clusters and instances.

Reading dataset with Spark

In [8]:
#reading dataset using spark
df_pyspark = spark.read.csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv')

In [9]:
df_pyspark

DataFrame[_c0: string, _c1: string]

In [10]:
df_pyspark.show()

+----+---+
| _c0|_c1|
+----+---+
|Name|Age|
|   A| 21|
|   B| 34|
|   C| 50|
|   D| 31|
|   E| 25|
|   F| 29|
|   G| 26|
|   H| 31|
+----+---+



We want Name and Age as column headers

In [11]:
df_pyspark = spark.read.option('header','true').csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv')

In [12]:
df_pyspark.show()

+----+---+
|Name|Age|
+----+---+
|   A| 21|
|   B| 34|
|   C| 50|
|   D| 31|
|   E| 25|
|   F| 29|
|   G| 26|
|   H| 31|
+----+---+



In [13]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [14]:
df_pyspark.head(3)

[Row(Name='A', Age='21'), Row(Name='B', Age='34'), Row(Name='C', Age='50')]

In [15]:
#Checking schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)



Age here is considered as a string by default which is wrong. We have to use a inferSchema

In [16]:
df_pyspark_1 = spark.read.option('header','true').csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv',inferSchema=True)

In [17]:
df_pyspark_1.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)



In [18]:
#Better way
df_pyspark_1 = spark.read.csv('C:/Users/ASUS/OneDrive/Desktop/DatasetR/Spark_test.csv',header=True,inferSchema=True)
df_pyspark_1.show()

+----+---+
|Name|Age|
+----+---+
|   A| 21|
|   B| 34|
|   C| 50|
|   D| 31|
|   E| 25|
|   F| 29|
|   G| 26|
|   H| 31|
+----+---+



In [19]:
#Get the column names
df_pyspark_1.columns

['Name', 'Age']

In [20]:
#Pick only Name column
df_pyspark_1.select('Name').show()

+----+
|Name|
+----+
|   A|
|   B|
|   C|
|   D|
|   E|
|   F|
|   G|
|   H|
+----+



In [21]:
#Pick multiple columns
df_pyspark_1.select(['Name','Age']).show()

+----+---+
|Name|Age|
+----+---+
|   A| 21|
|   B| 34|
|   C| 50|
|   D| 31|
|   E| 25|
|   F| 29|
|   G| 26|
|   H| 31|
+----+---+



In [24]:
df_pyspark_1.dtypes

[('Name', 'string'), ('Age', 'int')]

In [27]:
#calculating mean median etc
df_pyspark_1.describe().show()

+-------+----+-----------------+
|summary|Name|              Age|
+-------+----+-----------------+
|  count|   8|                8|
|   mean|null|           30.875|
| stddev|null|8.741322227541684|
|    min|   A|               21|
|    max|   H|               50|
+-------+----+-----------------+



In [32]:
#Adding new column to dataframe
df_pyspark_2 = df_pyspark_1.withColumn('Age After 2 Years',df_pyspark_1['Age']+2)
df_pyspark_2.show()

+----+---+-----------------+
|Name|Age|Age After 2 Years|
+----+---+-----------------+
|   A| 21|               23|
|   B| 34|               36|
|   C| 50|               52|
|   D| 31|               33|
|   E| 25|               27|
|   F| 29|               31|
|   G| 26|               28|
|   H| 31|               33|
+----+---+-----------------+



In [35]:
df_pyspark_3 = df_pyspark_2.drop('Age After 2 Years')
df_pyspark_3.show()

+----+---+
|Name|Age|
+----+---+
|   A| 21|
|   B| 34|
|   C| 50|
|   D| 31|
|   E| 25|
|   F| 29|
|   G| 26|
|   H| 31|
+----+---+



In [36]:
#Renaming Column
df_pyspark_4 = df_pyspark_3.withColumnRenamed('Name','New Name')
df_pyspark_4.show()

+--------+---+
|New Name|Age|
+--------+---+
|       A| 21|
|       B| 34|
|       C| 50|
|       D| 31|
|       E| 25|
|       F| 29|
|       G| 26|
|       H| 31|
+--------+---+

