In [1]:
import findspark
findspark.init()

findspark library to initialize and configure Spark within a Python environment. This is typically done when you want to use Apache Spark in a local mode (standalone) on your machine or in a development environment.

**import findspark:** This imports the findspark Python library, which is a lightweight package that makes it easier to locate Spark within the system and set the necessary environment variables.

**findspark.init():** This function call initializes the Spark environment. It tries to locate Spark (assuming it's installed on your machine) and sets the required environment variables, such as SPARK_HOME and PYSPARK_PYTHON, to make Spark accessible from your Python environment.

**Note:**
Keep in mind that for this to work, you should have Spark installed on your machine, and the SPARK_HOME environment variable should be set correctly. Additionally, the findspark library is not required if you are working in a Spark cluster environment (such as using Databricks) where Spark is already configured. The primary use of findspark is for local development setups.

In [2]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.config("spark.driver.host", "localhost").getOrCreate()

#to build session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('rdddf').master('local').getOrCreate()

The configuration option spark.driver.host is used to set the host or IP address that the Spark driver program will bind to on the local machine. The driver program is the main control program of a Spark application, responsible for coordinating the execution of tasks on worker nodes.

spark.driver.host: This is a Spark configuration property that determines the host or IP address that the Spark driver should bind to. By setting it to "localhost," you are instructing Spark to bind the driver program to the local machine.

"localhost": This is the value assigned to the spark.driver.host configuration property. It means that the Spark driver program will run on the same machine where the Spark application is launched. This is typical in local development environments or when running Spark on a single machine.



In [10]:
df= spark.read.format('csv').options(inferSchema=True, header= True, sep=',').load("data.csv")

we can use many option or use options to set all properties together.\
e.g.

df= spark.read.format('csv').option("inferSchema",True).option("header", True).option("sep",',').load("data.csv")\
or\
df= spark.read.format('csv').options(inferSchema=True, header= True, sep=',').load("data.csv")

In [13]:
display(df.show(5))
print(df.count())

+------------+-------------------+-------------------+--------------------+-------------------------+------------------------+-----------------------------+-----------------------------+--------------------+--------------------+-------+
|collision_id|         crash_date|         crash_time|      on_street_name|number_of_persons_injured|number_of_persons_killed|contributing_factor_vehicle_1|contributing_factor_vehicle_2|  vehicle_type_code1|  vehicle_type_code2|borough|
+------------+-------------------+-------------------+--------------------+-------------------------+------------------------+-----------------------------+-----------------------------+--------------------+--------------------+-------+
|     4456867|2021-09-02 00:00:00|2023-11-30 19:56:00|MAJOR DEEGAN EXPR...|                        0|                       0|                  Unspecified|                         null|               Sedan|                null|   null|
|     4456988|2021-09-11 00:00:00|2023-11-30 15:45:0

None

1849


**If want to define schema manually**

Example - 

--to send the schema first scehma is defined.\
-- when we manually deifine the scehma than we don't need to use inferScehma

>from pyspark.sql.types import StructType,StructField,IntegerType,StringType
>
>
>scehma_define= StructType(\
>StructField("id",IntegerType(),True),\
>StructField("name","StringType(),True),\
>StructField("roll_number",IntegerType(),True)\
>)

In structField('column_name','data_type','True')\
True: Column allows null values.\
False: Column does not allow null values.


**df = spark.read.format('csv').option(header=True, sep=',').schema(scehma_define)**


#### **To check the schema of the dataframe use data_frame.printSchema()**

In [14]:
df.printSchema()

root
 |-- collision_id: integer (nullable = true)
 |-- crash_date: timestamp (nullable = true)
 |-- crash_time: timestamp (nullable = true)
 |-- on_street_name: string (nullable = true)
 |-- number_of_persons_injured: integer (nullable = true)
 |-- number_of_persons_killed: integer (nullable = true)
 |-- contributing_factor_vehicle_1: string (nullable = true)
 |-- contributing_factor_vehicle_2: string (nullable = true)
 |-- vehicle_type_code1: string (nullable = true)
 |-- vehicle_type_code2: string (nullable = true)
 |-- borough: string (nullable = true)



### **To read multiple csv files**

When you are reading multiple files into a DataFrame in PySpark, it is generally advisable that the files have the same schema, including the same number of columns and corresponding data types. This ensures consistency in your DataFrame and prevents issues that may arise from inconsistent or mismatched schemas.

Example -\
**df=spark.read.format('csv').options(header=True,sep=',').load(["path1/file1.csv","path2/file2.csv"])**

### **Read all files within a  particular folder**
**df= spark.read.format('csv').options(inferSchema=True,header=True, sep=',').load("path/folder")**\
Thus all the file within the folder spark will read them. But bydefault it it won't automatically traverse nested directories.

**df = spark.read.format('csv').options(inferSchema=True, header=True, sep=',',recursiveFileLookup =True).load("path/folder")**\
By adding recursiveFileLookup=True, you are telling PySpark to recursively search for files in subdirectories of the specified path.

Note:\
Keep in mind that while recursiveFileLookup allows for recursive directory discovery, it assumes that the data in those nested folders has the same schema as the top-level folder. If the nested folders have different schemas, you might need to handle them separately or implement a more dynamic schema resolution strategy based on your specific use case.


### **Pyspark filter condition**

filter()
1. **single and multiple conditoin** - (df.col('id')>4 & df.col('col2')>32)
2. **starts with** - df.column.startswith('char')
3. **endwith** - df.columns.endswith('char')
4. **contains** - df.column.contains('string')
   1. contains - to check if string present in column value or not.
5. **like**- df.column.like('%a')
6. **null** value - df.column.isNull('cal') - to get record whose column 'cal' value is null.
7. **not null** - df.column.NotNull('cal') - to get record in which 'cal' don't have null value.
8. **isin** - df.column.isin(char1,char2) - if column contain any value char1, char2 than return True.
   1. to get isnot condition use ~ at start. **~df.column.isin(char1,char2)** -> **~** is like complementing the return value. this will return Fasle if condition is fullfiled.
9.  **operator** - df.column!=30  -> ==,>,<,>,=>,=<,!=



### **To create dataframe manually**


>from pyspark.sql.types import StructType,StructField,IntegerType,StringType\
>
>--data to be inserted in dataframe\
>employee_data= [(1,"a","34"),(2,"b","54"),(3,"c","22"),(4,"d","81"),(5,"e","75")]\
>
>--to define schema\
>employee_scehma = StructType(StructField("id",IntegerType(),False),StructField("name",StringType(),True),StructField("number",IntegerType(),False))\
>\
>**df=spark.createDataFrame(data=employee_data,scehma=employee_schema)**

<hr>

### **To add, drop, rename column in dataframe**

1.  **To Add extra column to result including the data frame columns** - withColumn()\
    If the column is already present in dataframe thanwithColumn will update that column and only if column not present it will create another column.


In [19]:
from pyspark.sql.functions import to_date, col, date_format

df=df.withColumn("date",to_date(col("crash_date")))
df.show(5)

+------------+-------------------+-------------------+--------------------+-------------------------+------------------------+-----------------------------+-----------------------------+--------------------+--------------------+-------+----------+
|collision_id|         crash_date|         crash_time|      on_street_name|number_of_persons_injured|number_of_persons_killed|contributing_factor_vehicle_1|contributing_factor_vehicle_2|  vehicle_type_code1|  vehicle_type_code2|borough|      date|
+------------+-------------------+-------------------+--------------------+-------------------------+------------------------+-----------------------------+-----------------------------+--------------------+--------------------+-------+----------+
|     4456867|2021-09-02 00:00:00|2023-11-30 19:56:00|MAJOR DEEGAN EXPR...|                        0|                       0|                  Unspecified|                         null|               Sedan|                null|   null|2021-09-02|
|     44

withColumn("date", to_date(col("crash_date")))\
Adds another new column named "date" by applying the to_date function to convert the crash_date column to a date.

using multiple withColumn statement we can add new columns.

**df.select("col1","col2")**\
To get specific columns from dataframe

2.   **Add new column**  

**using constant literal use lit function** - lit()

In [31]:
from pyspark.sql.functions import lit
df.withColumn("new_column", lit("ABD")).show(5)


+------------+-------------------+-----------+----------+
|collision_id|         crash_date|date_column|new_column|
+------------+-------------------+-----------+----------+
|     4456867|2021-09-02 00:00:00| 2021/09/02|       ABD|
|     4456988|2021-09-11 00:00:00| 2021/09/11|       ABD|
|     4456859|2021-09-07 00:00:00| 2021/09/07|       ABD|
|     4456663|2021-06-25 00:00:00| 2021/06/25|       ABD|
|     4456624|2021-07-08 00:00:00| 2021/07/08|       ABD|
+------------+-------------------+-----------+----------+
only showing top 5 rows



**by calculation***

here we created another column which is based on calculating the remainder of the collision_id column value.

In [32]:
df.withColumn("remainder",df.collision_id%100).show(5)

+------------+-------------------+-----------+---------+
|collision_id|         crash_date|date_column|remainder|
+------------+-------------------+-----------+---------+
|     4456867|2021-09-02 00:00:00| 2021/09/02|       67|
|     4456988|2021-09-11 00:00:00| 2021/09/11|       88|
|     4456859|2021-09-07 00:00:00| 2021/09/07|       59|
|     4456663|2021-06-25 00:00:00| 2021/06/25|       63|
|     4456624|2021-07-08 00:00:00| 2021/07/08|       24|
+------------+-------------------+-----------+---------+
only showing top 5 rows



**By concatinating two column values**

In [34]:
from pyspark.sql.functions import concat,col
df = df.select("collision_id","crash_date").withColumn("id", concat(lit("id "),col("collision_id").cast("string")))
df.show(5)

+------------+-------------------+----------+
|collision_id|         crash_date|        id|
+------------+-------------------+----------+
|     4456867|2021-09-02 00:00:00|id 4456867|
|     4456988|2021-09-11 00:00:00|id 4456988|
|     4456859|2021-09-07 00:00:00|id 4456859|
|     4456663|2021-06-25 00:00:00|id 4456663|
|     4456624|2021-07-08 00:00:00|id 4456624|
+------------+-------------------+----------+
only showing top 5 rows



**cast() -** To convert column datatype from one type to other type - col(col_name).cast(datatype)

**col() -** In PySpark, the col function is used to reference a column in a DataFrame. It's a convenient way to refer to a column by name when performing operations or transformations on the data.

**without col()** function we need to use dataframe reference to get the column - df.col_name 

**list()** - In PySpark, the lit function is used to create a new column with a constant literal value. It's short for "literal." The lit function is often used when you want to add a new column to a DataFrame where all the values are the same.

3.  **To get the date in different format** - date_format()

In [24]:
from pyspark.sql.functions import date_format
df = df.select("collision_id","crash_date").withColumn("date_column", date_format(col("crash_date"), "yyyy/MM/dd"))
df.show(5)

+------------+-------------------+-----------+
|collision_id|         crash_date|date_column|
+------------+-------------------+-----------+
|     4456867|2021-09-02 00:00:00| 2021/09/02|
|     4456988|2021-09-11 00:00:00| 2021/09/11|
|     4456859|2021-09-07 00:00:00| 2021/09/07|
|     4456663|2021-06-25 00:00:00| 2021/06/25|
|     4456624|2021-07-08 00:00:00| 2021/07/08|
+------------+-------------------+-----------+
only showing top 5 rows



4.  **To rename the column** - withColumnRenamed()

This will create a new dataframe and in that name is changed not in the orignal dataset
df.withColumnRenmae("old_col","new_col")

In [25]:
df.withColumnRenamed("date_column","date").show(5)

# date_column name is changed to date.

+------------+-------------------+----------+
|collision_id|         crash_date|      date|
+------------+-------------------+----------+
|     4456867|2021-09-02 00:00:00|2021/09/02|
|     4456988|2021-09-11 00:00:00|2021/09/11|
|     4456859|2021-09-07 00:00:00|2021/09/07|
|     4456663|2021-06-25 00:00:00|2021/06/25|
|     4456624|2021-07-08 00:00:00|2021/07/08|
+------------+-------------------+----------+
only showing top 5 rows



5.  **To drop the column** - drop()

To drop the column from the dataframe. 
>df.drop("col_name")

To drop multiple columns - using multiple drop fucntion we can drop multiple column e.g. drop().drop() or we can use packing unpacking method
>drop_col =["col1","col2"]\
>df.drop(*drop_col)

-- *drop_col- will unpack the list




In [26]:
df.drop("date")

DataFrame[collision_id: int, crash_date: timestamp, date_column: string]

In [29]:
df.show(5)

+------------+-------------------+-----------+
|collision_id|         crash_date|date_column|
+------------+-------------------+-----------+
|     4456867|2021-09-02 00:00:00| 2021/09/02|
|     4456988|2021-09-11 00:00:00| 2021/09/11|
|     4456859|2021-09-07 00:00:00| 2021/09/07|
|     4456663|2021-06-25 00:00:00| 2021/06/25|
|     4456624|2021-07-08 00:00:00| 2021/07/08|
+------------+-------------------+-----------+
only showing top 5 rows



### **Joins in pyspark**

syntax
> &emsp;df1.join(df2,on_condition, hoe = "joinning_type")

if not mentioned by default it is inner join

> df1.join(df2, df1.id=df2.user_id)

df1 - first datafarme, id - column of first dataframe\
df2 - second dataframe, user_id - column of seocnd dataframe\
    
df1.id=df2.user_id - on condition on which two dataframe are connected.



**Types of join (joinning_type)**
* **inner join** - return records which fullfill on conditons.
* **full join** - left + right 
* **left outer join (left or left_outer)** - innerjoin + unmatched record according to conditoin in left datafarme
   * All rows from the left DataFrame (df1) are included in the result. If there is no match in the right DataFrame (df2), the result will contain null values for the columns from df2.
* **right outer join (right or right_outer)** -innerjoin + unmatched record according to conditoin in right datafarme
  * All rows from the right DataFrame (df2) are included in the result. If there is no match in the left DataFrame (df1), the result will contain null values for the columns from df1
* **left semi join (left_semi)** - Returns only the rows from the left DataFrame (df1) where there is at least one match in the right DataFrame (df2). Columns from the right DataFrame are not included in the resul
  * similar to inner join but in result, column of only left datafarme are present.
* **left anti join (left_anti)** - Returns only the rows from the left DataFrame (df1) where there is no match in the right DataFrame (df2). Columns from the right DataFrame are not included in the result.
  * unmatched rows or rows which not follow the condition from left dataframe are in result