# <div style="font-family: Trebuchet MS; background-color: #58D68D; color: #000000; padding: 12px; line-height: 1.5; font-size:"> Introduction 🎻</div>

### <div style="font-family: Trebuchet MS; background-color: #F4D03F; color: #000000; padding: 12px; line-height: 1.5;"> Hey Kagglers!! Today I am gonna share with you a simple tool that you can leverage to speeden up the big data processing involved in your own projects. For freshers/experienced practioners, I believe that it is important for y'all to get a basic understanding of the Spark ecosystem as many data-centric companies are continuing to adopt this technology.<br><br> In this notebook, I have tried to compile all the basic functionalities to get you started with Spark effortlessly.</div>

<div style="font-family: Trebuchet MS; background-color: #F5B041; color: #000000; padding: 12px; line-height: 1;"><h3> Some basic guidelines that I have followed to make this notebook look interactive:</h3><h4><ul style=“list-style-type:square”><li>Whenever there is a definition, I have highlighted it with a  <span style="background-color: #2E31FD;font-size: 25px">📣</span></li><br><li>Whenever there is a new function/method, I have highlighted it with a <span style="background-color: #2E31FD;font-size: 25px">🌼</span></li><br><li>Whenever there is a suggestion from my side, I have highlighted it with a <span style="background-color: #2E31FD;font-size: 25px">📌</span></li></ul></h4></div> 

### So what are you waiting for! Let's get started with the basics:

## <div style="padding: 12px"><span style="background-color: #2E31FD;font-size: 35px">📣</span> What is Apache Spark in Technical terms.</div>

- Apache Spark is an open-source, distributed data processing and analytics framework designed for large-scale data processing tasks. 

- It provides a unified and flexible platform for performing various data processing operations, including batch processing, interactive queries, real-time stream processing, machine learning, and graph processing.

## <div style="padding: 12px"><span style="background-color: #2E31FD;font-size: 35px">📣</span> What is this Apache Spark with a simple analogy? </div>

- Apache Spark is like a supercharged engine for processing and analyzing really big piles of data. Imagine you have a massive amount of information, like a gigantic puzzle with millions of pieces. Trying to solve this puzzle on a single computer could take forever. But Spark lets you use many computers at once, like a team of puzzle solvers, to work on different parts of the puzzle together.

- These "puzzle solvers" (computers) can talk to each other and share their findings, making the work faster and more efficient. Spark also keeps everything organized and makes sure that even if one of the "puzzle solvers" takes a break or has a problem, the others can still continue working without losing progress.

- In simple words, Apache Spark helps you process huge amounts of data much faster by getting a bunch of computers to work together and collaborate on the job. It's like a team effort that makes solving big data problems much easier and quicker!

## <div style="padding: 12px"><span style="background-color: #2E31FD;font-size: 35px">📣</span> What is PySpark?</div>

- PySpark is the Python API to use Spark, just like Pandas.

- In simple words, PySpark is a special tool that combines the power of many computers with the simplicity of Python to help you handle really big piles of data without breaking a sweat!

## <div style="padding: 12px"><span style="background-color: #2E31FD;font-size: 35px">📣</span> Benefits of using PySpark over Pandas for Data Processing:</div>

#### 1. Scalability and Distributed Computing:

- PySpark is designed for processing large-scale data across clusters of machines. It can handle data sizes that may not fit in memory, as it utilizes distributed computing.
- Pandas, on the other hand, is designed for single-machine data processing and may struggle with extremely large datasets that exceed available memory.

#### 2. Performance:

- PySpark's in-memory processing and distributed computing can lead to better performance for certain operations on large datasets compared to pandas.
- While pandas is fast for single-machine operations, PySpark's parallel processing can provide significant performance gains for operations that can be parallelized across multiple nodes.

# <div style="font-family: Trebuchet MS; background-color: #B0E0E6; color: #000000; padding: 12px; line-height: 1.5;"> Importing Libraries 📚</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import regex as re
import os

## Supressing warnings:
import warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install pyspark

In [None]:
## importing essential spark libraries:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, count, when, regexp_replace, isnan, udf
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, FloatType

# <div style="font-family: Trebuchet MS; background-color: #B0E0E6; color: #000000; padding: 12px; line-height: 1.5;"> Getting Started with the Analysis 🔬</div>


#### The first step towards your adventure in Spark is to create a Spark Session. It is the entry point to the Spark ecosystem. Once you<br><br>reach the Spark environment via the entry point, you can freely create and manipulate Spark RDDs, Dataframes and Datasets. 

## <span style="background-color: #2E31FD;font-size: 35px">📣</span> What is a RDD?

You might be wondering what this new term is. Well RDD stands for **Resilient Distributed Dataset**. It is the fundamental data structure of Spark.

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> SparkSession.builder()

#### SparkSession will be created using SparkSession.builder() builder patterns::

In [None]:
##  Creating a Spark session:
spark = SparkSession.builder.appName('Sample').getOrCreate()

In [None]:
## Quick glance at the object
spark

##### Here, the spark object acts as the gateway to the Spark ecosystem. 

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> read.csv(), show()

##### Next in order to read the CSV data, we use the **read.csv** functionality:

In [None]:
df=spark.read.csv("/kaggle/input/food-delivery-dataset/train.csv",
                  header=True,
                  inferSchema=True)
#  Parameters:
## - inferSchema parameter ensures that the data formatting stays the same as the original dataframe. If False, then the 
##     columns will be of class string.
## - header parameter tells that the columns names are provided along with the dataset.

## Displaying the first 5 rows:
df.show(5)

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> toPandas(), head()

In [None]:
## To convert a spark dataframe into a pandas dataframe
df.toPandas().head()

#### As you can see above, Time_taken(min) is the target variable.

#### Now we have read the csv file into Spark. Lets view the dataframe:

In [None]:
## Viewing the type
type(df)

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> printSchema()

#### Printing the schema of the dataframe

In [None]:
## Printing the attributes of the table:
df.printSchema()

In [None]:
## Displaying the first 5 rows in the form of col-value pairs
df.head(5)

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> describe(), summary()

In [None]:
## Basic statistics of the data:
df.describe()    ### df.summary()
df.describe().show()

#### NOTE: describe() represents the statistical summary of dataframe but it also uses the string variables

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> count(), columns

In [None]:
## Shape of the dataframe is:
df.count(),len(df.columns)

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> col(), isNull()

In [None]:
## Checking for null values:
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

#### Looks like there are no null values.

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> dtypes

In [None]:
## Checking the dtypes:
df.dtypes

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> select()

In [None]:
## To view a few selected columns:
df.select(["ID","Delivery_person_ID"]).show()

In [None]:
df.printSchema()

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> cast()

#### The various datatypes that a column can take up are integers, string, double, float, timestamp, etc...

#### To convert a column into:

1. double ---> use DoubleType()

2. int    ---> use IntegerType()

3. float  ---> use FloatType()

4. string ---> use StringType()

5. long   ---> use LongType()

#### all inside the cast() method.

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> withColumn()

#### In PySpark, the withColumn() function is widely used and defined as the **transformation function** of the DataFrame

#### which is further

- used to change the value, 

- convert the datatype of an existing column, 

- create the new column etc...

In [None]:
## Have to correct the datatypes of some columns. Delivery_person_Age, Vehicle_condition, multiple_deliveries
df=df.withColumn('Delivery_person_Age',col('Delivery_person_Age').cast(IntegerType()))\
.withColumn('Vehicle_condition',col('Vehicle_condition').cast(IntegerType()))\
.withColumn('multiple_deliveries',col('multiple_deliveries').cast(IntegerType()))

In [None]:
## Checking after conversion:
df.dtypes

In [None]:
df.select(['Delivery_person_Age','Vehicle_condition','multiple_deliveries']).dtypes

In [None]:
## To display the PySpark dataframe as a pandas dataframe:
df.toPandas().head()

In [None]:
## Checking the numeric columns:
def num_cols(dataframe):
    num_cols = [col for col in dataframe.columns if dataframe.select(col).dtypes[0][1] in ['double', 'int']]
    return num_cols

num_cols = num_cols(df)  ### list of numeric columns
    
df.describe(num_cols).show()

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> distinct()

In [None]:
### There are 1320 unique IDs
df.select('Delivery_person_ID').distinct().count()  

### <span style="background-color: #2E31FD;font-size: 35px">🌼</span> orderBy()

In [None]:
### Counts of unique delivery person ids::
df.select('Delivery_person_ID').distinct().show()  ### 20 
df.groupBy('Delivery_person_ID').count().orderBy('count').show()

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
        <p style="padding: 10px;
              color:white;">
            Feature Engineering Overview
        </p>
    </div></h2>

As observed from the above dataset, we can extract the following:

1. City from Delivery_person_ID ----> city

2. Bucket cities into Zones - North, South, East, West  ----> city_zone

3. Time taken to pick up delivery using Time_Orderd and Time_Order_picked ----> pickup_time

4. Time of the day - Morning, Lunch, Evening, Night, Midnight ----> day_zone

5. To clean up target variable - Time_taken(min)

6. Bucket Age - Delivery_person_Age ----> life_stage

7. Features using Latitude and Longitude ----> geosidic

<blockquote><p style="font-size:20px; color:#159364; font-family:verdana;">1. City from delivery id:</p></blockquote>

#### In order to apply a function into a particular column, we have create the function and register it as a UDF(User Defined Function) on Spark

In [None]:
# Create custom function
def city_extract(x):
    return re.findall("(\S+)RES\S+",x)[0]

# Convert the function as a UDF using the udf function:
city_extract_UDF = udf(lambda x:city_extract(x),StringType()) 

# Apply the function on the desired column:
df=df.withColumn("City",city_extract_UDF(col("Delivery_person_ID")))

## Having a glance at the new column:
df.select(['Delivery_person_ID','City']).show()

In [None]:
df.select("City").distinct().show(22)

<blockquote><p style="font-size:20px; color:#159364; font-family:verdana;">3. Getting Pickup time:</p></blockquote>

In [None]:
## equivalent value counts in python:
## Looks like there are ~1700 rows of null values in this column.
df.groupBy('Time_Orderd').count().sort(col("count").desc()).show(10)

In [None]:
df.groupBy('Time_Order_picked').count().sort(col("count").desc()).show(10)

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp; A go-to approach will be to calculate average pickup time using other non null rows and then imputing the null rows with the average obtained.
</div>

<blockquote><p style="font-size:20px; color:#159364; font-family:verdana;">5. Cleaning the target variable:</p></blockquote>

#### Use withColumnRenamed method to rename a column.

In [None]:
## Before transformation:
df.select("Time_taken(min)").show(5)

In [None]:
## Renaming the column name::
df=df.withColumnRenamed('Time_taken(min)','time_taken')

## Removing the preffix (i.e. '(min)') in the column values with the help of a UDF:
def target_clean(x):
    return x[-2:]

target_clean_udf=udf(lambda x:target_clean(x),StringType())
df=df.withColumn("time_taken",target_clean_udf(col("time_taken")))
## Converting type:
df=df.withColumn("time_taken",col("time_taken").cast(IntegerType()))

In [None]:
## As you can see, the values have been cleaned and the type has been changed:
df.select("time_taken").show(5),df.select("time_taken").dtypes

In [None]:
7. 

# from geopy.distance import geodesic 

# train['distance_diff_KM']=np.zeros(len(train))
# restaurant_cordinates_train=train[['Restaurant_latitude','Restaurant_longitude']].to_numpy()
# delivery_location_cordinates_train=train[['Delivery_location_latitude','Delivery_location_longitude']].to_numpy()

# for i in range(len(train)):
#     train['distance_diff_KM'].loc[i]=geodesic(restaurant_cordinates_train[i],delivery_location_cordinates_train[i])