# <font color='blue'>Introduction to PySpark
---

- In this Notebook, we will learn how to use and operate Spark using Python called PySpark
- To make sure PySpark is already installed on your machine, we can import `pyspark` and check the version

In [9]:
# import library
import pyspark

In [2]:
# check pyspark version
pyspark.__version__

'3.5.0'

- Ok, we're already importing `pyspark` then how do we use the function and component inside pyspark?
- To use all the functions and components inside pyspark, we must create `SparkSession`
- To create `SparkSession` we can import this function `from pyspark.sql import SparkSession`
- Then, if we want to create `SparkSession` we can use this code snippet
  
```python
spark = SparkSession \
    .builder \
    .appName("...") \ # you can insert what kind of appName that you want
    .getOrCreate()
```

In [10]:
from pyspark.sql import SparkSession

In [11]:
spark = SparkSession.builder.appName("Learn PySpark Data Pipeline").getOrCreate()

In [12]:
spark

### SparkContext UI
---

- When we're creating SparkContext, we can get a web-based User Interface that will give us information about Spark execution, like run time, the process being split into many tasks, etc
- It will help us to debug, check our Spark performance, and many more
- To access SparkContext UI, we can access http://localhost:4040
- Make sure, to access SparkContext UI you're already create SparkSession
- If you're already creating SparkSession, you can click button `Spark UI`

<center>
    <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/spark_ui.png">
</center>

You can try to run this code below and see the Spark UI

In [7]:
df_rdd = spark.sparkContext\
            .parallelize([(1, 2, 3, 'a b c'),
                            (4, 5, 6, 'd e f'),
                            (7, 8, 9, 'g h i')])\
            .toDF(['col_1', 'col_2', 'col_3', 'col_4'])

df_rdd.show()

+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|
+-----+-----+-----+-----+
|    1|    2|    3|a b c|
|    4|    5|    6|d e f|
|    7|    8|    9|g h i|
+-----+-----+-----+-----+



In Spark UI will show what jobs are currently run by PySpark

<center>
    <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/spark_ui_run.png">
</center>

# <font color='blue'>Pandas vs PySpark
---

- In the previous section, we already learned that PySpark is similar to Pandas 
- Especially for data processing, we can:
    - Read Data
    - Select Data
    - Filter Data
    - Transform Data
    - etc
- We're going to learn about how to do it using PySpark in this course
- But what's the difference between them?
- The main difference between them is that PySpark will split the task!
- It will affect in speed time when reading data
- In this section, we're going to demonstrate it to see the speed difference
- You will see the difference if using data that have many records

---
- We will try to read data that have more than 5 million rows
- Say we have transaction data about Flights data, you can access the data in here [dataset](https://drive.google.com/file/d/1GggQUGzWFeLXwITGKN0s_NNmkep9V-WQ/view?usp=drive_link)
- We will compare it with Pandas and PySpark

**Pandas**
---

In [10]:
import pandas as pd
import time

In [23]:
DATA_PATH = "../data/"

In [24]:
# we're going to compare the speed 

def read_big_table_pandas(filename: str):
    start_time = time.time()
    
    df = pd.read_csv(DATA_PATH + filename)
    
    end_time = time.time()
    
    elapsed_time = end_time - start_time
    
    return elapsed_time, df

In [25]:
elapsed_time, df_pandas = read_big_table_pandas(filename = "bank_churners.csv")
print(f"Time to read CSV: {elapsed_time} seconds")

Time to read CSV: 0.04840540885925293 seconds


In [26]:
df_pandas.head()

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


**PySpark**
---

In [27]:
def read_big_table_pyspark(filename: str):
    start_time = time.time()
    
    df = spark \
        .read \
        .option("header", "true")\
        .csv(DATA_PATH + filename)
    
    end_time = time.time()
    
    elapsed_time = end_time - start_time
    
    return elapsed_time, df

In [28]:
elapsed_time, df_pyspark = read_big_table_pyspark(filename = "bank_churners.csv")
print(f"Time to read CSV: {elapsed_time} seconds")

Time to read CSV: 0.3867511749267578 seconds


In [29]:
df_pyspark.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

# <font color='blue'>Read Data using PySpark
---

- Just like in Pandas, when using PySpark we can read data from various formats like:
    - File
    - Database
    - JSON
    - API
    - etc
- For more detailed what kind of format data that PySpark can read, you can check in this [documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)
- PySpark also can treat the data into DataFrame

---
In this section, we're going to learn how to read data from csv file and Postgres database 

**Read csv file**
---

- Say, we're using Flights Transactional Data
- The file in directory `data/bank_churners.csv`
- To read data using PySpark, first we can use `SparkSession` that we've already created
- So, for the next of this course we will use the variable `spark`
- The snippet code will be like this

```python
df = spark.read.option("header", "true").FORMAT_DATA(...) 
```

- In `FORMAT_DATA`we can adjust with the format data that we have
- So, in this case if we're going to read the `csv` file the `FORMAT_DATA` will be `csv`
- Then, inside `csv` we can put the filename of the data

```python
df = spark.read.csv(filename)
```

In [7]:
df = spark.read.option("header", "true").csv("../data/bank_churners.csv")

In [8]:
df.show()

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

**Read Database data**
---

- To read data from the database, first we must own the database driver
- Because PySpark is run in Scala and Java, so our database driver must be in `jar` files
- If you're using PySpark Docker from Pacmann's repo, we've already included `jar` files in that repo
- Next step, we can copy the `jar` files into our docker container by running this command in the terminal
  
  `docker cp driver/postgresql-42.6.0.jar pyspark_container:/usr/local/spark-3.5.0-bin-hadoop3/jars/postgresql-42.6.0.jar`

- After we set up the database driver, we can start reading the database by using PySpark

- Say, we're already have a database from our docker compose and this is our config for database and we want to read `payments` table

```python
    DB_URL = jdbc:postgresql://pachotel_db_container:5432/pachotel
    DB_TABLE = payment
    DB_USER = postgres
    DB_PASS = cobapassword
```

In [2]:
# set variable for database

DB_URL = "jdbc:postgresql://pachotel_db_container:5432/pachotel"
DB_TABLE = "payment" 
DB_USER = "postgres"
DB_PASS = "cobapassword"

In [3]:
# set config
jdbc_url = DB_URL
table_name = DB_TABLE
connection_properties = {
    "user": DB_USER,
    "password": DB_PASS,
    "driver": "org.postgresql.Driver" # set driver postgres
}

- To read the database using PySpark, we can use `jdbc`
- Then, we set parameters for:
    - `url`: URL for database
    - `table`: table name that we want to access
    - `properties`: database config

In [8]:
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Display the DataFrame
df.show()

+----------+--------------+--------+--------------+--------------+--------------------+--------------------+--------------------+
|payment_id|reservation_id|provider|account_number|payment_status|        payment_date|         expire_date|          created_at|
+----------+--------------+--------+--------------+--------------+--------------------+--------------------+--------------------+
|         1|             1|     Ovo|     038137149|       Success|2020-10-20 17:07:...|2020-10-28 18:10:...|2024-02-09 13:59:...|
|         2|             2|     BCA|     042103729|       Success|2015-02-28 16:44:...|2015-03-08 05:27:...|2024-02-09 13:59:...|
|         3|             3| Permata|     058689635|       Success|2022-04-03 13:27:...|2022-04-06 14:20:...|2024-02-09 13:59:...|
|         4|             4|     BNI|     107161965|       Success|2019-10-04 08:39:...|2019-10-08 02:01:...|2024-02-09 13:59:...|
|         5|             5|     BSI|     014334131|        Failed|                NULL|202

- We've already select and filtered data based on the requirements given to us
- The next step is to **store the results** or export the data
- By using PySpark, we can save the output into a file or database
- When export the data we can use method `write` then followed by format data we want

**Save to File**
---

- In this case, we will try to save the output into CSV files
- When exporting the file into CSV by using PySpark, we can choose the output format:
    - can be only one file: `output.csv`
    - split the output: `output_1.csv`, `output_2.csv`, `output_n.csv`

---
- Given the flight dataset, we're going to selecting and filtering the data
- Selected columns:
    - `ACCOUNT_NUMBER`
    - `PAYMENT_STATUS`
    - `PAYMENT_DATE`
    - `EXPIRE_DATE`
    - `PROVIDER`

In [30]:
SELECTED_COLS = ["ACCOUNT_NUMBER", "PAYMENT_STATUS", "PAYMENT_DATE","EXPIRE_DATE","PROVIDER"]

In [31]:
new_df = df.select(SELECTED_COLS)

new_df.show()

+--------------+--------------+--------------------+--------------------+--------+
|ACCOUNT_NUMBER|PAYMENT_STATUS|        PAYMENT_DATE|         EXPIRE_DATE|PROVIDER|
+--------------+--------------+--------------------+--------------------+--------+
|     038137149|       Success|2020-10-20 17:07:...|2020-10-28 18:10:...|     Ovo|
|     042103729|       Success|2015-02-28 16:44:...|2015-03-08 05:27:...|     BCA|
|     058689635|       Success|2022-04-03 13:27:...|2022-04-06 14:20:...| Permata|
|     107161965|       Success|2019-10-04 08:39:...|2019-10-08 02:01:...|     BNI|
|     014334131|        Failed|                NULL|2020-06-14 12:02:...|     BSI|
|     043583173|       Waiting|                NULL|2020-03-25 03:35:...|     Ovo|
|     021381685|       Waiting|                NULL|2017-10-12 03:15:...|     BSI|
|     119636925|       Success|2019-07-06 00:27:...|2019-07-16 05:38:...| Permata|
|     128447299|       Success|2020-10-22 21:58:...|2020-11-07 21:05:...| Mandiri|
|   

Next, we're going to filter the data when the `PROVIDER` is `BSI` and the status is `Waiting`

In [37]:
filter_new_df = new_df.filter("PROVIDER in ('BSI','Ovo') and PAYMENT_STATUS = 'Waiting'")
filter_new_df.show()

+--------------+--------------+------------+--------------------+--------+
|ACCOUNT_NUMBER|PAYMENT_STATUS|PAYMENT_DATE|         EXPIRE_DATE|PROVIDER|
+--------------+--------------+------------+--------------------+--------+
|     043583173|       Waiting|        NULL|2020-03-25 03:35:...|     Ovo|
|     021381685|       Waiting|        NULL|2017-10-12 03:15:...|     BSI|
|     067955907|       Waiting|        NULL|2018-01-23 00:35:...|     Ovo|
|     117651601|       Waiting|        NULL|2019-05-23 21:16:...|     BSI|
|     117451542|       Waiting|        NULL|2016-10-09 18:40:...|     BSI|
|     042230919|       Waiting|        NULL|2016-07-31 02:22:...|     BSI|
|     071579340|       Waiting|        NULL|2018-05-28 17:52:...|     Ovo|
|     041985553|       Waiting|        NULL|2015-08-12 03:03:...|     BSI|
|     065955077|       Waiting|        NULL|2020-05-07 08:32:...|     Ovo|
+--------------+--------------+------------+--------------------+--------+



To save it only one file, we can use this snippet code

```python
df.coalesce(numPartitions = 1).write.csv(filename)
```

---
Say we want to save the output into directory `data/output/`

In [39]:
new_df.coalesce(numPartitions = 1).write.csv("../data/output/filtered_data_single", header = True)

Then, if we want to save the output into partition files we can use without method `coalesce`

In [40]:
new_df.write.option("header", True).csv("../data/output/filtered_data_partition")

**Save to Database**
---

- To save to a database is just like when we read data from a database
- We must config the connection first and use the `jdbc` method

---
- When load to the database, PySpark can generate the table automatically
- But, you're also can build the table first and then load the data

Schema Table

```sqlrk;

CREATE TABLE public.payment_pyspark (
	"ACCOUNT_NUMBER" text NULL,
	"PAYMENT_STATUS" varchar NULL,
	"PAYMENT_DATE" datetime NULL,
	"EXPIRE_DATE" datetime NULL,
	"PROVIDER" varchar NULL,
);
```

In [43]:
# set variable for database
DB_URL = "jdbc:postgresql://pachotel_db_container:5432/pachotel"
DB_TABLE = "payment" 
DB_USER = "postgres"
DB_PASS = "cobapassword"

In [44]:
url = f"jdbc:postgresql://pachotel_db_container:5432/pachotel"
properties = {
    "user": DB_USER,
    "password": DB_PASS
}

In [46]:
new_df.write.jdbc(url = url,
                          table = "payment_pyspark",
                          mode = "overwrite",
                          properties = properties)

# <font color='blue'>Select Data using PySpark
---

- To Select Data using PySpark it's like in Pandas
- To access the data that we want, we can select the columns that we want
- Say, we want to access the flight transactional data

In [49]:
df_churns = spark \
            .read \
            .option("header", "true") \
            .csv("../data/bank_churners.csv")

df_churns.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

- To select the columns, we can store them first in the Python list for easier use
- Then, to select the data we can use syntax `select(col_1, col_2, col_n)

In [51]:
SELECTED_COLUMNS = ["Customer_Age","Gender","Dependent_count","Education_Level","Marital_Status","Income_Category"]

selected_df_churns = df_churns.select(SELECTED_COLUMNS)
selected_df_churns.show()

+------------+------+---------------+---------------+--------------+---------------+
|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|
+------------+------+---------------+---------------+--------------+---------------+
|          45|     M|              3|    High School|       Married|    $60K - $80K|
|          49|     F|              5|       Graduate|        Single| Less than $40K|
|          51|     M|              3|       Graduate|       Married|   $80K - $120K|
|          40|     F|              4|    High School|       Unknown| Less than $40K|
|          40|     M|              3|     Uneducated|       Married|    $60K - $80K|
|          44|     M|              2|       Graduate|       Married|    $40K - $60K|
|          32|     M|              0|    High School|       Unknown|    $60K - $80K|
|          37|     M|              3|     Uneducated|        Single|    $60K - $80K|
|          48|     M|              2|           NULL|        Sing

# <font color='blue'>Filter Data using PySpark
---

- To filter data in PySpark we can use the function `.filter()`
- Also, when filtering in PySpark we can access via column or SQL based
- So, when we're doing filtering data we can use **comparison operators**

<center>

| **Comparison Operators** | **Description**           |
|--------------------------|-------------------------|
| <                        | Less than             |
| >                        | More than              |
| <=                       | Less than equal to  |
| >=                       | More than equal to  |
| ==                       | Equal to             |
| !=                       | Not equal to       |

</center>

- When filtering data in PySpark we can do single or multiple filter
- When do multiple filtering we can use **boolean logic** like **and (`&`)** **or (`|`)**

**Single Filter**
---

When filtering only one column we can use this snippet code

```python
df_filtered = df.filter(df[col_1] > value)
```

In [54]:
df_filtered = df_churns.filter(df_churns['Gender'] =='F')
df_filtered.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

**Multiple Filter**
---

When do multiple filtering in PySpark, we can use **boolean logic**

```python

# using columns based
filtered_df = df.filter((df[col_1] > value_1) & (df[col_2] == value_2))

# using sql based
filtered_df = df.filter("col_1 > value_1 AND col_2 = value_2")
```

In [55]:
df_filtered = df_churns.filter((df_churns['Gender'] =='M')& (df_churns['Education_Level'] =='Graduate'))
df_filtered.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

# <font color='blue'>Data Transformation using PySpark
---

- In this section, we're going to learn how to do Data Wrangling or Data Transformation in PySpark
- Just like in Pandas, we can also do Data Transformation using PySpark
- There are so many tasks that we can do in Data Transformation by using PySpark
- But, in this session we will focus to:
    - Rename Column
    - Slicing Data
    - Create a New Column using the Existing Column
    - Impute Missing Values


Just like in the previous section, make sure we've already created `SparkSession`

In [1]:
# import SparkSession
from pyspark.sql import SparkSession
import pyspark

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data Transformation using PySpark") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .getOrCreate()

In [3]:
spark

- We will be using data from the previous section
- The data is in this directory `data/materi_10/flights_filtered.csv`
- We can also create a Python function to read the data

In [4]:
def read_csv_data(filename: str) -> pyspark.sql.dataframe.DataFrame:
    """Function to read csv file using PySpark"""
    df = spark \
        .read \
        .option("header", "true") \
        .csv(filename)

    return df

In [5]:
# define data path
DATA_PATH = "../data/"

In [6]:
df_data = read_csv_data(DATA_PATH + "bank_churners.csv")

df_data.show(5) # get 5 data

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

### **Rename Column**
---

- When reading data directly from data sources, usually there's a problem
- The naming convention for column names is not proper
- For naming convention usually each person may differ, we can use camelCase or snake_case
- In this course, we will use snake_case
- To rename column names in PySpark we can use two methods:
    - `withColumnRenamed()`: can be used when rename one column only
    - `withColumnsRenamed()`: can be used when rename multiple columns

**Syntax Rename One Column**

```python
df = df.withColumnRenamed(existing = old_column, new = new_column)
```

**Syntax Rename Multiple Columns**

```python
# initiate dictionary
RENAME_COLS = {
    "old_col_1": "new_col_1",
    "old_col_2": "new_col_2",
    "old_col_n": "new_col_n"
}

df = df.withColumnsRenamed(colsMap = RENAME_COLS)
```

**Rename One Column**

By using data in variable `df_fligts`, we want to rename column `AIRLINE` to `airline`

In [8]:
df_data.withColumnRenamed(existing = "Dependent_count", new = "Dependent_Count").show()

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_Count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

In [15]:
rename_columns = {
                    "Months_on_book":"Months_On_Book",
                    "Contacts_Count_12_mon":"Contacts_Count_12_Mon",
                    "Months_Inactive_12_mon":"Months_Inactive_12_Mon"
                 }

In [16]:
df_data = df_data.withColumnsRenamed(colsMap = rename_columns)

In [17]:
df_data.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_On_Book|Total_Relationship_Count|Months_Inactive_12_Mon|Contacts_Count_12_Mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

### **Slicing Data**
---

- In the previous section we already learn how to select data using columns in PySpark
- But did you know, we can select the data using rows based like in Pandas like this?

<center>
    <img src="https://www.boardinfinity.com/blog/content/images/2023/02/iloc-python.png" width=65%>
    <br>
    <a href="https://www.boardinfinity.com/blog/iloc-in-python/">img src</a>
</center>

- We can do that also in PySpark, but there's no method to do it in PySpark
- So, we must convert it first to Pandas.
- Yes, in PySpark we can convert the data into Pandas!
- If we want to use a Pandas method in PySpark we must convert it first to Pandas
- So our workflow will be like this

<center>
    <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/workflow_slicing_data.png">
</center>

- To convert to Pandas, we can use method `.toPandas()`

**Syntax**

```python
df = df.toPandas()
```

- After that, we can do data slicing like in pandas by using `.loc` or `.iloc`
    - `.iloc`: access data based on index position
    - `.loc`: access data based on labels (column or index name) in rows or columns

**Syntax**

```python
# slicing using .iloc
df.iloc[rows_start_index:rows_end_index, cols_start_index:cols_end_index]

# slicing using .loc
df.loc[rows_index_name_start:rows_index_name_end, cols_index_name_start:cols_index_name_end]
```

---
- In this section, we're going to use a dataset from this directory `data/materi_10/new_filtered_data.csv`
- The first step, we're going to convert the data to Pandas

In [67]:
# read data 
df_churners = spark.read.option("header", "true").csv("../data/bank_churners.csv")

df_churners.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

In [20]:
df_churners.count()

9776

In [24]:
df_churners = df_churners.toPandas()

In [25]:
df_churners.head()

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691,777,11914,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256,864,7392,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418,0,3418,2.594,1887,20,2.333,0.0
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313,2517,796,1.405,1171,20,2.333,0.76
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716,0,4716,2.175,816,28,2.5,0.0


---
- As you can see from the output at the top, is in Pandas DataFrame format!
- Now, we can slice the data using either `.loc` or `.iloc`
- Say, we're going to slice the data from index rows `57424` to `102114`

In [30]:
df_churners

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691,777,11914,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256,864,7392,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418,0,3418,2.594,1887,20,2.333,0
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313,2517,796,1.405,1171,20,2.333,0.76
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716,0,4716,2.175,816,28,2.5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9771,Attrited Customer,46,M,4,Uneducated,Single,,Gold,41,6,3,2,34516,1104,33412,0.591,1989,47,0.741,0.032
9772,Existing Customer,59,M,0,College,Married,Less than $40K,Blue,49,3,3,3,4250,1095,3155,0.947,1174,27,0.8,0.258
9773,Existing Customer,42,M,3,High School,Single,$80K - $120K,Silver,32,3,3,3,34516,1396,33120,0.688,3748,62,0.938,0.04
9774,Existing Customer,52,F,3,Uneducated,Married,Less than $40K,Blue,33,6,3,3,5677,1403,4274,0.951,2730,58,0.758,0.247


In [34]:
val = df_churners.iloc[9430:9776]
val

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
9430,Existing Customer,40,F,4,College,Married,Less than $40K,Blue,26,3,1,1,2972,0,2972,0.593,3840,69,0.769,0
9431,Attrited Customer,50,F,1,Uneducated,Single,Less than $40K,Blue,37,3,2,4,1838,0,1838,0.743,2461,47,1.136,0
9432,Existing Customer,33,F,3,College,Single,$40K - $60K,Blue,26,6,3,5,6316,1344,4972,0.823,2401,71,0.732,0.213
9433,Existing Customer,48,M,2,Graduate,Unknown,$40K - $60K,Blue,36,5,1,2,13068,1718,11350,0.607,937,28,0.556,0.131
9434,Existing Customer,44,Female,4,Uneducated,Married,Less than $40K,Blue,33,5,4,2,9671,704,8967,0.699,2955,66,0.941,0.073
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9771,Attrited Customer,46,M,4,Uneducated,Single,,Gold,41,6,3,2,34516,1104,33412,0.591,1989,47,0.741,0.032
9772,Existing Customer,59,M,0,College,Married,Less than $40K,Blue,49,3,3,3,4250,1095,3155,0.947,1174,27,0.8,0.258
9773,Existing Customer,42,M,3,High School,Single,$80K - $120K,Silver,32,3,3,3,34516,1396,33120,0.688,3748,62,0.938,0.04
9774,Existing Customer,52,F,3,Uneducated,Married,Less than $40K,Blue,33,6,3,3,5677,1403,4274,0.951,2730,58,0.758,0.247


---
- After we slice the data we can convert it back to PySpark
- To do that, we can use the method `spark.createDataFrame(pandas_df)`

**Syntax**

```python
df_pyspark = spark.createDataFrame(df_pandas)
```

In [35]:
df_pyspark = spark.createDataFrame(val)

In [36]:
df_pyspark.show()

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

In [37]:
df_pyspark.count()

346

### **Create New Column using Existing Column**
---

- In Data Transformation, we can also create a new column using an existing column!
- Why are we creating new information? There are so many reasons why we create new columns using existing columns, like:
    - For enhancing data analysis
    - Data Enrichment
    - Capture more information
    - and many more

**Example Create New Column using Existing Column**

<center>
    <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/feature_engineering.png">
</center>

---
- In this section, we're going to use `data/materi_10/superstore.csv`
- The first step, we're going to read the csv data

In [40]:
df_superstore = spark \
                .read \
                .option("header", "true") \
                .csv("../data/superstore.csv")

df_superstore.show(5)

+---------------+-----------+-------------+-----------+----------------+--------+------+-------------------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+-------------------+--------------+-------------+----------+------------+-------+
|       Category|       City|      Country|Customer.ID|   Customer.Name|Discount|Market|         Order.Date|      Order.ID|Order.Priority|     Product.ID|        Product.Name| Profit|Quantity|Region|Row.ID|Sales| Segment|          Ship.Date|     Ship.Mode|Shipping.Cost|     State|Sub.Category|weeknum|
+---------------+-----------+-------------+-----------+----------------+--------+------+-------------------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+-------------------+--------------+-------------+----------+------------+-------+
|Office Supplies|Los Angeles|United States|  LS-172304|Lycoris Saunders|     0.0|    US|201

- Before we generate new columns, this is our workflow for this case:
    1. Rename all the columns according to criteria.
    2. Cast data types.
    3. Create new columns based on existing columns.
    4. Create new columns by extracting from existing columns.

**1. Rename all the columns**

- We want to rename all the columns, so we will use the method withColumnsRenamed().
- Because we will rename all the columns, we will pass the old and new values using a Python dictionary.
- These are the columns that we want to rename:
    - `Category`: `category`
    - `City`: `city`
    - `Country`: `country`
    - `Customer.ID`: `customer_id`
    - `Customer.Name`: `customer_name`
    - `Discount`: `discount`
    - `Market`: `market`
    - `Order.Date`: `order_date`
    - `Order.ID`: `order_id`
    - `Order.Priority`: `order_priority`
    - `Product.ID`: `product_id`
    - `Product.Name`: `product_name`
    - `Profit`: `profit`
    - `Quantity`: `quantity`
    - `Region`: `region`
    - `Row.ID`: `row_id`
    - `Sales`: `sales`
    - `Segment`: `segment`
    - `Ship.Date`: `ship_date`
    - `Ship.Mode`: `ship_mode`
    - `Shipping.Cost`: `shipping_cost`
    - `State`: `state`
    - `Sub.Category`: `sub_category`
    - `weeknum`: `week_num`

In [42]:
# create dictionary to rename columns
RENAME_COLS = {
    "Category": "category",
    "City": "city",
    "Country": "country",
    "Customer.ID": "customer_id",
    "Customer.Name": "customer_name",
    "Discount": "discount",
    "Market": "market",
    "Order.Date": "order_date",
    "Order.ID": "order_id",
    "Order.Priority": "order_priority",
    "Product.ID": "product_id",
    "Product.Name": "product_name",
    "Profit": "profit",
    "Quantity": "quantity",
    "Region": "region",
    "Row.ID": "row_id",
    "Sales": "sales",
    "Segment": "segment",
    "Ship.Date": "ship_date",
    "Ship.Mode": "ship_mode",
    "Shipping.Cost": "shipping_cost",
    "State": "state",
    "Sub.Category": "sub_category",
    "weeknum": "week_num"
}

In [43]:
# rename all the columns using .withColumnsRenamed()

df_superstore = df_superstore.withColumnsRenamed(colsMap = RENAME_COLS)

df_superstore.show(5)

+---------------+-----------+-------------+-----------+----------------+--------+------+-------------------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+-------------------+--------------+-------------+----------+------------+--------+
|       category|       city|      country|customer_id|   customer_name|discount|market|         order_date|      order_id|order_priority|     product_id|        product_name| profit|quantity|region|row_id|sales| segment|          ship_date|     ship_mode|shipping_cost|     state|sub_category|week_num|
+---------------+-----------+-------------+-----------+----------------+--------+------+-------------------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+-------------------+--------------+-------------+----------+------------+--------+
|Office Supplies|Los Angeles|United States|  LS-172304|Lycoris Saunders|     0.0|    US|

**2. Casting Data Types**

- There are some cases when values from columns are not in the correct data types.
- To check the data types in each column, we can use the printSchema() method.

**Syntax**

```python
df.printSchema()
```

**Example Output**

<center>
    <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/printSchema.png">
</center>

- The output on the left side is the column names, and the other side shows the data types for each column.

In [44]:
df_superstore.printSchema()

root
 |-- category: string (nullable = true)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- discount: string (nullable = true)
 |-- market: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_priority: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- profit: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- region: string (nullable = true)
 |-- row_id: string (nullable = true)
 |-- sales: string (nullable = true)
 |-- segment: string (nullable = true)
 |-- ship_date: string (nullable = true)
 |-- ship_mode: string (nullable = true)
 |-- shipping_cost: string (nullable = true)
 |-- state: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- week_num: string (nullable = true)



- Turns out, there are some values of the columns are not in the correct data types
- So, we must cast the data
- To cast data types in PySpark, we can use method `withColumn()` to access all the column that we want to cast and use method `cast()` to cast the data type that we want

**Syntax**

```python
df_pyspark.withColumn(col_name, df_pyspark[col_name].cast(data_type))
```

---
- The columns that we want to cast are these columns:
    - `discount`: `int`
    - `profit`: `float`
    - `quantity`: `int`
    - `sales`: `int`
    - `shipping_cost`: `float`
    - `order_date`: `date`
    - `ship_date`: `date`

In [48]:
CAST_COLUMNS = {
    "discount": "int",
    "profit": "float",
    "quantity": "int",
    "sales": "int",
    "shipping_cost": "float"
}

for col_name, data_type in CAST_COLUMNS.items():
    df_superstore = df_superstore.withColumn(col_name, df_superstore[col_name].cast(data_type))

In [47]:
df_superstore.printSchema()

root
 |-- category: string (nullable = true)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- market: string (nullable = true)
 |-- order_date: date (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_priority: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- profit: float (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- row_id: string (nullable = true)
 |-- sales: integer (nullable = true)
 |-- segment: string (nullable = true)
 |-- ship_date: date (nullable = true)
 |-- ship_mode: string (nullable = true)
 |-- shipping_cost: float (nullable = true)
 |-- state: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- week_num: string (nullable = true)



- Then, to cast the data type in `date` it's a little bit different
- We must use the specific method in PySpark called `to_date()` and we must import the functions first
- Then, we must set the format of our date data type. In this case, we will use `yyyy-MM-dd` format
- After that, we pass it to the `withColumn()` method like before

**Syntax**

```python
from pyspark.sql.functions import to_date

df_pyspark.withColumn(col_name, to_date(df_pyspark[col_name]), "yyyy-MM-dd")
```

In [49]:
# import to_date method 
from pyspark.sql.functions import to_date

Now, we will cast `order_date` and `ship_date` to `date` data type

In [50]:
# cast order_date
df_superstore = df_superstore.withColumn("order_date", to_date(df_superstore["order_date"], "yyyy-MM-dd"))

# cast ship_date
df_superstore = df_superstore.withColumn("ship_date", to_date(df_superstore["ship_date"], "yyyy-MM-dd"))

In [54]:
df_superstore.show(5)

+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+
|       category|       city|      country|customer_id|   customer_name|discount|market|order_date|      order_id|order_priority|     product_id|        product_name| profit|quantity|region|row_id|sales| segment| ship_date|     ship_mode|shipping_cost|     state|sub_category|week_num|
+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+
|Office Supplies|Los Angeles|United States|  LS-172304|Lycoris Saunders|       0|    US|2011-01-07|CA-2011-130813|          High|OFF-PA-100020

In [55]:
df_superstore.printSchema()

root
 |-- category: string (nullable = true)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- market: string (nullable = true)
 |-- order_date: date (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_priority: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- profit: float (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- row_id: string (nullable = true)
 |-- sales: integer (nullable = true)
 |-- segment: string (nullable = true)
 |-- ship_date: date (nullable = true)
 |-- ship_mode: string (nullable = true)
 |-- shipping_cost: float (nullable = true)
 |-- state: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- week_num: string (nullable = true)



**3. Create new columns based on existing columns**

---
- Now, we will create a new column called `total_sales`
- That column is generate by multiplying `quantity` with `sales`

So, our formula to calculate `total_sales` is like this

$$
\text{total\_sales} = \text{quantity} \cdot \text{sales}
$$

- To generate new column `total_sales`, we can use `withColumn()` method
- Then, we can do mathematical operation inside the method

**Syntax**

```python
df_pyspark = df_pyspark.withColumn(col_name, df_pyspark[col_a] + df_pyspark[col_b])
```

In [56]:
df_superstore = df_superstore.withColumn('total_sales', df_superstore['quantity']*df_superstore['sales'])

In [59]:
df_superstore.show(5)

+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+-----------+
|       category|       city|      country|customer_id|   customer_name|discount|market|order_date|      order_id|order_priority|     product_id|        product_name| profit|quantity|region|row_id|sales| segment| ship_date|     ship_mode|shipping_cost|     state|sub_category|week_num|total_sales|
+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+-----------+
|Office Supplies|Los Angeles|United States|  LS-172304|Lycoris Saunders|       0|    US|2011-01-07|CA-2011

**4. Create new columns by extracting from existing columns**

---
- For the last step, we will create new columns called `year_order`, `month_order`, and `day_order` by extracting data from column `order_date`
- To do that, we can use functions that are already provided by PySpark called `year,` `month`, and `dayofmonth`

**Example**

```python
from pyspark.sql.functions import year, month, dayofmonth

# extract day
df_pyspark.withColumn(new_col_name, dayofmonth(df_pyspark[col_name]))

# extract month
df_pyspark.withColumn(new_col_name, month(df_pyspark[col_name]))

# extract year
df_pyspark.withColumn(new_col_name, year(df_pyspark[col_name]))
```

In [60]:
from pyspark.sql.functions import year, month, dayofmonth

In [62]:
df_superstore = df_superstore.withColumn('year_order', year(df_superstore['order_date']))
df_superstore = df_superstore.withColumn('month_order', month(df_superstore['order_date']))
df_superstore = df_superstore.withColumn('day_order', dayofmonth(df_superstore['order_date']))

In [63]:
df_superstore.show(5)

+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+-----------+----------+-----------+---------+
|       category|       city|      country|customer_id|   customer_name|discount|market|order_date|      order_id|order_priority|     product_id|        product_name| profit|quantity|region|row_id|sales| segment| ship_date|     ship_mode|shipping_cost|     state|sub_category|week_num|total_sales|year_order|month_order|day_order|
+---------------+-----------+-------------+-----------+----------------+--------+------+----------+--------------+--------------+---------------+--------------------+-------+--------+------+------+-----+--------+----------+--------------+-------------+----------+------------+--------+-----------+----------+-----------+---------+
|Office

### **Impute Missing Values**
---

- There is a possibility that our data contains missing values.
- The common reasons for missing values are:
    - Errors during data collection
    - Corruption or errors in the database
    - Missing values on purpose

**Example of Missing Values**

<center>
    
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-intro-to-data-eng/7-1.png" alt="Drawing" width= 500px;/>
<centeig

- When we have missing values in our data, it can affect the Analysis result created by the Business Team
- So, as Data Engineer we must make sure that our data in the Data Pipeline does not contain missing values if those values can affect the analysis

---
In this section, we're going handle missing values in this data `data/materi_10/bank_churners.csv`

In [69]:
df_churners.show(5)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

- To detect is there any missing values or not we can use `.isNull()` method then count the missing values data

**Syntax**

```python
# create variable to store the columns name
get_cols_name = df_pyspark.columns

# iterate to each column data then check the missing values
for col in get_cols_name:
    # store the missing count value
    missing_count = df_spark.filter(df_spark[col].isNull()).count()

    # create branching to get the column name that contains missing values
    if missing_count > 0:
        print(f"Column {col} has {missing_count} missing values")
```

In [71]:
# get cols 
get_columns_name = df_churners.columns
print(get_columns_name)

['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']


In [74]:
for col in get_columns_name:
    missing_val = df_churners.filter(df_churners[col].isNull()).count()

    if missing_val > 0:
        print(f"Column {col} has {missing_val} missing values")

Column Education_Level has 109 missing values
Column Income_Category has 37 missing values


In [75]:
df_churners.printSchema()

root
 |-- Attrition_Flag: string (nullable = true)
 |-- Customer_Age: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Dependent_count: string (nullable = true)
 |-- Education_Level: string (nullable = true)
 |-- Marital_Status: string (nullable = true)
 |-- Income_Category: string (nullable = true)
 |-- Card_Category: string (nullable = true)
 |-- Months_on_book: string (nullable = true)
 |-- Total_Relationship_Count: string (nullable = true)
 |-- Months_Inactive_12_mon: string (nullable = true)
 |-- Contacts_Count_12_mon: string (nullable = true)
 |-- Credit_Limit: string (nullable = true)
 |-- Total_Revolving_Bal: string (nullable = true)
 |-- Avg_Open_To_Buy: string (nullable = true)
 |-- Total_Amt_Chng_Q4_Q1: string (nullable = true)
 |-- Total_Trans_Amt: string (nullable = true)
 |-- Total_Trans_Ct: string (nullable = true)
 |-- Total_Ct_Chng_Q4_Q1: string (nullable = true)
 |-- Avg_Utilization_Ratio: string (nullable = true)



---
- Ok, we got two columns that contain missing values `Educational_Level` and `Income_Category`
- As we can see, the columns are categorical data
- If categorical data, we will impute or fill in the missing values using the value `UNKNOWN`

In [76]:
df_churners[['Education_Level']].show(4)

+---------------+
|Education_Level|
+---------------+
|    High School|
|       Graduate|
|       Graduate|
|    High School|
+---------------+
only showing top 4 rows



In [77]:
df_churners[['Income_Category']].show(4)

+---------------+
|Income_Category|
+---------------+
|    $60K - $80K|
| Less than $40K|
|   $80K - $120K|
| Less than $40K|
+---------------+
only showing top 4 rows



To impute the missing values, we can use method `na.fill({col_name: impute_value})`

**Syntax**

```python
df_spark.na.fill({col_name_1: impute_value_1, col_name_2: impute_value_2})
```

In [78]:
df_churners = df_churners.na.fill({"Education_Level": "UNKNOWN", "Income_Category": "UNKNOWN"})
df_churners.show(4)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------

Now, we validate if the data is already impute or not 

In [83]:
df_churners.filter("Education_Level = 'UNKNOWN'").show(1)

+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+-------------------+---------------------+
|   Attrition_Flag|Customer_Age|Gender|Dependent_count|Education_Level|Marital_Status|Income_Category|Card_Category|Months_on_book|Total_Relationship_Count|Months_Inactive_12_mon|Contacts_Count_12_mon|Credit_Limit|Total_Revolving_Bal|Avg_Open_To_Buy|Total_Amt_Chng_Q4_Q1|Total_Trans_Amt|Total_Trans_Ct|Total_Ct_Chng_Q4_Q1|Avg_Utilization_Ratio|
+-----------------+------------+------+---------------+---------------+--------------+---------------+-------------+--------------+------------------------+----------------------+---------------------+------------+-------------------+---------------+--------------------+---------------+--------------+--------