# <font color='blue'>Data Pipeline using PySpark
---

**Outline**

1. Review:
    - Introduction to Apache Spark & PySpark I
    - Introduction to Apache Spark & PySpark II
2. Case Study: Data Pipeline Movie Data

### **Create Spark Session**
---

We're going to name the session as `"Data Pipeline using PySpark"`

In [1]:
# import library
import pyspark

In [2]:
import pyspark
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \
    .appName("Data Pipeline using PySpark") \
    .getOrCreate()

In [4]:
spark

### **1. Extract Process**
---

#### **1a**
---

- Read the csv data first in `data/ratings.csv` using PySpark
- Save it to `df_ratings` variable

In [5]:
def read_data_csv(PATH:str,filename:str):
    try:
        """
            Function to read data from csv file 
        """
        data = spark.read.option("header", "true").csv(DATA_PATH + filename) 
        return data
    except Exception as e:
        print(f"ERROR : {e}")    

In [6]:
DATA_PATH = '../data/'
filename = 'ratings.csv'
df_ratings = read_data_csv(DATA_PATH, filename)
df_ratings.show(3)

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    110|   1.0|1425941529|
|     1|    147|   4.5|1425942435|
|     1|    858|   5.0|1425941523|
+------+-------+------+----------+
only showing top 3 rows



In [7]:
df_ratings.count()

26024289

#### **1b**
---

In [8]:
pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
from dotenv import load_dotenv
import os

In [10]:
# Load .env file
load_dotenv()

# Access variables
db_host = os.getenv("POSTGRES_URL_MOVIE")
db_name = os.getenv("POSTGRES_DB_MOVIE")
db_user = os.getenv("POSTGRES_USER_MOVIE")
db_pass = os.getenv("POSTGRES_PASSWORD_MOVIE")
conn_properties = {
    "user": db_user,
    "password": db_pass,
    "driver": "org.postgresql.Driver" # set driver postgres
}

In [11]:
TABLE_NAME = 'movies_metadata'
df_metadata = spark.read.jdbc(url=db_host,table = TABLE_NAME, properties=conn_properties)

In [12]:
df_metadata.show(3, truncate = False, vertical = True)

-RECORD 0------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 adult                 | False                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 belongs

In [13]:
df_metadata.count()

45466

### **2. Source to Target Mapping**
---

- After we extracted all the data, we got information that ratings and movie metadata can be joined by using the movie id
- Before that, we must profile our table that can help us for the next step
- You can create your Source to Target Mapping or you can see it on this [Spreadsheet](https://docs.google.com/spreadsheets/d/1spFcpnUdoiKW2dApInxDNkadywiBs2mdCt8CmwHZ1ME/edit?usp=sharing)
- The goals of Source to Target Mapping is to identify which columns can be used for the next process like what transformation rules that can be used

### **3. Transform Data**
---

- After we mapping all the columns from Source to Target Mapping, now we can set the transformation rules
- These are the transformation processes that we will do in this case:
    - Join Data
    - Renaming Columns
    - Select Data based on Columns
    - Casting Data Type
    - Filter Data
    - Create New Columns using Existing Columns

#### **3a. Join Data**
---

- In this process, we want to join the data from movies rating and movies metadata
- After we explore the data from the previous process turns out there's one column that we can join!
    - `df_ratings`: `movieId`
    - `df_metadata`: `id`

    <br>
    <center>
        <img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/join_data_case_week_5.png" width=50%>
    </center>
    <br>

- To join the data using PySpark, we can use this code

```python
df1.join(df2, df1.col == df2.col, JOIN_METHOD) # we can use inner, left, right, etc
```

- In this case, we will use `inner` join then save the joined data in `df_joined` variable

In [15]:
df_data = df_ratings.join(df_metadata, df_ratings.movieId == df_metadata.id, "inner")

In [16]:
df_data.show(5)

+------+-------+------+----------+-----+---------------------+------+--------------------+--------+------+---------+-----------------+--------------+--------------------+----------+--------------------+--------------------+--------------------+------------+-------+-------+--------------------+--------+--------------------+--------------+-----+------------+----------+
|userId|movieId|rating| timestamp|adult|belongs_to_collection|budget|              genres|homepage|    id|  imdb_id|original_language|original_title|            overview|popularity|         poster_path|production_companies|production_countries|release_date|revenue|runtime|    spoken_languages|  status|             tagline|         title|video|vote_average|vote_count|
+------+-------+------+----------+-----+---------------------+------+--------------------+--------+------+---------+-----------------+--------------+--------------------+----------+--------------------+--------------------+--------------------+------------+---

In [17]:
df_data.show(3, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 userId                | 429                                                                                                                                                                                                                                     

#### **3b. Rename Columns**
---

- After we joined the data, turns out there's a columns that not in the correct format
- Columns that we want to renamed are:
    - `userId`: `user_id`
    - `movieId`: `movie_id`

In [None]:
COLUMNS_RENAME = {
    "userId": "user_id",
    "movieId": "movie_id"
}

df_data = df_data.withRenames