# **PHASE 1: FOUNDATION (Days 1-4)**

## **DAY 3 (11/01/26) - PySpark Transformations Deep Dive**



### **Section 1 - Learn**:

### **_1. PySpark vs Pandas comparison_**

Choosing between **PySpark** and **Pandas** usually comes down to the size of your data and the environment you are working in. While they look similar in syntax, their "engines" are fundamentally different.

Here are the key comparison points:

* **Single Machine vs. Distributed:** **Pandas** runs on a single node (your laptop or one server) and is limited by that machine's RAM. **PySpark** is distributed; it splits data into partitions and processes them across a cluster of many machines.
* **Memory Management:** **Pandas** loads the entire dataset into memory at once—if your file is 20GB and your RAM is 16GB, it crashes. **PySpark** can handle datasets far larger than the memory of any single machine by "spilling" to disk or using the combined RAM of the whole cluster.
* **Execution Strategy:** **Pandas** is **Eager**; it executes every command as soon as you run the cell. **PySpark** is **Lazy**; it builds a "plan" (DAG) and waits until you call an action (like `.show()` or `.save()`) before doing the actual work.
* **Performance Bottlenecks:** For small datasets (under 1–2 GB), **Pandas** is often faster because it doesn't have the "overhead" of coordinating multiple machines. For large data, **PySpark** wins because it parallelizes the workload.
* **Immutability:** **Pandas DataFrames** are mutable (you can change values in place). **PySpark DataFrames** are immutable; every transformation creates a *new* DataFrame, which is essential for tracking data lineage and recovering from errors.
* **Ecosystem & Visualization:** **Pandas** has deep integration with the Python data science stack (Scikit-Learn, Matplotlib, Seaborn). **PySpark** requires extra steps (like converting a sample back to Pandas) to use these visualization tools easily.
* **Fault Tolerance:** If a **Pandas** job fails halfway, you lose the progress in memory. **PySpark** is fault-tolerant; if one machine in the cluster fails, the "Driver" knows how to re-run only the missing tasks on a different machine.

#### **Comparison Table**

| Feature | Pandas | PySpark |
| --- | --- | --- |
| **Data Size** | Small/Medium (up to 5-10GB) | Big Data (TB to PB scale) |
| **Parallelism** | Single-threaded | Multi-node parallelism |
| **Optimization** | Minimal | High (Catalyst & Tungsten) |
| **Learning Curve** | Low (Very intuitive) | Moderate (Requires Spark knowledge) |
| **Streaming** | No native support | Yes (Structured Streaming) |

---

### **_2. Joins (inner, left, right, outer)_**

In PySpark, joins allow you to combine two DataFrames based on a common key. Because Spark is a distributed system, performing joins often involves a "Shuffle," where data is moved across the network to ensure matching keys end up on the same machine.

Here are the 5 essential join types in bullet points:

* **Inner Join (Default):** The most restrictive join. It returns only the rows where there is a **matching key in both** the left and right DataFrames. If a key exists in one but not the other, it is dropped.
* **Left Outer Join:** Keeps **all rows from the left** DataFrame. If there is no match in the right DataFrame, the resulting columns from the right side will contain `null`. This is the most common join for enriching a primary dataset.
* **Right Outer Join:** The mirror of the Left Join. It keeps **all rows from the right** DataFrame. If no match is found in the left side, those columns will contain `null`. (Often avoided in favor of Left Joins for better readability).
* **Full Outer Join:** The most inclusive join. It retains **all rows from both** DataFrames. It matches rows where keys exist in both and fills in `null` wherever a match is missing on either side.
* **Left Semi Join:** A unique "filtering" join. It returns only rows from the left DataFrame that have a **match in the right** DataFrame. Unlike an Inner Join, it *only* keeps columns from the left side and does not create duplicate rows if multiple matches exist in the right.
* **Left Anti Join:** The opposite of Semi Join. It returns only the rows from the left DataFrame that **do NOT have a match** in the right. It is extremely useful for finding discrepancies or "missing" records between two datasets.

#### **PySpark Join Syntax**

```python
# Generic syntax
df_joined = df_left.join(df_right, df_left.id == df_right.id, "join_type")

```

#### **Performance Tip: Broadcast Joins**

When joining a **large table** with a **very small table**, Spark can perform a "Broadcast Join." Instead of shuffling the big table across the network, it sends a copy of the small table to every machine. This significantly speeds up the process and reduces network overhead.

---

### **_3. Window functions (running totals, rankings)_**

In PySpark, **Window Functions** allow you to perform calculations across a specific "window" or group of rows related to the current row, without collapsing them into a single output row (unlike a `groupBy`).

#### **The Core Components**

To use a window function, you must define a **Window Specification** using three main parts:

* **`partitionBy`**: Groups the rows (e.g., "Group by Department").
* **`orderBy`**: Sorts the rows within that group (e.g., "Sort by Salary").
* **`rowsBetween`**/**`rangeBetween`**: Defines the "frame" or boundaries (e.g., "From the start of the group to the current row").

#### **Key Use Cases**

* **Ranking Functions:**
  * **`row_number()`**: Assigns a unique, sequential number to rows (1, 2, 3, 4).
  * **`rank()`**: Assigns numbers but leaves gaps if there is a tie (1, 2, 2, 4).
  * **`dense_rank()`**: Assigns numbers without leaving gaps for ties (1, 2, 2, 3).


* **Running Totals (Cumulative Sums):**
  * Uses `sum()` combined with a window that starts from the beginning of the partition up to the current row (`Window.unboundedPreceding, Window.currentRow`).


* **Lead and Lag (Time-series analysis):**
  * **`lag()`**: Fetches a value from a *previous* row (useful for calculating day-over-day growth).
  * **`lead()`**: Fetches a value from a *subsequent* row (useful for seeing "what happens next").


* **Moving Averages:**
  * Calculates the average of a sliding window, such as the "previous 7 days" of sales data to smooth out fluctuations.


* **Analytical Aggregates:**
  * Unlike a standard `groupBy`, a Window `avg()` or `max()` appends the result as a new column to **every row** in the original dataset, allowing you to compare an individual's value against the group average.



#### **PySpark Example: Running Total**

```python
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# 1. Define the window
windowSpec = Window.partitionBy("Department").orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# 2. Apply the window function
df.withColumn("Running_Total", F.sum("Sales").over(windowSpec)).show()
```

#### **Performance Note**

**Avoid Empty** `partitionBy()`: If you don't define a partition, Spark is forced to move **all data** to a single machine (the Driver) to calculate the window. This will cause a "bottleneck" or an Out of Memory (OOM) error on large datasets.

---

### **_4. User-Defined Functions (UDFs)_**

User-Defined Functions (UDFs) allow you to extend PySpark’s functionality by writing custom Python code to handle transformations that aren't possible with built-in functions.

However, they come with a significant "performance tax" that every developer should understand:

* **Custom Logic Extension:** Use UDFs only when Spark’s hundreds of built-in functions (like `when`, `substring`, or `date_add`) cannot satisfy your business logic.
* **The Serialization Bottleneck:** Standard Python UDFs are slow because Spark must serialize each row, move it from the **JVM** (where Spark lives) to a **Python process**, execute your code, and then move the result back.
* **"Black Box" Problem:** Spark’s **Catalyst Optimizer** cannot "see" inside a UDF. It can't optimize the code or perform "predicate pushdown" (filtering data early), which often leads to 10x slower execution compared to native functions.
* **Pandas UDFs (Vectorized):** Introduced to solve the performance issue, **Pandas UDFs** use **Apache Arrow** to move data in "batches" rather than row-by-row. This significantly reduces overhead and can be up to 100x faster than standard UDFs.
* **Type Hinting Requirement:** In newer Spark versions (3.0+), you should use Python type hints (e.g., `def func(s: pd.Series) -> pd.Series`) so Spark knows exactly how to handle the data types.
* **Unity Catalog Governance:** In modern Databricks environments, you can register UDFs in the **Unity Catalog**, allowing them to be shared across the entire organization and governed with specific permissions.
* **Best Practice Rule:** **Built-in Functions > Pandas UDFs > Standard UDFs.** Always search the `pyspark.sql.functions` module before writing a custom UDF.

#### **Code Comparison: Standard vs. Pandas UDF**

| Feature | Standard UDF (`@udf`) | Pandas UDF (`@pandas_udf`) |
| --- | --- | --- |
| **Data Handling** | Row-by-row | Batch-by-batch (Vectorized) |
| **Performance** | Slow (High overhead) | Fast (Uses Apache Arrow) |
| **Input Type** | Python primitives (int, str) | `pandas.Series` or `pandas.DataFrame` |
| **Best For** | Simple, non-vectorizable logic | Statistical models, ML, complex math |

---

### **Practice**

In [0]:
from pyspark.sql import functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

#### **1. Joins** 

In [0]:
# Creating a subset for purchases and views
purchases = df_o.filter(df_o.event_type == "purchase").select(
    "user_session", "product_id", F.col("event_time").alias("purchase_time")
)

views = df_o.filter(df_o.event_type == "view").select(
    "user_session", "product_id", F.col("event_time").alias("view_time"), "brand", "price"
)

In [0]:
# Join to find views that resulted in a purchase in the same session for the same product
conversion_df = views.join(
    purchases, 
    on=["user_session", "product_id"], 
    how="inner"
).filter(F.col("purchase_time") >= F.col("view_time"))

In [0]:
display(conversion_df.head(5))

user_session,product_id,view_time,brand,price,purchase_time
56fe5a0c-5744-4a2a-aa6d-d5b4ef99fe86,1002633,2019-10-01T04:53:26.000Z,apple,360.08,2019-10-01T04:59:25.000Z
56fe5a0c-5744-4a2a-aa6d-d5b4ef99fe86,1002633,2019-10-01T04:54:33.000Z,apple,360.08,2019-10-01T04:59:25.000Z
56fe5a0c-5744-4a2a-aa6d-d5b4ef99fe86,1002633,2019-10-01T04:58:42.000Z,apple,360.08,2019-10-01T04:59:25.000Z
646820dd-045d-4028-9929-75627384eaa9,1004250,2019-10-01T05:14:53.000Z,apple,804.21,2019-10-01T05:15:05.000Z
77b8992b-6844-48ac-8c45-e1dd2f53b567,21900034,2019-10-01T05:29:46.000Z,,3.18,2019-10-01T05:33:11.000Z


#### **2. Running Totals with Window Functions** 

In [0]:
from pyspark.sql.window import Window

In [0]:
# Define the Window: Group by category only (no ordering needed for simple averages)
category_window = Window.partitionBy("category_id")

In [0]:
# Apply the window function
df_simple = df_o.withColumn(
    "avg_category_price", 
    F.avg("price").over(category_window)
).withColumn(
    "price_diff", 
    F.col("price") - F.col("avg_category_price")
)

In [0]:
display(df_simple.select("product_id", "category_id", "price", "avg_category_price", "price_diff").head(10))

product_id,category_id,price,avg_category_price,price_diff
8600174,2053013552226107603,10.27,63.36775169300209,-53.097751693002095
8600067,2053013552226107603,17.99,63.36775169300209,-45.3777516930021
8600067,2053013552226107603,17.99,63.36775169300209,-45.3777516930021
8600134,2053013552226107603,102.81,63.36775169300209,39.44224830699791
8600236,2053013552226107603,457.89,63.36775169300209,394.5222483069979
8600135,2053013552226107603,16.71,63.36775169300209,-46.65775169300209
8600235,2053013552226107603,463.73,63.36775169300209,400.3622483069979
8600094,2053013552226107603,69.47,63.36775169300209,6.102248306997907
8600204,2053013552226107603,30.96,63.36775169300209,-32.40775169300209
8600227,2053013552226107603,13.77,63.36775169300209,-49.597751693002095


#### **3. Create derived features**

In [0]:
user_profile = df_o.groupBy("user_id").agg(
    F.max("event_time").alias("last_active"),
    F.avg("price").alias("avg_interacted_price"),
    F.mode("brand").alias("favorite_brand") # Requires Spark 3.4+
)

In [0]:
# Calculate Days since last activity
current_date = df_o.select(F.max("event_time")).collect()[0][0]
user_profile = user_profile.withColumn(
    "days_since_active", 
    F.datediff(F.lit(current_date), F.col("last_active"))
)

In [0]:
display(user_profile.head(10))

user_id,last_active,avg_interacted_price,favorite_brand,days_since_active
522453283,2019-10-01T00:11:44.000Z,136.2,samsung,30
554935393,2019-10-01T00:17:43.000Z,225.5388888888889,bosch,30
547718922,2019-10-21T12:15:31.000Z,292.96183333333335,lenovo,10
526949848,2019-10-21T15:54:20.000Z,290.73660377358493,luminarc,10
555467783,2019-10-01T02:58:12.000Z,15.42,samsung,30
548060580,2019-10-02T04:29:56.000Z,159.7057142857143,samsung,29
516407514,2019-10-15T05:10:19.000Z,224.0890909090909,vivo,16
513191866,2019-10-26T06:39:34.000Z,435.6627272727273,samsung,5
513386589,2019-10-29T06:07:56.000Z,250.3921739130435,samsung,2
512769367,2019-10-23T09:22:59.000Z,195.60457627118643,beko,8


---

### **Resources**
- [pyspark official docs](https://docs.databricks.com/pyspark/)
- [pySpark window Functions](https://medium.com/@uzzaman.ahmed/pyspark-window-functions-a-comprehensive-guide-dc9bdad8c7ae)
- [pySpark UDFs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html)
- [pySpark joins](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html)

----