# **PHASE 1: FOUNDATION (Days 1-4)**

## **DAY 2 (10/01/26) – Apache Spark Fundamentals**



### **Section - 1 - Learn**:

### **_1. Spark architecture (driver, executors, DAG)_**

Spark's architecture is built on a **Master-Slave model** that allows it to process massive datasets in parallel by breaking them into smaller, manageable chunks.

Here is the breakdown of its core components and how they work together:

##### 1. The Spark Driver (The Brain)

* **Central Coordinator:** It is the "master" process that runs your `main()` function and creates the `SparkSession` (or `SparkContext`).
* **Plan Creator:** It converts your high-level code (Python/SQL) into a logical execution plan.
* **Task Scheduler:** It breaks the plan into smaller units of work called **Tasks** and decides which executors should run them.
* **Metadata Keeper:** It tracks the location of data across the cluster and keeps tabs on the status of all running jobs.

##### 2. The Spark Executors (The Muscles)

* **Workhorses:** These are processes launched on **Worker Nodes**. Their only job is to execute the tasks assigned by the Driver.
* **Data Storage:** They store data in-memory (RAM) or on disk for fast access during processing (caching).
* **State Reporting:** They constantly send "heartbeats" back to the Driver to confirm they are still alive and reporting the success or failure of their tasks.
* **Isolation:** Each Spark application has its own dedicated executors, ensuring that one job's failure doesn't crash another.

##### 3. The DAG (Directed Acyclic Graph)

* **The Blueprint:** When you write code, Spark doesn’t run it immediately (**Lazy Evaluation**). Instead, it builds a **DAG**—a step-by-step map of every transformation you want to perform.
* **Directed:** The data flows in one specific direction (from source to result).
* **Acyclic:** There are no loops; once a transformation is done, the data moves forward to the next stage.
* **Optimization:** The **DAG Scheduler** looks at the entire graph and optimizes it (e.g., combining two `filter` operations into one) to reduce data movement (shuffling).

##### The Execution Flow: From Code to Result

1. **Action Triggered:** You call an action like `.collect()` or `.save()`.
2. **DAG Created:** The Driver builds the DAG of all preceding transformations.
3. **Stages Divided:** The DAG is split into **Stages** based on "shuffles" (when data needs to move between machines).
4. **Tasks Launched:** Each stage is broken into **Tasks** (one per data partition).
5. **Execution:** Tasks are sent to Executors, which run the code and return results to the Driver.

---

### **_2. DataFrames vs RDDs_**

In Apache Spark, **RDDs (Resilient Distributed Datasets)** are the foundational building blocks, while **DataFrames** are a more modern, optimized abstraction. Think of RDDs as the "low-level assembly language" and DataFrames as the "high-level SQL" of the Spark world.

##### **Key Differences**

* **Abstraction Level:** RDDs are a collection of **Java/Python objects** distributed across nodes with no inherent structure. DataFrames organize data into **named columns**, similar to a table in a relational database.
* **Performance Optimization:** DataFrames use the **Catalyst Optimizer** and **Tungsten execution engine** to automatically optimize queries. RDDs have no built-in optimization; the performance depends entirely on how efficiently you write your code.
* **Schema Awareness:** DataFrames are **schema-based**, meaning they understand data types (Integer, String, etc.) and can perform "predicate pushdown" (filtering data at the source). RDDs are "schema-less" and treat data as raw objects.
* **Ease of Use:** DataFrames provide a **declarative API** (like `df.select("name")`) which is much more concise. RDDs require functional programming constructs (like `rdd.map(lambda x: x[0])`), which often leads to more verbose code.
* **Serialization Overhead:** DataFrames are highly efficient because they store data in a compact **binary format** (off-heap). RDDs require individual objects to be serialized/deserialized, which creates significant Java Garbage Collection (GC) overhead.
* **Type Safety:** RDDs provide **compile-time type safety** (in Scala/Java), meaning errors are caught before the code runs. DataFrames (in Python/Scala) are generally checked at **runtime**, though Datasets provide a middle ground for type safety in Scala.

##### **When to Use Each**

| Feature | Use RDD when... | Use DataFrame when... |
| --- | --- | --- |
| **Data Type** | Unstructured (media, raw text) | Structured/Semi-structured (JSON, CSV, Parquet) |
| **Control** | You need low-level, fine-grained control | You want the engine to optimize for you |
| **APIs** | Using functional transformations | Using SQL-like queries or Spark SQL |
| **Performance** | Performance is secondary to custom logic | Performance and speed are critical |

---

### **_3. Lazy evaluation_**

In Apache Spark, **Lazy Evaluation** means that Spark does not execute transformations immediately when you write them. Instead, it records them in a "to-do list" and only executes them when an **Action** is called.

Here are the key points to understand how it works:

* **Transformations vs. Actions:** Spark operations are split into two types. **Transformations** (like `filter()`, `map()`, `select()`) are lazy and just build the plan. **Actions** (like `show()`, `count()`, `collect()`, `save()`) trigger the actual computation.
* **The Logical Plan (The Blueprint):** When you chain transformations together, Spark builds a **Directed Acyclic Graph (DAG)**. This is a map of all the steps needed to get from your raw data to the final result without actually moving any data yet.
* **Query Optimization:** Because Spark waits until the action is called, it can look at the entire DAG and optimize it. For example, if you filter a dataset and then select two columns, Spark's **Catalyst Optimizer** will combine these to only read those two columns from the source, saving massive amounts of I/O.
* **Reduced Data Transfer:** By evaluating the whole plan at once, Spark avoids loading unnecessary data into RAM. It only pulls exactly what is needed to satisfy the final Action.
* **Fault Tolerance:** Since the DAG records the lineage of the data (the history of how it was built), if a machine fails mid-calculation, Spark can re-run only the missing pieces of the "blueprint" to reconstruct the lost data.
* **Efficiency in Iteration:** Lazy evaluation allows Spark to pipeline operations. Instead of passing over the data once for a "filter" and again for a "map," it can perform both operations in a single pass over the data.

##### **Example Scenario**

If you write:

1. Load 1TB file.
2. Filter for "Year = 2024".
3. Count the rows.

Spark does **nothing** during steps 1 and 2. It only starts working at step 3. Because it was "lazy," it knows it doesn't need to load the whole 1TB; it can optimize the process to only scan for the 2024 records.

---


### **_Notebook magic commands (%sql, %python, %fs)_**

In Databricks, **Magic Commands** are special symbols at the start of a cell that allow you to switch languages, interact with the file system, or manage the environment within a single notebook.

Here are the essential magic commands every user should know:

##### 1. Language Switchers

These allow you to use multiple languages in a single notebook, regardless of the notebook's "default" language.

* **`%python`**: Executes the cell as Python code.
* **`%sql`**: Runs a SQL query against your registered tables.
* **`%scala`**: Executes Scala code.
* **`%r`**: Executes R code.

##### 2. File System & OS Interaction

These are used to manage data files and the underlying virtual machine environment.

* **`%fs`**: Short for Databricks File System (DBFS). Used to list files (`ls`), copy (`cp`), or delete (`rm`).
  * *Example: `%fs ls /databricks-datasets*`


* **`%sh`**: Allows you to run **Shell commands** (Bash) on the driver node. Useful for installing libraries via `apt-get` or checking disk space.
  * *Example: `%sh top*`

##### 3. Environment & Documentation

* **`%md`**: Renders the cell as **Markdown**. This is how you create professional documentation, headers, and bullet points within your code.
* **`%pip`**: Used to install Python libraries locally to the current cluster.
  * *Example: `%pip install seaborn*`


* **`%run`**: Allows you to call and execute **another notebook** from within your current one. This is great for modularizing code (e.g., running a "Configuration" notebook).

##### 4. Utility & Debugging

* **`%ls`**: A shortcut for listing files in the current working directory of the driver.
* **`%lsmagic`**: Lists every magic command available in your current Databricks environment.

----

### **Practice**

In [0]:
import pyspark.sql.functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

In [0]:
df_o.columns

['event_time',
 'event_type',
 'product_id',
 'category_id',
 'category_code',
 'brand',
 'price',
 'user_id',
 'user_session']

In [0]:
display(df_o.select('event_time', "category_code", 'brand', 'price').head(10))

event_time,category_code,brand,price
2019-10-01 00:00:00 UTC,,shiseido,35.79
2019-10-01 00:00:00 UTC,appliances.environment.water_heater,aqua,33.2
2019-10-01 00:00:01 UTC,furniture.living_room.sofa,,543.1
2019-10-01 00:00:01 UTC,computers.notebook,lenovo,251.74
2019-10-01 00:00:04 UTC,electronics.smartphone,apple,1081.98
2019-10-01 00:00:05 UTC,computers.desktop,pulser,908.62
2019-10-01 00:00:08 UTC,,creed,380.96
2019-10-01 00:00:08 UTC,,luminarc,41.16
2019-10-01 00:00:10 UTC,apparel.shoes.keds,baden,102.71
2019-10-01 00:00:11 UTC,electronics.smartphone,huawei,566.01


In [0]:
# Summary statistics for numeric price
display(df_o.describe("price"))

summary,price
count,42448764.0
mean,290.32366068491405
stddev,358.2691553394021
min,0.0
max,999.82


In [0]:
# Count total rows vs. unique users
print(f"Total Events: {df_o.count()}")
print(f"Unique Users: {df_o.select('user_id').distinct().count()}")

Total Events: 42448764
Unique Users: 3022290


In [0]:
# Filter only purchases
purchase_df = df_o.filter(df_o.event_type == "purchase")

# Filter for a specific brand (e.g., Apple) and price above 500
premium_apple = df_o.filter((df_o.brand == "apple") & (df_o.price > 500))

In [0]:
display(purchase_df.head(10))

event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-01T00:02:14.000Z,purchase,1004856,2053013555631882655,electronics.smartphone,samsung,130.76,543272936,8187d148-3c41-46d4-b0c0-9c08cd9dc564
2019-10-01T00:04:37.000Z,purchase,1002532,2053013555631882655,electronics.smartphone,apple,642.69,551377651,3c80f0d6-e9ec-4181-8c5c-837a30be2d68
2019-10-01T00:06:02.000Z,purchase,5100816,2053013553375346967,,xiaomi,29.51,514591159,0e5dfc4b-2a55-43e6-8c05-97e1f07fbb56
2019-10-01T00:07:07.000Z,purchase,13800054,2053013557418656265,furniture.bathroom.toilet,santeri,54.42,555332717,1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f
2019-10-01T00:09:26.000Z,purchase,4804055,2053013554658804075,electronics.audio.headphone,apple,189.91,524601178,2af9b570-0942-4dcd-8f25-4d84fba82553
2019-10-01T00:09:54.000Z,purchase,4804056,2053013554658804075,electronics.audio.headphone,apple,161.98,551377651,3c80f0d6-e9ec-4181-8c5c-837a30be2d68
2019-10-01T00:10:08.000Z,purchase,1002524,2053013555631882655,electronics.smartphone,apple,515.67,524325294,0b74a829-f9d7-4654-b5b0-35bc9822c238
2019-10-01T00:10:56.000Z,purchase,6200687,2053013552293216471,appliances.environment.air_heater,oasis,28.03,548691404,b67cdbcb-b073-4271-b365-803c6fce53b0
2019-10-01T00:12:14.000Z,purchase,4802036,2053013554658804075,electronics.audio.headphone,apple,171.56,533624186,e5ac3caa-e6d5-4d6b-ae06-2c18cd9ca683
2019-10-01T00:14:14.000Z,purchase,1004932,2053013555631882655,electronics.smartphone,vivo,463.31,555083442,83d12d1a-5452-4fa0-abbb-d9f492f8b562


In [0]:
display(premium_apple.head(10))

event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-01T00:00:04.000Z,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d
2019-10-01T00:00:19.000Z,view,1005135,2053013555631882655,electronics.smartphone,apple,1747.79,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d
2019-10-01T00:00:20.000Z,view,1003306,2053013555631882655,electronics.smartphone,apple,588.77,555446831,6ec635da-ea15-4a5d-96b4-c8ca9d38f89f
2019-10-01T00:00:24.000Z,view,1003306,2053013555631882655,electronics.smartphone,apple,588.77,555446831,6ec635da-ea15-4a5d-96b4-c8ca9d38f89f
2019-10-01T00:00:43.000Z,view,1005135,2053013555631882655,electronics.smartphone,apple,1747.79,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d
2019-10-01T00:00:50.000Z,view,1005105,2053013555631882655,electronics.smartphone,apple,1415.48,529755884,0b828fb6-99bd-4d26-beb3-3021f5d6102c
2019-10-01T00:01:30.000Z,view,1005115,2053013555631882655,electronics.smartphone,apple,975.57,514218020,d7c4761f-de75-454b-9164-177db5e53695
2019-10-01T00:01:39.000Z,view,1004258,2053013555631882655,electronics.smartphone,apple,735.05,513758741,a8e9a7cf-2708-43b1-90ae-3a9b11bc4300
2019-10-01T00:01:43.000Z,view,5100855,2053013553341792533,electronics.clocks,apple,617.52,554190174,46279e69-5c2a-4c3c-945b-460268bc683c
2019-10-01T00:01:44.000Z,view,1003317,2053013555631882655,electronics.smartphone,apple,957.53,514218020,d7c4761f-de75-454b-9164-177db5e53695


In [0]:
# Average price per category
avg_price_cat = df_o.groupBy("category_code").agg(F.avg("price").alias("avg_price"))

In [0]:
display(avg_price_cat)

category_code,avg_price
stationery.cartrige,25.680992125984243
electronics.video.tv,441.8972821728816
accessories.wallet,59.78795994677888
appliances.kitchen.juicer,110.28602360379338
,184.9124529313098
construction.tools.welding,222.5182847865556
appliances.environment.air_heater,51.41798860596159
country_yard.furniture.hammok,133.31488505747134
apparel.shoes,89.58188990457958
electronics.audio.microphone,142.35853790489384


---

### **Resources**
- [Pyspark official docs](https://docs.databricks.com/pyspark/)
- [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [pyspark in easy way](https://medium.com/the-researchers-guide/introduction-to-pyspark-a61f7217398e)

----