#DAY 2 (10/01/26) – Apache Spark Fundamentals

##Section 1: Notebook Setup & Dataset Upload

In [0]:
# Day 2 – Apache Spark Fundamentals

##**Topics Covered:**
## - Spark architecture (Driver, Executors, DAG)
## - DataFrames vs RDDs
## - Lazy Evaluation
## - Notebook magic commands (%sql, %python, %fs)

##**Dataset:** Sample e-commerce CSV uploaded to Databricks workspace


###Upload CSV to Databricks

In [0]:
data_path = "/Volumes/workspace/ecommerce/ecommerce_data/*.csv"

#Section 2: Spark Architecture (Driver, Executors, DAG)

##Load Data into DataFrame

In [0]:
df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .csv(data_path)


###No execution yet (lazy evaluation)

Driver builds logical plan

In [0]:
# Preview schema
df.printSchema()


### Parsed Logical PLan --> Analyzed Logical Plan --> Optimized Logical Plan --> Physical Plane

In [0]:
df.explain(True)

###Simple Transformations (Observe DAG)

In [0]:
# Select + Filter (Narrow transformations)
df_filtered = df.select("event_type", "category_code", "price") \
                .filter(df.event_type == "purchase") \
                .filter(df.price > 100)

#Lazy evaluation — no job yet


### Apply Action

In [0]:
# Trigger action
df_filtered.show(10)


In [0]:
#Wide Transformation Example (groupBy + orderBy)
df_grouped = df_filtered.groupBy("category_code") \
                        .count() \
                        .orderBy("count", ascending=False)

df_grouped.show(5)


# Section 3: Lazy Evaluation

In [0]:
# Chain transformations without action
df_lazy = df.filter(df.price > 200) \
            .filter(df.event_type == "purchase") \
            .select("category_code", "price")

# At this point:
# - No DAG executed
# - Only logical plan built

# Trigger action
df_lazy.show(5)


#Section 4: DataFrames vs RDDs (Free Edition)

In [0]:
# DataFrame transformation (preferred)
df_df = df.filter(df.event_type == "purchase") \
          .groupBy("category_code") \
          .count()

df_df.show(5)

#DAG visible in Spark UI

#Optimized by Catalyst + Tungsten


Conceptually, RDD requires manual map/reduce; DataFrames are optimized

Always prefer DataFrames in Databricks

#Section 5: Notebook Magic Commands

##%python (PySpark)

In [0]:
%python
df.select("event_type", "category_code", "price").show(5)


##%sql (Spark SQL)

In [0]:
%python
df = spark.read.option("header","true") \
               .option("inferSchema","true") \
               .csv("/Volumes/workspace/ecommerce/ecommerce_data/*.csv")

# Create temporary view for SQL queries
df.createOrReplaceTempView("events")


In [0]:
%sql
SELECT category_code, COUNT(*) AS cnt
FROM events
WHERE event_type = 'purchase'
GROUP BY category_code
ORDER BY cnt DESC
LIMIT 5


##%fs (File System)

In [0]:
%fs
ls /Volumes/workspace/ecommerce/ecommerce_data
