# PySpark Demo Notebook

This notebook provides practical examples covering PySpark fundamentals, including RDDs, DataFrames, SparkSession setup, partitioning, caching, joins, broadcast variables, accumulators, and dynamic resource allocation.

## Setup SparkSession

Initialize SparkSession which serves as the entry point for DataFrame and SQL operations in Spark 2.x+.

In [0]:
from pyspark.sql import SparkSession

# build or retrieve a SparkSession with dynamic executor allocation
spark = (
    SparkSession.builder
    .appName("PySpark Demo")  # name the Spark application
    .master("local[*]")  # run locally using all cores
    .config("spark.dynamicAllocation.enabled", "true")  # enable dynamic scaling of executors
    .config("spark.dynamicAllocation.initialExecutors", "2")  # start with 2 executors
    .config("spark.dynamicAllocation.minExecutors", "1")  # allow scaling down to 1 executor
    .config("spark.dynamicAllocation.maxExecutors", "4")  # allow scaling up to 4 executors
    .config("spark.dynamicAllocation.executorIdleTimeout", "60s")  # remove idle executors after 60s
    .getOrCreate()  # create the session if not already existing
)

sc = spark.sparkContext  # get the low-level SparkContext from the SparkSession

# print out configuration details to verify
print(f"AppName: {sc.appName}")
print(f"Master: {sc.master}")
print(f"Dynamic Allocation Enabled: {spark.conf.get('spark.dynamicAllocation.enabled')}")

AppName: Databricks Shell
Master: local[8]
Dynamic Allocation Enabled: true


---
## RDD: Resilient Distributed Datasets
RDD is Spark's low-level immutable distributed collection. It supports fine-grained control over transformations and actions.

In [0]:
# Create an RDD from a Python list
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, numSlices=3)  # The parallelize() method is used to create an RDD (Resilient Distributed Dataset) from an existing Python collection (like a list or range).
print("Number of partitions:", rdd.getNumPartitions())
print("Collect:", rdd.collect())

Number of partitions: 3
Collect: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


### Transformations & Actions

- Transformations are lazy (e.g., `map`, `filter`).
- Actions trigger execution (e.g., `collect`, `count`).

In [0]:
# Example: word count using RDD operations
text_data = sc.parallelize([
    "Apache Spark is fast",
    "PySpark brings Python to Spark",
    "Spark runs in memory",
    "Spark supports RDD and DataFrame APIs"
])

# Split lines into words
words = text_data.flatMap(lambda line: line.split(" ")) # flattens the result into a single list of words
word_pairs = words.map(lambda w: (w.lower(), 1))
counts = word_pairs.reduceByKey(lambda a, b: a + b)

print(counts.collect())

[('python', 1), ('spark', 4), ('brings', 1), ('apache', 1), ('fast', 1), ('runs', 1), ('in', 1), ('supports', 1), ('pyspark', 1), ('to', 1), ('memory', 1), ('and', 1), ('dataframe', 1), ('is', 1), ('rdd', 1), ('apis', 1)]


---
## DataFrames & Spark SQL
High-level API for structured data. It provides schema-based transformations and SQL querying capabilities.

In [0]:
from pyspark.sql import SparkSession

# Assuming SparkSession 'spark' is already created

json_data = [
    {"id": 1, "name": "Alice", "age": 30},
    {"id": 2, "name": "Bob", "age": 25},
    {"id": 3, "name": "Charlie", "age": 35}
]

# Directly create a DataFrame from list of dicts
df = spark.createDataFrame(json_data)

df.show()
df.printSchema()

+---+---+-------+
|age| id|   name|
+---+---+-------+
| 30|  1|  Alice|
| 25|  2|    Bob|
| 35|  3|Charlie|
+---+---+-------+

root
 |-- age: long (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



In [0]:
# DataFrame operations
df.select("name", "age").filter(df.age > 28).show()

df.groupBy("age").count().show()

# Spark SQL
df.createOrReplaceTempView("people")
spark.sql("SELECT name FROM people WHERE age BETWEEN 26 AND 34").show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 30|
|Charlie| 35|
+-------+---+

+---+-----+
|age|count|
+---+-----+
| 30|    1|
| 25|    1|
| 35|    1|
+---+-----+

+-----+
| name|
+-----+
|Alice|
+-----+



---
## Partitioning
Spark divides data into partitions. Control partitions with `repartition` and `coalesce`.

In [0]:
rdd_part = sc.parallelize(range(20), 4)
print("Original partitions:", rdd_part.getNumPartitions())

rdd_repart = rdd_part.repartition(6)
print("Repartitioned (6):", rdd_repart.getNumPartitions())

rdd_coalesce = rdd_part.coalesce(2)
print("Coalesced (2):", rdd_coalesce.getNumPartitions())

Original partitions: 4
Repartitioned (6): 6
Coalesced (2): 2


---
## Caching & Persistence
Persist datasets in memory for faster reuse.

In [0]:
# Cache the RDD in memory for faster reuse
rdd_cached = rdd_part.cache()

# First action triggers computation and caches the RDD in memory
print(rdd_cached.count())    # Computes and caches the RDD, then returns count

# Second action uses cached data, so no recomputation happens
print(rdd_cached.count())    # Returns count quickly from cached data

# Remove the RDD from cache to free up memory when no longer needed
rdd_cached.unpersist()

20
20
Out[7]: PythonRDD[25] at RDD at PythonRDD.scala:58

---
## Broadcast Variables & Accumulators
Share read-only data efficiently and aggregate values across tasks.

In [0]:
# Broadcast variable: efficiently share a read-only variable with all worker nodes
broadcast_list = sc.broadcast([2, 4, 6])
print(broadcast_list.value)  # Access the broadcasted value on the driver

# Accumulator: a variable used to aggregate information (like counters) across tasks
acc = sc.accumulator(0)  # Initialize accumulator with 0

def add_if_even(x):
    # If number is even, increment the accumulator by 1
    if x % 2 == 0:
        acc.add(1)
    return x

# Create an RDD of numbers from 0 to 9
rdd_nums = sc.parallelize(range(10))

# Run the function on each element of the RDD (side effect updates accumulator)
rdd_nums.foreach(add_if_even)

# Print the total count of even numbers computed by the accumulator
print("Even count via accumulator:", acc.value)

[2, 4, 6]
Even count via accumulator: 5


---
## Join Strategies
Demonstrate broadcast hash join vs sort-merge join using DataFrames.

In [0]:
from pyspark.sql.functions import broadcast

# Create a large DataFrame with keys from 1 to 1000
df_large = spark.range(1, 1001).withColumnRenamed("id", "key")

# Create a small DataFrame with 20 rows and key-value pairs
df_small = spark.createDataFrame([(i, f"val_{i}") for i in range(1, 21)], ["key", "value"])

# Broadcast hash join:
# Explicitly broadcast the small DataFrame to all worker nodes to optimize the join
broadcast_join = df_large.join(broadcast(df_small), on="key")
broadcast_join.explain(True)  # Print detailed physical and logical plans

# Sort-merge join (default join strategy):
# This join does not broadcast and uses a sort-merge join which is good for large datasets
merge_join = df_large.join(df_small, on="key")
merge_join.explain(True)  # Print detailed physical and logical plans

== Parsed Logical Plan ==
'Join UsingJoin(Inner,Buffer(key))
:- Project [id#61L AS key#63L]
:  +- Range (1, 1001, step=1, splits=Some(8))
+- ResolvedHint (strategy=broadcast)
   +- LogicalRDD [key#65L, value#66], false

== Analyzed Logical Plan ==
key: bigint, value: string
Project [key#63L, value#66]
+- Join Inner, (key#63L = key#65L)
   :- Project [id#61L AS key#63L]
   :  +- Range (1, 1001, step=1, splits=Some(8))
   +- ResolvedHint (strategy=broadcast)
      +- LogicalRDD [key#65L, value#66], false

== Optimized Logical Plan ==
Project [key#63L, value#66]
+- Join Inner, (key#63L = key#65L), rightHint=(strategy=broadcast)
   :- Project [id#61L AS key#63L]
   :  +- Range (1, 1001, step=1, splits=Some(8))
   +- Filter isnotnull(key#65L)
      +- LogicalRDD [key#65L, value#66], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [key#63L, value#66]
   +- BroadcastHashJoin [key#63L], [key#65L], Inner, BuildRight, false, true
      :- Project [id#61L AS key#63L]
   