## 🔑 Notebook Overview
1. Setup & Configuration  
2. Creating DataFrames (manual, CSV, JSON, Parquet)  
3. Data Inspection  
4. Column Selection, Renaming, Casting  
5. Filtering and Conditional Logic  
6. Column Expressions & Functions  
7. Complex Columns (arrays, structs, maps)  
8. Aggregations & groupBy  
9. Joins  
10. Window Functions  
11. Pivoting & Melting  
12. Missing Data Handling  
13. Sorting, Deduplication, Sampling  
14. Exploding Arrays & Nested JSON  
15. UDFs & Pandas UDFs  
16. Optimization & Explain Plans  
17. Writing Data  
18. Final Summary

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder\
        .appName("PySpark Data Transformations Masterclass")\
        .config("spark.sql.shuffle.partitions", "4")\
        .getOrCreate()

spark

<pyspark.sql.connect.session.SparkSession at 0xff7ef8213380>

## 2️⃣ Creating DataFrames

In [0]:
data = [
    ("A", "North", 10, 100.5, "2024-01-01"),
    ("B", "South", 20, 200.0, "2024-01-02"),
    ("C", "East", 30, 300.0, "2024-01-03"),
    ("D", "West", 40, 400.0, "2024-01-04"),
    ("E", "North", 50, 500.0, "2024-01-05"),
    ("F", "South", 60, 600.0, "2024-01-06"),
    ("G", "East", 70, None, "2024-01-07")
]

schema = StructType([
    StructField("category", StringType()),
    StructField("region", StringType()),
    StructField("value", IntegerType()),
    StructField("sales", DoubleType()),
    StructField("date", StringType())
])

df = spark.createDataFrame(data, schema)
df.show()

+--------+------+-----+-----+----------+
|category|region|value|sales|      date|
+--------+------+-----+-----+----------+
|       A| North|   10|100.5|2024-01-01|
|       B| South|   20|200.0|2024-01-02|
|       C|  East|   30|300.0|2024-01-03|
|       D|  West|   40|400.0|2024-01-04|
|       E| North|   50|500.0|2024-01-05|
|       F| South|   60|600.0|2024-01-06|
|       G|  East|   70| NULL|2024-01-07|
+--------+------+-----+-----+----------+



## 3️⃣ Inspecting Data

In [0]:
df.printSchema()
df.describe().show()
df.summary().show()
df.limit(5).toPandas()

root
 |-- category: string (nullable = true)
 |-- region: string (nullable = true)
 |-- value: integer (nullable = true)
 |-- sales: double (nullable = true)
 |-- date: string (nullable = true)

+-------+--------+------+------------------+-----------------+----------+
|summary|category|region|             value|            sales|      date|
+-------+--------+------+------------------+-----------------+----------+
|  count|       7|     7|                 7|                6|         7|
|   mean|    NULL|  NULL|              40.0|350.0833333333333|      NULL|
| stddev|    NULL|  NULL|21.602468994692867|186.9493023968441|      NULL|
|    min|       A|  East|                10|            100.5|2024-01-01|
|    max|       G|  West|                70|            600.0|2024-01-07|
+-------+--------+------+------------------+-----------------+----------+

+-------+--------+------+------------------+-----------------+----------+
|summary|category|region|             value|            sales|  

Unnamed: 0,category,region,value,sales,date
0,A,North,10,100.5,2024-01-01
1,B,South,20,200.0,2024-01-02
2,C,East,30,300.0,2024-01-03
3,D,West,40,400.0,2024-01-04
4,E,North,50,500.0,2024-01-05


## 4️⃣ Select, Rename, and Cast Columns

In [0]:
df.select("category", "region", "sales").show()
df.withColumnRenamed("sales", "total_sales").show()
df.withColumn("sales", col("sales").cast("decimal(10,2)")).printSchema()

+--------+------+-----+
|category|region|sales|
+--------+------+-----+
|       A| North|100.5|
|       B| South|200.0|
|       C|  East|300.0|
|       D|  West|400.0|
|       E| North|500.0|
|       F| South|600.0|
|       G|  East| NULL|
+--------+------+-----+

+--------+------+-----+-----------+----------+
|category|region|value|total_sales|      date|
+--------+------+-----+-----------+----------+
|       A| North|   10|      100.5|2024-01-01|
|       B| South|   20|      200.0|2024-01-02|
|       C|  East|   30|      300.0|2024-01-03|
|       D|  West|   40|      400.0|2024-01-04|
|       E| North|   50|      500.0|2024-01-05|
|       F| South|   60|      600.0|2024-01-06|
|       G|  East|   70|       NULL|2024-01-07|
+--------+------+-----+-----------+----------+

root
 |-- category: string (nullable = true)
 |-- region: string (nullable = true)
 |-- value: integer (nullable = true)
 |-- sales: decimal(10,2) (nullable = true)
 |-- date: string (nullable = true)



## 5️⃣ Filtering and Conditional Logic

In [0]:
df.filter(col("value") > 30).show()
df.filter(col("region").isin("North", "East")).show()
df.filter(col("category").like("A%")).show()

df.withColumn("segment",
    when(col("value") < 30, "Low")
    .when(col("value") < 60, "Medium")
    .otherwise("High")
).show()

+--------+------+-----+-----+----------+
|category|region|value|sales|      date|
+--------+------+-----+-----+----------+
|       D|  West|   40|400.0|2024-01-04|
|       E| North|   50|500.0|2024-01-05|
|       F| South|   60|600.0|2024-01-06|
|       G|  East|   70| NULL|2024-01-07|
+--------+------+-----+-----+----------+

+--------+------+-----+-----+----------+
|category|region|value|sales|      date|
+--------+------+-----+-----+----------+
|       A| North|   10|100.5|2024-01-01|
|       C|  East|   30|300.0|2024-01-03|
|       E| North|   50|500.0|2024-01-05|
|       G|  East|   70| NULL|2024-01-07|
+--------+------+-----+-----+----------+

+--------+------+-----+-----+----------+
|category|region|value|sales|      date|
+--------+------+-----+-----+----------+
|       A| North|   10|100.5|2024-01-01|
+--------+------+-----+-----+----------+

+--------+------+-----+-----+----------+-------+
|category|region|value|sales|      date|segment|
+--------+------+-----+-----+---------

## 6️⃣ Column Expressions & Built-in Functions

In [0]:
df = df.withColumn("sales_tax", col("sales") * 0.08)
df = df.withColumn("net_sales", expr("sales - sales_tax"))
df = df.withColumn("info", concat_ws("-", col("category"), col("region")))
df = df.withColumn("year", year(to_date(col("date"))))
df.show()

+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|       A| North|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|
|       B| South|   20|200.0|2024-01-02|             16.0|    184.0|B-South|2024|
|       C|  East|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|
|       D|  West|   40|400.0|2024-01-04|             32.0|    368.0| D-West|2024|
|       E| North|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|
|       F| South|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|
|       G|  East|   70| NULL|2024-01-07|             NULL|     NULL| G-East|2024|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+



## 7️⃣ Complex Columns (Arrays, Structs, Maps)

In [0]:

df_complex = df.withColumn("tags", array(lit("retail"), col("region"))) \               
            .withColumn("meta", struct(col("value"), col("sales"))) \              
            .withColumn("mapping", create_map(lit("category"), col("category")))
df_complex.show(truncate=False)

## 8️⃣ Aggregations and groupBy

In [0]:
df.groupBy("region").agg(
    count("*").alias("count"),
    avg("value").alias("avg_value"),
    sum("sales").alias("total_sales"),
    max("sales").alias("max_sales")
).orderBy(desc("total_sales")).show()

+------+-----+---------+-----------+---------+
|region|count|avg_value|total_sales|max_sales|
+------+-----+---------+-----------+---------+
| South|    2|     40.0|      800.0|    600.0|
| North|    2|     30.0|      600.5|    500.0|
|  West|    1|     40.0|      400.0|    400.0|
|  East|    2|     50.0|      300.0|    300.0|
+------+-----+---------+-----------+---------+



## 9️⃣ Joins (Advanced)

In [0]:
region_data = [("North", "Zone 1"), ("South", "Zone 2"), ("East", "Zone 3"), ("West", "Zone 4")]
region_df = spark.createDataFrame(region_data, ["region", "zone"])

inner_join = df.join(region_df, "region", "inner")
left_join = df.join(region_df, "region", "left")
anti_join = df.join(region_df, "region", "left_anti")

inner_join.show()
anti_join.show()

+------+--------+-----+-----+----------+-----------------+---------+-------+----+------+
|region|category|value|sales|      date|        sales_tax|net_sales|   info|year|  zone|
+------+--------+-----+-----+----------+-----------------+---------+-------+----+------+
| North|       A|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|Zone 1|
| South|       B|   20|200.0|2024-01-02|             16.0|    184.0|B-South|2024|Zone 2|
|  East|       C|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|Zone 3|
|  West|       D|   40|400.0|2024-01-04|             32.0|    368.0| D-West|2024|Zone 4|
| North|       E|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|Zone 1|
| South|       F|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|Zone 2|
|  East|       G|   70| NULL|2024-01-07|             NULL|     NULL| G-East|2024|Zone 3|
+------+--------+-----+-----+----------+-----------------+---------+-------+----+------+

+------+--------+---

## 🔟 Window Functions (Rank, Lag, Cumulative)

In [0]:
windowSpec = Window.partitionBy("region").orderBy(col("sales").desc())

df_window = (df.withColumn("rank", rank().over(windowSpec))
             .withColumn("dense_rank", dense_rank().over(windowSpec))
             .withColumn("lag_sales", lag("sales").over(windowSpec))
             .withColumn("cum_sum", sum("sales").over(windowSpec)))

df_window.show()

+--------+------+-----+-----+----------+-----------------+---------+-------+----+----+----------+---------+-------+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|rank|dense_rank|lag_sales|cum_sum|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+----+----------+---------+-------+
|       C|  East|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|   1|         1|     NULL|  300.0|
|       G|  East|   70| NULL|2024-01-07|             NULL|     NULL| G-East|2024|   2|         2|    300.0|  300.0|
|       E| North|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|   1|         1|     NULL|  500.0|
|       A| North|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|   2|         2|    500.0|  600.5|
|       F| South|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|   1|         1|     NULL|  600.0|
|       B| South|   20|200.0|2024-01-02|             16.0|    184.0|B-So

## 1️⃣1️⃣ Pivoting & Melting

In [0]:
pivot_df = df.groupBy("category").pivot("region").agg(sum("sales"))
pivot_df.show()

melted_df = pivot_df.selectExpr("category", 
                                "stack(4, 'East', East, 'West', West, 'North', North, 'South', South) as (region, total_sales)")
melted_df.show()

+--------+-----+-----+-----+-----+
|category| East|North|South| West|
+--------+-----+-----+-----+-----+
|       A| NULL|100.5| NULL| NULL|
|       B| NULL| NULL|200.0| NULL|
|       C|300.0| NULL| NULL| NULL|
|       D| NULL| NULL| NULL|400.0|
|       E| NULL|500.0| NULL| NULL|
|       F| NULL| NULL|600.0| NULL|
|       G| NULL| NULL| NULL| NULL|
+--------+-----+-----+-----+-----+

+--------+------+-----------+
|category|region|total_sales|
+--------+------+-----------+
|       A|  East|       NULL|
|       A|  West|       NULL|
|       A| North|      100.5|
|       A| South|       NULL|
|       B|  East|       NULL|
|       B|  West|       NULL|
|       B| North|       NULL|
|       B| South|      200.0|
|       C|  East|      300.0|
|       C|  West|       NULL|
|       C| North|       NULL|
|       C| South|       NULL|
|       D|  East|       NULL|
|       D|  West|      400.0|
|       D| North|       NULL|
|       D| South|       NULL|
|       E|  East|       NULL|
|       E|  We

## 1️⃣2️⃣ Missing Data Handling

In [0]:
df.na.drop(subset=["sales"]).show()
df.na.fill({"sales": 0, "region": "Unknown"}).show()
df.na.replace(["East", "West"], ["E", "W"], "region").show()

+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|       A| North|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|
|       B| South|   20|200.0|2024-01-02|             16.0|    184.0|B-South|2024|
|       C|  East|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|
|       D|  West|   40|400.0|2024-01-04|             32.0|    368.0| D-West|2024|
|       E| North|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|
|       F| South|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+

+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|
+--------+-----

## 1️⃣3️⃣ Sorting, Deduplication, Sampling

In [0]:
df.orderBy(col("sales").desc()).show()
df.dropDuplicates(["region"]).show()
df.sample(withReplacement=False, fraction=0.5, seed=1).show()

+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|       F| South|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|
|       E| North|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|
|       D|  West|   40|400.0|2024-01-04|             32.0|    368.0| D-West|2024|
|       C|  East|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|
|       B| South|   20|200.0|2024-01-02|             16.0|    184.0|B-South|2024|
|       A| North|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|
|       G|  East|   70| NULL|2024-01-07|             NULL|     NULL| G-East|2024|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+

+--------+------+-----+-----+----------+-----------------+---------+-------+----+
|category|regio

In [0]:
## 1️⃣4️⃣ Explode Arrays / Nested JSON Example

In [0]:
data_json = [
    ("A", ["p1", "p2", "p3"]),
    ("B", ["p2", "p4"]),
    ("C", ["p1"])
]
df_json = spark.createDataFrame(data_json, ["category", "products"])
df_json.withColumn("product", explode("products")).show()

+--------+------------+-------+
|category|    products|product|
+--------+------------+-------+
|       A|[p1, p2, p3]|     p1|
|       A|[p1, p2, p3]|     p2|
|       A|[p1, p2, p3]|     p3|
|       B|    [p2, p4]|     p2|
|       B|    [p2, p4]|     p4|
|       C|        [p1]|     p1|
+--------+------------+-------+



## 1️⃣5️⃣ UDFs & Pandas UDFs

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def categorize_sales(sales):
    return "High" if sales and sales > 300 else "Low"

udf_sales = udf(categorize_sales, StringType())
df.withColumn("sales_category", udf_sales(col("sales"))).show()

+--------+------+-----+-----+----------+-----------------+---------+-------+----+--------------+
|category|region|value|sales|      date|        sales_tax|net_sales|   info|year|sales_category|
+--------+------+-----+-----+----------+-----------------+---------+-------+----+--------------+
|       A| North|   10|100.5|2024-01-01|8.040000000000001|    92.46|A-North|2024|           Low|
|       B| South|   20|200.0|2024-01-02|             16.0|    184.0|B-South|2024|           Low|
|       C|  East|   30|300.0|2024-01-03|             24.0|    276.0| C-East|2024|           Low|
|       D|  West|   40|400.0|2024-01-04|             32.0|    368.0| D-West|2024|          High|
|       E| North|   50|500.0|2024-01-05|             40.0|    460.0|E-North|2024|          High|
|       F| South|   60|600.0|2024-01-06|             48.0|    552.0|F-South|2024|          High|
|       G|  East|   70| NULL|2024-01-07|             NULL|     NULL| G-East|2024|           Low|
+--------+------+-----+-----+-

## 1️⃣6️⃣ Performance & Optimization

In [0]:
df.explain(True)

optimized = df.select("region", "sales").filter(col("sales") > 200)
optimized.explain(True)

== Parsed Logical Plan ==
Project [category#11071, region#11072, value#11073, sales#11074, date#11075, sales_tax#12682, net_sales#12684, info#12686, year(to_date(date#11075, None, Some(Etc/UTC), true)) AS year#12688]
+- Project [category#11071, region#11072, value#11073, sales#11074, date#11075, sales_tax#12682, net_sales#12684, concat_ws(-, category#11071, region#11072) AS info#12686]
   +- Project [category#11071, region#11072, value#11073, sales#11074, date#11075, sales_tax#12682, (sales#11074 - sales_tax#12682) AS net_sales#12684]
      +- Project [category#11071, region#11072, value#11073, sales#11074, date#11075, (sales#11074 * 0.08) AS sales_tax#12682]
         +- LocalRelation [category#11071, region#11072, value#11073, sales#11074, date#11075]

== Analyzed Logical Plan ==
category: string, region: string, value: int, sales: double, date: string, sales_tax: double, net_sales: double, info: string, year: int
Project [category#11071, region#11072, value#11073, sales#11074, date#1

## 1️⃣7️⃣ Writing Data

In [0]:
df.write.format("delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("default.transformed_data")

## ✅ Final Summary

| Concept | Method | SQL Equivalent |
|----------|--------|----------------|
| Column Selection | `select()` | SELECT columns |
| Conditional | `when()`, `filter()` | CASE WHEN, WHERE |
| Aggregation | `groupBy().agg()` | GROUP BY |
| Join | `join()` | JOIN |
| Window | `Window().over()` | OVER() |
| Pivot | `pivot()` | PIVOT |
| Null Handling | `na.drop()`, `na.fill()` | COALESCE / NULL |
| Sort | `orderBy()` | ORDER BY |
| User Functions | `udf()` | Custom SQL Function |
| Optimization | `cache()`, `explain()` | Query Plan / Indexing |
