- **Name:** 04.1_dataframe_csv_and_parquet
- **Author:** Shamas Imran
- **Desciption:** Reading and writing DataFrames in CSV and Parquet formats
- **Date:** 19-Aug-2025
<!--
REVISION HISTORY
Version          Date        Author           Desciption
01           19-Aug-2025   Shamas Imran       Read CSV files with header and schema  
                                              Saved DataFrames to Parquet  
                                              Compared CSV vs Parquet performance  
-->


Certainly! Here are comments added to your notebook code cells to explain each step:



In [None]:
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("DatapurProgram").getOrCreate()

In [None]:
# Define the root path for data files
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"
filepath = rootPath + "ncr_ride_bookings.csv"

try:
    # Read CSV file with header and infer schema automatically
    df_csv = spark.read.csv(filepath, header=True, inferSchema=True)
    print("CSV file contents:")
    display(df_csv)
except Exception as e:
    # Handle missing file scenario
    print("Upload 'ncr_ride_bookings.csv' to run this step.")

## Show VS Display Function
| Feature              | `show()` (Spark)           | `display()` (Databricks)          |
|-----------------------|-----------------------------|-----------------------------------|
| Environment           | Works in any Spark session | Only in Databricks notebooks      |
| Output                | Text table (console-style) | Interactive UI table              |
| Row limit             | 20 by default (configurable) | 1000 by default                   |
| Interactivity         | ❌ None                     | ✅ Sorting, filtering, exporting, charts |
| Visualization support | ❌ No                       | ✅ Yes (built-in visualizations)   |
| Typical use case      | Quick inspection, debugging | Data exploration & visualization  |

In [None]:
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"
filepath = rootPath + "ncr_ride_bookings.csv"

# df = spark.read.option("header", "true").option("inferSchema", "false").csv(filepath)

df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .csv(filepath)

# df.printSchema()
df.show(10)
print(df.schema)

In [None]:
# Import data types for schema definition
from pyspark.sql.types import *

# Define schema for student DataFrame
student_schema = StructType([
    StructField('StudentID', IntegerType(), False),
    StructField('StudentName', StringType(), True),
    StructField('StudentAge', IntegerType(), True)
])

# Sample student data
student_data = [
        (1, "Alice", 34), 
        (2, "Bob", 45), 
        (3, "Charlie", 29),
        (4, "Shamas", 40)
        ]

# Create DataFrame using schema and data
df_student = spark.createDataFrame(student_data, student_schema)
df_student.show()  # Display the DataFrame

In [None]:
# Define schema for course DataFrame
course_schema = StructType([
    StructField('CourseID', IntegerType(), False),
    StructField('CourseName', StringType(), True),
    StructField('CourseTitle', StringType(), True),
])

# Sample course data
course_data = [
        (1, "Physics", "1111"), 
        (2, "Chemistry", "2222"), 
        (3, "English", "3333"),
        (4, "Computer Science", "4444")
        ]

# Create DataFrame using schema and data
df_course = spark.createDataFrame(course_data, course_schema)
df_course.show()  # Display the DataFrame

In [None]:
# Define file paths for saving parquet files
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"
studentFilePath = rootPath + "student"
courseFilePath = FilePath = rootPath + "course"

# Write student DataFrame to Parquet format
df_student.write.mode("overwrite").parquet(studentFilePath)
print("Parquet file written to " + studentFilePath)

# Write course DataFrame to Parquet format
df_course.write.mode("overwrite").parquet(courseFilePath)
print("Parquet file written to " + courseFilePath)

In [None]:
# Read Parquet file and display its contents
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"
filepath = rootPath + "student"
try:
    # Read Parquet file (header and inferSchema not needed for parquet)
    df_parquet = spark.read.parquet(filepath, header=True, inferSchema=True)
    print("parquet file contents:")
    display(df_parquet)
except Exception as e:
    print("Upload 'student.parquet' to run this step.")

In [None]:
# Display course and student DataFrames
df_course.show()
df_student.show()

# Get the number of partitions for each DataFrame
# num_partitions_course = df_course.rdd.getNumPartitions()
# num_partitions_student = df_student.rdd.getNumPartitions()

# print(f"Number of partitions in course DataFrame: {num_partitions_course}")
# print(f"Number of partitions in student DataFrame: {num_partitions_student}")

In [None]:
# df = spark.read.parquet(
#     "Files/client_output_data/parquet/student.parquet/part-00000.snappy.parquet",
#     "Files/client_output_data/parquet/student.parquet/part-00001.snappy.parquet"
# )

# df = spark.read.parquet("Files/client_output_data/parquet/student/part-*.parquet")

In [None]:
# Define file paths for saving CSV files
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"
studentFolderPath = rootPath + "student"

df_student.coalesce(1) \
    .write.mode("overwrite") \
    .option("header", "true") \
    .parquet(studentFolderPath)
print("file written to " + studentFolderPath)

In [0]:
# Original
print("Original:", df_student.rdd.getNumPartitions())

# Coalesce
df_coal = df_student.coalesce(2)
print("Coalesce(2):", df_coal.rdd.getNumPartitions())

# Repartition
df_repart = df_student.repartition(2)
print("Repartition(2):", df_repart.rdd.getNumPartitions())


# Repartition
df_repart = df_student.repartition(20)
print("Repartition(20):", df_repart.rdd.getNumPartitions())


# Spark Quick Concepts

| Concept       | One-liner Explanation |
|---------------|---------------------|
| coalesce(n)   | Reduces the number of partitions without shuffle (fast, may be uneven). |
| repartition(n)| Creates exactly n partitions with shuffle (balanced, more expensive). |
| Data Skew     | Some partitions have much more data than others, causing slow tasks. |
| Shuffle       | Moves data across partitions/executors to align for operations like join/groupBy. |


# Difference: `coalesce()` vs `repartition()`

| Feature             | `coalesce(n)` | `repartition(n)` |
|---------------------|---------------|------------------|
| Shuffle operation   | ❌ No shuffle when reducing partitions <br> ✅ Shuffle if increasing | ✅ Always triggers shuffle |
| Increase partitions | ⚠️ Possible but inefficient | ✅ Efficient and balanced |
| Decrease partitions | ✅ Fast (just merges partitions) <br> ⚠️ May cause uneven distribution | ✅ Balanced distribution (reshuffle) |
| Performance         | Faster, less expensive | More expensive due to shuffle |
| Data distribution   | Can be skewed/uneven | Evenly distributed |
| Typical use case    | Write fewer output files quickly (e.g., `coalesce(1)`) | Prepare balanced data for joins, aggregations, or large writes |

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema_person = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("country", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("month", IntegerType(), True)
])

data_person = [
    (1, "Ali",   "PK", 2024, 1),
    (2, "Ahmed", "PK", 2024, 2),
    (3, "Sara",  "PK", 2025, 1),
    (4, "John",  "US", 2024, 1),
    (5, "Maria", "US", 2025, 2),
    (6, "Chen",  "CN", 2024, 2)
]

df_person = spark.createDataFrame(data_person, schema=schema_person)

df_person.show()
df_person.printSchema()

In [0]:
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/partitioned_person"
output_path = rootPath + "student"
df_person.write.mode("overwrite") \
    .partitionBy("country", "year") \
    .parquet(output_path)

In [0]:
df = spark.read.parquet(output_path)
df.show()

In [0]:
df.filter("country = 'PK'").show()
df.filter("country = 'US' AND year = 2025").show()

# Data Skew and Shuffle: Good vs Bad

## 🔹 Data Skew
### ✅ Good
- None inherently — skew is usually **bad**.  
- The only "good" aspect is that **identifying skew** can help optimize partitioning and improve performance.  

### ❌ Bad
- Some partitions get **huge data**, while others have very little → tasks are **unbalanced**.  
- Causes **stragglers** (a few slow tasks delay the entire job).  
- Leads to **memory pressure**, sometimes OOM errors.  
- Inefficient resource utilization (some executors idle while others overloaded).  

---

## 🔹 Shuffle
### ✅ Good
- Enables **wide transformations** like `groupBy`, `reduceByKey`, `join`.  
- Redistributes data across partitions → ensures **correctness** of operations.  
- Necessary for **load balancing** in some cases (e.g., after skew fix).  
- Enables **parallelism** across nodes.  

### ❌ Bad
- **Expensive** operation → involves disk I/O, network I/O, and serialization.  
- Can generate **large intermediate files**.  
- Prone to **shuffle spill** → increases job runtime.  
- More shuffles → **slower job**, higher cluster cost.  

---

👉 **Rule of Thumb**:  
- **Data skew = always harmful** → needs fixing (via `repartition`, `salting`, `map-side combine`, etc.).  
- **Shuffle = necessary evil** → bad for performance, but sometimes essential for correctness. Minimize, but don’t avoid at cost of wrong results.  
