- Name: 04.1_dataframe_csv_and_parquet
- Author: Shamas Imran
- Desciption: Reading and writing DataFrames in CSV and Parquet formats
- Date: 07-Oct-2025

In [None]:

filepath =  "Files/client_input_data/csv/ncr_ride_bookings.csv"

try:
    df_csv = spark.read.csv(filepath, header=True, inferSchema=True)
    print("CSV file contents:")
    display(df_csv)
except Exception as e:
    print("Upload 'ncr_ride_bookings.csv' to run this step.")

print(f"total rows: {df_csv.count()} ")

In [None]:
df_csv.show(100)

## Show VS Display Function
| Feature              | `show()` (Spark)           | `display()` (Databricks)          |
|-----------------------|-----------------------------|-----------------------------------|
| Environment           | Works in any Spark session | Only in Databricks notebooks      |
| Output                | Text table (console-style) | Interactive UI table              |
| Row limit             | 20 by default (configurable) | 1000 by default                   |
| Interactivity         | ❌ None                     | ✅ Sorting, filtering, exporting, charts |
| Visualization support | ❌ No                       | ✅ Yes (built-in visualizations)   |
| Typical use case      | Quick inspection, debugging | Data exploration & visualization  |

In [None]:
filepath =  "Files/client_input_data/csv/ncr_ride_bookings.csv"

# df = spark.read.option("header", "true").option("inferSchema", "false").csv(filepath)

df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .csv(filepath)

# df.printSchema()
df.show(10)
print(df.schema)

In [1]:
from pyspark.sql.types import *

student_schema = StructType([
    StructField('StudentID', IntegerType(), False),
    StructField('StudentName', StringType(), True),
    StructField('StudentAge', IntegerType(), True)
])

student_data = [
        (1, "Alice", 34), 
        (2, "Bob", 45), 
        (3, "Charlie", 29),
        (4, "Shamas", 40)
        ]

df_student = spark.createDataFrame(student_data, student_schema)
df_student.show()


course_schema = StructType([
    StructField('CourseID', IntegerType(), False),
    StructField('CourseName', StringType(), True),
    StructField('CourseTitle', StringType(), True),
])

course_data = [
        (1, "Physics", "1111"), 
        (2, "Chemistry", "2222"), 
        (3, "English", "3333"),
        (4, "Computer Science", "4444")
        ]

df_course = spark.createDataFrame(course_data, course_schema)
df_course.show()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 3, Finished, Available, Finished)

+---------+-----------+----------+
|StudentID|StudentName|StudentAge|
+---------+-----------+----------+
|        1|      Alice|        34|
|        2|        Bob|        45|
|        3|    Charlie|        29|
|        4|     Shamas|        40|
+---------+-----------+----------+

+--------+----------------+-----------+
|CourseID|      CourseName|CourseTitle|
+--------+----------------+-----------+
|       1|         Physics|       1111|
|       2|       Chemistry|       2222|
|       3|         English|       3333|
|       4|Computer Science|       4444|
+--------+----------------+-----------+



In [2]:
rootPath =  "Files/client_output_data/parquet/"
studentFilePath = rootPath + "student"
courseFilePath =  rootPath + "course"

df_student.write.mode("overwrite").parquet(studentFilePath)
print("Parquet file written to " + studentFilePath)

df_course.write.mode("overwrite").parquet(courseFilePath)
print("Parquet file written to " + courseFilePath)

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 4, Finished, Available, Finished)

Parquet file written to Files/client_output_data/parquet/student
Parquet file written to Files/client_output_data/parquet/course_2


In [3]:
df_course.rdd.getNumPartitions()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 5, Finished, Available, Finished)

8

In [5]:
rootPath =  "Files/client_output_data/parquet/"
studentFolderPath = rootPath + "student"

df_student.coalesce(1) \
    .write.mode("overwrite") \
    .option("header", "true") \
    .parquet(studentFolderPath)
print("csv file written to " + studentFolderPath)

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 7, Finished, Available, Finished)

csv file written to Files/client_output_data/parquet/student


In [6]:
df_student.rdd.getNumPartitions()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 8, Finished, Available, Finished)

8

In [7]:
df = spark.read \
     .option("mergeSchema", "true") \
     .parquet("Files/client_output_data/parquet/course")

print(df.rdd.getNumPartitions())

df = spark.read \
     .option("mergeSchema", "true") \
     .parquet("Files/client_output_data/parquet/student")


print(df.rdd.getNumPartitions())

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 9, Finished, Available, Finished)

5
3


In [14]:

# df = spark.read.parquet(
#     "Files/client_output_data/parquet/student.parquet/part-00000-3aefd2b0-4179-4b5e-8fc2-c58e8005ac1f-c000.snappy.parquet",
#     "Files/client_output_data/parquet/student.parquet/part-00001.snappy.parquet"
# )

# df = spark.read.parquet(
#     "abfss://shamas_ws@onelake.dfs.fabric.microsoft.com/test_Lakehouse.Lakehouse/Files/client_output_data/parquet/student/part-00000-3aefd2b0-4179-4b5e-8fc2-c58e8005ac1f-c000.snappy.parquet",
#     "abfss://shamas_ws@onelake.dfs.fabric.microsoft.com/test_Lakehouse.Lakehouse/Files/client_output_data/parquet/student/part-00001-3aefd2b0-4179-4b5e-8fc2-c58e8005ac1f-c000.snappy.parquet",
#     "abfss://shamas_ws@onelake.dfs.fabric.microsoft.com/test_Lakehouse.Lakehouse/Files/client_output_data/parquet/student/part-00002-3aefd2b0-4179-4b5e-8fc2-c58e8005ac1f-c000.snappy.parquet"
#     )
# df.show()

# df = spark.read.parquet(
#     "abfss://shamas_ws@onelake.dfs.fabric.microsoft.com/test_Lakehouse.Lakehouse/Files/client_output_data/parquet/student/part-00002-3aefd2b0-4179-4b5e-8fc2-c58e8005ac1f-c000.snappy.parquet"
#     )
# df.show()
df = spark.read.parquet("Files/client_output_data/parquet/student/part-*.parquet")
df.show()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 16, Finished, Available, Finished)

+---------+-----------+----------+
|StudentID|StudentName|StudentAge|
+---------+-----------+----------+
|        3|    Charlie|        29|
|        4|     Shamas|        40|
|        1|      Alice|        34|
|        2|        Bob|        45|
+---------+-----------+----------+



In [15]:
# Original
print("Original:", df_student.rdd.getNumPartitions())

# Coalesce
df_coal = df_student.coalesce(2)
print("Coalesce(2):", df_coal.rdd.getNumPartitions())

# Repartition
df_repart = df_student.repartition(2)
print("Repartition(2):", df_repart.rdd.getNumPartitions())


# Repartition
df_repart = df_student.repartition(20)
print("Repartition(20):", df_repart.rdd.getNumPartitions())

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 17, Finished, Available, Finished)

Original: 8
Coalesce(2): 2
Repartition(2): 2
Repartition(20): 20



# Spark Quick Concepts

| Concept       | One-liner Explanation |
|---------------|---------------------|
| coalesce(n)   | Reduces the number of partitions without shuffle (fast, may be uneven). |
| repartition(n)| Creates exactly n partitions with shuffle (balanced, more expensive). |
| Data Skew     | Some partitions have much more data than others, causing slow tasks. |
| Shuffle       | Moves data across partitions/executors to align for operations like join/groupBy. |


# Difference: `coalesce()` vs `repartition()`

| Feature             | `coalesce(n)` | `repartition(n)` |
|---------------------|---------------|------------------|
| Shuffle operation   | ❌ No shuffle when reducing partitions <br> ✅ Shuffle if increasing | ✅ Always triggers shuffle |
| Increase partitions | ⚠️ Possible but inefficient | ✅ Efficient and balanced |
| Decrease partitions | ✅ Fast (just merges partitions) <br> ⚠️ May cause uneven distribution | ✅ Balanced distribution (reshuffle) |
| Performance         | Faster, less expensive | More expensive due to shuffle |
| Data distribution   | Can be skewed/uneven | Evenly distributed |
| Typical use case    | Write fewer output files quickly (e.g., `coalesce(1)`) | Prepare balanced data for joins, aggregations, or large writes |


In [16]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema_person = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("country", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("month", IntegerType(), True)
])

data_person = [
    (1, "Ali",   "PK", 2024, 1),
    (2, "Ahmed", "PK", 2024, 2),
    (3, "Sara",  "PK", 2025, 1),
    (4, "John",  "US", 2024, 1),
    (5, "Maria", "US", 2025, 2),
    (6, "Chen",  "CN", 2024, 2)
]

df_person = spark.createDataFrame(data_person, schema=schema_person)

df_person.show()
df_person.printSchema()


StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 18, Finished, Available, Finished)

+---+-----+-------+----+-----+
| id| name|country|year|month|
+---+-----+-------+----+-----+
|  1|  Ali|     PK|2024|    1|
|  2|Ahmed|     PK|2024|    2|
|  3| Sara|     PK|2025|    1|
|  4| John|     US|2024|    1|
|  5|Maria|     US|2025|    2|
|  6| Chen|     CN|2024|    2|
+---+-----+-------+----+-----+

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)



In [23]:
output_path = "Files/client_output_data/parquet/partitioned_person"

# df_person.coalesce(1) \
#     .write.mode("overwrite") \
#     .partitionBy("country", "year") \
#     .option("header", "true") \
#     .parquet(output_path)
# print("csv file written to " + output_path)

df_person.write.mode("overwrite") \
    .partitionBy("country", "year") \
    .parquet(output_path)

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 25, Finished, Available, Finished)

In [26]:
output_path = "Files/client_output_data/parquet/partitioned_person"
df = spark.read.parquet(output_path)
df.show()

df.select("id", "name", "country", "year").show()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 28, Finished, Available, Finished)

+---+-----+-----+-------+----+
| id| name|month|country|year|
+---+-----+-----+-------+----+
|  2|Ahmed|    2|     PK|2024|
|  5|Maria|    2|     US|2025|
|  6| Chen|    2|     CN|2024|
|  3| Sara|    1|     PK|2025|
|  4| John|    1|     US|2024|
|  1|  Ali|    1|     PK|2024|
+---+-----+-----+-------+----+

+---+-----+-------+----+
| id| name|country|year|
+---+-----+-------+----+
|  2|Ahmed|     PK|2024|
|  5|Maria|     US|2025|
|  6| Chen|     CN|2024|
|  3| Sara|     PK|2025|
|  4| John|     US|2024|
|  1|  Ali|     PK|2024|
+---+-----+-------+----+



In [27]:
df.filter("country = 'PK'").show()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 29, Finished, Available, Finished)

+---+-----+-----+-------+----+
| id| name|month|country|year|
+---+-----+-----+-------+----+
|  2|Ahmed|    2|     PK|2024|
|  3| Sara|    1|     PK|2025|
|  1|  Ali|    1|     PK|2024|
+---+-----+-----+-------+----+



In [29]:
df.select("id", "name", "country", "year") \
.filter("country = 'US' AND year = 2025") \
.show()

StatementMeta(, 421e6e49-9759-4b41-abf0-aa46e3798824, 31, Finished, Available, Finished)

+---+-----+-------+----+
| id| name|country|year|
+---+-----+-------+----+
|  5|Maria|     US|2025|
+---+-----+-------+----+



In [None]:
output_path = "Files/client_output_data/parquet/partitioned_person/country=US/year=2025/"
df = spark.read.parquet(output_path)
df.show()

# Data Skew and Shuffle: Good vs Bad

## 🔹 Data Skew
### ✅ Good
- None inherently — skew is usually **bad**.  
- The only "good" aspect is that **identifying skew** can help optimize partitioning and improve performance.  

### ❌ Bad
- Some partitions get **huge data**, while others have very little → tasks are **unbalanced**.  
- Causes **stragglers** (a few slow tasks delay the entire job).  
- Leads to **memory pressure**, sometimes OOM errors.  
- Inefficient resource utilization (some executors idle while others overloaded).  

---

## 🔹 Shuffle
### ✅ Good
- Enables **wide transformations** like `groupBy`, `reduceByKey`, `join`.  
- Redistributes data across partitions → ensures **correctness** of operations.  
- Necessary for **load balancing** in some cases (e.g., after skew fix).  
- Enables **parallelism** across nodes.  

### ❌ Bad
- **Expensive** operation → involves disk I/O, network I/O, and serialization.  
- Can generate **large intermediate files**.  
- Prone to **shuffle spill** → increases job runtime.  
- More shuffles → **slower job**, higher cluster cost.  

---

👉 **Rule of Thumb**:  
- **Data skew = always harmful** → needs fixing (via `repartition`, `salting`, `map-side combine`, etc.).  
- **Shuffle = necessary evil** → bad for performance, but sometimes essential for correctness. Minimize, but don’t avoid at cost of wrong results.  
