####Creating a Parquet Partitioned File

When we run a query on the Parquet table, Spark scans through all the rows to return the required results â€” similar to how queries work in traditional databases.

In PySpark, we can optimize query performance by dividing the data into partitions based on specific columns using the partitionBy() method. Partitioning allows Spark to read only the relevant subsets of data instead of scanning the entire dataset.

_Example_:

df.write.**partitionBy**("gender", "salary").mode("overwrite").parquet("/path/file_name.parquet")


In this example, the data is written as a Parquet file and partitioned by gender and salary, which helps improve the efficiency of future queries on these columns.

In [0]:
file_path = "/Volumes/workspace/training/test_data/parquet"
file_name1 = "employee_partition.parquet"
full_path = f"{file_path}/{file_name1}"

In [0]:
# Create dataframe
data = [
    ("Aarav", "Kumar", "Patel", "1993-08-14", "M", 5500),
    ("Diya", "Rani", "Sharma", "1998-03-22", "F", 6200),
    ("Karan", "", "Mehta", "1989-11-10", "M", 7200),
    ("Meera", "Anand", "Nair", "1995-07-05", "F", 5800),
    ("Rohan", "", "Verma", "1990-12-30", "M", 5000),
    ("Sneha", "L.", "Reddy", "1996-04-18", "F", 6100),
    ("Vikram", "", "Singh", "1988-09-25", "M", 6800),
    ("Priya", "G.", "Iyer", "1992-01-16", "F", 6400),
    ("Aditya", "", "Khan", "1999-02-28", "M", 4700),
    ("Neha", "", "Chopra", "1997-10-12", "F", 5900)
]

columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]

df = spark.createDataFrame(data, columns)
df.show()

In [0]:
# Creating a Parquet Partitioned File
df.write.partitionBy("gender","salary").mode("overwrite").parquet(f"{full_path}")

In [0]:
# Read partitioned parquet file using read.parquet()
parDF=spark.read.parquet(f"{full_path}")
display(parDF)