## Partitioning and bucketing in PySpark

Partitioning and bucketing in PySpark are techniques used to optimize data storage and query performance, especially in distributed data processing.

### Partitioning
Partitioning is a technique that organizes the data in a DataFrame or table by dividing it into separate files or directories based on the unique values of one or more columns.

**How It Works:**
- `Partition Column:` Data is written into subdirectories for each unique value in the partition column.
- Improves performance when queries filter data on the partition column because Spark can skip non-relevant partitions (partition pruning).

**Advantages:**
- Reduces the amount of data scanned during query execution (pruning).
- Helps in organizing and managing large datasets.

**Disadvantages**
- Over-partitioning can lead to many small files, which degrade performance (HDFS small file problem).
- Only effective if queries use filters on the partition column.

**Example: Writing with Partitioning**

In [0]:
# Sample DataFrame
data = [("Rohish", 25, "India"), ("Bob", 30, "UK"), ("Naina", 35, "India"), ("David", 40, "USA")]
columns = ["Name", "Age", "Country"]
df = spark.createDataFrame(data, columns)

# Write the DataFrame with partitioning
df.write.format("parquet") \
    .partitionBy("Country") \
    .mode("overwrite") \
    .save("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/")

**Output Directory Structure:**

In [0]:
%fs
ls dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/

path,name,size,modificationTime
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/Country=India/,Country=India/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/Country=UK/,Country=UK/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/Country=USA/,Country=USA/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/_SUCCESS,_SUCCESS,0,1733491248000
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/partition_output/_committed_3496381552902929341,_committed_3496381552902929341,35,1733491247000


**When to Use Partitioning**
- When datasets are large and queries often filter data on specific columns.
- Common for time-series data (e.g., partition by year or month).

### Bucketing
Bucketing is another optimization technique where data is grouped into a fixed number of files (buckets) based on the hash of a column or columns. 

Unlike partitioning, bucketing doesn’t create directories but instead divides data into buckets within the same directory.

**How It Works**
- Data is distributed based on the hash of the bucket column into a predefined number of buckets.
- Each bucket is stored as a single file.
- Helps in query optimization when performing operations like joins and aggregations on bucketed columns.

**Advantages:**
- Optimizes performance for joins and aggregations by co-locating data with the same hash.
- Allows better control over file size and data distribution.

**Disadvantages:**
- Requires specifying the number of buckets upfront.
- Does not support dynamic pruning like partitioning.

**Example: Writing with Bucketing**

In [0]:
# Sample DataFrame
data = [("Rohish", 25, "India"), ("Bob", 30, "UK"), ("Naina", 35, "India"), ("David", 40, "USA")]
columns = ["Name", "Age", "Country"]
df = spark.createDataFrame(data, columns)

# Write the DataFrame with partitioning
df.write.format("parquet") \
    .bucketBy(4, "Country") \  # 4 buckets based on the hash of "Country" \
    .sortBy("Age") \           # Sort records within each bucket by "Age" \
    .saveAsTable("bucketed_table")

**Output**
- Bucketing writes the data into 4 buckets based on the hash of the Country column.
- Stored as Hive-compatible tables in the specified location.

In [0]:
%fs
ls dbfs:/user/hive/warehouse/bucketed_table/

path,name,size,modificationTime
dbfs:/user/hive/warehouse/bucketed_table/_SUCCESS,_SUCCESS,0,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/_committed_7924222626171180252,_committed_7924222626171180252,444,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/_started_7924222626171180252,_started_7924222626171180252,0,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/part-00001-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-17-1_00003.c000.snappy.parquet,part-00001-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-17-1_00003.c000.snappy.parquet,1100,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/part-00003-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-19-1_00000.c000.snappy.parquet,part-00003-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-19-1_00000.c000.snappy.parquet,1057,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/part-00005-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-21-1_00003.c000.snappy.parquet,part-00005-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-21-1_00003.c000.snappy.parquet,1092,1733491450000
dbfs:/user/hive/warehouse/bucketed_table/part-00007-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-23-1_00002.c000.snappy.parquet,part-00007-tid-7924222626171180252-b8316328-8dda-4c39-a59f-be7e40021a90-23-1_00002.c000.snappy.parquet,1079,1733491450000


**When to Use Bucketing:**
- When you frequently join or aggregate on specific columns.
- When partitioning is either impractical or leads to too many small files.

### Key Differences Between Partitioning and Bucketing:
| **Feature**         | **Partitioning**                               | **Bucketing**                                |
|---------------------|-----------------------------------------------|---------------------------------------------|
| **Basis**           | Divides data by unique column values.         | Divides data by the hash of column values.  |
| **Output Structure**| Creates separate directories for partitions.  | Stores all buckets in the same directory.   |
| **Dynamic Pruning** | Supports dynamic partition pruning.           | Does not support dynamic pruning.           |
| **Granularity**     | Based on unique values (may lead to over-partitioning). | Fixed number of buckets (user-defined). |
| **Use Case**        | Filtering data on partition columns.          | Joining or aggregating on bucket columns.   |



### Combining Partitioning and Bucketing

You can use both techniques together for better optimization, especially when dealing with large datasets.

**Example: Partitioning + Bucketing**

In [0]:
df.write.format("parquet") \
    .partitionBy("Country") \       # Partition by Country
    .bucketBy(4, "Age") \           # Bucket within each partition by Age
    .saveAsTable("partitioned_bucketed_table")


In [0]:
%fs
ls dbfs:/user/hive/warehouse/partitioned_bucketed_table/

path,name,size,modificationTime
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/,Country=India/,0,0
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=UK/,Country=UK/,0,0
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=USA/,Country=USA/,0,0
dbfs:/user/hive/warehouse/partitioned_bucketed_table/_SUCCESS,_SUCCESS,0,1733491677000


In [0]:
%fs
ls dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/

path,name,size,modificationTime
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/_SUCCESS,_SUCCESS,0,1733491677000
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/_committed_2687089059435324386,_committed_2687089059435324386,234,1733491677000
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/_started_2687089059435324386,_started_2687089059435324386,0,1733491676000
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/part-00001-tid-2687089059435324386-7386ccd7-db2c-4841-95a2-77d992f02857-41-1_00000.c000.snappy.parquet,part-00001-tid-2687089059435324386-7386ccd7-db2c-4841-95a2-77d992f02857-41-1_00000.c000.snappy.parquet,851,1733491676000
dbfs:/user/hive/warehouse/partitioned_bucketed_table/Country=India/part-00005-tid-2687089059435324386-7386ccd7-db2c-4841-95a2-77d992f02857-45-1_00002.c000.snappy.parquet,part-00005-tid-2687089059435324386-7386ccd7-db2c-4841-95a2-77d992f02857-45-1_00002.c000.snappy.parquet,843,1733491676000
