In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [0]:
%fs ls /mnt/adls/target_tables/Dim/

path,name,size,modificationTime
dbfs:/mnt/adls/target_tables/Dim/categories_table/,categories_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/customers_table/,customers_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/material_table/,material_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/products_table/,products_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/roles_table/,roles_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/suppliers_table/,suppliers_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/users_table/,users_table/,0,0
dbfs:/mnt/adls/target_tables/Dim/warehouses_table/,warehouses_table/,0,0


# Narrow Transformations in Spark

A **narrow transformation** in Spark refers to operations on **RDDs (Resilient Distributed Datasets)** that only involve data from a **single partition**. These transformations do **not require shuffling** of data between partitions, making them more efficient than wide transformations, which involve data exchange across partitions.

## Key Characteristics:
- **Single Partition**: Narrow transformations process data within each partition independently.
- **No Shuffling**: Data does not need to be redistributed or shuffled across nodes in the cluster.
- **Partitioning Preserved**: The partitioning scheme of the input RDD is generally maintained after the transformation.

## Common Narrow Transformations:
1. **map()**: 
   - Applies a function to each element in the RDD and returns a new RDD with the transformed elements.
   
2. **filter()**: 
   - Filters out elements based on a condition, returning only the elements that satisfy the condition.
   
3. **flatMap()**: 
   - Similar to `map()`, but can return multiple values for each input element, which are then flattened into a single RDD.
   
4. **mapPartitions()**: 
   - Applies a function to entire partitions (not individual elements), but still operates within the same partition.
   
5. **union()**: 
   - Combines two RDDs into one without shuffling data, preserving the partitioning of both RDDs.
   
6. **sample()**: 
   - Randomly samples data from the RDD without requiring a shuffle.
   
7. **zip()**: 
   - Combines two RDDs element-wise into pairs without shuffling.

## Why Narrow Transformations are Efficient:
- **Faster Execution**: Since no shuffling is required, narrow transformations are generally faster and require less network I/O and disk usage.
- **Lower Overhead**: The lack of data movement across partitions reduces computational and network overhead.

## Example:
```python
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 2)  # map is a narrow transformation
print(result.collect())  # Output: [2, 4, 6, 8, 10]


| **Narrow Transformation** | **Description**                                                       |
|---------------------------|-----------------------------------------------------------------------|
| **map()**                 | Applies a function to each element in the RDD.                      |
| **filter()**              | Returns a new RDD containing elements that meet a specified condition. |
| **flatMap()**             | Similar to `map`, but allows returning multiple elements for each input element. |
| **union()**               | Combines two RDDs into one, containing all elements from both.      |
| **sample()**              | Returns a random sample of the RDD.                                 |
| **zip()**                 | Combines two RDDs by pairing corresponding elements together.        |
| **coalesce()**            | Reduces the number of partitions in the RDD without performing a shuffle. |
| **mapPartitions()**       | Applies a function to each partition of the RDD, allowing more efficient operations. |
| **distinct()**            | Returns a new RDD containing unique elements (generally narrow but can involve wider behavior depending on context). |
| **sortBy()**              | Sorts the elements of the RDD based on a specified key or condition. |


## Narrow Transformation


# Narrow Transformations in Spark DataFrames

A **narrow transformation** in Spark for **DataFrames** refers to operations that only involve a **single partition** and do not require shuffling of data between partitions. These transformations are more efficient than **wide transformations**, which involve data exchange across partitions.

## Key Characteristics:

- **Single Partition**: Narrow transformations work within a single partition of the DataFrame without involving other partitions.

- **No Shuffling**: Data does not need to be moved or shuffled across partitions, which reduces network I/O and speeds up execution.

- **Partitioning Preserved**: The partitioning of the input DataFrame is usually maintained after the transformation, unless explicitly altered.

## Common Narrow Transformations:

1. **select()**:

   - Selects a subset of columns from a DataFrame, creating a new DataFrame with only the specified columns.

2. **filter()**:

   - Filters rows based on a condition, returning only those that satisfy the given filter expression.

3. **map()** (using `rdd` API):

   - Similar to `map()` on RDDs, it allows applying a function to each row of the DataFrame and can return a transformed DataFrame or RDD.

4. **withColumn()**:

   - Adds a new column to the DataFrame, applying a transformation to an existing column or expression.

5. **drop()**:

   - Removes one or more columns from the DataFrame, creating a new DataFrame without the specified columns.

6. **distinct()**:

   - Returns a new DataFrame with only the distinct rows, removing duplicates without shuffling data.

7. **limit()**:

   - Returns a new DataFrame with only the first `n` rows of the original DataFrame, without requiring a shuffle.

8. **alias()**:

   - Renames a column with a new alias in a DataFrame.

9. **join()** (when applied within a single partition):

   - Performs an inner or outer join within the same partition, without needing to shuffle data between partitions.

## Why Narrow Transformations are Efficient:

- **Faster Execution**: Since no shuffling is required, narrow transformations are faster and consume less memory and network I/O.

- **Lower Overhead**: The data is processed locally within each partition, reducing computational overhead and avoiding unnecessary data movement.

## Example:

```python
# Example with a DataFrame
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("NarrowTransformations").getOrCreate()

# Create a DataFrame
df = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Cathy")], ["id", "name"])

# Narrow transformation: using 'select' to pick columns
result = df.select("name")

# Show result
result.show()
# Output:
# +-----+
# | name|
# +-----+
# |Alice|
# |  Bob|
# |Cathy|
# +-----+


## Common Narrow Transformations in DataFrames:

| **Narrow Transformation**   | **Description**                                                        |
|-----------------------------|------------------------------------------------------------------------|
| **select()**                 | Selects a subset of columns from the DataFrame.                        |
| **filter()**                 | Filters rows based on a condition.                                    |
| **withColumn()**             | Adds or updates a column in the DataFrame with a specified transformation. |
| **drop()**                   | Removes one or more columns from the DataFrame.                        |
| **distinct()**               | Returns a DataFrame with distinct rows, removing duplicates.           |
| **limit()**                  | Limits the number of rows returned by the DataFrame to `n`.            |
| **alias()**                  | Renames a column with a new alias.                                     |
| **selectExpr()**             | Allows selecting columns and performing transformations via SQL expressions. |


In [0]:
##Select condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.select('CategoryID','CategoryName','ParentCategoryID','NumberOfProducts','LastUpdated')
df.show()

+----------+-------------+----------------+----------------+-------------------+
|CategoryID| CategoryName|ParentCategoryID|NumberOfProducts|        LastUpdated|
+----------+-------------+----------------+----------------+-------------------+
|      7250|Category_7250|            5132|             414|2024-11-25 08:43:41|
|      7251|Category_7251|            1793|              79|2024-11-02 08:43:41|
|      7252|Category_7252|            5501|             414|2024-11-07 08:43:41|
|      7253|Category_7253|            2682|              21|2024-11-10 08:43:41|
|      7254|Category_7254|            2434|               7|2024-11-15 08:43:41|
|      7255|Category_7255|            null|              70|2024-11-29 08:43:41|
|      7256|Category_7256|            7215|             365|2024-11-10 08:43:41|
|      7257|Category_7257|            null|             195|2024-11-05 08:43:41|
|      7258|Category_7258|            null|             170|2024-11-08 08:43:41|
|      7259|Category_7259|  

When executing the command with `select()` in Spark, you might observe a discrepancy between the **logical row count** (e.g., 7000 records) and the **read rows** value in the DAG metrics (e.g., 875 rows). This difference can occur due to a few reasons related to how Spark optimizes data loading and processing.

#### 1. **Column Pruning**:
- **What is Column Pruning?**
  - When you use the `select()` operation, Spark only loads the specified columns from the dataset. This is known as **column pruning**.
  - **Impact on Read Rows**: Column pruning reduces the amount of data loaded into memory because Spark only reads the necessary columns. However, it **does not reduce the number of rows read** from disk. Spark might still need to scan the entire dataset to read the specified columns, especially if the data is not partitioned efficiently.
  - As a result, the "read rows" metric could still show a higher number (e.g., 875) because Spark is reading the full set of rows from the Parquet file, even though it is only loading a subset of the columns into memory.

#### 2. **Efficient Data Loading**:
- **What is Efficient Data Loading?**
  - Spark employs several techniques to reduce the amount of data it reads, such as **partition pruning** and **predicate pushdown**.
    - **Partition Pruning**: If your dataset is partitioned (e.g., by `CategoryID` or `LastUpdated`), Spark may skip reading irrelevant partitions.
    - **Predicate Pushdown**: If a `WHERE` clause or filter is applied, Spark can push down these filters to the storage layer (e.g., Parquet), which reduces the number of rows loaded.
  - **Impact on Read Rows**: These optimizations help Spark read **only the necessary data**. However, even when these techniques are applied, the DAG metrics might still report a higher number of rows being read (875 in this case) because Spark could be reading additional rows or partitions to apply its optimizations, depending on how the data is stored and partitioned.

#### Summary:
- The difference between the logical row count (7000 rows) and the "read rows" metric (875 rows) can be attributed to Spark's internal optimizations, such as column pruning and efficient data loading techniques.
- While Spark might only load specific columns into memory, it could still be reading a larger number of rows from disk as part of its optimizations or based on how the data is structured (e.g., partitioning, filtering).


In [0]:
##Filter condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.filter(df.ParentCategoryID==9166)
df.show()

+----------+------------+----------------+-----------+----------------+-----------+-----------+
|CategoryID|CategoryName|ParentCategoryID|Description|NumberOfProducts|CreatedDate|LastUpdated|
+----------+------------+----------------+-----------+----------------+-----------+-----------+
+----------+------------+----------------+-----------+----------------+-----------+-----------+



### Observation:

A total of 4 jobs are generated in this step:
1. One for reading the data
2. One for filtering the data
3. One for showing the output
4. And one for some additional processing (it got created in run runtime or specific output)

However, from the output, we can see that 4 jobs were created, but in the UI, we can only see 3 outputs.

In [0]:
##withColumn condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.withColumn("NumberOfProducts2",df.NumberOfProducts*2)
df.show()

+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+-----------------+
|CategoryID| CategoryName|ParentCategoryID|         Description|NumberOfProducts|        CreatedDate|        LastUpdated|NumberOfProducts2|
+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+-----------------+
|      7250|Category_7250|            5132|This is a descrip...|             414|2024-01-01 08:43:41|2024-11-25 08:43:41|              828|
|      7251|Category_7251|            1793|This is a descrip...|              79|2024-08-26 08:43:41|2024-11-02 08:43:41|              158|
|      7252|Category_7252|            5501|This is a descrip...|             414|2024-11-21 08:43:41|2024-11-07 08:43:41|              828|
|      7253|Category_7253|            2682|This is a descrip...|              21|2024-03-26 08:43:41|2024-11-10 08:43:41|               42|
|      7254|Category

In [0]:
# execution plan
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.withColumn("NumberOfProducts2",df.NumberOfProducts*2).explain()

== Physical Plan ==
*(1) Project [CategoryID#262, CategoryName#263, ParentCategoryID#264, Description#265, NumberOfProducts#266, CreatedDate#267, LastUpdated#268, (NumberOfProducts#266 * 2) AS NumberOfProducts2#276]
+- *(1) ColumnarToRow
   +- FileScan parquet [CategoryID#262,CategoryName#263,ParentCategoryID#264,Description#265,NumberOfProducts#266,CreatedDate#267,LastUpdated#268] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[dbfs:/mnt/adls/target_tables/Dim/categories_table], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CategoryID:int,CategoryName:string,ParentCategoryID:int,Description:string,NumberOfProduct...




### Observation: 
As withColumn condition only addes the new column based on the logic so we have only Job in our case same as select condition. the actual column creatiuon happed in the project part of the execution plan only  

In [0]:
##Drop condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.drop("NumberOfProducts")
df.show()

+----------+-------------+----------------+--------------------+-------------------+-------------------+
|CategoryID| CategoryName|ParentCategoryID|         Description|        CreatedDate|        LastUpdated|
+----------+-------------+----------------+--------------------+-------------------+-------------------+
|      7250|Category_7250|            5132|This is a descrip...|2024-01-01 08:43:41|2024-11-25 08:43:41|
|      7251|Category_7251|            1793|This is a descrip...|2024-08-26 08:43:41|2024-11-02 08:43:41|
|      7252|Category_7252|            5501|This is a descrip...|2024-11-21 08:43:41|2024-11-07 08:43:41|
|      7253|Category_7253|            2682|This is a descrip...|2024-03-26 08:43:41|2024-11-10 08:43:41|
|      7254|Category_7254|            2434|This is a descrip...|2024-09-30 08:43:41|2024-11-15 08:43:41|
|      7255|Category_7255|            null|This is a descrip...|2024-04-26 08:43:41|2024-11-29 08:43:41|
|      7256|Category_7256|            7215|This is a de

### Observation: 


In [0]:
##Distinct 
#ADQ Enabled
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
print("Number of partitions before distinct:", df.rdd.getNumPartitions())
df=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).distinct()
print("Number of partitions after distinct:", df.rdd.getNumPartitions())
df.show()

Number of partitions before distinct: 8
Number of partitions after distinct: 1
+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
|      7344|            3981|             250|
|      7407|            1159|              35|
|      7534|            null|              96|
|      7541|            4732|             186|
|      7582|            2113|              86|
|      7726|            null|             434|
|      7757|            null|             348|
|      7774|            null|             221|
|      7903|            null|             321|
|      7957|            3313|              31|
|      7279|            7855|             237|
|      7680|            9557|             330|
|      7721|            null|             297|
|      8016|            null|             363|
|      8089|            null|              46|
|      7302|            1409|             306|
|      7333|            9648

In [0]:
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.get("spark.sql.shuffle.partitions")

Out[8]: '200'

In [0]:
##Distinct 
#ADQ Disbaled
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
print("Number of partitions before distinct:", df.rdd.getNumPartitions())
df=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).distinct()
print("Number of partitions after distinct:", df.rdd.getNumPartitions())
df.show()

Number of partitions before distinct: 8
Number of partitions after distinct: 200
+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
|      7344|            3981|             250|
|      7407|            1159|              35|
|      7534|            null|              96|
|      7541|            4732|             186|
|      7582|            2113|              86|
|      7726|            null|             434|
|      7757|            null|             348|
|      7774|            null|             221|
|      7903|            null|             321|
|      7957|            3313|              31|
|      4934|            null|              18|
|      5061|            1939|              29|
|      5100|            null|             461|
|      5147|            null|               4|
|      5180|            8783|             314|
|      5287|            8038|              82|
|      5516|            68

### Observation:

#### Run with AQE Enabled (Adaptive Query Execution):
When we run the code with AQE enabled, a shuffle operation occurs due to the `distinct()` function. The reasons for this behavior are as follows:

1. **Distributed Data:**
   - When your data is large and distributed across many partitions, Spark needs to ensure that the `distinct()` operation is applied across the entire dataset.
   - This requires shuffling the data, especially for operations that involve grouping, like `distinct()`. Without this shuffle, Spark cannot guarantee that each partition will have unique rows.

2. **HashAggregate:**
   - The `distinct()` operation in Spark is essentially a `groupBy` operation.
   - For large datasets, Spark may need to perform a shuffle to aggregate the data correctly across partitions. This shuffle is necessary because:
     - Each partition holds a portion of the data.
     - Spark needs to combine these portions to eliminate duplicates globally, across all partitions, not just within a single partition.

3. **Adaptive Query Execution (AQE):**
   - AQE optimizes queries at runtime based on the actual data distribution.
   - In this case, Spark may choose to perform an exchange and shuffle if it determines that doing so will improve performance. For example, this could occur if partitions are unevenly distributed or if the data is skewed.
   - Even though the partition count may reduce (from 8 to 1, as seen in the output), AQE might still perform a shuffle to optimize the query.

#### Run with AQE Disabled:
When AQE is disabled, no shuffle operation occurs in the code, and the partition count increases. The reasons are as follows:

- **No Dynamic Optimizations:**
  - With AQE disabled, Spark does not adjust the partition count dynamically based on the data distribution. Instead, it uses the static partitioning strategy.
  - By default, Spark sets the number of shuffle partitions to **200**, which leads to an increased number of partitions, as reflected in the output.

---

### Summary:
- **With AQE enabled**, Spark performs a shuffle for operations like `distinct()` to ensure global uniqueness of the data. AQE also dynamically adjusts the number of partitions during execution to optimize performance (e.g., reducing partitions from 8 to 1 in your case).
- **With AQE disabled**, Spark uses static partitioning, and the default shuffle partition count (typically **200**) is applied, leading to an increased number of partitions, regardless of the data size.
- Because of which we got 2 jobs in Spark UI for enabled as it become wide transformation and only 1 job for disabled as it be narrow transformation


In [0]:
##Limit condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.limit(1000)
df.show()

+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+
|CategoryID| CategoryName|ParentCategoryID|         Description|NumberOfProducts|        CreatedDate|        LastUpdated|
+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+
|      7250|Category_7250|            5132|This is a descrip...|             414|2024-01-01 08:43:41|2024-11-25 08:43:41|
|      7251|Category_7251|            1793|This is a descrip...|              79|2024-08-26 08:43:41|2024-11-02 08:43:41|
|      7252|Category_7252|            5501|This is a descrip...|             414|2024-11-21 08:43:41|2024-11-07 08:43:41|
|      7253|Category_7253|            2682|This is a descrip...|              21|2024-03-26 08:43:41|2024-11-10 08:43:41|
|      7254|Category_7254|            2434|This is a descrip...|               7|2024-09-30 08:43:41|2024-11-15 08:43:41|
|      7255|Category_725

### Observation: 

In [0]:
##Alias condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.select(col('CategoryName').alias("Master Category Name"))
df.show()

+--------------------+
|Master Category Name|
+--------------------+
|       Category_7250|
|       Category_7251|
|       Category_7252|
|       Category_7253|
|       Category_7254|
|       Category_7255|
|       Category_7256|
|       Category_7257|
|       Category_7258|
|       Category_7259|
|       Category_7260|
|       Category_7261|
|       Category_7262|
|       Category_7263|
|       Category_7264|
|       Category_7265|
|       Category_7266|
|       Category_7267|
|       Category_7268|
|       Category_7269|
+--------------------+
only showing top 20 rows



### Observation: 

In [0]:
##selectExpr condition 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.selectExpr("*","NumberOfProducts*2 as NumberOfProducts2")
df.show()

+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+-----------------+
|CategoryID| CategoryName|ParentCategoryID|         Description|NumberOfProducts|        CreatedDate|        LastUpdated|NumberOfProducts2|
+----------+-------------+----------------+--------------------+----------------+-------------------+-------------------+-----------------+
|      7250|Category_7250|            5132|This is a descrip...|             414|2024-01-01 08:43:41|2024-11-25 08:43:41|              828|
|      7251|Category_7251|            1793|This is a descrip...|              79|2024-08-26 08:43:41|2024-11-02 08:43:41|              158|
|      7252|Category_7252|            5501|This is a descrip...|             414|2024-11-21 08:43:41|2024-11-07 08:43:41|              828|
|      7253|Category_7253|            2682|This is a descrip...|              21|2024-03-26 08:43:41|2024-11-10 08:43:41|               42|
|      7254|Category

In [0]:
##execution plan 
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df=df.selectExpr("*","NumberOfProducts*2 as NumberOfProducts2").explain()

== Physical Plan ==
*(1) Project [CategoryID#663, CategoryName#664, ParentCategoryID#665, Description#666, NumberOfProducts#667, CreatedDate#668, LastUpdated#669, (NumberOfProducts#667 * 2) AS NumberOfProducts2#677]
+- *(1) ColumnarToRow
   +- FileScan parquet [CategoryID#663,CategoryName#664,ParentCategoryID#665,Description#666,NumberOfProducts#667,CreatedDate#668,LastUpdated#669] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[dbfs:/mnt/adls/target_tables/Dim/categories_table], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CategoryID:int,CategoryName:string,ParentCategoryID:int,Description:string,NumberOfProduct...




### Observation: 

###Extra Union condition

In [0]:
##Union
# single cell execution
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df_1=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).filter(df.ParentCategoryID=='1520')
df_2=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).filter(df.ParentCategoryID=='3071')
union_df=df_1.union(df_2)
union_df.show()

+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
+----------+----------------+----------------+



In [0]:
##Union
# Multi cell execution
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df_1=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).filter(df.ParentCategoryID=='3071')
df_1.show()

+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
+----------+----------------+----------------+



In [0]:
##Union
# Multi cell execution
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df_2=df.select(df.CategoryID, df.ParentCategoryID, df.NumberOfProducts).filter(df.ParentCategoryID=='3071')
df_2.show()

+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
+----------+----------------+----------------+



In [0]:
##Union
# Multi cell execution
union_df=df_1.union(df_2)
union_df.show()

+----------+----------------+----------------+
|CategoryID|ParentCategoryID|NumberOfProducts|
+----------+----------------+----------------+
+----------+----------------+----------------+



### Observation:
#### Query Execution Behavior in DAG

While executing the `UNION` command in a single cell, if we check the DAG image, we can see that for **Filter 3**, we get 1 record as output, and for **Filter 8**, we get 0 records. Despite this, we still get **2 records** as output, which is the expected result.

However, if you execute all the commands in separate cells, the DAG image shows **1 record** for each of the filters. This discrepancy occurs due to improper logging happening at the backend. As a result, while the output and the jobs will not differ between both cases, the logging behavior leads to different interpretations of the DAG.


#####Note:
In data frame point of view Union comes under wide tranformation as the partition vlaues changes 


# Wide Transformations

Wide transformations in PySpark are operations that require **shuffling data across partitions**. This means that the data needs to be moved between executor or worker nodes to perform the transformation.

These transformations are generally more expensive in terms of computation and network I/O because they involve redistributing data to ensure that rows with the same key or required data are grouped together. This often results in a high cost for large datasets.

### Some examples of wide transformations in Spark include:

- **groupBy**: Groups the data by a specified column or columns.
- **groupByKey()**: Groups the data by key (for RDD operations).
- **reduceByKey()**: Reduces the data by key using a given function.
- **aggregate()**: Aggregates the data using an initial value and a function, producing a final result.
- **aggregateByKey()**: Similar to `reduceByKey`, but allows for a more complex aggregation using separate functions for combining values within partitions and across partitions.
- **distinct()**: Removes duplicate values from the dataset, requiring all data to be shuffled to identify unique rows.
- **join()**: Joins two datasets based on a common key, which involves shuffling data to align the rows with the same key from both datasets.
- **repartition()**: Reshuffles the data into a specified number of partitions, requiring data to be redistributed across the cluster.

### Example of Wide Transformations

#### 1. `groupBy`

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Wide Transformation Example").getOrCreate()

# Sample data
data = [("Alice", "HR", 1000),
        ("Bob", "Finance", 1500),
        ("Alice", "HR", 1100),
        ("Charlie", "Finance", 2000),
        ("Charlie", "HR", 1200)]

# Define schema
columns = ["name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Perform a wide transformation with groupBy
df_grouped = df.groupBy("name").agg({"salary": "avg"})

# Show the result of groupBy transformation
df_grouped.show()


# Wide Transformations Table

| Transformation        | Description                                                                                                     |
|-----------------------|-----------------------------------------------------------------------------------------------------------------|
| **groupBy**           | Groups the data by a specified column or columns. This operation requires shuffling data across partitions.    |
| **groupByKey()**      | Groups the data by key (for RDD operations). Requires shuffling data for proper key grouping.                 |
| **reduceByKey()**     | Reduces the data by key using a specified function. Shuffles data across partitions to aggregate values.       |
| **aggregate()**       | Aggregates the data using an initial value and a function. Results in a final aggregated value.                |
| **aggregateByKey()**  | Similar to `reduceByKey()`, but allows for more complex aggregation with different functions for combining values within partitions and across partitions. |
| **distinct()**        | Removes duplicate values from the dataset, requiring a shuffle to identify unique rows across all partitions. |
| **join()**            | Joins two datasets based on a common key, requiring data to be shuffled to align rows with the same key.       |
| **repartition()**     | Reshuffles the data into a specified number of partitions, requiring data redistribution across the cluster.    |



In [0]:
##groupBy
df=spark.read.parquet("dbfs:/mnt/adls/target_tables/Dim/categories_table/")
df = df.groupBy("ParentCategoryID").agg(sum(col("NumberOfProducts").cast(IntegerType())).alias("NumberOfProducts"))
df.show()

+----------------+----------------+
|ParentCategoryID|NumberOfProducts|
+----------------+----------------+
|            3794|             211|
|            4101|             344|
|            1829|             555|
|            5518|             255|
|            1591|              43|
|            9427|             293|
|            4900|             495|
|            1342|              39|
|            6397|             207|
|            9900|             413|
|            1088|             455|
|            7340|             303|
|            8086|              76|
|            5614|             880|
|            1395|              27|
|            8932|             214|
|            6393|             421|
|            7417|             305|
|            4161|             177|
|            1896|             120|
+----------------+----------------+
only showing top 20 rows



## Shuffle Read and Shuffle Write in Apache Spark

In distributed computing frameworks like Apache Spark, **shuffle** is a critical concept that directly affects performance. It occurs when Spark needs to reorganize or redistribute data across different nodes or partitions, often due to operations like `groupBy`, `join`, or `repartition`. These operations require data to be moved across the network between workers, and this process involves both **Shuffle Read** and **Shuffle Write**.

### 1. Shuffle Read

**Shuffle Read** refers to the amount of data read by tasks during the shuffle process.

### What happens during a shuffle?
- When Spark executes operations like `groupBy`, `join`, or `repartition`, it might need to redistribute data across multiple workers (nodes) in the cluster. 
- For example, in a `groupBy` operation, Spark needs to collect all the data related to a specific key (e.g., `ParentCategoryID` in your case) on the same node to perform the aggregation. As a result, data from different partitions (which might be on different workers) must be moved or "shuffled" to the correct workers.

### How does it work?
- Suppose you have a dataset that is divided into multiple partitions across nodes. If you're performing a `groupBy` operation, Spark will redistribute the data so that all records belonging to the same `ParentCategoryID` are grouped together. For this to happen, Spark needs to **read** data from other partitions and bring it to the node where it will be processed.

### Why does Shuffle Read matter?
- **Network and Disk I/O**: Shuffle involves significant network and disk input/output (I/O) because data has to be transferred between nodes and potentially written to disk temporarily. The more data being shuffled, the higher the cost in terms of network and disk bandwidth. This can lead to slow performance, especially when large datasets are involved.
- **Performance Bottleneck**: If a large amount of data is being shuffled, it can create a bottleneck, making the process slower. Network congestion and high disk usage during shuffle can lead to delays, increased latency, and overall inefficiency.
- **Shuffle Read Size**: The **Shuffle Read** value tells you how much data was read from other partitions or nodes during this redistribution. A large **Shuffle Read** size could indicate a very expensive operation, especially if Spark needs to shuffle a lot of data.

---

### 2. Shuffle Write

**Shuffle Write** refers to the amount of data written by tasks to temporary storage during the shuffle process.

### What happens during shuffle write?
- After Spark redistributes the data (as part of the shuffle), it writes this data to temporary disk storage (on each node) so that tasks on other nodes can access it.
- For example, if a task needs data from other partitions, it will read the required data from a disk (this is **Shuffle Read**). After performing its local computation (e.g., a `groupBy` aggregation), Spark will write the processed data to disk (this is **Shuffle Write**) so that other tasks can access it and complete their computation.

### How does it work?
- After the data is read and processed locally, Spark writes the results to disk in the form of intermediate files. These files are stored temporarily in Spark’s shuffle buffer (disk storage), which might involve writing to local disks or distributed file systems like HDFS, depending on the configuration.
- The amount of data written to disk depends on how much data is being shuffled and how Spark decides to partition the results (i.e., how the data is redistributed).

### Why does Shuffle Write matter?
- **Disk and Network I/O**: Shuffle Write also involves significant disk and network I/O, as the data is written to intermediate storage locations and might be transferred across the network for future tasks.
- **Temporary Storage**: Shuffle Write is typically written to temporary files, which means these operations can consume a lot of disk space and I/O bandwidth, leading to potential bottlenecks.
- **Performance Impact**: Writing large volumes of data during shuffle operations increases the overall time it takes for the job to finish. This is especially problematic when you have a high volume of intermediate data, as it leads to disk contention and slower performance.
- **Shuffle Write Size**: The **Shuffle Write** value indicates how much data Spark has written during the shuffle phase. A large **Shuffle Write** value suggests that a lot of data is being transferred and stored temporarily, which may be indicative of inefficient partitioning or too much data being shuffled.

---

## How Shuffle Read and Shuffle Write Affect Performance

1. **Network and Disk Overhead**:
   - Both Shuffle Read and Shuffle Write involve significant network and disk I/O. Since Spark is a distributed system, data needs to be moved between nodes in the cluster during the shuffle. This can cause high network traffic, leading to delays and inefficiencies, especially when the cluster is under heavy load.
   - If there is a lot of shuffle traffic (both reading and writing), it can saturate the network and disk, causing slowdowns.

2. **Data Partitioning**:
   - If the data is not partitioned optimally, Spark might need to shuffle large amounts of data. For example, if your data is heavily skewed, certain partitions may have far more data than others, requiring additional shuffling and increasing Shuffle Read/Write.
   - **Repartitioning** or **coalescing** the data (to adjust the number of partitions) can help reduce shuffle costs.

3. **Skew in Data**:
   - If one partition has significantly more data than others (known as data skew), Spark will need to shuffle more data to balance the workload across nodes. This can lead to increased Shuffle Read/Write, as Spark redistributes the data to other workers.
   - Skewed data can also lead to long-running tasks that create a bottleneck in the shuffle process, affecting overall job performance.

4. **Job Optimization**:
   - To optimize jobs that involve shuffling, you can:
     - **Increase parallelism** by adjusting the number of partitions (using `repartition()` or `coalesce()`).
     - **Avoid unnecessary shuffles**: Try to design the workflow to reduce unnecessary shuffling, such as using `map` and `filter` operations before `groupBy` to reduce the amount of data being shuffled.
     - **Broadcast joins**: In the case of joins, use broadcast joins if one of the datasets is small enough to be broadcast to all nodes, thereby avoiding a shuffle.

5. **Tuning**:
   - **Memory settings**: Ensure that the memory settings for your Spark job are adequate, especially for tasks that involve heavy shuffling. If there isn’t enough memory, Spark might spill data to disk, which can slow down the shuffle process.
   - **Shuffle Partition Size**: You can tune the number of shuffle partitions (`spark.sql.shuffle.partitions`) to control how data is distributed during the shuffle. Having too many partitions can result in excessive shuffle, while too few partitions can lead to large partitions with inefficient processing.

---

## Summary

- **Shuffle Read** represents the data being read from other partitions during the shuffle process, while **Shuffle Write** represents the data being written during the same process.
- Both processes involve significant disk and network I/O, and understanding how much data is shuffled can help in identifying performance bottlenecks.
- Large shuffle sizes (both read and write) can indicate inefficient partitioning or data skew, which may require optimizations such as adjusting the number of partitions, repartitioning the data, or avoiding unnecessary shuffling.
