# Unpivot in PySpark

The **unpivot** operation is used to transform wide-format data into a long-format table. This is useful when you need to restructure data for analysis.

## Sample Data

We start with a DataFrame where sales data is stored in separate columns for different regions (North, South, East, West).

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

# Initialize Spark Session
spark = SparkSession.builder.appName('UnpivotExample').getOrCreate()

# Sample Data
data = [
    ('Product A', 100, 200, 150, 180),
    ('Product B', 90, 210, 160, 190),
    ('Product C', 120, 180, 140, 170)
]

columns = ['Product', 'North', 'South', 'East', 'West']
df = spark.createDataFrame(data, columns)

df.show()

StatementMeta(, 7d9f7e7c-1fdd-45a1-b16b-de69a1805c7a, 3, Finished, Available, Finished)

+---------+-----+-----+----+----+
|  Product|North|South|East|West|
+---------+-----+-----+----+----+
|Product A|  100|  200| 150| 180|
|Product B|   90|  210| 160| 190|
|Product C|  120|  180| 140| 170|
+---------+-----+-----+----+----+



## Unpivoting Data

We use the **stack()** function inside **selectExpr()** to transform the regional columns into rows. Each region name will now be stored in a single column (`Region`), and the corresponding sales values will be in another column (`Sales`).

In [2]:
unpivot_expr = "stack(4, 'North', North, 'South', South, 'East', East, 'West', West) as (Region, Sales)"
df_unpivoted = df.selectExpr('Product', unpivot_expr)

df_unpivoted.show()

StatementMeta(, 7d9f7e7c-1fdd-45a1-b16b-de69a1805c7a, 4, Finished, Available, Finished)

+---------+------+-----+
|  Product|Region|Sales|
+---------+------+-----+
|Product A| North|  100|
|Product A| South|  200|
|Product A|  East|  150|
|Product A|  West|  180|
|Product B| North|   90|
|Product B| South|  210|
|Product B|  East|  160|
|Product B|  West|  190|
|Product C| North|  120|
|Product C| South|  180|
|Product C|  East|  140|
|Product C|  West|  170|
+---------+------+-----+



## Conclusion

- **Unpivoting helps in data normalization**, making it easier to analyze and visualize.
- The `stack()` function is an efficient way to perform this transformation in PySpark.
- If the column names are dynamic, you can retrieve them using `df.columns` and construct the `selectExpr` dynamically.