# Reshaping

In [1]:
import polars as pl

## Wide vs. Long DataFrames
- Wide and long describe two ways of organizing data in a table.
- A variable is a data attribute that can have multiple values.
- Wide `DataFrames` store the same variable across multiple columns.
- Wide DataFrames expand horizontally -- their number of columns grows.
- Long `DataFrames` store each variable in a single column.
- Long `DataFrames` expand vertically -- their number of rows grows.

### Example
- The `wide_store_sales.csv` dataset is a wide dataset.
- It stores the same attribute/variable (revenue) across multiple columns (Jan, Feb, etc).
- As more revenue values arrive for future months, the `DataFrame` will expand in width.

In [2]:
pl.read_csv("wide_store_sales.csv")

store_id,location,Jan,Feb,Mar,Apr,May,Jun
i64,str,i64,i64,i64,i64,i64,i64
101,"""Central""",12000,12500,13000,13500,14000,14500
102,"""North""",15000,14500,16000,15500,16500,17000
103,"""East""",10000,9500,11000,10500,11500,12000
104,"""South""",18000,19000,20000,21000,22000,23000
105,"""West""",13000,13500,14000,14500,15000,15500


## The unpivot Method to Convert a Wide DataFrame to a Long DataFrame
- The `unpivot` method transforms a `DataFrame` from a wide format to a long format.
- The `on` parameter accepts the column(s) with the duplicate values by category.
- The `index` parameter accepts the column(s) with the identifiers. Polars will extract the unique values from this column.
- The equivalent Pandas method is `melt`.

In [3]:
sales = pl.read_csv("wide_store_sales.csv")
sales

store_id,location,Jan,Feb,Mar,Apr,May,Jun
i64,str,i64,i64,i64,i64,i64,i64
101,"""Central""",12000,12500,13000,13500,14000,14500
102,"""North""",15000,14500,16000,15500,16500,17000
103,"""East""",10000,9500,11000,10500,11500,12000
104,"""South""",18000,19000,20000,21000,22000,23000
105,"""West""",13000,13500,14000,14500,15000,15500


In [4]:
sales.unpivot(
    index=["store_id", "location"], on=["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
)

store_id,location,variable,value
i64,str,str,i64
101,"""Central""","""Jan""",12000
102,"""North""","""Jan""",15000
103,"""East""","""Jan""",10000
104,"""South""","""Jan""",18000
105,"""West""","""Jan""",13000
…,…,…,…
101,"""Central""","""Jun""",14500
102,"""North""","""Jun""",17000
103,"""East""","""Jun""",12000
104,"""South""","""Jun""",23000


- The `index` parameter supports a list argument too.
- If the `on` parameter is omitted, Polars will include all columns that are not provided to `index`.
- The dataset has 6 month columns x 5 rows = 30 total rows in new `DataFrame`.

In [5]:
sales.unpivot(index=["store_id", "location"])

store_id,location,variable,value
i64,str,str,i64
101,"""Central""","""Jan""",12000
102,"""North""","""Jan""",15000
103,"""East""","""Jan""",10000
104,"""South""","""Jan""",18000
105,"""West""","""Jan""",13000
…,…,…,…
101,"""Central""","""Jun""",14500
102,"""North""","""Jun""",17000
103,"""East""","""Jun""",12000
104,"""South""","""Jun""",23000


- Use the `variable_name` parameter to rename the variable column (the one that will hold the former column names).
- Use the `value_name` parameter to rename the value column (the one that will hold the cell values).

In [6]:
sales.unpivot(index=["store_id", "location"], variable_name="month", value_name="sales")

store_id,location,month,sales
i64,str,str,i64
101,"""Central""","""Jan""",12000
102,"""North""","""Jan""",15000
103,"""East""","""Jan""",10000
104,"""South""","""Jan""",18000
105,"""West""","""Jan""",13000
…,…,…,…
101,"""Central""","""Jun""",14500
102,"""North""","""Jun""",17000
103,"""East""","""Jun""",12000
104,"""South""","""Jun""",23000


### Further Reading
- https://docs.pola.rs/user-guide/transformations/unpivot/
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unpivot.html

## The pivot Method to Convert a Long DataFrame to a Wide DataFrame
- The complementary `pivot` method converts a `DataFrame` from a long format to a wide format.
- The `student_grades.csv` dataset stores each variable in a single column.
- Student names, subjects, and score values are _not_ scattered across multiple columns.

In [7]:
grades = pl.read_csv("student_grades.csv")
grades

student,subject,score
str,str,i64
"""Alice""","""math""",88
"""Bob""","""math""",76
"""Charlie""","""math""",90
"""Diana""","""math""",67
"""Ethan""","""math""",95
…,…,…
"""Alice""","""writing""",100
"""Bob""","""writing""",50
"""Charlie""","""writing""",82
"""Diana""","""writing""",67


- A wider view can make it easy to parse a student's performance in all subjects.
- The `pivot` method transforms a long `DataFrame` to a wide one.
- The `index` parameter sets the columns whose values will be kept as unique row identifiers.
- The `on` parameter sets the columns whose distinct values will be extracted to new columns.
- The `values` parameter sets the columns whose values will be distributed in the cells of the new table.
- Polars will use a `null` if there is no value for the intersection of an identifier (student name) and value (test score).

In [8]:
grades.pivot(index="student", on="subject", values="score")

student,math,history,writing
str,i64,i64,i64
"""Alice""",88,52,100.0
"""Bob""",76,69,50.0
"""Charlie""",90,13,82.0
"""Diana""",67,26,67.0
"""Ethan""",95,100,


### Further Reading
- https://docs.pola.rs/user-guide/transformations/pivot/#eager
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.pivot.html

## Pivot Tables I
- A pivot table reshapes data by turning unique values into new rows or columns, then summarizing corresponding values.
- The `student_grades_expanded` `DataFrame` has multiple entries for a student and grade.

In [9]:
grades = pl.read_csv("student_grades_expanded.csv")
grades.head(3)

student,subject,score
str,str,i64
"""Alice""","""math""",88
"""Bob""","""math""",76
"""Charlie""","""math""",90


- The dataset stores student grades over 2 years at the school.
- Thus, certain combinations of student and subject may appear twice.

In [10]:
grades.filter(pl.col("student") == "Alice")

student,subject,score
str,str,i64
"""Alice""","""math""",88
"""Alice""","""history""",52
"""Alice""","""writing""",100
"""Alice""","""math""",92
"""Alice""","""history""",59
"""Alice""","""writing""",16


- A regular `pivot` method fails because of duplicate values in the `student` column.
- Each combination of student name and subject appears twice (once per year) so Polars cannot choose a single score per combination.

In [11]:
# grades.pivot(index="student", on="subject", values="score")

- The `aggregate_function` parameter sets the algorithm for selecting the value per each duplicate combination.
- An argument of `first` selects the first occurrence of each unique value.
- In this example, Polars will use the test score for the first occurrence of each student name + school subject.
- This dataset represents the first year of grades.

In [12]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="first")

student,math,history,writing
str,i64,i64,i64
"""Alice""",88,52,100.0
"""Bob""",76,69,50.0
"""Charlie""",90,13,82.0
"""Diana""",67,26,67.0
"""Ethan""",95,100,


- An argument of `last` selects the last occurrence of each unique value.
- This dataset represents the second year of grades.

In [13]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="last")

student,math,history,writing
str,i64,i64,i64
"""Alice""",92,59.0,16
"""Bob""",100,62.0,25
"""Charlie""",38,,88
"""Diana""",42,98.0,72
"""Ethan""",13,,30


- There are complementary `max` and `min` arguments for choosing the largest or smallest value for every match.

In [14]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="max")

student,math,history,writing
str,i64,i64,i64
"""Alice""",92,59,100
"""Bob""",100,69,50
"""Charlie""",90,13,88
"""Diana""",67,98,72
"""Ethan""",95,100,30


In [15]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="min")

student,math,history,writing
str,i64,i64,i64
"""Alice""",88,52,16
"""Bob""",76,62,25
"""Charlie""",38,13,82
"""Diana""",42,26,67
"""Ethan""",13,100,30


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.pivot.html

## Pivot Tables II
- The aggregate functions in the previous lesson (`first`, `last`, `max`, and `min`) chose one value from a set of possible values.
- Other functions can perform aggregate operations across all values within each combination.

In [16]:
grades = pl.read_csv("student_grades_expanded.csv")
grades.head(3)

student,subject,score
str,str,i64
"""Alice""","""math""",88
"""Bob""","""math""",76
"""Charlie""","""math""",90


- The `sum` aggregation adds the values together.

In [17]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="sum")

student,math,history,writing
str,i64,i64,i64
"""Alice""",180,111,116
"""Bob""",176,131,75
"""Charlie""",128,13,170
"""Diana""",109,124,139
"""Ethan""",108,100,30


- The `mean` aggregation takes the average of the values.

In [18]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="mean")

student,math,history,writing
str,f64,f64,f64
"""Alice""",90.0,55.5,58.0
"""Bob""",88.0,65.5,37.5
"""Charlie""",64.0,13.0,85.0
"""Diana""",54.5,62.0,69.5
"""Ethan""",54.0,100.0,30.0


- The `len` aggregate function counts the number of values in each group.

In [19]:
grades.pivot(index="student", on="subject", values="score", aggregate_function="len")

student,math,history,writing
str,u32,u32,u32
"""Alice""",2,2,2
"""Bob""",2,2,2
"""Charlie""",2,2,2
"""Diana""",2,2,2
"""Ethan""",2,2,2


- One advantage of pivot tables is that we can look at data from different angles.
- Let's swap the axes: subjects as row values, students as columns.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.pivot.html

## The transpose Method
- The `transpose` method swaps the axes of the `DataFrame`.
- Current row values become new column headers, and current column headers become new row values.
- Let's start with the pivot table from the end of the previous lesson.

In [20]:
grades = pl.read_csv("student_grades_expanded.csv").pivot(
    index="subject", on="student", values="score", aggregate_function="mean"
)
grades

subject,Alice,Bob,Charlie,Diana,Ethan
str,f64,f64,f64,f64,f64
"""math""",90.0,88.0,64.0,54.5,54.0
"""history""",55.5,65.5,13.0,62.0,100.0
"""writing""",58.0,37.5,85.0,69.5,30.0


- The `column_names` parameter identifies the column whose values will become new columns.
- Polars will arrange the values so they match the original intersection of row and column (but now inverted/transposed).
- Pass `True` to the `include_header` parameter to include the former column headers in a new column.
- The `header_name` parameter sets a custom name for the column of header values.

In [21]:
grades.transpose(column_names="subject", include_header=True, header_name="student")

student,math,history,writing
str,f64,f64,f64
"""Alice""",90.0,55.5,58.0
"""Bob""",88.0,65.5,37.5
"""Charlie""",64.0,13.0,85.0
"""Diana""",54.5,62.0,69.5
"""Ethan""",54.0,100.0,30.0


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.transpose.html