# Exercise - Aggregation and Grouping

Now it's your turn to practice what we learned in the previous notebook.

We will be using the same dataset as the previous exercise.

The dataset for this exercise comes from [Dairy Supply Chain Sales dataset](https://zenodo.org/records/7853252) from [Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory](https://ieeexplore.ieee.org/document/10176585), Siniosoglou et al.

Run the first cell as is and then proceed with the exercise.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_excel('https://raw.githubusercontent.com/soltaniehha/Intro-to-Data-Analytics/main/data/MEVGAL-Dairy-Sales/product1-2022.xlsx')

df.loc[:, 'previous_year_daily_unit_returns_kg'] = df['previous_year_daily_unit_returns_kg'].replace([np.inf, -np.inf], pd.NA)
df.loc[df['daily_unit_sales']<0, 'daily_unit_sales'] = pd.NA
df = df.dropna(subset = ['daily_unit_sales'])
df[['percentage_difference_daily_unit_sales','percentage_difference_daily_unit_sales_kg']] = df[['percentage_difference_daily_unit_sales', 'percentage_difference_daily_unit_sales_kg']].apply(lambda x: x.fillna(x.mean()))
df[['daily_unit_sales_kg', 'daily_unit_returns_kg']] = df[['daily_unit_sales_kg', 'daily_unit_returns_kg']].apply(lambda x: x.fillna(x.median()))
df[['previous_year_daily_unit_sales_kg', 'previous_year_daily_unit_returns_kg']] = df[['previous_year_daily_unit_sales_kg', 'previous_year_daily_unit_returns_kg']].ffill()
for col in ['points_of_distribution', 'previous_year_points_of_distribution']:
    missing = df[col].isna()
    df.loc[missing, col] = np.random.choice(df[col].dropna(), size=missing.sum(), replace=True)
df.loc[:, 'previous_year_daily_unit_returns_kg'] = df['previous_year_daily_unit_returns_kg'].fillna(0)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 364 entries, 0 to 364
Data columns (total 13 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Day                                        364 non-null    int64  
 1   Month                                      364 non-null    object 
 2   Year                                       364 non-null    int64  
 3   daily_unit_sales                           364 non-null    float64
 4   previous_year_daily_unit_sales             364 non-null    float64
 5   percentage_difference_daily_unit_sales     364 non-null    float64
 6   daily_unit_sales_kg                        364 non-null    float64
 7   previous_year_daily_unit_sales_kg          364 non-null    float64
 8   percentage_difference_daily_unit_sales_kg  364 non-null    float64
 9   daily_unit_returns_kg                      364 non-null    float64
 10  previous_year_daily_unit_return

1. What is the average amount of yogurt, in kilograms, sold and returned this year compared to last year?

To practice, let's first create a subset of numerical columns: `daily_unit_sales_kg`, `previous_year_daily_unit_sales_kg`, `daily_unit_returns_kg`, `previous_year_daily_unit_returns_kg` before performing the aggregations.

Hint: Do not create a new dataframe, you simply need to specify (subset) the columns you want the statistics for.

In [2]:
# Your code goes here


Unnamed: 0,0
daily_unit_sales_kg,1569.187912
previous_year_daily_unit_sales_kg,1816.796703
daily_unit_returns_kg,0.024513
previous_year_daily_unit_returns_kg,0.021489


2. How many more or fewer units of yogurt were sold this year compared to last year?

In [3]:
# Your code goes here


The difference in total sales is: -145864.00000000035


3. What was the maximum quantity of yogurt sold in one day last year?

In [4]:
# Your code goes here


4228.800000000001

4. What was the standard deviation observed in yogurt unit sales this year?

In [5]:
# Your code goes here


1183.207440684088

5. Which day had the highest average return across this year and the previous year, and what was the amount?

Hint: You need to get the average of `daily_unit_returns_kg` and	`previous_year_daily_unit_returns_kg` for each day. To calculate the average between two columns for each row, use `axis=1` in the aggregation function.

In [6]:
# Your code goes here


Highest return day: March 13
Maximum average return: 0.17601993965630328


6. What is the total unit sales for each month, sorted in descending order?

In [7]:
# Your code goes here


Unnamed: 0_level_0,daily_unit_sales
Month,Unnamed: 1_level_1
June,102627.0
May,94959.0
September,83921.0
July,83679.0
October,81907.0
November,77869.0
August,77655.0
January,74106.0
February,69420.0
March,69013.0


7. What is the average daily unit sales for each month, sorted in descending order?

In [8]:
# Your code goes here


Unnamed: 0_level_0,daily_unit_sales
Month,Unnamed: 1_level_1
June,3420.9
May,3165.3
September,2797.366667
July,2699.322581
October,2642.16129
November,2595.633333
August,2505.0
February,2479.285714
January,2390.516129
March,2226.225806


8. How do you evaluate the two aggregation methods mentioned above for comparing sales across different months?

In [9]:
# Your explanation goes here


The sum method is best for understanding total sales volume, while the average method provides a more normalized view for comparing sales performance across months with different durations.


9. What is the average percentage difference in daily unit sales compared to the previous year for each month?

Hint: you need to use `percentage_difference_daily_unit_sales` (The percentage difference between `daily_unit_sales` and `previous_year_daily_unit_sales`)

In [10]:
# Your code goes here


Unnamed: 0_level_0,percentage_difference_daily_unit_sales
Month,Unnamed: 1_level_1
April,-0.077647
August,0.145652
December,0.140123
February,-0.029261
January,-0.087711
July,-0.053859
June,-0.070052
March,-0.054208
May,0.332456
November,0.255204


Can you sort the results above in the natural order of the months instead of the default alphabetical ordering?