## Overview
In this notebook the modelling choice assumptions are logged. The basic recipe is as follows:
1. Read the data representation from the data representation phase
2. Compute the quaterly revenue for each product
3. Compute the quaterly revenue which is the sum of the product revenues
4. Compute the contribution of the product towards the revenue
5. Compute the cumulative distribution of each product - from most contribution to least contribution
6. Analyze the cumulative contribution distribution
7. Trim the products used for daily sales representation
8. Write the updated daily sales representation to be used for modelling
9. Log the modelling choice with the explaination that contribution of products towards quaterly revenue shows a power law like behavior and so to facilitate understanding of the products that go significantly towards the generated quaterly revenue, we drop the products that materially do not contribute towards the revenue and simply add dimensionality to the problem.


## Read the data

In [1]:
import pandas as pd
fp = "../../data/retail_q1_post_data_rep_prep.parquet"
df = pd.read_parquet(fp)

In [2]:
df.head()

Unnamed: 0,10002,10120,10123C,10124A,10125,10133,10134,10135,10138,11001,...,90214L,90214M,90214N,90214O,90214P,90214R,90214S,90214V,PADS,POST
0,2.55,6.3,0.0,0.0,0.0,0.0,0.0,1.25,0.0,0.0,...,0.0,0.0,2.5,0.0,0.0,0.0,1.25,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.5,0.0,3.38,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,108.0,212.0,0.0,0.0,27.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
4,10.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,477.0


## Compute product contribution to quaterly revenue

In [3]:
df_prod_contrib = df.sum(axis=0).sort_values(ascending=False)


In [4]:
df_prod_contrib = df_prod_contrib.to_frame()


In [5]:
df_prod_contrib = df_prod_contrib.reset_index()

In [6]:
df_prod_contrib.columns = ["Stock_Code", "Revenue"]

## Compute total revenue

In [7]:
total_revenue = df_prod_contrib["Revenue"].sum()

## Compute Cumulative Contribution of each product revenue towards the quaterly revenue

In [8]:
df_prod_contrib["Contribution"] = (df_prod_contrib["Revenue"]/total_revenue) * 100

In [9]:
df_prod_contrib["Cum_Contrib"] = df_prod_contrib["Contribution"].cumsum()

## Analyze the Contribution

In [10]:
df_prod_contrib = df_prod_contrib[df_prod_contrib["Cum_Contrib"] <= 80]
df_prod_contrib = df_prod_contrib.reset_index(drop=True)
in_demand_inventory = df_prod_contrib["Stock_Code"]

## Write the updated data representation to disk

In [11]:
df = df[in_demand_inventory]

In [12]:
fp = "../../data/retail_q1_post_mc.parquet"
df.to_parquet(fp, index=False)