## Overview
In this notebook the modelling choice assumptions are logged. The basic recipe is as follows:
1. Read the data representation from the data representation phase
2. Compute the quaterly revenue for each product
3. Compute the quaterly revenue which is the sum of the product revenues
4. Compute the contribution of the product towards the revenue
5. Compute the cumulative distribution of each product - from most contribution to least contribution
6. Analyze the cumulative contribution distribution
7. Trim the products used for daily sales representation
8. Write the updated daily sales representation to be used for modelling
9. Log the modelling choice with the explaination that contribution of products towards quaterly revenue shows a power law like behavior and so to facilitate understanding of the products that go significantly towards the generated quaterly revenue, we drop the products that materially do not contribute towards the revenue and simply add dimensionality to the problem.


## Read the data

In [1]:
import pandas as pd
fp = "../../kmds/examples/retail_q1_post_data_rep_prep.parquet"
df = pd.read_parquet(fp)

In [2]:
df.head()

Unnamed: 0,10002,10120,10123C,10124A,10125,10133,10134,10135,10138,11001,...,90214L,90214M,90214N,90214O,90214P,90214R,90214S,90214V,PADS,POST
0,2.55,6.3,0.0,0.0,0.0,0.0,0.0,1.25,0.0,0.0,...,0.0,0.0,2.5,0.0,0.0,0.0,1.25,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.5,0.0,3.38,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,108.0,212.0,0.0,0.0,27.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
4,10.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,477.0


## Compute product contribution to quaterly revenue

In [3]:
df_prod_contrib = df.sum(axis=0).sort_values(ascending=False)


In [4]:
df_prod_contrib = df_prod_contrib.to_frame()


In [5]:
df_prod_contrib = df_prod_contrib.reset_index()

In [6]:
df_prod_contrib.columns = ["Stock_Code", "Revenue"]

## Compute total revenue

In [7]:
total_revenue = df_prod_contrib["Revenue"].sum()

## Compute Cumulative Contribution of each product revenue towards the quaterly revenue

In [8]:
df_prod_contrib["Contribution"] = (df_prod_contrib["Revenue"]/total_revenue) * 100

In [9]:
df_prod_contrib["Cum_Contrib"] = df_prod_contrib["Contribution"].cumsum()

## Analyze the Contribution

In [10]:
df_prod_contrib = df_prod_contrib[df_prod_contrib["Cum_Contrib"] <= 80]
df_prod_contrib = df_prod_contrib.reset_index(drop=True)
in_demand_inventory = df_prod_contrib["Stock_Code"]

## Write the updated data representation to disk

In [11]:
df = df[in_demand_inventory]

In [12]:
fp = "../../data/retail_q1_post_mc.parquet"
df.to_parquet(fp, index=False)

## Capturing and Tagging Meta Data in Data Representations
After creating the required data representation to our modelling requirement, the meta-data related to the data representation can be captured to facilitate understanding. The [woodwork library](https://woodwork.alteryx.com/en/v0.7.1/start.html)  can provide this feature. The generated meta-data can be reviewed and updated. Note how the semantic tags related to sale of items for the quarter is updated. The obtained meta-data can then be published to a tool like [ckan](https://ckan.org/) for wider dissemenation. 

In [13]:
import woodwork as ww
df.ww.init(name="q1_2010_retail_data_rep")

In [14]:
inv_items = df.columns.to_list()

In [15]:
semantic_tags_new = {i: "sale of {item} in Q1 2010 at the store".format(item=i) for i in inv_items}

In [16]:
df.ww.set_types(semantic_tags=semantic_tags_new)

In [17]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
85123A,float64,Double,"['numeric', 'sale of 85123A in Q1 2010 at the store']"
85099B,float64,Double,"['sale of 85099B in Q1 2010 at the store', 'numeric']"
48138,float64,Double,"['sale of 48138 in Q1 2010 at the store', 'numeric']"
20685,float64,Double,"['sale of 20685 in Q1 2010 at the store', 'numeric']"
21843,float64,Double,"['sale of 21843 in Q1 2010 at the store', 'numeric']"
84879,float64,Double,"['numeric', 'sale of 84879 in Q1 2010 at the store']"
79323W,float64,Double,"['numeric', 'sale of 79323W in Q1 2010 at the store']"
POST,float64,Double,"['sale of POST in Q1 2010 at the store', 'numeric']"
37503,float64,Double,"['numeric', 'sale of 37503 in Q1 2010 at the store']"
21524,float64,Double,"['numeric', 'sale of 21524 in Q1 2010 at the store']"


In [18]:
fp = "../../data/retail_q1_post_mc_meta_data.csv"
df.ww.to_csv(fp, index=False)

In [19]:
from tagging.tag_types import *
from owlready2 import *
from utils.load_utils import *
from utils.path_utils import *
KNOWLEDGE_BASE = "../../kmds/examples/example_ml_kb_exp_workflow.xml"

In [20]:
onto2 = load_kb(KNOWLEDGE_BASE)

In [21]:
with onto2:
    insts = Workflow.instances()

In [22]:
the_workflow_instance = insts[0]

In [23]:
mc_obs_list = []
observation_count = 1

mc1 = ModellingChoiceObservation(namespace=onto2)
mc1.finding = "contribution of products towards quaterly revenue shows a power law like behavior and\
so to facilitate understanding of the products that go significantly towards the generated quaterly\
revenue, we drop the products that materially do not contribute towards the revenue and simply add dimensionality to the problem."
mc1.finding_sequence = observation_count
mc1.modelling_choice_observation_type = ModellingChoiceTags.MODELLING_CHOICE_OBSERVATION.value
mc_obs_list.append(mc1)
the_workflow_instance.has_modeling_choice_observations = mc_obs_list
onto2.save(file=KNOWLEDGE_BASE, format="rdfxml")