## Overview
This notebook implements the model selection phase of the example. To explain the patterns of daily customer purchases
at the store we run PCA on the trimmed daily sales representation. There are many choices to explain variation and summarize
data. PCA is used here for illustration. Other choices such as support vector data description or matrix factorization based approaches
could be favored by some analysts or teams. The point here is that it is possible to capture annotations about rationale used
to make model selection decisions. This is done after developing the model and logging the observations to KMDS. The basic recipe in the note book is:
1. Read the data representation post the modelling choice decision to trim the representation to only use store inventory that accounts for daily variation - see the modelling choice notebook for details
2. Run PCA on the representation
3. Analyze the results
4. Log the observations with KMDS

## Read the Data

In [20]:
import pandas as pd
fp = "../../kmds/examples/retail_q1_post_mc.parquet"
df = pd.read_parquet(fp)

## Run PCA

In [21]:
from sklearn.decomposition import PCA

In [22]:
retail_model = PCA(n_components=30)
df_transf = pd.DataFrame(retail_model.fit_transform(df))

In [23]:
retail_model1 = PCA(n_components=30)
df_transf2 = pd.DataFrame(retail_model.fit_transform(df.T))

In [24]:
df_transf2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,5006.174477,341.610897,-200.697736,15.669141,-152.983853,296.658332,-864.457816,-281.164268,132.208532,399.034609,...,-64.516981,-65.770141,284.564540,-101.182537,118.515658,144.003202,16.216457,160.529757,11.321026,98.388183
1,1681.468123,-460.557186,742.892645,116.278171,-423.580728,-241.706336,-464.784752,222.875142,24.377983,355.046044,...,-176.298624,-23.306556,13.906266,-80.338396,364.009174,-135.874114,-75.225717,-88.657570,-261.386500,60.990368
2,957.268762,48.729052,374.344807,395.761571,-316.455294,-195.184350,-202.071455,143.193595,-262.381339,97.965645,...,371.258577,-53.846431,-404.413015,-213.458922,-73.725382,-62.535382,86.086513,13.322408,267.455125,101.238826
3,1410.624150,-342.474004,-87.784454,165.449873,-400.123700,-603.726885,567.299488,222.533982,-151.055197,-259.582349,...,28.413642,86.757575,157.901723,65.538219,-39.670882,40.536510,-202.170938,-6.942679,-224.676010,-242.475162
4,1677.514064,-239.063064,-520.944064,-200.777950,354.022596,49.174584,99.403145,21.647362,-70.935252,-273.928048,...,-178.514875,-111.922424,-284.335079,-46.288340,-219.910981,-0.384524,22.789406,-383.415033,-145.354641,-18.551486
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
727,-172.852246,2.100443,-27.136371,-9.633826,34.745692,8.778996,-77.545324,150.250846,27.828469,23.480760,...,-6.241302,14.290513,12.177529,-2.353906,-1.948029,-0.325350,5.944735,11.446884,-3.405799,4.693416
728,-159.148455,67.922798,-17.423884,-32.287761,-21.936119,74.065468,14.955490,-42.941924,49.578258,-10.807979,...,11.467326,-7.354815,-32.811369,-27.920144,-4.893528,10.471535,-30.858007,-7.486688,-26.053246,-22.923165
729,-177.155888,-5.836874,-21.174289,-34.276734,44.268164,-0.040906,47.146020,-18.386306,9.491912,26.346587,...,-68.035407,-185.079197,12.783916,230.010366,74.091690,-218.617910,241.896770,-59.197060,-14.254232,-117.082006
730,-182.902079,10.916400,-38.174114,-33.872690,21.798875,-1.602727,-1.868852,-52.008459,-15.501537,-6.420668,...,2.735047,15.658047,-43.002242,2.163355,11.021457,24.383129,35.462282,74.481338,-107.009932,19.560093


In [25]:
fp = "../../kmds/examples/retail_transformed_q1_product_data.parquet"
df_transf2.columns = ["c_" + str(i+1) for i in range(df_transf.shape[1])]
df_transf2.to_parquet(fp, index=False)

In [26]:
fp = "../../kmds/examples/retail_transformed_q1_sales_data.parquet"
df_transf.columns = ["c_" + str(i+1) for i in range(df_transf.shape[1])]
df_transf.to_parquet(fp, index=False)

In [27]:
 retail_model.explained_variance_ratio_.cumsum()

array([0.16927973, 0.25284967, 0.33298209, 0.39949308, 0.45306245,
       0.4975023 , 0.53758732, 0.57516627, 0.6030706 , 0.62816856,
       0.65140386, 0.67159258, 0.69093234, 0.70955167, 0.72759465,
       0.74498394, 0.75977597, 0.77413686, 0.78797363, 0.80042258,
       0.81132756, 0.82182917, 0.83156578, 0.84061998, 0.84909888,
       0.85701798, 0.86439814, 0.87146312, 0.8782515 , 0.88486271])

## Log the Observations to KMDS

In [28]:
from tagging.tag_types import *
from owlready2 import *
from utils.load_utils import *
from utils.path_utils import *
KNOWLEDGE_BASE = "../../kmds/examples/example_ml_kb_exp_workflow.xml"


In [29]:
onto2 = load_kb(KNOWLEDGE_BASE)
with onto2:
    insts = Workflow.instances()
the_workflow_instance = insts[0]

In [30]:
the_workflow_instance

example_ml_kb_exp_workflow.xml.retail_customer_modelling

In [31]:
from kmds.ontology.intent_types import IntentType
ms_obs_list = []
observation_count = 1

ms1 = ModelSelectionObservation(namespace=onto2)
ms1.finding = "For this iteration, PCA was the only modelling approach to explain the variance in daily\
product sales. This is sufficient to illustrate how a model selection workflow is logged with KMDS."
ms1.finding_sequence = observation_count
ms1.model_selection_observation_type = ModelSelectionTags.MODEL_SELECTION_STATEMENT.value
ms1.intent = IntentType.MODEL_SELECTION.value
ms_obs_list.append(ms1)
the_workflow_instance.has_model_selection_observations = ms_obs_list
onto2.save(file=KNOWLEDGE_BASE, format="rdfxml")