## Profile of Clusters
The purpose of this notebook is to profile the clusters in the frequently purchased inventory items. Non-Negative Matrix factorization is used for this purpose. It looks like the factorization achieved has the separable property. Matrix factorization attempts to factorize a matrix $\mathbf{X}$ as  $\mathbf{X} \approx \mathbf{W}.\mathbf{H}$, where $\mathbf{H}$ is a topic matrix (here a topics are defined on the inventory of frequently purchased items) and $\mathbf{W}$ is a topic mixing matrix that defines the mixture of topics for a particular week. It looks like the $\mathbf{H}$ matrix is seperable, that is it contains **anchor** items, these are elements of the topic that are not found in other topics. [NIMFA](https://ai.stanford.edu/~marinka/nimfa/) used to have an explicit implementation of separable NMF, but it looks like it is not maintained anymore and there is a NUMPY version error that needs to be fixed. The scikit-learn version works, looks like the separable property is achieved in the solution.

In [None]:
import pandas as pd

In [None]:
fp = "../data/olist_prepared/freq_prod_weekly_sale_SP_2017.parquet"
df = pd.read_parquet(fp)

In [None]:
X = df.values

In [None]:
from sklearn.decomposition import NMF

In [None]:
model = NMF(n_components=2, init='random', random_state=0)

In [None]:
W = model.fit_transform(X)
H = model.components_

In [None]:
import numpy as np
X_hat = W @ H
np.mean(np.sum((X_hat - X) ** 2, axis=1) / np.sum(X ** 2, axis=1))

In [None]:
H

In [None]:
prod_list = df.columns.tolist()

In [None]:
c1 = H[0] > 0
c2 = H[1] > 0

In [None]:
df_prod = pd.DataFrame({"prod_id": prod_list})

## Separablity
This section checks for separablity of the basis matrix $\mathbf{H}$. It looks like the basis matrix has the separable property

In [None]:
set_c1 = set(df_prod[c1])
set_c2 = set(df_prod[c2])

In [None]:
fp = "../data/olist_raw/olist_products_dataset.csv"
dfp = pd.read_csv(fp)

In [None]:
dfp

In [None]:
c1 = set(df_prod[H[0] > 0]["prod_id"])
c2 = set(df_prod[H[1] > 0]["prod_id"])

In [None]:
df_unique_c1 = pd.DataFrame(c1.difference(c2))
df_unique_c1.columns = ["product_id"]

In [None]:
df_unique_c1 = pd.merge(df_unique_c1, dfp, on="product_id")
cols_needed = ["product_category_name"]
df_unique_c1 = df_unique_c1[cols_needed]

In [None]:
df_unique_c2 =  pd.DataFrame(c2.difference(c1))
df_unique_c2.columns = ["product_id"]
df_unique_c2 = pd.merge(df_unique_c2, dfp, on="product_id")
cols_needed = ["product_category_name"]
df_unique_c2 = df_unique_c2[cols_needed]

## Signature Items of each Topic
The dataframes `df_unique_c1` and `df_unique_c2` contain the signature components of each topic

In [None]:
df_unique_c2.head(20)

In [None]:
df_unique_c1.head(20)

## Profile Clusters

In [None]:
fpc = "../data/olist_prepared/SP_2017_cs_cluster_info.csv"
df_fpc = pd.read_csv(fpc)

In [None]:
df = pd.merge(df, df_fpc, on="woy")

In [None]:
df.cluster.unique()

In [None]:
c2_ind = df["cluster"] == 2
c1_ind = df["cluster"] == 1
c0_ind = df["cluster"] == 0

In [None]:
c2_mix = (W[c2_ind][0].mean(), W[c2_ind][1].mean())
c1_mix = (W[c1_ind][0].mean(), W[c1_ind][1].mean())
c0_mix = (W[c0_ind][0].mean(), W[c0_ind][1].mean())

In [None]:
c1_mix

In [None]:
c0_mix