In [None]:
import pandas as pd
fp = "../data/olist_prepared/SP_CS_2017_NN_graph.csv"
df = pd.read_csv(fp)
week_cols = df.columns.tolist()

Context:
This dataset has weekly purchases of frequently purchased inventory items. Each row of the dataset is revenue from the sale of a particular inventory item in SP. So each column represents the weekly sale amount. Now, if two weeks sell the same items, then these weeks would have high cosine similarity. Note that this does not mean that the weeks have to have the same revenue for these items. This simply means that weeks with high cosine similarity are weeks that had a sale of similar inventory items. This is useful for many reasons:
1. Demand planners know when a group inventory items are in demand.
2. Price setters can set prices appropriately at that time.

Weeks with high cosine similarity are events that signal an affinity for groups of inventory items (columns) for a particular group of weeks (rows). The presence of weeks with high cosine similarity indicates that we have such affinities in our dataset. We will exploit this point later.

In [None]:
hc_weeks = {}
for i in df.index:
    for j in df.index:
        if df.iloc[i,j] > 0.4:
            ij_key = "week-" + str(week_cols[i]) + "," + str(week_cols[j])
            ji_key = "week-" + str(week_cols[j]) + "," + str(week_cols[i])
            if ji_key in hc_weeks or (i == j):
                continue
            else:
                hc_weeks[ij_key] =  df.iloc[i,j]
        

In [None]:
df_corr_purch_weeks = pd.DataFrame.from_dict(hc_weeks, orient="index")

In [None]:
import plotly.express as px

fig = px.imshow(df, width=600, height=600)
fig.show()

A review of the heatmap shows many square regions of correlation 0.4 as you move your eyes across the diagonal of the heat map. These are weeks that have high cosine similarity. I see 3 clusters for sure, can make a case for a 4 th cluster

In [None]:
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=4, affinity="precomputed",
        assign_labels='cluster_qr',
        random_state=0).fit(df)

In [None]:
cluster_info = {"woy": week_cols, "cluster": clustering.labels_}

In [None]:
df_cluster_info = pd.DataFrame.from_dict(cluster_info, orient="columns")
df_cluster_info["cluster"] = df_cluster_info["cluster"].astype("str")
df_cluster_info["woy"] = df_cluster_info["woy"].astype(int)

In [None]:
fp = "../data/olist_prepared/SP_weekly_revenue.csv"
df_weekly_rev = pd.read_csv(fp)

In [None]:
filter_2017 = df_weekly_rev["year"] == 2017
df_weekly_rev_2017 = df_weekly_rev[filter_2017]
df_weekly_rev_2017.loc[:, "woy"] = df_weekly_rev_2017["woy"].astype(int)

In [None]:
df_result = pd.merge(df_cluster_info, df_weekly_rev_2017, on="woy")

In [None]:
fig = px.violin(df_result, y="weekly_revenue", x="cluster", box=True, points="all")
fig.show()

In [None]:
fig = px.scatter(df_result, x='woy', y='weekly_revenue', text='cluster', color='cluster')

# Update layout to show labels
fig.update_traces(textposition='top center')

fig.show()