## Clustering Module

There are two types of clustering models, time-series, and cross-sectional. CS involves feature extraction and clustering algorithm. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sovai as sov

sov.token_auth(token="your_token_here")

#### Processed Dataset

In [3]:
# Load ratios - takes around 5 mins to load data 
df_accounting = sov.data("accounting/weekly")

In [4]:
# Choose long enough history for the model to train
df_mega = df_accounting.select_stocks("mega").date_range("2018-01-01")

### Panel Clustering

This clustering methodology takes multivariate panel datasets and represents them according to the centroids that capture the main patterns within the time series data. 

#### All Features (5 minutes)
First think we can do is to calculate the clusters according to all the features as compared accross all the tickers in the dataset. 

In [5]:
df_cluster = df_mega.cluster()

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


In [6]:
df_cluster.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Centroid 0,Centroid 1,Centroid 2,Centroid 3,Centroid 4,Centroid 5,Centroid 6,Centroid 7,labels
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
XOM,2024-06-21,0.19,0.164,0.258,0.147,0.202,0.364,0.102,0.247,Centroid 5
XOM,2024-06-28,0.193,0.166,0.262,0.15,0.203,0.366,0.104,0.251,Centroid 5
XOM,2024-07-05,0.195,0.169,0.265,0.153,0.205,0.368,0.106,0.253,Centroid 5
XOM,2024-07-12,0.197,0.171,0.269,0.156,0.206,0.37,0.107,0.255,Centroid 5
XOM,2024-07-19,0.198,0.174,0.272,0.159,0.206,0.371,0.109,0.257,Centroid 5


#### Specific Features
We can also focus on any specific feature in the dataset of tens of features like `total_debt`, `total_assets`, or as we are using below `ebit`. 

In [None]:
df_cluster_ebit = df_mega.cluster(features=["ebit"]); df_cluster_ebit.tail()

We can also specify our own selection of multiple features like ``features=["total_assets","total_debt","ebit"]``.


In [None]:
df_mega.cluster(features=["total_assets","total_debt","ebit"])

#### Downstream Calculations
There are many things you can do once the data is clustered in time series, for one, you can take the standard deviation of the standard deviation of similarity accross clusters. 

In [None]:
import pandas as pd

def transform_df(merged_df):
    # Reset the index to have 'ticker' and 'date' as columns
    df_reset = merged_df.reset_index()

    # Identify the numerical columns (assuming they start with 'Centroid')
    centroid_columns = [col for col in df_reset.columns if col.startswith('Centroid') and col != 'Centroid labels']

    # Calculate the average of the centroid columns
    df_reset['average'] = df_reset[centroid_columns].std(axis=1)

    # Pivot the table to have dates as index and tickers as columns
    transformed_df = df_reset.pivot(index='date', columns='ticker', values='average')

    return transformed_df

# Use the function
transformed_df = transform_df(df_cluster)
max_date = transformed_df.index.max()
sorted_df = (transformed_df.query("date == @max_date")
                           .T
                           .reset_index()
                           .sort_values(by=max_date, ascending=False)
                           .reset_index(drop=True))


Companies who are stable and stay within their accounting cluster over time. 

In [None]:
transformed_df.std().sort_values(ascending=False).tail(10)

Companies who are see-sawing through accounting clusters over time. 

In [None]:
transformed_df.std().sort_values(ascending=False).head(10)

#### Distance Cluster

We can use our in-built distance functionality to get the distances between the ticker-cluster combinations.

In [None]:
df_dist = df_cluster.drop(columns=["labels"]).distance(orient="time-series"); df_dist

Distance calculation for companies with similar clusters:

In [None]:
df_dist.sort_values(["AMZN"])[["AMZN"]].T

What about ebit clustering distance?

In [None]:
df_cluster_ebit.drop(columns=["labels"]).distance(orient="time-series").sort_values(["AMZN"])[["AMZN"]].T

#### Summary
This gives you a quick summary of the last 6-months data

In [None]:
df_mega.cluster("summary")

#### Vizualisation
Each colored line represents a distance to centroid of the cluster. The centroid is the average pattern of all time series assigned to that cluster. These are similarity scores (based on cross-correlation). Selecting features shows you the different shapes over time. 

In [None]:
df_mega.cluster("line_plot")

In [None]:
df_mega.cluster("scatter_plot")

In [None]:
df_mega.cluster("animation_plot")