In [None]:
!pip install sovai[full]

## Clustering Module

There are two types of clustering models, time-series, and cross-sectional. CS involves feature extraction and clustering algorithm. 

In [2]:
import sovai as sov

sov.token_auth(token="visit https://sov.ai/profile for your token")

#### Processed Dataset

In [3]:
# Load ratios - takes around 5 mins to load data 
df_accounting = sov.data("accounting/weekly")

In [4]:
# Choose long enough history for the model to train
df_mega = df_accounting.select_stocks("mega").date_range("2018-01-01")

### Panel Clustering

This clustering methodology takes multivariate panel datasets and represents them according to the centroids that capture the main patterns within the time series data. 

#### All Features (5 minutes)
First think we can do is to calculate the clusters according to all the features as compared accross all the tickers in the dataset. 

In [5]:
df_cluster = df_mega.cluster()

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


In [6]:
df_cluster.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Centroid 0,Centroid 1,Centroid 2,Centroid 3,Centroid 4,Centroid 5,Centroid 6,Centroid 7,labels
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
XOM,2024-06-28,0.203,0.196,0.217,0.145,0.21,0.401,0.226,0.148,Centroid 5
XOM,2024-07-05,0.206,0.199,0.221,0.148,0.212,0.404,0.23,0.15,Centroid 5
XOM,2024-07-12,0.208,0.202,0.225,0.151,0.213,0.407,0.233,0.152,Centroid 5
XOM,2024-07-19,0.209,0.205,0.229,0.154,0.214,0.408,0.236,0.154,Centroid 5
XOM,2024-07-26,0.21,0.208,0.233,0.157,0.215,0.409,0.239,0.155,Centroid 5


#### Specific Features
We can also focus on any specific feature in the dataset of tens of features like `total_debt`, `total_assets`, or as we are using below `ebit`. 

In [7]:
df_cluster_ebit = df_mega.cluster(features=["ebit"]); df_cluster_ebit.tail()

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


Unnamed: 0_level_0,Unnamed: 1_level_0,Centroid 0,Centroid 1,Centroid 2,Centroid 3,Centroid 4,Centroid 5,Centroid 6,Centroid 7,labels
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
XOM,2024-06-28,0.597,0.585,0.442,0.217,0.123,0.038,0.23,-0.133,Centroid 0
XOM,2024-07-05,0.614,0.588,0.446,0.219,0.117,0.054,0.241,-0.137,Centroid 0
XOM,2024-07-12,0.631,0.59,0.449,0.219,0.112,0.07,0.253,-0.141,Centroid 0
XOM,2024-07-19,0.646,0.591,0.453,0.219,0.107,0.084,0.264,-0.145,Centroid 0
XOM,2024-07-26,0.659,0.591,0.456,0.218,0.102,0.097,0.276,-0.148,Centroid 0


We can also specify our own selection of multiple features like ``features=["total_assets","total_debt","ebit"]``.


In [8]:
df_mega.cluster(features=["total_assets","total_debt","ebit"])

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


Unnamed: 0_level_0,Unnamed: 1_level_0,Centroid 0,Centroid 1,Centroid 2,Centroid 3,Centroid 4,Centroid 5,Centroid 6,Centroid 7,labels
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AAPL,2018-01-05,0.003,0.005,0.006,0.004,0.004,0.001,0.002,0.006,Centroid 4
AAPL,2018-01-12,0.006,0.011,0.012,0.009,0.008,0.002,0.003,0.013,Centroid 4
AAPL,2018-01-19,0.010,0.018,0.019,0.014,0.012,0.003,0.005,0.021,Centroid 4
AAPL,2018-01-26,0.013,0.026,0.027,0.019,0.017,0.005,0.007,0.029,Centroid 4
AAPL,2018-02-02,0.016,0.034,0.034,0.025,0.022,0.006,0.009,0.038,Centroid 4
...,...,...,...,...,...,...,...,...,...,...
XOM,2024-06-28,0.087,0.044,0.183,0.481,0.654,0.183,0.358,0.339,Centroid 4
XOM,2024-07-05,0.091,0.045,0.186,0.487,0.667,0.185,0.358,0.339,Centroid 4
XOM,2024-07-12,0.095,0.046,0.190,0.493,0.680,0.186,0.357,0.339,Centroid 4
XOM,2024-07-19,0.098,0.048,0.195,0.499,0.692,0.187,0.355,0.339,Centroid 4


#### Downstream Calculations
There are many things you can do once the data is clustered in time series, for one, you can take the standard deviation of the standard deviation of similarity accross clusters. 

In [9]:
import pandas as pd

def transform_df(merged_df):
    # Reset the index to have 'ticker' and 'date' as columns
    df_reset = merged_df.reset_index()

    # Identify the numerical columns (assuming they start with 'Centroid')
    centroid_columns = [col for col in df_reset.columns if col.startswith('Centroid') and col != 'Centroid labels']

    # Calculate the average of the centroid columns
    df_reset['average'] = df_reset[centroid_columns].std(axis=1)

    # Pivot the table to have dates as index and tickers as columns
    transformed_df = df_reset.pivot(index='date', columns='ticker', values='average')

    return transformed_df

# Use the function
transformed_df = transform_df(df_cluster)
max_date = transformed_df.index.max()
sorted_df = (transformed_df.query("date == @max_date")
                           .T
                           .reset_index()
                           .sort_values(by=max_date, ascending=False)
                           .reset_index(drop=True))


Companies who are stable and stay within their accounting cluster over time. 

In [10]:
transformed_df.std().sort_values(ascending=False).tail(10)

ticker
WFC     0.013
WMT     0.013
BRK.B   0.012
JPM     0.012
BAC     0.012
BA      0.011
PFE     0.011
NVS     0.011
CSCO    0.009
TWX     0.006
dtype: float64

Companies who are see-sawing through accounting clusters over time. 

In [11]:
transformed_df.std().sort_values(ascending=False).head(10)

ticker
TSM     0.028
AMZN    0.027
NVO     0.024
CRM     0.024
COST    0.023
TMO     0.023
CVX     0.023
NFLX    0.023
META    0.022
GOOGL   0.022
dtype: float64

#### Distance Cluster

We can use our in-built distance functionality to get the distances between the ticker-cluster combinations.

In [12]:
df_dist = df_cluster.drop(columns=["labels"]).distance(orient="time-series"); df_dist

Unnamed: 0,AAPL,ABBV,ABT,ACN,ADBE,AIG,AMD,AMZN,ASML,AVGO,AZN,BA,BABA,BAC,BHP,BP,BRK.B,BUD,C,CMCSA,COST,CRM,CSCO,CVX,DHR,DIS,GE,GOOGL,HD,IBM,INTC,JNJ,JPM,KO,LIN,LLY,MA,MCD,META,MRK,MSFT,NFLX,NKE,NVDA,NVO,NVS,ORCL,PDD,PEP,PFE,PG,PTRCY,PYPL,QCOM,SAP,T,TM,TMO,TSLA,TSM,TTE,TWX,UNH,V,VZ,WFC,WMT,XOM
AAPL,0.000,0.114,0.066,0.102,0.112,0.242,0.047,0.090,0.087,0.106,0.052,0.492,0.154,0.142,0.644,0.718,0.114,0.641,0.164,0.245,0.097,0.103,0.856,0.466,0.092,0.261,0.165,0.080,0.102,0.224,0.311,0.070,0.097,0.141,0.309,0.078,0.088,0.117,0.094,0.082,0.071,0.128,0.095,0.079,0.086,0.483,0.249,0.077,0.101,0.064,0.062,0.139,0.117,0.052,0.169,0.153,0.099,0.066,0.065,0.063,0.117,0.838,0.093,0.095,0.099,0.396,0.163,0.276
ABBV,0.114,0.000,0.027,0.019,0.018,0.180,0.048,0.026,0.026,0.085,0.063,0.345,0.026,0.215,0.537,0.463,0.047,0.370,0.149,0.050,0.032,0.030,0.888,0.766,0.012,0.105,0.033,0.021,0.016,0.423,0.104,0.032,0.068,0.064,0.092,0.121,0.046,0.078,0.027,0.046,0.046,0.017,0.030,0.051,0.088,0.435,0.370,0.053,0.041,0.075,0.102,0.149,0.013,0.147,0.039,0.306,0.141,0.024,0.049,0.050,0.141,0.803,0.043,0.067,0.023,0.217,0.044,0.396
ABT,0.066,0.027,0.000,0.035,0.037,0.177,0.036,0.028,0.030,0.090,0.055,0.332,0.047,0.192,0.531,0.561,0.048,0.417,0.146,0.103,0.045,0.044,0.918,0.684,0.029,0.159,0.068,0.025,0.027,0.359,0.135,0.024,0.060,0.065,0.138,0.093,0.053,0.081,0.031,0.049,0.032,0.039,0.037,0.055,0.084,0.446,0.354,0.056,0.048,0.067,0.076,0.146,0.030,0.089,0.076,0.278,0.131,0.018,0.040,0.048,0.137,0.836,0.051,0.070,0.043,0.257,0.067,0.378
ACN,0.102,0.019,0.035,0.000,0.003,0.203,0.022,0.009,0.008,0.042,0.030,0.412,0.028,0.146,0.631,0.555,0.024,0.452,0.100,0.050,0.005,0.008,0.876,0.681,0.007,0.073,0.029,0.003,0.006,0.397,0.146,0.021,0.029,0.035,0.097,0.079,0.015,0.038,0.009,0.023,0.016,0.005,0.011,0.019,0.039,0.352,0.278,0.019,0.023,0.048,0.072,0.084,0.010,0.122,0.027,0.230,0.078,0.011,0.017,0.018,0.093,0.816,0.009,0.024,0.008,0.265,0.019,0.302
ADBE,0.112,0.018,0.037,0.003,0.000,0.215,0.029,0.008,0.008,0.046,0.037,0.399,0.019,0.157,0.627,0.536,0.020,0.426,0.106,0.041,0.008,0.006,0.897,0.708,0.010,0.069,0.024,0.006,0.007,0.432,0.137,0.022,0.030,0.036,0.082,0.086,0.018,0.041,0.008,0.028,0.019,0.002,0.012,0.023,0.046,0.342,0.293,0.022,0.023,0.060,0.076,0.093,0.007,0.137,0.021,0.250,0.088,0.013,0.021,0.023,0.105,0.809,0.012,0.029,0.010,0.254,0.014,0.325
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
V,0.095,0.067,0.070,0.024,0.029,0.239,0.023,0.027,0.022,0.014,0.017,0.548,0.075,0.068,0.723,0.672,0.031,0.562,0.052,0.111,0.012,0.024,0.843,0.541,0.045,0.077,0.067,0.021,0.040,0.376,0.242,0.037,0.011,0.027,0.165,0.036,0.005,0.014,0.028,0.017,0.021,0.037,0.020,0.008,0.006,0.314,0.166,0.008,0.020,0.037,0.060,0.042,0.049,0.104,0.068,0.148,0.025,0.032,0.010,0.013,0.057,0.825,0.009,0.000,0.027,0.380,0.041,0.195
VZ,0.099,0.023,0.043,0.008,0.010,0.221,0.027,0.014,0.013,0.046,0.030,0.428,0.029,0.148,0.637,0.550,0.025,0.451,0.101,0.056,0.010,0.012,0.902,0.680,0.015,0.075,0.029,0.011,0.013,0.400,0.162,0.023,0.033,0.044,0.107,0.082,0.018,0.043,0.016,0.025,0.026,0.013,0.016,0.024,0.042,0.358,0.274,0.022,0.022,0.049,0.076,0.089,0.014,0.130,0.031,0.229,0.080,0.017,0.022,0.022,0.095,0.792,0.014,0.027,0.000,0.273,0.030,0.306
WFC,0.396,0.217,0.257,0.265,0.254,0.502,0.321,0.255,0.273,0.422,0.354,0.193,0.210,0.555,0.558,0.288,0.287,0.142,0.481,0.241,0.295,0.270,0.722,0.932,0.240,0.350,0.251,0.279,0.246,0.617,0.168,0.294,0.380,0.336,0.215,0.480,0.338,0.426,0.263,0.331,0.302,0.240,0.329,0.344,0.400,0.489,0.654,0.349,0.340,0.397,0.379,0.496,0.218,0.435,0.240,0.632,0.502,0.256,0.342,0.326,0.495,0.725,0.308,0.380,0.273,0.000,0.266,0.719
WMT,0.163,0.044,0.067,0.019,0.014,0.243,0.056,0.022,0.022,0.057,0.061,0.421,0.027,0.173,0.667,0.534,0.029,0.430,0.110,0.036,0.024,0.020,0.880,0.749,0.034,0.059,0.044,0.024,0.027,0.504,0.145,0.049,0.039,0.042,0.069,0.106,0.032,0.044,0.020,0.046,0.038,0.013,0.030,0.039,0.062,0.298,0.315,0.039,0.040,0.088,0.104,0.098,0.022,0.186,0.026,0.283,0.105,0.034,0.039,0.046,0.130,0.805,0.027,0.041,0.030,0.266,0.000,0.339


Distance calculation for companies with similar clusters:

In [13]:
df_dist.sort_values(["AMZN"])[["AMZN"]].T

Unnamed: 0,AMZN,META,ASML,GOOGL,TMO,ADBE,CRM,PYPL,NFLX,ACN,COST,BRK.B,HD,MSFT,UNH,VZ,NVDA,MA,DHR,TSLA,PDD,NKE,TSM,JNJ,AMD,PEP,WMT,ABBV,JPM,V,BABA,MRK,AZN,ABT,KO,GE,SAP,NVO,MCD,AVGO,PG,PFE,CMCSA,LLY,TM,AAPL,DIS,PTRCY,LIN,C,TTE,QCOM,INTC,BAC,T,AIG,WFC,ORCL,XOM,NVS,IBM,BA,BUD,BP,BHP,CVX,TWX,CSCO
AMZN,0.0,0.002,0.005,0.005,0.007,0.008,0.008,0.009,0.009,0.009,0.009,0.012,0.012,0.013,0.014,0.014,0.016,0.017,0.017,0.018,0.018,0.018,0.02,0.02,0.021,0.021,0.022,0.026,0.026,0.027,0.027,0.027,0.028,0.028,0.033,0.035,0.037,0.039,0.041,0.045,0.06,0.06,0.062,0.073,0.079,0.09,0.09,0.094,0.102,0.107,0.113,0.114,0.145,0.149,0.231,0.232,0.255,0.279,0.322,0.355,0.403,0.406,0.443,0.581,0.649,0.695,0.825,0.913


What about ebit clustering distance?

In [14]:
df_cluster_ebit.drop(columns=["labels"]).distance(orient="time-series").sort_values(["AMZN"])[["AMZN"]].T

Unnamed: 0,AMZN,PYPL,META,CRM,TMO,BRK.B,NFLX,GOOGL,NVDA,BUD,MSFT,ADBE,JPM,DIS,CMCSA,KO,ASML,NVO,COST,AMD,PG,BA,ABT,AAPL,LIN,HD,ACN,BHP,TM,JNJ,V,UNH,DHR,GE,MA,MRK,VZ,PDD,TSM,WMT,PFE,IBM,T,AVGO,QCOM,TSLA,PEP,MCD,NKE,TWX,ABBV,AZN,PTRCY,BAC,LLY,C,INTC,ORCL,XOM,WFC,AIG,TTE,CVX,CSCO,NVS,SAP,BABA,BP
AMZN,0.0,0.142,0.144,0.157,0.23,0.247,0.291,0.293,0.315,0.318,0.331,0.345,0.35,0.351,0.357,0.367,0.371,0.407,0.408,0.421,0.445,0.456,0.463,0.464,0.467,0.468,0.471,0.479,0.482,0.496,0.499,0.503,0.511,0.514,0.522,0.533,0.543,0.544,0.552,0.553,0.557,0.561,0.564,0.572,0.573,0.574,0.581,0.592,0.597,0.624,0.631,0.642,0.649,0.655,0.656,0.662,0.662,0.674,0.697,0.703,0.718,0.723,0.728,0.808,0.826,0.863,0.894,0.956


#### Summary
This gives you a quick summary of the last 6-months data

In [15]:
df_mega.cluster("summary")

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


Unnamed: 0_level_0,Cluster,Divergence,Growing Centroid,Falling Centroid,Average Standard Deviation,Average Correlation
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,Centroid 6,9.673,Centroid 6 (+9.67%),Centroid 7 (+0.96%),0.111,0.258
ABBV,Centroid 3,8.743,Centroid 3 (+8.74%),Centroid 7 (-0.54%),0.198,0.400
ABT,Centroid 2,7.717,Centroid 1 (+8.24%),Centroid 5 (+2.29%),0.151,0.344
ACN,Centroid 6,4.945,Centroid 6 (+4.94%),Centroid 0 (+0.03%),0.241,0.533
ADBE,Centroid 2,7.823,Centroid 2 (+7.82%),Centroid 4 (+1.66%),0.231,0.527
...,...,...,...,...,...,...
V,Centroid 0,8.838,Centroid 1 (+9.14%),Centroid 4 (+4.06%),0.202,0.548
VZ,Centroid 6,2.369,Centroid 5 (+2.60%),Centroid 7 (-5.01%),0.149,0.312
WFC,Centroid 0,1.609,Centroid 2 (+4.89%),Centroid 1 (+0.18%),0.064,0.052
WMT,Centroid 5,7.400,Centroid 5 (+7.40%),Centroid 1 (+1.78%),0.145,0.360


#### Vizualisation
Each colored line represents a distance to centroid of the cluster. The centroid is the average pattern of all time series assigned to that cluster. These are similarity scores (based on cross-correlation). Selecting features shows you the different shapes over time. 

In [16]:
df_mega.cluster("line_plot")

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted


In [17]:
df_mega.cluster("scatter_plot")

In [18]:
df_mega.cluster("animation_plot")

We will remove 0 tickers for which sufficient data does not exist, they are probably delisted
