# t-SNE Visualization and Exploration

The first unsupervised technique I will be exploring is t-SNE. This is only for data exploration and visualization and will not be used for any of our clustering analysis. The main purpose of this will be to see whether or not we can find structure in our data and to potentially give us an idea of the number of clusters we may be looking for.

Benefits of t-SNE:  
- Captures local structure: Points close in high-dimensional space tend to stayt close in 2D space  
- Often reveals well-seperated clusters visually, even when PCA does not  
- Nonlinear: Can uncoverl manifold structures (PCA cannot)  
- Useful for conveying to readers that the engineered features separate entities meaningfully.  


Drawbacks:  
- Can be computationally expensive on large datasets  
- Sensitive to hyperparameters (perplexity, learning rate). Different runs can give different embeddings  
- Distances and axes are not interpretable. Only relative neighborhood structure matters  
- Cannot be used for out-of-sample embeddings without retraining  
- Global geometry is distorted (they can look further or closer than they actually are)

PCA assumes the data varies mostly along linear axis/MDS assumes the redcued datas important structure is non-linear (curved). 

Notes from Slack Channel Conversation:
'Just to summarize what Pete and I talked about as well, we essentially have two components to our unsupervised part:
Feature Engineering/reduction (2D space):
1. PCA used for feature engineering (2D features) and biplots 
2. t-SNE for visualization of local clustering of high dimensional space for our readers
Clustering for financial health of companies (and potentially feature engineering of a cluster category):
1. PCA to reduce dimensional space based on a SCREE plot (for reducing the computational load of the clustering: usually down to 20-50 features).
2. Cluster on the reduced dataset k-means/DBSCAN
3. Evaluation of Clustering techniques
Pete, did I capture this correctly? 

As a kind of summary:

PCA
- Linear method.
- Preserves directions of maximum variance.
- Fast, deterministic, scalable.
- Works best when data structure is close to linear.
- Output axes are interpretable (linear combinations of original features) BIPLOTS.

Manifold Learning (MDS and t-SNE)
- Nonlinear (classical MDS reduces to PCA if using Euclidean distances so we should try other distance metrics).
- Preserves pairwise distances (or dissimilarities).
- Can better capture nonlinear relationships if the distance measure is well chosen (think the surface of a sphere is actually 2D in 3D space just curved).
- More computationally expensive.
- Axes have no interpretability, only relative positions matter.

Comparison value
- If PCA and t-SNE/MDS give similar 2D maps, we gain confidence that the structure is not an artifact of the method
- If they are different, that tells us the data may have nonlinear structure (manifold-like), which PCA cannot capture
- Showing both in the report strengthens our narrative: PCA gives interpretable axes for clustering and variance analysis whereas MDS gives a geometry-preserving view.

# Import libraries
Alright, let's start importing our libraries that we will use to analyze and work with the data.

In [1]:
# Data Manipulation libraries
import numpy as np
import pandas as pd

# Unsupervised Learning Methods 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, QuantileTransformer
from sklearn.manifold import TSNE

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

# Let's set our Random State here as well
random_state = 6

Now, let's bring in the datafile that we will be working with.

In [2]:
path = '../../datasets/X_train_filled_KPIs_QoQ.csv'
df = pd.read_csv(path)
print(df.shape)
df.head()

(1905, 284)


Unnamed: 0.1,Unnamed: 0,Ticker,Name,Sector,CapitalExpenditure_2024Q2,CapitalExpenditure_2024Q3,CapitalExpenditure_2024Q4,CapitalExpenditure_2025Q1,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,...,KPI_CurrentRatio_Rate,EPS_Rate,CurrentAssets_Rate,InterestExpense_Rate,CashFromOps_Rate,TotalAssets_Rate,TotalLiabilities_Rate,OperatingIncome_Rate,TotalDebt_Rate,KPI_NetProfitMargin_Rate
0,1602,VRNT,VERINT SYSTEMS INC,Information Technology,-3981000.0,-7440000.0,-7660000.0,-6337000.0,185460000.0,207845000.0,...,-0.441123,0.112,-16089800.0,-454200.0,13855500.0,250803500.0,111418500.0,10844400.0,78576850.0,0.02707
1,590,CCK,CROWN HOLDINGS INC,Materials,-84000000.0,-76000000.0,-149000000.0,-33000000.0,1414000000.0,1738000000.0,...,-0.053039,0.509,-303500000.0,-5000000.0,-155200000.0,-429800000.0,-506600000.0,-13100000.0,-426000000.0,0.020909
2,1846,NWPX,NWPX INFRASTRUCTURE INC,Industrials,-6064000.0,-5975000.0,-4190000.0,-3670000.0,4528000.0,5723000.0,...,0.317033,-0.143,-18326000.0,-410500.0,-3889800.0,-17607000.0,-26953200.0,-2909100.0,-16620800.0,-0.00916
3,694,SSD,SIMPSON MANUFACTURING INC,Industrials,-40210000.0,-45226000.0,-55509000.0,-50165000.0,354851000.0,339427000.0,...,-0.189649,-0.228,-53893800.0,5978650.0,-29823900.0,-1250600.0,-27934700.0,-14201300.0,-28682800.0,-0.011017
4,834,PRMB,PRIMO BRANDS CLASS A CORP,Consumer Staples,-47300000.0,-41300000.0,-57600000.0,-69500000.0,15500000.0,174800000.0,...,0.017002,-0.024851,236170000.0,-1020000.0,-35900000.0,2347870000.0,890330000.0,-1100000.0,530480000.0,-0.022474


## Dataset Separation
Alright, I think one of the first things we should do is identify three different datasets that we want to work with.  
1. Full Dataset (minus columns like Ticker)  
2. Raw Data Dataset (What would it look like if we just used the raw financial data)  
3. Engineered Dataset (Do we get better structure when we look at just the engineered features)

We can easily just split these into subdatasets if we pull out the relevant columns. So let's look at all of the columns first so that we can start creating the proper datasets.

In [3]:
complete_dataset = df.copy()
columns = complete_dataset.columns.tolist()
for column in sorted(columns):
    print(column)

CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CapitalExpenditure_QoQ_24Q2_24Q3
CapitalExpenditure_QoQ_24Q3_24Q4
CapitalExpenditure_QoQ_24Q4_25Q1
CapitalExpenditure_QoQ_Rate
CapitalExpenditure_Rate
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashAndSTInvestments_QoQ_24Q2_24Q3
CashAndSTInvestments_QoQ_24Q3_24Q4
CashAndSTInvestments_QoQ_24Q4_25Q1
CashAndSTInvestments_QoQ_Rate
CashAndSTInvestments_Rate
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CashFromOps_QoQ_24Q2_24Q3
CashFromOps_QoQ_24Q3_24Q4
CashFromOps_QoQ_24Q4_25Q1
CashFromOps_QoQ_Rate
CashFromOps_Rate
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CostOfRevenue_QoQ_24Q2_24Q3
CostOfRevenue_QoQ_24Q3_24Q4
CostOfRevenue_QoQ_24Q4_25Q1
CostOfRevenue_QoQ_Rate
CostOfRevenue_Rate
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
Current

Alright, let's start by identifying which columns to drop because they are unnecessary for the unsupervised learning part. This should be relatively few columns.
- Ticker
- Name  


In [4]:
complete_dataset = complete_dataset.drop(columns=['Unnamed: 0','Ticker','Name'])
print(complete_dataset.shape)

(1905, 281)


Great, now, we can loop through all of the columns and we will pull out all of the feature engineered data if it contains 'KPI', 'QoQ', or 'Rate' in the title. We can then investigate these columns to make sure they make sense.

In [5]:
raw_columns = []
engineered_columns = []
for column in complete_dataset.columns:
    if ('KPI' not in column) and ('QoQ' not in column) and ('Rate' not in column):
        raw_columns.append(column)
    else:
        engineered_columns.append(column)
print(f'Raw Columns: {len(raw_columns)}')
print(f'Engineered Columns: {len(engineered_columns)}')

Raw Columns: 102
Engineered Columns: 179


In [6]:
for column in raw_columns:
    print(column)

Sector
CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
CurrentAssets_2025Q1
CurrentLiabilities_2024Q2
CurrentLiabilities_2024Q3
CurrentLiabilities_2024Q4
CurrentLiabilities_2025Q1
EPS_2024Q2
EPS_2024Q3
EPS_2024Q4
EPS_2025Q1
Exchange
IncomeTaxExpense_2024Q2
IncomeTaxExpense_2024Q3
IncomeTaxExpense_2024Q4
IncomeTaxExpense_2025Q1
InterestExpense_2024Q2
InterestExpense_2024Q3
InterestExpense_2024Q4
InterestExpense_2025Q1
Location
LongTermDebt_2024Q2
LongTermDebt_2024Q3
LongTermDebt_2024Q4
LongTermDebt_2025Q1
Market Value
NetIncome_2024Q2
NetIncome_2024Q3
NetIncome_2024Q4
NetIncome_2025Q1
OperatingIncome_20

Now, there are going to be some of the raw columns that we want to add back to the engineered columns as they can be very important components to the company, so let's list these here.
- Sector  
- Exchange
- Location  
- Market Cap
- Market Value

So let's append those

In [7]:
add_back = ['Sector','Exchange','Location','Market Value','Market Cap']
engineered_columns = engineered_columns + add_back
print(f'Engineered Columns after adding back important raw columns: {len(engineered_columns)}')

Engineered Columns after adding back important raw columns: 184


Alright, now that we have these, we can build our individual dataframes for the t-SNE unsupervised learning and visualizations.

In [8]:
raw_data = complete_dataset[raw_columns]
engineered_data = complete_dataset[engineered_columns]

print(f'Full Dataset Shape: {complete_dataset.shape}')
print(f'Raw Data Shape: {raw_data.shape}')
print(f'Engineered Data Shape: {engineered_data.shape}')

Full Dataset Shape: (1905, 281)
Raw Data Shape: (1905, 102)
Engineered Data Shape: (1905, 184)


## Preprocessing and t-SNE Unsupervised Learning
Alright, now that we are ready to do some unsupervised learning let's start in a hierarchical fasion with the raw data, then the complete dataset, then just the engineered features. This will allow us to evaluate the clustering of the dataset as it was developed and see if we have improved the local structure in the dataset with our engineered features.

### Raw Fundamental Data

In [9]:
# First thing we need to do is pull out categorical columns and numerical columns for our preprocessing pipeline
X_raw = raw_data.copy()
cat_cols = ['Sector', 'Exchange', 'Market Cap']
num_cols = [c for c in X_raw.columns if c not in cat_cols]

def t_sne_pipeline(dataset, 
                   cat_cols = None, 
                   num_cols = None, 
                   scaler = QuantileTransformer(), 
                   metric='cosine',
                   random_state=random_state):
    
    #Let's define the length and pick a safe perplexity 
    n = len(dataset)
    perp = int(min(30, max(5, (n - 1) // 3)))
    
    # Let's address the preprocessing of our data columns
    preproc = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("scaler", scaler)
            ]), num_cols),
            ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ],
    )
    
    # Now, lets create our pipeline
    pipe = Pipeline([
        ("prep", preproc),
        ("tsne", TSNE(
            n_components=2,
            perplexity=perp,
            learning_rate="auto",
            init="pca",
            metric=metric,
            random_state=random_state
        ))
    ])

    tsne_data = pipe.fit_transform(dataset)
    
    return tsne_data, pipe

X_raw_tsne, pipe = t_sne_pipeline(X_raw,cat_cols, num_cols)

Alright, that should have run our raw data through the pipeline and output a dataframe that we would expect to be (1915 x 2). Let's check that and then we can plot the data to see what we have.

In [10]:
print(f"The tsne output dataframe has dimensions of {X_raw_tsne.shape}\n")

def plot_tsne(dataset, tsne_data, title = 't-SNE'):
    plot_df = pd.DataFrame({
        "tsne1": tsne_data[:, 0],
        "tsne2": tsne_data[:, 1],
        # optional metadata for color/tooltip:
        "Sector": dataset["Sector"].values,
        "Market Cap": dataset['Market Cap'].values,
    })
    
    market_cap_order = ['Nano-Cap','Mirco-Cap','Small-Cap','Mid-Cap','Large-Cap','Macro-cap']
    
    chart = (
        alt.Chart(plot_df, title = alt.TitleParams(text=title,fontSize=20,offset=35))
        .mark_circle(size=28, opacity=0.3)
        .encode(
            x=alt.X("tsne1:Q", title = '', axis = None),
            y=alt.Y("tsne2:Q", title = '', axis = None),
            color=alt.Color("Sector:N"), 
            size = alt.Size("Market Cap:N", scale=alt.Scale(domain=market_cap_order),sort=market_cap_order), 
            tooltip=["Sector:N", "Market Cap:N"]
        ).properties(width = 600, height = 600)
    )
    
    return chart

raw_chart = plot_tsne(X_raw, X_raw_tsne, title = 't-SNE of Fundamental Financial Data')
raw_chart.properties(padding = 20,width = 600, height = 600
        ).configure_view(stroke=None).interactive()

The tsne output dataframe has dimensions of (1905, 2)



### Entire Dataset

Now, lets do the exact same analysis on the entire dataset.

In [11]:
# First thing we need to do is pull out categorical columns and numerical columns for our preprocessing pipeline
X_complete = complete_dataset.copy()
cat_cols = ['Sector', 'Exchange', 'Market Cap']
num_cols = [c for c in X_complete.columns if c not in cat_cols]

# Lets use our function to run t-SNE on the complete dataset
X_complete_tsne, pipe = t_sne_pipeline(X_complete,cat_cols, num_cols)

Once again, this should have the same size dataset (1915,2)

In [12]:
print(f"The tsne output dataframe has dimensions of {X_complete_tsne.shape}\n")

complete_chart = plot_tsne(X_complete, X_complete_tsne, title = 't-SNE of Complete Financial Dataset')
complete_chart.properties(padding = 20,width = 600, height = 600
        ).configure_view(stroke=None).interactive()

The tsne output dataframe has dimensions of (1905, 2)



### Engineered Features
With the complete dataset, we actually seem to loose a lot of the local structure and we can see some groupings but much less identifiable clusters. We may need to play with the hyper parameters a bit more.

In [13]:
# First thing we need to do is pull out categorical columns and numerical columns for our preprocessing pipeline
X_eng = engineered_data.copy()
cat_cols = ['Sector', 'Exchange', 'Market Cap']
num_cols = [c for c in X_eng.columns if c not in cat_cols]

X_eng_tsne, pipe = t_sne_pipeline(X_eng,cat_cols,num_cols)

Once again, we should have the proper structure of (1915,2)

In [14]:
print(f"The tsne output dataframe has dimensions of {X_eng_tsne.shape}\n")

cos_eng_chart = plot_tsne(X_eng, X_eng_tsne, title = 't-SNE of Engineered Feature Data')
cos_eng_chart.properties(padding = 20,width = 600, height = 600
    ).configure_view(stroke=None).interactive()

The tsne output dataframe has dimensions of (1905, 2)



## Distance metrics
Let's see what it would look like if we used cosine instead of euclidean.

In [15]:
# First thing we need to do is pull out categorical columns and numerical columns for our preprocessing pipeline
X_eng = engineered_data.copy()
cat_cols = ['Sector', 'Exchange', 'Market Cap']
num_cols = [c for c in X_eng.columns if c not in cat_cols]

X_eng_euc_tsne, pipe = t_sne_pipeline(X_eng,cat_cols,num_cols,metric='euclidean')

In [16]:
print(f"The tsne output dataframe has dimensions of {X_eng_tsne.shape}\n")

euc_eng_chart = plot_tsne(X_eng, X_eng_euc_tsne, title = 't-SNE of Engineered Feature Data (Euclidean Distance)')
euc_eng_chart.properties(padding = 20,width = 600, height = 600
    ).configure_view(stroke=None).interactive()

The tsne output dataframe has dimensions of (1905, 2)



It appears as though we have better seperation and more clustering when we use cosine distance. 

## Perplexity

Let's see if we can further improve this with different perplexities

In [17]:
# First thing we need to do is pull out categorical columns and numerical columns for our preprocessing pipeline
X_eng = engineered_data.copy()
cat_cols = ['Sector', 'Exchange', 'Market Cap']
num_cols = [c for c in X_eng.columns if c not in cat_cols]

#Let's define the length and pick a safe perplexity 
n = len(X_eng)
# Let's create a grid of perplexity values to plot
perps = [int(min(30, max(5, (n - 1) // 3))), 10,20,30,40,50,60,70,80]

# Create a dict of our datasets so we can plot them
perp_models = {}

for idx, perp in enumerate(perps):
    # Let's address the preprocessing of our data columns
    preproc = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("scaler", StandardScaler())
            ]), num_cols),
            ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ],
        #sparse_threshold=1.0
    )
    
    # Now, lets create our pipeline
    pipe = Pipeline([
        ("prep", preproc),
        ("tsne", TSNE(
            n_components=2,
            perplexity=perp,
            learning_rate="auto",
            init="pca",
            metric="cosine",
            random_state=random_state
        ))
    ])

    X_eng_tsne = pipe.fit_transform(X_eng)
    perp_models[idx]=X_eng_tsne    
    

In [18]:
def plot_cos_chart(X_eng_tsne):
    plot_df = pd.DataFrame({
        "tsne1": X_eng_tsne[:, 0],
        "tsne2": X_eng_tsne[:, 1],
        # optional metadata for color/tooltip:
        "Sector": X_eng["Sector"].values,
        "Market Cap": X_eng['Market Cap'].values,
    })
    
    market_cap_order = ['Nano-Cap','Mirco-Cap','Small-Cap','Mid-Cap','Large-Cap','Macro-cap']
    
    cos_eng_chart = (
        alt.Chart(plot_df, title = alt.TitleParams(text='t-SNE of Engineered Feature Data',fontSize=20,offset=35))
        .mark_circle(size=28, opacity=0.3)
        .encode(
            x=alt.X("tsne1:Q", title = '', axis = None),
            y=alt.Y("tsne2:Q", title = '', axis = None),
            color=alt.Color("Sector:N"), 
            size = alt.Size("Market Cap:N", scale=alt.Scale(domain=market_cap_order),sort=market_cap_order), 
            tooltip=["Sector:N", "Market Cap:N"]
        )
        .properties(padding = 20, width = 600, height = 600
        )
        .configure_view(stroke=None)
        .interactive()
    )
    
    return cos_eng_chart

In [19]:
# Let's confirm this is the same as above to make sure our coding is right
plot_cos_chart(perp_models[0])

In [20]:
plot_cos_chart(perp_models[1])

In [21]:
plot_cos_chart(perp_models[2])

In [22]:
plot_cos_chart(perp_models[3])

In [23]:
plot_cos_chart(perp_models[4])

In [24]:
plot_cos_chart(perp_models[5])

In [25]:
plot_cos_chart(perp_models[6])

In [26]:
plot_cos_chart(perp_models[7])

In [27]:
plot_cos_chart(perp_models[8])

In [29]:
complete_figure = raw_chart | cos_eng_chart | complete_chart
complete_figure.properties(padding = 10).configure_view(stroke=None)