# Supervised Learning Models  
The goal of this notebook is to create a machine learning pipeline that test multiple base forms (no hyper parameter tuning) of supervised machine learning techniques. This will allow us to determine the best baseline model to then tune in order to maximize performance. We want to set this up in a form that allows us to apply it to many different subsets of our feature space to see what works best and reduce model complexity while maintaining forecasting of out-of-sample data.

## Import Libraries
There are going to be a lot of different baseline models that we need to import here. The goal will be to produce a pipeline that runs the dataset through all of these models and outputs a box-whisker plot showing the RMSE.

In [1]:
# Data Manipulation libraries
import numpy as np
import pandas as pd

# Sci-Kit Learn Processing and Evaluating
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import root_mean_squared_error
from sklearn.feature_selection import SelectKBest, chi2, f_regression

# Supervised Learning Models  
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso  
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor  
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.svm import SVR  
from sklearn.ensemble import RandomForestRegressor  
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

# Let's set our Random State here as well
random_state = 6

## Load Data Files
Now we need to load both the X_data files and the y_data files for comparisons.

In [17]:
root_path = '../../datasets/'
X_train_file = 'X_train_filled_KPIs_QoQ.csv'
y_train_file = 'y_train.csv'
X = pd.read_csv(root_path+X_train_file)
y = pd.read_csv(root_path+y_train_file)
print(X.shape, y.shape)
print(X.tail(),y.tail())


(1905, 284) (1968, 19)
      Unnamed: 0 Ticker  ... TotalDebt_Rate KPI_NetProfitMargin_Rate
1900        1644   HOPE  ...     -8915800.0                -0.010423
1901        1704   CLOV  ...    114188450.0                -0.010855
1902         674   FYBR  ...    117700000.0                 0.009983
1903        1599   AAMI  ...     12190000.0                 0.031616
1904         856   BOOT  ...     24055700.0                 0.002681

[5 rows x 284 columns]       Unnamed: 0  ...  TotalLiabilities_2025Q2
1963        1644  ...             1.632290e+10
1964        1704  ...             2.308080e+08
1965         674  ...             1.650400e+10
1966        1599  ...             5.854000e+08
1967         856  ...             9.224200e+08

[5 rows x 19 columns]


Alright, the first thing that I notice is that we have different sizes of files. That means we dropped rows in our X split but didn't drop them in our y. So let's address this first.

In [18]:
in_train = set(X['Unnamed: 0'])
y = y[y['Unnamed: 0'].isin(in_train)].copy()
X.reset_index(drop=True,inplace=True)
y.reset_index(drop=True,inplace=True)
print(len(in_train))
print(y.shape)
print(X.tail(),y.tail())

1905
(1905, 19)
      Unnamed: 0 Ticker  ... TotalDebt_Rate KPI_NetProfitMargin_Rate
1900        1644   HOPE  ...     -8915800.0                -0.010423
1901        1704   CLOV  ...    114188450.0                -0.010855
1902         674   FYBR  ...    117700000.0                 0.009983
1903        1599   AAMI  ...     12190000.0                 0.031616
1904         856   BOOT  ...     24055700.0                 0.002681

[5 rows x 284 columns]       Unnamed: 0  ...  TotalLiabilities_2025Q2
1900        1644  ...             1.632290e+10
1901        1704  ...             2.308080e+08
1902         674  ...             1.650400e+10
1903        1599  ...             5.854000e+08
1904         856  ...             9.224200e+08

[5 rows x 19 columns]


Great, now similar to our unsupervised notebook. We need to do a little data cleaning and manipulation here so that we have useable X and y data. Let's start by dropping the unique columns that will not help us in identifying trends (Ticker, Name)

## Dataset Separation
Alright, I think one of the first things we should do is identify three different datasets that we want to work with.  
1. Full Dataset (minus columns like Ticker)  
2. Raw Data Dataset (What would it look like if we just used the raw financial data) 
4. KPIs and PCA Dataset (Engineered data and data reduction dataset; this may end up being 2) 
3. Engineered Dataset (Do we get better structure when we look at just the engineered features)

We can easily just split these into subdatasets if we pull out the relevant columns. So let's look at all of the columns first so that we can start creating the proper datasets.

In [20]:
complete_dataset = X.copy()
columns = complete_dataset.columns.tolist()
for column in sorted(columns):
    print(column)

CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CapitalExpenditure_QoQ_24Q2_24Q3
CapitalExpenditure_QoQ_24Q3_24Q4
CapitalExpenditure_QoQ_24Q4_25Q1
CapitalExpenditure_QoQ_Rate
CapitalExpenditure_Rate
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashAndSTInvestments_QoQ_24Q2_24Q3
CashAndSTInvestments_QoQ_24Q3_24Q4
CashAndSTInvestments_QoQ_24Q4_25Q1
CashAndSTInvestments_QoQ_Rate
CashAndSTInvestments_Rate
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CashFromOps_QoQ_24Q2_24Q3
CashFromOps_QoQ_24Q3_24Q4
CashFromOps_QoQ_24Q4_25Q1
CashFromOps_QoQ_Rate
CashFromOps_Rate
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CostOfRevenue_QoQ_24Q2_24Q3
CostOfRevenue_QoQ_24Q3_24Q4
CostOfRevenue_QoQ_24Q4_25Q1
CostOfRevenue_QoQ_Rate
CostOfRevenue_Rate
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
Current

Alright, let's start by identifying which columns to drop because they are unnecessary for the unsupervised learning part. This should be relatively few columns.
- Ticker
- Name  


In [21]:
complete_dataset = complete_dataset.drop(columns=['Unnamed: 0','Ticker','Name'])
print(complete_dataset.shape)

(1905, 281)


Great, now, we can loop through all of the columns and we will pull out all of the feature engineered data if it contains 'KPI', 'QoQ', or 'Rate' in the title. We can then investigate these columns to make sure they make sense.

In [22]:
raw_columns = []
engineered_columns = []
for column in complete_dataset.columns:
    if ('KPI' not in column) and ('QoQ' not in column) and ('Rate' not in column):
        raw_columns.append(column)
    else:
        engineered_columns.append(column)
print(f'Raw Columns: {len(raw_columns)}')
print(f'Engineered Columns: {len(engineered_columns)}')

Raw Columns: 102
Engineered Columns: 179


Now, there are going to be some of the raw columns that we want to add back to the engineered columns as they can be very important components to the company, so let's list these here.
- Sector  
- Exchange
- Location  
- Market Cap
- Market Value

So let's append those

In [23]:
add_back = ['Sector','Exchange','Location','Market Value','Market Cap']
engineered_columns = engineered_columns + add_back
print(f'Engineered Columns after adding back important raw columns: {len(engineered_columns)}')

Engineered Columns after adding back important raw columns: 184


Alright, now we can build out all of our feature dataframes to test them all.

In [25]:
X_raw = complete_dataset[raw_columns]
X_eng = complete_dataset[engineered_columns]
X_tot = complete_dataset.copy()
#X_kpi = place holder for the KPI data
#X_pca = place holder for the PCA 

print(f'Full Dataset Shape: {X_tot.shape}')
print(f'Raw Data Shape: {X_raw.shape}')
print(f'Engineered Data Shape: {X_eng.shape}')
#print(f'KPI Data shape: {X_kpi.shape}')
#print(f'PCA reduced Data Shape: {X_pca}')


Full Dataset Shape: (1905, 281)
Raw Data Shape: (1905, 102)
Engineered Data Shape: (1905, 184)


### Set up our dependent variables (y1 and y2)
Now, Let's get our two different y variables that we want to compare to. y1 will be total Revenue, y2 will be net income.

In [None]:
# Let's pull out the data we want to predict
y1_rev = y['Revenue_2025Q2']
y2_ear = y['NetIncome_2025Q2']
print(y1_rev.shape,y2_ear.shape)

(1905,) (1905,)


## Preprocessing
Now that we have all of our data setup, we need to work on the preprocessing steps in order to have machine readable information being fed into our supervised model. We will use functions here so that we can easily apply it to any dataset that we desire.