# Supervised Learning Models  
The goal of this notebook is to create a machine learning pipeline that test multiple base forms (no hyper parameter tuning) of supervised machine learning techniques. This will allow us to determine the best baseline model to then tune in order to maximize performance. We want to set this up in a form that allows us to apply it to many different subsets of our feature space to see what works best and reduce model complexity while maintaining forecasting of out-of-sample data.

## Import Libraries
There are going to be a lot of different baseline models that we need to import here. The goal will be to produce a pipeline that runs the dataset through all of these models and outputs a box-whisker plot showing the RMSE.

In [1]:
# Data Manipulation libraries
import numpy as np
import pandas as pd

# Sci-Kit Learn Processing and Evaluating
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, QuantileTransformer
from sklearn.metrics import mean_squared_error, make_scorer


# Supervised Learning Models  
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso  
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor  
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.svm import SVR  
from sklearn.ensemble import RandomForestRegressor  
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

# Let's set our Random State and k_folds here as well
random_state = 42
k_fold = 10

## Load Data Files
Now we need to load both the X_data files and the y_data files for comparisons.

In [2]:
root_path = '../../datasets/'
X_train_file = 'X_train_filled_KPIs_QoQ_PCA.csv'
y_train_file = 'y_train.csv'
X = pd.read_csv(root_path+X_train_file)
y = pd.read_csv(root_path+y_train_file)
print(X.shape, y.shape)
print(X.tail(),y.tail())


(1910, 326) (1974, 19)
     Ticker                               Name                  Sector  \
1905    CNS                 COHEN & STEERS INC              Financials   
1906    FBP                      FIRST BANCORP              Financials   
1907   RDDT                 REDDIT INC CLASS A           Communication   
1908    AGM  FEDERAL AGRICULTURAL MORTGAGE NON              Financials   
1909    EAT          BRINKER INTERNATIONAL INC  Consumer Discretionary   

      CapitalExpenditure_2024Q2  CapitalExpenditure_2024Q3  \
1905                 -4239000.0                 -1408000.0   
1906                 -3264000.0                 -2547000.0   
1907                 -1202000.0                 -1353000.0   
1908                 -3568000.0                   -66000.0   
1909                -58000000.0                -56500000.0   

      CapitalExpenditure_2024Q4  CapitalExpenditure_2025Q1  \
1905                 -1678000.0                 -1075000.0   
1906                 -3407000.0    

Let's put in a small section where we can filter out the Mega and Nano caps before changing the y-variables.

In [3]:
print(X['Market Cap'].unique())
print(X.shape)
X = X[(X['Market Cap'] != 'Mega-Cap')] # & (X['Market Cap'] != 'Micro-Cap')]
print(X.shape)
print(X['Market Cap'].unique())

['Small-Cap' 'Micro-Cap' 'Mid-Cap' 'Large-Cap' 'Mega-Cap']
(1910, 326)
(1903, 326)
['Small-Cap' 'Micro-Cap' 'Mid-Cap' 'Large-Cap']


Alright, the first thing that I notice is that we have different sizes of files. That means we dropped rows in our X split but didn't drop them in our y. So let's address this first.

In [4]:
in_train = set(X['Ticker'])
y = y[y['Ticker'].isin(in_train)].copy()
print(len(in_train))
print(y.shape)
#print(X.tail(),y.tail())

1903
(1903, 19)


### Set up our dependent variables (y1 and y2)
Now, Let's get our two different y variables that we want to compare to. y1 will be total Revenue, y2 will be net income.

In [5]:
# Let's pull out the data we want to predict
y1_rev = y['Revenue_2025Q2']
y2_ear = y['NetIncome_2025Q2']
y3_rev_rat = (y['Revenue_2025Q2'] - X['Revenue_2025Q1'])/X['Revenue_2025Q1']
print(y1_rev.shape,y2_ear.shape, y3_rev_rat.shape)
print(y1_rev.isna().sum())
print(y2_ear.isna().sum())
print(y3_rev_rat.isna().sum())
print(y3_rev_rat[:5])

(1903,) (1903,) (1972,)
522
522
641
0         NaN
1    0.105149
2         NaN
3    0.078966
4         NaN
dtype: float64


So, it turns out that we have a lot of missing values that we are trying to predict here. So we obviously need to remove them.


In [6]:
y = y[['Ticker','Revenue_2025Q2','NetIncome_2025Q2']]
print(y.shape)
y.dropna(inplace=True)
print(y.shape)
in_dependent = set(y['Ticker'])
X = X[X['Ticker'].isin(in_dependent)].copy()
print(X.shape)

(1903, 3)
(1381, 3)
(1381, 326)


Now, we can reassing the y1_rev and the y2_ear.

In [7]:
y1_rev = y['Revenue_2025Q2']
y2_ear = y['NetIncome_2025Q2']
y3_rev_rat = (y['Revenue_2025Q2'] - X['Revenue_2025Q1'])/X['Revenue_2025Q1']
y4_margin =y['NetIncome_2025Q2']/y['Revenue_2025Q2']
print(y1_rev.isna().sum())
print(y2_ear.isna().sum())
print(y3_rev_rat.isna().sum())
print(y4_margin.isna().sum())


0
0
854
0


Great, now similar to our unsupervised notebook. We need to do a little data cleaning and manipulation on our X here so that we have useable X and y data. Let's start by dropping the unique columns that will not help us in identifying trends (Ticker, Name)

## Dataset Separation
Alright, I think one of the first things we should do is identify three different datasets that we want to work with.  
1. Full Dataset (minus columns like Ticker)  
2. Raw Data Dataset (What would it look like if we just used the raw financial data) 
4. KPIs and PCA Dataset (Engineered data and data reduction dataset; this may end up being 2) 
3. Engineered Dataset (Do we get better structure when we look at just the engineered features)

We can easily just split these into subdatasets if we pull out the relevant columns. So let's look at all of the columns first so that we can start creating the proper datasets.

In [8]:
complete_dataset = X.copy()
columns = complete_dataset.columns.tolist()
for column in sorted(columns):
    print(column)

CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CapitalExpenditure_QoQ_24Q2_24Q3
CapitalExpenditure_QoQ_24Q3_24Q4
CapitalExpenditure_QoQ_24Q4_25Q1
CapitalExpenditure_QoQ_Rate
CapitalExpenditure_Rate
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashAndSTInvestments_QoQ_24Q2_24Q3
CashAndSTInvestments_QoQ_24Q3_24Q4
CashAndSTInvestments_QoQ_24Q4_25Q1
CashAndSTInvestments_QoQ_Rate
CashAndSTInvestments_Rate
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CashFromOps_QoQ_24Q2_24Q3
CashFromOps_QoQ_24Q3_24Q4
CashFromOps_QoQ_24Q4_25Q1
CashFromOps_QoQ_Rate
CashFromOps_Rate
Cluster
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CostOfRevenue_QoQ_24Q2_24Q3
CostOfRevenue_QoQ_24Q3_24Q4
CostOfRevenue_QoQ_24Q4_25Q1
CostOfRevenue_QoQ_Rate
CostOfRevenue_Rate
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4

Alright, let's start by identifying which columns to drop because they are unnecessary for the unsupervised learning part. This should be relatively few columns.
- Ticker
- Name  


In [9]:
complete_dataset = complete_dataset.drop(columns=['Ticker','Name'])
print(complete_dataset.shape)

(1381, 324)


Great, now, we can loop through all of the columns and we will pull out all of the feature engineered data if it contains 'KPI', 'QoQ', or 'Rate' in the title. We can then investigate these columns to make sure they make sense.

In [10]:
raw_columns = []
engineered_columns = []
pca_columns = []
revenue_columns = []
for column in complete_dataset.columns:
    if ('KPI' not in column) and ('QoQ' not in column) and ('Rate' not in column) and ('PCA' not in column) and ('Cluster' not in column):
        raw_columns.append(column)
    else:
        engineered_columns.append(column)
for column in complete_dataset.columns:
    if ('PCA' not in column) and ('Cluster' not in column):
        continue
    else:
        pca_columns.append(column)
for column in complete_dataset.columns:
    if ('Revenue' in column) and ('CostOf' not in column):
        revenue_columns.append(column)
print(f'Raw Columns: {len(raw_columns)}')
print(f'Engineered Columns: {len(engineered_columns)}')
print(f'PCA Columns {len(pca_columns)}')
print(f'Revenue Columns {len(revenue_columns)}')

Raw Columns: 102
Engineered Columns: 222
PCA Columns 61
Revenue Columns 9


In [11]:
for column in engineered_columns:
    print(column)

InterestRate_2024Q1
InterestRate_2024Q2
InterestRate_2024Q3
InterestRate_2024Q4
InterestRate_2025Q1
KPI_GrossProfitMargin_2024Q2
KPI_GrossProfitMargin_2024Q3
KPI_GrossProfitMargin_2024Q4
KPI_GrossProfitMargin_2025Q1
KPI_NetProfitMargin_2024Q2
KPI_NetProfitMargin_2024Q3
KPI_NetProfitMargin_2024Q4
KPI_NetProfitMargin_2025Q1
KPI_CurrentRatio_2024Q2
KPI_CurrentRatio_2024Q3
KPI_CurrentRatio_2024Q4
KPI_CurrentRatio_2025Q1
KPI_Leverage_2024Q2
KPI_Leverage_2024Q3
KPI_Leverage_2024Q4
KPI_Leverage_2025Q1
KPI_DebtToEquityRatio_2024Q2
KPI_DebtToEquityRatio_2024Q3
KPI_DebtToEquityRatio_2024Q4
KPI_DebtToEquityRatio_2025Q1
KPI_TotalAssetTurnover_2024Q3
KPI_TotalAssetTurnover_2024Q4
KPI_TotalAssetTurnover_2025Q1
KPI_ReturnOnEquity_2024Q3
KPI_ReturnOnEquity_2024Q4
KPI_ReturnOnEquity_2025Q1
KPI_ReturnOnAssets_2024Q3
KPI_ReturnOnAssets_2024Q4
KPI_ReturnOnAssets_2025Q1
KPI_NetProfitMargin_QoQ_24Q2_24Q3
KPI_NetProfitMargin_QoQ_24Q3_24Q4
KPI_NetProfitMargin_QoQ_24Q4_25Q1
OperatingIncome_QoQ_24Q2_24Q3
Operat

Now, there are going to be some of the raw columns that we want to add back to the engineered columns as they can be very important components to the company, so let's list these here.
- Sector  
- Exchange
- Location  
- Market Cap
- Market Value

So let's append those

In [12]:
add_back = ['Sector','Exchange','Location','Market Value','Market Cap']
engineered_columns = engineered_columns + add_back
print(f'Engineered Columns after adding back important raw columns: {len(engineered_columns)}')

Engineered Columns after adding back important raw columns: 227


Alright, now we can build out all of our feature dataframes to test them all.

In [13]:
raw_data = complete_dataset[raw_columns]
eng_data = complete_dataset[engineered_columns]
tot_data = complete_dataset.copy()
#kpi_data = place holder for the KPI data
pca_data = complete_dataset[pca_columns]
rev_data = complete_dataset[revenue_columns + add_back]

print(f'Full Dataset Shape: {tot_data.shape}')
print(f'Raw Data Shape: {raw_data.shape}')
print(f'Engineered Data Shape: {eng_data.shape}')
#print(f'KPI Data shape: {kpi_data.shape}')
print(f'PCA reduced Data Shape: {pca_data.shape}')
print(f'Rev reduced Data Shape: {rev_data.shape}')


Full Dataset Shape: (1381, 324)
Raw Data Shape: (1381, 102)
Engineered Data Shape: (1381, 227)
PCA reduced Data Shape: (1381, 61)
Rev reduced Data Shape: (1381, 14)


## Fundamental Data
Now that we have all of our data setup, we need to work on the preprocessing steps in order to have machine readable information being fed into our supervised model. 

### Scaler Set

In [14]:
# Let's set our scaler here
scaler = QuantileTransformer()
#scaler = StandardScaler()

## Aside
I wanted to see whether only using the revenue data was better at predicting the revenue than the full dataset so I came back up to the top of my notebook to run this section

In [15]:
## Lets see how we do with just the rev data.
#X_rev = rev_data.copy()
## columns
#cat_cols = ['Sector', 'Exchange', 'Market Cap'] 
#num_cols = [c for c in X_rev.columns if c not in cat_cols]
#
## models (set seeds where applicable)
#models = [
#    ('DUMMY', DummyRegressor(strategy='median')),
#    ('LR', LinearRegression()),
#    ('RIDGE', Ridge()),
#    ('LASSO', Lasso()),
#    ('EN', ElasticNet()),
#    ('KNN', KNeighborsRegressor()),
#    ('DT', DecisionTreeRegressor(random_state=random_state)),
#    ('SVR', SVR()),
#    ('RFR', RandomForestRegressor(random_state=random_state)),
#    ('ETR', ExtraTreesRegressor(random_state=random_state)),
#    ('ABR', AdaBoostRegressor(random_state=random_state)),
#    ('GBR', GradientBoostingRegressor(random_state=random_state)),
#    ('MLP', MLPRegressor(random_state=random_state))
#]
#
## preprocessing
#preproc = ColumnTransformer([
#    ('num', Pipeline([
#        ('scale', scaler)
#    ]), num_cols),
#    ('cat', Pipeline([
#        ('ohe', OneHotEncoder(handle_unknown='ignore'))
#    ]), cat_cols),
#])
#
#cv = KFold(n_splits=k_fold, shuffle = True, random_state=random_state)
#scorer = 'neg_root_mean_squared_error'  # RMSE as negative; flip sign after
#
#names, kfold_results_rev, kfold_results_ear, kfold_results_rat = [], [], [], []
#
#for name, model in models:
#    pipe = Pipeline([('prep', preproc), ('model', model)])
#    rmse_rev = -cross_val_score(pipe, X_rev, y1_rev, cv=cv, scoring=scorer, n_jobs=-1)
#    rmse_ear = -cross_val_score(pipe, X_rev, y2_ear, cv=cv, scoring=scorer, n_jobs=-1)
#    
#    names.append(name)
#    kfold_results_rev.append(rmse_rev)
#    kfold_results_ear.append(rmse_ear)
#    
#
## optional: summary
#summary = pd.DataFrame({
#    'model': names,
#    'rev_rmse_mean': [r.mean() for r in kfold_results_rev],
#    'rev_rmse_std':  [r.std()  for r in kfold_results_rev],
#    'ear_rmse_mean': [r.mean() for r in kfold_results_ear],
#    'ear_rmse_std':  [r.std()  for r in kfold_results_ear],
#}).sort_values('rev_rmse_mean')
#summary

In [16]:
## Let's try a better looking chart in altair
#df_rev = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rev)]),
#    "rmse":  np.concatenate(kfold_results_rev)
#})
#
#df_rev['rmse'] = df_rev['rmse']/1_000_000
#
#rev_chart = alt.Chart(df_rev).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Revenue'
#)
#
#rev_chart

## Continue with original notebook
Here we will resume to our original order of testing models

In [17]:
## Lets attempt this first with the raw data
#X_raw = raw_data.copy()
## columns
#cat_cols = ['Sector', 'Exchange', 'Market Cap'] 
#num_cols = [c for c in X_raw.columns if c not in cat_cols]
#
## models (set seeds where applicable)
#models = [
#    ('DUMMY', DummyRegressor(strategy='median')),
#    ('LR', LinearRegression()),
#    ('RIDGE', Ridge()),
#    ('LASSO', Lasso()),
#    ('EN', ElasticNet()),
#    ('KNN', KNeighborsRegressor()),
#    ('DT', DecisionTreeRegressor(random_state=random_state)),
#    ('SVR', SVR()),
#    ('RFR', RandomForestRegressor(random_state=random_state)),
#    ('ETR', ExtraTreesRegressor(random_state=random_state)),
#    ('ABR', AdaBoostRegressor(random_state=random_state)),
#    ('GBR', GradientBoostingRegressor(random_state=random_state)),
#    ('MLP', MLPRegressor(random_state=random_state))
#]
#
## preprocessing
#preproc = ColumnTransformer([
#    ('num', Pipeline([
#        ('scale', scaler)
#    ]), num_cols),
#    ('cat', Pipeline([
#        ('ohe', OneHotEncoder(handle_unknown='ignore'))
#    ]), cat_cols),
#])
#
#cv = KFold(n_splits=k_fold, shuffle = True, random_state=random_state)
#scorer = 'neg_root_mean_squared_error'  # RMSE as negative; flip sign after
#
#names, kfold_results_rev, kfold_results_ear, kfold_results_rat, kfold_results_marg= [], [], [], [], []
#
#for name, model in models:
#    pipe = Pipeline([('prep', preproc), ('model', model)])
#    rmse_rev = -cross_val_score(pipe, X_raw, y1_rev, cv=cv, scoring=scorer, n_jobs=-1)
#    rmse_ear = -cross_val_score(pipe, X_raw, y2_ear, cv=cv, scoring=scorer, n_jobs=-1)
#    
#    rmse_marg = -cross_val_score(pipe, X_raw, y4_margin, cv=cv, scoring=scorer, n_jobs=-1)
#    names.append(name)
#    kfold_results_rev.append(rmse_rev)
#    kfold_results_ear.append(rmse_ear)
#    
#    kfold_results_marg.append(rmse_marg)
#
## optional: summary
#summary = pd.DataFrame({
#    'model': names,
#    'rev_rmse_mean': [r.mean() for r in kfold_results_rev],
#    'rev_rmse_std':  [r.std()  for r in kfold_results_rev],
#    'ear_rmse_mean': [r.mean() for r in kfold_results_ear],
#    'ear_rmse_std':  [r.std()  for r in kfold_results_ear],
#}).sort_values('rev_rmse_mean')
#summary

Alright, that should hopefully run through all of our models and give us an output that we can plot the box and whisker plots. Let's just use a simple plot first to see if it worked then we can build a nicer one.

In [18]:
#fig = plt.figure()
#fig.suptitle('Algorithm Comparison Predicting Revenue using Raw Financial Data: \n5-fold Cross Validation')
#ax = fig.add_subplot(111)
#plt.boxplot(kfold_results_rev)
#ax.set_xticklabels(names)
#fig.set_size_inches(15,8)
#plt.show()

In [19]:
## Let's try a better looking chart in altair
#df_rev = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rev)]),
#    "rmse":  np.concatenate(kfold_results_rev)
#})
#
#df_rev['rmse'] = df_rev['rmse']/1_000_000
#
#rev_chart = alt.Chart(df_rev).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Revenue'
#)


In [20]:
## Let's try a better looking chart in altair
#df_ear = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_ear)]),
#    "rmse":  np.concatenate(kfold_results_ear)
#})
#
#df_ear['rmse'] = df_ear['rmse']/1_000_000
#
#ear_chart = alt.Chart(df_ear).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Net Income'
#)

In [21]:
## Let's try a better looking chart in altair
#df_rat = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rat)]),
#    "rmse":  np.concatenate(kfold_results_rat)
#})
#
#df_rat['rmse'] = df_rat['rmse']
#
#rat_chart = alt.Chart(df_rat).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='QoQ Change in Revenue'
#)

In [22]:
## Let's try a better looking chart in altair
#df_marg = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_marg)]),
#    "rmse":  np.concatenate(kfold_results_marg)
#})
#
#df_rat['rmse'] = df_rat['rmse']
#
#marg_chart = alt.Chart(df_marg).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Margin'
#)

In [23]:
#full_chart = (rev_chart | ear_chart ).properties(
#    title=alt.TitleParams('Raw Fundamental Feature Space', anchor='middle'))
#full_chart.configure_view(stroke=None)

## Engineered Only Data
Now that we have seen the performance of the fundamentals. Let's look at the engineered data by itself. I have a feeling this will struggle to predict the absolute revenue value because it's not included in the orginal but I may be wrong. Let's take a look.

In [24]:
## Lets attempt this first with the raw data
#X_eng = eng_data.copy()
## columns
#cat_cols = ['Sector', 'Exchange', 'Market Cap'] 
#num_cols = [c for c in X_eng.columns if c not in cat_cols]
#
## models (set seeds where applicable)
#models = [
#    ('DUMMY', DummyRegressor(strategy='median')),
#    ('LR', LinearRegression()),
#    ('RIDGE', Ridge()),
#    ('LASSO', Lasso()),
#    ('EN', ElasticNet()),
#    ('KNN', KNeighborsRegressor()),
#    ('DT', DecisionTreeRegressor(random_state=random_state)),
#    ('SVR', SVR()),
#    ('RFR', RandomForestRegressor(random_state=random_state)),
#    ('ETR', ExtraTreesRegressor(random_state=random_state)),
#    ('ABR', AdaBoostRegressor(random_state=random_state)),
#    ('GBR', GradientBoostingRegressor(random_state=random_state)),
#    ('MLP', MLPRegressor(random_state=random_state))
#]
#
## preprocessing
#preproc = ColumnTransformer([
#    ('num', Pipeline([
#        ('scale', scaler)
#    ]), num_cols),
#    ('cat', Pipeline([
#        ('ohe', OneHotEncoder(handle_unknown='ignore'))
#    ]), cat_cols),
#])
#
#cv = KFold(n_splits=k_fold, shuffle = True, random_state=random_state)
#scorer = 'neg_root_mean_squared_error'  # RMSE as negative; flip sign after
#
#names, kfold_results_rev, kfold_results_ear, kfold_results_rat, kfold_results_marg= [], [], [], [], []
#
#for name, model in models:
#    pipe = Pipeline([('prep', preproc), ('model', model)])
#    rmse_rev = -cross_val_score(pipe, X_eng, y1_rev, cv=cv, scoring=scorer, n_jobs=-1)
#    rmse_ear = -cross_val_score(pipe, X_eng, y2_ear, cv=cv, scoring=scorer, n_jobs=-1)
#    
#    rmse_marg = -cross_val_score(pipe, X_eng, y4_margin, cv=cv, scoring=scorer, n_jobs=-1)
#    names.append(name)
#    kfold_results_rev.append(rmse_rev)
#    kfold_results_ear.append(rmse_ear)
#    
#    kfold_results_marg.append(rmse_marg)
#
## optional: summary
#summary = pd.DataFrame({
#    'model': names,
#    'rev_rmse_mean': [r.mean() for r in kfold_results_rev],
#    'rev_rmse_std':  [r.std()  for r in kfold_results_rev],
#    'ear_rmse_mean': [r.mean() for r in kfold_results_ear],
#    'ear_rmse_std':  [r.std()  for r in kfold_results_ear],
#    
#}).sort_values('rev_rmse_mean')
#summary

In [25]:
## Let's try a better looking chart in altair
#df_rev = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rev)]),
#    "rmse":  np.concatenate(kfold_results_rev)
#})
#
#df_rev['rmse'] = df_rev['rmse']/1_000_000
#
#rev_chart2 = alt.Chart(df_rev).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Revenue'
#)

In [26]:
## Let's try a better looking chart in altair
#df_ear = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_ear)]),
#    "rmse":  np.concatenate(kfold_results_ear)
#})
#
#df_ear['rmse'] = df_ear['rmse']/1_000_000
#
#ear_chart2 = alt.Chart(df_ear).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Net Income'
#)

In [27]:
## Let's try a better looking chart in altair
#df_rat = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rat)]),
#    "rmse":  np.concatenate(kfold_results_rat)
#})
#
#df_rat['rmse'] = df_rat['rmse']
#
#rat_chart2 = alt.Chart(df_rat).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='QoQ Change in Revenue'
#)

In [28]:
## Let's try a better looking chart in altair
#df_marg = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_marg)]),
#    "rmse":  np.concatenate(kfold_results_marg)
#})
#
#df_marg['rmse'] = df_rat['rmse']
#
#marg_chart2 = alt.Chart(df_marg).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Margin'
#)

In [29]:
#full_chart2 = (rev_chart2 | ear_chart2 ).properties(
#    title=alt.TitleParams('Raw Fundamental Feature Space', anchor='middle'))
#full_chart2.configure_view(stroke=None)

## Full dataset

In [30]:
# Lets attempt this first with the raw data
X_tot = tot_data.copy()
# columns
cat_cols = ['Sector', 'Exchange', 'Market Cap'] 
num_cols = [c for c in X_tot.columns if c not in cat_cols]

# models (set seeds where applicable)
models = [
    ('DUMMY', DummyRegressor(strategy='median')),
    ('LR', LinearRegression()),
    ('RIDGE', Ridge()),
    ('LASSO', Lasso()),
    ('EN', ElasticNet()),
    ('KNN', KNeighborsRegressor()),
    ('DT', DecisionTreeRegressor(random_state=random_state)),
    ('SVR', SVR()),
    ('RFR', RandomForestRegressor(random_state=random_state)),
    ('ETR', ExtraTreesRegressor(random_state=random_state)),
    ('ABR', AdaBoostRegressor(random_state=random_state)),
    ('GBR', GradientBoostingRegressor(random_state=random_state)),
    ('MLP', MLPRegressor(random_state=random_state))
]

# preprocessing
preproc = ColumnTransformer([
    ('num', Pipeline([
        ('scale', scaler)
    ]), num_cols),
    ('cat', Pipeline([
        ('ohe', OneHotEncoder(handle_unknown='ignore'))
    ]), cat_cols),
])

cv = KFold(n_splits=k_fold, shuffle = True, random_state=random_state)
scorer = 'neg_root_mean_squared_error'

names, kfold_results_rev, kfold_results_ear, kfold_results_rat = [], [], [], []
kfold_mae,kfold_r2,kfold_results_marg = [],[],[]

for name, model in models:
    pipe = Pipeline([('prep', preproc), ('model', model)])
    rmse_rev = -cross_val_score(pipe, X_tot, y1_rev, cv=cv, scoring='neg_root_mean_squared_error', n_jobs=-1)
    mae_rev = -cross_val_score(pipe, X_tot, y1_rev, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)
    r2_rev = cross_val_score(pipe, X_tot, y1_rev, cv=cv, scoring='r2', n_jobs=-1)
    rmse_ear = -cross_val_score(pipe, X_tot, y2_ear, cv=cv, scoring=scorer, n_jobs=-1)

    rmse_marg = -cross_val_score(pipe, X_tot, y4_margin, cv=cv, scoring=scorer, n_jobs=-1)
    names.append(name)
    kfold_results_rev.append(rmse_rev)
    kfold_mae.append(mae_rev)
    kfold_r2.append(r2_rev)
    kfold_results_ear.append(rmse_ear)
    
    kfold_results_marg.append(rmse_marg)

# optional: summary
summary = pd.DataFrame({
    'model': names,
    'r2 mean': [r.mean() for r in kfold_r2],
    'r2 std': [r.std() for r in kfold_r2],
    'RMSE mean': [r.mean() for r in kfold_results_rev],
    'RMSE std':  [r.std()  for r in kfold_results_rev],
    'MAE mean': [r.mean() for r in kfold_mae],
    'MAE std':  [r.std()  for r in kfold_mae],
})
summary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
model,DUMMY,LR,RIDGE,LASSO,EN,KNN,DT,SVR,RFR,ETR,ABR,GBR,MLP
r2 mean,-0.092587,-0.143355,0.190855,-0.059225,0.313151,0.32733,0.80069,-0.092587,0.927669,0.954828,0.907761,0.916265,-0.179437
r2 std,0.04141,0.608199,0.255008,0.485302,0.079432,0.229692,0.254583,0.04141,0.082469,0.061074,0.0958,0.155373,0.08943
RMSE mean,7515018720.811421,6808308654.728461,6149817672.875845,6665817502.017856,6159781898.504516,6114533103.322847,3234885276.679091,7515018693.383453,2055576854.055822,1705600275.451047,2304093319.57669,1695899954.73065,7732665932.626477
RMSE std,4143759842.073029,2741709409.260979,3106097124.472261,2783257326.092637,3940675007.99228,4147901731.513206,4010015614.189151,4143759847.349423,2469333323.253695,2288052459.713968,2813285594.060095,2249592150.668164,4114535423.928305
MAE mean,2425283435.816302,4031514222.291781,3488939269.604856,3962511421.053754,2521577197.403382,1550613445.313376,625450403.360734,2425283391.528574,351656922.522568,329487980.054444,1004463849.059654,327327938.104997,2674981028.969665
MAE std,862176442.798848,370582004.265227,471881561.845573,398728793.637475,654222626.97301,713482667.02617,723800806.027237,862176443.632048,299561636.749399,281418009.482717,303673822.097149,297321022.22374,876306329.546344


In [31]:
summary = pd.DataFrame({
    'model': names,
    'r2 mean': [np.round(r.mean(),3) for r in kfold_r2],
    'r2 std': [np.round(r.std(),3) for r in kfold_r2],
    'RMSE mean': [str(int(r.mean()/1_000_000))+'M' for r in kfold_results_rev],
    'RMSE std':  [str(int(r.std()/1_000_000))+'M'  for r in kfold_results_rev],
    'MAE mean': [str(int(r.mean()/1_000_000))+'M' for r in kfold_mae],
    'MAE std':  [str(int(r.std()/1_000_000))+'M'  for r in kfold_mae],
    
})
summary.set_index('model')

Unnamed: 0_level_0,r2 mean,r2 std,RMSE mean,RMSE std,MAE mean,MAE std
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DUMMY,-0.093,0.041,7515M,4143M,2425M,862M
LR,-0.143,0.608,6808M,2741M,4031M,370M
RIDGE,0.191,0.255,6149M,3106M,3488M,471M
LASSO,-0.059,0.485,6665M,2783M,3962M,398M
EN,0.313,0.079,6159M,3940M,2521M,654M
KNN,0.327,0.23,6114M,4147M,1550M,713M
DT,0.801,0.255,3234M,4010M,625M,723M
SVR,-0.093,0.041,7515M,4143M,2425M,862M
RFR,0.928,0.082,2055M,2469M,351M,299M
ETR,0.955,0.061,1705M,2288M,329M,281M


In [32]:
#fig = plt.figure()
#fig.suptitle('Algorithm Comparison Predicting Revenue using Raw Financial Data: \nLeave One Out Cross Validation')
#ax = fig.add_subplot(111)
#plt.boxplot(kfold_results_rev)
#ax.set_xticklabels(names)
#fig.set_size_inches(15,8)
#plt.show()

In [33]:
# Let's try a better looking chart in altair
df_rev = pd.DataFrame({
    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_r2)]),
    "R2":  np.concatenate(kfold_r2)
})


In [45]:
# Let's try a better looking chart in altair
df_rev = pd.DataFrame({
    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_r2)]),
    "R2":  np.concatenate(kfold_r2)
})
df_rev.replace(np.nan, '0', inplace=True)
#eliminate dummy variable for R2 values
rev_chart3 = alt.Chart(df_rev).mark_boxplot(
    size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}
).encode(
    x=alt.X('model:N', sort=names, title='Regression Model',
            axis=alt.Axis(labelAngle=0, grid=False)),
    y=alt.Y('R2:Q', title='R²',
            scale=alt.Scale(domain=[-1, 1], nice=False, clamp=True),
            axis=alt.Axis(grid=False)),
    color=alt.Color('model:N', legend=None)
).properties(width=40*len(names), height=300, title='')
rev_chart3

In [35]:
# Let's try a better looking chart in altair
df_rev = pd.DataFrame({
    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rev)]),
    "rmse":  np.concatenate(kfold_results_rev)
})

df_rev['rmse'] = df_rev['rmse']/1_000_000

rev_chart2 = alt.Chart(df_rev).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
    color = alt.Color('model:N', legend=None)
).properties(
    width=40*len(names), height=300,
    title=''
)

In [36]:
# Let's try a better looking chart in altair
df_ear = pd.DataFrame({
    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_ear)]),
    "rmse":  np.concatenate(kfold_results_ear)
})

df_ear['rmse'] = df_ear['rmse']/1_000_000

ear_chart3 = alt.Chart(df_ear).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
    color = alt.Color('model:N', legend=None)
).properties(
    width=40*len(names), height=300,
    title='Net Income'
)

In [37]:
## Let's try a better looking chart in altair
#df_rat = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rat)]),
#    "rmse":  np.concatenate(kfold_results_rat)
#})
#
#df_rat['rmse'] = df_rat['rmse']
#
#rat_chart3 = alt.Chart(df_rat).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='QoQ Change in Revenue'
#)

In [38]:
## Let's try a better looking chart in altair
#df_marg = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_marg)]),
#    "rmse":  np.concatenate(kfold_results_marg)
#})
#
#df_rat['rmse'] = df_rat['rmse']
#
#marg_chart3 = alt.Chart(df_marg).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (% Change of Revenue)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Margin'
#)

In [46]:
full_chart3 = (rev_chart3 | rev_chart2 ).properties(
    title=alt.TitleParams('Regression Model Evaluation using R² and Root Mean Squared Error', subtitle='10-Fold Cross Validation', fontSize = 20, anchor='middle'))
full_chart3.configure_view(stroke=None)

In [40]:
#(full_chart & full_chart2 & full_chart3).configure_view(stroke=None).properties(
#    title=alt.TitleParams('RMSE Comparing Feature Space and Predictor Variables Without Mega- and Micro-Cap using 5-Fold Cross Validation', anchor='middle'))

## PCA Only Data

In [41]:
## Lets attempt this first with the raw data
#X_pca = pca_data.copy()
## columns
##cat_cols = ['Sector', 'Exchange', 'Market Cap'] 
#num_cols = X_pca.columns
#
## models (set seeds where applicable)
#models = [
#    ('DUMMY', DummyRegressor(strategy='mean')),
#    ('LR', LinearRegression()),
#    ('RIDGE', Ridge()),
#    ('LASSO', Lasso()),
#    ('EN', ElasticNet()),
#    ('KNN', KNeighborsRegressor()),
#    ('DT', DecisionTreeRegressor(random_state=random_state)),
#    ('SVR', SVR()),
#    ('RFR', RandomForestRegressor(random_state=random_state)),
#    ('ETR', ExtraTreesRegressor(random_state=random_state)),
#    ('ABR', AdaBoostRegressor(random_state=random_state)),
#    ('GBR', GradientBoostingRegressor(random_state=random_state)),
#]
#
## preprocessing
#preproc = ColumnTransformer([
#    ('num', Pipeline([
#        ('scale', scaler)
#    ]), num_cols),
#    
#])
#
#cv = KFold(n_splits=k_fold, shuffle = True, random_state=random_state)
#scorer = 'neg_root_mean_squared_error'  # RMSE as negative; flip sign after
#
#names, kfold_results_rev, kfold_results_ear = [], [], []
#
#for name, model in models:
#    pipe = Pipeline([('prep', preproc), ('model', model)])
#    rmse_rev = -cross_val_score(pipe, X_pca, y1_rev, cv=cv, scoring=scorer, n_jobs=-1)
#    rmse_ear = -cross_val_score(pipe, X_pca, y2_ear, cv=cv, scoring=scorer, n_jobs=-1)
#    names.append(name)
#    kfold_results_rev.append(rmse_rev)
#    kfold_results_ear.append(rmse_ear)
#
## optional: summary
#summary = pd.DataFrame({
#    'model': names,
#    'rev_rmse_mean': [r.mean() for r in kfold_results_rev],
#    'rev_rmse_std':  [r.std()  for r in kfold_results_rev],
#    'ear_rmse_mean': [r.mean() for r in kfold_results_ear],
#    'ear_rmse_std':  [r.std()  for r in kfold_results_ear],
#}).sort_values('rev_rmse_mean')
#summary

In [42]:
## Let's try a better looking chart in altair
#df_rev = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_rev)]),
#    "rmse":  np.concatenate(kfold_results_rev)
#})
#
#df_rev['rmse'] = df_rev['rmse']/1_000_000
#
#rev_chart4 = alt.Chart(df_rev).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Revenue'
#)

In [43]:
## Let's try a better looking chart in altair
#df_ear = pd.DataFrame({
#    "model": np.concatenate([[n]*len(v) for n, v in zip(names, kfold_results_ear)]),
#    "rmse":  np.concatenate(kfold_results_ear)
#})
#
#df_ear['rmse'] = df_ear['rmse']/1_000_000
#
#ear_chart4 = alt.Chart(df_ear).mark_boxplot(size=33, opacity=0.5, median={'color': 'black', 'strokeWidth': 3}).encode(
#    x=alt.X('model:N', sort=names, title='Regression Model', axis = alt.Axis(labelAngle = 0, grid = False)),
#    y=alt.Y('rmse:Q', title='RMSE (millions of $)', axis = alt.Axis(grid = False)),
#    color = alt.Color('model:N', legend=None)
#).properties(
#    width=40*len(names), height=300,
#    title='Net Income'
#)

In [44]:
#full_chart4 = (rev_chart4 | ear_chart4).properties(
#    title=alt.TitleParams('Full Feature Space', anchor='middle'))
#full_chart4.configure_view(stroke=None)