# Train - Test - Split
We are going to split the data into Training and Testing CSV's early to make sure that we have no data leakage. We are going to do this even before our imputation step because we don't want any of the information in the Test dataset being used to affect the training data set. This is going to be a quick notebook to run through the code necessary to create the train.csv and the test.csv files that we can further work with.

In [None]:
# File system libraries
import os
from google.colab import drive

# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Stat Libraries
import scipy.stats as stats

# Machine Learning Libraries
#import pycaret #Not working with this version of python
from sklearn.model_selection import train_test_split

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

In [None]:
# show decimals without scientific notation
pd.set_option('display.float_format', '{:,.2f}'.format)

In [None]:
# Mount the google drive
drive.mount('/content/drive')
# Navigate to the folder and set the file name
path = '/content/drive/MyDrive/Colab Notebooks/696 - Milestone II/696 - Milestone II - Shared/Pipeline Files'

os.chdir(path)
os.getcwd()
os.listdir()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['Russell_3000.csv',
 'Makefile',
 'make_KPIs.py',
 'Russell_3000_Fundamentals.csv',
 'Russell_3000_With_Macro.csv',
 'Russell_3000_Cleaned.csv',
 'data_acquisition.py',
 'clean.py',
 'data_acquisition_macro.py']

In [None]:
filename = 'Russell_3000_Cleaned.csv'
dataset = pd.read_csv(filename)
print(dataset.shape)
print(dataset.head())

(2570, 131)
  Ticker                        Name                  Sector  Weight (%)  \
0   NVDA                 NVIDIA CORP  Information Technology        6.39   
1   MSFT              MICROSOFT CORP  Information Technology        5.99   
2   AAPL                   APPLE INC  Information Technology        5.62   
3   AMZN              AMAZON COM INC  Consumer Discretionary        3.60   
4   META  META PLATFORMS INC CLASS A           Communication        2.60   

   CapitalExpenditure_2024Q2  CapitalExpenditure_2024Q3  \
0                        NaN            -977,000,000.00   
1         -13,873,000,000.00         -14,923,000,000.00   
2          -2,151,000,000.00          -2,908,000,000.00   
3         -17,620,000,000.00         -22,620,000,000.00   
4          -8,173,000,000.00          -8,258,000,000.00   

   CapitalExpenditure_2024Q4  CapitalExpenditure_2025Q1  \
0            -813,000,000.00          -1,077,000,000.00   
1         -15,804,000,000.00         -16,745,000,000.00   

Alright, now that we have the data imported properly. We are going to split this into two files. A train file that has both the independent (X) variables and the dependent (y) variables we are trying to predict. We will have the same for the test file. We will then need to merge (X)(y) back together before saving them into their independent files. The first thing we are going to need to do is seperate the (X) columns from the (y) columns. So let's work on that to start.

In [None]:
y_columns = []
X_columns = []
columns = dataset.columns
for column in columns:
    if '_2025Q2' in column:
        y_columns.append(column)
    else:
        X_columns.append(column)
# Quick sanity check
for column in X_columns:
    print(column)

Ticker
Name
Sector
Weight (%)
CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
CurrentAssets_2025Q1
CurrentLiabilities_2024Q2
CurrentLiabilities_2024Q3
CurrentLiabilities_2024Q4
CurrentLiabilities_2025Q1
EPS_2024Q2
EPS_2024Q3
EPS_2024Q4
EPS_2025Q1
Exchange
IncomeTaxExpense_2024Q2
IncomeTaxExpense_2024Q3
IncomeTaxExpense_2024Q4
IncomeTaxExpense_2025Q1
InterestExpense_2024Q2
InterestExpense_2024Q3
InterestExpense_2024Q4
InterestExpense_2025Q1
Location
LongTermDebt_2024Q2
LongTermDebt_2024Q3
LongTermDebt_2024Q4
LongTermDebt_2025Q1
Market Value
NetIncome_2024Q2
NetIncome_2024Q3
NetIncome_2024Q4
NetIncome_20

Alright, now that we have the columns we can start splitting the data based on both the sector and market cap. Let's start by creating the combined column.

In [None]:
# Create the startum
strat = (dataset['Sector'].astype(str) + ' | ' + dataset['Market Cap'].astype(str))

# We need to drop anything that is too small
val_cou = strat.value_counts()
val_cou

Unnamed: 0,count
Financials | Small-Cap,198
Industrials | Small-Cap,166
Health Care | Small-Cap,154
Financials | Micro-Cap,146
Health Care | Micro-Cap,146
Information Technology | Small-Cap,145
Consumer Discretionary | Small-Cap,137
Industrials | Mid-Cap,93
Industrials | Micro-Cap,77
Financials | Mid-Cap,75


In [None]:
# Based on this I would say we need at least 4 to stratify
to_strat = strat.isin(val_cou[val_cou >= 2].index)
# Now, let's pull out those
dataset, strat = dataset.loc[to_strat],strat.loc[to_strat]

X = dataset[X_columns]
y = dataset[y_columns]

X_tr, X_te, y_tr, y_te, strat_tr, strat_te = train_test_split(X, y, strat, test_size = 0.2, random_state = 6, stratify = strat)


Okay, so that should have stratified everything but let's do a quick test to see if we have balance.

In [None]:
# Training Set
print(pd.crosstab(dataset['Sector'].loc[strat_tr.index],
                  dataset['Market Cap'].loc[strat_tr.index], normalize="all").round(3))
# Testing Set
print(pd.crosstab(dataset['Sector'].loc[strat_te.index],
                  dataset['Market Cap'].loc[strat_te.index], normalize="all").round(3))

Market Cap              Large-Cap  Mega-Cap  Micro-Cap  Mid-Cap  Nano-Cap  \
Sector                                                                      
Communication                0.00      0.00       0.02     0.01      0.00   
Consumer Discretionary       0.01      0.00       0.03     0.02      0.00   
Consumer Staples             0.01      0.00       0.01     0.01      0.00   
Energy                       0.01      0.00       0.01     0.01      0.00   
Financials                   0.02      0.00       0.06     0.03      0.00   
Health Care                  0.01      0.00       0.06     0.02      0.01   
Industrials                  0.02      0.00       0.03     0.04      0.00   
Information Technology       0.01      0.00       0.03     0.02      0.00   
Materials                    0.00      0.00       0.01     0.01      0.00   
Real Estate                  0.00      0.00       0.01     0.01      0.00   
Utilities                    0.01      0.00       0.00     0.01      0.00   

That looks pretty darn good in terms of the stratification. Now we can just export them to their own CSV files.

In [None]:
# If we want to export to four different files, we can do that.
#X_tr.to_csv("X_train.csv", index=True)
#y_tr.to_csv("y_train.csv", index=True)
#X_te.to_csv("X_test.csv",  index=True)
#y_te.to_csv("y_test.csv",  index=True)

In [None]:
# Or, if we want to merge them back together before saving, we can do that too.
train = pd.concat([X_tr,y_tr],axis = 1)
test = pd.concat([X_te,y_te],axis = 1)

train.to_csv('train.csv', index = True)
test.to_csv('test.csv',index = True)

Now that we have updated the pipeline and run the script, lets take a look at the shape of all of our files to make sure they are what we expect.

In [22]:
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv')
X_test = pd.read_csv('X_test.csv')
y_test = pd.read_csv('y_test.csv')

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(2054, 114)
(2054, 19)
(514, 114)
(514, 19)


Looks good to me, I don't want to investigate further because I don't want to destroy the integrity of the test split.