# Feature Engineering (Part II)

Update 9-25-25: I am adjusting this notebook to work with the new KPI files from the GitHub Repository.

This notebook will be used to build the function that will calculate all of the quarterly percentage changes and rate of change features. It will add the data back to the file. Consider calcualtions for percent of revenue.


In [1]:
# File system libraries
import os
#from google.colab import drive

# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Stat Libraries
import scipy.stats as stats

# Machine Learning Libraries
#import pycaret #Not working with this version of python
import sklearn

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt


In [2]:
# show decimals without scientific notation
pd.set_option('display.float_format', '{:,.2f}'.format)

In [3]:
# Mount the google drive
# drive.mount('/content/drive')
# Navigate to the folder and set the file name
path = '../../datasets'

os.chdir(path)
os.getcwd()
os.listdir()

['.train_test.split',
 'Russell_3000.csv',
 'Russell_3000_Cleaned.csv',
 'Russell_3000_Fundamentals.csv',
 'Russell_3000_With_Macro.csv',
 'X_test.csv',
 'X_test_filled.csv',
 'X_test_filled_KPIs.csv',
 'X_train.csv',
 'X_train_filled.csv',
 'X_train_filled_KPIs.csv',
 'y_test.csv',
 'y_train.csv']

In [4]:
train_file = 'X_train_filled_KPIs.csv'
test_file =  'X_test_filled_KPIs.csv'
train_dataset = pd.read_csv(train_file)
test_dataset = pd.read_csv(test_file)
train_dataset.drop(columns=['Unnamed: 0'], inplace=True)
test_dataset.drop(columns=['Unnamed: 0'], inplace=True)
print(train_dataset.head())

  Ticker                             Name                  Sector  \
0   NSSC  NAPCO SECURITY TECHNOLOGIES INC  Information Technology   
1   ALGN             ALIGN TECHNOLOGY INC             Health Care   
2   UBSI            UNITED BANKSHARES INC              Financials   
3   RCUS            ARCUS BIOSCIENCES INC             Health Care   
4   CRNX    CRINETICS PHARMACEUTICALS INC             Health Care   

   CapitalExpenditure_2024Q2  CapitalExpenditure_2024Q3  \
0                -551,000.00                -680,000.00   
1             -53,450,000.00             -29,800,000.00   
2              -2,973,000.00              -1,585,000.00   
3              -1,215,000.00              -1,000,000.00   
4                -955,000.00                -528,000.00   

   CapitalExpenditure_2024Q4  CapitalExpenditure_2025Q1  \
0              -1,134,000.00                 -65,000.00   
1             -22,961,000.00             -25,289,000.00   
2              -2,798,000.00              -3,895,000.

In [5]:
print(train_dataset.shape)
print(test_dataset.shape)

(1974, 146)
(496, 146)


Alright, now that we have it all loaded, we are going to have to look at quarterly changes. From the quarterly changes we can start calculating the rate of change etc. The best way to do this I think will be to design a function that will output a series that we can then just add into the dataframe. As parameters, it will take our original dataframe, the fature and the quarters to calculate the change. First, let's get a list of all the columns that we will need to calculate this for.

In [6]:
# Let's create a copy of the dataset so that we can merge data back to it later
original_train_df = train_dataset.copy()
columns = train_dataset.columns
for column in columns:
    print(column)

Ticker
Name
Sector
CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
CurrentAssets_2025Q1
CurrentLiabilities_2024Q2
CurrentLiabilities_2024Q3
CurrentLiabilities_2024Q4
CurrentLiabilities_2025Q1
EPS_2024Q2
EPS_2024Q3
EPS_2024Q4
EPS_2025Q1
Exchange
IncomeTaxExpense_2024Q2
IncomeTaxExpense_2024Q3
IncomeTaxExpense_2024Q4
IncomeTaxExpense_2025Q1
InterestExpense_2024Q2
InterestExpense_2024Q3
InterestExpense_2024Q4
InterestExpense_2025Q1
Location
LongTermDebt_2024Q2
LongTermDebt_2024Q3
LongTermDebt_2024Q4
LongTermDebt_2025Q1
Market Value
NetIncome_2024Q2
NetIncome_2024Q3
NetIncome_2024Q4
NetIncome_2025Q1
Operat

Okay, so the easy part is the quarters so let's build a list of quarters

In [7]:
quarters = ['_2024Q2','_2024Q3','_2024Q4','_2025Q1']

Now, we need to create a list of unique columns. We can do this by iterating trhough splitting adding to a set and then creating a list.

In [8]:
unique_columns = set()
for column in columns:
    column = column.replace(' ','_')
    words = column.split('_')
    if words[0] != 'KPI':
        unique_columns.add(words[0])
    else:
        unique_columns.add(str(words[0]) + '_' + str(words[1]))
unique_columns = list(unique_columns)
for column in unique_columns:
    print(column)


Exchange
CostOfRevenue
Unemployment
Revenue
OperatingIncome
EPS
Location
InterestRate
CapitalExpenditure
GDP
KPI_ReturnOnEquity
Sector
CashAndSTInvestments
KPI_Leverage
Inflation
TotalAssets
Ticker
LongTermDebt
IndustrialProd
KPI_TotalAssetTurnover
NetIncome
TotalDebt
KPI_ReturnOnAssets
IncomeTaxExpense
KPI_NetProfitMargin
CashFromOps
Market
OtherOperatingExpense
Name
GDPReal
KPI_WorkingCapital
TotalLiabilities
InterestExpense
KPI_DebtToEquityRatio
KPI_GrossProfitMargin
CurrentAssets
KPI_CashFlow
KPI_CurrentRatio
TotalEquity
CurrentLiabilities


Great, now we can drop all of the features that are not quarterly as we won't be calculating the differences for these.

In [9]:
drop = {'Name','Ticker','Exchange','Sector','Market','Location','Inflation','Unemployment','IndustrialProd','GDP','GDPReal','InterestRate'}

unique_columns = [c for c in unique_columns if c not in drop]
for c in unique_columns:
    print(c)

CostOfRevenue
Revenue
OperatingIncome
EPS
CapitalExpenditure
KPI_ReturnOnEquity
CashAndSTInvestments
KPI_Leverage
TotalAssets
LongTermDebt
KPI_TotalAssetTurnover
NetIncome
TotalDebt
KPI_ReturnOnAssets
IncomeTaxExpense
KPI_NetProfitMargin
CashFromOps
OtherOperatingExpense
KPI_WorkingCapital
TotalLiabilities
InterestExpense
KPI_DebtToEquityRatio
KPI_GrossProfitMargin
CurrentAssets
KPI_CashFlow
KPI_CurrentRatio
TotalEquity
CurrentLiabilities


Okay, Now we have a solid list of the raw data that we want to calcualte quarterly information for, we can start to build the function that will accomplish what we want it to. We can return a series, or we can just have the function build it into the dataset without returning anything, which is likely way more efficient. Thinking about the nesting of this, we will iterate through each unique column, we will then iterate through the quarters to calcuate the values and add it to the dataframe before moving on to the next unique column.

In [10]:
def quarterly_changes(dataset, unique_columns, quarters):
    for column in unique_columns:
        try:
            for val in range(1,len(quarters)):
                series_1 = dataset[str(column) + str(quarters[val-1])]
                series_2 = dataset[str(column) + str(quarters[val])]
                dataset[f'{column}_QoQ_{quarters[val-1][-4:]}_{quarters[val][-4:]}'] = (series_2 - series_1)/series_1
        except:
            for val in range(2,len(quarters)):
                series_1 = dataset[str(column) + str(quarters[val-1])]
                series_2 = dataset[str(column) + str(quarters[val])]
                dataset[f'{column}_QoQ_{quarters[val-1][-4:]}_{quarters[val][-4:]}'] = (series_2 - series_1)/series_1

    print(f"Completed Quarterly Change Calculations")
    return dataset


In [11]:
# Run the function to see the output
train_dataset = quarterly_changes(train_dataset, unique_columns, quarters)
print(train_dataset.shape)
train_dataset.head()

Completed Quarterly Change Calculations
(1974, 227)


Unnamed: 0,Ticker,Name,Sector,CapitalExpenditure_2024Q2,CapitalExpenditure_2024Q3,CapitalExpenditure_2024Q4,CapitalExpenditure_2025Q1,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,...,KPI_CashFlow_QoQ_24Q4_25Q1,KPI_CurrentRatio_QoQ_24Q2_24Q3,KPI_CurrentRatio_QoQ_24Q3_24Q4,KPI_CurrentRatio_QoQ_24Q4_25Q1,TotalEquity_QoQ_24Q2_24Q3,TotalEquity_QoQ_24Q3_24Q4,TotalEquity_QoQ_24Q4_25Q1,CurrentLiabilities_QoQ_24Q2_24Q3,CurrentLiabilities_QoQ_24Q3_24Q4,CurrentLiabilities_QoQ_24Q4_25Q1
0,NSSC,NAPCO SECURITY TECHNOLOGIES INC,Information Technology,-551000.0,-680000.0,-1134000.0,-65000.0,65341000.0,85596000.0,86019000.0,...,-0.01,-0.09,0.09,-0.11,-0.0,-0.02,-0.07,0.11,-0.12,0.04
1,ALGN,ALIGN TECHNOLOGY INC,Health Care,-53450000.0,-29800000.0,-22961000.0,-25289000.0,761429000.0,1041935000.0,1043887000.0,...,-0.82,0.06,-0.03,-0.01,0.05,-0.02,-0.01,0.01,-0.01,-0.02
2,UBSI,UNITED BANKSHARES INC,Financials,-2973000.0,-1585000.0,-2798000.0,-3895000.0,1857653000.0,1907579000.0,2290976000.0,...,-0.12,0.3,-0.03,-0.17,0.02,0.01,0.06,-0.14,-0.08,0.19
3,RCUS,ARCUS BIOSCIENCES INC,Health Care,-1215000.0,-1000000.0,-1000000.0,-1000000.0,156000000.0,201000000.0,150000000.0,...,0.32,0.02,-0.14,0.19,-0.11,-0.14,0.09,0.08,0.05,-0.15
4,CRNX,CRINETICS PHARMACEUTICALS INC,Health Care,-955000.0,-528000.0,-1029000.0,-1239000.0,302162000.0,317269000.0,264545000.0,...,0.37,0.02,0.41,-0.02,0.0,0.59,-0.05,-0.01,0.11,-0.04


In [12]:
# Need to do the same with the test
test_dataset = quarterly_changes(test_dataset, unique_columns, quarters)
print(test_dataset.shape)
test_dataset.head()

Completed Quarterly Change Calculations
(496, 227)


Unnamed: 0,Ticker,Name,Sector,CapitalExpenditure_2024Q2,CapitalExpenditure_2024Q3,CapitalExpenditure_2024Q4,CapitalExpenditure_2025Q1,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,...,KPI_CashFlow_QoQ_24Q4_25Q1,KPI_CurrentRatio_QoQ_24Q2_24Q3,KPI_CurrentRatio_QoQ_24Q3_24Q4,KPI_CurrentRatio_QoQ_24Q4_25Q1,TotalEquity_QoQ_24Q2_24Q3,TotalEquity_QoQ_24Q3_24Q4,TotalEquity_QoQ_24Q4_25Q1,CurrentLiabilities_QoQ_24Q2_24Q3,CurrentLiabilities_QoQ_24Q3_24Q4,CurrentLiabilities_QoQ_24Q4_25Q1
0,TMO,THERMO FISHER SCIENTIFIC INC,Health Care,-301000000.0,-272000000.0,-480000000.0,-362000000.0,7073000000.0,4645000000.0,4009000000.0,...,-0.78,-0.06,0.02,0.07,0.03,0.01,-0.0,-0.01,-0.09,-0.01
1,SD,SANDRIDGE ENERGY INC,Energy,-2445000.0,-9986000.0,-12832000.0,-6736000.0,209908000.0,92697000.0,98128000.0,...,-0.22,-0.58,-0.01,0.01,0.05,0.03,0.01,0.27,0.03,0.02
2,KAR,OPENLANE INC,Industrials,-13000000.0,-13100000.0,-14000000.0,-11900000.0,60900000.0,132100000.0,143000000.0,...,2.75,0.0,0.03,0.01,-0.0,0.01,0.02,0.0,-0.01,0.06
3,APPN,APPIAN CORP CLASS A,Information Technology,-734000.0,-355000.0,-511000.0,-651000.0,120787000.0,99193000.0,118552000.0,...,2.24,-0.03,0.02,0.02,0.09,-0.34,-0.03,0.02,0.19,-0.07
4,CSL,CARLISLE COMPANIES INC,Industrials,-24900000.0,-19300000.0,-36600000.0,-29000000.0,1736300000.0,1530600000.0,753500000.0,...,-1.0,-0.06,0.06,-0.11,-0.08,-0.11,-0.12,-0.02,-0.38,-0.11


Let's now take all of these rates of change, plot them and take the line of best fit so that we can get the slope as an accurate rate of change over the past 5 quarters.

In [13]:
# Let's get all of the columns again
columns = train_dataset.columns
# Use sets to avoid duplicates
QoQ_columns = set()
QoQ_quarters = set()
for column in columns:
    if 'QoQ' in column:
        # Pull out the common Column Name
        QoQ_columns.add(column[:-10])
        # Let's also pull out the QoQ
        QoQ_quarters.add(column[-10:])
QoQ_columns = list(QoQ_columns)
QoQ_quarters = list(QoQ_quarters)
# Order will be important so let's sort them
QoQ_quarters.sort()
QoQ_quarters


['_24Q2_24Q3', '_24Q3_24Q4', '_24Q4_25Q1']

Now that we have a list of all of the features, we need to start pulling out all of the data in a funciton. This is no longer functioning because of either NaN's so we need to setup a save way to do this.

In [14]:
def safe_slope(vals: list[float]) -> float:
    a = np.asarray(vals, float)
    a = a[np.isfinite(a)]
    n = a.size
    if n < 2:
        return np.nan
    t = np.arange(n, dtype=float)
    
    # OLS slope without lstsq
    St, Sy = t.sum(), a.sum()
    Stt, Sty = (t*t).sum(), (t*a).sum()
    denom = n*Stt - St*St
    if np.isclose(denom, 0.0):
        return 0.0
    
    return (n*Sty - St*Sy) / denom

def get_rate_data(row, cols, quarters):
    for col in cols:
        vals = [row.get(f"{col}{q}", np.nan) for q in quarters]
        row[f"{col}_Rate"] = safe_slope(vals)
    return row

Alright now that we have the function let's apply it to our dataset.

In [15]:
train_dataset = train_dataset.apply(lambda r: get_rate_data(r, QoQ_columns, QoQ_quarters), axis=1)
train_dataset = train_dataset.apply(lambda r: get_rate_data(r, unique_columns, quarters), axis=1)
test_dataset = test_dataset.apply(lambda r: get_rate_data(r, QoQ_columns, QoQ_quarters), axis=1)
test_dataset = test_dataset.apply(lambda r: get_rate_data(r, unique_columns, quarters), axis=1)
train_dataset.head()

Unnamed: 0,Ticker,Name,Sector,CapitalExpenditure_2024Q2,CapitalExpenditure_2024Q3,CapitalExpenditure_2024Q4,CapitalExpenditure_2025Q1,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,...,KPI_WorkingCapital_Rate,TotalLiabilities_Rate,InterestExpense_Rate,KPI_DebtToEquityRatio_Rate,KPI_GrossProfitMargin_Rate,CurrentAssets_Rate,KPI_CashFlow_Rate,KPI_CurrentRatio_Rate,TotalEquity_Rate,CurrentLiabilities_Rate
0,NSSC,NAPCO SECURITY TECHNOLOGIES INC,Information Technology,-551000.0,-680000.0,-1134000.0,-65000.0,65341000.0,85596000.0,86019000.0,...,-5411700.0,-146900.0,7700.0,0.0,0.01,-5583700.0,-139700.0,-0.2,-5640500.0,-172000.0
1,ALGN,ALIGN TECHNOLOGY INC,Health Care,-53450000.0,-29800000.0,-22961000.0,-25289000.0,761429000.0,1041935000.0,1043887000.0,...,3711800.0,-33452200.0,1056400.0,-0.0,-0.0,-13597500.0,-29903700.0,0.0,1663600.0,-17309300.0
2,UBSI,UNITED BANKSHARES INC,Financials,-2973000.0,-1585000.0,-2798000.0,-3895000.0,1857653000.0,1907579000.0,2290976000.0,...,-3812700.0,725466000.0,-2367200.0,-0.06,0.0,-11358750.0,18905300.0,0.03,139885100.0,-7546050.0
3,RCUS,ARCUS BIOSCIENCES INC,Health Care,-1215000.0,-1000000.0,-1000000.0,-1000000.0,156000000.0,201000000.0,150000000.0,...,-7400000.0,20000000.0,400000.0,0.03,0.0,-8400000.0,-24000000.0,-0.0,-39200000.0,-1000000.0
4,CRNX,CRINETICS PHARMACEUTICALS INC,Health Care,-955000.0,-528000.0,-1029000.0,-1239000.0,302162000.0,317269000.0,264545000.0,...,173171400.0,1305800.0,2164100.0,-0.01,0.0,174663800.0,-12630200.0,2.6,176153600.0,1492400.0


In [16]:
missing = train_dataset['CapitalExpenditure_QoQ_24Q4_25Q1'].isna().sum()
total = train_dataset.shape[0]
ratio =(missing/total)*100
print(f'The percent of missing data in the rates is {np.round(ratio,2)}%')

The percent of missing data in the rates is 0.15%


In [17]:
print(train_dataset.KPI_GrossProfitMargin_Rate.min())
print(train_dataset.KPI_GrossProfitMargin_Rate.max())

-9.024311153512485
94.65243828065637


## Final Section to Save new CSV

In [None]:
# Uncomment this block when we want to save to a CSV file
#train_dataset.to_csv("X_train_filled_KPIs_QoQ.csv", index=True)
#test_dataset.to_csv("X_test_filled_KPIs_QoQ.csv", index=True)