# Feature Engineering (Part II)
This notebook will be used to build the function that will calculate all of the quarterly percentage changes and rate of change features. It will add the data back to the file. Consider calcualtions for percent of revenue.


In [1]:
# File system libraries
import os
from google.colab import drive

# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Stat Libraries
import scipy.stats as stats

# Machine Learning Libraries
#import pycaret #Not working with this version of python
import sklearn

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt


In [2]:
# show decimals without scientific notation
pd.set_option('display.float_format', '{:,.2f}'.format)

In [3]:
# Mount the google drive
drive.mount('/content/drive')
# Navigate to the folder and set the file name
path = '/content/drive/MyDrive/Colab Notebooks/696 - Milestone II/696 - Milestone II - Shared/Dataset'

os.chdir(path)
os.getcwd()
os.listdir()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['Russell_3000.csv',
 'Russell_3000_Quarter_Annual_0913.csv',
 'Russell_1000',
 'Russel_3000_Quarter_Annual_0913.ipynb',
 'Macroeconomics',
 'Russell_3000_Quarterly_0917_five_more_variable.csv',
 'Russel_3000_0917_five_more_variable.ipynb',
 'Russell_3000_Merged.csv',
 'Russell_3000_09-17_Merged_cleaned.csv',
 'clean_data.ipynb',
 'Russel_3000_09-17_merged_cleaned_KPIs.csv',
 'Russel_3000_09-18_merged_cleaned_KPIs_QoQ.csv',
 'Russell_3000_Quarterly_0918_two_more_variables.csv',
 'Russel_3000_0918_two_more_variables.ipynb']

In [4]:
filename = 'Russel_3000_09-17_merged_cleaned_KPIs.csv'
dataset = pd.read_csv(filename)
print(dataset.head())

  Ticker                        Name                  Sector  \
0   NVDA                 NVIDIA CORP  Information Technology   
1   MSFT              MICROSOFT CORP  Information Technology   
2   AAPL                   APPLE INC  Information Technology   
3   AMZN              AMAZON COM INC  Consumer Discretionary   
4   META  META PLATFORMS INC CLASS A           Communication   

   CashAndSTInvestments_2024Q2  CashAndSTInvestments_2024Q3  \
0                          NaN             8,563,000,000.00   
1            18,315,000,000.00            20,840,000,000.00   
2            25,565,000,000.00            29,943,000,000.00   
3            71,178,000,000.00            75,091,000,000.00   
4            32,045,000,000.00            43,852,000,000.00   

   CashAndSTInvestments_2024Q4  CashAndSTInvestments_2025Q1  \
0             9,107,000,000.00             8,589,000,000.00   
1            17,482,000,000.00            28,828,000,000.00   
2            30,299,000,000.00            28,16

In [5]:
dataset.shape

(2570, 122)

Alright, now that we have it all loaded, we are going to have to look at quarterly changes. From the quarterly changes we can start calculating the rate of change etc. The best way to do this I think will be to design a function that will output a series that we can then just add into the dataframe. As parameters, it will take our original dataframe, the fature and the quarters to calculate the change. First, let's get a list of all the columns that we will need to calculate this for.

In [6]:
# Let's create a copy of the dataset so that we can merge data back to it later
original_df = dataset.copy()
columns = dataset.columns
for column in columns:
    print(column)

Ticker
Name
Sector
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashAndSTInvestments_2025Q2
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CashFromOps_2025Q2
EPS_2024Q2
EPS_2024Q3
EPS_2024Q4
EPS_2025Q1
EPS_2025Q2
Exchange
Location
LongTermDebt_2024Q2
LongTermDebt_2024Q3
LongTermDebt_2024Q4
LongTermDebt_2025Q1
LongTermDebt_2025Q2
Market Value
NetIncome_2024Q2
NetIncome_2024Q3
NetIncome_2024Q4
NetIncome_2025Q1
NetIncome_2025Q2
Notional Value
OperatingIncome_2024Q2
OperatingIncome_2024Q3
OperatingIncome_2024Q4
OperatingIncome_2025Q1
OperatingIncome_2025Q2
Price
Quantity
Revenue_2024Q2
Revenue_2024Q3
Revenue_2024Q4
Revenue_2025Q1
Revenue_2025Q2
ShortTermDebtOrCurrentLiab_2024Q2
ShortTermDebtOrCurrentLiab_2024Q3
ShortTermDebtOrCurrentLiab_2024Q4
ShortTermDebtOrCurrentLiab_2025Q1
ShortTermDebtOrCurrentLiab_2025Q2
TotalAssets_2024Q2
TotalAssets_2024Q3
TotalAssets_2024Q4
TotalAssets_2025Q1
TotalAsse

Okay, so the easy part is the quarters so let's build a list of quarters

In [7]:
quarters = ['_2024Q2','_2024Q3','_2024Q4','_2025Q1','_2025Q2']

Now, we need to create a list of unique columns. We can do this by iterating trhough splitting adding to a set and then creating a list.

In [8]:
unique_columns = set()
for column in columns:
    words = column.split('_')
    if words[0] != 'KPI':
        unique_columns.add(words[0])
    else:
        unique_columns.add(str(words[0]) + '_' + str(words[1]))
unique_columns = list(unique_columns)
for column in unique_columns:
    print(column)


KPI_TotalAssetTurnover
CapitalExpenditure
KPI_ReturnOnAssets
Ticker
NetIncome
TotalAssets
Price
OtherOperatingExpense
Market Value
LongTermDebt
KPI_ReturnOnEquity
Location
CashAndSTInvestments
TotalLiabilities
Name
InterestExpense
EPS
KPI_Leverage
Revenue
CashFromOps
KPI_NetProfitMargin
ShortTermDebtOrCurrentLiab
IncomeTaxExpense
Sector
Quantity
Weight (%)
Exchange
OperatingIncome
KPI_GrossProfitMargin
TotalEquity
CostOfRevenue
KPI_DebtToEquityRatio
Notional Value


Great, now we can drop all of the features that are not quarterly as we won't be calculating the differences for these.

In [9]:
drop = {'Notional Value','Name','Ticker','Exchange','Price','Quantity','Sector','Market Value','Location','Weight (%)'}

unique_columns = [c for c in unique_columns if c not in drop]
for c in unique_columns:
    print(c)

KPI_TotalAssetTurnover
CapitalExpenditure
KPI_ReturnOnAssets
NetIncome
TotalAssets
OtherOperatingExpense
LongTermDebt
KPI_ReturnOnEquity
CashAndSTInvestments
TotalLiabilities
InterestExpense
EPS
KPI_Leverage
Revenue
CashFromOps
KPI_NetProfitMargin
ShortTermDebtOrCurrentLiab
IncomeTaxExpense
OperatingIncome
KPI_GrossProfitMargin
TotalEquity
CostOfRevenue
KPI_DebtToEquityRatio


Okay, Now we have a solid list of the raw data that we want to calcualte quarterly information for, we can start to build the function that will accomplish what we want it to. We can return a series, or we can just have the function build it into the dataset without returning anything, which is likely way more efficient. Thinking about the nesting of this, we will iterate through each unique column, we will then iterate through the quarters to calcuate the values and add it to the dataframe before moving on to the next unique column.

In [10]:
def quarterly_changes(dataset, unique_columns, quarters):
    for column in unique_columns:
        if (column == 'KPI_ReturnOnAssets') | (column == 'KPI_ReturnOnEquity') | (column == 'KPI_TotalAssetTurnover'):
            for val in range(2,len(quarters)):
                series_1 = dataset[str(column) + str(quarters[val-1])]
                series_2 = dataset[str(column) + str(quarters[val])]
                dataset[f'{column}_QoQ_{quarters[val-1][-4:]}_{quarters[val][-4:]}'] = (series_2 - series_1)/series_1

        else:
            for val in range(1,len(quarters)):
                series_1 = dataset[str(column) + str(quarters[val-1])]
                series_2 = dataset[str(column) + str(quarters[val])]
                dataset[f'{column}_QoQ_{quarters[val-1][-4:]}_{quarters[val][-4:]}'] = (series_2 - series_1)/series_1

    print(f"Completed Quarterly Change Calculations")
    return dataset


In [11]:
# Run the function to see the output
dataset = quarterly_changes(dataset, unique_columns, quarters)
print(dataset.shape)
dataset.head()

Completed Quarterly Change Calculations
(2570, 211)


Unnamed: 0,Ticker,Name,Sector,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,CashAndSTInvestments_2025Q1,CashAndSTInvestments_2025Q2,CashFromOps_2024Q2,CashFromOps_2024Q3,...,TotalEquity_QoQ_24Q4_25Q1,TotalEquity_QoQ_25Q1_25Q2,CostOfRevenue_QoQ_24Q2_24Q3,CostOfRevenue_QoQ_24Q3_24Q4,CostOfRevenue_QoQ_24Q4_25Q1,CostOfRevenue_QoQ_25Q1_25Q2,KPI_DebtToEquityRatio_QoQ_24Q2_24Q3,KPI_DebtToEquityRatio_QoQ_24Q3_24Q4,KPI_DebtToEquityRatio_QoQ_24Q4_25Q1,KPI_DebtToEquityRatio_QoQ_25Q1_25Q2
0,NVDA,NVIDIA CORP,Information Technology,,8563000000.0,9107000000.0,8589000000.0,15234000000.0,,14488000000.0,...,0.2,0.06,,0.2,0.19,0.64,,,,
1,MSFT,MICROSOFT CORP,Information Technology,18315000000.0,20840000000.0,17482000000.0,28828000000.0,30242000000.0,37195000000.0,34180000000.0,...,0.06,0.07,0.02,0.08,0.01,0.1,-0.18,-0.05,-0.1,-0.06
2,AAPL,APPLE INC,Information Technology,25565000000.0,29943000000.0,30299000000.0,28162000000.0,36269000000.0,28858000000.0,26811000000.0,...,0.0,-0.01,0.11,0.29,-0.24,-0.0,0.23,-0.23,0.01,0.05
3,AMZN,AMAZON COM INC,Consumer Discretionary,71178000000.0,75091000000.0,78779000000.0,66207000000.0,57741000000.0,25281000000.0,25971000000.0,...,0.07,0.09,0.1,0.22,-0.22,0.05,,,,
4,META,META PLATFORMS INC CLASS A,Communication,32045000000.0,43852000000.0,43889000000.0,28750000000.0,12005000000.0,19370000000.0,24724000000.0,...,0.01,0.05,0.01,0.2,-0.14,0.12,,,,


Let's now take all of these rates of change, plot them and take the line of best fit so that we can get the slope as an accurate rate of change over the past 5 quarters.

In [12]:
# Let's get all of the columns again
columns = dataset.columns
# Use sets to avoid duplicates
QoQ_columns = set()
QoQ_quarters = set()
for column in columns:
    if 'QoQ' in column:
        # Pull out the common Column Name
        QoQ_columns.add(column[:-10])
        # Let's also pull out the QoQ
        QoQ_quarters.add(column[-10:])
QoQ_columns = list(QoQ_columns)
QoQ_quarters = list(QoQ_quarters)
# Order will be important so let's sort them
QoQ_quarters.sort()
QoQ_quarters


['_24Q2_24Q3', '_24Q3_24Q4', '_24Q4_25Q1', '_25Q1_25Q2']

Now that we have a list of all of the features, we need to start pulling out all of the data in a funciton.

In [13]:
def get_QoQ_rate_data(row, QoQ_columns, QoQ_quarters):
    for column in QoQ_columns:
        values = []
        try:
            for quarter in QoQ_quarters:
                values.append(row[column + quarter])
        except:
            # Addressing variables that don't have intial Quarter
            for quarter in QoQ_quarters[1:]:
                values.append(row[column + quarter])
        pd.Series(values).dropna(inplace=True)
        y = np.array(values,float)
        t = np.arange(len(y))
        # Get the OLS slope
        b = np.polyfit(t,y,1)[0]
        row[column + '_Rate'] = b
    return row

Alright now that we have the function let's apply it to our dataset.

In [14]:
dataset = dataset.apply(lambda row: get_QoQ_rate_data(row, QoQ_columns, QoQ_quarters), axis=1)
dataset.head()

Unnamed: 0,Ticker,Name,Sector,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,CashAndSTInvestments_2025Q1,CashAndSTInvestments_2025Q2,CashFromOps_2024Q2,CashFromOps_2024Q3,...,KPI_NetProfitMargin_QoQ_Rate,IncomeTaxExpense_QoQ_Rate,Revenue_QoQ_Rate,KPI_DebtToEquityRatio_QoQ_Rate,ShortTermDebtOrCurrentLiab_QoQ_Rate,OtherOperatingExpense_QoQ_Rate,TotalAssets_QoQ_Rate,InterestExpense_QoQ_Rate,CostOfRevenue_QoQ_Rate,TotalEquity_QoQ_Rate
0,NVDA,NVIDIA CORP,Information Technology,,8563000000.0,9107000000.0,8589000000.0,15234000000.0,,14488000000.0,...,,,,,,,,,,
1,MSFT,MICROSOFT CORP,Information Technology,18315000000.0,20840000000.0,17482000000.0,28828000000.0,30242000000.0,37195000000.0,34180000000.0,...,-0.03,-0.02,0.02,0.03,0.05,0.07,0.03,0.06,0.01,-0.0
2,AAPL,APPLE INC,Information Technology,25565000000.0,29943000000.0,30299000000.0,28162000000.0,36269000000.0,28858000000.0,26811000000.0,...,0.0,-0.77,-0.09,-0.03,-0.03,-0.0,-0.03,-0.1,-0.09,0.02
3,AMZN,AMAZON COM INC,Consumer Discretionary,71178000000.0,75091000000.0,78779000000.0,66207000000.0,57741000000.0,25281000000.0,25971000000.0,...,-0.03,-0.17,-0.03,,,0.01,-0.0,-0.02,-0.06,-0.0
4,META,META PLATFORMS INC CLASS A,Communication,32045000000.0,43852000000.0,43889000000.0,28750000000.0,12005000000.0,19370000000.0,24724000000.0,...,-0.06,-0.07,-0.01,,,0.05,-0.02,-0.21,-0.0,-0.01


In [15]:
missing = dataset['CapitalExpenditure_QoQ_25Q1_25Q2'].isna().sum()
total = dataset.shape[0]
ratio =(missing/total)*100
print(f'The percent of missing data in the rates is {np.round(ratio,2)}%')

The percent of missing data in the rates is 5.99%


## Final Section to Save new CSV

In [17]:
# Uncomment this block when we want to save to a CSV file
#dataset.to_csv("Russel_3000_09-18_merged_cleaned_KPIs_QoQ.csv", index=False)