# Troubleshooting QoQ Calculations
In the cleaning file we removed all raw NaN values, so something seems to be going wrong in the QoQ feature engineering file that is leading to the introduction of novel NaN values. I'm going to use this notebook to investigate the potential reasons as to why that is happening. Let's start by importing some data-manipulation libraries.


In [3]:
import numpy as np
import pandas as pd

Now, let's import the files that we will be working with.

In [5]:
path = '../../datasets/X_train_filled_KPIs_QoQ.csv'
df = pd.read_csv(path, index_col=0)
df.head()

Unnamed: 0_level_0,Name,Sector,CapitalExpenditure_2024Q2,CapitalExpenditure_2024Q3,CapitalExpenditure_2024Q4,CapitalExpenditure_2025Q1,CashAndSTInvestments_2024Q2,CashAndSTInvestments_2024Q3,CashAndSTInvestments_2024Q4,CashAndSTInvestments_2025Q1,...,KPI_ReturnOnEquity_Rate,TotalEquity_Rate,OperatingIncome_Rate,TotalDebt_Rate,CashFromOps_Rate,KPI_CashFlow_Rate,IncomeTaxExpense_Rate,KPI_NetProfitMargin_Rate,InterestExpense_Rate,KPI_GrossProfitMargin_Rate
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CAL,CALERES INC,Consumer Discretionary,-8754000.0,-11482000.0,-18520000.0,-11358000.0,68348500.0,51753000.0,33685000.0,29636000.0,...,-0.027528,80650630.0,2017500.0,107459900.0,-9265050.0,-9265050.0,-1271100.0,-0.005052,237200.0,0.007471
BLDR,BUILDERS FIRSTSOURCE INC,Industrials,-88107000.0,-99578000.0,-99672000.0,-99974000.0,75569000.0,328103000.0,153624000.0,115371000.0,...,-0.021435,5835100.0,-103900300.0,226483600.0,-131576300.0,-131576300.0,-23858000.0,-0.017014,3747400.0,-0.007441
PR,PERMIAN RESOURCES CORP CLASS A,Energy,-682937000.0,-1277047000.0,-541242000.0,-537805000.0,47849000.0,272026000.0,479343000.0,702236000.0,...,-0.003852,358642100.0,20500400.0,48600000.0,-20398600.0,-20398600.0,1036300.0,0.0001,547600.0,-0.000933
AVTR,AVANTOR INC,Health Care,-45800000.0,-40800000.0,-27500000.0,-28000000.0,272600000.0,285300000.0,261900000.0,315700000.0,...,9e-05,239310000.0,-1830000.0,-389020000.0,-58690000.0,-58690000.0,6560000.0,0.022167,-5990000.0,-0.000522
KRYS,KRYSTAL BIOTECH INC,Health Care,-1131000.0,-1046000.0,-801000.0,-6204000.0,345786000.0,373966000.0,344865000.0,308770000.0,...,0.002745,49787700.0,5177900.0,571000.0,9939700.0,9939700.0,2270300.0,0.072596,-456900.0,0.011086


Let's grab all columns so that if we need to use some we can easily reference this list

In [27]:
columns = df.columns.tolist()
for col in sorted(columns):
    print(col)

CapitalExpenditure_2024Q2
CapitalExpenditure_2024Q3
CapitalExpenditure_2024Q4
CapitalExpenditure_2025Q1
CapitalExpenditure_QoQ_24Q2_24Q3
CapitalExpenditure_QoQ_24Q3_24Q4
CapitalExpenditure_QoQ_24Q4_25Q1
CapitalExpenditure_QoQ_Rate
CapitalExpenditure_Rate
CashAndSTInvestments_2024Q2
CashAndSTInvestments_2024Q3
CashAndSTInvestments_2024Q4
CashAndSTInvestments_2025Q1
CashAndSTInvestments_QoQ_24Q2_24Q3
CashAndSTInvestments_QoQ_24Q3_24Q4
CashAndSTInvestments_QoQ_24Q4_25Q1
CashAndSTInvestments_QoQ_Rate
CashAndSTInvestments_Rate
CashFromOps_2024Q2
CashFromOps_2024Q3
CashFromOps_2024Q4
CashFromOps_2025Q1
CashFromOps_QoQ_24Q2_24Q3
CashFromOps_QoQ_24Q3_24Q4
CashFromOps_QoQ_24Q4_25Q1
CashFromOps_QoQ_Rate
CashFromOps_Rate
CostOfRevenue_2024Q2
CostOfRevenue_2024Q3
CostOfRevenue_2024Q4
CostOfRevenue_2025Q1
CostOfRevenue_QoQ_24Q2_24Q3
CostOfRevenue_QoQ_24Q3_24Q4
CostOfRevenue_QoQ_24Q4_25Q1
CostOfRevenue_QoQ_Rate
CostOfRevenue_Rate
CurrentAssets_2024Q2
CurrentAssets_2024Q3
CurrentAssets_2024Q4
Current

Alright, let's now take a look at what columns have NaN values so we can start investigating why.

In [16]:
print(f"The column with the most NaN values has {df.isna().sum().max()} missing values.")
df.iloc[:,-80:-40].isna().sum()


The column with the most NaN values has 275 missing values.


OperatingIncome_QoQ_24Q2_24Q3            0
OperatingIncome_QoQ_24Q3_24Q4            0
OperatingIncome_QoQ_24Q4_25Q1            0
TotalDebt_QoQ_24Q2_24Q3                  0
TotalDebt_QoQ_24Q3_24Q4                  0
TotalDebt_QoQ_24Q4_25Q1                  0
CashFromOps_QoQ_24Q2_24Q3                0
CashFromOps_QoQ_24Q3_24Q4                0
CashFromOps_QoQ_24Q4_25Q1                0
KPI_CashFlow_QoQ_24Q2_24Q3               0
KPI_CashFlow_QoQ_24Q3_24Q4               0
KPI_CashFlow_QoQ_24Q4_25Q1               0
IncomeTaxExpense_QoQ_24Q2_24Q3          30
IncomeTaxExpense_QoQ_24Q3_24Q4          25
IncomeTaxExpense_QoQ_24Q4_25Q1          15
KPI_NetProfitMargin_QoQ_24Q2_24Q3        0
KPI_NetProfitMargin_QoQ_24Q3_24Q4        0
KPI_NetProfitMargin_QoQ_24Q4_25Q1        0
InterestExpense_QoQ_24Q2_24Q3            9
InterestExpense_QoQ_24Q3_24Q4           12
InterestExpense_QoQ_24Q4_25Q1            6
KPI_GrossProfitMargin_QoQ_24Q2_24Q3    274
KPI_GrossProfitMargin_QoQ_24Q3_24Q4    275
KPI_GrossPr

Alright, so it looks like the column with the most missing values is the Gross Profit Margin Calculation. So, let's filter our dataframe using masking and then investigate the columns in this.

In [18]:
missing_df = df[df['KPI_GrossProfitMargin_QoQ_24Q3_24Q4'].isna()]
missing_df.shape

(275, 282)

Let's also take a look and see if this captures most of the other NaN values in the dataframe

In [20]:
non_missing_df = df[~df['KPI_GrossProfitMargin_QoQ_24Q3_24Q4'].isna()]
print(f"Maximum number of missing columns after we address net margin is {non_missing_df.isna().sum().max()}")

Maximum number of missing columns after we address net margin is 25


Okay, so we will still having missing values after this that we will have to address.

In [62]:
columns_to_investigate = ['KPI_GrossProfitMargin_2024Q3','KPI_GrossProfitMargin_2024Q4','KPI_GrossProfitMargin_QoQ_24Q3_24Q4']
missing_df[columns_to_investigate].sample(10)

Unnamed: 0_level_0,KPI_GrossProfitMargin_2024Q3,KPI_GrossProfitMargin_2024Q4,KPI_GrossProfitMargin_QoQ_24Q3_24Q4
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSBC,0.0,0.0,
TFC,0.0,0.0,
FULT,0.0,0.0,
SCHW,0.0,0.0,
KEY,0.0,0.0,
SPFI,0.0,0.0,
CWBC,0.0,0.0,
MTB,0.0,0.0,
VOYA,0.0,0.0,
ARI,-0.0,0.0,


After iterating through about 10 samples of 10, I can comfortably say that nearly all of these NaN values are due to a 0 Gross Profit Margin in the companies. So we will be able to address these pretty easily with a if denominator == 0 statement. Let's start investigating the nonmissing value dataframe for the ones that are left.

In [64]:
non_missing_df.iloc[:,-40:].isna().sum()

KPI_DebtToEquityRatio_QoQ_Rate     0
CashAndSTInvestments_QoQ_Rate      1
CashFromOps_QoQ_Rate               0
OperatingIncome_QoQ_Rate           0
LongTermDebt_QoQ_Rate              0
InterestExpense_QoQ_Rate          14
TotalAssets_QoQ_Rate               0
TotalLiabilities_QoQ_Rate          0
EPS_QoQ_Rate                       0
CapitalExpenditure_QoQ_Rate        6
IncomeTaxExpense_QoQ_Rate         25
OtherOperatingExpense_QoQ_Rate     0
KPI_WorkingCapital_Rate            0
KPI_TotalAssetTurnover_Rate        0
KPI_CurrentRatio_Rate              0
KPI_Leverage_Rate                  0
CurrentAssets_Rate                 0
CashAndSTInvestments_Rate          0
KPI_DebtToEquityRatio_Rate         0
KPI_ReturnOnAssets_Rate            0
EPS_Rate                           0
TotalAssets_Rate                   0
LongTermDebt_Rate                  0
CurrentLiabilities_Rate            0
TotalLiabilities_Rate              0
NetIncome_Rate                     0
Revenue_Rate                       0
C

Alright, now the most values are the income tax expense QoQ rate. So we can investigate this furter the important thing here is that we are looking for all of the QoQ values in order to find the line of best fit in order to get the slope.

In [65]:
missing_income = non_missing_df[non_missing_df['IncomeTaxExpense_QoQ_Rate'].isna()]
missing_income.shape

(25, 282)

In [69]:
columns_to_investigate = ['IncomeTaxExpense_QoQ_24Q2_24Q3','IncomeTaxExpense_QoQ_24Q3_24Q4','IncomeTaxExpense_QoQ_24Q4_25Q1','IncomeTaxExpense_QoQ_Rate']
missing_income[columns_to_investigate].head(10)

Unnamed: 0_level_0,IncomeTaxExpense_QoQ_24Q2_24Q3,IncomeTaxExpense_QoQ_24Q3_24Q4,IncomeTaxExpense_QoQ_24Q4_25Q1,IncomeTaxExpense_QoQ_Rate
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PACS,-1.0,,inf,
CLDT,,,,
MVST,,,,
NUVB,,,,
CHRS,,,,
CBLL,,,,
XRX,-1.0,,inf,
NAGE,,,inf,
SPR,,inf,25.046667,
NCMI,,inf,-1.0,


So here we have a bunch of NaN and inf which implies that the QoQ calculation is likely the culprit so let's investigate that further and see if it addresses this.

In [70]:
columns_to_investigate = ['IncomeTaxExpense_2024Q2','IncomeTaxExpense_2024Q3','IncomeTaxExpense_2024Q4','IncomeTaxExpense_2025Q1']
missing_income[columns_to_investigate].head(10)

Unnamed: 0_level_0,IncomeTaxExpense_2024Q2,IncomeTaxExpense_2024Q3,IncomeTaxExpense_2024Q4,IncomeTaxExpense_2025Q1
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PACS,-1474000.0,0.0,0.0,2500.0
CLDT,0.0,0.0,0.0,0.0
MVST,0.0,0.0,0.0,0.0
NUVB,0.0,0.0,0.0,0.0
CHRS,0.0,0.0,0.0,0.0
CBLL,0.0,0.0,0.0,0.0
XRX,841000.0,0.0,0.0,622500.0
NAGE,0.0,0.0,0.0,168000.0
SPR,0.0,0.0,300000.0,7814000.0
NCMI,0.0,0.0,200000.0,0.0
