# Data Cleaning Notebook

In [1]:
# reference source code
import sys, os
sys.path.append(os.path.abspath(".."))

In [2]:
# data cleaning and creation imports
import pandas as pd
import os
import glob
import itertools

# mitigate warnings
import warnings
warnings.filterwarnings("ignore")

# internal imports
from src.utils import nba_teams, team_map
from src.data_loader import *


### Utils Direcrtory Checks

In [3]:
print(len(nba_teams))   # should be 30
print(team_map["LAL"])  # "Los Angeles Lakers"

30
Los Angeles Lakers


## Creating Compressed Data

In [4]:
# # THE CODE TO CREATE POROCESSED GZIP FILES
# root_dir = "/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/team-stats"

# for subdir, _, files in os.walk(root_dir):
#     for file in files:
#         if file.endswith(".csv"):
#             file_path = os.path.join(subdir, file)
#             gz_path = file_path + ".gz"

#             try:
#                 df = pd.read_csv(file_path)

#                 if df.empty:
#                     print(f"Skipping empty file: {file_path}")
#                     continue

#                 print(f"Compressing {file_path} -> {gz_path}")
#                 df.to_csv(gz_path, index=False, compression="gzip")

#             except pd.errors.EmptyDataError:
#                 print(f"Skipping empty file (no data): {file_path}")


## CSV Cleaning

For each csv file, I have already gone in and manually lightly cleaned the data with Microsoft Excel; however, for deeper cleaning I will use python libraries here in this ipynb file.

In Microsoft Excel, I:
- eliminated unnecessary rows and columns from the dataset
- added in the columns necessary for aggregation and merging
- formatted everything correctly to prepare it for python data manipulation

In [5]:
team_records_df = load_team_records()

For this team_records_df, I need to make it so that the wins and loss columns are separated. This has been moved from the notebook and done in data_loader.py.

In [6]:
team_stats_df = load_team_stats()
team_stats_df.head().shape

(5, 28)

### Optimal Dataframe Construction

I now want to merge the dataframes that contain the team records and team stats (FG%, FGA, MIN) position. I will do this by merging by team and season.

In [7]:
# Merge on Team and Season
team_df = merge_team_data(team_stats_df, team_records_df)

print(team_df.head())
print("-----")
print(f"The shape of the dataframe: {team_df.shape}")

   Season           Team  GP   W   L   WIN%   Min    PTS   FGM   FGA  ...  \
0    2016  Atlanta Hawks  82  48  34  0.585  48.4  102.8  38.6  84.4  ...   
1    2017  Atlanta Hawks  82  43  39  0.524  48.5  103.2  38.1  84.4  ...   
2    2018  Atlanta Hawks  82  24  58  0.293  48.1  103.4  38.2  85.5  ...   
3    2019  Atlanta Hawks  82  29  53  0.354  48.4  113.3  41.4  91.8  ...   
4    2020  Atlanta Hawks  67  20  47  0.299  48.6  111.8  40.6  90.6  ...   

   Road_W  Road_L  E_W  E_L  W_W  W_L  Pre-ASG_W  Pre-ASG_L  Post-ASG_W  \
0      21      20   29   23   19   11         31         24          17   
1      20      21   30   22   13   17         32         24          15   
2       8      33   12   40   18   12         18         41          17   
3      29      12   16   36   13   17         19         39          14   
4      27       6   11   32   15    9         15         41           6   

   Post-ASG_L  
0          10  
1          11  
2           6  
3          10  
4     

In [8]:
draft_df = load_draft()
draft_df.head()

Unnamed: 0,Season,Team,FirstRoundPicks,SecondRoundPicks
0,2015,Atlanta Hawks,1,2
1,2015,Boston Celtics,2,2
2,2015,Brooklyn Nets,1,1
3,2015,Charlotte Hornets,1,1
4,2015,Chicago Bulls,1,0


In [9]:
coach_df = load_coaches()
coach_df.head(5)

Unnamed: 0,Coach,Yw/Franch,YOverall,CareerW,CareerL,CareerW%,Season,Team
0,Quin Snyder,3,11,458,363,0.558,2025,Atlanta Hawks
1,Joe Mazzulla,3,3,182,64,0.74,2025,Boston Celtics
2,Jordi Fernandez,1,1,26,56,0.317,2025,Brooklyn Nets
3,Billy Donovan,5,10,438,362,0.548,2025,Chicago Bulls
4,Charles Lee,1,1,19,63,0.232,2025,Charlotte Hornets


In order to alleviate duplicate rows, I will turn the draft pick column into two columns, those being the sum of First Round (Picks 1-30) and Second Round (Picks 31-60). This process has been moved from the noteboook and integrated within the "load_coaches" function in data_loader.py.

In [10]:
# For draft_df leaks
draft_2024 = draft_df[draft_df["Season"] == 2024]["Team"].unique()
draft_2025 = draft_df[draft_df["Season"] == 2025]["Team"].unique()

print("draft_df → Missing in 2024:", set(nba_teams) - set(draft_2024))
print("draft_df → Missing in 2025:", set(nba_teams) - set(draft_2025))


# For coach_df leaks
coach_2024 = coach_df[coach_df["Season"] == 2024]["Team"].unique()
coach_2025 = coach_df[coach_df["Season"] == 2025]["Team"].unique()

print("coach_df → Missing in 2024:", set(nba_teams) - set(coach_2024))
print("coach_df → Missing in 2025:", set(nba_teams) - set(coach_2025))

draft_df → Missing in 2024: set()
draft_df → Missing in 2025: set()
coach_df → Missing in 2024: set()
coach_df → Missing in 2025: set()


I now want to merge the dataframes that contain the length of the current coach and their draft position. I will do this by merging by team and season.

In [11]:
# Merge on Team and Season
front_office_df = merge_front_office(coach_df, draft_df)

print(front_office_df.head())
print("-----")
print(f"The shape of the dataframe: {front_office_df.shape}")

             Coach  Yw/Franch  YOverall  CareerW  CareerL  CareerW%  Season  \
0      Quin Snyder          3        11      458      363     0.558    2025   
1     Joe Mazzulla          3         3      182       64     0.740    2025   
2  Jordi Fernandez          1         1       26       56     0.317    2025   
3    Billy Donovan          5        10      438      362     0.548    2025   
4      Charles Lee          1         1       19       63     0.232    2025   

                Team  FirstRoundPicks  SecondRoundPicks  Coach_Count  
0      Atlanta Hawks                2                 0            1  
1     Boston Celtics                1                 1            1  
2      Brooklyn Nets                4                 1            1  
3      Chicago Bulls                1                 1            1  
4  Charlotte Hornets                1                 2            1  
-----
The shape of the dataframe: (328, 11)


I now need to add a column to this front offfice dataframe that counts the amount of coaches each team has in each season. This process has been moved from the noteboook and integrated within the "merge_front_office" function in data_loader.py.

In [12]:
# For front office data leaks
fo_2024 = front_office_df[front_office_df["Season"] == 2024]["Team"].unique()
fo_2025 = front_office_df[front_office_df["Season"] == 2025]["Team"].unique()

print("Missing in 2024:", set(nba_teams) - set(fo_2024))
print("Missing in 2025:", set(nba_teams) - set(fo_2025))

Missing in 2024: set()
Missing in 2025: set()


I will now add the teams payroll data into the front office dataframe. This will just give my front office dataset that last bit of kick to make accurate predictions on the future of NBA Teams!

In [13]:
team_payroll_df = load_payroll()
print(team_payroll_df.head())

                Team  Season     Payroll
0      Atlanta Hawks    2016  71661760.0
1     Boston Celtics    2016  77141919.0
2      Brooklyn Nets    2016  80258302.0
3  Charlotte Hornets    2016  76860006.0
4      Chicago Bulls    2016  87073838.0


It seems I have mispelled some of the teams names and need to find out which ones. This is one of the errors expected from manual insertion of data.

In [14]:
team_payroll_df.info

<bound method DataFrame.info of                    Team  Season      Payroll
0         Atlanta Hawks    2016   71661760.0
1        Boston Celtics    2016   77141919.0
2         Brooklyn Nets    2016   80258302.0
3     Charlotte Hornets    2016   76860006.0
4         Chicago Bulls    2016   87073838.0
..                  ...     ...          ...
415    Sacramento Kings    2029          NaN
416   San Antonio Spurs    2029   86198976.0
417     Toronto Raptors    2029  109924508.0
418           Utah Jazz    2029   53536209.0
419  Washington Wizards    2029          NaN

[420 rows x 3 columns]>

In [15]:
front_office_df = merge_front_office(coach_df, draft_df, team_payroll_df)

print(front_office_df.head())
print("Shape:", front_office_df.shape)

             Coach  Yw/Franch  YOverall  CareerW  CareerL  CareerW%  Season  \
0      Quin Snyder          3        11      458      363     0.558    2025   
1     Joe Mazzulla          3         3      182       64     0.740    2025   
2  Jordi Fernandez          1         1       26       56     0.317    2025   
3    Billy Donovan          5        10      438      362     0.548    2025   
4      Charles Lee          1         1       19       63     0.232    2025   

                Team  FirstRoundPicks  SecondRoundPicks  Coach_Count  \
0      Atlanta Hawks                2                 0            1   
1     Boston Celtics                1                 1            1   
2      Brooklyn Nets                4                 1            1   
3      Chicago Bulls                1                 1            1   
4  Charlotte Hornets                1                 2            1   

       Payroll  
0  170057021.0  
1  195348491.0  
2  168312896.0  
3  165722496.0  
4  1685

In [16]:
front_office_df[front_office_df["Payroll"].isna()]

Unnamed: 0,Coach,Yw/Franch,YOverall,CareerW,CareerL,CareerW%,Season,Team,FirstRoundPicks,SecondRoundPicks,Coach_Count,Payroll


### Strength of Schedule Calculation

Next, I need to use the dataframe containing the teams records against themselves to calculate each teams strength of schedule, giving me an even more accurate prediction of the amount of wins they will have in the 2025-26 season.

The formula I will use for the calculation is:
∑(Opposing Team Win Pct * Games vs the Opponent) / Total Games

I have added this logic for this formula inside of the data_loader calculate_sos function.


In [17]:
team_sos_df = load_sos()
print(team_sos_df.head())

   Rk               Team Atlanta Hawks Boston Celtics Brooklyn Nets  \
0   1      Atlanta Hawks           NaN            2-1           2-1   
1   2     Boston Celtics           1-2            NaN           4-0   
2   3      Brooklyn Nets           1-2            0-4           NaN   
3   4      Chicago Bulls           2-2            1-3           2-1   
4   5  Charlotte Hornets           0-4            0-4           1-3   

  Chicago Bulls Charlotte Hornets Cleveland Cavaliers Dallas Mavericks  \
0           2-2               4-0                 2-1              0-2   
1           3-1               4-0                 2-2              1-1   
2           1-2               3-1                 0-4              1-1   
3           NaN               3-1                 0-4              0-2   
4           1-3               NaN                 0-4              1-1   

  Denver Nuggets  ... Orlando Magic Philadelphia 76ers Phoenix Suns  \
0            0-2  ...           2-2                3-0   

In [18]:
sos_df = calculate_sos(team_df, team_sos_df)

print(sos_df.head())
print(f"Shape: {sos_df.shape}")

                Team  Season       SOS
0      Atlanta Hawks    2025  0.495012
1     Boston Celtics    2025  0.479110
2      Brooklyn Nets    2025  0.507207
3      Chicago Bulls    2025  0.495756
4  Charlotte Hornets    2025  0.504085
Shape: (300, 3)


### Player Stats Calculations

Now that I have calculated the strength of schedule, I need to calculate a few more player stats that I can add to my final dataframe. The stats I will be using from these player stats are:
- average age
- average points of top 10 players
- production score: (Points + PlUS/MINUS) / Minutes Played
- injury rate: (82 × Roster Size) − Total Games Played

First, I need to parse through my player-stats folder and gather all of the player statistics I scraped from Basketball Reference.

In [19]:
teams_2024 = front_office_df[front_office_df["Season"] == 2024]["Team"].unique()
teams_2025 = front_office_df[front_office_df["Season"] == 2025]["Team"].unique()

print("Missing in 2024:", set(nba_teams) - set(teams_2024))
print("Missing in 2025:", set(nba_teams) - set(teams_2025))

Missing in 2024: set()
Missing in 2025: set()


In [20]:
players_df = load_players()

Next, I need to make a new dataframe with the calculations I had explained earlier. Then I will take these calculations and add them to the team dataframe. I have implemented this formula logic inside of the data_loader calculate_player_features function.

In [21]:
# Step 2: Aggregate into team-season features
player_stats_df = calculate_player_features(players_df)

print(player_stats_df.head())
print("Shape:", player_stats_df.shape)

   Season               Team    avg_age  avg_pts_top10  avg_production_score  \
0    2016      Atlanta Hawks  27.470588          10.52              0.377504   
1    2016     Boston Celtics  24.500000          11.20              0.428270   
2    2016      Brooklyn Nets  26.000000          11.39              0.389805   
3    2016  Charlotte Hornets  25.941176          11.78              0.395202   
4    2016      Chicago Bulls  27.562500          11.24              0.375663   

   injury_rate  
0     0.352941  
1     0.319360  
2     0.370875  
3     0.378049  
4     0.352134  
Shape: (300, 6)


Now that I have calculated the strength of schedule for each team in each season and the player stats needed, I will merge the strength of schedule dataframe to the current team_df. This will make it so that I have a front_office_df that includes payroll, draft order, and coaches along with a team_df that will contain team records, strength of schedule, and average stats.

In [22]:
# Merge on SOS and current team_df
team_df = merge_team_data(team_df, sos_df)

# Merge Player Stats and current team_df
team_df = merge_team_data(team_df, player_stats_df)

# Check result
print(team_df.head())
print(f"-----")
print(f"The shape of the dataframe:", team_df.shape)

   Season           Team  GP   W   L   WIN%   Min    PTS   FGM   FGA  ...  \
0    2016  Atlanta Hawks  82  48  34  0.585  48.4  102.8  38.6  84.4  ...   
1    2017  Atlanta Hawks  82  43  39  0.524  48.5  103.2  38.1  84.4  ...   
2    2018  Atlanta Hawks  82  24  58  0.293  48.1  103.4  38.2  85.5  ...   
3    2019  Atlanta Hawks  82  29  53  0.354  48.4  113.3  41.4  91.8  ...   
4    2020  Atlanta Hawks  67  20  47  0.299  48.6  111.8  40.6  90.6  ...   

   W_L  Pre-ASG_W  Pre-ASG_L  Post-ASG_W  Post-ASG_L       SOS    avg_age  \
0   11         31         24          17          10  0.500061  27.470588   
1   17         32         24          15          11  0.489720  28.200000   
2   12         18         41          17           6  0.508341  25.500000   
3   17         19         39          14          10  0.499671  25.136364   
4    9         15         41           6           5  0.509597  25.761905   

   avg_pts_top10  avg_production_score  injury_rate  
0          10.52    

The code below is ran to make sure that there are no data leaks within the team dataframe.

In [23]:
teams_2024 = team_df[team_df["Season"] == 2024]["Team"].unique()
teams_2025 = team_df[team_df["Season"] == 2025]["Team"].unique()

print("Missing in 2024:", set(nba_teams) - set(teams_2024))
print("Missing in 2025:", set(nba_teams) - set(teams_2025))

Missing in 2024: set()
Missing in 2025: set()


In order to get a clean target and feature variables, I am forced to merge the two master datasets I have into one master dataset.

In [24]:
# Merge team_df and front_office_df
master_df_unscaled = pd.merge(
    team_df,
    front_office_df,
    on=["Team", "Season"],
    how="inner"
)

print(master_df_unscaled.shape)
print(master_df_unscaled.head())
print(master_df_unscaled.info())

(328, 56)
   Season           Team  GP   W   L   WIN%   Min    PTS   FGM   FGA  ...  \
0    2016  Atlanta Hawks  82  48  34  0.585  48.4  102.8  38.6  84.4  ...   
1    2017  Atlanta Hawks  82  43  39  0.524  48.5  103.2  38.1  84.4  ...   
2    2018  Atlanta Hawks  82  24  58  0.293  48.1  103.4  38.2  85.5  ...   
3    2019  Atlanta Hawks  82  29  53  0.354  48.4  113.3  41.4  91.8  ...   
4    2020  Atlanta Hawks  67  20  47  0.299  48.6  111.8  40.6  90.6  ...   

              Coach  Yw/Franch  YOverall  CareerW  CareerL  CareerW%  \
0  Mike Budenholzer          3         3      146      100     0.593   
1  Mike Budenholzer          4         4      189      139     0.576   
2  Mike Budenholzer          5         5      213      197     0.520   
3      Lloyd Pierce          1         1       29       53     0.354   
4      Lloyd Pierce          2         2       49      100     0.329   

   FirstRoundPicks  SecondRoundPicks  Coach_Count      Payroll  
0                1           

Now that I have all of my dataframes merged into one master, I need to do some preprocessing of the data; more specifically, I need to scale the 2020 and 2021 shortened covid seasons to have 82 games played for only the totals,

In [25]:
# --- Step 1: Identify columns ---
id_cols = ["Season", "Team", "Coach", "Rk", "Pk", "Coach_Count", "GP"]
rate_cols = ["WIN%", "FG%", "3P%", "FT%", "SOS", "CareerW%", "injury_rate", "Payroll"]
career_cols = ["Yw/Franch", "YOverall", "CareerW", "CareerL"]  # career aggregates

# Columns eligible for scaling
counting_cols = [
    col for col in master_df_unscaled.columns
    if col not in id_cols + rate_cols + career_cols
]

# --- Step 2: Apply scaling ---
def scale_row(row):
    season_games = row["GP"]
    if season_games < 82:  # only scale shortened seasons
        scale_factor = 82 / season_games
        row[counting_cols] = row[counting_cols] * scale_factor
    return row

master_df = master_df_unscaled.apply(scale_row, axis=1)

# --- Step 3: Sanity check ---
print(master_df.loc[master_df["Season"].isin([2020, 2021]),
                          ["Season", "Team", "GP", "W", "L", "PTS", "avg_production_score"]].head())

    Season            Team  GP          W          L         PTS  \
4     2020   Atlanta Hawks  67  24.477612  57.522388  136.829851   
5     2021   Atlanta Hawks  72  46.694444  35.305556  129.491667   
6     2021   Atlanta Hawks  72  46.694444  35.305556  129.491667   
17    2020  Boston Celtics  72  54.666667  27.333333  129.491667   
18    2021  Boston Celtics  72  41.000000  41.000000  128.238889   

    avg_production_score  
4               0.466780  
5               0.471217  
6               0.471217  
17              0.481876  
18              0.451757  


In [26]:
# --- Create NWins column (next season's wins) ---
master_df = master_df.sort_values(by=["Team", "Season"]).reset_index(drop=True)

# Shift W column back one season within each team group
master_df["NWins"] = master_df.groupby("Team")["W"].shift(-1)

# For the latest season (e.g., 2025), NWins will naturally be NaN
print(master_df[["Season", "Team", "W", "NWins", "Coach", "FirstRoundPicks","SecondRoundPicks"]].tail(20))


     Season                Team          W      NWins          Coach  \
308    2017           Utah Jazz  51.000000  48.000000    Quin Snyder   
309    2018           Utah Jazz  48.000000  50.000000    Quin Snyder   
310    2019           Utah Jazz  50.000000  50.111111    Quin Snyder   
311    2020           Utah Jazz  50.111111  59.222222    Quin Snyder   
312    2021           Utah Jazz  59.222222  49.000000    Quin Snyder   
313    2022           Utah Jazz  49.000000  37.000000    Quin Snyder   
314    2023           Utah Jazz  37.000000  31.000000     Will Hardy   
315    2024           Utah Jazz  31.000000  17.000000     Will Hardy   
316    2025           Utah Jazz  17.000000        NaN     Will Hardy   
317    2016  Washington Wizards  41.000000  49.000000  Randy Wittman   
318    2017  Washington Wizards  49.000000  43.000000   Scott Brooks   
319    2018  Washington Wizards  43.000000  32.000000   Scott Brooks   
320    2019  Washington Wizards  32.000000  28.472222   Scott Br

The last step I have before I can start my final cleaning checklist is to make the master dataframes into csv files.

In [27]:
# # THE CODE TO CREATE RAW MASTER DATA CSV FILES
# # Define output folder
# output_dir = "/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats"
# os.makedirs(output_dir, exist_ok=True)

# # Save team_df
# team_path = os.path.join(output_dir, "team_df.csv")
# team_df.to_csv(team_path, index=False)
# print(f"team_df saved to {team_path}")

# # Save front_office_df
# front_office_path = os.path.join(output_dir, "front_office_df.csv")
# front_office_df.to_csv(front_office_path, index=False)
# print(f"front_office_df saved to {front_office_path}")

# # Save master_df
# master_path = os.path.join(output_dir, "master_df.csv")
# master_df.to_csv(master_path, index=False)
# print(f"master_df saved to {master_path}")

In [28]:
# # THE CODE TO CREATE PROCESSED MASTER DATA GZIP FILES
# # gzip file creation block
# root_dir = "/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats"

# for subdir, _, files in os.walk(root_dir):
#     for file in files:
#         if file.endswith(".csv"):
#             file_path = os.path.join(subdir, file)
#             gz_path = file_path + ".gz"

#             try:
#                 df = pd.read_csv(file_path)

#                 if df.empty:
#                     print(f"Skipping empty file: {file_path}")
#                     continue

#                 print(f"Compressing {file_path} -> {gz_path}")
#                 df.to_csv(gz_path, index=False, compression="gzip")

#             except pd.errors.EmptyDataError:
#                 print(f"Skipping empty file (no data): {file_path}")


## Final Data Cleaning

Now that I have my main dataframes completed, I now need to go through the basic data cleaning checklist:
- checking for NaN values
- checking for duplicate values
- verify manually entered data (Season columns and Team names)

In [29]:
def cleaning_report(df, df_name="DataFrame"):
    print(f"--- Cleaning Report for {df_name} ---")
    
    # 1. Check for NaN values
    nan_summary = df.isna().sum()
    nan_summary = nan_summary[nan_summary > 0]
    if nan_summary.empty:
        print("No NaN values found")
    else:
        print("NaN values detected:")
        print(nan_summary)
    
    # 2. Check for duplicate rows
    dup_count = df.duplicated().sum()
    if dup_count == 0:
        print("No duplicate rows found")
    else:
        print(f"Found {dup_count} duplicate rows")

    # 3. Verify manually entered data (Season & Team)
    if "Season" in df.columns:
        seasons = df["Season"].unique()
        print(f"Unique seasons: {seasons}")
    else:
        print("No 'Season' column found")
        
    if "Team" in df.columns:
        teams = df["Team"].unique()
        print(f"Unique teams: {len(teams)} teams")
        print(sorted(teams))
    else:
        print("No 'Team' column found")
    
    print("-----------------------------------\n")

In [30]:
cleaning_report(front_office_df, "Front Office Data")
cleaning_report(sos_df, "Strength of Schedule")
cleaning_report(master_df, "Master Dataframe")

--- Cleaning Report for Front Office Data ---
No NaN values found
No duplicate rows found
Unique seasons: [2025 2024 2023 2022 2021 2020 2019 2018 2017 2016]
Unique teams: 30 teams
['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets', 'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers', 'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons', 'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers', 'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves', 'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings', 'San Antonio Spurs', 'Toronto Raptors', 'Utah Jazz', 'Washington Wizards']
-----------------------------------

--- Cleaning Report for Strength of Schedule ---
No NaN values found
No duplicate rows found
Unique seasons: [2025 2024 2023 2022 2021 2020 2019 2018 2017 2016]
Unique teams: 30 t

In [31]:
# Filter to seasons other than 2025
non_2025 = master_df[master_df["Season"] != 2025]

# Select only rows with at least one NaN
rows_with_nans = non_2025[non_2025.isna().any(axis=1)]

print(rows_with_nans.head())   # Preview first few rows
print("Total rows with NaN (non-2025):", len(rows_with_nans))


Empty DataFrame
Columns: [Season, Team, GP, W, L, WIN%, Min, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FTA, FT%, OREB, DREB, REB, AST, TOV, STL, BLK, BLKA, PF, PFD, PLUS_MINUS, Rk, Home_W, Home_L, Road_W, Road_L, E_W, E_L, W_W, W_L, Pre-ASG_W, Pre-ASG_L, Post-ASG_W, Post-ASG_L, SOS, avg_age, avg_pts_top10, avg_production_score, injury_rate, Coach, Yw/Franch, YOverall, CareerW, CareerL, CareerW%, FirstRoundPicks, SecondRoundPicks, Coach_Count, Payroll, NWins]
Index: []

[0 rows x 57 columns]
Total rows with NaN (non-2025): 0


This cleaning report has proved to be an immensely useful function. It was able to alert me that I had 61 NaN values in my Front Office dataset resulting in 61 NaN values in my Master dataset. In the end, it was a manual data insertion spelling error and has been resolved!

Now that I have finished cleaning dataframes, I will move onto the exploration notebook. In this notebook, I will be exploring the data through summary statistics and visualizations with the core goal of understanding the data as best as possible.