# NBA Player Statistics Analysis (2024-2025 Season)

This notebook performs an analysis of NBA player statistics for the 2024-2025 season, using data scraped from Basketball-Reference.com. We will load per-game, advanced, and shooting statistics, merge them, and perform exploratory data analysis (EDA) with visualizations.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # For potential NaN handling or numeric operations

# Display plots inline
%matplotlib inline

# Set a style for plots
plt.style.use('fivethirtyeight')
sns.set_style('whitegrid')

## 2. Load Data

We will load the three CSV files: per-game stats, advanced stats, and shooting stats.

In [None]:
per_game_file = '../PyScripts/nba_per_game_stats_2024_25.csv'
advanced_file = '../PyScripts/nba_advanced_stats_2024_25.csv'
shooting_file = '../PyScripts/nba_shooting_stats_2024_25.csv'

try:
    per_game_df = pd.read_csv(per_game_file)
    print(f"Successfully loaded per_game_stats from {per_game_file}")
except FileNotFoundError:
    print(f"Error: {per_game_file} not found.")
    per_game_df = pd.DataFrame() # Create empty DataFrame if file not found

try:
    advanced_df = pd.read_csv(advanced_file)
    print(f"Successfully loaded advanced_stats from {advanced_file}")
except FileNotFoundError:
    print(f"Error: {advanced_file} not found.")
    advanced_df = pd.DataFrame()

try:
    shooting_df = pd.read_csv(shooting_file)
    print(f"Successfully loaded shooting_stats from {shooting_file}")
except FileNotFoundError:
    print(f"Error: {shooting_file} not found.")
    shooting_df = pd.DataFrame()

## 3. Initial Data Inspection and Cleaning

Let's inspect the first few rows, data types, and summary statistics for each DataFrame. We'll also check for missing values.

### 3.1 Per-Game Statistics

In [None]:
if not per_game_df.empty:
    print("Per Game Stats - Head:")
    display(per_game_df.head())
    print("\nPer Game Stats - Info:")
    per_game_df.info()
    print("\nPer Game Stats - Describe:")
    display(per_game_df.describe())
    print("\nPer Game Stats - Missing Values:")
    display(per_game_df.isnull().sum())

### 3.2 Advanced Statistics

In [None]:
if not advanced_df.empty:
    print("Advanced Stats - Head:")
    display(advanced_df.head())
    print("\nAdvanced Stats - Info:")
    advanced_df.info()
    print("\nAdvanced Stats - Describe:")
    display(advanced_df.describe())
    print("\nAdvanced Stats - Missing Values:")
    display(advanced_df.isnull().sum())

### 3.3 Shooting Statistics

In [None]:
if not shooting_df.empty:
    print("Shooting Stats - Head:")
    display(shooting_df.head())
    print("\nShooting Stats - Info:")
    shooting_df.info()
    print("\nShooting Stats - Describe:")
    display(shooting_df.describe())
    print("\nShooting Stats - Missing Values:")
    display(shooting_df.isnull().sum())

### 3.4 Discussion on Missing Values

Missing values are observed in several columns, particularly '3P%', 'FT%', 'FG%', and some advanced metrics. This often occurs for players with zero attempts in those categories (e.g. 0 3PA leads to NaN for 3P%). The 'Awards' column has many NaNs as most players don't receive awards.

**Strategy for handling NaNs for this EDA:**
*   For percentage stats (like 'FG%', '3P%', 'FT%'), NaNs resulting from zero attempts could be filled with 0.0 for calculation if the number of attempts is also considered.
*   For other numeric stats, filling with 0 or the mean/median might be an option, but care must be taken not to skew the data. For this EDA, we will mostly filter out NaNs or use pandas' default handling where appropriate for calculations like `.mean()`.
*   The 'Awards' column NaNs mean no awards, which is fine for descriptive purposes.
*   When calculating specific metrics (e.g., top TS% players), we will filter for a minimum number of attempts or minutes played to ensure meaningful comparisons.

## 4. Merge DataFrames

We will merge the three DataFrames into a single comprehensive DataFrame. We need to be careful about common columns. Let's list them out and decide on a merge strategy. 

Common columns might include: 'Player', 'Pos', 'Age', 'Team', 'G', 'GS', 'MP'.
- `per_game_df` has MP per game.
- `advanced_df` has total MP.
- `shooting_df` has total MP.

We'll use 'Player', 'Team', 'Age', 'G', 'GS', 'Pos' as primary merge keys. For MP, we will keep the per-game MP from `per_game_df` as the main 'MP' and rename total MP columns from other tables if necessary or select one.

In [None]:
if not per_game_df.empty and not advanced_df.empty:
    # Merge per-game and advanced stats
    # Common columns: Player, Pos, Age, Team, G, GS
    # MP in per_game_df is minutes per game, MP in advanced_df is total minutes for the season.
    # We will keep per_game_df's MP as 'MP_per_game' and advanced_df's MP as 'MP_total'.
    merged_df = pd.merge(per_game_df.rename(columns={'MP': 'MP_per_game'}), 
                           advanced_df.rename(columns={'MP': 'MP_total_adv'}), 
                           on=['Player', 'Pos', 'Age', 'Team', 'G', 'GS'], 
                           how='outer', # Use outer to see all players, NaNs will indicate missing data in one table
                           suffixes=('', '_adv'))
    print("Shape after merging per-game and advanced: ", merged_df.shape)
else:
    # If one is empty, try to use the other or an empty df
    merged_df = per_game_df.rename(columns={'MP': 'MP_per_game'}) if not per_game_df.empty else pd.DataFrame()
    if merged_df.empty and not advanced_df.empty:
        merged_df = advanced_df.rename(columns={'MP': 'MP_total_adv'})
    print("One of per_game_df or advanced_df is empty. Merge might be incomplete.")

if not shooting_df.empty and not merged_df.empty:
    # Merge with shooting stats
    # shooting_df also has 'MP' (total minutes), 'G', 'GS', 'Age', 'Team', 'Pos'
    # We will use 'MP_total_shoot' for shooting_df's MP column.
    merged_df = pd.merge(merged_df, 
                           shooting_df.rename(columns={'MP': 'MP_total_shoot'}), 
                           on=['Player', 'Pos', 'Age', 'Team', 'G', 'GS'], 
                           how='outer', 
                           suffixes=('', '_shoot'))
    print("Shape after merging with shooting: ", merged_df.shape)
elif not shooting_df.empty and merged_df.empty: # if per_game and advanced were empty
    merged_df = shooting_df.rename(columns={'MP': 'MP_total_shoot'})
    print("Only shooting_df was available for merging.")

if not merged_df.empty:
    # Clean up duplicated columns after merge (e.g., if suffixes weren't perfectly handled or some columns were not in keys)
    # Example: If 'Awards_adv' exists and 'Awards' is preferred
    if 'Awards_adv' in merged_df.columns and 'Awards' in merged_df.columns:
        merged_df['Awards'] = merged_df['Awards'].fillna(merged_df['Awards_adv'])
        merged_df.drop(columns=['Awards_adv'], inplace=True, errors='ignore')
    if 'Awards_shoot' in merged_df.columns and 'Awards' in merged_df.columns:
        merged_df['Awards'] = merged_df['Awards'].fillna(merged_df['Awards_shoot'])
        merged_df.drop(columns=['Awards_shoot'], inplace=True, errors='ignore')
    
    # Select the primary total MP column. Let's prefer MP_total_adv, then MP_total_shoot.
    if 'MP_total_adv' in merged_df.columns and 'MP_total_shoot' in merged_df.columns:
        merged_df['MP_total'] = merged_df['MP_total_adv'].fillna(merged_df['MP_total_shoot'])
        merged_df.drop(columns=['MP_total_adv', 'MP_total_shoot'], inplace=True, errors='ignore')
    elif 'MP_total_adv' in merged_df.columns:
        merged_df.rename(columns={'MP_total_adv': 'MP_total'}, inplace=True)
    elif 'MP_total_shoot' in merged_df.columns:
        merged_df.rename(columns={'MP_total_shoot': 'MP_total'}, inplace=True)
        
    # Drop redundant Rk columns if they exist from merging
    rk_cols_to_drop = [col for col in merged_df.columns if 'Rk_' in col]
    merged_df.drop(columns=rk_cols_to_drop, inplace=True, errors='ignore')

    print("\nMerged DataFrame - Head:")
    display(merged_df.head())
    print("\nMerged DataFrame - Info:")
    merged_df.info()
    print("\nMerged DataFrame - Columns:")
    print(merged_df.columns.tolist())
else:
    print("Merged DataFrame is empty. Cannot proceed with EDA.")

## 5. Exploratory Data Analysis (EDA) & Visualization

Now we'll explore the merged data and visualize some key statistics.

### 5.1 Per-Game Stats Leaders

We'll look at top players in points, assists, and rebounds per game.

In [None]:
if not merged_df.empty and 'PTS' in merged_df.columns:
    top_10_pts = merged_df.sort_values(by='PTS', ascending=False).head(10)
    plt.figure(figsize=(12, 7))
    sns.barplot(x='PTS', y='Player', data=top_10_pts, palette='viridis', hue='Player', dodge=False, legend=False)
    plt.title('Top 10 Players by Points Per Game (PTS)')
    plt.xlabel('Points Per Game')
    plt.ylabel('Player')
    plt.tight_layout()
    plt.show()
else:
    print("'PTS' column not found or DataFrame is empty.")

In [None]:
if not merged_df.empty and 'AST' in merged_df.columns:
    top_10_ast = merged_df.sort_values(by='AST', ascending=False).head(10)
    plt.figure(figsize=(12, 7))
    sns.barplot(x='AST', y='Player', data=top_10_ast, palette='mako', hue='Player', dodge=False, legend=False)
    plt.title('Top 10 Players by Assists Per Game (AST)')
    plt.xlabel('Assists Per Game')
    plt.ylabel('Player')
    plt.tight_layout()
    plt.show()
else:
    print("'AST' column not found or DataFrame is empty.")

In [None]:
if not merged_df.empty and 'TRB' in merged_df.columns:
    top_10_trb = merged_df.sort_values(by='TRB', ascending=False).head(10)
    plt.figure(figsize=(12, 7))
    sns.barplot(x='TRB', y='Player', data=top_10_trb, palette='rocket', hue='Player', dodge=False, legend=False)
    plt.title('Top 10 Players by Rebounds Per Game (TRB)')
    plt.xlabel('Rebounds Per Game')
    plt.ylabel('Player')
    plt.tight_layout()
    plt.show()
else:
    print("'TRB' column not found or DataFrame is empty.")

### 5.2 Advanced Stats Leaders

In [None]:
if not merged_df.empty and 'PER' in merged_df.columns:
    # Filter for players with substantial minutes for PER to be more meaningful
    # Using MP_per_game from per_game_df and G for total minutes
    if 'MP_per_game' in merged_df.columns and 'G' in merged_df.columns:
        merged_df_filtered_per = merged_df[merged_df['MP_per_game'] * merged_df['G'] > 500] # Example: > 500 total minutes
    else:
        merged_df_filtered_per = merged_df # No filter if MP or G not available
        
    top_10_per = merged_df_filtered_per.sort_values(by='PER', ascending=False).head(10)
    plt.figure(figsize=(12, 7))
    sns.barplot(x='PER', y='Player', data=top_10_per, palette='cubehelix', hue='Player', dodge=False, legend=False)
    plt.title('Top 10 Players by Player Efficiency Rating (PER) (min. 500 MP)')
    plt.xlabel('Player Efficiency Rating (PER)')
    plt.ylabel('Player')
    plt.tight_layout()
    plt.show()
else:
    print("'PER' column not found or DataFrame is empty.")

In [None]:
if not merged_df.empty and 'TS%' in merged_df.columns:
    # Filter for players with a reasonable number of field goal attempts
    # FGA (per game) is in per_game_df (original name)
    if 'FGA' in merged_df.columns and 'G' in merged_df.columns: # FGA is from per_game_df, G from common merge
        merged_df['Total_FGA'] = merged_df['FGA'] * merged_df['G']
        merged_df_filtered_ts = merged_df[merged_df['Total_FGA'] > 100] # Example: > 100 total FGA
    elif 'FGA_total_adv' in merged_df.columns: # If advanced_df had total FGA
        merged_df_filtered_ts = merged_df[merged_df['FGA_total_adv'] > 100]
    else:
        merged_df_filtered_ts = merged_df # No filter if FGA not available
        
    top_10_ts = merged_df_filtered_ts.sort_values(by='TS%', ascending=False).head(10)
    plt.figure(figsize=(12, 7))
    sns.barplot(x='TS%', y='Player', data=top_10_ts, palette='crest', hue='Player', dodge=False, legend=False)
    plt.title('Top 10 Players by True Shooting % (TS%) (min. 100 FGA)')
    plt.xlabel('True Shooting Percentage (TS%)')
    plt.ylabel('Player')
    plt.tight_layout()
    plt.show()
else:
    print("'TS%' column not found or DataFrame is empty.")

### 5.3 Shooting Stats Analysis

In [None]:
if not merged_df.empty:
    # League average for FG% and 3P% (using overall FG% and 3P% from per_game for consistency)
    # Ensure these columns are numeric and handle NaNs by dropping them for mean calculation
    if 'FG%' in merged_df.columns:
        avg_fg_pct = merged_df['FG%'].dropna().mean()
        print(f"League Average FG%: {avg_fg_pct:.3f}")
    else:
        print("'FG%' (overall) not found in merged_df for league average calculation.")
        
    if '3P%' in merged_df.columns: # This is overall 3P% from per_game_df
        avg_3p_pct = merged_df['3P%'].dropna().mean()
        print(f"League Average 3P%: {avg_3p_pct:.3f}")
    else:
        print("'3P%' (overall) not found in merged_df for league average calculation.")
    
    # Histogram of Average Shot Distance (Dist.)
    if 'Dist.' in merged_df.columns:
        plt.figure(figsize=(10, 6))
        sns.histplot(merged_df['Dist.'].dropna(), kde=True, bins=20)
        plt.title('Distribution of Average Shot Distance (Dist.)')
        plt.xlabel('Average Shot Distance (ft)')
        plt.ylabel('Frequency')
        plt.show()
    else:
        print("'Dist.' column not found for histogram.")
        
    # Top 5 players by %FGA from 3-point range (among those with significant minutes)
    if '%FGA 3P' in merged_df.columns and 'MP_total' in merged_df.columns:
        significant_mp_players = merged_df[merged_df['MP_total'] > 500]
        top_5_3pa_pct = significant_mp_players.sort_values(by='%FGA 3P', ascending=False).head(5)
        print("\nTop 5 Players by Percentage of FGA from 3-Point Range (min. 500 MP):")
        display(top_5_3pa_pct[['Player', 'Team', '%FGA 3P', '3P%', 'MP_total']])
    else:
        print("'%FGA 3P' or 'MP_total' column not found for 3P analysis.")
        
    # Top 5 players by FG% on shots from 0-3 feet (min. significant attempts)
    # Need total FGA and %FGA 0-3 to estimate attempts from 0-3 feet
    if 'FG% 0-3' in merged_df.columns and '%FGA 0-3' in merged_df.columns and 'FGA' in merged_df.columns and 'G' in merged_df.columns:
        if 'Total_FGA' not in merged_df.columns: # Calculate if not already done
             merged_df['Total_FGA'] = merged_df['FGA'] * merged_df['G']
        merged_df['FGA_0_3_attempts'] = merged_df['%FGA 0-3'] * merged_df['Total_FGA']
        significant_attempts_0_3 = merged_df[merged_df['FGA_0_3_attempts'] > 50] # Example: > 50 attempts from 0-3ft
        top_5_fg_0_3 = significant_attempts_0_3.sort_values(by='FG% 0-3', ascending=False).head(5)
        print("\nTop 5 Players by FG% on Shots from 0-3 Feet (min. 50 attempts from 0-3ft):")
        display(top_5_fg_0_3[['Player', 'Team', 'FG% 0-3', 'FGA_0_3_attempts']])
    else:
        print("Required columns for 'FG% 0-3' analysis are missing.")
else:
    print("Merged DataFrame is empty, skipping Shooting Stats Analysis.")

### 5.4 Correlations

In [None]:
if not merged_df.empty:
    # Select key stats for correlation matrix
    # From per-game: PTS, AST, TRB
    # From advanced: PER, WS, USG%, TS%
    # From shooting: %FGA 3P, Dist.
    correlation_cols = ['PTS', 'AST', 'TRB', 'PER', 'WS', 'USG%', 'TS%', '%FGA 3P', 'Dist.']
    
    # Filter out columns that might not exist if a file load failed
    existing_corr_cols = [col for col in correlation_cols if col in merged_df.columns]
    
    if len(existing_corr_cols) > 1:
        correlation_df = merged_df[existing_corr_cols].copy()
        
        # Convert all selected columns to numeric, coercing errors. This helps if some NaNs are strings.
        for col in existing_corr_cols:
            correlation_df[col] = pd.to_numeric(correlation_df[col], errors='coerce')
        
        # Drop rows with NaNs for correlation calculation to be meaningful
        correlation_df.dropna(inplace=True)
        
        if not correlation_df.empty and len(correlation_df.columns) > 1:
            corr_matrix = correlation_df.corr()
            
            plt.figure(figsize=(12, 10))
            sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
            plt.title('Correlation Matrix of Key Player Statistics')
            plt.show()
        else:
            print("Not enough data or columns for correlation matrix after cleaning.")
    else:
        print("Not enough columns available for correlation analysis.")
else:
    print("Merged DataFrame is empty, skipping Correlation Analysis.")

## 6. Conclusion

This notebook provided an initial exploratory data analysis of the 2024-2025 NBA player statistics. We loaded, inspected, cleaned, and merged data from three different sources. We then visualized leaders in various statistical categories and examined correlations between key metrics.

Further analysis could involve more advanced statistical modeling, player clustering, and in-depth investigation of specific player performances or trends.