<h1 style="text-align: center;"> Next Man Up: NBA, WNBA, G-League, NCAA Player Clustering Similarity </h1>

Last Modified: *7/28/2024*

### **Goal**: Implement clustering model to group NBA, WNBA, and G League players based on traditional and advanced box score statistics to:

1. Identify potential role player replacements

2. Scout emerging talent

3. Provide contingency planning for injuries and trades

Final result will include clustering plots for each league.

###

### Importing libraries ###

In [None]:
import os
import random
import pandas as pd
import numpy as np
import itertools
import time
import collections
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from io import BytesIO
import base64
from datetime import datetime

from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from matplotlib.patches import Polygon
from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
from matplotlib.patches import Arc

from nba_api.stats.endpoints import commonplayerinfo, leaguegamefinder, boxscoreadvancedv2, BoxScoreDefensiveV2, PlayerGameLog
from nba_api.stats.static import players, teams

pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # Disabling pandas SetWithCopyWarnings
os.add_dll_directory(r"C:\Program Files\GTK3-Runtime Win64\bin")


from PIL import Image
from py_ball import synergy, image

# Adjust the amount of jitter as needed
from scipy.spatial import ConvexHull
from scipy import stats
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer


HEADERS = {'Connection': 'keep-alive',
           'Host': 'stats.nba.com',
           'Origin': 'http://stats.nba.com',
           'Upgrade-Insecure-Requests': '1',
           'Referer': 'stats.nba.com',
           'x-nba-stats-origin': 'stats',
           'x-nba-stats-token': 'true',
           'Accept-Language': 'en-US,en;q=0.9',
           "X-NewRelic-ID": "VQECWF5UChAHUlNTBwgBVw==",
           'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6)' +\
                         ' AppleWebKit/537.36 (KHTML, like Gecko)' + \
                         ' Chrome/81.0.4044.129 Safari/537.36'}



### User-defined Functions ###

In [None]:
def k_closest_players(df, player_name, k):
    ### Returns the k nearest neighbors most similar to player_name

    return 0


def generate_random_color():
    return f'#{random.randint(0, 0xFFFFFF):06x}'


def return_player_bio(player_id):
    # returns player position, height, and weight as a list using nba_api library

    player_info = commonplayerinfo.CommonPlayerInfo(player_id=player_id)
    player_info_dict = player_info.get_normalized_dict()

    # Extract relevant data
    position = player_info_dict['CommonPlayerInfo'][0]['POSITION']
    height = player_info_dict['CommonPlayerInfo'][0]['HEIGHT']
    weight = player_info_dict['CommonPlayerInfo'][0]['WEIGHT']

    return [position, height, weight]


def encode_image(image):
    # converts img file from PIL library to PNG, meant to retain quality of image when plotting
    
    buffer = BytesIO()
    image.save(buffer, format='PNG')
    encoded_image = base64.b64encode(buffer.getvalue()).decode()
    return f'data:image/png;base64,{encoded_image}'



def plotly_clusters(df, league, season_start, season_end):
    
    x_min = df['x'].min()
    x_max = df['x'].max()
    y_min = df['y'].min()
    y_max = df['y'].max()

    df['hover_text'] = df.apply(lambda row: f"Name: {row['PLAYER_NAME']}<br>PTS: {row['PTS']:.1f}<br>AST: {row['AST']:.1f}<br>REB: {row['REB']:.1f}", axis=1)

    # Create scatter plot
    fig = go.Figure()

    # Add scatter trace for hover text
    fig.add_trace(go.Scatter(
        x=df['x'],
        y=df['y'],
        mode='markers',
        marker=dict(size=10, color='rgba(255, 255, 255, 0)'),
        text=df['hover_text'],
        hoverinfo='text',
        showlegend=False
    ))
    
    if league == 'NBA' or league == 'WNBA':
        # Add images as scatter points
        for i, row in df.iterrows():
            fig.add_layout_image(
                dict(
                    source=row['IMAGE_ENCODED'],
                    x=row['x'],
                    y=row['y'],
                    xref="x",
                    yref="y",
                    sizex=10,  # Adjust size as needed
                    sizey=10,
                    xanchor="center",
                    yanchor="middle"
                )
            )
    else:
        # Initialize the color map with random colors
        unique_clusters = df['cluster'].unique()
        color_map = {cluster: generate_random_color() for cluster in unique_clusters}

        # Map the clusters to colors
        colors = df['cluster'].map(color_map)

        # Add scatter trace
        fig.add_trace(go.Scatter(
            x=df['x'], 
            y=df['y'], 
            mode='markers',
            marker=dict(color=colors),
            showlegend=False
        ))



    # Update layout for better display
    fig.update_layout(
        autosize=True,
        xaxis=dict(range=[x_min, x_max], visible = False),  # Adjust range to include all points
        yaxis=dict(range=[y_min, y_max], visible = False),  # Adjust range to include all points
        hovermode='closest',
        title=dict(
            text=(f"<b>{league} Player Clustering</b><br><sup>Regular season & Playoff data spanning from {season_start} to {season_end}</sup>"),
            x=0.12,  # Move title slightly to the right
            xanchor='left'
        )
    )

    
    if league == 'WNBA':
        with open("C:/Users/rsandan/Downloads/WNBA.png", "rb") as image_file:
            img_str = base64.b64encode(image_file.read()).decode()
        source = f"data:image/png;base64,{img_str}"
        x=-.03
        y=1.35
    elif league == 'NBA':
        source = "https://raw.githubusercontent.com/TGOlson/nba-logos/main/data/img/NBA.png"
        x=-.01
        y=1.35
    else:
        with open("C:/Users/rsandan/Downloads/gleague.png", "rb") as image_file:
            img_str = base64.b64encode(image_file.read()).decode()
        source = f"data:image/png;base64,{img_str}"
        x=0
        y=1.35

    # Add logo image
    fig.add_layout_image(
        dict(
            source=source,
            xref="paper", yref="paper",
            x=x, y=y,
            sizex=0.35, sizey=0.35,
            xanchor="left", yanchor="top"
        )
    )



    fig.show()

    return None


def season_string(start, end):
    # Split the input strings to get the start and end years
    start_year = int(start.split('-')[0])
    end_year = int(end.split('-')[0])
    
    # Create a list to store the season strings
    seasons = []
    
    # Loop through the range of years and generate the season strings
    for year in range(start_year, end_year + 1):
        next_year = str(year + 1)[-2:]  # Get the last two digits of the next year
        season = f"{year}-{next_year}"
        seasons.append(season)
    
    return seasons


def fetch_season_data(league_id, season_start, season_end, season_type):
    all_data = []

    # WNBA seasons are simply 2000, 2001, etc.
    if league_id == '10':
        seasons = range(season_start, season_end)
    else: 
        seasons = season_string(season_start, season_end)

        
    for season in seasons:
        nba_gamefinder = leaguegamefinder.LeagueGameFinder(
            league_id_nullable=league_id,
            season_nullable=season,
            season_type_nullable=season_type,
            player_or_team_abbreviation='P'
        )
        games = nba_gamefinder.get_data_frames()[0]
        all_data.append(games)
        
        print(f'Finished scraping data for the {season} season ({season_type}). Amount of rows =', len(games))
        
        # Optional: Add a delay to avoid overloading the server
        lag = np.random.uniform(low=2, high=5)
        print(f'...waiting {round(lag, 1)} seconds')
        time.sleep(lag)
    
    return pd.concat(all_data, ignore_index=True)



def apply_tsne(df, num_of_clusters, numeric_columns, gleague = 'no'):
    X = np.array(df[numeric_columns])
    kmeans = KMeans(n_clusters = num_of_clusters, random_state = 42).fit(X)
    labels = kmeans.labels_
    cluster_centers = kmeans.cluster_centers_

    # set verbose to 2 to see output messages from T-SNE
    data_dim = TSNE(n_components=2, perplexity=5, verbose=0, method='barnes_hut').fit_transform(X)
    cluster_centers_dim = TSNE(n_components=2, perplexity=1, verbose=0, method='barnes_hut').fit_transform(cluster_centers)

    data_dim_x = [i[0] for i in data_dim]
    data_dim_y = [i[1] for i in data_dim]

    cluster_center_x = [i[0] for i in cluster_centers_dim]
    cluster_center_y = [i[1] for i in cluster_centers_dim]

    cluster_center_labels = [0, 1, 2, 3, 4]

    data_point_table = pd.DataFrame({'x': data_dim_x, 'y': data_dim_y, 'cluster': labels})
    data_point_table['cluster_center_x'] = data_point_table['cluster'].map(dict(zip(cluster_center_labels, cluster_center_x)))
    data_point_table['cluster_center_y'] = data_point_table['cluster'].map(dict(zip(cluster_center_labels, cluster_center_y)))
    data_point_table['PLAYER_NAME'] = df['PLAYER_NAME'].values
    data_point_table['PTS'] = df['PTS'].values
    data_point_table['AST'] = df['AST'].values
    data_point_table['REB'] = df['REB'].values


    if gleague == 'no':
        # Add encoded images and player names to the dataframe
        data_point_table['IMAGE'] = df['IMAGE'].values
        data_point_table['IMAGE_ENCODED'] = data_point_table['IMAGE'].apply(encode_image)
    

    return data_point_table

def grab_player_headshots(df, league):
    player_data = []

    df_unique = df[['PLAYER_NAME', 'PLAYER_ID']].drop_duplicates()

    for index, row in df_unique.iterrows():
        player_id = row['PLAYER_ID']
        player_name = row['PLAYER_NAME']
        try:
            image_url = image.Headshot(league=league, player_id=player_id).image
        except Exception as e:
            print(f"Error processing {player_name} (ID: {player_id}): {e}")
            # Add default image on error (Javale McGee)
            image_url = "url_to_default_image"

        player_data.append({
            'PLAYER_NAME': player_name,
            'PLAYER_ID': player_id,
            'IMAGE': image_url
        })


    return pd.DataFrame(player_data)


In [None]:
# League IDs
nba_league_id = '00'
g_league_id = '20'
wnba_league_id = '10'

# Season Parameters
season_start = '2022-23'
season_end = '2023-24'

In [None]:

# Fetch regular season and playoffs data (NBA)
nba_rs_data = fetch_season_data(league_id=nba_league_id, season_start=season_start, season_end=season_end, season_type='Regular Season')
nba_p_data = fetch_season_data(league_id=nba_league_id, season_start=season_start, season_end=season_end, season_type='Playoffs')

# Fetch regular season and playoffs data (G League)
gleague_rs_data = fetch_season_data(league_id=g_league_id, season_start=season_start, season_end=season_end, season_type='Regular Season')
gleague_p_data = fetch_season_data(league_id=g_league_id, season_start=season_start, season_end=season_end, season_type='Playoffs')

# Fetch regular season and playoffs data (WNBA)
wnba_rs_data = fetch_season_data(league_id=wnba_league_id, season_start=2015, season_end=2024, season_type='Regular Season')
wnba_p_data = fetch_season_data(league_id=wnba_league_id, season_start=2015, season_end=2024, season_type='Playoffs')

In [None]:
# Concatenate every df into one main df
master = pd.concat([nba_rs_data, nba_p_data, 
                    gleague_rs_data, gleague_p_data, 
                    wnba_rs_data, wnba_p_data], ignore_index=True)

### Data Transformation ### 

My next step is to create two new columns that differentiate league data `LEAGUE` (nba, g league, wnba) and `SEASON_TYPE` (regular season & playoffs)

In [None]:
master['SEASON_ID'].unique()

- 2 indicates regular season
- 4 indicates playoff data

In [None]:
unique_teams = master[['TEAM_ID', 'TEAM_NAME']].drop_duplicates()

# Set pandas option to display all rows
pd.set_option('display.max_rows', None)

# Display the unique teams
print(unique_teams.sort_values(by='TEAM_ID'))

# Optionally, reset the display option to default after displaying
pd.reset_option('display.max_rows')

Looking at `Team ID`:
- NBA starts with 16106...
- WNBA starts with 16116...
- G LEAGUE with 16127...

In [None]:
# Make the appropriate changes

# Add new column to indicate season_type
master['SEASON_TYPE'] = ['Regular Season' if str(row)[0] == '2' else 'Playoffs' for row in master['SEASON_ID']]

# Add new column to indicate league
league_types = []
for row in master['TEAM_ID']:
    if str(row)[:5] == '16106':
        league_types.append('NBA')
    elif str(row)[:5] == '16116':
        league_types.append('WNBA')
    else:
        league_types.append('G LEAGUE')

master['LEAGUE'] = league_types

In [None]:
print("Number of unique players in dataset", len(master['PLAYER_ID'].unique()))
print("\nSplit by league:\n")

for league in master['LEAGUE'].unique():
    temp = master[master['LEAGUE'] == league]
    print("Number of", league, "players: ", len(temp['PLAYER_ID'].unique()))

In [None]:
# Verify which features have missing values
master.isnull().sum()

In [None]:
# All the NA WL are coming from G-League data
master[master['WL'].isna()]['LEAGUE'].unique()

In [None]:
# All the NA player names are coming from G-League data
master[master['PLAYER_NAME'].isna()]['LEAGUE'].unique()

Data Pre-processing Questions:
- What are potential reasons why `PLAYER_NAME` is missing? (but they have `PLAYER_ID` for those missing players)
- Missing data on who won `WL`? Did it go to overtime?
- bunch of NA values for percentage columns `FT_PCT`, `FG_PCT`, `FG3_PCT` to indicate that a player did not attempt a field goal, therefore they did not make a field goal either. So 0/0 = NA
- bunch of NA values for `PLUS_MINUS`
- `REB` shouldn't have any NA values since `REB` = `OREB` + `DREB`. So 0 = 0 + 0 should work out.


To do:
- Exclude rows where `PLAYER_NAME` is NA
- Exclude rows where `WL` is NA
- Drop Percentage columns since it's redundant (we can compute percentage already using `FGA` and `FGM`)
- Ensure `REB` = `DREB` + `OREB`
- DROP `PLUS_MINUS`
- Change stats like `PTS` to float data type
- [TENTATIVE] Add advanced metrics such as:
  - `TS_PERC`: True shooting percentage
  - `PPP`: Points Per Possession
  - `PER`: Player Efficiency Rating
- Add player headshots to dataframe using `grab_player_headshots` function for visual aid in clustering 

Why drop `PLUS_MINUS`? Take this excerpt from [blazersedge.com, written by Dave Deckard,](https://www.blazersedge.com/2024/3/31/24117295/nba-plus-minus-portland-trail-blazers-scoot-henderson-record) explaining the problem with plus minus: "The nice thing about plus/minus is that it indicates, at least partially, how a player’s performance might be affecting his team. If you notice a guy scoring 20 per game, but his plus/minus runs consistently negative, you might begin to suspect that his scoring isn’t as valuable as it seems on the surface.

On the other hand, maybe not. The problematic part of plus/minus is the statistical “noise” accompanying it. Team performance provides the baseline for the stat, but the stat is applied to a single player. Those do not match up.

If you send Damian Lillard onto the floor of an NBA game with four preschoolers, he’s going to have a horrible plus/minus even though he’s a fantastic player. His individual play-making and/or skills won’t be able to overcome the gravity of the team’s demise. Nor would the loss (and the terrible plus/minus stat on his boxscore line) be his fault." 

Since we're only focusing on individual players and not groups (i.e. starters and bench players), this stat won't provide valuable insight to an individual player's performance. 

[TENTATIVE] Why add `TS_PERC`, `PPP`, `PER`? 
- I wanted to add true shooting percentage because unlike effective field goal percentage, it takes freethrows into account. 
- I wanted to add Points Per Possession because it accounts for the number of possessions used, giving a clearer picture of scoring effectiveness compared to raw points alone.
- Adding Player Efficiency Rating because it's a good overall metric taking positive and negative stats into account

Let's make the appropriate changes

In [None]:
# Exclude player data where the value of WL is Not Available
master = master[master['WL'].notna()]

# Exclude rows of player data where their name isn't available. I noticed that it's all g_league too.
master = master[master['PLAYER_NAME'].notna()]

# Exclude row where FGA is NA
master = master[master['FGA'].notna()]

# Drop percentage and plus minus columns
master = master.drop(columns=['FG_PCT', 'FG3_PCT', 'FT_PCT', 'PLUS_MINUS'])

# Ensure REB column is sum of offensive and defensive rebounds
master['REB'] = master['DREB'] + master['OREB']

# Map 'W' to 1 (win) and 'L' to 0 (loss)
result_list = [1 if i == 'W' else 0 for i in master['WL']]
master['WL'] = result_list

In [None]:
# Verify which features have missing values
master.isnull().sum()

### Adding Player Headshots (except G-League) ###

In [None]:
nba_headshots = grab_player_headshots(master[master['LEAGUE'] == 'NBA'], 'NBA')
nba_headshots.head()

In [None]:
wnba_headshots = grab_player_headshots(master[master['LEAGUE'] == 'WNBA'], 'WNBA')
wnba_headshots.head()

In [None]:
print("Count of NBA Players w/ valid headshots:", len(nba_headshots[nba_headshots['IMAGE'] != 'url_to_default_image']), "out of", len(nba_headshots))
print("Count of WNBA Players w/ valid headshots:", len(wnba_headshots[wnba_headshots['IMAGE'] != 'url_to_default_image']), "out of", len(wnba_headshots))

We're going to revisit these two datasets later when we begin visualizing clusters.

### Standardization: normalizing features ### 

Before I go further, I want to check if my data is normally distributed before I standardize the features by plotting each league's player statistics as a histogram and plotting correlation matrix.

Using Scikit-learn, I'm either going to use:
- `StandardScaler` (beneficial when data is approximately normally distributed or when using algorithms sensitive to the distribution of data)
- `MinMaxScaler` (beneficial when you need data within a specific range or when using algorithms that do not assume any particular distribution of data)

In [None]:
# Separate numeric and descriptive columns
numeric_columns = ["MIN", "PTS", "FGM", "FGA",
                   "FG3M", "FG3A", "FTM", "FTA", 
                   "OREB", "DREB", "REB", "AST", 
                   "STL", "BLK", "TOV", "PF"]

# Convert columns to float type
for col in numeric_columns:
    master[col] = master[col].astype(float)


descriptive_columns = master.select_dtypes(exclude=['float64']).columns

In [None]:
# Iteratively plot each league's histogram statistics and correlation matrix. 

for league in master['LEAGUE'].unique():
    df = master[master['LEAGUE'] == league]
    historical_player_avg = df[['PLAYER_NAME'] + list(numeric_columns)].groupby('PLAYER_NAME').mean()
    historical_player_avg.reset_index(inplace=True)

    # Compute correlation between all columns except the first one
    correlation_matrix = historical_player_avg.iloc[:, 1:].corr()
    
    # Create a figure with 2 subplots
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))

    # Plot the correlation matrix on the first subplot
    sns.heatmap(correlation_matrix, annot=True, cmap='BuGn', ax=axes[0])
    axes[0].set_title(f"Correlation matrix of Player Metrics - {league}", weight='bold')

    # Plot the histograms on the second subplot
    for col in numeric_columns:
        historical_player_avg[col].hist(ax=axes[1], bins=20, alpha=0.5, label=col)
    axes[1].set_title(f"Histograms of Player Metrics - {league}", weight='bold')
    axes[1].legend()

    plt.tight_layout()
    plt.show()

### Applying MinMaxScaler to numeric features ### 

It appears that most of the features in the data are not normally distributed (excluding Personal Fouls and Minutes). Many of them show a skewed distribution, often right-skewed (positive skewness). Since most of the data is skewed right, a MinMaxScaler would be more appropriate to use since it's not normally distributed and it will scale all features to a specific range [0, 1]. 

In order to do this, let's apply the MinMaxScaler to numeric features only. Let's create a copy of `master` to differentiate between raw and normalized statistics. 

In [None]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Create a copy of df to separate raw and normalized statistics
master_std = master.copy()

# Apply the scaler to the numeric columns
master_std[numeric_columns] = scaler.fit_transform(master_std[numeric_columns])

# Add "_STD" to the column names
master_std.rename(columns={col: f"{col}_STD" for col in numeric_columns}, inplace=True)

In [None]:
# Check if lengths match
assert len(master_std) == len(master), "Length mismatch between normalized data and original data."

Next step is to split our main table into 3 dataframes (NBA, WNBA, G-League) of player averages spanned across their lifetime by grouping by `PLAYER_NAME` (similar to what we did before when analyzing player distributions) and attach our player headshots to each dataset. We do this because we want to implement the clustering algorithm where one value represents one player, and not one performance of their game (side note: that would also be interesting to see how a player's performance varies). **Our goal is to see whose standardized performances are most similar.** 

In [None]:
numeric_columns_std = [str(col + "_STD") for col in numeric_columns]

cluster_df = pd.concat([master_std, master[numeric_columns]], axis=1)

In [None]:
list_of_dfs = []

for league in cluster_df['LEAGUE'].unique():
    # filter for each league
    df = cluster_df[cluster_df['LEAGUE'] == league]

    # Ensure PLAYER_NAME is included in the DataFrame before grouping
    df = df[['PLAYER_NAME'] + list(numeric_columns_std) + list(numeric_columns)]

    historical_player_avg = df.groupby('PLAYER_NAME').mean()
    historical_player_avg.reset_index(inplace=True)
    list_of_dfs.append(historical_player_avg)

nba_cluster = list_of_dfs[0]
gleague_cluster = list_of_dfs[1]
wnba_cluster = list_of_dfs[2]

In [None]:
# Merging the dataframes on PLAYER_NAME
nba_cluster = pd.merge(nba_cluster, nba_headshots, on='PLAYER_NAME', how='left')
nba_cluster = nba_cluster[nba_cluster['IMAGE'] != 'url_to_default_image']

In [None]:
# Merging the dataframes on PLAYER_NAME
wnba_cluster = pd.merge(wnba_cluster, wnba_headshots, on='PLAYER_NAME', how='left')
wnba_cluster = wnba_cluster[wnba_cluster['IMAGE'] != 'url_to_default_image']

*A note about Elbow Method and Silhouette Score:*
- I wanted to try these methods to find out the optimal k for clustering. However, both methods show that 2 is the optimal k. I believe this might be the case since each position is really a variation between guards and forwards. So I'm going to use k = 5 since we have 5 players on the court at all times. 

## Data Visualization: Clustering begins ##
Now we're going to visualize our players' standardized career stats against each other and see whose games are most similar. 

### Method ###
- Implement T-SNE to visualize data into a 2D plot so we can quickly see clusters of players who have similar playing styles or statistical profiles. For example:
  - Players who score a lot and have similar shooting percentages might end up grouped together. Whereas players with different stats are placed far apart. So, a player who focuses on defense with many blocks and steals will be far from a player who scores a lot but doesn't have many defensive stats.
- We're going to use our functions we defined earlier in the notebook.
   - `apply_tsne` : used to apply T-SNE to our data and append/encode player headshots
   - `plotly_clusters` : used to visualize clustering with plotly library 

Note: 
- [According to `py_ball` documentation](https://github.com/basketballrelativity/py_ball/wiki/Image), G League Player headshots aren't available using the `image` function. This is only available for NBA and WNBA players. So I'm going to visualize the G-League data along with NBA as dots rather than headshots.

### NBA ###

In [None]:
nba_final = apply_tsne(nba_cluster, num_of_clusters = 5, numeric_columns=numeric_columns)

In [None]:
plotly_clusters(nba_final, "NBA", season_start = season_start, season_end=season_end)

### WNBA ###

In [None]:
wnba_final = apply_tsne(wnba_cluster, num_of_clusters = 5, numeric_columns=numeric_columns)

In [None]:
plotly_clusters(wnba_final, "WNBA", season_start=season_start, season_end=season_end)

### G-League ###
*note: cluster plot won't show player headshots*

In [None]:
gleague_final = apply_tsne(gleague_cluster, num_of_clusters = 5, numeric_columns=numeric_columns, gleague = 'yes')

In [None]:
plotly_clusters(gleague_final, "G-LEAGUE", season_start=season_start, season_end=season_end)

### ANALYSIS coming soon ###