# Exploratory Data Analysis
In this section, we take a look at the data to better understand the different features as well as any possible trends.

## Imports

In [None]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
all_stats_cleaned = pd.read_csv('all_stats_cleaned.csv')
all_stats_cleaned.head()

### Map Between ID and Abbr

In [None]:
team_id_to_abb = {} # dictionary to convert from team_id to team_abbreviation
team_abb_to_id = {} # dictionary to convert from team_abbreviation to team_id

teams = (all_stats_cleaned[['TEAM_ID', 'TEAM_ABBREVIATION']]).drop_duplicates()

for index, row in teams.iterrows() :
    if row['TEAM_ID'] not in team_id_to_abb.keys():
        team_id_to_abb[row['TEAM_ID']] = []
    team_id_to_abb[row['TEAM_ID']].append(row['TEAM_ABBREVIATION'])
    team_abb_to_id[row['TEAM_ABBREVIATION']] = row['TEAM_ID']

## Rolling Average vs. Actual Value

In [None]:
def rollingAvg(team_id, feature_name, rolling_window = 3, season = 0, plot = True) :
    """
    This function takes a team and feature and calculate the rolling average (not including the game on a given date)
    and the actual value of that feature on the day. It can visualize this comparison and returns the RMSE between the 
    rolling average and actual value.

    Inputs:
    team_id: integer from 0-29 representing a team (required)
    feature_name: string of a column of integer values from the dataset (required)
    rolling_window: how many days to average over (default = 3)
    season: season we are looking at (defaults to most current season)
    plot: whether or not to plot the function (default = True)

    Output: RMSE between predicted value (rolling average) and actual value
    """
    if (season < all_stats_cleaned['SEASON_YEAR'].min()) or (season > all_stats_cleaned['SEASON_YEAR'].max()) :
        season = all_stats_cleaned['SEASON_YEAR'].max()
        
    data = all_stats_cleaned[(all_stats_cleaned['TEAM_ID'] == team_id) & (all_stats_cleaned['SEASON_YEAR'] == season)]
    data = data[['GAME_DATE', feature_name]].sort_values(by='GAME_DATE')
    data['SHIFTED'] = data[feature_name].shift(1)
    data['ROLLING_AVG'] = data['SHIFTED'].rolling(window = rolling_window).mean()
    if plot :
        plt.figure(figsize=(10,5))
        plt.plot(data['GAME_DATE'], data['ROLLING_AVG'], marker = 'o', linestyle = '-', label = "Rolling Avg")
        plt.plot(data['GAME_DATE'], data[feature_name], marker = 'o', linestyle = '-', label = "Actual Value")
        plt.xlabel("Date")
        plt.ylabel("Average Points")
        plt.title("Rolling Average Points Over Time")
        plt.legend()
        plt.show()

    # Calculate RMSE
    data = data.dropna()
    error = np.sqrt(np.mean((data[feature_name].values-data['ROLLING_AVG'].values) ** 2))
    return error

The function above takes a team ID and a specific feature (with optional arguments of specifying the rolling window, season of interest, and to plot or not) and finds the rolling average. Our rolling average calculates a predicted value for the next game and the graph plots the prediction against the actual result. The function returns the error calculated by RMSE of a specific rolling window. We can use this to see how predictive previous games are of team performance in an upcoming game and decide what a good window might be.

Below, we test rolling windows from 1 to 20 with the option to adjust the season, team, and feature. This can be used later on when building test examples for the model.

In [None]:
min_error = None
best_window = 0
team = 0
feature = 'FGM'
season = 2000

for i in range(1, 20) :
    rmse = rollingAvg(team, feature, i, season, False)
    if min_error is None or rmse < min_error:
        min_error = rmse
        best_window = i

print("Best Window:", best_window)
rollingAvg(team, feature, best_window, season)

The code above finds that the best window for the `FGM` variable is 11. It then visualizes the predictions agains the actual values and provides the RMSE of them. 

## Team vs. Team Performance
We look at a heatmap showing the win percentages between teams to see how teams perform.

In [None]:
def calculateWinMatrix(start_season = all_stats_cleaned['SEASON_YEAR'].min(), end_season = all_stats_cleaned['SEASON_YEAR'].max()) :
    """
    This function takes a range of seasons and calculates the win percentages of the a team against all other teams for all games
    occuring within the provided season. Each row represents the win percentages a team.

    Inputs:
    start_season: first season to look at (default first recorded season)
    end_season: last season to consider (default most recent season)

    Output: Numpy matrix containing win percentages for team by row.
    """
    
    num_teams = len(team_id_to_abb)
    np_win_matrix = np.zeros((num_teams, num_teams))
    for team_one in range(num_teams):
        for team_two, team_two_abb in team_id_to_abb.items() :
            if team_one == team_two : continue
            games = wins = 0
            for x in team_two_abb :
                matches = all_stats_cleaned[(all_stats_cleaned['TEAM_ID'] == team_one) & (all_stats_cleaned['OPPONENT'] == x) & (all_stats_cleaned['SEASON_YEAR'].between(start_season, end_season)) ]
                games += len(matches)
                wins += len(matches[matches['WIN'] == 1])
        
            np_win_matrix[team_one][team_two] = wins / games
    return np_win_matrix


In [None]:
np_win_matrix = calculateWinMatrix(2020, 2025)
teams = [team_id_to_abb[i][-1] for i in team_id_to_abb]
plt.figure(figsize =(20,15))
sns.heatmap(np_win_matrix, annot=True, cmap="coolwarm", xticklabels=teams, yticklabels=teams) 
plt.xlabel("Opponent")
plt.ylabel("Team Win Percentages")
plt.title("Head-to-Head Win Percentage")
plt.show()

Looking at the heatmap above, we can see that certain teams perform far better and worse than others. For example, DET loses more games than wins against nearly every team in the range from 2020 to 2025. The same goes for Washington. However, Washington seems to fare especially well against MIN despite MIN generally having win percentages above 50%. This tells us that Washington may perform especially well against MIN and increase their probability of winning despite generally losing their games. 

## Home Court Advantage
Another important factor is home-court advantage (i.e. increased chance of winning due to playing at home). We want to observe how this affects teams. Since teams typically play half their games at home and half away (playing each team twice during a season, once at home at once away), we can forgo calculating home wins / total home games and away wins / total away games and instead just look at how many wins were home. 

In [None]:
def homeWins(team_id) :
    """
    This function takes a team ID and calculates the percentages of wins that are at home each season. 
    Then, we graph the values on a bar graph.

    Inputs:
    team_id: team ID, required
    """
    start = all_stats_cleaned['SEASON_YEAR'].min()
    end = all_stats_cleaned['SEASON_YEAR'].max()

    years = list(range(start, end+1))
    win_percentages = []
    games = all_stats_cleaned[all_stats_cleaned['TEAM_ID'] == team_id]

    for i in years :
        games_in_season = games[games['SEASON_YEAR'] == i]
        wins_in_season = games_in_season[games_in_season['WIN'] == 1]
        home_wins = wins_in_season[wins_in_season['HOME'] == 1]
        if len(wins_in_season) == 0 : 
            win_percentages.append(0)
        else :
            win_percentages.append(len(home_wins) / len(wins_in_season))

    team_abb = team_id_to_abb[team_id][-1]
    plt.bar(years, win_percentages)
    plt.xlabel('Year')
    plt.ylabel('Win Percentages At Home')
    plt.title(f'Win Percentages at Home for {team_abb} over the Years')
    plt.show()

In [None]:
homeWins(10)

The above bargraph shows us that for the Los Angeles Lakers (LAL), the game being at home results in a slightly higher probability of winning. Looking at more teams, we will see that this trend continues, although perhaps not as strong as some may think. This indicates the model may find whether the game is home or away to be a significant factor. 

## Statistics in Games Won vs. Lost
We also want to see how values compare in games they won or lost. We can do that by graphing the averages over a season for games a specific team lost and won as follows. 

In [None]:
def winLossAverages(team_id, feature) :
    """
    This function takes a team ID and a feature and calculates the average value of that feature 
    for each season separated into games won or lost. This allows us to see how a value could be 
    used to predict if a team will win or not.
    """
    start = all_stats_cleaned['SEASON_YEAR'].min()
    end = all_stats_cleaned['SEASON_YEAR'].max()

    years = list(range(start, end+1))
    avg_for_wins = []
    avg_for_losses = []

    for y in years :
        games = all_stats_cleaned[all_stats_cleaned['SEASON_YEAR'] == y]
        wins = games[games['WIN'] == 1]
        losses = games[games['WIN'] == 0]

        avg_for_wins.append(wins[feature].mean())
        avg_for_losses.append(losses[feature].mean())

    plt.figure(figsize = (10, 5))
    plt.plot(years, avg_for_wins, linestyle = '-', label = "Games Won")
    plt.plot(years, avg_for_losses, linestyle = '-', label = "Games Lost")
    plt.xlabel("Seasons")
    plt.ylabel(f"Average {feature}")
    plt.title(f"Comparing Averages of {feature} For Games Won or Lost by {team_id_to_abb[team_id][-1]}")
    plt.legend()
    plt.show()

In [None]:
winLossAverages(1, 'FG_PCT')

Above we can see that field goal percentage is always significantly higher when games are won. Thus, when we expect a team to have a higher field goal percentage, they have a higher likelihood of winning. 