# Basketball Playoffs Qualification

## Task description

Basketball tournaments are usually split in two parts. First, all teams play each other aiming to achieve the greatest number of wins possible. Then, at the end of the first part of the season, a pre determined number of teams which were able to win the most games are qualified to the playoff season, where they play series of knock-out matches for the trophy.

For the 10 years, data from players, teams, coaches, games and several other metrics were gathered and arranged on this dataset. The goal is to use this data to predict which teams will qualify for the playoffs in the next season.

## Data preparation

### Creating the database

First, we need to convert the CSV files to tables in an SQLite database, so we can analyze, manipulate and prepare data more easily. This was done with a couple of SQlite3 commands:

```
.mode csv
.import dataset/awards_players.csv awards_players
.import dataset/coaches.csv coaches
.import dataset/players.csv players
.import dataset/players_teams.csv players_teams
.import dataset/series_post.csv series_post
.import dataset/teams_post.csv teams_post
.import dataset/teams.csv teams
.save database.db
```

### Filtering unneeded rows and columns

Upon closer inspection of the dataset, we found some rows which had no effect or could have a negative impact in our models training, such as rows in the players table which corresponded to current coaches, and thus had no information related to their height, weight, etc.

## Model performance measures

### The Game Score measure
The Game Score measure, created by John Hollinger, attempts to give an estimation of a player's productivity for a single game. We will start working on our model based on this measure, applying it to each player based on a whole season's stats and dividing it by the amount of games played.


## Data Preparation and Metrics

Import necessary packages

In [None]:
import sqlite3
import pandas as pd

Create dataframes based on the database and relations between data

In [None]:
con = sqlite3.connect("database.db")

# Player <-> Awards
pl_aw = pd.read_sql_query('''
    SELECT players_teams.playerID, players_teams.tmID,
        awards_players.award, awards_players.year
    FROM awards_players 
    LEFT JOIN players_teams
    ON (
        awards_players.playerID = players_teams.playerID 
        AND awards_players.year = players_teams.year
    )''', con)

# Coach <-> Awards
cc_aw = pd.read_sql_query('''
        SELECT playerID, award, c.year, c.tmID
        FROM awards_players
        INNER JOIN
        (
                SELECT coaches.coachID, teams.year,
                    coaches.year, teams.tmID
                FROM teams
                INNER JOIN coaches
                ON (
                    coaches.tmID = teams.tmID
                    AND coaches.year = teams.year
                )    
        ) AS c
        ON (
            awards_players.playerID = c.coachID
            AND awards_players.year = c.year
        )
    ''', con)

# Players
pl = pd.read_sql_query("SELECT * FROM players", con)

# Teams
tm = pd.read_sql_query("SELECT * FROM teams", con)

# Player Teams
pt = pd.read_sql_query("SELECT * FROM players_teams", con)

# Player <-> Teams
pl_tm = pd.read_sql_query("SELECT * FROM players_teams INNER JOIN players ON players_teams.playerID = players.bioID", con)

# Teams <-> Post Season Results (aggregated)
tm_psa = pd.read_sql_query('''
    SELECT teams.year, teams.lgID, teams.tmID, franchID,
       confID, divID, rank, playoff, seeded, firstRound, semis,
       finals, name, o_fgm, o_fga, o_ftm, o_fta, o_3pm, o_3pa,
       o_oreb, o_dreb, o_reb, o_asts, o_pf, o_stl, o_to, o_blk,
       o_pts, d_fgm, d_fga, d_ftm, d_fta, d_3pm, d_3pa, d_oreb,
       d_dreb, d_reb, d_asts, d_pf, d_stl, d_to, d_blk, d_pts,
       tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB, won,
       lost, GP, homeW, homeL, awayW, awayL, confW, confL,
       min, attend, arena,W, L
    FROM teams_post 
    INNER JOIN teams 
    ON (
        teams_post.tmID = teams.tmID 
        AND teams_post.year = teams.year
    )''', con)

# Coach <-> Teams
cc_tm = pd.read_sql_query("SELECT * FROM coaches INNER JOIN teams ON (coaches.tmID = teams.tmID AND coaches.year = teams.year)", con)

# Teams <-> Post Series Results
tm_pss = pd.read_sql_query('''
    SELECT winners.winnersID, winners.year, winners.winnersPlayoff, winners.winnersRank, losers.tmID, losers.playoff, losers.rank
    FROM
    (
        SELECT teams.tmID AS winnersID, teams.year AS year, teams.playoff AS winnersPlayoff, teams.rank AS winnersRank, series_post.tmIDLoser AS tmIDLoser
        FROM series_post 
        INNER JOIN teams
        ON
        (series_post.tmIDWinner = teams.tmID AND series_post.year = teams.year)
    ) AS winners
    JOIN teams AS losers
    ON
    (winners.tmIDLoser = losers.tmID AND winners.year = losers.year)
''', con)
cc_aw

## Data Pre-processing

### Column dropping

First, remove columns that only have null values or no unique values.

In [None]:
dataframes = [pl, pl_tm, tm_psa, cc_tm, tm]

for i in range(len(dataframes)):
    unique_counts = dataframes[i].nunique()
    dropped_columns = dataframes[i].columns[(dataframes[i].isna().sum() == len(dataframes[i])) | (unique_counts == 1)]
    
    print(f"Dropped columns in dataframe {i}: {list(dropped_columns)}")
    
    dataframes[i] = dataframes[i].drop(columns=dropped_columns, axis=1, inplace=True)

Now, we remove Player rows that have birth-date `0000-00-00`.

In [None]:
pl = pl.drop(pl[pl['birthDate'] == '0000-00-00'].index, axis = 0)
pl_tm = pl_tm.drop(pl_tm[pl_tm['birthDate'] == '0000-00-00'].index, axis = 0)

Lastly, when applicable, we remove the Team `name` attribute, since we already have access to the `tmID`.

In [None]:
tm_psa = tm_psa.drop(['name'], axis=1)
cc_tm = cc_tm.drop(['name'], axis=1)
tm = tm.drop(['name'], axis=1)

### Categorical encoding of Awards

Since not all awards are equal, it's useful to attribute each one a score, to signal to the algorithms the relevance each one has. For example, the `All-Star Game Most Valuable Player` is the most valuable one, while the `Kim Perrot Sportsmanship Award` is attributed only to those who show sportsmanship, revealing no real playing skill.

In [None]:
print(pl_aw['award'].unique())

As we can see the `Kim Perrot Sportsmanship Award` has two possible values, so we merge those two, before encoding the awards.

In [None]:
pl_aw['award'] = pl_aw['award'].replace(['Kim Perrot Sportsmanship', 'Kim Perrot Sportsmanship Award'], 'Kim Perrot Sportsmanship Award')

print(pl_aw['award'].unique())

We can rank the awards as follows:

| Award                                    | Score |
|------------------------------------------|-------|
| 'All-Star Game Most Valuable Player'     |    7  |
| 'Coach of the Year'                      |   10  |
| 'Defensive Player of the Year'           |    7  |
| 'Kim Perrot Sportsmanship Award'         |    3  |
| 'Most Improved Player'                   |    8  |
| 'Most Valuable Player'                   |   10  |
| 'Rookie of the Year'                     |    8  |
| 'Sixth Woman of the Year'                |    7  |
| 'WNBA Finals Most Valuable Player'       |    7  |
| 'WNBA All-Decade Team'                   |    4  |
| 'WNBA All Decade Team Honorable Mention' |    3  |

In [None]:
award_scores = {
    'All-Star Game Most Valuable Player': 7,
    'Coach of the Year': 10,
    'Defensive Player of the Year': 7,
    'Kim Perrot Sportsmanship Award': 3,
    'Most Improved Player': 8,
    'Most Valuable Player': 10,
    'Rookie of the Year': 8,
    'Sixth Woman of the Year': 7,
    'WNBA Finals Most Valuable Player': 7,
    'WNBA All-Decade Team': 4,
    'WNBA All Decade Team Honorable Mention': 3
}

pl_aw['award_score'] = pl_aw['award'].map(award_scores)
cc_aw['award_score'] = cc_aw['award'].map(award_scores)

print(pl_aw)
print(cc_aw)


Now, for the graphs of this metric:

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import numpy as np

In [None]:
team_award_scores = pl_aw.groupby(['tmID', 'year'])['award_score'].sum().reset_index()

# Merge team award scores with team wins
merged_team_data = pd.merge(team_award_scores, tm_psa[['tmID', 'year', 'won']], on=['tmID', 'year'], how='left')
merged_team_data['year'] = pd.to_numeric(merged_team_data['year'])
merged_team_data.sort_values(by='year', inplace=True)

merged_team_data['award_score'] = pd.to_numeric(merged_team_data['award_score'])
merged_team_data['won'] = pd.to_numeric(merged_team_data['won'])

# Create a scatter plot with regression line for each year
g = sns.FacetGrid(merged_team_data, col="year", col_wrap=4, height=4, sharex=False)
g.map(sns.scatterplot, 'award_score', 'won', alpha=0.7)
#g.map(sns.regplot, 'award_score', 'won', scatter=False, color='red')  # Add regression line
g.set_axis_labels("Team's Award Score", "Number of Wins")
g.set_titles("Year {col_name}")
g.add_legend(title='Year')
plt.show()


### Outliers

Now, to analyse the `weight` and `height` attributes' Z-Score and IQR.

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import numpy as np

In [None]:
# Outliers for Player's Weight + Height

# Convert Height and Weight Columns into numeric values
pl['height'] = pd.to_numeric(pl['height'], errors='coerce')
pl['weight'] = pd.to_numeric(pl['weight'], errors='coerce')

pl_tm['height'] = pd.to_numeric(pl_tm['height'], errors='coerce')
pl_tm['weight'] = pd.to_numeric(pl_tm['weight'], errors='coerce')

pl = pl.dropna(subset=['height', 'weight'])
pl_tm = pl_tm.dropna(subset=['height', 'weight'])

height_zscores = stats.zscore(pl['height'])
weight_zscores = stats.zscore(pl['weight'])

# Plot boxplot
print("Height Z-Score\n", height_zscores)
print("\nWeight Z-Score\n", weight_zscores)

plt.title('Height Distribution')
plt.boxplot(pl['height'])
plt.show()

plt.title('Weight Distribution')
plt.boxplot(pl['weight'])
plt.show()

In [None]:
# Outlier for Team's played minutes

tm_psa['min'] = pd.to_numeric(tm_psa['min'], errors='coerce')
cc_tm['min'] = pd.to_numeric(cc_tm['min'], errors='coerce')

tm_psa = tm_psa.dropna(subset=['min'])
cc_tm = cc_tm.dropna(subset=['min'])


print("Team Played minutes Z-Score\n", stats.zscore(tm_psa['min']))
plt.title('Team played minutes Distribution')
plt.boxplot(tm_psa['min'])
plt.show()

In [None]:
# Outlier for Player's played minutes

pl_tm['minutes'] = pd.to_numeric(pl_tm['minutes'], errors='coerce')

print("Player Played minutes Z-Score\n", stats.zscore(pl_tm['minutes']))
plt.title('Player played minutes Distribution')
plt.boxplot(pl_tm['minutes'])
plt.show()

In [None]:
# Outlier Team points

tm_psa['o_pts'] = pd.to_numeric(tm_psa['o_pts'], errors='coerce')

print("Team's Offense Score Z-Score\n", stats.zscore(tm_psa['o_pts']))
plt.title("Team's Offence Score Distribution")
plt.boxplot(tm_psa['o_pts'])
plt.show()

tm_psa['d_pts'] = pd.to_numeric(tm_psa['d_pts'], errors='coerce')

print("Team's Defense Score Z-Score\n", stats.zscore(tm_psa['d_pts']))
plt.title("Team's Defense Score Distribution")
plt.boxplot(tm_psa['d_pts'])
plt.show()

In [None]:
# Outlier Player points

pl_tm['points'] = pd.to_numeric(pl_tm['points'], errors='coerce')

print("Player's Scores Z-Score\n", stats.zscore(pl_tm['points']))
plt.title("Player's Scores Distribution")
plt.boxplot(pl_tm['points'])
plt.show()

In this case, we didn't find it useful to remove the outliers, since they represent players that exceed average measures, as opposed to mistakes in the data collection process.

### Correlation between attributes

In [None]:
def plot_corr_mat(df, name, columns):
    correlation_matrix = df[columns].corr()
    
    plt.figure(figsize=(17, 10))

    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

    cmap = sns.diverging_palette(220, 20, as_cmap=True)
    sns.heatmap(correlation_matrix, mask=mask, cmap=cmap, annot=True, fmt=".2f")
    plt.title(f'Correlation Matrix {name}')
    plt.tight_layout()
    plt.show()


In [None]:
dataframes = [pl_tm, tm_psa]
names = ['pl_tm', 'tm_psa']

selected_columns = [
	[
       'oRebounds', 'dRebounds', 'rebounds', 'assists', 'steals', 'blocks',
       'turnovers', 'PF', 'fgAttempted', 'fgMade', 'ftAttempted', 'ftMade',
       'threeAttempted', 'threeMade', 'dq', 'PostGP', 'PostGS', 'PostMinutes',
       'PostPoints', 'PostoRebounds', 'PostdRebounds', 'PostRebounds',
       'PostAssists', 'PostSteals', 'PostBlocks', 'PostTurnovers', 'PostPF',
       'PostfgAttempted', 'PostfgMade', 'PostftAttempted', 'PostftMade',
       'PostthreeAttempted', 'PostthreeMade', 'PostDQ'],
	
	[ 'o_fgm', 'o_fga', 'o_ftm', 'o_fta', 'o_3pm', 'o_3pa',
       'o_oreb', 'o_dreb', 'o_reb', 'o_asts', 'o_pf', 'o_stl', 'o_to', 'o_blk',
       'o_pts', 'd_fgm', 'd_fga', 'd_ftm', 'd_fta', 'd_3pm', 'd_3pa', 'd_oreb',
       'd_dreb', 'd_reb', 'd_asts', 'd_pf', 'd_stl', 'd_to', 'd_blk', 'd_pts',
       'min']
]


for i in range(0, len(dataframes)):
	df = dataframes[i]
	for column in selected_columns[i]:
		if (df[column].dtype == 'object'):  # Check if the column contains strings
			try:
				df[column] = pd.to_numeric(df[column])  # Try to convert to numeric
			except ValueError:
				pass
    
	
	plot_corr_mat(df, names[i], selected_columns[i])

## Feature Engineering

Create the dataframe, `df`, to be used with the models

In [None]:
df = tm
df['year'] = df['year'].astype(int)
df["playoff"].replace({"N": 0, "Y": 1}, inplace=True)
df.sort_values(by=['year'], inplace=True)
df

Merge columns with performance data into a single performance indicator

Game Score, applied to the season and to the teams

In [None]:
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format

In order for teams with a strong defensive performance to have their strengths represented a "defensive game score", `def_metric_game_score`, is also being calculated and the average of the scores used as the final metric.

In [None]:
for col in ['o_pts', 'o_fgm', 'o_fga', 'o_3pm', 'o_fta', 'o_ftm', 'o_oreb', 'o_dreb', 'o_stl', 'o_asts', 'o_blk', 'o_pf', 'o_to', 'GP']:
    df[col] = df[col].astype(int)

for col in ['d_pts', 'd_fgm', 'd_fga', 'd_3pm', 'd_fta', 'd_ftm', 'd_oreb', 'd_dreb', 'd_stl', 'd_asts', 'd_blk', 'd_pf', 'd_to', 'GP']:
    df[col] = df[col].astype(int)

df['off_metric_game_score'] = (df['o_pts'] + 0.4 * df['o_fgm'] - 0.7 * df['o_fga'] - 0.4 * (df['o_fta'] - df['o_ftm']) + 0.7 * df['o_oreb'] + 0.3 * df['o_dreb'] + df['o_stl'] + 0.7 * df['o_asts'] + 0.7 * df['o_blk'] - 0.4 * df['o_pf'] - df['o_to']) / df['GP']
df['def_metric_game_score'] = (df['d_pts'] + 0.4 * df['d_fgm'] - 0.7 * df['d_fga'] - 0.4 * (df['d_fta'] - df['d_ftm']) + 0.7 * df['d_oreb'] + 0.3 * df['d_dreb'] + df['d_stl'] + 0.7 * df['d_asts'] + 0.7 * df['d_blk'] - 0.4 * df['d_pf'] - df['d_to']) / df['GP']
df['metric_game_score'] = (df['off_metric_game_score'] + df['def_metric_game_score'])/2

print(df.sort_values(by='metric_game_score', ascending=False)['metric_game_score'])

Effective Field Goal (%) and Free Throw Rate calculation

In [None]:
df['eFG%'] = (df['o_fgm'] + 0.5 * df['o_3pm']) / df['o_fga']
df['FTA_rate'] = df['o_fta'] / df['o_fga']
df.sort_values(by='year', ascending=True)

df.columns

Plot to show relation between number of wins and effective field goal percentage.

In [None]:
copy = df.copy()

copy['eFG%'] = pd.to_numeric(copy['eFG%'], errors='coerce')  # Convert 'eFG%' to numeric type if needed
copy['won'] = pd.to_numeric(copy['won'], errors='coerce')  # Convert 'won' to numeric type if needed

copy.sort_values(by='won', inplace=True)

# Create a scatter plot for each year
g = sns.FacetGrid(copy, col="year", col_wrap=4, height=4, sharex=False)
g.map(sns.scatterplot, 'eFG%', 'won', alpha=0.7)
g.map(sns.regplot, 'eFG%', 'won', scatter=False, color='red')  # Add regression line
g.set_axis_labels("Effective Field Goal Percentage (eFG%)", "Number of Wins")
g.set_titles("Year {col_name}")
g.add_legend(title='Year')
plt.show()

Game Score, applied to the season and to the players

In [None]:
for col in ['points', 'fgMade', 'fgAttempted', 'ftAttempted', 'ftMade', 'oRebounds', 'dRebounds', 'steals', 'assists', 'blocks', 'PF', 'turnovers', 'GP']:
    pl_tm[col] = pl_tm[col].astype(int)

pl_tm['off_metric_game_score'] = (pl_tm['points'] + 0.4 * pl_tm['fgMade'] - 0.7 * pl_tm['fgAttempted'] - 0.4 * (pl_tm['ftAttempted'] - pl_tm['ftMade']) + 0.7 * pl_tm['oRebounds'] + 0.3 * pl_tm['dRebounds'] + pl_tm['steals'] + 0.7 * pl_tm['assists'] + 0.7 * pl_tm['blocks'] - 0.4 * pl_tm['PF'] - pl_tm['turnovers']) / pl_tm['GP']
print(pl_tm.sort_values(by='off_metric_game_score', ascending=False)['off_metric_game_score'])

mean_gs = pl_tm.groupby(['tmID', 'year'])["off_metric_game_score"].mean().reset_index().sort_values(by='off_metric_game_score', ascending=False)

for idx, x in mean_gs.iterrows():
    year_condition = df['year'] == int(x['year'])
    tmID_condition = df['tmID'] == str(x['tmID'])
    df.loc[year_condition & tmID_condition, "mean_player_game_score"] = x['off_metric_game_score']

df

Plot between Game Score and Wins.

In [None]:
pl_tm['off_metric_game_score'] = pd.to_numeric(pl_tm['off_metric_game_score'], errors='coerce')  # Convert to numeric if needed

# Sum the 'off_metric_game_score' for each team's players
team_off_metric_scores = pl_tm.groupby(['tmID', 'year'])['off_metric_game_score'].sum().reset_index()

# Convert 'tmID' and 'year' columns to the same data type as in 'df'
team_off_metric_scores['tmID'] = team_off_metric_scores['tmID'].astype(copy['tmID'].dtype)
team_off_metric_scores['year'] = team_off_metric_scores['year'].astype(copy['year'].dtype)

# Merge with 'df' to get 'won' column
merged_data = pd.merge(team_off_metric_scores, copy[['tmID', 'year', 'won']], on=['tmID', 'year'], how='left')

# Sort the DataFrame by 'won'
merged_data.sort_values(by='won', inplace=True)

# Create a scatter plot for each year
g = sns.FacetGrid(merged_data, col="year", col_wrap=4, height=4, sharex=False)
g.map(sns.scatterplot, 'off_metric_game_score', 'won', alpha=0.7)
g.map(sns.regplot, 'off_metric_game_score', 'won', scatter=False, color='red')  # Add regression line
g.set_axis_labels("Sum of Player Off Metric Game Score", "Number of Wins")
g.set_titles("Year {col_name}")
g.add_legend(title='Year')
plt.show()

Create an award score for each `team-year` occurence based on the award enconding previously employed

In [None]:
team_player_award_score = pl_aw.groupby(['tmID', 'year'])['award_score'].sum().reset_index()
team_coach_award_score = cc_aw.groupby(['tmID', 'year'])['award_score'].sum().reset_index()

df['award_score'] = 0

for idx, x in team_player_award_score.iterrows():
    year_condition = df['year'] == int(x['year'])
    tmID_condition = df['tmID'] == str(x['tmID'])
    df.loc[year_condition & tmID_condition, "award_score"] += x['award_score']

for idx, x in team_coach_award_score.iterrows():
    year_condition = df['year'] == int(x['year'])
    tmID_condition = df['tmID'] == str(x['tmID'])
    df.loc[year_condition & tmID_condition, "award_score"] += x['award_score']

df

In [None]:
pt['year'] = pt['year'].astype(int)
pt.sort_values(by=['year'], inplace=True)
grouped = pt.groupby(['tmID', 'year'])[['tmID', 'playerID', 'year']]
teams_dict = {}
res_dict = {}

for name, group in grouped:
    loop_df = pd.DataFrame(group)
    if name[0] in teams_dict.keys():
        teams_dict[name[0]][name[1]] = loop_df['playerID'].unique().tolist()
    else:
        teams_dict[name[0]] = {}
        teams_dict[name[0]][name[1]] = loop_df['playerID'].unique().tolist()

for x in teams_dict.keys():
    prev = None
    if len(teams_dict[x].keys()) == 1:
        res_dict[x][teams_dict[x].keys()[0]] = 0
        continue
    for y in teams_dict[x].keys():
        if prev == None: 
            prev = y
            res_dict[x] = {}
            res_dict[x][y] = 0
            continue

        past = teams_dict[x][prev]
        present = teams_dict[x][y]
        prev = y
        count = 0

        for player in past:
            if player in present:
                count += 1

        # ratio of players that stayed
        count /= len(present)

        res_dict[x][y] = {}
        res_dict[x][y] = round(count,2)


for key in res_dict.keys():
    for year in res_dict[key].keys():
        year_condition = df['year'] == int(year)
        tmID_condition = df['tmID'] == str(key)
        df.loc[year_condition & tmID_condition, "player_retention"] = res_dict[key][year]

df

Player retention calculation

In [None]:
pt['year'] = pt['year'].astype(int)
pt.sort_values(by=['year'], inplace=True)
grouped = pt.groupby(['tmID', 'year'])[['tmID', 'playerID', 'year']]
teams_dict = {}
res_dict = {}


for name, group in grouped:
    loop_df = pd.DataFrame(group)
    if name[0] in teams_dict.keys():
        teams_dict[name[0]][name[1]] = loop_df['playerID'].unique().tolist()
    else:
        teams_dict[name[0]] = {}
        teams_dict[name[0]][name[1]] = loop_df['playerID'].unique().tolist()

for x in teams_dict.keys():
    prev = None
    if len(teams_dict[x].keys()) == 1:
        res_dict[x][teams_dict[x].keys()[0]] = 0
        continue
    for y in teams_dict[x].keys():
        if prev == None: 
            prev = y
            res_dict[x] = {}
            res_dict[x][y] = 0
            continue

        past = teams_dict[x][prev]
        present = teams_dict[x][y]
        prev = y
        count = 0

        for player in past:
            if player in present:
                count += 1

        # ratio of players that stayed
        count /= len(present)

        res_dict[x][y] = {}
        res_dict[x][y] = round(count,2)


for key in res_dict.keys():
    for year in res_dict[key].keys():
        year_condition = df['year'] == int(year)
        tmID_condition = df['tmID'] == str(key)
        df.loc[year_condition & tmID_condition, "player_retention"] = res_dict[key][year]
df

## Scaling Features

Our features have different orders of magnitude. Since some models benefit from standardized scaling of features, we will use a `MinMaxScaler` to transform our created features between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler

feature_cols = ["def_metric_game_score", "metric_game_score", "off_metric_game_score", "eFG%", "FTA_rate", "award_score", "player_retention", "mean_player_game_score"]

scaled_df = df[feature_cols]
scaler = MinMaxScaler()
scaled_df = pd.DataFrame(scaler.fit_transform(scaled_df), index=scaled_df.index, columns=scaled_df.columns)
for col in feature_cols:
    df[col] = scaled_df[col]
df

## Sliding Window

In order to accurately predict the following year playoffs qualification, we must analyse the previous year performance. So, we will shift the data one year. This way, data marked from the end of year 1, will be used as learning data for the start of year 2.

Instead of shifting every relevant column one year ahead, we will instead increment the year value by one: this way, the year 1 results will be in the year 2 row, which mimics the effect we want. We will save year 10's results (which would be placed under year 11) and delete them from the dataframe, as they will only be used when making the final prediction (for year 11).

In [None]:
df['year'] = (df['year'].astype(int) + 1)
year_10_df = df[df['year'] == 11]
year_10_df.to_csv('dataset/teams_year_10.csv', index=None)
df = df[df['year'] < 11]
df

## Creating and training the model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, roc_auc_score
feature_selection_func = LinearSVC(dual="auto", penalty="l2")

### Train and test set balance

We can create a dictionary to hold the train/test split values in order to emulate different model training scenarios.

In [None]:
train_test = {
    5/9: 4/9,
    6/9: 3/9,
    8/9: 1/9,
}

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', DecisionTreeClassifier(random_state=48))])

X_file, Y_file = df.drop("playoff", axis=1), df[["playoff"]]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

dt_acc = 0
dt_prec = 0
dt_f1 = 0
dt_rec = 0
dt_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    dt_acc += accuracy_score(y_test, y_prediction)
    dt_prec += precision_score(y_test, y_prediction)
    dt_f1 += f1_score(y_test, y_prediction)
    dt_rec += recall_score(y_test, y_prediction)
    dt_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    [print(f"{trained_model.feature_names_in_[idx]}: {round(x*100, 2)}%") for idx, x in enumerate(trained_model['classification'].feature_importances_)]
    
    print("-" * 18)

dt_acc /= len(train_test.keys())
dt_prec /= len(train_test.keys())
dt_f1 /= len(train_test.keys())
dt_rec /= len(train_test.keys())
dt_roc_auc /= len(train_test.keys())

print("Overall:")
print(dt_acc)
print(dt_prec)
print(dt_f1)
print(dt_rec)
print(dt_roc_auc)


### Naive Bayes Gaussian and Mulitnomial

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', GaussianNB())])

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

nbg_acc = 0
nbg_prec = 0
nbg_f1 = 0
nbg_rec = 0
nbg_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    nbg_acc += accuracy_score(y_test, y_prediction)
    nbg_prec += precision_score(y_test, y_prediction)
    nbg_f1 += f1_score(y_test, y_prediction)
    nbg_rec += recall_score(y_test, y_prediction)
    nbg_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

nbg_acc /= len(train_test.keys())
nbg_prec /= len(train_test.keys())
nbg_f1 /= len(train_test.keys())
nbg_rec /= len(train_test.keys())
nbg_roc_auc /= len(train_test.keys())

print("Overall:")
print(nbg_acc)
print(nbg_prec)
print(nbg_f1)
print(nbg_rec)
print(nbg_roc_auc)

print()

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', MultinomialNB())])

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

nbm_acc = 0
nbm_prec = 0
nbm_f1 = 0
nbm_rec = 0
nbm_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    nbm_acc += accuracy_score(y_test, y_prediction)
    nbm_prec += precision_score(y_test, y_prediction)
    nbm_f1 += f1_score(y_test, y_prediction)
    nbm_rec += recall_score(y_test, y_prediction)
    nbm_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

nbm_acc /= len(train_test.keys())
nbm_prec /= len(train_test.keys())
nbm_f1 /= len(train_test.keys())
nbm_rec /= len(train_test.keys())
nbm_roc_auc /= len(train_test.keys())

print("Overall:")
print(nbm_acc)
print(nbm_prec)
print(nbm_f1)
print(nbm_rec)
print(nbm_roc_auc)
# [print(f"{trained_model.feature_names_in_[idx]}: {x}") for idx, x in enumerate(trained_model.feature_importances_)]

### K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', KNeighborsClassifier())])


parameter_combinations = {
    'classification__n_neighbors': [*range(3, 16, 2)],
    'classification__weights': ['uniform', 'distance'],
    'classification__metric': ['euclidean', 'manhattan', 'minkowski'],
    'classification__algorithm': ['ball_tree', 'kd_tree', 'brute'],
    'classification__p': [1, 2]
}

pipeline = GridSearchCV(pipeline, parameter_combinations, cv=5, scoring='roc_auc')

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

knn_acc = 0
knn_prec = 0
knn_f1 = 0
knn_rec = 0
knn_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")
    print(pipeline.best_params_)

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    knn_acc += accuracy_score(y_test, y_prediction)
    knn_prec += precision_score(y_test, y_prediction)
    knn_f1 += f1_score(y_test, y_prediction)
    knn_rec += recall_score(y_test, y_prediction)
    knn_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

knn_acc /= len(train_test.keys())
knn_prec /= len(train_test.keys())
knn_f1 /= len(train_test.keys())
knn_rec /= len(train_test.keys())
knn_roc_auc /= len(train_test.keys())

print("Overall:")
print(knn_acc)
print(knn_prec)
print(knn_f1)
print(knn_rec)
print(knn_roc_auc)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', RandomForestClassifier(random_state=48))])

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

parameter_combinations = {
    'classification__n_estimators': [50, 100, 200],
    'classification__min_samples_split': [2, 4, 9],
    'classification__min_samples_leaf': [2, 3, 5],
    'classification__max_features': ['sqrt', 'log2'],
    'classification__bootstrap': [True, False]
}

pipeline = GridSearchCV(pipeline, parameter_combinations, cv=5, scoring='roc_auc')

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

rf_acc = 0
rf_prec = 0
rf_f1 = 0
rf_rec = 0
rf_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")
    print(pipeline.best_params_)

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    rf_acc += accuracy_score(y_test, y_prediction)
    rf_prec += precision_score(y_test, y_prediction)
    rf_f1 += f1_score(y_test, y_prediction)
    rf_rec += recall_score(y_test, y_prediction)
    rf_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

rf_acc /= len(train_test.keys())
rf_prec /= len(train_test.keys())
rf_f1 /= len(train_test.keys())
rf_rec /= len(train_test.keys())
rf_roc_auc /= len(train_test.keys())

print("Overall:")
print(rf_acc)
print(rf_prec)
print(rf_f1)
print(rf_rec)
print(rf_roc_auc)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
import warnings

warnings.filterwarnings(action='ignore') # ignore warnings about l1_ratio not being used outside elasticnet penalty

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', LogisticRegression(random_state=48))])

parameter_combinations = {
    'classification__C': [0.01, 0.1, 1, 10, 100],
    'classification__penalty': ['l1', 'l2', 'elasticnet'],
    'classification__l1_ratio': [0.2, 0.4, 0.6, 0.8],
    'classification__solver': ['saga'],
    'classification__max_iter': [1000, 3000, 5000]
}

pipeline = GridSearchCV(pipeline, parameter_combinations, cv=5, scoring='roc_auc')

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)

lr_acc = 0
lr_prec = 0
lr_f1 = 0
lr_rec = 0
lr_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")
    print(pipeline.best_params_)

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    lr_acc += accuracy_score(y_test, y_prediction)
    lr_prec += precision_score(y_test, y_prediction)
    lr_f1 += f1_score(y_test, y_prediction)
    lr_rec += recall_score(y_test, y_prediction)
    lr_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

lr_acc /= len(train_test.keys())
lr_prec /= len(train_test.keys())
lr_f1 /= len(train_test.keys())
lr_rec /= len(train_test.keys())
lr_roc_auc /= len(train_test.keys())

print("Overall:")
print(lr_acc)
print(lr_prec)
print(lr_f1)
print(lr_rec)
print(lr_roc_auc)

### Support Vector Machines

In [None]:
from sklearn.svm import SVC

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', SVC(random_state=48))])

parameter_combinations = {
    'classification__C': [0.1, 1, 10],
    'classification__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'classification__degree': [2, 3, 4],
    'classification__gamma': ['scale', 'auto', 0.1, 1],
    'classification__shrinking': [True, False],
    'classification__probability': [True, False]
}

pipeline = GridSearchCV(pipeline, parameter_combinations, cv=5, scoring='roc_auc')

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)
        
svc_acc = 0
svc_prec = 0
svc_f1 = 0
svc_rec = 0
svc_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")
    print(pipeline.best_params_)

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    svc_acc += accuracy_score(y_test, y_prediction)
    svc_prec += precision_score(y_test, y_prediction)
    svc_f1 += f1_score(y_test, y_prediction)
    svc_rec += recall_score(y_test, y_prediction)
    svc_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))
    
    print("-" * 18)

svc_acc /= len(train_test.keys())
svc_prec /= len(train_test.keys())
svc_f1 /= len(train_test.keys())
svc_rec /= len(train_test.keys())
svc_roc_auc /= len(train_test.keys())

print("Overall:")
print(svc_acc)
print(svc_prec)
print(svc_f1)
print(svc_rec)
print(svc_roc_auc)

### Gradient Boosted Trees

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', GradientBoostingClassifier(random_state=48))])

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)
        
gbt_acc = 0
gbt_prec = 0
gbt_f1 = 0
gbt_rec = 0
gbt_roc_auc = 0

for train, test in train_test.items():
    # Fit the model to the training data
    x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, train_size=train, test_size=test, shuffle=False, random_state=48)
    trained_model = pipeline.fit(x_train, y_train)
    print(f"Train/test split: {train}/{test}")

    # Predict using the trained model
    y_prediction = trained_model.predict(x_test)

    gbt_acc += accuracy_score(y_test, y_prediction)
    gbt_prec += precision_score(y_test, y_prediction)
    gbt_f1 += f1_score(y_test, y_prediction)
    gbt_rec += recall_score(y_test, y_prediction)
    gbt_roc_auc += roc_auc_score(y_test, y_prediction)

    print(accuracy_score(y_test, y_prediction))
    print(precision_score(y_test, y_prediction))
    print(f1_score(y_test, y_prediction))
    print(recall_score(y_test, y_prediction))
    print(roc_auc_score(y_test, y_prediction))

    [print(f"{trained_model.feature_names_in_[idx]}: {x}") for idx, x in enumerate(trained_model['classification'].feature_importances_)]
    
    print("-" * 18)

gbt_acc /= len(train_test.keys())
gbt_prec /= len(train_test.keys())
gbt_f1 /= len(train_test.keys())
gbt_rec /= len(train_test.keys())
gbt_roc_auc /= len(train_test.keys())

print("Overall:")
print(gbt_acc)
print(gbt_prec)
print(gbt_f1)
print(gbt_rec)
print(gbt_roc_auc)


## Model Comparison

In [None]:
dt_dict = {'Accuracy': dt_acc,
            'Precision': dt_prec,
            'F1': dt_f1,
            'Recall': dt_rec,
            'Area under ROC curve': dt_roc_auc}

nbg_dict = {'Accuracy': nbg_acc,
            'Precision': nbg_prec,
            'F1': nbg_f1,
            'Recall': nbg_rec,
            'Area under ROC curve': nbg_roc_auc}

nbm_dict = {'Accuracy': nbm_acc,
            'Precision': nbm_prec,
            'F1': nbm_f1,
            'Recall': nbm_rec,
            'Area under ROC curve': nbm_roc_auc}

knn_dict = {'Accuracy': knn_acc,
            'Precision': knn_prec,
            'F1': knn_f1,
            'Recall': knn_rec,
            'Area under ROC curve': knn_roc_auc}

rf_dict = {'Accuracy': rf_acc,
            'Precision': rf_prec,
            'F1': rf_f1,
            'Recall': rf_rec,
            'Area under ROC curve': rf_roc_auc}

lr_dict = {'Accuracy': lr_acc,
            'Precision': lr_prec,
            'F1': lr_f1,
            'Recall': lr_rec,
            'Area under ROC curve': lr_roc_auc}

svc_dict = {'Accuracy': svc_acc,
            'Precision': svc_prec,
            'F1': svc_f1,
            'Recall': svc_rec,
            'Area under ROC curve': svc_roc_auc}

gbt_dict = {'Accuracy': gbt_acc,
            'Precision': gbt_prec,
            'F1': gbt_f1,
            'Recall': gbt_rec,
            'Area under ROC curve': gbt_roc_auc}

comparison = pd.DataFrame({'Decision Tree': pd.Series(dt_dict),
                       'Gaussian Naive Bayes': pd.Series(nbg_dict),
                       'Multinomial Naive Bayes': pd.Series(nbm_dict),
                       'K-Nearest Neighbors': pd.Series(knn_dict),
                       'Random Forest': pd.Series(rf_dict),
                       'Logistic Regression': pd.Series(lr_dict),
                       'Support Vector Machines': pd.Series(svc_dict),
                       'Gradient Boosted Trees': pd.Series(gbt_dict),
                      })
comparison

## Final prediction

In [None]:
pred_df = pd.read_csv('dataset/teams_year_10.csv')
year_11_df = pd.read_csv('dataset/teams_year_11.csv')

year_11_teams = year_11_df['franchID'].values
pred_df = pred_df[pred_df['franchID'].isin(year_11_teams)]

pipeline = Pipeline([('feature_selection', SelectFromModel(feature_selection_func)),
                    ('classification', LogisticRegression(C=100, max_iter=1000, penalty='l1', solver='saga', random_state=48))])

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in feature_cols + ["playoff"]):
        X_file.drop(column, axis=1, inplace=True)
        pred_df.drop(column, axis=1, inplace=True)
        
trained_model = pipeline.fit(X_file, Y_file)

pred_df.drop('playoff', axis=1, inplace=True)
predictions = trained_model.predict(pred_df)

year_11_df['playoff'] = predictions
year_11_df["playoff"].replace({0: "N", 1: "Y"}, inplace=True)
year_11_df = year_11_df[['name', 'playoff']]
year_11_df.to_csv("dataset/prediction.csv", index=None)
year_11_df