# Basketball Playoffs Qualification

## Task description

Basketball tournaments are usually split in two parts. First, all teams play each other aiming to achieve the greatest number of wins possible. Then, at the end of the first part of the season, a pre determined number of teams which were able to win the most games are qualified to the playoff season, where they play series of knock-out matches for the trophy.

For the 10 years, data from players, teams, coaches, games and several other metrics were gathered and arranged on this dataset. The goal is to use this data to predict which teams will qualify for the playoffs in the next season.

## Data preparation

### Creating the database

First, we need to convert the CSV files to tables in an SQLite database, so we can analyze, manipulate and prepare data more easily. This was done with a couple of SQlite3 commands:

```
.mode csv
.import dataset/awards_players.csv awards_players
.import dataset/coaches.csv coaches
.import dataset/players.csv players
.import dataset/players_teams.csv players_teams
.import dataset/series_post.csv series_post
.import dataset/teams_post.csv teams_post
.import dataset/teams.csv teams
.save database.db
```

### Filtering unneeded rows and columns

Upon closer inspection of the dataset, we found some rows which had no effect or could have a negative impact in our models training, such as rows in the players table which corresponded to current coaches, and thus had no information related to their height, weight, etc.

## Model performance measures

### The Game Score measure
The Game Score measure, created by John Hollinger, attempts to give an estimation of a player's productivity for a single game. We will start working on our model based on this measure, applying it to each player based on a whole season's stats and dividing it by the amount of games played.


Import necessary packages

In [44]:
import sqlite3
import pandas as pd

Create dataframes based on the database and relations between data

In [45]:
con = sqlite3.connect("database.db")

# Player <-> Awards
pl_aw = pd.read_sql_query("SELECT * FROM awards_players INNER JOIN players ON awards_players.playerID = players.bioID", con)

# Player <-> Teams
pl_tm = pd.read_sql_query("SELECT * FROM players_teams INNER JOIN players ON players_teams.playerID = players.bioID", con)

# Teams <-> Post Season Results (aggregated)
tm_psa = pd.read_sql_query('''
    SELECT teams.year, teams.lgID, teams.tmID, franchID,
       confID, divID, rank, playoff, seeded, firstRound, semis,
       finals, name, o_fgm, o_fga, o_ftm, o_fta, o_3pm, o_3pa,
       o_oreb, o_dreb, o_reb, o_asts, o_pf, o_stl, o_to, o_blk,
       o_pts, d_fgm, d_fga, d_ftm, d_fta, d_3pm, d_3pa, d_oreb,
       d_dreb, d_reb, d_asts, d_pf, d_stl, d_to, d_blk, d_pts,
       tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB, won,
       lost, GP, homeW, homeL, awayW, awayL, confW, confL,
       min, attend, arena,W, L
    FROM teams_post 
    INNER JOIN teams 
    ON (
        teams_post.tmID = teams.tmID 
        AND teams_post.year = teams.year
    )''', con)

# Coach <-> Teams
cc_tm = pd.read_sql_query("SELECT * FROM coaches INNER JOIN teams ON (coaches.tmID = teams.tmID AND coaches.year = teams.year)", con)

# Teams <-> Post Series Results
tm_pss = pd.read_sql_query('''
    SELECT winners.winnersID, winners.year, winners.winnersPlayoff, winners.winnersRank, losers.tmID, losers.playoff, losers.rank
    FROM
    (
        SELECT teams.tmID AS winnersID, teams.year AS year, teams.playoff AS winnersPlayoff, teams.rank AS winnersRank, series_post.tmIDLoser AS tmIDLoser
        FROM series_post 
        INNER JOIN teams
        ON
        (series_post.tmIDWinner = teams.tmID AND series_post.year = teams.year)
    ) AS winners
    JOIN teams AS losers
    ON
    (winners.tmIDLoser = losers.tmID AND winners.year = losers.year)
''', con)
df = pd.read_sql_query("SELECT * FROM teams", con)
df['year'] = df['year'].astype(int)
df.sort_values(by=['year'], inplace=True)
df

Unnamed: 0,year,lgID,tmID,franchID,confID,divID,rank,playoff,seeded,firstRound,...,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena
63,1,WNBA,MIA,MIA,EA,,6,N,0,,...,32,9,7,4,12,9,12,6475,127721,AmericanAirlines Arena
24,1,WNBA,DET,DET,EA,,5,N,0,,...,32,8,8,6,10,10,11,6425,107289,The Palace of Auburn Hills
89,1,WNBA,PHO,PHO,WE,,4,Y,0,L,...,32,11,5,9,7,11,10,6425,161075,US Airways Center
129,1,WNBA,UTA,SAS,WE,,5,N,0,,...,32,12,4,6,10,13,8,6400,103442,EnergySolutions Arena
99,1,WNBA,POR,POR,WE,,7,N,0,,...,32,6,10,4,12,4,17,6525,133076,Rose Garden Arena
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,10,WNBA,MIN,MIN,WE,,5,N,0,,...,34,9,8,5,12,7,13,6875,128127,Target Center
85,10,WNBA,NYL,NYL,EA,,7,N,0,,...,34,8,9,5,12,9,13,6900,166604,Madison Square Garden (IV)
98,10,WNBA,PHO,PHO,WE,,1,Y,0,W,...,34,12,5,11,6,13,7,6900,144884,US Airways Center
52,10,WNBA,IND,IND,EA,,1,Y,0,W,...,34,14,3,8,9,17,5,6925,134964,Conseco Fieldhouse


Merge columns with performance data into a single performance indicator

Game Score, applied to the season and to the teams

In [46]:
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format

In [47]:
for col in ['o_pts', 'o_fgm', 'o_fga', 'o_fta', 'o_ftm', 'o_oreb', 'o_dreb', 'o_stl', 'o_asts', 'o_blk', 'o_pf', 'o_to', 'GP']:
    df[col] = df[col].astype(int)

df['metric_game_score'] = (df['o_pts'] + 0.4 * df['o_fgm'] - 0.7 * df['o_fga'] - 0.4 * (df['o_fta'] - df['o_ftm']) + 0.7 * df['o_oreb'] + 0.3 * df['o_dreb'] + df['o_stl'] + 0.7 * df['o_asts'] + 0.7 * df['o_blk'] - 0.4 * df['o_pf'] - df['o_to']) / df['GP']
print(df.sort_values(by='metric_game_score', ascending=False)['metric_game_score'])
df.sort_values(by='year', ascending=True)

98    67.93
96    67.18
95    64.11
97    61.47
20    60.70
       ... 
133   37.70
7     37.65
114   36.95
63    31.70
119   29.60
Name: metric_game_score, Length: 142, dtype: float64


Unnamed: 0,year,lgID,tmID,franchID,confID,divID,rank,playoff,seeded,firstRound,...,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena,metric_game_score
63,1,WNBA,MIA,MIA,EA,,6,N,0,,...,9,7,4,12,9,12,6475,127721,AmericanAirlines Arena,31.70
34,1,WNBA,HOU,HOU,WE,,2,Y,0,W,...,14,2,13,3,17,4,6475,196077,Compaq Center,59.68
76,1,WNBA,NYL,NYL,EA,,1,Y,0,W,...,12,4,8,8,14,7,6425,231962,Madison Square Garden (IV),45.84
2,1,WNBA,CHA,CHA,EA,,8,N,0,,...,5,11,3,13,5,16,6475,90963,Charlotte Coliseum,44.13
132,1,WNBA,WAS,WAS,EA,,4,Y,0,L,...,7,9,7,9,13,8,6400,244134,Verizon Center,47.71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12,10,WNBA,CHI,CHI,EA,,5,N,0,,...,12,5,4,13,10,12,6825,66852,UIC Pavilion,52.21
1,10,WNBA,ATL,ATL,EA,,2,Y,0,L,...,12,5,6,11,10,12,6950,120737,Philips Arena,58.00
52,10,WNBA,IND,IND,EA,,1,Y,0,W,...,14,3,8,9,17,5,6925,134964,Conseco Fieldhouse,54.41
111,10,WNBA,SAC,SAC,WE,,6,N,0,,...,7,10,5,12,6,14,6850,131654,ARCO Arena (II),52.52


Game Score, applied to the season and to the players

In [48]:
for col in ['points', 'fgMade', 'fgAttempted', 'ftAttempted', 'ftMade', 'oRebounds', 'dRebounds', 'steals', 'assists', 'blocks', 'PF', 'turnovers', 'GP']:
    pl_tm[col] = pl_tm[col].astype(int)

pl_tm['metric_game_score'] = (pl_tm['points'] + 0.4 * pl_tm['fgMade'] - 0.7 * pl_tm['fgAttempted'] - 0.4 * (pl_tm['ftAttempted'] - pl_tm['ftMade']) + 0.7 * pl_tm['oRebounds'] + 0.3 * pl_tm['dRebounds'] + pl_tm['steals'] + 0.7 * pl_tm['assists'] + 0.7 * pl_tm['blocks'] - 0.4 * pl_tm['PF'] - pl_tm['turnovers']) / pl_tm['GP']
print(pl_tm.sort_values(by='metric_game_score', ascending=False)['metric_game_score'])


736    19.90
1576   17.77
1589   17.56
1587   17.26
735    17.21
        ... 
56     -1.90
287    -2.00
1663   -2.10
1612   -2.50
1141   -2.60
Name: metric_game_score, Length: 1876, dtype: float64


## Creating and training the model

In [69]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score

### Decision Tree Classifier

In [70]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

df["playoff"].replace({"N": 0, "Y": 1}, inplace=True)

model = DecisionTreeClassifier(random_state=42)

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))

[print(f"{trained_model.feature_names_in_[idx]}: {x}") for idx, x in enumerate(trained_model.feature_importances_)]


115 27
0.5925925925925926
0.6666666666666666
0.6451612903225806
d_fgm: 0.0
d_fga: 0.07599240676662826
d_ftm: 0.04697712418300654
d_fta: 0.13203661565537952
d_3pm: 0.056568287037036986
d_3pa: 0.08147454696863698
d_oreb: 0.0
d_dreb: 0.03535801729260148
d_reb: 0.062175605536332175
d_asts: 0.0
d_pf: 0.06351163598801118
d_stl: 0.010485965219421101
d_to: 0.018790849673202617
d_blk: 0.030855429292929275
d_pts: 0.08119752929907599
metric_game_score: 0.3045759870877378


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

### Naive Bayes Gaussian and Mulitnomial

In [71]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

model = GaussianNB()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))
print()
model = MultinomialNB()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))

# [print(f"{trained_model.feature_names_in_[idx]}: {x}") for idx, x in enumerate(trained_model.feature_importances_)]

115 27
0.6296296296296297
0.7142857142857143
0.6666666666666666

115 27
0.6666666666666666
0.7058823529411765
0.7272727272727272


### KNNeighbors

In [72]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))


115 27
0.6296296296296297
0.6875
0.6875


### Random Forest

In [73]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))


115 27
0.7037037037037037
0.7222222222222222
0.7647058823529411


### Logistic Regression

In [76]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))


115 27
0.8148148148148148
0.9230769230769231
0.8275862068965517


### Support Vector Machine

In [77]:
from sklearn.svm import SVC

model = SVC()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))


115 27
0.5925925925925926
0.5925925925925926
0.7441860465116279


### Gradient Boosted Trees

In [80]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()

X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]

for column in df.columns:
    if (column not in ["playoff", "metric_game_score","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts"]):
        X_file.drop(column, axis=1, inplace=True)

# Fit the model to the training data
x_train, x_test, y_train, y_test = train_test_split(X_file, Y_file, test_size=0.19, shuffle=False)
trained_model = model.fit(x_train, y_train)
print(len(x_train), len(x_test))
# Predict using the trained model
y_prediction = trained_model.predict(x_test)

print(accuracy_score(y_test, y_prediction))
print(precision_score(y_test, y_prediction))
print(f1_score(y_test, y_prediction))

[print(f"{trained_model.feature_names_in_[idx]}: {x}") for idx, x in enumerate(trained_model.feature_importances_)]


115 27
0.6666666666666666
0.6842105263157895
0.742857142857143
d_fgm: 0.029979420510816108
d_fga: 0.03501961646066596
d_ftm: 0.057777198585370726
d_fta: 0.09859703224684972
d_3pm: 0.06775494821328029
d_3pa: 0.08009142661764256
d_oreb: 0.017492137573490305
d_dreb: 0.015338516449845184
d_reb: 0.006079363485992113
d_asts: 0.011168099198130397
d_pf: 0.040991546655702225
d_stl: 0.019790944896917757
d_to: 0.00434865571268638
d_blk: 0.039683136517949316
d_pts: 0.08151364430249583
metric_game_score: 0.3943743125721651


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]