# Basketball Playoffs Qualification

## Task description

Basketball tournaments are usually split in two parts. First, all teams play each other aiming to achieve the greatest number of wins possible. Then, at the end of the first part of the season, a pre determined number of teams which were able to win the most games are qualified to the playoff season, where they play series of knock-out matches for the trophy.

For the 10 years, data from players, teams, coaches, games and several other metrics were gathered and arranged on this dataset. The goal is to use this data to predict which teams will qualify for the playoffs in the next season.

## Data preparation

### Creating the database

First, we need to convert the CSV files to tables in an SQLite database, so we can analyze, manipulate and prepare data more easily. This was done with a couple of SQlite3 commands:

```
.mode csv
.import dataset/awards_players.csv awards_players
.import dataset/coaches.csv coaches
.import dataset/players.csv players
.import dataset/players_teams.csv players_teams
.import dataset/series_post.csv series_post
.import dataset/teams_post.csv teams_post
.import dataset/teams.csv teams
.save database.db
```

### Filtering unneeded rows and columns

Upon closer inspection of the dataset, we found some rows which had no effect or could have a negative impact in our models training, such as rows in the players table which corresponded to current coaches, and thus had no information related to their height, weight, etc.

## Model performance measures

### The Game Score measure
The Game Score measure, created by John Hollinger, attempts to give an estimation of a player's productivity for a single game. We will start working on our model based on this measure, applying it to each player based on a whole season's stats and dividing it by the amount of games played.


Import necessary packages

In [26]:
import sqlite3
import pandas as pd

Create dataframes based on the database and relations between data

In [9]:
con = sqlite3.connect("database.db")

# Player <-> Awards
pl_aw = pd.read_sql_query("SELECT * FROM awards_players INNER JOIN players ON awards_players.playerID = players.bioID", con)

# Player <-> Teams
pl_tm = pd.read_sql_query("SELECT * FROM players_teams INNER JOIN players ON players_teams.playerID = players.bioID", con)

# Teams <-> Post Season Results (aggregated)
tm_psa = pd.read_sql_query("SELECT * FROM teams_post INNER JOIN teams ON (teams_post.tmID = teams.tmID AND teams_post.year = teams.year)", con)

# Coach <-> Teams
cc_tm = pd.read_sql_query("SELECT * FROM coaches INNER JOIN teams ON (coaches.tmID = teams.tmID AND coaches.year = teams.year)", con)

# Teams <-> Post Series Results
tm_pss = pd.read_sql_query('''
    SELECT winners.winnersID, winners.year, winners.winnersPlayoff, winners.winnersRank, losers.tmID, losers.year, losers.playoff, losers.rank
    FROM
    (
        SELECT teams.tmID AS winnersID, teams.year AS year, teams.playoff AS winnersPlayoff, teams.rank AS winnersRank, series_post.tmIDLoser AS tmIDLoser
        FROM series_post 
        INNER JOIN teams
        ON
        (series_post.tmIDWinner = teams.tmID AND series_post.year = teams.year)
    ) AS winners
    JOIN teams AS losers
    ON
    (winners.tmIDLoser = losers.tmID AND winners.year = losers.year)
''', con)

df = pd.read_sql_query("SELECT * FROM teams", con)
df.sort_values(by=['year'], inplace=True)
df

Unnamed: 0,year,lgID,tmID,franchID,confID,divID,rank,playoff,seeded,firstRound,...,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena
24,1,WNBA,DET,DET,EA,,5,N,0,,...,32,8,8,6,10,10,11,6425,107289,The Palace of Auburn Hills
99,1,WNBA,POR,POR,WE,,7,N,0,,...,32,6,10,4,12,4,17,6525,133076,Rose Garden Arena
34,1,WNBA,HOU,HOU,WE,,2,Y,0,W,...,32,14,2,13,3,17,4,6475,196077,Compaq Center
13,1,WNBA,CLE,CLE,EA,,2,Y,0,W,...,32,13,3,4,12,13,8,6500,137532,Quicken Loans Arena
53,1,WNBA,LAS,LAS,WE,,1,Y,0,W,...,32,15,1,13,3,17,4,6450,105005,Staples Center
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,9,WNBA,SEA,SEA,WE,,2,Y,0,L,...,34,16,1,6,11,13,7,6800,140503,KeyArena at Seattle Center
117,9,WNBA,SAS,SAS,WE,,1,Y,0,W,...,34,15,2,9,8,10,10,6850,135722,AT&T Center
97,9,WNBA,PHO,PHO,WE,,6,N,0,,...,34,10,7,6,11,8,12,6825,144867,US Airways Center
140,9,WNBA,WAS,WAS,EA,,6,N,0,,...,34,6,11,4,13,6,14,6825,154637,Verizon Center


Merge columns with performance data into a single performance indicator

In [28]:
df = pd.read_csv('dataset/teams.csv')
pd.set_option('display.max_rows', None)

Game Score, applied to the season and to the teams

In [29]:
df['metric_game_score'] = (df['o_pts'] + 0.4 * df['o_fgm'] - 0.7 * df['o_fga'] - 0.4 * (df['o_fta'] - df['o_ftm']) + 0.7 * df['o_oreb'] + 0.3 * df['o_dreb'] + df['o_stl'] + 0.7 * df['o_asts'] + 0.7 * df['o_blk'] - 0.4 * df['o_pf'] - df['o_to']) / df['GP']
print(df.sort_values(by='metric_game_score', ascending=False)['metric_game_score'])

98    67.93
96    67.18
95    64.11
97    61.47
20    60.70
34    59.68
54    59.67
1     58.00
74    57.92
102   57.40
126   57.24
53    57.06
138   56.91
55    56.53
21    56.41
32    56.30
22    55.50
75    55.10
59    54.62
117   54.57
52    54.41
23    54.17
19    54.14
118   54.04
27    53.93
31    53.84
103   53.42
61    53.29
56    53.06
131   52.98
125   52.79
129   52.68
111   52.52
123   52.37
12    52.21
57    52.21
33    52.16
84    51.94
62    51.61
128   51.19
108   51.13
4     50.91
127   50.79
116   50.73
11    50.58
89    50.51
124   50.36
122   50.33
73    50.22
109   50.16
10    50.14
30    49.99
42    49.82
18    49.70
115   49.57
49    49.51
85    49.19
46    49.09
28    49.01
50    49.00
17    48.99
40    48.83
41    48.78
121   48.64
69    48.56
77    48.52
139   48.44
141   48.39
94    48.32
130   48.28
88    48.17
110   48.08
39    47.98
81    47.88
51    47.80
106   47.79
107   47.78
132   47.71
36    47.43
43    47.41
105   47.38
24    47.36
86    47.09
37  

Game Score, applied to the season and to the players

In [30]:
df = pd.read_csv('dataset/players_teams.csv')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format

df['metric_game_score'] = (df['points'] + 0.4 * df['fgMade'] - 0.7 * df['fgAttempted'] - 0.4 * (df['ftAttempted'] - df['ftMade']) + 0.7 * df['oRebounds'] + 0.3 * df['dRebounds'] + df['steals'] + 0.7 * df['assists'] + 0.7 * df['blocks'] - 0.4 * df['PF'] - df['turnovers']) / df['GP']
print(df.sort_values(by='metric_game_score', ascending=False)['metric_game_score'])


736    19.90
1576   17.77
1589   17.56
1587   17.26
735    17.21
278    17.16
732    16.92
54     16.52
1224   16.51
283    16.24
572    16.12
52     15.61
1590   15.58
279    15.58
733    15.50
1814   15.48
573    15.45
681    15.40
909    15.33
1340   15.17
737    15.05
680    15.02
734    14.96
282    14.92
1598   14.72
1580   14.72
904    14.70
738    14.52
907    14.44
609    14.21
1476   14.04
905    13.94
1588   13.92
1577   13.90
264    13.83
280    13.81
1277   13.69
576    13.67
103    13.63
906    13.62
1274   13.51
1864   13.35
1475   13.35
574    13.30
285    13.24
903    13.23
36     13.19
53     13.13
1276   13.11
1865   13.00
321    12.92
281    12.91
1756   12.88
51     12.80
682    12.78
1546   12.70
1647   12.67
1640   12.60
1646   12.58
500    12.55
1578   12.52
1248   12.38
683    12.29
608    12.26
1755   12.23
322    12.23
407    12.20
1479   12.14
38     12.13
1478   12.10
731    12.07
1225   12.07
406    12.06
415    12.02
910    11.90
782    11.89
817    11.89

## Creating and training the model

In [40]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, f1_score

df["playoff"].replace({"N": 0, "Y": 1}, inplace=True)
nFolds = 10
y_nCounts = [0]


# Create a decision tree classifier
kf = StratifiedKFold(n_splits=nFolds, shuffle = True, random_state = 42)
model = DecisionTreeClassifier()
X_file, Y_file = df.drop("playoff", axis=1), df["playoff"]
for target in df.columns:
    if (target not in ["year","rank", "playoff","o_fgm","o_fga","o_ftm","o_fta","o_3pm","o_3pa","o_oreb","o_dreb","o_reb","o_asts","o_pf","o_stl","o_to","o_blk","o_pts","d_fgm","d_fga","d_ftm","d_fta","d_3pm","d_3pa","d_oreb","d_dreb","d_reb","d_asts","d_pf","d_stl","d_to","d_blk","d_pts","tmORB","tmDRB","tmTRB","opptmORB","opptmDRB","opptmTRB","won","lost","GP","homeW","homeL","awayW","awayL","confW","confL","min","attend"]):
        X_file.drop(target, axis=1, inplace=True)

for train_index, test_index in kf.split(X_file, Y_file):

    # Set the length of nCounts
    if len(y_nCounts) != len(test_index) and len(y_nCounts) != len(test_index)+1:
        y_nCounts = [0]*len(test_index)

    # Fit the model to the training data
    x_train, y_train = X_file.iloc[ train_index], Y_file.iloc[ train_index]
    trained_model = model.fit(x_train, y_train)

    # Predict using the trained model
    x_test, y_test = X_file.iloc[test_index], Y_file.iloc[test_index]
    y_prediction = trained_model.predict(x_test)

    # Metrics gathering
    y_nCounts = [count if y_pred == y_real
                 else count+1
                 for y_pred, y_real, count in zip(y_prediction, y_test, y_nCounts) ]

    print(accuracy_score(y_prediction, y_test))
    print(precision_score(y_prediction, y_test))
    print(f1_score(y_prediction, y_test))


1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
