# Basketball Playoffs Qualification

## Task description

Basketball tournaments are usually split in two parts. First, all teams play each other aiming to achieve the greatest number of wins possible. Then, at the end of the first part of the season, a pre determined number of teams which were able to win the most games are qualified to the playoff season, where they play series of knock-out matches for the trophy.

For the 10 years, data from players, teams, coaches, games and several other metrics were gathered and arranged on this dataset. The goal is to use this data to predict which teams will qualify for the playoffs in the next season.

## Data preparation

### Creating the database

First, we need to convert the CSV files to tables in an SQLite database, so we can analyze, manipulate and prepare data more easily. This was done with a couple of SQlite3 commands:

```
.mode csv
.import dataset/awards_players.csv awards_players
.import dataset/coaches.csv coaches
.import dataset/players.csv players
.import dataset/players_teams.csv players_teams
.import dataset/series_post.csv series_post
.import dataset/teams_post.csv teams_post
.import dataset/teams.csv teams
.save database.db
```

### Filtering unneeded rows and columns

Upon closer inspection of the dataset, we found some rows which had no effect or could have a negative impact in our models training, such as rows in the players table which corresponded to current coaches, and thus had no information related to their height, weight, etc.

## Model performance measures

### The Game Score measure
The Game Score measure, created by John Hollinger, attempts to give an estimation of a player's productivity for a single game. We will start working on our model based on this measure, applying it to each player based on a whole season's stats and dividing it by the amount of games played.

### Basketball Power Index


Import necessary packages

In [1]:
import sqlite3
import pandas as pd

Create dataframes based on the database and relations between data

In [2]:
con = sqlite3.connect("database.db")

# Player <-> Awards
pl_aw = pd.read_sql_query("SELECT * FROM awards_players INNER JOIN players ON awards_players.playerID = players.bioID", con)

# Player <-> Teams
pl_tm = pd.read_sql_query("SELECT * FROM players_teams INNER JOIN players ON players_teams.playerID = players.bioID", con)

# Teams <-> Post Season Results (aggregated)
tm_psa = pd.read_sql_query("SELECT * FROM teams_post INNER JOIN teams ON (teams_post.tmID = teams.tmID AND teams_post.year = teams.year)", con)

# Coach <-> Teams
cc_tm = pd.read_sql_query("SELECT * FROM coaches INNER JOIN teams ON (coaches.tmID = teams.tmID AND coaches.year = teams.year)", con)


## Data Pre-processing

First, remove columns that only have null values.

In [20]:
dataframes = [pl_aw, pl_tm, tm_psa, cc_tm]

for i in range(len(dataframes)):
    dropped_columns = dataframes[i].columns.difference(dataframes[i].loc[:, (dataframes[i] != '0').any()].columns)
    print(f"Dropped columns in dataframe {i}: {list(dropped_columns)}")
    dataframes[i] = dataframes[i].loc[:, (dataframes[i] != '0').any()]

Dropped columns in dataframe 0: ['firstseason', 'lastseason']
Dropped columns in dataframe 1: ['firstseason', 'lastseason']
Dropped columns in dataframe 2: ['opptmDRB', 'opptmORB', 'opptmTRB', 'seeded', 'tmDRB', 'tmORB', 'tmTRB']
Dropped columns in dataframe 3: ['opptmDRB', 'opptmORB', 'opptmTRB', 'seeded', 'tmDRB', 'tmORB', 'tmTRB']


Now, to analyse the values' Z-Score and IQR.

In [23]:
# Outlier for Player's Weight + Height
# Outlier for Team's played minutes
# Outlier Player+Team's minutes
# Outlier Player+Team's points
# Players that have Birth-Date 0
# Feature engineering, number of awards per player/coach, number of awards per team
