In [1]:
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text
import dotenv
import os

In [2]:
nba = pd.read_csv('ASA All NBA Raw Data.csv')

  nba = pd.read_csv('ASA All NBA Raw Data.csv')


In [3]:
dotenv.load_dotenv()
postgrespassword = os.getenv('postgrespassword')

In [4]:
pd.set_option('display.max_rows', 81)
nba.head(1).T

Unnamed: 0,0
game_id,202202170BRK
game_date,2022-02-17
OT,0
H_A,A
Team_Abbrev,WAS
Team_Score,117
Team_pace,94.5
Team_efg_pct,0.627
Team_tov_pct,13.5
Team_orb_pct,22.9


## Database Normalization
### First normal form:

1. **All tables must have a primary key**: In this table, `game_id` and `player_id` together are unique on every row, and so they form primary key.

2. **All the data must be atomic**: Inactives is non-atomic.

3. **No repeating groups problem**: We can't solve the non-atomicity problem by creating separate columns if this leads to arbitrary ordering language in the column names (for example, `Inactive1`, `Inactive2`, etc.) and if it leads to a lot of missing data (there would be an `Inactive7` which would be missing any time a team has less than 7 inactive players).

Our solution here is to cheat -- because we have `is_inactive` for each player, it makes `Inactives` redundant. So we can just delete `Inactives`.

If we did not have `is_inactive`, we would have to create a new table called `Inactive` with three columns: `game_id` and `inactive_player_id` that contains the player ID for each inactive player for each team in each game. Each inactive player in each game would get a new row. We wouldn't need to include team in this table because we could get the information about a player's team by joining this data to the player-game table. (A player plays for one team in one game, so if we know the game and player, we can lookup the player's team).

In [5]:
nba = nba.drop(['Inactives'], axis=1)

### Functional Dependence
Let X and Y be columns in a data table. Y is functionally dependent on X if each value of X has exactly one value of Y.

That's pretty abstract. So here are some guidelines that help me:

1. This use of "function" is the exact same as the concept of a function from algebra and pre-calculus. A correspondence f(x)=y is a function if each value of x has only one associated value of y.

2. X is either a primary key, or something that should be a primary key in another table.

For example, `game_date` (Y) is functionally dependent on `game_id` (X) because one `game_id` takes place on exactly one date.

### Second normal form:
In this table the primary key is a superkey consisting of two columns: `game_id` and `player_id`. 

2NF is violated if any columns are functionally dependent on part of the primary key but not the entire primary key. This can only happen if the primary key is a superkey.

Here there are three columns that depend on `game_id` but not on `player_id`: `game_date`, `OT`, and `season`. We solve 2NF by moving these columns to a new table.

There is also one column, `player`, that depends on `player_id` but not `game_id`. We create a new table here as well.

In [7]:
games = nba[['game_id', 'game_date', 'OT', 'season']].drop_duplicates()
players = nba[['player_id', 'player']].drop_duplicates()

### Third normal form:
3NF is violated if there are "transitive dependencies", that is, functional dependence between columns when neither column is part of the primary key.

In the main dataframe, `game_id` and `player_id` are part of the primary key. But many of the columns depend on `team_abbrev` as well as `game_id`. We pull these columns out and create a new table:

In [14]:
team_game = nba[['H_A', 'Team_Abbrev', 'Team_Score','Team_pace', 
                 'Team_efg_pct','Team_tov_pct','Team_orb_pct',
                 'Team_ft_rate','Team_off_rtg']].drop_duplicates()

In [13]:
# Need: python -m pip install --upgrade 'sqlalchemy<2.0'

Unnamed: 0,game_id,game_date,OT,H_A,Team_Abbrev,Team_Score,Team_pace,Team_efg_pct,Team_tov_pct,Team_orb_pct,...,pf_per_minute,ts,last_60_minutes_per_game_starting,last_60_minutes_per_game_bench,PG%,SG%,SF%,PF%,C%,active_position_minutes
0,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.061538,9.00,31.716667,22.017778,1.0,36.0,60.0,4.0,0.0,46.253586
1,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.099119,7.44,34.324000,18.475954,0.0,0.0,4.0,85.0,11.0,52.152590
2,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,7.00,29.820290,16.051693,0.0,32.0,67.0,0.0,0.0,47.021807
3,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.048387,7.88,29.920833,14.603922,90.0,10.0,0.0,0.0,0.0,27.603314
4,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,6.88,20.095833,14.538095,0.0,0.0,0.0,0.0,100.0,36.472537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112118,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.107914,13.08,33.110667,19.232562,0.0,2.0,77.0,21.0,0.0,57.207786
112119,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.036079,6.00,25.470833,20.228571,5.0,45.0,43.0,7.0,0.0,58.202391
112120,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.150943,4.00,24.083333,13.228788,0.0,0.0,0.0,9.0,91.0,49.630640
112121,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.094340,12.64,34.783333,27.691667,0.0,44.0,48.0,8.0,0.0,58.923515
