# Building a Database of NBA Games, Teams, Players, and Box Scores

A box score is a tabular representation of everything that happened during a professional basketball game in the NBA. For example, a box score for a recent game that took place between the Cleveland Cavaliers and the Milwaukee Bucks may be found at https://www.espn.com/nba/boxscore/_/gameId/401360541.

For this lab we will be working with data that collects and appends all box scores from all NBA games together. The data were compiled by Stewart Gibson of Advanced Sports Analytics.

The raw data has 81 columns. Below is a data dictionary.

In [1]:
import pandas as pd
pd.read_csv(filepath_or_buffer = 'Dictionary--Box_Scores.csv', sep = '\t', index_col = 'Column No.')

Unnamed: 0_level_0,Column name,Data type,Varies By,Calculated from other columns?,Description
Column No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,game_id,object,Game,No,Unique ID for one NBA game
1,game_date,object,Game,No,Date in YYYY-MM-DD format
2,OT,int64,Game,No,Boolean: did the game go into overtime?
3,H_A,object,Game and Team,No,Is the team playing at home or away in this game?
4,Team_Abbrev,object,Team,No,Team
...,...,...,...,...,...
76,SG%,float64,Game and Player,No,% of player's minutes playing the point guard ...
77,SF%,float64,Game and Player,No,% of player's minutes playing the point guard ...
78,PF%,float64,Game and Player,No,% of player's minutes playing the power forwar...
79,C%,float64,Game and Player,Yes,% of player's minutes playing the center position


## Goals

1. Construct a series of data tables which together form a database in third normal form. That will require us to create separate tables for games, teams, players, and box score statistics. Five tables:

    a. Information overall about the game: OT, date, location, etc

    b. Info about how the team overall did in the game

    c. Info about how the player did personally in the game

    d. Info about the team's total season stats so far

    e. Info about the player's total season so far

    Use the aggregation methods in pandas to construct these tables.

Create an ER diagram.

Initialize a local database.

Upload the data tables to the database.

Issue SQL queries to the database.

## We load all box scores from all NBA games.

In [2]:
import numpy as np
import pandas as pd
import psycopg2
import dotenv
import os
from sqlalchemy import create_engine

In [3]:
nba = pd.read_csv('ASA All NBA Raw Data.csv')
nba.head(n = 5)

Unnamed: 0,game_id,game_date,OT,H_A,Team_Abbrev,Team_Score,Team_pace,Team_efg_pct,Team_tov_pct,Team_orb_pct,...,pf_per_minute,ts,last_60_minutes_per_game_starting,last_60_minutes_per_game_bench,PG%,SG%,SF%,PF%,C%,active_position_minutes
0,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.061538,9.0,31.716667,22.017778,1.0,36.0,60.0,4.0,0.0,46.253586
1,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.099119,7.44,34.324,18.475954,0.0,0.0,4.0,85.0,11.0,52.15259
2,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.0,7.0,29.82029,16.051693,0.0,32.0,67.0,0.0,0.0,47.021807
3,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.048387,7.88,29.920833,14.603922,90.0,10.0,0.0,0.0,0.0,27.603314
4,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.0,6.88,20.095833,14.538095,0.0,0.0,0.0,0.0,100.0,36.472537


In [4]:
pd.options.display.max_rows = None
nba.head(1).T.head(n = 5)

Unnamed: 0,0
game_id,202202170BRK
game_date,2022-02-17
OT,0
H_A,A
Team_Abbrev,WAS


## We drop column Inactives to ensure first-normal form.

In [5]:
pd.options.display.max_rows = 10
nba = nba.drop('Inactives', axis=1)
nba.head(1).T.head(n = 5)

Unnamed: 0,0
game_id,202202170BRK
game_date,2022-02-17
OT,0
H_A,A
Team_Abbrev,WAS


## We filter the data to just the current NBA season

In [6]:
nba2022 = nba.query("season==2022")
nba2022

Unnamed: 0,game_id,game_date,OT,H_A,Team_Abbrev,Team_Score,Team_pace,Team_efg_pct,Team_tov_pct,Team_orb_pct,...,pf_per_minute,ts,last_60_minutes_per_game_starting,last_60_minutes_per_game_bench,PG%,SG%,SF%,PF%,C%,active_position_minutes
0,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.061538,9.00,31.716667,22.017778,1.0,36.0,60.0,4.0,0.0,46.253586
1,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.099119,7.44,34.324000,18.475954,0.0,0.0,4.0,85.0,11.0,52.152590
2,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,7.00,29.820290,16.051693,0.0,32.0,67.0,0.0,0.0,47.021807
3,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.048387,7.88,29.920833,14.603922,90.0,10.0,0.0,0.0,0.0,27.603314
4,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,6.88,20.095833,14.538095,0.0,0.0,0.0,0.0,100.0,36.472537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22619,202201130NOP,2022-01-13,0,A,LAC,89,97.0,0.444,14.3,10.2,...,0.451128,1.00,7.977756,4.313333,0.0,0.0,0.0,0.0,100.0,48.215347
22620,202201150SAS,2022-01-15,0,A,LAC,94,92.7,0.440,11.9,35.0,...,,0.00,7.997933,4.324242,0.0,0.0,0.0,0.0,100.0,57.995905
22621,202112220SAC,2021-12-22,0,A,LAC,105,95.5,0.555,11.9,15.4,...,0.000000,0.00,,,0.0,0.0,0.0,0.0,100.0,
22622,202112260LAC,2021-12-26,0,H,LAC,100,99.8,0.512,13.0,17.4,...,,0.00,2.712684,1.466667,0.0,0.0,0.0,0.0,100.0,41.567992


## Information overall about the game: OT, date, location, etc

In [7]:
game_info = nba2022[['game_id', 'game_date', 'OT']].drop_duplicates()
game_info

Unnamed: 0,game_id,game_date,OT
0,202202170BRK,2022-02-17,0
26,202202170CHO,2022-02-17,2
48,202202170LAC,2022-02-17,0
71,202202170MIL,2022-02-17,0
95,202202170NOP,2022-02-17,0
...,...,...,...
13978,202111190NOP,2021-11-19,0
13984,202111290LAC,2021-11-29,0
14001,202201130NOP,2022-01-13,0
14005,202201250PHI,2022-01-25,0


## Info about how the team overall did in the game

In [8]:
nba2022['win'] = nba2022['Team_Score'] > nba2022['Opponent_Score']
team_game = nba2022[['Team_Abbrev', 'H_A', 'win', 'game_id', 'fg', 'fga', 'fg3', 'fg3a', 
             'ft', 'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']]
team_game = team_game.groupby(['game_id','Team_Abbrev']).agg({'H_A': pd.Series.mode,
                                                              'win': 'mean',
                                                              'fg': sum,
                                                              'fga': sum, 
                                                              'fg3': sum, 
                                                              'fg3a': sum, 
                                                              'ft': sum, 
                                                              'fta': sum, 
                                                              'orb': sum, 
                                                              'drb': sum, 
                                                              'ast': sum, 
                                                              'stl': sum, 
                                                              'blk': sum, 
                                                              'tov': sum, 
                                                              'pf': sum, 
                                                              'pts': sum})
team_game.reset_index()

Unnamed: 0,game_id,Team_Abbrev,H_A,win,fg,fga,fg3,fg3a,ft,fta,orb,drb,ast,stl,blk,tov,pf,pts
0,202110190LAL,GSW,A,1.0,41,93,14,39,25,30,9,41,30,9,2,17,18,121
1,202110190LAL,LAL,H,0.0,45,95,15,42,9,19,5,40,21,7,4,17,25,114
2,202110190MIL,BRK,A,0.0,37,84,17,32,13,23,5,39,19,3,9,12,17,104
3,202110190MIL,MIL,H,1.0,48,105,17,45,14,18,13,41,25,8,9,7,19,127
4,202110200CHO,CHO,H,1.0,46,107,13,31,18,27,12,34,29,9,5,8,21,123
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1761,202202170LAC,LAC,H,1.0,51,92,18,35,22,26,12,38,34,12,5,16,16,142
1762,202202170MIL,MIL,H,0.0,42,95,14,44,22,28,12,32,20,5,1,7,22,120
1763,202202170MIL,PHI,A,1.0,44,88,12,34,23,27,9,37,21,2,2,12,23,123
1764,202202170NOP,DAL,A,1.0,44,82,19,40,18,25,7,31,18,5,4,8,22,125


## Info about how the player did personally in the game

In [9]:
player_game = nba2022[['game_id', 'player_id', 'starter', 'minutes', 'fg', 'fga', 'fg3', 'fg3a', 
                   'ft', 'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'usg_pct', 
                   'is_inactive', 'PG%', 'SG%', 'SF%', 'PF%', 'C%']]
player_game

Unnamed: 0,game_id,player_id,starter,minutes,fg,fga,fg3,fg3a,ft,fta,...,tov,pf,pts,usg_pct,is_inactive,PG%,SG%,SF%,PF%,C%
0,202202170BRK,kispeco01,1,32.500000,6,9,4,6,0,0,...,2,2,16,15.6,0,1.0,36.0,60.0,4.0,0.0
1,202202170BRK,kuzmaky01,1,30.266667,2,7,0,3,1,1,...,7,3,5,22.0,0,0.0,0.0,4.0,85.0,11.0
2,202202170BRK,caldwke01,1,25.433333,3,7,1,3,0,0,...,1,0,7,14.5,0,0.0,32.0,67.0,0.0,0.0
3,202202170BRK,netora01,1,20.666667,5,7,1,1,1,2,...,0,1,12,17.6,0,90.0,10.0,0.0,0.0,0.0
4,202202170BRK,bryanth01,1,14.066667,5,6,0,1,2,2,...,0,0,12,22.6,0,0.0,0.0,0.0,0.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22619,202201130NOP,gabriwe01,0,4.433333,1,1,1,1,0,0,...,2,2,3,31.0,0,0.0,0.0,0.0,0.0,100.0
22620,202201150SAS,gabriwe01,0,0.000000,0,0,0,0,0,0,...,0,0,0,0.0,0,0.0,0.0,0.0,0.0,100.0
22621,202112220SAC,wrighmo01,0,1.466667,0,0,0,0,0,0,...,0,0,0,0.0,0,0.0,0.0,0.0,0.0,100.0
22622,202112260LAC,wrighmo01,0,0.000000,0,0,0,0,0,0,...,0,0,0,0.0,0,0.0,0.0,0.0,0.0,100.0


## Info about the team's total season stats so far

In [10]:
teams = nba2022[['Team_Abbrev', 'game_id', 'fg', 'fga', 'fg3', 'fg3a', 
             'ft', 'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']]
teams = teams.groupby('Team_Abbrev').agg({'fg': sum,
                                          'fga': sum, 
                                          'fg3': sum, 
                                          'fg3a': sum, 
                                          'ft': sum, 
                                          'fta': sum, 
                                          'orb': sum, 
                                          'drb': sum, 
                                          'ast': sum, 
                                          'stl': sum, 
                                          'blk': sum, 
                                          'tov': sum, 
                                          'pf': sum, 
                                          'pts': sum})
WL = nba2022.groupby(['Team_Abbrev', 'game_id']).agg({'win': 'mean'})
WL = WL.groupby('Team_Abbrev').agg({'win':[sum, 'count']})
WL.columns = ['wins', 'totalgames']
WL['losses'] = WL['totalgames'] - WL['wins']
teams = pd.merge(teams, WL, on=['Team_Abbrev'], validate='one_to_one')
teams = teams.reset_index()
teams

Unnamed: 0,Team_Abbrev,fg,fga,fg3,fg3a,ft,fta,orb,drb,ast,stl,blk,tov,pf,pts,wins,totalgames,losses
0,ATL,2381,5093,734,1946,1014,1257,577,1996,1414,402,255,684,1073,6510,28.0,58,30.0
1,BOS,2379,5247,754,2195,1019,1252,642,2155,1416,434,362,809,1119,6531,34.0,60,26.0
2,BRK,2442,5231,660,1883,992,1231,606,2039,1464,414,313,766,1173,6536,31.0,59,28.0
3,CHI,2482,5139,662,1760,1019,1248,526,2032,1444,418,264,729,1098,6645,38.0,59,21.0
4,CHO,2528,5540,821,2306,952,1292,663,2045,1630,521,290,772,1199,6829,29.0,60,31.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,SAC,2434,5332,698,2024,1026,1341,610,2011,1400,433,276,814,1147,6592,22.0,60,38.0
26,SAS,2562,5498,657,1855,815,1097,649,2026,1653,455,302,721,1053,6596,23.0,59,36.0
27,TOR,2296,5186,703,1984,899,1186,749,1807,1252,526,273,683,1129,6194,32.0,57,25.0
28,UTA,2360,4990,843,2335,1034,1321,593,2080,1295,414,286,768,1098,6597,36.0,58,22.0


## Info about the player's total season so far

In [11]:
players = nba2022[['player', 'player_id', 'starter', 'minutes', 'fg', 'fga', 'fg3', 'fg3a', 
                   'ft', 'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'usg_pct', 
                   'did_not_play', 'is_inactive', 'ts_pct', 'PG%', 'SG%', 'SF%', 'PF%', 'C%']]
players = players.groupby(['player','player_id']).agg({'starter': sum,
                                                   'minutes': 'mean',
                                                   'fg': sum,
                                                   'fga': sum, 
                                                   'fg3': sum, 
                                                   'fg3a': sum, 
                                                   'ft': sum, 
                                                   'fta': sum, 
                                                   'orb': sum, 
                                                   'drb': sum, 
                                                   'ast': sum, 
                                                   'stl': sum, 
                                                   'blk': sum, 
                                                   'tov': sum, 
                                                   'pf': sum, 
                                                   'pts': sum,
                                                  'usg_pct': 'mean',
                                                   'ts_pct': 'mean',
                                                  'did_not_play': sum,
                                                  'is_inactive': sum,
                                                  'PG%': 'mean', 
                                                   'SG%': 'mean', 
                                                   'SF%': 'mean', 
                                                   'PF%': 'mean', 
                                                   'C%': 'mean'}).add_prefix('season_')
players.reset_index()

Unnamed: 0,player,player_id,season_starter,season_minutes,season_fg,season_fga,season_fg3,season_fg3a,season_ft,season_fta,...,season_pts,season_usg_pct,season_ts_pct,season_did_not_play,season_is_inactive,season_PG%,season_SG%,season_SF%,season_PF%,season_C%
0,Aaron Gordon,gordoaa01,53,31.760692,298,577,62,186,111,151,...,769,19.488679,0.600585,0,0,0.00,0.00,48.0,49.0,4.0
1,Aaron Henry,henryaa01,0,0.999020,1,5,0,1,0,0,...,2,6.835294,0.058824,11,0,0.00,11.00,69.0,21.0,0.0
2,Aaron Holiday,holidaa01,14,14.045000,109,227,27,73,28,35,...,273,16.510000,0.501420,6,0,98.74,1.26,0.0,0.0,0.0
3,Aaron Nesmith,nesmiaa01,1,8.357233,54,143,21,89,16,19,...,145,13.760377,0.293566,12,0,0.00,27.00,64.0,9.0,0.0
4,Aaron Wiggins,wiggiaa01,20,20.062719,90,194,27,88,36,51,...,243,11.392105,0.471711,4,0,9.00,70.00,20.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,Zach Collins,colliza01,0,13.630556,12,27,3,7,2,4,...,29,16.316667,0.360833,1,1,0.00,0.00,0.0,0.0,100.0
604,Zach LaVine,lavinza01,47,34.553901,411,853,135,338,199,229,...,1156,28.865957,0.596745,0,0,0.00,17.00,68.0,15.0,0.0
605,Zeke Nnaji,nnajize01,1,13.286054,95,176,37,76,38,62,...,265,11.875510,0.503306,10,1,0.00,0.00,3.0,53.0,43.0
606,Ziaire Williams,willizi02,19,20.498810,115,261,42,148,19,25,...,291,13.859524,0.495500,2,0,0.00,1.00,96.0,2.0,0.0
