# Ethan Ooi | Establishing the Data

## Setup

In terminal:

git clone https://github.com/vys5hb/Design-Final.git

```
pip install kagglehub -q
```

## Import Libraries

In [1]:
import pandas as pd
import kagglehub

## Download Data
https://www.kaggle.com/datasets/ehallmar/nba-historical-stats-and-betting-data/data

This data was produced from 

In [None]:
path = kagglehub.dataset_download("ehallmar/nba-historical-stats-and-betting-data")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ehallmar/nba-historical-stats-and-betting-data?dataset_version_number=1...


100%|██████████| 36.5M/36.5M [00:01<00:00, 21.6MB/s]

Extracting files...





Path to dataset files: /Users/ethanooi/.cache/kagglehub/datasets/ehallmar/nba-historical-stats-and-betting-data/versions/1


## Read Data

In [3]:
bets = pd.read_csv(f'{path}/nba_betting_spread.csv')
games = pd.read_csv(f'{path}/nba_games_all.csv')
lines = pd.read_csv(f'{path}/nba_betting_money_line.csv')

In [4]:
bets.head()

Unnamed: 0,game_id,book_name,book_id,team_id,a_team_id,spread1,spread2,price1,price2
0,21000358,Pinnacle Sports,238,1610612749,1610612742,7.5,-7.5,-106.0,-104.0
1,21000358,5Dimes,19,1610612749,1610612742,7.5,-7.5,-110.0,-110.0
2,21000358,Bookmaker,93,1610612749,1610612742,7.5,-7.5,-110.0,-110.0
3,21000358,BetOnline,1096,1610612749,1610612742,7.5,-7.5,-110.0,-110.0
4,21000358,Bovada,999996,1610612749,1610612742,8.0,-8.0,-115.0,-105.0


In [5]:
games.head()

Unnamed: 0,game_id,game_date,matchup,team_id,is_home,wl,w,l,w_pct,min,...,ast,stl,blk,tov,pf,pts,a_team_id,season_year,season_type,season
0,20800741,2009-02-06,SAC vs. UTA,1610612762,f,W,29.0,22.0,0.569,240,...,19.0,5.0,4.0,18.0,26.0,111,1610612758,2008,Regular Season,2008-09
1,20800701,2009-01-31,POR vs. UTA,1610612762,f,L,26.0,22.0,0.542,240,...,17.0,6.0,0.0,15.0,22.0,108,1610612757,2008,Regular Season,2008-09
2,20800584,2009-01-16,MEM vs. UTA,1610612762,f,W,24.0,16.0,0.6,240,...,23.0,9.0,3.0,15.0,22.0,101,1610612763,2008,Regular Season,2008-09
3,20800558,2009-01-12,IND @ UTA,1610612762,t,W,23.0,15.0,0.605,240,...,24.0,10.0,6.0,8.0,20.0,120,1610612754,2008,Regular Season,2008-09
4,20800440,2008-12-27,HOU vs. UTA,1610612762,f,L,18.0,14.0,0.563,290,...,35.0,13.0,7.0,9.0,27.0,115,1610612745,2008,Regular Season,2008-09


In [6]:
lines.head()

Unnamed: 0,game_id,book_name,book_id,team_id,a_team_id,price1,price2
0,41100314,Pinnacle Sports,238,1610612759,1610612760,165.0,-183.0
1,41100314,5Dimes,19,1610612759,1610612760,165.0,-175.0
2,41100314,Bookmaker,93,1610612759,1610612760,160.0,-190.0
3,41100314,BetOnline,1096,1610612759,1610612760,165.0,-190.0
4,41100314,Bovada,999996,1610612759,1610612760,155.0,-175.0


## How to get the Data?

There are a couple ways you can get the data.

1. If you run (below) you can access the data in the "nba_data" folder. Or you can unzip the "nba-historical-stats-and-betting-data.zip" file and access the data from there.
```
git clone https://github.com/vys5hb/Design-Final.git
```


2. You can go to this link, https://www.kaggle.com/datasets/ehallmar/nba-historical-stats-and-betting-data/data, and download the data from there.
3. You can use the code at the top of this notebook and it will automatically download the data for you, the data reads works directly with this code.


## Who produced the data? And how?

This data comes from Kaggle by the user Evan Hallmark, username: ehallmar. This data hasn't been updated in 7 years, but it's a great resource for historical NBA data and has a large collection of data. I believe the data was bought from another source which regularly updates this data file. The original data link can be found here: https://www.scottfreellc.com/shop/p/nba-historical-odds-data.


## COLS Tables

In [7]:
bet_info = pd.DataFrame({
    "Count": bets.count(),
    "Types": bets.dtypes,
    "Nulls": bets.isnull().sum()
}).reset_index()
bet_info

Unnamed: 0,index,Count,Types,Nulls
0,game_id,131690,int64,0
1,book_name,131690,object,0
2,book_id,131690,int64,0
3,team_id,131690,int64,0
4,a_team_id,131690,int64,0
5,spread1,131690,float64,0
6,spread2,131690,float64,0
7,price1,131690,float64,0
8,price2,131690,float64,0


In [8]:
games_info = pd.DataFrame({
    "Count": games.count(),
    "Types": games.dtypes,
    "Nulls": games.isnull().sum()
}).reset_index()
games_info

Unnamed: 0,index,Count,Types,Nulls
0,game_id,125624,int64,0
1,game_date,119376,object,6248
2,matchup,125624,object,0
3,team_id,125624,int64,0
4,is_home,125624,object,0
5,wl,125614,object,10
6,w,41000,float64,84624
7,l,41000,float64,84624
8,w_pct,41000,float64,84624
9,min,125624,int64,0


In [9]:
lines_info = pd.DataFrame({
    "Count": lines.count(),
    "Types": lines.dtypes,
    "Nulls": lines.isnull().sum()
}).reset_index()
lines_info

Unnamed: 0,index,Count,Types,Nulls
0,game_id,125286,int64,0
1,book_name,125286,object,0
2,book_id,125286,int64,0
3,team_id,125286,int64,0
4,a_team_id,125286,int64,0
5,price1,125286,float64,0
6,price2,125286,float64,0
