# **Data Collection**

## Objectives

* Retrieve dataset from Kaggle, save the raw data, and remove unnecessary files.

## Inputs

* Credentials for Kaggle as a json file named `kaggle.json`.

## Outputs

* The output of this notebook is a directory named `outputs/datasets/raw/csv` containeing our raw dataset as a CSV file.

## Additional Comments

* The steps here are not strictly necessary as the data can be obtained from the following link:  <a href="https://www.kaggle.com/datasets/wyattowalsh/basketball">Wyatt Walsh's NBA Database</a>.

* Please select Python 3.8.18 for the kernel of this notebook. It was developed with this version of Python in mind and we can not guarantee the performance using a different kernel.

* This notebook was inspired by the Data Collection Jupyter notebook in the Churnometer walkthrough project.

---

#### Change working directory

The following cells will change the working directory to the parent folder using commands from the `os` libraray.

In [1]:
import os

current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
new_current_dir = os.getcwd()
print(f"You set the working directory to:\n{new_current_dir}")

You set the working directory to:
/workspace/pp5-ml-dashboard


---

## Section 1: Data Retrieval

Here, we will collect the zip file containing the data. Depending on when you perform this, the dataset may be different from what we have used as the dataset is updated. This should not effect the performance of the models as our code excludes games after a certain date.

Install the kaggle package.

In [2]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73025 sha256=81cf7bfdf4169707370d8da4d7d70ecb20e811e709b4e610b767848

In order to download the dataset, you will need your Kaggle authentication tokens. These can be retrieved by navigating to your account settings page on Kaggle, scrolling down to the API section, and clicking on the "Create New Token" button. This will invalidate your current token and download a new kaggle.json file to your download directory. Copy that json file to your current working directory. Make sure that it is named `kaggle.json`.

* It should appear greyed out as the name is in the `gitignore` file already.

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We can now download the zip file containing the datasets.

In [5]:
KaggleDatasetPath = "wyattowalsh/basketball"
DestinationFolder = "outputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading basketball.zip to outputs/datasets/raw
100%|████████████████████████████████████████| 697M/697M [00:24<00:00, 36.8MB/s]
100%|████████████████████████████████████████| 697M/697M [00:24<00:00, 30.0MB/s]


To unzip the dataset execute the following cell. It will also remove your Kaggle tokens as well as the zip file.

In [6]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip
  && rm kaggle.json

Archive:  outputs/datasets/raw/basketball.zip
  inflating: outputs/datasets/raw/csv/common_player_info.csv  
  inflating: outputs/datasets/raw/csv/draft_combine_stats.csv  
  inflating: outputs/datasets/raw/csv/draft_history.csv  
  inflating: outputs/datasets/raw/csv/game.csv  
  inflating: outputs/datasets/raw/csv/game_info.csv  
  inflating: outputs/datasets/raw/csv/game_summary.csv  
  inflating: outputs/datasets/raw/csv/inactive_players.csv  
  inflating: outputs/datasets/raw/csv/line_score.csv  
  inflating: outputs/datasets/raw/csv/officials.csv  
  inflating: outputs/datasets/raw/csv/other_stats.csv  
  inflating: outputs/datasets/raw/csv/play_by_play.csv  
  inflating: outputs/datasets/raw/csv/player.csv  
  inflating: outputs/datasets/raw/csv/team.csv  
  inflating: outputs/datasets/raw/csv/team_details.csv  
  inflating: outputs/datasets/raw/csv/team_history.csv  
  inflating: outputs/datasets/raw/csv/team_info_common.csv  
  inflating: outputs/datasets/raw/nba.sqlite  


---

## Section 2: Removing unnecessary data

We will not be using all of the above data. You are free to peruse it and tinker as you like. Please make sure to credit Wyatt Walsh though for his work in collecting the data.

We will only be keeping `game.csv`. Feel free to take a look at the files before we prune the directory.

In [8]:
import os

current_dir = os.getcwd()
os.chdir(current_dir + "/outputs/datasets/raw/csv")
current_dir = os.getcwd()
csv_files = os.listdir(current_dir)
print(csv_files)

['common_player_info.csv', 'draft_combine_stats.csv', 'draft_history.csv', 'game.csv', 'game_info.csv', 'game_summary.csv', 'inactive_players.csv', 'line_score.csv', 'officials.csv', 'other_stats.csv', 'play_by_play.csv', 'player.csv', 'team.csv', 'team_details.csv', 'team_history.csv', 'team_info_common.csv']


You can load any of the listed files as a dataframe and inspect them.

In [9]:
import pandas as pd

games_df = pd.read_csv("game.csv")
games_df.describe()

Unnamed: 0,season_id,team_id_home,game_id,min,fgm_home,fga_home,fg_pct_home,fg3m_home,fg3a_home,fg3_pct_home,...,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away
count,65698.0,65698.0,65698.0,65698.0,65685.0,50251.0,50208.0,52480.0,47015.0,46624.0,...,46700.0,49973.0,49897.0,46849.0,47073.0,47013.0,62847.0,65698.0,65698.0,65698.0
mean,22949.338747,1609926000.0,25847470.0,221.003486,39.672269,83.992796,0.467321,5.735099,17.741146,0.346136,...,30.238073,42.119645,22.135419,7.854148,4.681537,15.19986,23.097284,100.991567,-3.627569,0.20133
std,5000.3055,33243130.0,6303760.0,67.903521,6.770802,9.164445,0.059423,4.537337,10.54581,0.151234,...,5.588675,6.867396,5.380805,3.031766,2.50082,4.299798,5.227208,14.418755,13.091395,0.400997
min,12005.0,45.0,10500000.0,0.0,4.0,0.0,0.14,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,-73.0,0.0
25%,21981.0,1610613000.0,21300530.0,240.0,35.0,78.0,0.427,2.0,10.0,0.261,...,26.0,37.0,18.0,6.0,3.0,12.0,20.0,92.0,-12.0,0.0
50%,21997.0,1610613000.0,26300070.0,240.0,40.0,84.0,0.467,5.0,16.0,0.348,...,30.0,42.0,22.0,8.0,4.0,15.0,23.0,101.0,-4.0,0.0
75%,22011.0,1610613000.0,28800690.0,240.0,44.0,89.0,0.506,9.0,24.0,0.42975,...,34.0,47.0,26.0,10.0,6.0,18.0,26.0,110.0,5.0,0.0
max,42022.0,1610617000.0,49800090.0,365.0,84.0,240.0,0.697,28.0,77.0,1.0,...,60.0,90.0,89.0,27.0,19.0,40.0,115.0,196.0,68.0,1.0


In [10]:
games_df.head()

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
0,21946,1610610035,HUS,Toronto Huskies,24600001,1946-11-01 00:00:00,HUS vs. NYK,L,0,25.0,...,,,,,,,68.0,2,0,Regular Season
1,21946,1610610034,BOM,St. Louis Bombers,24600003,1946-11-02 00:00:00,BOM vs. PIT,W,0,20.0,...,,,,,,25.0,51.0,-5,0,Regular Season
2,21946,1610610032,PRO,Providence Steamrollers,24600002,1946-11-02 00:00:00,PRO vs. BOS,W,0,21.0,...,,,,,,,53.0,-6,0,Regular Season
3,21946,1610610025,CHS,Chicago Stags,24600004,1946-11-02 00:00:00,CHS vs. NYK,W,0,21.0,...,,,,,,22.0,47.0,-16,0,Regular Season
4,21946,1610610028,DEF,Detroit Falcons,24600005,1946-11-02 00:00:00,DEF vs. WAS,L,0,10.0,...,,,,,,,50.0,17,0,Regular Season


As you can see, the records stretch back quite a ways. The data will need to be truncated as many statistics were not kept track of, hence all of the `NaN` values.

Now we remove the files we won't be using. (Feel free to keep some like Line_score and other_stats, which we have not incorporated in our analysis.)

In [11]:
keeping = ["game.csv"]
for file in csv_files:
    if file in keeping:
        print(f"Keeping {file}.")
        continue
    try:
        os.remove(file)
    except FileNotFoundError as e:
        print(f"{file} has already been removed.")
    else:
        print(f"Successfully removed {file}.")

os.chdir(os.path.dirname(current_dir))
try:
    os.remove("nba.sqlite")
except FileNotFoundError:
    print("nba.sqlite has already been removed.")

Successfully removed common_player_info.csv.
Successfully removed draft_combine_stats.csv.
Successfully removed draft_history.csv.
Keeping game.csv.
Successfully removed game_info.csv.
Successfully removed game_summary.csv.
Successfully removed inactive_players.csv.
Successfully removed line_score.csv.
Successfully removed officials.csv.
Successfully removed other_stats.csv.
Successfully removed play_by_play.csv.
Successfully removed player.csv.
Successfully removed team.csv.
Successfully removed team_details.csv.
Successfully removed team_history.csv.
Successfully removed team_info_common.csv.


---

## Conclusions

We don't yet have any real conclusions. We have collected the data, and now we need to inspect it in order to see what kind of cleaning and preparation it will require.

## Next Steps

In the next notebook, we will begin cleaning the files and preparing them for our models. Below, we remove files we deem unnecessary, you may keep them if you like.
