# **Data Collection**

## Objectives

* Retrieve dataset from Kaggle, save the raw data, and remove unnecessary files.

## Inputs

* One needs their credentials for Kaggle as a json file named `kaggle.json`.

## Outputs

* The output of this folder is a directory named `outputs/datasets/raw/csv` inside  in outputs which contains various CSV files. If the user wishes, they can also keep the database vrersion of the files.

## Additional Comments

* The steps here are not strictly necessary as the data can be obtained from the following link:  <a href="https://www.kaggle.com/datasets/wyattowalsh/basketball">Wyatt Walsh's NBA Database</a>.

* Please select Python 3.8.18 for the kernel of this notebook. It was developed with this version of Python in mind and we can not guarantee the performance using a different kernel.

* This notebook was inspired by the Data Collection Jupyter notebook in the Churnometer walkthrough project.

---

# Change working directory

The following cells will change the working directory to the parent folder using commands from the `os` libraray.

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
new_current_dir = os.getcwd()
print(f"You set the working directory to:\n{new_current_dir}")

You set the working directory to:
/workspace/pp5-ml-dashboard


The new working directory should be the name of the cloned repository.

## Section 1: Data retrieval

Here, we will collect the zip file containing the data. Depending on when you perform this, the data set may be different from what we have used as the dataset is updated regularly.  This should not effect the performance of the models as our code excludes games after a certain date.

Install the kaggle package.

In [2]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73025 sha256=719c066c2003bc9aead8c164d8c22a5eab7c188fabb2cc5f850fec2

In order to download the dataset, you will need your Kaggle authentication tokens. These can be retrieved by navigating to your settings page, scrolling down to the API section, and clicking on the "Create New Token" button. This will invalidate your current token and download a new kaggle.json file to your download directory. Copy that json file to your current working directory. Make sure that it is named `kaggle.json`.

* It should appear greyed out as the name is in the `gitignore` file already.

Now you are ready to execute the following cell.

In [2]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We can now download the zip file containing the datasets.

In [4]:
KaggleDatasetPath = "wyattowalsh/basketball"
DestinationFolder = "outputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading basketball.zip to inputs/datasets/raw
100%|███████████████████████████████████████▉| 696M/697M [00:20<00:00, 39.6MB/s]
100%|████████████████████████████████████████| 697M/697M [00:20<00:00, 36.3MB/s]


To unzip the dataset execute the following cell. It will also remove your Kaggle tokens as well as the zip file.

In [5]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/basketball.zip
  inflating: inputs/datasets/raw/csv/common_player_info.csv  
  inflating: inputs/datasets/raw/csv/draft_combine_stats.csv  
  inflating: inputs/datasets/raw/csv/draft_history.csv  
  inflating: inputs/datasets/raw/csv/game.csv  
  inflating: inputs/datasets/raw/csv/game_info.csv  
  inflating: inputs/datasets/raw/csv/game_summary.csv  
  inflating: inputs/datasets/raw/csv/inactive_players.csv  
  inflating: inputs/datasets/raw/csv/line_score.csv  
  inflating: inputs/datasets/raw/csv/officials.csv  
  inflating: inputs/datasets/raw/csv/other_stats.csv  
  inflating: inputs/datasets/raw/csv/play_by_play.csv  
  inflating: inputs/datasets/raw/csv/player.csv  
  inflating: inputs/datasets/raw/csv/team.csv  
  inflating: inputs/datasets/raw/csv/team_details.csv  
  inflating: inputs/datasets/raw/csv/team_history.csv  
  inflating: inputs/datasets/raw/csv/team_info_common.csv  
  inflating: inputs/datasets/raw/nba.sqlite  


## Section 2: Removing unnecessary data

We will not be using all of the above data. You are free to peruse it or use it for your own project. Please make sure to credit Wyatt Walsh though for his work in collecting the data and keeping it up to date.

We will be keeping the following: (note, this should be double checked at the end of the project)
- `game.csv`
- `line_score.csv`
- `other_stats.csv`
- `team_history.csv`

Feel free to take a look at the files before we prune the directory.

In [2]:
import os
current_dir = os.getcwd()
os.chdir(current_dir+'/outputs/datasets/raw/csv')
current_dir = os.getcwd()
csv_files = os.listdir(current_dir)
print(csv_files)


['game.csv', 'line_score.csv', 'officials.csv', 'other_stats.csv', 'team_history.csv']


You can load any of the listed files as a data frame and inspect them. We have loaded `game.csv` as this will be the file of primary interest.

In [20]:
import pandas as pd

games_df = pd.read_csv('game.csv')

games_df.describe()


          season_id  team_id_home       game_id           min      fgm_home  \
count  65698.000000  6.569800e+04  6.569800e+04  65698.000000  65685.000000   
mean   22949.338747  1.609926e+09  2.584747e+07    221.003486     39.672269   
std     5000.305500  3.324313e+07  6.303760e+06     67.903521      6.770802   
min    12005.000000  4.500000e+01  1.050000e+07      0.000000      4.000000   
25%    21981.000000  1.610613e+09  2.130053e+07    240.000000     35.000000   
50%    21997.000000  1.610613e+09  2.630007e+07    240.000000     40.000000   
75%    22011.000000  1.610613e+09  2.880069e+07    240.000000     44.000000   
max    42022.000000  1.610617e+09  4.980009e+07    365.000000     84.000000   

           fga_home   fg_pct_home     fg3m_home     fg3a_home  fg3_pct_home  \
count  50251.000000  50208.000000  52480.000000  47015.000000  46624.000000   
mean      83.992796      0.467321      5.735099     17.741146      0.346136   
std        9.164445      0.059423      4.537337    

In [None]:
games_df.head()

As you can see, the records stretch back quite a ways. The data will need to be truncated as many statistics were not kept track of, hence all of the `NaN` values.

Now we remove the files we won't be using.

In [3]:
for_removal = ['officials.csv','team_info_common.csv','team.csv','team_details.csv','inactive_players.csv','common_player.csv', 'draft_combine_stats.csv','draft_history.csv','game_info.csv','game_summary.csv','play_by_play.csv','player.csv']
for file in for_removal:
  try:
    os.remove(file)
  except FileNotFoundError as e:
    print(f"{file} has already been removed.")
  else:
    print(f"Successfully removed {file}.")

os.chdir(os.path.dirname(current_dir))
try:
  os.remove('nba.sqlite')
except FileNotFoundError:
  print("nba.sqlite has already been removed.")


Successfully removed officials.csv.
team_info_common.csv has already been removed.
team.csv has already been removed.
team_details.csv has already been removed.
inactive_players.csv has already been removed.
common_player.csv has already been removed.
draft_combine_stats.csv has already been removed.
draft_history.csv has already been removed.
game_info.csv has already been removed.
game_summary.csv has already been removed.
play_by_play.csv has already been removed.
player.csv has already been removed.
nba.sqlite has already been removed.


---

## Next Steps

In the next notebook, we will begin cleaning the files and preparing them for our models. Below, we remove files we deem unnecessary, you may keep them if you like.

## Conclusions

We don't yet have any real conclusions. We have collected the data, and now we need to inspect it in order to see what kind of cleaning and preparation it will require.

---

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
