<a href="https://colab.research.google.com/github/wyattowalsh/sports-analytics/blob/main/basketball/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align='center'> Basketball Data Collection </h1>

This notebook contains the associated work necessary to collect the data that composes the [***Kaggle Basketball Dataset*** (wyattowalsh/basketball)](https://www.kaggle.com/wyattowalsh/basketball) and serves as the foundation for the [basketball related projects](https://github.com/wyattowalsh/sports-analytics/tree/main/basketball) within my [sports analytics GitHub repository](https://github.com/wyattowalsh/sports-analytics).

One of the goals for the data collection component of this project is to produce a `robust`, *organized* dataset that can grow to as **large of a scale** as possible. You can find an explanation of my solution for storing the files related to the [***Basketball Dataset***](https://www.kaggle.com/wyattowalsh/basketball) below.

<img src="https://unsplash.com/photos/Kv-gAzpUSRg/download?force=true">

## Overview

***Kaggle*** offers many formats of which one can save files to a dataset, which include: `CSV`, `JSON`, `SQLite`, and `Archives`, among others. The platform essentially acts similarly to industrial cloud solutions like *Google Cloud Platform's* (**GCP**) ***Cloud Storage*** or *Amazon Web Service's* (**AWS**) ***S3*** albeit with a **100GB** storage capacity. ***Kaggle*** datasets as well as these industrial solutions can be considered as broad object/file storage and in certain data engineering paradigms can serve as data lakes. 

It seems that many state-of-the-art (SOTA) data storage solutions pivot around an organizational-wide data lake (of which itself allows for general object storage) that has multiple inputs (*"tributaries"*) both streaming into and routinely added to the overall lake. One benefit of this paradigm is that the lake facilitates the storage of both structured (tabular) and unstructured (image, video, audio, text, etc) data. This can prove useful because, as time progresses, new techniques for extracting useful information from unstructured data can be utilized. Thus it also seems like a good idea to hold onto all extracted data, if possible. 

***Kaggle*** datasets can serve as data lakes through the archival process or simply by storing data files in their raw file format. This certainly serves as a strong foundation for building a &#8212; one day in the future &#8212; <b><i>"big data"</i></b> collection. 

However, there is further work that can be done in configuring ***Kaggle*** datasets to enable additional platform functionality as well as improved storage efficiency. Structured data, whether structured upon extraction or structured through some pre-processing, can be stored in a ***SQLite*** database (`.sqlite` file type) as opposed to storing individual files such as `CSVs` or `JSONs` within the dataset. Thus, a single database file is stored as an object within the dataset, enabling additional functionality. One easily discerned advantage with storing in ***SQLite*** is that histograms of the distribution of across continuous variables are given directly within ***Kaggle***. 

As this project moves forward, I hope to collect a large collection of both structured and unstructured data. I hope that the ***SQLite*** database (`basketball.sqlite`) can serve to house the structured data in an efficient, useful format, similarly to the [***European Soccer Database***](https://www.kaggle.com/hugomathien/soccer).

## View System Information

In [1]:
print("********************** CUDA Version ********************** \n - \n")
!nvcc --version
print("********************** CPU Info ********************** \n - \n")
!cat /proc/cpuinfo
print("********************** CPU Count ********************** \n - \n")
import os
print(os.cpu_count())
print("********************** GPU Info ********************** \n - \n")
!nvidia-smi
print("********************** Python Version ********************** \n - \n")
!python -V

********************** CUDA Version ********************** 
 - 

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
********************** CPU Info ********************** 
 - 

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_singl

## Prepare Development Environment

### Clone Project Repository and Install Dependencies

In [25]:
# remove sample data and clone repo
!rm -r sample_data/
!rm -r sports-analytics/
!git clone https://github.com/wyattowalsh/sports-analytics.git

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/

# install dependencies
!pip install -r ../../dependencies/basketball/data_collection.txt

rm: cannot remove 'sample_data/': No such file or directory
rm: cannot remove 'sports-analytics/': No such file or directory
Cloning into 'sports-analytics'...
remote: Enumerating objects: 321, done.[K
remote: Counting objects: 100% (321/321), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 321 (delta 116), reused 223 (delta 51), pack-reused 0[K
Receiving objects: 100% (321/321), 108.19 KiB | 2.35 MiB/s, done.
Resolving deltas: 100% (116/116), done.
/content/sports-analytics/basketball/notebooks
Collecting dask
[?25l  Downloading https://files.pythonhosted.org/packages/2e/86/95faa4a9c1f7fbfa2df2ae9e7e1a11349cb97a81e2f38ff9dda301606882/dask-2021.3.0-py3-none-any.whl (925kB)
[K     |████████████████████████████████| 931kB 3.6MB/s 
Installing collected packages: dask
  Found existing installation: dask 2.12.0
    Uninstalling dask-2.12.0:
      Successfully uninstalled dask-2.12.0
Successfully installed dask-2021.3.0


Collecting distributed
[?25l  Downloading https://files.pythonhosted.org/packages/96/b7/f58dd1e30f940a8b38de10f5d92b2fce08f38dcba3eb1ddb017260588ed4/distributed-2021.3.0-py3-none-any.whl (675kB)
[K     |████████████████████████████████| 675kB 5.8MB/s 
[?25hCollecting cloudpickle>=1.5.0
  Downloading https://files.pythonhosted.org/packages/e7/e3/898487e5dbeb612054cf2e0c188463acb358167fef749c53c8bb8918cea1/cloudpickle-1.6.0-py3-none-any.whl
Installing collected packages: cloudpickle, distributed
  Found existing installation: cloudpickle 1.3.0
    Uninstalling cloudpickle-1.3.0:
      Successfully uninstalled cloudpickle-1.3.0
  Found existing installation: distributed 1.25.3
    Uninstalling distributed-1.25.3:
      Successfully uninstalled distributed-1.25.3
Successfully installed cloudpickle-1.6.0 distributed-2021.3.0


### Import Dependencies and Enable Tools

In [2]:
# nba_api dependencies
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import commonplayerinfo, playercareerstats

# datascience stack
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import sqlite3 as sql

# system utility stack
import os
import time
import urllib
from functools import partial

# # Upload kaggle.json to /content/
# from google.colab import files
# uploaded = files.upload()

# # Move and change permissions as needed, allowing for import
# !mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
# import kaggle

# # change directory to directory that contains this notebook
# %cd /content/sports-analytics/basketball/notebooks/

# # utilize Colab Monitor
# from urllib.request import urlopen
# exec(urlopen("http://colab-monitor.smankusors.com/track.py").read())
# _colabMonitor = ColabMonitor().start()

## Collect Data

### Connect to Database

In [3]:
conn = sql.connect('../data/basketball.sqlite')

### Players

#### Get Players DataFrame and Type ID as String

In [5]:
df_players = pd.DataFrame(players.get_players()).astype({'id': 'str'})
df_players

Unnamed: 0,id,full_name,first_name,last_name,is_active
0,76001,Alaa Abdelnaby,Alaa,Abdelnaby,False
1,76002,Zaid Abdul-Aziz,Zaid,Abdul-Aziz,False
2,76003,Kareem Abdul-Jabbar,Kareem,Abdul-Jabbar,False
3,51,Mahmoud Abdul-Rauf,Mahmoud,Abdul-Rauf,False
4,1505,Tariq Abdul-Wahad,Tariq,Abdul-Wahad,False
...,...,...,...,...,...
4496,1627790,Ante Zizic,Ante,Zizic,True
4497,78647,Jim Zoet,Jim,Zoet,False
4498,78648,Bill Zopf,Bill,Zopf,False
4499,1627826,Ivica Zubac,Ivica,Zubac,True


In [6]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4501 entries, 0 to 4500
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4501 non-null   object
 1   full_name   4501 non-null   object
 2   first_name  4501 non-null   object
 3   last_name   4501 non-null   object
 4   is_active   4501 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 145.2+ KB


#### Add Dataframe as Table to Database, Unless it Already Exists

In [8]:
try:
    df_players.to_sql('Player', conn, index=False)
except:
    pass

### Teams

#### Get Teams DataFrame, Type ID as String and Convert Year to Datetime

In [9]:
df_teams = pd.DataFrame(teams.get_teams()).astype({'id': 'str'})
df_teams['year_founded'] =  pd.to_datetime(df_teams['year_founded'], format='%Y').dt.year # convert year to datetime type
df_teams.head()

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
0,1610612737,Atlanta Hawks,ATL,Hawks,Atlanta,Atlanta,1949
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966


In [10]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     object
 1   full_name     30 non-null     object
 2   abbreviation  30 non-null     object
 3   nickname      30 non-null     object
 4   city          30 non-null     object
 5   state         30 non-null     object
 6   year_founded  30 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 1.8+ KB


#### Add Dataframe as Table to Database, Unless it Already Exists

In [11]:
try:
    df_teams.to_sql('Team', conn, index=False)
except:
    pass

### Common Player Information

In [None]:
# define function to extract common player info for a single player
def get_common_player_info(player_id, proxies):
    # define helpful variables
    no_res = True
    proxy_collection_counter = 0
    proxy_index = 0
    # while no response
    while no_res:
        # try getting a response without a proxy
        try:
            res = commonplayerinfo.CommonPlayerInfo(player_id=player_id, timeout=3).get_data_frames()
            no_res = False
            print(player_id)
            break
        except:
            # if that fails
            while no_res:
                # try getting with a certain proxy
                try: 
                    res = commonplayerinfo.CommonPlayerInfo(player_id=player_id, proxy="http://" + proxies[proxy_index], timeout=3).get_data_frames()
                    no_res = False
                    break
                except:
                    # if that fails, move on to next proxy unless out of proxies
                    if (proxy_index + 1) >= len(proxies):
                        # unless tried proxies 5 times
                        if proxy_collection_counter < 6:
                            # if out of proxies: get more proxies, fix counters, and try without a proxy again
                            proxy_index = 0
                            proxy_collection_counter = proxy_collection_counter + 1
                            print(player_id, ' failed {} times'.format(proxy_collection_counter))
                            proxies = [str(proxy).split('\\')[0][2:] for proxy in urllib.request.urlopen("https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=1000&country=all&ssl=yes&anonymity=all&simplified=true").readlines()]
                            break
                        else:
                            return None
                    else:
                        proxy_index = proxy_index + 1
                        
    # merge the common player info and player headline stats and drop timeframe                   
    res_df = pd.merge(res[0], res[1], how='left', left_on=['PERSON_ID', 'DISPLAY_FIRST_LAST'], right_on=['PLAYER_ID', 'PLAYER_NAME'])
    res_df = res_df.drop(['TimeFrame'], axis=1)
    return res_df

# get proxies
proxies = [str(proxy).split('\\')[0][2:] for proxy in urllib.request.urlopen("https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=1000&country=all&ssl=yes&anonymity=all&simplified=true").readlines()]

# get common player info for each player in the db
dfs = []
player_ids = pd.read_sql('SELECT id FROM Player', conn).T.values[0]
dfs = list(map(partial(get_common_player_info, proxies=proxies), player_ids))
df = pd.concat(dfs)
df.head()

In [6]:
list(dfs)

[]

In [3]:
df.info()

'34.78.118.194:3128'

In [None]:
person_id -> id
PLAYER_NAME x

#### Add Data to Table in Database

In [None]:
try:
    df_teams.to_sql('Team', conn, index=False)
except:
    pass

#### Upload to Kaggle

In [None]:
!kaggle datasets version -p ../data -m "adding common player info"

### Team Details