<a href="https://colab.research.google.com/github/wyattowalsh/sports-analytics/blob/main/basketball/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align='center'> Basketball Data Collection </h1>

This notebook contains the associated work necessary to collect the data that composes the [***Kaggle Basketball Dataset*** (wyattowalsh/basketball)](https://www.kaggle.com/wyattowalsh/basketball) and serves as the foundation for the [basketball related projects](https://github.com/wyattowalsh/sports-analytics/tree/main/basketball) within my [sports analytics GitHub repository](https://github.com/wyattowalsh/sports-analytics).

One of the goals for the data collection component of this project is to produce a `robust`, *organized* dataset that can grow to as **large of a scale** as possible. You can find an explanation of my solution for storing the files related to the [***Basketball Dataset***](https://www.kaggle.com/wyattowalsh/basketball) below.

<img src="https://unsplash.com/photos/Kv-gAzpUSRg/download?force=true">

***Kaggle*** offers many formats of which one can save files to a dataset, which include: `CSV`, `JSON`, `SQLite`, and `Archives`, among others. The platform essentially acts similarly to industrial cloud solutions like *Google Cloud Platform's* (**GCP**) ***Cloud Storage*** or *Amazon Web Service's* (**AWS**) ***S3*** albeit with a **100GB** storage capacity. ***Kaggle*** datasets as well as these industrial solutions can be considered as broad object/file storage and in certain data engineering paradigms can serve as data lakes. 

It seems that many state-of-the-art (SOTA) data storage solutions pivot around an organizational-wide data lake (of which itself allows for general object storage) that has multiple inputs (*"tributaries"*) both streaming into and routinely added to the overall lake. One benefit of this paradigm is that the lake facilitates the storage of both structured (tabular) and unstructured (image, video, audio, text, etc) data. This can prove useful because, as time progresses, new techniques for extracting useful information from unstructured data can be utilized. Thus it also seems like a good idea to hold onto all extracted data, if possible. 

***Kaggle*** datasets can serve as data lakes through the archival process or simply storing data files in their raw file format. This certainly serves as a strong foundation for building a &#8212; one day in the future &#8212; <b><i>"big data"</i></b> collection. 

However, there is further work that can be done in configuring ***Kaggle*** datasets to enable additional platform functionality as well as improved storage efficiency. data storage solution of data lakes. Similarly, to how storing files in the `.avro`  Further, data lakes can be refined over time to enable extraction and analysis across all ingested data. One feature of data lakes that can be utilized is storing 




To facilitate the growth of this &#8212; one day in the future &#8212; "big data" collection,  this, the dataset is stored within a ***SQLite*** database ( [`basketball.sqlite`] ). Of the different options available to store data on ***Kaggle*** 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Prepare-Development-Environment" data-toc-modified-id="Prepare-Development-Environment-1">Prepare Development Environment</a></span><ul class="toc-item"><li><span><a href="#Clone-GitHub-Repository-(if-necessary)" data-toc-modified-id="Clone-GitHub-Repository-(if-necessary)-1.1">Clone GitHub Repository (if necessary)</a></span></li><li><span><a href="#Install-Conda-package-manager-(if-necessary)" data-toc-modified-id="Install-Conda-package-manager-(if-necessary)-1.2">Install Conda package manager (if necessary)</a></span></li><li><span><a href="#Install-dependencies" data-toc-modified-id="Install-dependencies-1.3">Install dependencies</a></span></li></ul></li><li><span><a href="#Import-Dependencies,-Initialize-Kaggle,-Initialize-Dask" data-toc-modified-id="Import-Dependencies,-Initialize-Kaggle,-Initialize-Dask-2">Import Dependencies, Initialize Kaggle, Initialize Dask</a></span></li></ul></div>

## Prepare Development Environment

- ### Clone GitHub Repository (if necessary)
- ### Install Conda package manager (if necessary)
- ### Install dependencies

In [None]:
# remove sample data and clone repo
!rm -r sample_data 
!git clone https://github.com/wyattowalsh/sports-analytics.git

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/


# install dependencies
! pip3 install -r ../../dependencies/basketball/data_collection.txt
! cat ../../dependencies/basketball/data_collection.txt | grep -v '^\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U 
# install conda
# bash sports-analytics/project_resources/bash_scripts/install_conda_in_colab.sh 

Cloning into 'sports-analytics'...
remote: Enumerating objects: 184, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (126/126), done.[K
remote: Total 184 (delta 61), reused 140 (delta 33), pack-reused 0[K
Receiving objects: 100% (184/184), 46.91 KiB | 6.70 MiB/s, done.
Resolving deltas: 100% (61/61), done.
/content/sports-analytics/basketball/notebooks
Collecting nbdime
[?25l  Downloading https://files.pythonhosted.org/packages/17/50/b2fb7829239560f3de10309461f35eef4f5c188ffb83d861783a6728a3c5/nbdime-2.1.0-py2.py3-none-any.whl (5.0MB)
[K     |████████████████████████████████| 5.0MB 7.0MB/s 
[31mERROR: Could not find a version that satisfies the requirement python (from -r ../../dependencies/basketball/data_collection.txt (line 3)) (from versions: none)[0m
[31mERROR: No matching distribution found for python (from -r ../../dependencies/basketball/data_collection.txt (line 3))[0m
Collecting nbdime
  Using cached https://files.pythonho

In [6]:
! cat ../../dependencies/basketball/data_collection.txt | grep -v '^\-e' | cut -d = -f 1 

nbdime ### Utilities 
pip
python
ipykernel
nb_conda
jupyter_contrib_nbextensions
numpy
pandas
seaborn
matplotlib
yapf
isort
nba_api
kaggle
dask_cuda
dask


In [4]:
! pip3 list --outdated --format=freeze

chardet==3.0.4
dlib==19.18.0
dopamine-rl==1.0.5
earthengine-api==0.1.255
fancyimpute==0.4.3
fastai==1.0.61
firebase-admin==4.4.0
fix-yahoo-finance==0.0.22
future==0.16.0
gast==0.3.3
GDAL==2.2.2
gdown==3.6.4
gensim==3.6.0
geopy==1.17.0
google==2.0.3
google-api-python-client==1.12.8
google-auth==1.27.1
google-auth-httplib2==0.0.4
google-cloud-bigquery==1.21.0
google-cloud-bigquery-storage==1.1.0
google-cloud-core==1.0.3
google-cloud-datastore==1.8.0
google-cloud-firestore==1.7.0
google-cloud-language==1.2.0
google-cloud-storage==1.18.1
google-cloud-translate==1.5.0
google-resumable-media==0.4.1
graphviz==0.10.1
grpcio==1.32.0
gspread==3.0.1
gspread-dataframe==3.0.8
gym==0.17.3
h5py==2.10.0
holoviews==1.13.5
html5lib==1.0.1
httpimport==0.5.18
httplib2==0.17.4
humanize==0.5.1
hyperopt==0.1.2
idna==2.10
imageio==2.4.1
imbalanced-learn==0.4.3
importlib-metadata==3.7.2
inflect==2.1.0
intervaltree==2.1.0
ipykernel==4.10.1
ipython==5.5.0
ipython-sql==0.3.9
jaxlib==0.1.62+cuda110
jsonschema==2.6

## Import Dependencies, Initialize Kaggle, Initialize Dask 

In [1]:
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import commonplayerinfo, playercareerstats
import pandas as pd 
import numpy as np
import os
import sqlite3 as sql
import matplotlib.pyplot as plt
import seaborn
import time
from requests.packages.urllib3.exceptions import ProxyError
import urllib.error
import urllib.request
import dask
from dask.distributed import Client, progress, LocalCluster
from dask_cuda import LocalCUDACluster

!cat /proc/cpuinfo
!nvidia-smi -L

from google.colab import files
uploaded = files.upload()
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
import kaggle

ModuleNotFoundError: ignored

In [6]:
conn = sql.connect('../data/basketball.sqlite')

In [7]:
df_players = pd.DataFrame(players.get_players()).astype({'id': 'str'})
df_players

Unnamed: 0,id,full_name,first_name,last_name,is_active
0,76001,Alaa Abdelnaby,Alaa,Abdelnaby,False
1,76002,Zaid Abdul-Aziz,Zaid,Abdul-Aziz,False
2,76003,Kareem Abdul-Jabbar,Kareem,Abdul-Jabbar,False
3,51,Mahmoud Abdul-Rauf,Mahmoud,Abdul-Rauf,False
4,1505,Tariq Abdul-Wahad,Tariq,Abdul-Wahad,False
...,...,...,...,...,...
4496,1627790,Ante Zizic,Ante,Zizic,True
4497,78647,Jim Zoet,Jim,Zoet,False
4498,78648,Bill Zopf,Bill,Zopf,False
4499,1627826,Ivica Zubac,Ivica,Zubac,True


In [8]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4501 entries, 0 to 4500
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4501 non-null   object
 1   full_name   4501 non-null   object
 2   first_name  4501 non-null   object
 3   last_name   4501 non-null   object
 4   is_active   4501 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 145.2+ KB


In [9]:
df_players.to_sql('Player', conn)

In [10]:
df_teams = pd.DataFrame(teams.get_teams()).astype({'id': 'str'})
df_teams['year_founded'] =  pd.to_datetime(df_teams['year_founded'], format='%Y').dt.year
df_teams

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
0,1610612737,Atlanta Hawks,ATL,Hawks,Atlanta,Atlanta,1949
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966
5,1610612742,Dallas Mavericks,DAL,Mavericks,Dallas,Texas,1980
6,1610612743,Denver Nuggets,DEN,Nuggets,Denver,Colorado,1976
7,1610612744,Golden State Warriors,GSW,Warriors,Golden State,California,1946
8,1610612745,Houston Rockets,HOU,Rockets,Houston,Texas,1967
9,1610612746,Los Angeles Clippers,LAC,Clippers,Los Angeles,California,1970


In [11]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     object
 1   full_name     30 non-null     object
 2   abbreviation  30 non-null     object
 3   nickname      30 non-null     object
 4   city          30 non-null     object
 5   state         30 non-null     object
 6   year_founded  30 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 1.8+ KB


In [12]:
df_teams.to_sql('Team', conn)

In [14]:
import multiprocessing

multiprocessing.thread_count()

AttributeError: ignored

In [None]:
def get_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 10)
        except:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = dask.delayed(check_proxies)(proxy)
        dead_proxies.append(dead_proxy)

    dead_proxies = dask.persist(*dead_proxies)
    dead_proxies = list(filter(None, dask.compute(*dead_proxies))) 

    [proxies.remove(proxy) for proxy in dead_proxies]
    if "" in proxies:
        proxies.remove("")
    print("Number of proxies alive: ", len(proxies))
    return proxies

In [None]:
player_ids = df_players['id'].values
cluster = LocalCluster(n_workers=8)
c = Client(cluster)
proxies = get_proxies()
c.shutdown()
proxies

--2021-03-18 15:48:18--  https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all&simplified=true
Resolving api.proxyscrape.com (api.proxyscrape.com)... 151.139.128.11
Connecting to api.proxyscrape.com (api.proxyscrape.com)|151.139.128.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘http_proxies.txt’

http_proxies.txt        [ <=>                ]  24.70K  --.-KB/s    in 0.01s   

2021-03-18 15:48:19 (2.18 MB/s) - ‘http_proxies.txt’ saved [25295]

Original number of proxies:  1248


In [None]:
def get_quick_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=3500&country=all&ssl=all&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 2.5)
        except IOError:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = dask.delayed(check_proxies)(proxy)
        dead_proxies.append(dead_proxy)

    dead_proxies = dask.compute(*dask.persist(*dead_proxies))

    [proxies.remove(proxy) for proxy in dead_proxies]
    print("Number of proxies alive: ", len(proxies))
    return proxies

ProxyError: HTTPSConnectionPool(host='stats.nba.com', port=443): Max retries exceeded with url: /stats/commonplayerinfo?LeagueID=&PlayerID=76001 (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 500 Internal Server Error')))

In [None]:
cluster = LocalCluster(n_workers=8)
c = Client(cluster)

def get_common_player_info(player_id, proxies):
    custom_headers = {
    'Host': 'stats.nba.com',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    }
    dfs = []
    while len(dfs) < 2: 
        for proxy in proxies:
            try:
                res = commonplayerinfo.CommonPlayerInfo(player_id=player_id, proxy="http://" + proxy, timeout=100)
                dfs = res.get_data_frames()
                df = pd.merge(dfs[0], dfs[1], how='left', left_on=['PERSON_ID', 'DISPLAY_FIRST_LAST'], right_on=['PLAYER_ID', 'PLAYER_NAME'])
                df = df.drop(['TimeFrame'], axis=1)
                print(player_id)
                return df
            except:
                continue
        print(player_id, "\n proxies failed; retrieving new proxies and attempting request again")
        proxies = get_quick_proxies()
            

common_player_info_dfs = []
for player_id in player_ids:
    common_player_info_df = dask.delayed(get_common_player_info)(int(player_id), proxies)
    common_player_info_dfs.append(common_player_info_df)

common_player_info_dfs = dask.persist(*common_player_info_dfs)
common_player_info_dfs = dask.compute(*common_player_info_dfs)
common_player_info_df = pd.concat(common_player_info_dfs)

common_player_info_df.head()

In [None]:
c.shutdown()

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError


In [None]:
get_common_player_info(int(player_ids[0]), proxies)

76001 
 proxies failed; retrieving new proxies and attempting request again
--2021-03-18 14:02:29--  https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=1000&country=all&ssl=all&anonymity=all&simplified=true
Resolving api.proxyscrape.com (api.proxyscrape.com)... 151.139.128.11
Connecting to api.proxyscrape.com (api.proxyscrape.com)|151.139.128.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3676 (3.6K) [text/plain]
Saving to: ‘http_proxies.txt’


2021-03-18 14:02:30 (9.93 MB/s) - ‘http_proxies.txt’ saved [3676/3676]

Original number of proxies:  185


KeyboardInterrupt: 

In [None]:
pd.concat([Player_Attributes, Player_Attributes])

Unnamed: 0,PERSON_ID,FIRST_NAME,LAST_NAME,DISPLAY_FIRST_LAST,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FI_LAST,PLAYER_SLUG,BIRTHDATE,SCHOOL,COUNTRY,...,GAMES_PLAYED_FLAG,DRAFT_YEAR,DRAFT_ROUND,DRAFT_NUMBER,PLAYER_ID,PLAYER_NAME,PTS,AST,REB,ALL_STAR_APPEARANCES
0,76001,Alaa,Abdelnaby,Alaa Abdelnaby,"Abdelnaby, Alaa",A. Abdelnaby,alaa-abdelnaby,1968-06-24T00:00:00,Duke,USA,...,Y,1990,1,25,76001,Alaa Abdelnaby,5.7,0.3,3.3,0
0,76001,Alaa,Abdelnaby,Alaa Abdelnaby,"Abdelnaby, Alaa",A. Abdelnaby,alaa-abdelnaby,1968-06-24T00:00:00,Duke,USA,...,Y,1990,1,25,76001,Alaa Abdelnaby,5.7,0.3,3.3,0


In [None]:
Player_Attributes.info

<bound method DataFrame.info of    PERSON_ID FIRST_NAME  LAST_NAME DISPLAY_FIRST_LAST  \
0      76001       Alaa  Abdelnaby     Alaa Abdelnaby   

  DISPLAY_LAST_COMMA_FIRST DISPLAY_FI_LAST     PLAYER_SLUG  \
0          Abdelnaby, Alaa    A. Abdelnaby  alaa-abdelnaby   

             BIRTHDATE SCHOOL COUNTRY  ... GAMES_PLAYED_FLAG DRAFT_YEAR  \
0  1968-06-24T00:00:00   Duke     USA  ...                 Y       1990   

  DRAFT_ROUND  DRAFT_NUMBER PLAYER_ID     PLAYER_NAME  PTS  AST  REB  \
0           1            25     76001  Alaa Abdelnaby  5.7  0.3  3.3   

  ALL_STAR_APPEARANCES  
0                    0  

[1 rows x 38 columns]>

In [None]:
Player_Attributes