<a href="https://colab.research.google.com/github/wyattowalsh/sports-analytics/blob/main/basketball/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align='center'> Basketball Data Collection </h1>

This notebook contains the associated work necessary to collect the data that composes the [***Kaggle Basketball Dataset*** (wyattowalsh/basketball)](https://www.kaggle.com/wyattowalsh/basketball) and serves as the foundation for the [basketball related projects](https://github.com/wyattowalsh/sports-analytics/tree/main/basketball) within my [sports analytics GitHub repository](https://github.com/wyattowalsh/sports-analytics).

One of the goals for the data collection component of this project is to produce a `robust`, *organized* dataset that can grow to as **large of a scale** as possible. You can find an explanation of my solution for storing the files related to the [***Basketball Dataset***](https://www.kaggle.com/wyattowalsh/basketball) below.

<img src="https://unsplash.com/photos/Kv-gAzpUSRg/download?force=true">

## Overview

***Kaggle*** offers many formats of which one can save files to a dataset, which include: `CSV`, `JSON`, `SQLite`, and `Archives`, among others. The platform essentially acts similarly to industrial cloud solutions like *Google Cloud Platform's* (**GCP**) ***Cloud Storage*** or *Amazon Web Service's* (**AWS**) ***S3*** albeit with a **100GB** storage capacity. ***Kaggle*** datasets as well as these industrial solutions can be considered as broad object/file storage and in certain data engineering paradigms can serve as data lakes. 

It seems that many state-of-the-art (SOTA) data storage solutions pivot around an organizational-wide data lake (of which itself allows for general object storage) that has multiple inputs (*"tributaries"*) both streaming into and routinely added to the overall lake. One benefit of this paradigm is that the lake facilitates the storage of both structured (tabular) and unstructured (image, video, audio, text, etc) data. This can prove useful because, as time progresses, new techniques for extracting useful information from unstructured data can be utilized. Thus it also seems like a good idea to hold onto all extracted data, if possible. 

***Kaggle*** datasets can serve as data lakes through the archival process or simply by storing data files in their raw file format. This certainly serves as a strong foundation for building a &#8212; one day in the future &#8212; <b><i>"big data"</i></b> collection. 

However, there is further work that can be done in configuring ***Kaggle*** datasets to enable additional platform functionality as well as improved storage efficiency. Structured data, whether structured upon extraction or structured through some pre-processing, can be stored in a ***SQLite*** database (`.sqlite` file type) as opposed to storing individual files such as `CSVs` or `JSONs` within the dataset. Thus, a single database file is stored as an object within the dataset, enabling additional functionality. One easily discerned advantage with storing in ***SQLite*** is that histograms of the distribution of across continuous variables are given directly within ***Kaggle***. 

As this project moves forward, I hope to collect a large collection of both structured and unstructured data. I hope that the ***SQLite*** database (`basketball.sqlite`) can serve to house the structured data in an efficient, useful format, similarly to the [***European Soccer Database***](https://www.kaggle.com/hugomathien/soccer).

## View System Information

In [2]:
print("********************** CUDA Version ********************** \n - \n")
!nvcc --version
print("********************** CPU Info ********************** \n - \n")
!cat /proc/cpuinfo
print("********************** CPU Count ********************** \n - \n")
import os
print(os.cpu_count())
print("********************** GPU Info ********************** \n - \n")
!nvidia-smi
print("********************** Python Version ********************** \n - \n")
!python -V

********************** CUDA Version ********************** 
 - 

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
********************** CPU Info ********************** 
 - 

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7B12
stepping	: 0
microcode	: 0x1000065
cpu MHz		: 2249.996
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_l

## Prepare Development Environment

- ### Clone GitHub Repository (if necessary)
- ### Install Conda package manager (if necessary)
- ### Install dependencies

In [1]:
# remove sample data and clone repo
!rm -r sample_data/
!rm -r sports-analytics/
!git clone https://github.com/wyattowalsh/sports-analytics.git

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/

# install conda
# !bash sports-analytics/project_resources/bash_scripts/install_conda_in_colab.sh 

# install dependencies
!pip3 install -r ../../dependencies/basketball/data_collection.txt
!pip3 install --upgrade --force-reinstall dask
!pip3 install --upgrade --force-reinstall dask-cuda
!pip3 install 'fsspec>=0.3.3'

# restart kernel
exit()

rm: cannot remove 'sports-analytics/': No such file or directory
Cloning into 'sports-analytics'...
remote: Enumerating objects: 209, done.[K
remote: Counting objects: 100% (209/209), done.[K
remote: Compressing objects: 100% (146/146), done.[K
remote: Total 209 (delta 71), reused 150 (delta 35), pack-reused 0[K
Receiving objects: 100% (209/209), 69.17 KiB | 1.47 MiB/s, done.
Resolving deltas: 100% (71/71), done.
/content/sports-analytics/basketball/notebooks
Collecting yapf
[?25l  Downloading https://files.pythonhosted.org/packages/5f/0d/8814e79eb865eab42d95023b58b650d01dec6f8ea87fc9260978b1bf2167/yapf-0.31.0-py2.py3-none-any.whl (185kB)
[K     |████████████████████████████████| 194kB 5.9MB/s 
[?25hCollecting isort
[?25l  Downloading https://files.pythonhosted.org/packages/cc/89/6888f573886e9dc0906ec98f1b15888de20919a142c355d7f57ebd977d36/isort-5.7.0-py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 36.1MB/s 
[?25hCollecting nba_api
[?25l  Downloading

Collecting fsspec>=0.3.3
[?25l  Downloading https://files.pythonhosted.org/packages/91/0d/a6bfee0ddf47b254286b9bd574e6f50978c69897647ae15b14230711806e/fsspec-0.8.7-py3-none-any.whl (103kB)
[K     |███▏                            | 10kB 16.0MB/s eta 0:00:01[K     |██████▍                         | 20kB 15.8MB/s eta 0:00:01[K     |█████████▌                      | 30kB 8.9MB/s eta 0:00:01[K     |████████████▊                   | 40kB 7.3MB/s eta 0:00:01[K     |███████████████▉                | 51kB 4.3MB/s eta 0:00:01[K     |███████████████████             | 61kB 4.7MB/s eta 0:00:01[K     |██████████████████████▏         | 71kB 4.8MB/s eta 0:00:01[K     |█████████████████████████▍      | 81kB 5.2MB/s eta 0:00:01[K     |████████████████████████████▌   | 92kB 5.4MB/s eta 0:00:01[K     |███████████████████████████████▊| 102kB 5.4MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 5.4MB/s 
Installing collected packages: fsspec
Successfully installed fsspec-

## Import Dependencies & Initialize Kaggle

In [1]:
# nba_api dependencies
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import commonplayerinfo, playercareerstats

# datascience stack
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import sqlite3 as sql
import dask
from dask.distributed import Client, progress, LocalCluster
from dask_cuda import LocalCUDACluster

# system utility stack
import os
import time
from requests.packages.urllib3.exceptions import ProxyError
import urllib.error
import urllib.request

# Upload kaggle.json to /content/
from google.colab import files
uploaded = files.upload()

# Move and change permissions as needed, allowing for import
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
import kaggle

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/

Saving kaggle.json to kaggle.json
/content/sports-analytics/basketball/notebooks


## Collect Data

### Connect to Database

In [2]:
conn = sql.connect('../data/basketball.sqlite')

## Retrieve Players Table, Format (as needed), Add to Database (if needed)

In [3]:
df_players = pd.DataFrame(players.get_players()).astype({'id': 'str'})
df_players

Unnamed: 0,id,full_name,first_name,last_name,is_active
0,76001,Alaa Abdelnaby,Alaa,Abdelnaby,False
1,76002,Zaid Abdul-Aziz,Zaid,Abdul-Aziz,False
2,76003,Kareem Abdul-Jabbar,Kareem,Abdul-Jabbar,False
3,51,Mahmoud Abdul-Rauf,Mahmoud,Abdul-Rauf,False
4,1505,Tariq Abdul-Wahad,Tariq,Abdul-Wahad,False
...,...,...,...,...,...
4496,1627790,Ante Zizic,Ante,Zizic,True
4497,78647,Jim Zoet,Jim,Zoet,False
4498,78648,Bill Zopf,Bill,Zopf,False
4499,1627826,Ivica Zubac,Ivica,Zubac,True


In [4]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4501 entries, 0 to 4500
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4501 non-null   object
 1   full_name   4501 non-null   object
 2   first_name  4501 non-null   object
 3   last_name   4501 non-null   object
 4   is_active   4501 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 145.2+ KB


In [5]:
df_players.to_sql('Player', conn)

ValueError: ignored


## Retrieve Teams Table, Format (as needed), Add to Database (if needed)

In [4]:
df_teams = pd.DataFrame(teams.get_teams()).astype({'id': 'str'})
df_teams['year_founded'] =  pd.to_datetime(df_teams['year_founded'], format='%Y').dt.year # convert year to datetime type
df_teams.head()

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
0,1610612737,Atlanta Hawks,ATL,Hawks,Atlanta,Atlanta,1949
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966


In [9]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     object
 1   full_name     30 non-null     object
 2   abbreviation  30 non-null     object
 3   nickname      30 non-null     object
 4   city          30 non-null     object
 5   state         30 non-null     object
 6   year_founded  30 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 1.8+ KB


In [10]:
df_teams.to_sql('Team', conn)

ValueError: ignored

## Get Proxy Server Addresses

### Define Function to Scrape New Proxy List and Return Proxies Tested to be Alive

In [5]:
def get_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 2)
            print("alive proxy detected")
        except:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = dask.delayed(check_proxies)(proxy)
        dead_proxies.append(dead_proxy)

    dead_proxies = dask.persist(*dead_proxies)
    dead_proxies = list(filter(None, dask.compute(*dead_proxies))) 

    [proxies.remove(proxy) for proxy in dead_proxies]
    if "" in proxies:
        proxies.remove("")
    print("Number of proxies alive: ", len(proxies))
    return proxies

### Create Dask Cluster with the Number of Workers Equal to the Number of CPU Cores

In [6]:
# Make sure to put appropiate number of workers given info provided in the output of the first cell
cluster = LocalCluster(n_workers=4) 
c = Client(cluster)
c

0,1
Client  Scheduler: tcp://127.0.0.1:33605  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 27.39 GB


## Process `get_proxies()` with Dask then Shutdown Cluster

In [7]:
proxies = get_proxies()
c.shutdown()
proxies

--2021-03-19 01:06:34--  https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all&simplified=true
Resolving api.proxyscrape.com (api.proxyscrape.com)... 151.139.128.11
Connecting to api.proxyscrape.com (api.proxyscrape.com)|151.139.128.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘http_proxies.txt’

http_proxies.txt        [<=>                 ]       0  --.-KB/s               http_proxies.txt        [ <=>                ]  21.49K  --.-KB/s    in 0s      

2021-03-19 01:06:34 (60.0 MB/s) - ‘http_proxies.txt’ saved [22006]

Original number of proxies:  1085
Number of proxies alive:  7


['59.125.123.129:81',
 '221.182.31.54:8080',
 '109.237.91.155:8080',
 '173.212.202.65:80',
 '60.205.132.71:80',
 '203.243.63.16:80',
 '5.189.133.231:80']

## Get Common Player Information

### Define Functions `get_quick_proxies()` & `get_common_player_info()`

Each function utilizes a ***Dask*** cluster. 

`get_quick_proxies()` gets a list of proxies (tested to be alive) more quickly than the function above. This function is used in the case that all proxies found from the above function fail to return responses from stats.nba.com. 

`get_common_player_info()` returns dataframe of common player infomation for a certain player. The paradigm here is to distribute jobs (where each job is collecting common player info for a certain player) across a ***Dask*** cluster since all outputs will be the same and can be easily be concatenated. 

In [20]:
def get_quick_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=3500&country=all&ssl=all&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 2.5)
        except IOError:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = check_proxies(proxy)
        dead_proxies.append(dead_proxy)
        
    [proxies.remove(proxy) for proxy in dead_proxies]
    if "" in proxies:
        proxies.remove("")
    print("Number of proxies alive: ", len(proxies))
    return proxies
  
def get_common_player_info(player_id, proxies):
    custom_headers = {
    'Host': 'stats.nba.com',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    }
    dfs = []
    while len(dfs) < 2: 
        for proxy in proxies:
            try:
                res = commonplayerinfo.CommonPlayerInfo(player_id=player_id, proxy="http://" + proxy, timeout=100)
                dfs = res.get_data_frames()
                df = pd.merge(dfs[0], dfs[1], how='left', left_on=['PERSON_ID', 'DISPLAY_FIRST_LAST'], right_on=['PLAYER_ID', 'PLAYER_NAME'])
                df = df.drop(['TimeFrame'], axis=1)
                print(player_id)
                return df
            except:
                continue
        print(player_id, "\n proxies failed; retrieving new proxies and attempting request again")
        proxies = get_quick_proxies()

### Extract Common Player Information for all Players

In [23]:
# Make sure to put appropiate number of workers given info provided in the output of the first cell
cluster = LocalCluster(n_workers=4)
c = Client(cluster)           
player_ids = df_players['id'].values
common_player_info_dfs = []
for player_id in player_ids:
    common_player_info_df = dask.delayed(get_common_player_info)(int(player_id), proxies)
    common_player_info_dfs.append(common_player_info_df)

common_player_info_dfs = dask.persist(*common_player_info_dfs)
common_player_info_dfs = dask.compute(*common_player_info_dfs)
common_player_info_df = pd.concat(common_player_info_dfs)

common_player_info_df.head()
c.shutdown()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 37311 instead
  http_address["port"], self.http_server.port


TypeError: ignored

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError


In [22]:
c.shutdown()

In [8]:
common_player_info_df

NameError: ignored