<a href="https://colab.research.google.com/github/wyattowalsh/sports-analytics/blob/main/basketball/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align='center'> Basketball Data Collection </h1>

This notebook contains the associated work necessary to collect the data that composes the [***Kaggle Basketball Dataset*** (wyattowalsh/basketball)](https://www.kaggle.com/wyattowalsh/basketball) and serves as the foundation for the [basketball related projects](https://github.com/wyattowalsh/sports-analytics/tree/main/basketball) within my [sports analytics GitHub repository](https://github.com/wyattowalsh/sports-analytics).

One of the goals for the data collection component of this project is to produce a `robust`, *organized* dataset that can grow to as **large of a scale** as possible. You can find an explanation of my solution for storing the files related to the [***Basketball Dataset***](https://www.kaggle.com/wyattowalsh/basketball) below.

<img src="https://unsplash.com/photos/Kv-gAzpUSRg/download?force=true">

## Overview

***Kaggle*** offers many formats of which one can save files to a dataset, which include: `CSV`, `JSON`, `SQLite`, and `Archives`, among others. The platform essentially acts similarly to industrial cloud solutions like *Google Cloud Platform's* (**GCP**) ***Cloud Storage*** or *Amazon Web Service's* (**AWS**) ***S3*** albeit with a **100GB** storage capacity. ***Kaggle*** datasets as well as these industrial solutions can be considered as broad object/file storage and in certain data engineering paradigms can serve as data lakes. 

It seems that many state-of-the-art (SOTA) data storage solutions pivot around an organizational-wide data lake (of which itself allows for general object storage) that has multiple inputs (*"tributaries"*) both streaming into and routinely added to the overall lake. One benefit of this paradigm is that the lake facilitates the storage of both structured (tabular) and unstructured (image, video, audio, text, etc) data. This can prove useful because, as time progresses, new techniques for extracting useful information from unstructured data can be utilized. Thus it also seems like a good idea to hold onto all extracted data, if possible. 

***Kaggle*** datasets can serve as data lakes through the archival process or simply by storing data files in their raw file format. This certainly serves as a strong foundation for building a &#8212; one day in the future &#8212; <b><i>"big data"</i></b> collection. 

However, there is further work that can be done in configuring ***Kaggle*** datasets to enable additional platform functionality as well as improved storage efficiency. Structured data, whether structured upon extraction or structured through some pre-processing, can be stored in a ***SQLite*** database (`.sqlite` file type) as opposed to storing individual files such as `CSVs` or `JSONs` within the dataset. Thus, a single database file is stored as an object within the dataset, enabling additional functionality. One easily discerned advantage with storing in ***SQLite*** is that histograms of the distribution of across continuous variables are given directly within ***Kaggle***. 

As this project moves forward, I hope to collect a large collection of both structured and unstructured data. I hope that the ***SQLite*** database (`basketball.sqlite`) can serve to house the structured data in an efficient, useful format, similarly to the [***European Soccer Database***](https://www.kaggle.com/hugomathien/soccer).

## View System Information

In [1]:
print("********************** CUDA Version ********************** \n - \n")
!nvcc --version
print("********************** CPU Info ********************** \n - \n")
!cat /proc/cpuinfo
print("********************** CPU Count ********************** \n - \n")
import os
print(os.cpu_count())
print("********************** GPU Info ********************** \n - \n")
!nvidia-smi
print("********************** Python Version ********************** \n - \n")
!python -V

********************** CUDA Version ********************** 
 - 

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
********************** CPU Info ********************** 
 - 

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2200.210
cache size	: 56320 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch

## Prepare Development Environment

- ### Clone GitHub Repository (if necessary)
- ### Install Conda package manager (if necessary)
- ### Install dependencies

In [2]:
# remove sample data and clone repo
!rm -r sample_data/
!rm -r sports-analytics/
!git clone https://github.com/wyattowalsh/sports-analytics.git

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/

# install conda
# !bash sports-analytics/project_resources/bash_scripts/install_conda_in_colab.sh 

# install dependencies
!pip3 install -r ../../dependencies/basketball/data_collection.txt
!pip3 install --upgrade --force-reinstall dask
!pip3 install --upgrade --force-reinstall dask-cuda
!pip3 install 'fsspec>=0.3.3'

# restart kernel
exit()

rm: cannot remove 'sample_data/': No such file or directory
Cloning into 'sports-analytics'...
remote: Enumerating objects: 214, done.[K
remote: Counting objects: 100% (214/214), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 214 (delta 74), reused 150 (delta 35), pack-reused 0[K
Receiving objects: 100% (214/214), 87.55 KiB | 1.18 MiB/s, done.
Resolving deltas: 100% (74/74), done.
/content/sports-analytics/basketball/notebooks
Collecting cudf
  Using cached https://files.pythonhosted.org/packages/ee/8f/b8f7eb3c24d1062b419fec0b9e39b98a5954b3d0fc1539b8f830565ff06b/cudf-0.6.1.post1.tar.gz
Building wheels for collected packages: cudf
  Building wheel for cudf (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for cudf[0m
[?25h  Running setup.py clean for cudf
Failed to build cudf
Installing collected packages: cudf
    Running setup.py install for cudf ... [?25l[?25herror
[31mERROR: Command errored out with exit status 1: /usr/bin/python3 -u



## Import Dependencies & Initialize Kaggle

In [1]:
# nba_api dependencies
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import commonplayerinfo, playercareerstats

# datascience stack
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import sqlite3 as sql
import dask
from dask.distributed import Client, progress, LocalCluster
from dask_cuda import LocalCUDACluster

import dask.array as da

# system utility stack
import os
import time
from requests.packages.urllib3.exceptions import ProxyError
import urllib.error
import urllib.request

# Upload kaggle.json to /content/
from google.colab import files
uploaded = files.upload()

# Move and change permissions as needed, allowing for import
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
import kaggle

# change directory to directory that contains this notebook
%cd /content/sports-analytics/basketball/notebooks/

# utilize Colab Monitor
from urllib.request import urlopen
exec(urlopen("http://colab-monitor.smankusors.com/track.py").read())
_colabMonitor = ColabMonitor().start()

Saving kaggle.json to kaggle.json
/content/sports-analytics/basketball/notebooks
Now live at : http://colab-monitor.smankusors.com/60545998c0794


## Collect Data

### Connect to Database

In [2]:
conn = sql.connect('../data/basketball.sqlite')

## Retrieve Players Table, Format (as needed), Add to Database (if needed)

In [3]:
df_players = pd.DataFrame(players.get_players()).astype({'id': 'str'})
df_players

Unnamed: 0,id,full_name,first_name,last_name,is_active
0,76001,Alaa Abdelnaby,Alaa,Abdelnaby,False
1,76002,Zaid Abdul-Aziz,Zaid,Abdul-Aziz,False
2,76003,Kareem Abdul-Jabbar,Kareem,Abdul-Jabbar,False
3,51,Mahmoud Abdul-Rauf,Mahmoud,Abdul-Rauf,False
4,1505,Tariq Abdul-Wahad,Tariq,Abdul-Wahad,False
...,...,...,...,...,...
4496,1627790,Ante Zizic,Ante,Zizic,True
4497,78647,Jim Zoet,Jim,Zoet,False
4498,78648,Bill Zopf,Bill,Zopf,False
4499,1627826,Ivica Zubac,Ivica,Zubac,True


In [4]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4501 entries, 0 to 4500
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4501 non-null   object
 1   full_name   4501 non-null   object
 2   first_name  4501 non-null   object
 3   last_name   4501 non-null   object
 4   is_active   4501 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 145.2+ KB


In [5]:
df_players.to_sql('Player', conn)

ValueError: ignored


## Retrieve Teams Table, Format (as needed), Add to Database (if needed)

In [3]:
df_teams = pd.DataFrame(teams.get_teams()).astype({'id': 'str'})
df_teams['year_founded'] =  pd.to_datetime(df_teams['year_founded'], format='%Y').dt.year # convert year to datetime type
df_teams.head()

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
0,1610612737,Atlanta Hawks,ATL,Hawks,Atlanta,Atlanta,1949
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966


In [7]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     object
 1   full_name     30 non-null     object
 2   abbreviation  30 non-null     object
 3   nickname      30 non-null     object
 4   city          30 non-null     object
 5   state         30 non-null     object
 6   year_founded  30 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 1.8+ KB


In [8]:
df_teams.to_sql('Team', conn)

ValueError: ignored

## Get Proxy Server Addresses

### Define Function to Scrape New Proxy List and Return Proxies Tested to be Alive

In [2]:
def get_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=5000&country=all&ssl=yes&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 15)
            print("alive proxy detected")
        except:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = dask.delayed(check_proxies)(proxy)
        dead_proxies.append(dead_proxy)

    dead_proxies = dask.persist(*dead_proxies)
    dead_proxies = list(filter(None, dask.compute(dead_proxies))) 

    [proxies.remove(proxy) for proxy in dead_proxies if proxy in proxies]
    if "" in proxies:
        proxies.remove("")
    print("Number of proxies alive: ", len(proxies))
    return proxies

### Create Dask Cluster with the Number of Workers Equal to the Number of CPU Cores

In [5]:
# Make sure to put appropiate number of workers given info provided in the output of the first cell
cluster = LocalCluster(n_workers=4) 
c = Client(cluster)
c

0,1
Client  Scheduler: tcp://127.0.0.1:35003  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 27.39 GB


## Process `get_proxies()` with Dask then Shutdown Cluster

In [6]:
proxies = get_proxies()
with open('valid_proxies.txt', 'w') as f:
    for proxy in proxies:
        f.write("%s\n" % proxy)
c.shutdown()
proxies

--2021-03-19 07:17:10--  https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=5000&country=all&ssl=yes&anonymity=all&simplified=true
Resolving api.proxyscrape.com (api.proxyscrape.com)... 151.139.128.11
Connecting to api.proxyscrape.com (api.proxyscrape.com)|151.139.128.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1410 (1.4K) [text/plain]
Saving to: ‘http_proxies.txt’


2021-03-19 07:17:10 (33.3 KB/s) - ‘http_proxies.txt’ saved [1410/1410]

Original number of proxies:  70
Number of proxies alive:  69


['165.232.167.184:8080',
 '51.222.150.4:3128',
 '198.50.163.192:3129',
 '62.210.203.211:8080',
 '45.92.94.83:8118',
 '167.172.109.12:33892',
 '51.158.172.165:8811',
 '157.230.103.189:46347',
 '176.113.73.104:3128',
 '62.23.15.92:6666',
 '51.158.172.165:8761',
 '190.112.136.169:8085',
 '192.99.144.208:8080',
 '165.232.169.223:8080',
 '62.171.144.29:3128',
 '139.255.74.125:8080',
 '68.183.8.103:8080',
 '157.230.103.189:40027',
 '146.59.237.207:3128',
 '138.68.141.150:8080',
 '103.11.106.218:8181',
 '3.114.241.246:8080',
 '94.242.59.20:3128',
 '103.24.126.179:83',
 '157.230.103.189:33208',
 '64.225.26.142:8080',
 '157.230.103.189:38450',
 '95.217.238.89:3128',
 '176.113.73.102:3128',
 '51.222.150.3:3128',
 '176.9.85.13:3128',
 '192.99.92.249:3128',
 '51.75.147.44:3128',
 '157.230.103.189:35142',
 '176.113.73.101:3128',
 '200.155.139.242:3128',
 '13.212.32.171:80',
 '64.227.6.108:3127',
 '35.236.167.177:3128',
 '188.166.125.206:45135',
 '94.130.179.24:8015',
 '201.49.58.234:80',
 '158.176.

## Get Common Player Information

### Define Functions `get_quick_proxies()` & `get_common_player_info()`

Each function utilizes a ***Dask*** cluster. 

`get_quick_proxies()` gets a list of proxies (tested to be alive) more quickly than the function above. This function is used in the case that all proxies found from the above function fail to return responses from stats.nba.com. 

`get_common_player_info()` returns dataframe of common player infomation for a certain player. The paradigm here is to distribute jobs (where each job is collecting common player info for a certain player) across a ***Dask*** cluster since all outputs will be the same and can be easily be concatenated. 

In [3]:
def get_quick_proxies():
    !wget -O http_proxies.txt "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=3500&country=all&ssl=yes&anonymity=all&simplified=true"

    with open('http_proxies.txt', 'r') as file:
        proxies = file.read().split('\n')
    print("Original number of proxies: ", len(proxies))

    def check_proxies(proxy):
        try:
            urllib.request.urlopen("http://" + proxy, timeout = 15)
            print("alive proxy detected")
        except:
            return proxy

    dead_proxies = []
    for proxy in proxies:
        dead_proxy = check_proxies(proxy)
        dead_proxies.append(dead_proxy)

    [proxies.remove(proxy) for proxy in dead_proxies if proxy in proxies]
    if "" in proxies:
        proxies.remove("")
    print("Number of proxies alive: ", len(proxies))
    return proxies


def get_common_player_info(player_id):
  with open('valid_proxies.txt', 'r') as file:
    proxies = file.read().split('\n')
  res_dfs = []
  i = 0
  # while response is empty
  while len(res_dfs) <= 0: 
    # try the request without a proxy
    try:
      res_dfs = commonplayerinfo.CommonPlayerInfo(player_id=player_id, timeout=5).get_data_frames()
      res_df = pd.merge(res_dfs[0], res_dfs[1], how='left', left_on=['PERSON_ID', 'DISPLAY_FIRST_LAST'], right_on=['PLAYER_ID', 'PLAYER_NAME'])
      res_df = res_df.drop(['TimeFrame'], axis=1)
      print("******* SUCCESS ******* \n ******* {} ******* \n".format(player_id))
      return res_df
    # if still fails, then try with proxy
    except:
      try:
        res_dfs = commonplayerinfo.CommonPlayerInfo(player_id=player_id, timeout=10).get_data_frames()
        res_df = pd.merge(res_dfs[0], res_dfs[1], how='left', left_on=['PERSON_ID', 'DISPLAY_FIRST_LAST'], right_on=['PLAYER_ID', 'PLAYER_NAME'])
        res_df = res_df.drop(['TimeFrame'], axis=1)
        print("******* SUCCESS ******* \n ******* {} ******* \n".format(player_id))
        return res_df
      # if still fails, move on to next proxy, unless out of proxies
      except:
        if (i + 1) < len(proxies):
          i = i + 1
        # if out of proxies, restart counter and get new proxies
        else:
          print("******* FAILURE ****** \n ****** {} ******* \n ******* COLLECTING NEW PROXIES AND TRYING REQUEST AGAIN ******".format(player_id))
          i = 0
          proxies = get_quick_proxies()
      

### Extract Common Player Information for all Players

In [None]:
# Make sure to put appropiate number of workers given info provided in the output of the first cell
def main():
    cluster = LocalCluster(n_workers=4) 
    with Client(address=cluster):
      conn = sql.connect('../data/basketball.sqlite')
      player_ids = pd.read_sql('SELECT id FROM Player', conn).values #pd.DataFrame(players.get_players()).astype({'id': 'str'})['id'].values

      dfs = []
      for player_id in player_ids:
        df = dask.delayed(get_common_player_info(player_id))
        dfs.append(df)

      dfs = dask.persist(*dfs)
      dfs = dask.compute(dfs)
      dfs = dask.dataframe.multi.concat(dfs)
      return dfs

if __name__ == "__main__":
  common_player_info_dfs = main()

common_player_info_dfs.head()

In [25]:
c.shutdown()

In [None]:
common_player_info_df