# Abstract

I am acting as an NBA consultant; based on the 3-year performances of players before becoming Free Agents and the contract they ended up signing (calculated per year), I want to predict how this year's free agents will do (the 2018-19 season is only 1-2 games from being over). Teams targeting certain free agents in the Summer of '19 will be able to use this to determine who they want to target as their main pursuit.

# Obtain the Data

*Describe your data sources here and explain why they are relevant to the problem you are trying to solve.*

First I will scrape data from basketball-reference.com that has player's individual statitstics per season from 2008-2009 to 2017-18 seasons. This will contain data with various player statistics that I can either use as my features or 

I will also scrape free agent lists from 2011 to 2018 seasons that will help me filter out non-impending free agents. This along with the salary they signed up for is available on spotrac.com. This data will help me collect "y".

*After completing this step, be sure to edit `references/data_dictionary` to include descriptions of where you obtained your data and what information it contains.*

In [3]:
%%writefile ../src/data/make_dataset.py

# imports
import pandas as pd
import time
from datetime import datetime
import os
import pickle

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException       
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

def check_exists(driver,classname):
    try:
        driver.find_element_by_class_name(classname)
    except NoSuchElementException:
        return False
    return True

def initialize_selenium(URL):
    # initialize selenium
    chromedriver = "/Applications/chromedriver" 
    os.environ["webdriver.chrome.driver"] = chromedriver
    driver = webdriver.Chrome(chromedriver)
    driver.get(URL)
    
    return driver  

# Generate dictionary to store our data per year
def data_to_dict(years):
    """
    Generate Dictionary that will store our data per year in this format:
    
    Key (Year): Value (Data)
    
    years: int indicating how many years of data will be stored
    """
    data = {}
    CURRENT_YEAR = int(datetime.now().year)
    years_label = range(CURRENT_YEAR-1,CURRENT_YEAR-years,-1)
    
    return years_label, data
    
def download_salary_data(URL,years):
    
    years_label, data = data_to_dict(years)
    driver = initialize_selenium(URL)

    for i in in years_label:
        time.sleep(2)
        df = pd.read_html(driver.current_url)[0] 
        while check_exists(driver,'jcarousel-next-disabled') == False:
            next = driver.find_element_by_class_name('jcarousel-next')
            next.click()
            time.sleep(5)
            temp = pd.read_html(driver.current_url)[0]
            df = df.append(temp,ignore_index=True)
        years = Select(driver.find_element_by_class_name('tablesm'))
        years.select_by_visible_text(str(i-1)+"-"+str(i))
        data[i]=df
    
    driver.quit()
    
    return data

def download_rookie_data(URL, years):
    
    years_label, data = data_to_dict(years)
    driver = initialize_selenium(URL)
    wait = WebDriverWait(driver, 10)
    
    for i in years_label:
        df = pd.read_html(driver.current_url)[0]
        df.columns=df.columns.droplevel()
        df = df[['Player']]
        data[i]=df
        prev_year = driver.find_element_by_css_selector("a.button2.prev")
        prev_year.click()
        time.sleep(10)
    
    driver.quit()
    
    return data
    
    
def download_player_data(URL, years, type_data):
    
    years_label, data = data_to_dict(years)
    driver = initialize_selenium(URL)
    wait = WebDriverWait(driver, 10)
    
    # get to the current season stats, this may have changed
    tab = driver.find_elements_by_id("header_leagues")
    hover = ActionChains(driver).move_to_element(tab[0])
    hover.perform()
    wait.until(EC.visibility_of_element_located((By.LINK_TEXT, type_data))).click()
    
    for i in years_label:
        df = pd.read_html(driver.current_url)[0]
        df = df[df.Rk != 'Rk']
        data[i]=df
        prev_year = driver.find_element_by_css_selector("a.button2.prev")
        prev_year.click()
        time.sleep(10)
    
    driver.quit()
    
    return data

def download_fa_data(URL):
    years_label, data = data_to_dict(years)
    driver = initialize_selenium(URL)

    for i in range(2018,2010,-1):
        years = Select(driver.find_element_by_name('year'))
        years.select_by_visible_text(str(i))
        submit = driver.find_element_by_class_name('go')
        submit.click()
        time.sleep(10)
        df = pd.read_html(driver.current_url)[0]
        data[i]=df
        
    return data

def save_dataset(data,filename):
    with open(filename, 'wb') as w:
        pickle.dump(data,w)
        
def run():
    """
    Executes a set of helper functions that download data from one or more sources
    and saves those datasets to the data/raw directory.
    """
    data_fa = download_fa_data("https://www.spotrac.com/nba/free-agents/")
    data_reg = download_player_data("https://www.basketball-reference.com", 12, "Per G")
    data_adv = download_player_data("https://www.basketball-reference.com", 12, "Advanced")
    data_salary = download_salary_data("http://www.espn.com/nba/salaries/_/year/2019", 12)
    data_rookie = download_rookie_data("https://www.basketball-reference.com/leagues/NBA_2018_rookies.html", 12)
    save_dataset(data_fa, "data/raw/datafa.pickle")
    save_dataset(data_reg, "data/raw/regstats.pickle")
    save_dataset(data_adv, "data/raw/advstats.pickle")
    save_dataset(data_salary, "data/raw/salaries.pickle")
    save_dataset(data_rookie, "data/raw/rookies.pickle")

Overwriting ../src/data/make_dataset.py


# Scrub the Data

*Look through the raw data files and see what you will need to do to them in order to have a workable data set. If your source data is already well-formatted, you may want to ask yourself why it hasn't already been analyzed and what other people may have overlooked when they were working on it. Are there other data sources that might give you more insights on some of the data you have here?*

*The end goal of this step is to produce a [design matrix](https://en.wikipedia.org/wiki/Design_matrix), containing one column for every variable that you are modeling, including a column for the outputs, and one row for every observation in your data set. It needs to be in a format that won't cause any problems as you visualize and model your data.*

In [8]:
## %%writefile ../src/features/build_features.py

# imports
import re
import os
import pickle
import pandas as pd
import numpy as np
from functools import reduce

# helper functions go here
def clean_salaries_dataset(path, filename):
    money = pickle.load(open(path+"/"+filename, "rb"))

def clean_stats_dataset(path, filename1, filename2):
    stats = pickle.load(open(path+"/"+filename2, "rb"))
    advs = pickle.load(open(path+"/"+filename, "rb"))

def clean_rookie_dataset(path, filename):
    rookies = pickle.load(open(path+"/"+filename, "rb"))
    
def clean_fa_dataset(path, filename):
    freeagents = pickle.load(open(path+"/"+filename, "rb"))
    
def build_dataset(salaries, stats, rookies, freeagents):
    
    return data

def build_features(data):
    
    return data

def save_features(data,filename):
    with open(filename,"wb") as writer:
        pickle.dump(data,writer)

def run():
    """
    Executes a set of helper functions that read files from data/raw, cleans them,
    and converts the data into a design matrix that is ready for modeling.
    """
    salaries = clean_salaries_dataset('/data/raw', "salaries.pickle")
    stats = clean_stats_dataset('data/raw', "advstats.pickle", "regstats.pickle")
    rookies = clean_rookies_dataset('data/raw','rookies.pickle')
    freeagents = clean_fa_dataset('data/raw','datafa.pickle')
    
    full_data = build_dataset(salaries, stats, rookies, freeagents)
    
    build_features(data)
    
    # save_features(data,'data/processed/data.pickle')
    pass


*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [None]:
%%writefile ../src/visualization/visualize.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed')
    # describe_features(data, 'reports/')
    # generate_charts(data, 'reports/figures/')
    pass


*What did you learn? What relationships do you think will be most helpful as you build your model?*

# Model the Data

*Describe the algorithm or algorithms that you plan to use to train with your data. How do these algorithms work? Why are they good choices for this data and problem space?*

In [None]:
## %%writefile ../src/models/train_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed/')
    # train, test = train_test_split(data)
    # save_train_test(train, test, 'data/processed/')
    # model = build_model()
    # model.fit(train)
    # save_model(model, 'models/')
    pass


In [None]:
## %%writefile ../src/models/predict_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # test_X, test_y = load_test_data('data/processed')
    # trained_model = load_model('models/')
    # predictions = trained_model.predict(test_X)
    # metrics = evaluate(test_y, predictions)
    # save_metrics('reports/')
    pass



_Write down any thoughts you may have about working with these algorithms on this data. What other ideas do you want to try out as you iterate on this pipeline?_

# Interpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_