### Yonatan Klausner, Jonathan Rawson, and Yair Fax 
## Final Project - HustleBall

The Moneyball theory is fairly simple: Using statistical analysis, small-market teams (i.e. teams who do not spend a lot of money) can compete by buying assets that are undervalued by other teams and selling ones that are overvalued by other teams.
The key to Moneyball is figuring out how to know when a player is "undervalued". Here is a [link](http://grantland.com/features/the-economics-moneyball/) to read more about Moneyball.

For our project, we will be looking at statistics in the National Basketball Association (NBA). 
Among other statistics, we will mainly be looking at statistics that involve some sort of hustling. 
These statistics include screen assists, deflections, loose balls recovered, charges drawn, contested shots, which we will explain in detail below. 
We would like to find some sort of correlation between a hustle score that we calculate ourselves and other statistics.  Ideally, our findings will help teams go after specific players due to hustle scores. 

In [2]:
# Imports needed for project 
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import pandas as pd
import numpy as np
import pylab as pl
%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
figsize(14, 7)
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# hustle stat - average mph, miles run, balls deflected, contested shots, stats per minute (steals, o rebounds, feet), 
# jump balls, Loose balls recovered, boxouts, screen assists, charges, team, average salary, all nba or all star teams 
# https://stats.nba.com/players/speed-distance/?sort=AVG_SPEED&dir=1

# To do:
# 1) change the way calculate hustle score 
#        a) add in distance and speed (https://stats.nba.com/players/speed-distance/?sort=DIST_MILES_DEF&dir=1&Season=2017-18&SeasonType=Regular%20Season)
# 2) change up scatter plots 
# 3) sort by position
# 4) trends over time - playoff teams, non-playoff teams, championship teams
# 5) predict current season outcomes
# 6) more detailed comparisons with wins
# 7) look at specific player hustle in a team as correlated with wins
# 8) ML 

# Problems:
# 1) Jrue Holiday 


# Reminders: 
# 1) GP >= 40 make sure not for current season


ImportError: No module named 'sklearn.model_selection'

## Part 1: Get Data
In this section we get all the data that we will be using for our project.  
In the end, we will have five preliminary tables which we will explain in depth later on.

### Step 1: Define Functions for Getting Different Data

These functions get statistics from NBA.com. We use an automated browser ([link](https://webkit.org/blog/6900/webdriver-support-in-safari-10/) to read more) instead of a simple get request because the HTML is not fully populated in the request. The scripts on the page populate the table's data. We use javascript in the browser to edit elements of the page so that we can "click" them to display the full table. We then pull the HTML from the page and pass it to Pandas to create a dataframe.

In [None]:
def getHustleStats(year):
    return getYearStats("https://stats.nba.com/players/hustle/?sort=MIN&dir=-1&Season=" + year + "&SeasonType=Regular%20Season", year)

def getRegularStats(year):
    return getYearStats("https://stats.nba.com/leaders/?Season=" + year + "&SeasonType=Regular%20Season", year)

def getYearStats(url, year):
    # Use an automated browser so that the webpage is rendered properly
    browser = webdriver.Safari()
    browser.get(url)
    
    # Edit HTML so that we can get the whole table
    browser.execute_script('document.getElementsByClassName("stats-table-pagination__select")[0].setAttribute("id", "btn")')
    browser.execute_script('document.getElementById("btn").children[0].setAttribute("id", "select-all")')
    nextButton = browser.find_element_by_id('btn')
    allButton = browser.find_element_by_id('select-all')
    
    # Click on buttons to get whole table
    nextButton.click()
    allButton.click()
    
    # Get HTML and parse
    innerHTML = browser.execute_script("return document.body.innerHTML")
    root = BeautifulSoup(innerHTML, "lxml")
    table = pd.read_html(str(root.find("table")))
    table = table[0]
    
    # Add year column for later table merging
    table['year'] = year
    browser.close()
    return table

This function uses a simple get request for a specific year to get data about players' positions.

In [None]:
def getPositions(year):
    url = "https://www.basketball-reference.com/leagues/NBA_" + str(year) + "_per_game.html"
    r = requests.get(url)
    root = BeautifulSoup(r.content, "lxml")
    data = root.find("table")
    positions_table = pd.read_html(str(data))[0]
    positions_table['year'] = str(year - 1) + "-" + str(year - 2000)
    return positions_table

These functions read simple csv files stored on disk.

In [None]:
def getTeams(filename, year):
    teamTable = pd.read_csv(filename)
    teamTable['year'] = year
    return teamTable

def getSalaries(filename):
    salaryTable = pd.read_csv(filename)
    return salaryTable

### Step 2: Get Hustle Data on Players
This data consists of several statistics that are related to player hustling (see where we got the data from [here](https://stats.nba.com/players/hustle/?sort=MIN&dir=1&Season=2016-17&SeasonType=Regular%20Season)).  The statistics we will use from this data are:  

1. **Screen assists** - a screen is a blocking move by an offensive player, by standing beside or behind a defender, to free a teammate to shoot, receive a pass, or drive in to score. A screen assist is a screen that directly leads to a made field goal.  

2. **Deflections** - a deflection occurs when a defensive player redirects the intended direction of the ball.

3. **Loose Balls Recovered** - a loose ball is when neither team is in control of the ball.  A loose ball recovered is when a player recovers a loose ball.   

4. **Charges Drawn** - a charge, or player-control foul, occurs when a dribbler charges into a defender who has already established his position.  A drawn charge is when a defender takes a charge on an offensive player, meaning the defensive player is in position and the offensive player charges into him.  

5. **Contested Shots** - a contested shot is when a defender is within 4 ft of the person shooting.   

This function is used in cleaning our data. We get rid of any relevant non alphabetical characters as well as some specific suffixes. We do this to make sure that different tables have the same values for Players' names. For example, we want C.J. Watson to match CJ Watson, John Lucas III to match John Lucas, and Kelly Oubre Jr. to match Kelly Oubre.

In [None]:
def stripChars(series):
    return series.str.replace('Jr\.', '').str.replace('.', '').str.replace(',', '').str.replace('III', '').str.replace('II', '').str.replace('IV', '').str.rstrip()

In [None]:
hustle16_17 = getHustleStats("2016-17") 
hustle17_18 = getHustleStats("2017-18")

In [None]:
hustleStats = hustle16_17.append(hustle17_18)
hustleStats['Player'] = stripChars(hustleStats['Player'])
hustleStats.head()

### Step 3: Get Data for Regular Statistics on Players 
This data consists of several common, player statistics such as **MIN** (minutes per game), **PTS** (points per game), **FGM** (field goals made per game), **REB** (rebounds per game), etc. (see where we got the data from [here](https://stats.nba.com/leaders/?Season=2016-17&SeasonType=Regular%20Season))

In [None]:
regular16_17 = getRegularStats("2016-17") 
regular17_18 = getRegularStats("2017-18")
regularStats = regular16_17.append(regular17_18)
regularStats.head()

### Step 4: Get Data for Positions of Players
This data consists of the positions for each player.  The five positions are:
1. **PG** - point guard
2. **SG** - shooting guard
3. **SF** - small forward 
4. **PF** - power forward
5. **C** - center

Typically, the **point guard** is the leader of the team when on the court. Being the point guard requires polished ball handling skills and the ability to facilitate the team during a play. The **shooting guard** is generally the best shooter and is usually capable of shooting from farther distances. Generally, they also have good ball-handling skills. The **small forward** often has an aggressive approach to the basket on offense. The **power forward** and the center are usually called "low post" players. They often act as their team's primary rebounders and shot blockers, and also generally get the ball closer to the hoop to take inside shots. The **center** is typically the larger of the two. To read more about the different positions in basketball, please click this [link](https://en.wikipedia.org/wiki/Basketball_positions).

In [None]:
positions16_17 = getPositions(2017)[['Player', 'Pos', 'year']]
positions17_18 = getPositions(2018)[['Player', 'Pos', 'year']]

In [None]:
positionsTable = positions16_17.append(positions17_18)
positionsTable.head()

### Step 5: Get Data for Team Wins 

This data consists of team statistics such as wins, losses, home record, road record, etc.  
We use this data to get the number of wins for a team in a given year

In [None]:
teamsWins16_17 = getTeams('teams_2016_2017.csv', '2016-17') 
teamsWins17_18 = getTeams('teams_2017_2018.csv', '2017-18') 
teamsWins16_17.head()

### Step 6: Get Data for Salaries of Players

This data consists of player salaries ([link](https://data.world/datadavis/nba-salaries)). 

In [None]:
salaryTable = getSalaries('player_salaries.csv')
salaryTable.head()

## Part 2: Clean Data
After getting all the data needed for our project, we need to clean the data so that we are able to use each table easily (short, informative [read](https://medium.com/datadriveninvestor/data-cleaning-for-data-scientist-363fbbf87e5f)).   
First, we merge the hustleStats table and the positionsTable on the player and year columns in order to add the positions to the hustleStats table.  
We then print out any players in the merged table with a null position. 

In [None]:
hustleStatsTemp = pd.merge(hustleStats, positionsTable, on=['Player', 'year'], how='left')
hustleStatsTemp[pd.isnull(hustleStatsTemp['Pos'])]

As seen above, there are a few players whose position was not set properly.  
After looking closely into each case, we realized what was going wrong for each of them.
1. For a number of the players above, in one table they will have a roman numeral added to the end of their name (e.g. "II") whereas in the other table it will not have the roman numeral.
2. Some players in one table have suffixes such as "Jr." whereas in the other table they do not have those suffixes. 
3. A number of players will have dots in their names in one table (e.g. D.J.) and not in the other table (e.g. DJ). 
4. Taurean Prince - in one table he was named "Taurean Prince" and in the other table it had "Taurean Waller-Prince".
5. Nene - in one table he was named "Nene" and in the other table it had "Nene Hilario".  

First, we dropped all duplicated rows with the same player and year.   
Then we stripped certain characters in player names such as "Jr.", "I", and "II" because one table had "Jr.", "I", and "II" and the other did not.  Therefore, when merging the tables on the player names, the merge took those players has two different players as opposed to the same player.  
Next, we took care of Nene and Taurean Prince cases by renaming them in the table. 

In [None]:
positionsTable = positionsTable.drop_duplicates(subset=['Player', 'year'])
positionsTable['Player'] = stripChars(positionsTable['Player'])
positionsTable.loc[positionsTable['Player'] == 'Nene Hilario', 'Player'] = 'Nene'
positionsTable.loc[positionsTable['Player'] == 'Taurean Waller-Prince', 'Player'] = 'Taurean Prince'
positionsTable.head()

After cleaning the data some more, we then merge the hustleStats table and the postionsTable again.  
We then print the rows in the hustleStats that have a null position. 

In [None]:
hustleStats = pd.merge(hustleStats, positionsTable, on=['Player', 'year'], how='left')
hustleStats[pd.isnull(hustleStats['Pos'])]

In [None]:
hustleStats = hustleStats.dropna(subset=['Pos'])
hustleStats.head()

Next, we clean the hustleStats table. 
1. Drop any players that average less than 15 minutes per game. 
2. Drop any players that played less than 40 games in the season. 
3. Change the "TEAM" column to be named 'Team' 
4. Change any player with the two positions of SF and SG to just have a position of SF 
5. Drop unused columns 
6. Drop duplicated players within the same year 

In [None]:
hustleStats = hustleStats[hustleStats['MIN'] >= 15] # Delete players who did not average 15 minutes per game 
hustleStats = hustleStats[hustleStats['GP'] >= 40] # Delete players who did not play 40 games 
hustleStats = hustleStats.rename(columns = {'TEAM': 'Team'}) # Rename column 
hustleStats.loc[hustleStats['Pos'] == 'SF-SG', 'Pos'] = 'SF' # Change position of player from multiple positions to one 
# Drop unused columns 
hustleStats = hustleStats.drop(['ScreenAssists PTS', 'OFF Loose BallsRecovered', 'DEF Loose BallsRecovered', '% Loose BallsRecovered OFF', '% Loose BallsRecovered DEF'], axis=1)
hustleStats = hustleStats.drop_duplicates(subset=['Player', 'year'])
hustleStats.head()

In [None]:
teamWinsTable = teamsWins16_17.append(teamsWins17_18)
# Drop unused columns 
teamWinsTable = teamWinsTable.drop(['L', 'Win%', 'GB', 'Conf', 'Div', 'Home', 'Road', 'OT', 'Last 10', 'Streak'], axis=1)
teamWinsTable = teamWinsTable.rename(columns = {'W': 'Wins'}) # Rename column
teamWinsTable.head()

Next we clean the salary table.  
First, we get the data that is from the 2016-17 season and 2017-18.  Then we drop a few columns and remove any duplicates.  Lastly, we rename the year columns to match our other tables.  

In [None]:
salaryTable = salaryTable[salaryTable['year'] >= 2017]
salaryTable = salaryTable.drop(['team_name', 'season_end', 'team'], axis=1)
salaryTable = salaryTable.drop_duplicates(subset=['Player', 'year'])
salaryTable.loc[salaryTable['year'] == 2016, 'year'] = '2016-17'
salaryTable.loc[salaryTable['year'] == 2017, 'year'] = '2017-18'
salaryTable.head()

Now, we have 5 clean tables:   
1. **hustleStats** - players with different hustle stats
2. **regularStats** - players with normal statistics
3. **positionsTable** - players with their positions 
4. **teamsWinsTable** - teams with their wins
5. **salaryTable** - players with their salaries 

## Part 3: Calculate Hustle Score

In order to calculate our hustle score, we will use the sum of screen assists, deflections, loose balls recovered, charges drawn, and contested shots.  However, if we just took the sum of those statistics for each player, certain statistics, such as contested shots, would effect the overall hustle score more than others because its numbers are generally higher.  Therefore, we decided to normalize each statistic by dividing its value by the max for that specific statistic.  

In [None]:
maxScreenAssists = (hustleStats['ScreenAssists']/hustleStats['MIN']).max()
maxDeflections = (hustleStats['Deflections']/hustleStats['MIN']).max()
maxLooseBallsRecovered = (hustleStats['Loose BallsRecovered']/hustleStats['MIN']).max()
maxChargesDrawn = (hustleStats['ChargesDrawn']/hustleStats['MIN']).max()
maxContestedShots = (hustleStats['ContestedShots']/hustleStats['MIN']).max()
print("maxScreenAssists per minute: " + str(maxScreenAssists))
print("maxDeflections per minute: " + str(maxDeflections))
print("maxLooseBallsRecovered per minute: " + str(maxLooseBallsRecovered))
print("maxChargesDrawn per minute: " + str(maxChargesDrawn))
print("maxContestedShots per minute: " + str(maxContestedShots))

In [None]:
hustleStats['HustleScore'] = (hustleStats['ScreenAssists']/maxScreenAssists + hustleStats['Deflections']/maxDeflections + hustleStats['Loose BallsRecovered']/maxLooseBallsRecovered + hustleStats['ChargesDrawn']/maxChargesDrawn + hustleStats['ContestedShots']/maxContestedShots)/(hustleStats['MIN'])
hustleStats.head()

In [None]:
# Sort table by HustleScore
hustleStats = hustleStats.sort_values(by=['HustleScore'],ascending=False)
hustleStats.head(5)

## Part 4: Look for Potential Correlation Between HustleScore and Other Parameters

In [None]:
scatter = hustleStats.plot.scatter('AGE', 'HustleScore')

(m, b) = np.polyfit(hustleStats['AGE'], hustleStats['HustleScore'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in hustleStats['AGE']]
scatter.plot(hustleStats['AGE'], regression_line, color='red') 
correlation = np.corrcoef(hustleStats['HustleScore'], hustleStats['AGE'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Age (2016-2018) with r = " + str(round(correlation,2)))

# Label the plot
#scatter.set_title("Distribution of Hustle Scores Across Age (2016-2018) with (m, b) = (" + str(round(m,2)) + ", " + str(round(b, 2)) + ")")
scatter.set_xlabel("Age")
scatter.set_ylabel("Hustle Scores")

The above plot shows a slight negative correlation between age and hustle scores.  That is to say that as a player's age increases, a player's hustle score generally decreases slightly. 

In [None]:
scatter = hustleStats.plot.scatter('MIN', 'HustleScore')

(m, b) = np.polyfit(hustleStats['MIN'], hustleStats['HustleScore'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in hustleStats['MIN']]
scatter.plot(hustleStats['MIN'], regression_line, color='red') 

correlation = np.corrcoef(hustleStats['HustleScore'], hustleStats['MIN'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Minutes Per Game (2016-2018) with r = " + str(round(correlation,2)))

# Label the plot
# scatter.set_title("Distribution of Hustle Scores Across Minutes Per Game (2016-2018) with (m, b) = (" + str(round(m,2)) + ", " + str(round(b, 2)) + ")")
scatter.set_xlabel("Minutes Per Game")
scatter.set_ylabel("Hustle Scores")

The above plot shows a slight negative correlation between minutes per game played and hustle scores.  That is to say that as a player's minutes per game increases, a player's hustle score generally decreases slightly. 

In [None]:
teamHustleTable = pd.DataFrame(hustleStats.groupby(['Team', 'year'])['HustleScore'].sum())
teamHustleTable = teamHustleTable.sort_values(by=['HustleScore'],ascending=False)
teamHustleTable = teamHustleTable.reset_index()
teamHustleTable.head()

In [None]:
teamTable = pd.merge(teamHustleTable, teamWinsTable, on=['Team', 'year'])
teamTable.head()

In [None]:
# Convert type of wins column to int32 because cannot be of object type when plotting 
teamTable['Wins'] = teamTable['Wins'].astype('int32')
x = teamTable['HustleScore']
y = teamTable['Wins']
fig, ax = plt.subplots()
ax.scatter(x, y) 

# Get the slope and intercept for the best fit line 
(m, b) = np.polyfit(teamTable['HustleScore'], teamTable['Wins'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in teamTable['HustleScore']]
ax.plot(teamTable['HustleScore'], regression_line, color='red') 

correlation = np.corrcoef(teamTable['HustleScore'], teamTable['Wins'])[0,1]

# Label the plot
ax.set_title("Distribution of Hustle Scores Across Wins (2016-2018) with r = " + str(round(correlation,2)))

# Label the plot
#ax.set_title("Distribution of Hustle Scores Across Team Wins (2016-2018) with (m, b) = (" + str(round(m,2)) + ", " + str(round(b, 2)) + ")")
ax.set_xlabel("Team Hustle Score")
ax.set_ylabel("Team Wins")

The above plot shows a positive correlation between team wins and hustle scores.  That is to say that as a team's wins  increases, a team's hustle score generally increases. 

In [None]:
x = hustleStats['Pos']
y = hustleStats['HustleScore']
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set_xlabel("Position")
ax.set_ylabel("Hustle Score")
ax.set_title("Distribution of Position Across Hustle Score (2016-2018)")

In [None]:
topTeams = teamTable[teamTable['Wins'] >= teamTable['Wins'].median()]
topTeams = topTeams[topTeams['HustleScore'] >= topTeams['HustleScore'].median()]
topTeams.head()

In [None]:
bottomTeams = teamTable[teamTable['Wins'] < teamTable['Wins'].median()]
bottomTeams = bottomTeams[bottomTeams['HustleScore'] < bottomTeams['HustleScore'].median()]
bottomTeams.head()

In [None]:
topTeamsPlayers = pd.merge(topTeams, hustleStats, how='left', on=['Team']).sort_values('Pos')
x = topTeamsPlayers['Pos']
y = topTeamsPlayers['HustleScore_y']
fig, ax = plt.subplots()
ax.set_yticks(range(1,3,1))
ax.scatter(x, y)

In [None]:
bottomTeamsPlayers = pd.merge(bottomTeams, hustleStats, how='left', on=['Team']).sort_values('Pos')
x = bottomTeamsPlayers['Pos']
y = bottomTeamsPlayers['HustleScore_y']
fig, ax = plt.subplots()
ax.set_yticks(range(1,3,1))
ax.scatter(x, y) 

### Step 1: Look into correlation between player salaries and player hustle score 

In [None]:
salaryHustle = pd.merge(hustleStats, salaryTable, on=['Player', 'year'], how='inner')
salaryHustle.head()

In [None]:
scatter = salaryHustle.plot.scatter('HustleScore', 'salary')

(m, b) = np.polyfit(salaryHustle['HustleScore'], salaryHustle['salary'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in salaryHustle['HustleScore']]
scatter.plot(salaryHustle['HustleScore'], regression_line, color='red') 
correlation = np.corrcoef(salaryHustle['HustleScore'], salaryHustle['salary'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Salary (2016-2018) with r = " + str(round(correlation,2)))

## Part 5: Predicting Points Per Game

Get table with all the hustle stats and the points for each player for linear regression.

In [None]:
hustleStatsWithPTS = hustleStats.merge(regularStats[['Player', 'PTS', 'year']], on=['Player', 'year'], how='left')
hustleStatsWithPTS = hustleStatsWithPTS.dropna()
hustleStatsWithPTS = hustleStatsWithPTS.sort_values(by=['year', 'PTS'], ascending=False)
hustleStatsWithPTS.head()

### Step 1: Player Point Prediction based on Base Hustle Scores
We try to predict how many points a player gets based on his base hustle stats, where base hustle stats are the individual statistics from the NBA's website that we used to calculate our hustle score.

In [None]:
baseHustleReg = LinearRegression()
baseHustleReg = baseHustleReg.fit(hustleStatsWithPTS[['ScreenAssists', 'Deflections', 'Loose BallsRecovered', 'ChargesDrawn', 'ContestedShots']], hustleStatsWithPTS['PTS'])

In [None]:
hustleStatsWithPTS['residuals'] = hustleStatsWithPTS['PTS'] - baseHustleReg.predict(hustleStatsWithPTS[['ScreenAssists', 'Deflections', 'Loose BallsRecovered', 'ChargesDrawn', 'ContestedShots']])

In [None]:
baseHustleViolin = sns.violinplot(x="Team", y="residuals", data=hustleStatsWithPTS)
baseHustleViolin.set_title("Residuals for predictions of points per game per player based on hustle stats by team")

In the above plot we see residuals of our predictive model. Ideally we want want the widest part of each violin to be closest to zero. This is true for some teams (e.g. CLE, DET, DEN, CHI, etc.). However, many teams have very high residuals, and have very thin violins, showing that for many of their players the base hustle stats don't effectively predict the number of points they score in a game.

### Step 2: Player Point Prediction based on Calculated Hustle Score
We then try to predict the number of points each player scores per game based on our calculated hustle score. We postulate that the results of this regression will be better because in our hustle score we standardized each parameter. This is different from Part 1 becuase in Part 1 we took raw hustle scores without standardization.

In [None]:
hustleReg = LinearRegression()
hustleReg.fit(hustleStatsWithPTS[['HustleScore']],hustleStatsWithPTS['PTS'])
hustleStatsWithPTS['hustleResidual'] = hustleStatsWithPTS['PTS'] - hustleReg.predict(hustleStatsWithPTS[['HustleScore']])

In [None]:
hustleViolin = sns.violinplot(x="Team", y="hustleResidual", data=hustleStatsWithPTS)
hustleViolin.set_title("Residuals for calculated hustles score predictive model by team")

We see higher residuals in this plot than in the previous one. This could be because our hustle score doesn't predict points well.

### Step 3: Using the regular statistics to predict player points

In [None]:
# Merge the team column into the regular stats table
regularStatsWithTeam = regularStats.merge(hustleStats[['Player', 'Team', 'year']], on=['Player', 'year']).drop(['#'], axis=1)
#regularStatsWithWins = regularStatsWithWins.groupby(['Team']).sum().merge(teamTable, on=['Team'])
regularStatsWithTeam = regularStatsWithTeam.drop(['GP', 'MIN'], axis=1)
regularStatsWithTeam.head()

In [None]:
regularStatsReg = LinearRegression()
regularStatsReg.fit(regularStatsWithTeam[['REB', 'AST', 'STL', 'BLK', 'TOV']], regularStatsWithTeam['PTS'])

In [None]:
regularStatsWithTeam['residuals'] = regularStatsWithTeam['PTS'] - regularStatsReg.predict(regularStatsWithTeam[['REB', 'AST', 'STL', 'BLK', 'TOV']])
violin = sns.violinplot(x="Team", y="residuals", data=regularStatsWithTeam)

The above plot shows the prediction of each player's points based solely on their standard statistics per team. As with our earlier plot, some player's statistics are predicted really well by their regular stats, while others are not.

### Step 4: Incorporating the Hustle Score

In [None]:
# Merge calculated hustle stat into regular stats table
allStats = regularStatsWithTeam.merge(hustleStats[['Player', 'year', 'HustleScore']], on=['Player', 'year'])
allStats = allStats.dropna()
allStats.head()

In [None]:
# Perform the same regression as earlier but with the added hustle score column
hustleRegularStatsReg = LinearRegression()
hustleRegularStatsReg.fit(allStats[['REB', 'AST', 'STL', 'BLK', 'TOV', 'HustleScore']], allStats['PTS'])

In [None]:
allStats['residuals'] = allStats['PTS'] - hustleRegularStatsReg.predict(allStats[['REB', 'AST', 'STL', 'BLK', 'TOV', 'HustleScore']])
violin = sns.violinplot(x="Team", y="residuals", data=allStats)
violin.set_title("Residuals of predictive model using regular statistics with hustle score")

As we see here, the hustle score slightly improves the player points per game prediction.

## Part 6: Predicting Wins
We first try to see if points predict wins.

In [None]:
hustleStatsWithWins = hustleStats.groupby(['Team', 'year']).sum().merge(teamWinsTable, on=['Team', 'year'])
hustleWinsReg = LinearRegression()
hustleWinsReg.fit(hustleStatsWithWins[['HustleScore']], hustleStatsWithWins['Wins'])

In [None]:
hustleStatsWithWins['residuals'] = hustleStatsWithWins['Wins'] - hustleWinsReg.predict(hustleStatsWithWins[['HustleScore']])
hustleStatsWithWins.boxplot(column=['residuals'], by='year')

As seen in the plot above, hustle score does an OK job of predicting wins for a team, especially in 2016-17.