# HustleBall — Do Hustle Statistics in the NBA Matter?
### By Yonatan Klausner, Yair Fax, and Jonathan Rawson

#### Published: December 4, 2018

## Table of Contents
1. **Introduction**  
    A. Previous Work
2. **Data Scraping**  
    A. Define Functions for Obtaining Data  
    B. Obtain Hustle Data on Players  
    C. Obtain Data for Regular Statistics on Players  
    D. Obtain Data for Positions of Players  
    E. Obtain Data for Team Wins  
    F. Obtain Data for Salaries of Players  
    G. Obtain Data for Team Statistics  
3. **Clean Data**  
    A. Clean Hustle and Position Data  
    B. Clean Team Wins Data  
    C. Clean Salary Data  
4. **Calculate HustleScore**  
    A. Statistics Used in Calculation  
    B. Calculation   
5. **Look for Potential Correlation between HustleScore and Other Statistics**    
    A. Age vs. HustleScore  
    B. Minutes Played vs. HustleScore  
    C. Positions vs. HustleScore  
    D. HustleScore per Position between Top and Bottom Teams  
    E. Player Salary vs. HustleScore  
6. **Predictions**    
    A. Predicting Player Points with HustleScore    
    B. Predicting Player Points with Regular Statistics  
    C. Predicting Player Points with Regular Statistics and HustleScore  
    D. Predicting Team Wins with HustleScore  
7. **There's no "I" in Team**  
    A. Plotting Team HustleScore vs. Wins  
    B. Top Teams vs. Bottom Teams  
8. **Conclusion**

## 1. Introduction
As an [article](http://grantland.com/features/the-economics-moneyball/) in Grantland, a sports blog, puts it: "The Moneyball thesis is simple: Using statistical analysis, small-market teams can compete by buying assets that are undervalued by other teams and selling ones that are overvalued by other teams."
The key to Moneyball is figuring out how to know when a player is "undervalued."  

For our project, we will be looking at statistics from the National Basketball Association ([NBA](http://www.nba.com)). 
Among other statistics, we will mainly be looking at statistics that involve some sort of hustling. Hustle statistics try to measure how much effort a player puts in, meaning how hard they play. These statistics include screen assists, deflections, loose balls recovered, charges drawn, contested shots, steals, and offensive rebounds, which we will explain in detail below.  We would like to find some sort of correlation between a hustle score that we calculate ourselves and other statistics.  Ideally, our findings will help teams seek specific players due to their hustle scores. ([Read more](https://stats.nba.com/articles/dig-deeper-into-the-game-with-new-defensive-and-hustle-data/) about hustle statistics.)

### 1.A Previous Work
While research has been done related to normal statistics in the NBA (e.g. points per game, assists per game, field goal percentage, rebounds per game) little work has been done related to hustle statistics.  Igor Stančin and Alan Jović of the University of Zagreb published a [paper](https://bib.irb.hr/datoteka/939840.MIPRO_2018_Stancin_Jovic_final_revised.pdf) looking into the differences between winning teams and losing teams for several hustle statistics. However, they did not calculate a total hustle score for a player or for a team. Mark White and Kennon Sheldon published a [paper](https://link-springer-com.proxy-um.researchport.umd.edu/article/10.1007/s11031-013-9389-7) looking at players' performance before, during, and after their contract years. While they did look at several normal NBA statistics, they did not incorporate advanced hustle statistics into their paper. Landon Drew Laporte wrote a [paper](https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33510/etd.pdf?sequence=1) researching hustling in basketball. While his paper does go in depth on how to identify hustlers and how to predict hustling, he does not look for correlation with other statistics.  

Additionally, the hustle statistics from the NBA have only existed since the beginning of the 2016-2017 season. Therefore, not a lot of work has been done concerning them. In our project we try to find significance in the NBA's hustle statistics.

In [2]:
# Imports needed for project 
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import pandas as pd
import numpy as np
import pylab as pl
%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns

In [3]:
figsize(14, 7)

## 2. Data Scraping
In this section we get all the data that we will be using for our project.

### 2.A Define Functions for Obtaining Data

These functions get statistics from [NBA.com](http://nba.com). We use an automated browser (read more [here](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)) instead of a simple get request because the HTML is not fully populated in the request. The scripts on the page populate the table's data. We use JavaScript in the browser to edit elements of the page so that we can simulate clicks to display the full table. We then pull the HTML from the page and pass it to Pandas to create a dataframe.

In [4]:
def getHustleStats(year):
    return getYearStats("https://stats.nba.com/players/hustle/?sort=MIN&dir=-1&Season=" + year + "&SeasonType=Regular%20Season", year)

def getRegularStats(year):
    return getYearStats("https://stats.nba.com/leaders/?Season=" + year + "&SeasonType=Regular%20Season", year)

def getYearStats(url, year):
    # Use an automated browser so that the webpage is rendered properly
    browser = webdriver.Safari()
    browser.get(url)
    
    # Edit HTML so that we can get the whole table
    browser.execute_script('document.getElementsByClassName("stats-table-pagination__select")[0].setAttribute("id", "btn")')
    browser.execute_script('document.getElementById("btn").children[0].setAttribute("id", "select-all")')
    nextButton = browser.find_element_by_id('btn')
    allButton = browser.find_element_by_id('select-all')
    
    # Click on buttons to get whole table
    nextButton.click()
    allButton.click()
    
    # Get HTML and parse
    innerHTML = browser.execute_script("return document.body.innerHTML")
    root = BeautifulSoup(innerHTML, "lxml")
    table = pd.read_html(str(root.find("table")))
    table = table[0]
    
    # Add year column for later table merging
    table['year'] = year
    browser.close()
    return table

This function uses a simple get request for a specific year to get data about players' positions.

In [5]:
def getPositions(year):
    url = "https://www.basketball-reference.com/leagues/NBA_" + str(year) + "_per_game.html"
    r = requests.get(url)
    root = BeautifulSoup(r.content, "lxml")
    data = root.find("table")
    positions_table = pd.read_html(str(data))[0]
    # Add year column for later table merging
    positions_table['year'] = str(year - 1) + "-" + str(year - 2000)
    return positions_table

These functions read CSV files stored on our git repository.

In [6]:
def getTeams(filename, year):
    teamTable = pd.read_csv(filename)
    # Add year column for later table merging
    teamTable['year'] = year
    return teamTable

def getSalaries(filename):
    salaryTable = pd.read_csv(filename)
    return salaryTable

### 2.B Obtain Hustle Data on Players
This [data](https://stats.nba.com/players/hustle/?sort=MIN&dir=1&Season=2016-17&SeasonType=Regular%20Season) consists of several statistics that are related to player hustling. Due to the fact that these advanced statistics have only been available since the 2016-2017 season, we will only be using 2 seasons of data (2016-2017 and 2017-2018).

In [7]:
# Use our getHustleStats function to get the hustle stats for each year
hustle16_17 = getHustleStats("2016-17") 
hustle17_18 = getHustleStats("2017-18")

WebDriverException: Message: 


In [None]:
# Concatenate hustle tables together
hustleStats = hustle16_17.append(hustle17_18)
hustleStats.head()

### 2.C Obtain Data for Regular Statistics on Players 
This [data](https://stats.nba.com/leaders/?Season=2016-17&SeasonType=Regular%20Season) consists of several common player statistics, such as **MIN** (minutes per game), **PTS** (points per game), **FGM** (field goals made per game), and **REB** (rebounds per game).

In [None]:
# Get regular stats using our getRegularStats function and concatenate tables
regular16_17 = getRegularStats("2016-17") 
regular17_18 = getRegularStats("2017-18")
regularStats = regular16_17.append(regular17_18)
regularStats.head()

### 2.D Obtain Data for Positions of Players
This [data](https://www.basketball-reference.com/leagues/NBA_2017_per_game.html) consists of the positions for each player.  The five positions are:
1. **PG** - point guard
2. **SG** - shooting guard
3. **SF** - small forward 
4. **PF** - power forward
5. **C** - center

Typically, the **point guard** is the leader of the team when on the court. Being the point guard requires polished ball handling skills and the ability to facilitate the team during a play. The **shooting guard** is generally the best shooter and is usually capable of shooting from farther distances. Generally, they also have good ball-handling skills. The **small forward** often has an aggressive approach to the basket on offense. The **power forward** and the **center** are usually called "low post" players. They often act as their team's primary rebounders and shot blockers. In addition, they generally get the ball closer to the hoop to take inside shots. The **center** is typically the larger of the two. [Read more](https://www.basketballforcoaches.com/basketball-positions/) about the different positions in basketball.

In [None]:
# Get position data from our getPositions function
# We only need the player, position, and year from the table
positions16_17 = getPositions(2017)[['Player', 'Pos', 'year']]
positions17_18 = getPositions(2018)[['Player', 'Pos', 'year']]

In [None]:
# Concatenate tables together
positionsTable = positions16_17.append(positions17_18)
positionsTable.head()

### 2.E Obtain Data for Team Wins 

This [data](https://github.com/yklausner/yklausner.github.io/blob/master/teams_2016_2017.csv) consists of team statistics related to wins and losses. We use this data to get the number of wins for a team in a given year.

In [None]:
# Use our getTeams function to read from CSV files and concatenate together
teamsWins16_17 = getTeams('teams_2016_2017.csv', '2016-17') 
teamsWins17_18 = getTeams('teams_2017_2018.csv', '2017-18')
teamWinsTable = teamsWins16_17.append(teamsWins17_18)
teamWinsTable.head()

### 2.F Obtain Data for Salaries of Players

This [data](https://data.world/datadavis/nba-salaries) consists of player salaries. 

In [None]:
# Use our getSalaries function to read CSV file
salaryTable = getSalaries('player_salaries.csv')
salaryTable.head()

### 2.G Obtain Data for Team Statistics

This [data](https://github.com/yklausner/yklausner.github.io/blob/master/team_stats_2016_2017.csv) consists of aggregate player statistics for each team.

In [None]:
# Use our getTeams function to read CSV files and concatenate tables
teamStats_16_17 = getTeams("team_stats_2016_2017.csv", '2016-17') 
teamStats_17_18 = getTeams("team_stats_2017_2018.csv", '2017-18') 
teamStatsTable = teamStats_16_17.append(teamStats_17_18)
teamStatsTable.head()

## 3. Clean Data
### 3.A Clean Hustle and Position Data
After getting all the data needed for our project, we need to clean the data so that we are able to use each table easily (short, informative [read](https://medium.com/datadriveninvestor/data-cleaning-for-data-scientist-363fbbf87e5f) on cleaning data). 

First, we merge the hustleStats table and the positionsTable on the player and year columns in order to add the positions to the hustleStats table. We then print out any players in the merged table with a null position to see if any players were not assigned a position properly, meaning the two tables had different names for these players.

In [None]:
hustleStatsTemp = pd.merge(hustleStats, positionsTable, on=['Player', 'year'], how='left')
# Print out all players with null position
hustleStatsTemp[pd.isnull(hustleStatsTemp['Pos'])]

As seen above, there are a number of players whose position was not set properly.  After looking closely into each case, we realized what was going wrong.
1. For a number of the players, in one table they had a roman numeral added to the end of their name (e.g. "II") and not in the other table. For example, John Lucas III and John Lucas.  
2. Some players in one table had suffixes such as "Jr." and not in the other table. For example, Kelly Oubre Jr. and Kelly Oubre.  
3. A number of players had non-alphabetical characters in their names in one table and not in the other table. For example, C.J. Watson and CJ Watson.   
4. Taurean Prince — In one table he was named "Taurean Prince" and in the other table he was named "Taurean Waller-Prince".
5. Nene — In one table he was named "Nene" and in the other table he was named "Nene Hilario". 

We also drop all duplicate rows.   

In [None]:
# Function to clean player names
def stripChars(series):
    # Use numpy string functions to strip unwanted characters from player names and return stripped name series
    return series.str.replace('Jr\.', '').str.replace('.', '').str.replace(',', '').str.replace('III', '').str.replace('II', '').str.replace('IV', '').str.rstrip()

In [None]:
positionsTable = positionsTable.drop_duplicates(subset=['Player', 'year'])
# Clean player names 
positionsTable['Player'] = stripChars(positionsTable['Player'])
positionsTable.loc[positionsTable['Player'] == 'Nene Hilario', 'Player'] = 'Nene'
positionsTable.loc[positionsTable['Player'] == 'Taurean Waller-Prince', 'Player'] = 'Taurean Prince'
positionsTable.head()

After cleaning the data some more, we then merge the hustleStats table and the postionsTable again.  
We then print the rows in the hustleStats that have a null position to see which player names are still inconsistent. 

In [None]:
hustleStats['Player'] = stripChars(hustleStats['Player'])
hustleStats = pd.merge(hustleStats, positionsTable, on=['Player', 'year'], how='left')
hustleStats[pd.isnull(hustleStats['Pos'])]

The above 5 players are only in one table.  Due to the fact that they are relatively insignificant we drop them. These players are missing at random ([MAR](https://www.theanalysisfactor.com/mar-and-mcar-missing-data/)) because the missing data concerns the observed data but not the missing data.

In [None]:
hustleStats = hustleStats.dropna(subset=['Pos'])
hustleStats.head()

In [None]:
# Get all position values
hustleStats['Pos'].unique()

Next, we clean the hustleStats table. 
1. Change the "TEAM" column to be named "Team"
2. Change any player with two positions to have one position
3. Drop unused columns 
4. Drop duplicate players within the same year 

In [None]:
hustleStats = hustleStats.rename(columns = {'TEAM': 'Team'}) # Rename column 
# Change position of player from multiple positions to one
hustleStats.loc[hustleStats['Pos'] == 'PG-SG', 'Pos'] = 'PG'
hustleStats.loc[hustleStats['Pos'] == 'SF-SG', 'Pos'] = 'SF'
hustleStats.loc[hustleStats['Pos'] == 'PF-C', 'Pos'] = 'PF'
# Drop unused columns 
hustleStats = hustleStats.drop(['ScreenAssists PTS', 'OFF Loose BallsRecovered', 'DEF Loose BallsRecovered', '% Loose BallsRecovered OFF', '% Loose BallsRecovered DEF'], axis=1)
hustleStats = hustleStats.drop_duplicates(subset=['Player', 'year'])
hustleStats.head()

In [None]:
# Verify that no player has multiple positions
hustleStats['Pos'].unique()

### 3.B Clean Team Wins Data

In [None]:
# Drop unused columns 
teamWinsTable = teamWinsTable.drop(['L', 'Win%', 'GB', 'Conf', 'Div', 'Home', 'Road', 'OT', 'Last 10', 'Streak'], axis=1)
teamWinsTable = teamWinsTable.rename(columns = {'W': 'Wins'}) # Rename column
teamWinsTable.head()

### 3.C Clean Salary Data  
First, we get the data from the table that is from the 2016-17 season and from the 2017-18 season.  Then we drop unused columns and remove any duplicates.  Lastly, we rename the year column in each table to match those of our other tables.  

In [None]:
# Get data from desired seasons
salaryTable = salaryTable[salaryTable['year'] >= 2017]

# Drop unused columns
salaryTable = salaryTable.drop(['team_name', 'season_end', 'team'], axis=1)
# Drop duplicates
salaryTable = salaryTable.drop_duplicates(subset=['Player', 'year'])

# Rename year values
salaryTable.loc[salaryTable['year'] == 2016, 'year'] = '2016-17'
salaryTable.loc[salaryTable['year'] == 2017, 'year'] = '2017-18'
salaryTable.head()

All our data is now clean.

## 4. Calculate HustleScore

### 4.A Statistics Used in Calculation

The statistics we will use to calculate our hustle score are:  

1. **Screen Assists** — A screen assist is a [screen](https://www.coachesclipboard.net/Screens.html) that directly leads to a made field goal.  

2. **Deflections** — A deflection occurs when a defensive player redirects the intended direction of the ball.

3. **Loose Balls Recovered** — A loose ball is when neither team is in control of the ball.  A loose ball recovered is when a player recovers a loose ball.   

4. **Charges Drawn** — A [charge](http://www.playsportstv.com/basketball/basketball-rules_fouls_the-charge), or player-control foul, occurs when a dribbler charges into a defender who has already established his position.  When a defender takes a charge on an offensive player, the defensive player is awarded a drawn charge.  

5. **Contested Shots** — A contested shot is when a defender is within 4 feet of the person shooting. 

6. **Steals** — A [steal](https://www.sportslingo.com/sports-glossary/s/steal-basketball/) is when a defender legally causes a turnover by his positive, defensive actions.   

7. **Offensive Rebounds** — An offensive rebound is when an offensive player secures a rebound.

These stastics are all hustle related. We believe that players who hustle more generally have higher values for these statistics. Therefore, these are the 7 statistics we use to calculate our hustle score.

### 4.B Calculation 

In order to calculate our hustle score, we will use the sum of screen assists, deflections, loose balls recovered, charges drawn, contested shots, steals, and offensive rebounds.  However, if we just took the sum of those statistics for each player, certain statistics, such as contested shots, would effect the overall hustle score more than others because that statistic is generally higher than the others.  Therefore, we normalize each statistic by dividing its value by the maximum for that specific statistic. Finally, we calculate this hustle score on a per 36-minute basis, which is normal when calculating statistics in basketball. ([Read more](https://bleacherreport.com/articles/657235-the-value-of-per-36-statistics-and-future-breakthrough-stars#slide0) on per 36-minute statistics.)

In [None]:
# Merge offensive rebounds and steals into hustleStats table for HustleScore calculation
hustleStats = hustleStats.merge(regularStats[['Player', 'year', 'OREB', 'STL']], on=['Player', 'year'])
hustleStats.head()

In [None]:
# Calculate maximums to normalize aggregate hustle statistic
maxScreenAssists = hustleStats['ScreenAssists'].max()
maxDeflections = hustleStats['Deflections'].max()
maxLooseBallsRecovered = hustleStats['Loose BallsRecovered'].max()
maxChargesDrawn = hustleStats['ChargesDrawn'].max()
maxContestedShots = hustleStats['ContestedShots'].max()
maxOREB = hustleStats['OREB'].max()
maxSTL = hustleStats['STL'].max()
print("maxScreenAssists per game: \t\t" + str(maxScreenAssists))
print("maxDeflections per game: \t\t" + str(maxDeflections))
print("maxLooseBallsRecovered per game: \t" + str(maxLooseBallsRecovered))
print("maxChargesDrawn per game: \t\t" + str(maxChargesDrawn))
print("maxContestedShots per game: \t\t" + str(maxContestedShots))
print("maxOffensiveRebound per game: \t\t" + str(maxOREB))
print("maxSteals per game: \t\t\t" + str(maxSTL))

In [None]:
# Calculate and add HustleScore to hustleStats table 
hustleStats['HustleScore'] = 36*(hustleStats['ScreenAssists']/maxScreenAssists + hustleStats['Deflections']/maxDeflections + hustleStats['Loose BallsRecovered']/maxLooseBallsRecovered + hustleStats['ChargesDrawn']/maxChargesDrawn + hustleStats['ContestedShots']/maxContestedShots + hustleStats['OREB']/maxOREB + hustleStats['STL']/maxSTL)/(hustleStats['MIN'])
hustleStats.head()

## 5. Look for Potential Correlation between HustleScore and Other Statistics
### 5.A Age vs. HustleScore
We hypothesize that younger players hustle more.

In [None]:
scatter = hustleStats.plot.scatter('AGE', 'HustleScore')

(m, b) = np.polyfit(hustleStats['AGE'], hustleStats['HustleScore'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in hustleStats['AGE']]
scatter.plot(hustleStats['AGE'], regression_line, color='red') 

# Calculate the correlation coefficient r 
correlation = np.corrcoef(hustleStats['HustleScore'], hustleStats['AGE'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Age with r = " + str(round(correlation,2)))
scatter.set_xlabel("Age")
scatter.set_ylabel("Hustle Scores")

The above plot shows basically no correlation between age and hustle scores.

### 5.B Minutes Played vs. HustleScore
We hypothesize that players who play fewer minutes per game hustle more because they will have more energy than players who play more.

In [None]:
scatter = hustleStats.plot.scatter('MIN', 'HustleScore')

(m, b) = np.polyfit(hustleStats['MIN'], hustleStats['HustleScore'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in hustleStats['MIN']]
scatter.plot(hustleStats['MIN'], regression_line, color='red') 

# Calculate the correlation coefficient r 
correlation = np.corrcoef(hustleStats['HustleScore'], hustleStats['MIN'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Minutes Per Game with r = " + str(round(correlation,2)))
scatter.set_xlabel("Minutes Per Game")
scatter.set_ylabel("Hustle Scores")

The above plot shows a slight negative correlation between minutes per game and hustle scores.  That is to say that as a player's minutes per game increases, a player's hustle score generally decreases slightly. 

### 5.C Position vs. HustleScore
We hypothesize that some positions might have a higher hustle score than others.

In [None]:
positions = ['PG', 'SG', 'SF', 'PF', 'C']
hustlePerPos = []
for pos in positions:
    hustlePerPos.append(hustleStats[hustleStats['Pos'] == pos]['HustleScore'].median())
    
fig, ax = plt.subplots()
ax.bar(positions, hustlePerPos, color = ['#cccccc', '#9eadba', '#6f8eaa', '#416e99', '#165089'])
ax.set_xlabel("Positions")
ax.set_ylabel("Hustle Score")
ax.set_title("Distribution of Hustle Score Across Positions")

As seen in the above plot, centers generally have the highest hustle score.

### 5.D HustleScore per Position between Top and Bottom Teams
We hypothesize that there might be a difference in hustle score between top and bottom teams by position. We define a top team to have 48 or more wins.

In [None]:
# Define a top team as having greater than or equal to 48 wins
topTeams = teamWinsTable[teamWinsTable['Wins'] >= 48]
bottomTeams = teamWinsTable[teamWinsTable['Wins'] < 48]

In [None]:
# Import hustle stat into tables with wins
topTeamsPlayers = pd.merge(topTeams, hustleStats, how='left', on=['Team']).sort_values('Pos')
bottomTeamsPlayers = pd.merge(bottomTeams, hustleStats, how='left', on=['Team']).sort_values('Pos')

In [None]:
# Set width of bars
barWidth = 0.25
 
# Set height of bars
bars1 = topTeamsPlayers.groupby(['Pos']).median()['HustleScore']
bars2 = bottomTeamsPlayers.groupby(['Pos']).median()['HustleScore']

# Set position of bars on x axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
 
# Make the plot
plt.bar(r1, bars1, color='blue', width=barWidth, edgecolor='white', label='Top Teams')
plt.bar(r2, bars2, color='orange', width=barWidth, edgecolor='white', label='Bottom Teams')
 
# Add xticks on the middle of the group bars
plt.xlabel('Position')
plt.ylabel('Median Hustle Score')
plt.xticks([r + barWidth/2 for r in range(len(bars1))], ['C', 'PF', 'PG', 'SF', 'SG'])
plt.title("Hustle Score for Positions for Top vs. Bottom Teams")

# Create legend and show plot
plt.legend()
plt.show()

As seen in the above plot, centers on top teams generally hustle more than centers on bottom teams. In the other positions players on bottom teams tend to hustle more. However, the differences are not significant.

### 5.E Player Salary vs. HustleScore
We explore the correlation between salary and HustleScore.

In [None]:
# Merge hustle stats and salaries into one table
salaryHustle = pd.merge(hustleStats, salaryTable, on=['Player', 'year'], how='inner')
salaryHustle.head()

In [None]:
scatter = salaryHustle.plot.scatter('HustleScore', 'salary')

(m, b) = np.polyfit(salaryHustle['HustleScore'], salaryHustle['salary'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in salaryHustle['HustleScore']]
scatter.plot(salaryHustle['HustleScore'], regression_line, color='red') 

# Calculate the correlation coefficient r 
correlation = np.corrcoef(salaryHustle['HustleScore'], salaryHustle['salary'])[0,1]

# Label the plot
scatter.set_title("Distribution of Hustle Scores Across Salary with r = " + str(round(correlation,2)))

As seen in the above plot, there is basically no correlation between a player's hustle score and his salary.

## 6. Predictions
In the following section, we try to use our calculated hustle score to predict other statistics using Linear Regression models.

### 6.A Predicting Player Points with HustleScore
We predict the number of points each player scores per game based on our calculated hustle score.

In [None]:
# Get table with all the hustle stats and the points for each player
hustleStatsWithPTS = hustleStats.merge(regularStats[['Player', 'PTS', 'year']], on=['Player', 'year'], how='left')
hustleStatsWithPTS = hustleStatsWithPTS.dropna()
hustleStatsWithPTS = hustleStatsWithPTS.sort_values(by=['year', 'PTS'], ascending=False)
hustleStatsWithPTS.head()

In [None]:
# Fit linear rergression model with one parameter and calculate residuals
hustleReg = LinearRegression()
hustleReg.fit(hustleStatsWithPTS[['HustleScore']],hustleStatsWithPTS['PTS'])
hustleStatsWithPTS['hustleResidual'] = hustleStatsWithPTS['PTS'] - hustleReg.predict(hustleStatsWithPTS[['HustleScore']])

In [None]:
hustleViolin = sns.violinplot(x="Team", y="hustleResidual", data=hustleStatsWithPTS)
hustleViolin.set_title("Residuals for calculated hustles score predictive model by team")

In the above plot, only SAC has a short violin with many points clustered around 0. Thus our calculated hustle score does not effectively predict player points.

### 6.B Predicting Player Points with Regular Statistics
We predict points using regular statistics to see if it is more effective than using our hustle statistics. The statistics we use are assists, blocks, turnovers, defensive rebounds, and field goal percentage.

In [None]:
# Merge the team column into the regular stats table
regularStatsWithTeam = regularStats.merge(hustleStats[['Player', 'Team', 'year']], on=['Player', 'year']).drop(['#'], axis=1)
regularStatsWithTeam = regularStatsWithTeam.drop(['GP', 'MIN'], axis=1)
regularStatsWithTeam.head()

In [None]:
# Fit linear regression model
regularStatsReg = LinearRegression()
regularStatsReg.fit(regularStatsWithTeam[['AST', 'BLK', 'TOV', 'DREB', 'FG%']], regularStatsWithTeam['PTS'])

In [None]:
# Calculate residuals and plot
regularStatsWithTeam['residuals'] = regularStatsWithTeam['PTS'] - regularStatsReg.predict(regularStatsWithTeam[['AST', 'BLK', 'TOV', 'DREB', 'FG%']])
regViolin = sns.violinplot(x="Team", y="residuals", data=regularStatsWithTeam)
regViolin.set_title("Residuals of Point Predictions based on Regular Statistics")

Some players' points are predicted well by their regular stats, while others are not.

### 6.C Predicting Player Points with Regular Statistics and HustleScore
We look to see if a combination of the hustle score with regular statistics predicts player points more effectively. 

In [None]:
# Merge calculated hustle stat into regular stats table
allStats = regularStatsWithTeam.merge(hustleStats[['Player', 'year', 'HustleScore']], on=['Player', 'year'])
allStats = allStats.dropna()
allStats.head()

In [None]:
# Perform the same regression as earlier but with the added hustle score column
hustleRegularStatsReg = LinearRegression()
hustleRegularStatsReg.fit(allStats[['AST', 'BLK', 'TOV', 'DREB', 'FG%', 'HustleScore']], allStats['PTS'])

In [None]:
allStats['residuals'] = allStats['PTS'] - hustleRegularStatsReg.predict(allStats[['AST', 'BLK', 'TOV', 'DREB', 'FG%', 'HustleScore']])
violin = sns.violinplot(x="Team", y="residuals", data=allStats)
violin.set_title("Residuals of predictive model using regular statistics with hustle score")

As we see here, the hustle score slightly improves the player points per game prediction for PHX but not for the rest of the teams.

### 6.D Predicting Team Wins with Hustle Score
We look to see if team hustle score (sum of individual hustle scores) predicts team wins.

In [None]:
# Sum hustle score by team and add wins to table
hustleStatsWithWins = hustleStats.groupby(['Team', 'year']).sum().merge(teamWinsTable, on=['Team', 'year'])
hustleStatsWithWins.head()

In [None]:
# Fit linear regression model with one parameter
hustleWinsReg = LinearRegression()
hustleWinsReg.fit(hustleStatsWithWins[['HustleScore']], hustleStatsWithWins['Wins'])

In [None]:
# Calculate residuals and plot using boxplot
hustleStatsWithWins['residuals'] = hustleStatsWithWins['Wins'] - hustleWinsReg.predict(hustleStatsWithWins[['HustleScore']])
hustleStatsWithWins.boxplot(column=['residuals'], by='year')

As seen in the above boxplot, HustleScore does a pretty good job of predicting wins for a team, especially in 2016-17. The mean residual hovers around 0, as seen by the green lines. The middle 50 percent are within about 5 wins for each year.

## 7. There's no "I" in Team
6.D showed a correlation between team HustleScore and wins. We explore that further in this section and focus more on teams as opposed to individual players.

### 7.A Plotting Team HustleScore vs. Wins

In [None]:
teamWinsTable = hustleStatsWithWins[['HustleScore', 'Wins', 'Team', 'year']]
# Merge team HustleScore into table with general team stats
teamStatsTable = teamStatsTable.merge(teamWinsTable[['Team', 'year', 'Wins', 'HustleScore']], on=['Team', 'year'])
teamStatsTable.head()

In [None]:
# Convert type of wins column to int32 because cannot be of object type when plotting 
teamStatsTable['Wins'] = teamStatsTable['Wins'].astype('int32')
x = teamStatsTable['HustleScore']
y = teamStatsTable['Wins']
fig, ax = plt.subplots()
ax.scatter(x, y) 

# Calculate the slope and intercept for the best fit line 
(m, b) = np.polyfit(teamStatsTable['HustleScore'], teamStatsTable['Wins'], 1)
# Calculate the actual best fit line 
regression_line = [(m*x)+b for x in teamStatsTable['HustleScore']]
ax.plot(teamStatsTable['HustleScore'], regression_line, color='red') 

# Calculate the correlation coefficient r 
correlation = np.corrcoef(teamStatsTable['HustleScore'], teamStatsTable['Wins'])[0,1]

# Label the plot
ax.set_title("Distribution of Hustle Scores Across Wins with r = " + str(round(correlation,2)))
ax.set_xlabel("Team Hustle Score")
ax.set_ylabel("Team Wins")

As seen in the above plot with the high correlation coefficient r, there is a strong positive correlation between a team's HustleScore and its wins. This shows that our HustleScore is in fact significant and predicts wins better than other statistics that we have looked at.
## 7.B Top Teams vs. Bottom Teams
To further corroborate our results from 7.A we plot the differences between top and bottom teams for the HustleScore and for other statistics. If we see a significant difference between the top and bottom teams for the HustleScore but not for other statistics, then we have further shown that our HustleScore is meaningful.

In [None]:
# Create new tables with top and bottom teams
topTeams = teamStatsTable[teamStatsTable['Wins'] >= 48]
bottomTeams = teamStatsTable[teamStatsTable['Wins'] < 48]
figsize(14, 12)

In [None]:
# Function for graphing a statistic for top vs. bottom teams
def graph(ax, stat, name):
    x = topTeams[stat].median()
    y = bottomTeams[stat].median()
    ax.bar(['Top Teams'], x)
    ax.bar(['Bottom Teams'], y)
    ratio = x/y
    ax.set_title(name + ", Ratio = " + str(round(ratio, 3)))
    ax.set_ylabel(name)
    
print("Differences between Top and Bottom Teams where Ratio = Top Teams Score / Bottom Teams Score")

fig, axes = plt.subplots(3, 3)
graph(axes[0, 0], 'PTS', 'Points per Game')
graph(axes[0, 1], 'AST', 'Assists Per Game')
graph(axes[0, 2], 'TOV', 'Turnovers')
graph(axes[1, 0], 'FG%', 'Field Goal Percentage')
graph(axes[1, 1], '3P%', '3 Point Percentage')
graph(axes[1, 2], 'FT%', 'Free Throw Percentage')
graph(axes[2, 0], 'ORB', 'Offensive Rebounds')
graph(axes[2, 1], 'DRB', 'Defensive Rebounds')
graph(axes[2, 2], 'TRB', 'Total Rebounds')

In [None]:
figsize(14, 6)

In [None]:
fig, ax = plt.subplots()
graph(ax, 'HustleScore', 'HustleScore')

As seen above, the ratio of the HustleScore between the top and bottom teams is almost 10% better than any of the other statistics we plotted. This shows that our HustleScore is significant.

## 8. Conclusion


In our project, we tried to create a meaningful statistic to gauge player performance. To do this, we used advanced statistics from the NBA to create what we called a **HustleScore** for each player. We had trouble finding correlation between a player's HustleScore and other statistics for individual players.  However, we noticed that a team's HustleScore had a significant positive correlation with a team's overall wins.  

In terms of future work, there are other advanced statistics that could be incorporated into calculating the HustleScore for each player. Some of these [statistics](https://stats.nba.com/players/speed-distance/?sort=DIST_MILES_DEF&dir=1&Season=2017-18&SeasonType=Regular%20Season) include average MPH, distance traveled, and boxouts. These statistics might improve our correlations and predictive models.

Based on these findings, we would recommend that NBA general managers look more into a team's hustle when trying to create a championship team. Our statistics show that a team that hustles more wins more. Due to the fact that wins is the most important team statistic, we conclude that our HustleScore is a meaningful statistic. Moneyball 2.0?!