<a href="https://colab.research.google.com/github/ssaltwick/ENEE324-Project/blob/master/ENEE324H_Final_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ENEE324H Final Project
## Sam Saltwick & Frank Lopez

### Introduction
Our project seeks to cluster basketball players into their respective positions based on a select grouping of their per game average statistics. 

To see the project in action, check out 'Try It Out' section below

### Setup
This section contains setup for the code to run, including package imports, data importing, and an enumeration definition.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stat
import math
import pandas as pd
from enum import Enum
from mpl_toolkits import mplot3d
from itertools import combinations



class Position(Enum):
  PG = 0
  SG = 1
  SF = 2
  PF = 3
  C = 4
  
pos_names = {
    Position.PG : "PG",
    Position.SG : "SG",
    Position.SF : "SF",
    Position.PF : "PF",
    Position.C : "C",
}

# Data is hosted on a github repo
pg_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-PG-clean.csv'
sg_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-SG-clean.csv'
sf_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-SF-clean.csv'
pf_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-PF-clean.csv'
c_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-C-Clean.csv'
test_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data-Test-Clean.csv'

urls = {
    Position.PG : pg_url,
    Position.SG : sg_url,
    Position.SF : sf_url,
    Position.PF : pf_url,
    Position.C : c_url
}

# Get test data
test_data = pd.read_csv(test_url).fillna(0)
test_data['Player'] = test_data['Player'].str.split('\\').str[0]




### Code

Below is all of our code for the project. Collapse this section to hide it. It must stay above the testing cells for them to run properly.

In [0]:
"""
  Evaluates a player against a position's likelihood.
  params: positon = Position Enum
          player = numpy array of stats

"""
def evaluate_likelihood(position, player, avgs, covs, prior):
  mu = avgs[position]
  sig = covs[position]
  
  p = (math.sqrt(2*math.pi)) ** 3
  c = 1 / (p * math.sqrt(np.linalg.det(sig)))

  
  t = np.dot(np.transpose(player-mu), np.dot(sig, (player-mu)))
  
  
  return c * math.exp(-0.5 * t)

In [0]:
def guess_position(player, avgs, covs, prior, pos):
  positions = {}
  
  for p in pos:
    positions[p] = evaluate_likelihood(p, player, avgs, covs, prior) * prior[p]
  
  v = list(positions.values())
  k = list(positions.keys())
  return k[v.index(max(v))]

In [0]:
def stage1(player, round1_stats, labels):
  


  
  positions = [Position.PG, Position.SG, Position.SF, Position.PF, Position.C]
  

  bins = {}
  
  for l in labels:
    tmp = [pos_names[x] for x in l]
    bins[''.join(tmp)] = []
    
  
  data = {}
  for p in positions:
    data[p] = pd.read_csv(urls[p]).fillna(0)[round1_stats]  



  total_players = 0
  for p in positions:
    total_players += data[p].shape[0]



  # Defines an array of priors (each position's percentage of players)
  # TODO: Update with actual priors),

  prior = {}
  avgs = {}
  covs = {}


  # TODO: Generate actual MEAN and COV for each position

  for p in positions:
    prior[p] = data[p].shape[0] / total_players

  for p in positions:
    avgs[p] = data[p].mean(0).to_numpy()
    covs[p] = data[p].cov().to_numpy()

  
  guess = guess_position(player[:,0], avgs, covs, prior, positions)
    
    
  for l in labels:
    tmp = [pos_names[x] for x in l]
    k = ''.join(tmp)
    if guess in l:
      bins[k].append(player)
       
  restult = ""     
  for k in bins.keys():
    if bins[k]:
      result = k
      
  return result
  
  

In [0]:
def stage2(labels, stats, test_player):
  
  positions = [Position.PG, Position.SG, Position.SF, Position.PF, Position.C]
  
  test_player = test_player[stats].to_numpy()
  
  bins = {}
  for p in labels:
    bins[p] = []
  
  pos_names = {
      Position.PG : "PG",
      Position.SG : "SG",
      Position.SF : "SF",
      Position.PF : "PF",
      Position.C : "C",
  }
  
  
  data = {}
  for p in positions:
    data[p] = pd.read_csv(urls[p]).fillna(0)[stats]  


  total_players = 0
  for p in positions:
    total_players += data[p].shape[0]



  # Defines an array of priors (each position's percentage of players)
  # TODO: Update with actual priors),

  prior = {}
  avgs = {}
  covs = {}


  # TODO: Generate actual MEAN and COV for each position


  for p in positions:
    total_players += data[p].shape[0]

  for p in positions:
    prior[p] = data[p].shape[0] / total_players

  for p in positions:
    avgs[p] = data[p].mean(0).to_numpy()
    covs[p] = data[p].cov().to_numpy()
  
  guess = guess_position(test_player.reshape(len(stats),), avgs, covs, prior, labels)

  bins[guess].append(test_player)
  

  
  return pos_names[guess]

In [0]:
def run_model(selected_player, test_data):
  # Retrieve selected player from test data
  
  
  test_player = test_data.loc[test_data['Player'].str.lower() == selected_player.lower()]
  
  if test_player.empty:
    print("Sorry, we do not have data on %s. Please try again" % selected_player)
    return -1
  
  player_name = test_player['Player'].tolist()[0]
  test_pos = test_player['Pos'].to_string().split()[1]
  status = 0

  # print("Here is %s's stats for the 2018-2019 season" % selected_player)
  # print(test_player)
  
  [ ('3P%', 'STL', 'ORB', 'Height'), [['3PA', '3P%', '2P%', 'AST', 'STL', 'BLK', 'ORB', 'Height'], ['3PA', '3P%', '2P%', 'ORB', 'Height']]]
  
  
  # Stage 1 - cluster into [PG, SF], [SG, PF], [C] based on 
  # 3PA, 2P%, AST, FT%, STL, BLK, ORB
  s1_stats = ['3P%', 'STL', 'ORB', 'Height']
  groups = [[Position.SG, Position.SF], [Position.PG, Position.PF, Position.C]]
  s1_result = stage1(np.transpose(test_player[s1_stats].to_numpy()), s1_stats, groups)
  
  s1_worked = False
  if test_pos in s1_result:
    print("Stage 1 correctly categorized %s as %s" % (player_name, s1_result))
    s1_worked = True

  # Stage 2 - Take whichever bin they were placed into and split based on 
  # [3P%, ORB, TRB], [2P%, TRB], [3PA]
  
  s2_stats = {
      "SGSF" : ['3PA', '3P%', '2P%', 'AST', 'STL', 'BLK', 'ORB', 'Height'],
      "PGPFC" : ['3PA', '3P%', '2P%', 'ORB', 'Height']
  }

  s2_result = "Error"
  if s1_result == "PGPFC":
    s2_result = stage2([Position.PG, Position.PF, Position.C], s2_stats["PGPFC"], test_player)
  elif s1_result == "SGSF":
    s2_result = stage2([Position.SG, Position.SF], s2_stats["SGSF"], test_player)
 

  if s2_result == test_pos:
    print("Stage 2 correctly categorized %s as a %s" % (player_name, s2_result))
    status = 1
  else:
    print("%s was categorized as a %s in stage 2 but is actually a %s. Try another player." %(player_name, s2_result, test_pos))
    
  return status
  

  

In [0]:
def run_model_ncaa(selected_player, test_data):
  # Retrieve selected player from test data

  test_player = test_data.loc[test_data['Player'] == selected_player]
  
  status = 0

  # print("Here is %s's stats for the 2018-2019 season" % selected_player)
  # print(test_player)
  
  [ ('3P%', 'STL', 'ORB', 'Height'), [['3PA', '3P%', '2P%', 'AST', 'STL', 'BLK', 'ORB', 'Height'], ['3PA', '3P%', '2P%', 'ORB', 'Height']]]
  
  
  # Stage 1 - cluster into [PG, SF], [SG, PF], [C] based on 
  # 3PA, 2P%, AST, FT%, STL, BLK, ORB
  s1_stats = ['3P%', 'STL', 'ORB', 'Height']
  groups = [[Position.SG, Position.SF], [Position.PG, Position.PF, Position.C]]
  s1_result = stage1(np.transpose(test_player[s1_stats].to_numpy()), s1_stats, groups)
  
  # Stage 2 - Take whichever bin they were placed into and split based on 
  # [3P%, ORB, TRB], [2P%, TRB], [3PA]
  
  s2_stats = {
      "SGSF" : ['3PA', '3P%', '2P%', 'AST', 'STL', 'BLK', 'ORB', 'Height'],
      "PGPFC" : ['3PA', '3P%', '2P%', 'ORB', 'Height']
  }

  s2_result = "Error"
  if s1_result == "PGPFC":
    s2_result = stage2([Position.PG, Position.PF, Position.C], s2_stats["PGPFC"], test_player)
  elif s1_result == "SGSF":
    s2_result = stage2([Position.SG, Position.SF], s2_stats["SGSF"], test_player)
 

  return s2_result
  

  

In [0]:
def umd_test():
  
  
  players = {
      "Anthony Cowan" : "PG",
      "Bruno Fernando": "C",
      "Jalen Smith" : "PF",
      "Eric Ayala" : "PG",
      "Darryl Morsell": "SF",
      "Aaron Wiggins": "SG",
      "Serrel Smith Jr": "SG",
      "Ricky Lindo": "SF",
      "Joshua Tomaic": "SF",
      "Ivan Bender": "PF",
  }
  total = len(players.keys())
  correct = 0
  umd_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Test-UMD.csv'
  umd_data = pd.read_csv(umd_url).fillna(0)
  umd_data['Player'] = umd_data['Player'].str.split('\\').str[0]
  
  for k,v in players.items():
    res = run_model_ncaa(k, umd_data)
    if res == v:
      print("Correctly categorized %s as a %s" % (k,v))
      correct += 1
    else:
      print("%s was categorized as a %s but is actually a %s" % (k,res,v))
  
  
  print("Categorized {:.2%} of UMD players correctly.".format(correct/total))


### Try it Out
To test out our model, enter an NBA player's name to see if the model guesses the correct position.


After entering a player, select **Runtime > Run all** and the results will be displayed below.
    
After you've 'Run all' once, you don't have to do it again. To try out different players type in their name and then click the **Play Button** to the upper left of the code below.

**NOTE: Spelling and Punctuation matter when entering player names**

Here are some example players to try out.

* James Harden
* LeBron James
* De'Aaron Fox
* Al-Farouq Aminu
* P.J. Tucker
* CJ McCollum

If you're having trouble with a specific player, reference our raw data with the link listed above, under *Data* 

In [0]:
selected_player = "dirk nowitzki" #@param {type:"string"}

_ = run_model(selected_player, test_data)




Stage 1 correctly categorized Dirk Nowitzki as PGPFC
Stage 2 correctly categorized Dirk Nowitzki as a PF


## Data 

For this project we used accumulated NBA player statistics from the year 2014 up until the previous season (2018-2019 season). We gathered the data from [basketball-reference.com](https://www.basketball-reference.com/). Our preliminary data included 11 stats, (FG%, 3PA,	3P%,	2P%,	FT%,	ORB,	TRB,	AST,	STL, BLK,	PTS), while our final data added **Height** to the original 11 statistics. Through exploratory data analysis we reduced to fewer statistics. This process is described in our approach section. The height data for each player was not included in the tables provided by basketball-reference, so we pulled each player's height from the basketball-reference information page for each player. Our raw data can be found [here](https://docs.google.com/spreadsheets/d/1-6pwAv2Xz1BJTd7Pb0yBK6wmOlAbqYu_X7MPdkgKyEs/edit?usp=sharing).

We used the data from 2014-2018 to train our model for each position and then tested our model on the most player's recent season (2018-2019). To make sure the data we used was statistically significant, we only used players that played 30 games or more, while averaging at least 10 minutes played per game.


## Approach

In order to cluster our players into their positions, we used Bayes law to estimate the quantity P(Position | Player). Here we define a position as one of the 5 positions in basketball (PG, SG, SF, PF, C) and a player as a series of player statistics representing a player in the 2018-2019 season. For each player, we calculate this quantity 5 times, once for each position. Then with 5 values for each player, we take the maximum value and select the associated position as our player's position.

Our application of Bayes law resulted in the calculation P(Player | Position) * P(Position). We chose to ignore the summation in the denominator as we are comparing these values against each other and selecting the maximum. Since we are comparing the values, we do not need them to sum to 1, so we do not need to calculate this denominator. The P(Player) value, or the prior, was simply calculated as (# players in position / # players). Our likelihood function (P(Player | Position)) was assumed to be a Gaussian distribution. This assumption is based on the fact that each statistic was found to be close to normally distributed across positions. According to Bayes Law, the product of our prior and our likelihood function gives us a value proportional to P(Position | Player). 

For our likelihood function, we used an N dimensional Gaussian distribution where the mean and covariance were found as the mean and covariance of our dataset, accumulated over all years that we used. The dimension of this function (N) changes based on the number of statistics used to develop the model. This process is discussed down below in regard to our two-stage process.

### Initial One-Stage Approach

Initially, we calculated the likelihood function of each player for each position and multiplied it by the probability of that position. This gave us a value for each position for each player proportional to the probability that each player belongs to each position. Each player is assigned to the position with the maximum value calculated above.

After running our test data (without height) through this model, we found an accuracy of **41.01%**. Seeking to improve this accuracy, we thought of a different way to cluster our data.

*Note:*
We only ran this method without including height in our data. After realizing the two-stage method described below provided more accurate results, we did not specifically perform this approach again after adding height to our data.

### Final Two-Stage Approach



Our final approach includes two stages. The first stage clusters players into "bins" that best fit their stats, while the second stage clusters each player into the position inside their respective bin that best fits their stats. 

The bins were determined through iterative testing, calculating the accuracy of all 51 partitions of the list of positions, [PG, SG, SF, PF, C], and finding the one that yielded the highest accuracy.

*Example Partitions include:* 

    [PG], [SG], [SF], [PF], [C]
    [PG, SG], [SF], [PF], [C]
    [PG, SG, SF], [PF], [C]
    [PG, SG, SF, PF], [C]
    [PG, SG, SF, PF, C]
    [PG, SG], [SF, PF, C]
    [PG, SG, SF], [PF, C]

To find the partition that produced the highest accuracy, we found the combination of statistics that yielded the highest percent accuracy for all 51 partitions. We did this through a two-stage process, as previously mentioned.

**Stage 1**

In Stage 1, we receive *m* values (*i1, i2, ..., im*) for each player, where *m* is the number of bins in the current partition. These values are proportional to the probability that each player belongs to each bin. Each player is assigned to the bin, *x*, that had the maximum value *ix*.

For each partition, we wanted to find the stats that clustered players into each bin with the highest accuracy. To do this, we iterated through every possible combination of stats (FG%, 3PA,	3P%,	2P%,	FT%,	ORB,	TRB,	AST,	STL, BLK,	PTS, and Height) for sizes 1-12 and returned the combination with the highest percentage.


**Stage 2**

Stage 2 works similarly to Stage 1. However, instead of clustering players into bins, we found the stats that cluster players in each bin into the positions with the highest accuracy using the same method as Stage 1.

Stage 2 generates *n* values (*j1, j2, ..., jn*), where *n* is the size of the current bin of the current partition. These values are proportional to the probability that each player belongs to each position inside the player's respective bin. Each player is assigned to the position inside their respective bin, *y*, that had the maximum value *jy*.

**Total Accuracy**

Our accuracy calculation is done by tallying the number of players that we clustered correctly and dividing it by the total number of players clustered.

**Notes**

For the One-Stage Model and Stage 1 and Stage 2 of the Two-Stage Model we found that using too many stats led to less accurate results. The reason for this is stats such as points, total rebounds and field goal percentage, had similar means and variances across all five positions, thus decreasing the accuracy.

We ran this method both with and without height included in our data set as it provided more accurate results.

The one-stage approach was one of the 51 partitions of the two-stage approach.

## Results

### One-Stage Approach Without Height

Utilizing the One-Stage approach described above without height included in our data provided the following results:

**Ideal Statistics**

The statistics that yielded the highest accuracy (**41.01%**) in clustering players into all 5 positions were:

*  Three point percentage
*  Two point percentage
*  Assists
*  Free throw percentage
*  Steals
*  Blocks
*  Offensive Rebounds


### Two-Stage Approach Without Height

Utilizing the Two-Stage approach described above without height included in our data provided the following results:

**Ideal Partition**

Our ideal partition resulted in the following three bins:

    [PG, SF], [SG, PF], [C]
    
**Stage 1**

The statistics that yielded the highest accuracy (**59.27%**) in clustering players into these three bins were:

*  Three point attempts
*  Two point percentage
*  Assists
*  Free throw percentage
*  Steals
*  Height
*  Offensive rebounds

**Stage 2**

The statistics that yielded the highest accuracy in clustering players in the [PG, SF] bin into the *PG* and *SF* positions were:

*  Three point percentage
*  Offensive rebounds
*  Total Rebounds

The statistics that yielded the highest accuracy in clustering players in the [SG, PF] bin into the *SG*, and *PF* positions were: 

*  Two point percentage
*  Total rebounds

Since the [C] bin only holds one position (Center), all players clustered into that bin in Stage 1 are already clustered into a final position. Thus, Stage 2 has no effect on this bin.

**Total Accuracy**

Using the partition and stats for each stage mentioned above we achieved a peak accuracy of **48.31%** for this model.


### Two-Stage Approach With Height

Utilizing the Two-Stage approach described above provided the following results:

**Ideal Partition**

Our ideal partition resulted in the following two bins:

    [SG, SF], [PG, PF, C]
    
**Stage 1**

The statistics that yielded the highest accuracy (**81.74%**) in clustering players into these two bins were:

*  Three point percentage
*  Steals
*  Offensive Rebounds
*  Height

**Stage 2**

The statistics that yielded the highest accuracy in clustering players in the [SG, SF] bin into the *SG* and *SF* positions were:

*  Three point attempts
*  Three point percentage
*  Two point percentage
*  Assists
*  Steals
*  Blocks
*  Offensive rebounds
*  Height

The statistics that yielded the highest accuracy in clustering players in the [PG, PF, C] bin into the *PG*, *PF*, and *C* positions were: 

*  Three point attempts
*  Three point percentage
*  Two point percentage
*  Offensive rebounds
*  Height

**Total Accuracy**

Using the partition and stats for each stage mentioned above we achieved a peak accuracy of **70.79%**.



### NCAAM Extension

As an extension of our results, we decided to test on a league other than the NBA. For this experiment we gathered data for the UMD men's basketball team and ran our model with their data. While this is only a sample of 10 players, we found an accuracy of **70%** which aligns well with our NBA accuracy.

If we were to generalize this test to more NCAA teams, which wasn't done due to the lack of available data in a usable format, we could conclude more about this relationship. If we were to test on a larger sample of NCAAM basketball players and found that our model's accuracy remained at 70%, then we would conclude that the positional roles are similar across the two leagues. 

To see this test, hit **Runtime > Run all** and the output will appear below.



In [0]:
umd_test()

Correctly categorized Anthony Cowan as a PG
Correctly categorized Bruno Fernando as a C
Correctly categorized Jalen Smith as a PF
Eric Ayala was categorized as a SG but is actually a PG
Darryl Morsell was categorized as a SG but is actually a SF
Correctly categorized Aaron Wiggins as a SG
Correctly categorized Serrel Smith Jr as a SG
Correctly categorized Ricky Lindo as a SF
Correctly categorized Joshua Tomaic as a SF
Ivan Bender was categorized as a SF but is actually a PF
Categorized 70.00% of UMD players correctly.



### Analysis

**Without Height**

Our original method, a Single-Stage approach that clusters players directly into the 5 positions resulted in only **41.01%** accuracy. Using the Two-Stage approach without height resulted in **48.31%** accuracy. Our Two-Stage approach increased the accuracy of the model without height by over 7%.

**Adding Height**

Originally, we were not planning to use height as a statistic in our model. However, after some preliminary analysis, we realized the exclusion of height was severely limiting the accuracy of our model. 

Without height, the maximum accuracy we achieved was **48.31%** using the Two-Stage approach. Adding height to our data yielded an accuracy of **70.79%**, over a 20% increase.

**Discussion**

Considering that there are 5 different options for a player's position, if we randomly choose a position for each player the expected percent correct would be 20%. All three of our approaches for our model perform much better than this, with the Two-Stage approach with height successfully clustering over 3.5 times more players than a stochastic method would. Because of this comparison, we believe that our clustering process successfully works to better determine a player's position from their statistics.


### Further Work

Many further investigations into this topic are possible. The weakest point of the model is the assumption that the statistics for each distribution would be fairly normal. While this did appear to be true, plotting 3 statistics together with a Gaussian ellipsoid reveals that many players still fall outside of the models predicted range. Developing a different likelihood function through other statistic methods could definitely improve the accuracy of our model.

Another extension of this would be extending our training data to include more NBA seasons. We only included 4 seasons of training data; however, including more (10+ seasons) could improve our accuracy. One reason this may not be the case is that the game of basketball has drastically changed over the years. As the game changes, the roles of each position changes with it, which would skew our results and over many years could result in an underfitted model. 

Alternatively, we could test our data on previous seasons to examine how well our position models hold up for years before they were trained on. 