<a href="https://colab.research.google.com/github/ssaltwick/ENEE324-Project/blob/master/ENEE324H_Final_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ENEE324H Final Project
## Sam Saltwick & Frank Lopez

### Introduction
Our project seeks to cluster basketball players into their respective positions based on a select grouping of their per game average statistics. 

To see the project in action, check out 'Try It Out' section at the bottom.

### Setup
This section contains setup for the code to run, including package imports, data importing, and an enumeration definition.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stat
import math
import pandas as pd
from enum import Enum
from mpl_toolkits import mplot3d
from itertools import combinations



class Position(Enum):
  PG = 0
  SG = 1
  SF = 2
  PF = 3
  C = 4
  
pos_names = {
    Position.PG : "PG",
    Position.SG : "SG",
    Position.SF : "SF",
    Position.PF : "PF",
    Position.C : "C",
}

# Data is hosted on a github repo
pg_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20PG-clean.csv'
sg_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20SG-clean.csv'
sf_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20SF-clean.csv'
pf_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20PF-clean.csv'
c_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20C-Clean.csv'
test_url = 'https://raw.githubusercontent.com/ssaltwick/ENEE324-Project/master/data/Data%20-%20Test-Clean.csv'

urls = {
    Position.PG : pg_url,
    Position.SG : sg_url,
    Position.SF : sf_url,
    Position.PF : pf_url,
    Position.C : c_url
}

# Get test data
test_data = pd.read_csv(test_url).dropna()
test_data['Player'] = test_data['Player'].str.split('\\').str[0]




### Data 

For this project we used accumulated NBA player statistics from the year 2014 up until the previous season (2018-2019 season). We gathered the data from [basketball-reference.com](https://www.basketball-reference.com/). Our preliminary data included 11 stats (FG% 3PA	3P%	2P%	FT%	ORB	TRB	AST	STL	BLK	PTS) that through exploratory data analysis we reduced to fewer statistics. This process is described in our approach section. Our raw data can be found [here](https://docs.google.com/spreadsheets/d/1-6pwAv2Xz1BJTd7Pb0yBK6wmOlAbqYu_X7MPdkgKyEs/edit?usp=sharing).

We used the data from 2014-2018 to train our model for each position and then tested our model on the most player's recent season (2018-2019). To make sure the data we used was statistically significant, we only used players that played 30 games or more, while averaging at least 10 minutes played per game.


### Approach

In order to cluster our players into their positions, we used Bayes law to estimate the quantity P(Position | Player). Here we define a position as one of the 5 positions in basketball (PG, SG, SF, PF, C) and a player as a series of player statistics representing a player in the 2018-2019 season. For each player, we calculate this quantity 5 times, once for each position. Then with 5 values for each player, we take the maximum value and select the associated position as our player's position.

Our application of Bayes law resulted in the calculation P(Player | Position) * P(Position). We chose to ignore the summation in the denominator as we are comparing these values against each other and selecting the maximum. Since we are comparing the values, we do not need them to sum to 1, so we do not need to calculate this denominator. The P(Player) value, or the prior, was simply calculated as (# players in position / # players). Our likelihood function (P(Player | Position)) was assumed to be a Gaussian distribution. This assumption is based on the fact that each statistic was found to be close to normally distributed across positions. According to Bayes Law, the product of our prior and our likelihood function gives us a value proportional to P(Position | Player). 

For our likelihood function, we used an N dimensional Gaussian distribution where the mean and covariance were found as the mean and covariance of our dataset, accumulated over all years that we used. The dimension of this function (N) changes based on the number of statistics used to develop the model. This process is discussed down below in regards to our two stage process.

Initially, we calculated the likelihood function of each player for each position and multiplied it by the probablity of that position. This gave us a value for each position for each player which we used to cluster the players into positions. Preliminary testing found that the best statistics to use for this method were 3-point attempts, 3-point percentage, 2-point percentage, and blocks. While we had 11 total statistics to utilize, adding more statistics generally led to poor results. Certain statistics had stronger effects on the accuracy thatn others, such as points. When the points per game statistic was introduced into the model, it dropped the accuracy significantly every time. After running our test data through this model, we found an accuracy of **41.01%**. Seeking to improve this accuracy, we thought of a different way to cluster our data.

Our final approach includes two stages. The first stage clusters the data into 3 bins: [PG,SF], [SG, PF] and [C]. These bins were determined through iterative testing, calculating the accuracy of all 51 partitions of the list of positions and finding the best one. For each partition of the list, we also found the best combination of statistics to use. Since we found that using all of the statistics that we prepared usually resulted in a worse accuracy, we tested each number of statistics and determined that the best combination for stage 1 was to use 3-points attempted, 2 point percentage, assists, free throw percentage, steals, blocks, and offensive rebounds (All per game averages). Using this combination of 7 stats, we cluster each point in the test data into one of the three bins. At this point, we were able to get up to **59.27%** accuracy.The three bins were then entered into stage two of our algorithm.

Stage two works similarly to stage 1, except it started with our three bins and resulted in a cluster for each position. We reused the process from stage 1 to determine the optimal set of statistics to use with stage 2, for each bin. For the first bin, [PG, SF] we found the optimal statistics to use to be 3-point percentage, offensive rebounds, and total rebounds. For the second bin, [SG. PF], the best statistics to use were 2-point percentage and total rebounds. Since our last bin only contains Centers, it does not need to run through stage 2.

After stage two we are left with 5 values for each player, proportional to the probability that each player is the corresponding position. The position associated with the maximum value is chosen as that player's position. With this two stage method, we found a total accuracy of **48.31%**.

Our accuracy calculation is done by measuring the number of players that we clustered correctly and dividing it by the total number of players clustered. 

### Results

As previously discussed, our final process resulted in an overall accuracy of **48.31%** using the optimal combination of two-stage clustering and selected statistics. Our original method, a single staged cluster directly into postions, resulted in only 41.01% accuracy. Our improved method increased the accuracy of the model by over 7%.

While our accuracy is low, we do believe that our clustering model works relatively well. Considering that there are 5 different options for position, the expected percent correct would be 20%. Our model does much better than this, successfully clustering almost 2.5 times more players than a stochastic method would. Because of this comparision, we believe that our clustering process can successfully work to better determine a player from their statistics. 

### Further Work

Many further investigations into this topic are possible. The weakest point of the model is the assumption that the statistics for each distribution would be fairly normal. While this did appear to be true, plotting 3 statistics together with a Gaussian ellipsoid reveals that many players still fall outside of the models predicted range. Developing a different likelihood function through other statistic methods could definitely improve the accuracy of our model.

Another extension of this would would be extending our training data to include more NBA seasons. We only included 4 seasons of training data, but including more (10+ seasons) could improve our accuracy. One reason this may not be the case is that the game of basketball has drastically changed over the years. As the game changes, the roles of each position changes with it, which would skew our results and over many years could result in an underfitted model. 

Alternatively we could test our data on previous seasons to examine how well our position models hold up for years before they were trained on. We could also introduce players from leagues other than the NBA (NCAAM, NCAAW, NBA G-League) to evaluate if our models for positions fit players across all forms of basketball.


In [0]:
"""
  Evaluates a player against a position's likelihood.
  params: positon = Position Enum
          player = numpy array of stats

"""
def evaluate_likelihood(position, player, avgs, covs, prior):
  mu = avgs[position]
  sig = covs[position]
  
  p = (math.sqrt(2*math.pi)) ** 3
  c = 1 / (p * math.sqrt(np.linalg.det(sig)))

  
  t = np.dot(np.transpose(player-mu), np.dot(sig, (player-mu)))
  
  
  return c * math.exp(-0.5 * t)

In [0]:
def guess_position(player, avgs, covs, prior, pos):
  positions = {}
  
  for p in pos:
    positions[p] = evaluate_likelihood(p, player, avgs, covs, prior) * prior[p]
  
  v = list(positions.values())
  k = list(positions.keys())
  return k[v.index(max(v))]

In [0]:
def stage1(player, round1_stats, labels):
  


  
  positions = [Position.PG, Position.SG, Position.SF, Position.PF, Position.C]
  

  bins = {}
  
  for l in labels:
    tmp = [pos_names[x] for x in l]
    bins[''.join(tmp)] = []
    
  
  data = {}
  for p in positions:
    data[p] = pd.read_csv(urls[p]).dropna()[round1_stats]  



  total_players = 0
  for p in positions:
    total_players += data[p].shape[0]



  # Defines an array of priors (each position's percentage of players)
  # TODO: Update with actual priors),

  prior = {}
  avgs = {}
  covs = {}


  # TODO: Generate actual MEAN and COV for each position

  for p in positions:
    prior[p] = data[p].shape[0] / total_players

  for p in positions:
    avgs[p] = data[p].mean(0).to_numpy()
    covs[p] = data[p].cov().to_numpy()

  
  guess = guess_position(player[:,0], avgs, covs, prior, positions)
    
    
  for l in labels:
    tmp = [pos_names[x] for x in l]
    k = ''.join(tmp)
    if guess in l:
      bins[k].append(player)
       
  restult = ""     
  for k in bins.keys():
    if bins[k]:
      result = k
      
  return result
  
  

In [0]:
def stage2(labels, stats, test_player):
  
  positions = [Position.PG, Position.SG, Position.SF, Position.PF, Position.C]
  
  test_player = test_player[stats].to_numpy()
  
  bins = {}
  for p in labels:
    bins[p] = []
  
  pos_names = {
      Position.PG : "PG",
      Position.SG : "SG",
      Position.SF : "SF",
      Position.PF : "PF",
      Position.C : "C",
  }
  
  
  data = {}
  for p in positions:
    data[p] = pd.read_csv(urls[p]).dropna()[stats]  


  total_players = 0
  for p in positions:
    total_players += data[p].shape[0]



  # Defines an array of priors (each position's percentage of players)
  # TODO: Update with actual priors),

  prior = {}
  avgs = {}
  covs = {}


  # TODO: Generate actual MEAN and COV for each position


  for p in positions:
    total_players += data[p].shape[0]

  for p in positions:
    prior[p] = data[p].shape[0] / total_players

  for p in positions:
    avgs[p] = data[p].mean(0).to_numpy()
    covs[p] = data[p].cov().to_numpy()
    
    
    

     
  

  
  
  
  guess = guess_position(test_player.reshape(len(stats),), avgs, covs, prior, labels)

  bins[guess].append(test_player)
  

  
  return pos_names[guess]

### Evaluate Model
Evaluates the model given a player

In [0]:
def run_model(selected_player, test_data):
  # Retrieve selected player from test data

  test_player = test_data.loc[test_data['Player'] == selected_player]
  
  test_pos = test_player['Pos'].to_string().split()[1]
  status = 0

  # print("Here is %s's stats for the 2018-2019 season" % selected_player)
  # print(test_player)
  
  
  # Stage 1 - cluster into [PG, SF], [SG, PF], [C] based on 
  # 3PA, 2P%, AST, FT%, STL, BLK, ORB
  s1_stats = ['3PA', '2P%', 'AST', 'FT%', 'STL', 'BLK', 'ORB']
  groups = [[Position.PG, Position.SF],[Position.SG, Position.PF],[Position.C]]
  s1_result = stage1(np.transpose(test_player[s1_stats].to_numpy()), s1_stats, groups)
  
  s1_worked = False
  if test_pos in s1_result:
    print("Stage 1 correctly categorized %s as %s" % (selected_player, s1_result))
    s1_worked = True

  # Stage 2 - Take whichever bin they were placed into and split based on 
  # [3P%, ORB, TRB], [2P%, TRB], [3PA]
  
  s2_stats = {
      "PGSF" : ['3P%', 'ORB', 'TRB'],
      "SGPF" : ['2P%', 'TRB'],
      "C" : ['3PA']
  }

  s2_result = "Error"
  if s1_result == "PGSF":
    s2_result = stage2([Position.PG, Position.SF], s2_stats["PGSF"], test_player)
  elif s1_result == "SGPF":
    s2_result = stage2([Position.SG, Position.PF], s2_stats["SGPF"], test_player)
  else:
    s2_result = 'C'

  if s2_result == test_pos:
    print("Stage 2 correctly categorized %s as a %s" % (selected_player, s2_result))
    status = 1
  else:
    print("%s was categorized as a %s in stage 2 but is actually a %s. Try another player." %(selected_player, s2_result, test_pos))
    
  return status
  

  

### Try it Out
To test out our model, select an NBA player from the dropdown below to see if it guesses the correct position.

In [331]:
selected_player = "James Harden" #@param {type:"string"}

_ = run_model(selected_player, test_data)
# all_names = test_data['Player']
# sample = all_names.shape[0]

# correct = 0;
# for n in all_names:
#   correct += run_model(n, test_data)

# print(correct / sample)




James Harden was categorized as a C in stage 2 but is actually a PG. Try another player.
