# Predicting Player Points Using K Nearest Neighbours in Python (NBA 2013 Players Dataset)

Statistics in basketball are collected and analyzed to evaluate and predict a team's or individual players' performance. The dataset we have gives the various performance statistics of the NBA 2013 players. Here we are attempting predict the points scored by the players, and we will be using the K Nearest Neighbours Regression Algorithm to do this.

Let us first import the necessary libraries and load our dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics as stat
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics

In [2]:
players_df = pd.read_csv('../input/unsupervised-ml/nba_2013.csv')
print('Dataset Shape: ',players_df.shape)

Dataset Shape:  (481, 31)


In [3]:
# Check for missing values
players_df['player'].duplicated().value_counts()

False    481
Name: player, dtype: int64

The dataset has 481 rows and 31 columns, with no duplicate entries.

## The Data

Explanation of the columns:
* `player` - Name of the player
* `pos` - Position of the player (Center(C), Point Forward(PF), Shooting Guard(SG), etc.)
* `age` - Age of the player
* `bref_team_id` - Player team
* `g`, `gs` - No.of Games, No.of Games Started by the player
* `mp` - Minutes Played
* `fg`, `fga`, `fg.` - Field Goals, Field Goal Attempts, Field Goal Percetage(FG%=FG/FGA)
* `x3p`, `x3pa`, `x3p.` - 3-Point Field Goals, 3-Point Field Goal Attempts, 3-Point Field Goal Percentage(3P%=3P/3PA)
* `x2p`, `x2pa`, `x2p.` - 2-Point Field Goals, 2-Point Field Goal Attempts, 2-Point Field Goal Percentage(2P%=2P/2PA)
* `efg.` - Effective Field Goal Percentage; eFG% = (FG + 0.5 * 3P) / FGA
* `ft`, `fta`, `ft.` - Free Throws, Free Throw Attempts, Free Throw Percentage(FT%=FT/FTA)
* `orb`, `drb`, `trb` - Offensive Rebounds, Defensive Rebounds, Total Rebounds(TRB=ORB+DRB)
* `ast`, `stl`, `blk` - Assists, Steals, Blocks
* `tov`, `pf` - Turnovers, Personal Fouls
* `pts` - Points
* `season`, `season_end` - Season(2013-14), Season end year(2013)

In [4]:
print(players_df.dtypes)
print(players_df.head(5))

player           object
pos              object
age               int64
bref_team_id     object
g                 int64
gs                int64
mp                int64
fg                int64
fga               int64
fg.             float64
x3p               int64
x3pa              int64
x3p.            float64
x2p               int64
x2pa              int64
x2p.            float64
efg.            float64
ft                int64
fta               int64
ft.             float64
orb               int64
drb               int64
trb               int64
ast               int64
stl               int64
blk               int64
tov               int64
pf                int64
pts               int64
season           object
season_end        int64
dtype: object
          player pos  age bref_team_id   g  gs    mp   fg   fga    fg.  ...  \
0     Quincy Acy  SF   23          TOT  63   0   847   66   141  0.468  ...   
1   Steven Adams   C   20          OKC  81  20  1197   93   185  0.503  ...   
2    

### Dropping Irrelevant Columns

In [5]:
print(players_df.season.value_counts())
print(players_df.season_end.value_counts())

2013-2014    481
Name: season, dtype: int64
2013    481
Name: season_end, dtype: int64


Here,
* `player`, `pos`, `bref_team_id` irrelevant categorical features
* `fg.`, `x3p.`, `x2p.`, `efg.`, `ft.`, `trb` are dependent on other features, and hence void
* `season`, `season_end` have the same value throughout (irrelevant)

Hence, we drop the above columns from the dataframe.

In [6]:
drop_cols = ['player','pos','bref_team_id','fg.','x3p.','x2p.','efg.', 'ft.','trb','season','season_end']
players_df.drop(drop_cols, axis=1, inplace=True, errors='ignore')
players_df.columns

Index(['age', 'g', 'gs', 'mp', 'fg', 'fga', 'x3p', 'x3pa', 'x2p', 'x2pa', 'ft',
       'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts'],
      dtype='object')

### Missing Values

In [7]:
# Sum of null values in each column
players_df.isnull().sum()

age     0
g       0
gs      0
mp      0
fg      0
fga     0
x3p     0
x3pa    0
x2p     0
x2pa    0
ft      0
fta     0
orb     0
drb     0
ast     0
stl     0
blk     0
tov     0
pf      0
pts     0
dtype: int64

There are no missing values

### Splitting Train and Test Data

In [8]:
# The columns that we will be making predictions with.
x = players_df[['age', 'g', 'gs', 'mp', 'fg', 'fga', 'x3p', 'x3pa', 'x2p', 'x2pa', 
                'ft', 'fta', 'orb', 'drb', 'ast', 'stl', 'blk', 'tov', 'pf']]

# The column that we want to predict.
y = players_df['pts']

# Split 80% training data and 20% testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)

### Model Training and Calculating Regression Score

In [9]:
# Train for k=1 to 10, and calculate R-square for prediction
reg_score = []
for k in range(10):
    k_value = k + 1
    knn = KNeighborsRegressor(n_neighbors = k_value)
    knn.fit(x_train, y_train) 
    y_pred = knn.predict(x_test)
    r_square = format(metrics.r2_score(y_test, y_pred),'.4f')
    print ('Regression score is:', r_square, 'for k_value:', k_value)
    reg_score.append(r_square)
print()
max_k_value = reg_score.index(max(reg_score))+1
print('Max Regression Score is:', max(reg_score), 'for k_value:', max_k_value)

Regression score is: 0.9744 for k_value: 1
Regression score is: 0.9804 for k_value: 2
Regression score is: 0.9840 for k_value: 3
Regression score is: 0.9839 for k_value: 4
Regression score is: 0.9842 for k_value: 5
Regression score is: 0.9836 for k_value: 6
Regression score is: 0.9841 for k_value: 7
Regression score is: 0.9819 for k_value: 8
Regression score is: 0.9817 for k_value: 9
Regression score is: 0.9819 for k_value: 10

Max Regression Score is: 0.9842 for k_value: 5


### K-Fold Cross Validation

In [10]:
# Prepare cross validation
kfold = KFold(n_splits=10,random_state=5, shuffle=True)
knn = KNeighborsRegressor(n_neighbors = max_k_value)

# Train on each splits
k_reg_scores = []
for train_index, test_index in kfold.split(x):
    X_train, X_test, Y_train, Y_test = x.iloc[train_index], x.iloc[test_index], y.iloc[train_index], y.iloc[test_index]
    knn.fit(X_train, Y_train) 
    Y_pred = knn.predict(X_test)
    acc = format(metrics.r2_score(Y_test, Y_pred),'.4f')
    k_reg_scores.append(float(acc))

# Print mean R-square of folds
print('K-Fold Cross Validation with k=10:-')
print('Mean Regression Score is:',stat.mean(k_reg_scores),'for k_value =', max_k_value)

K-Fold Cross Validation with k=10:-
Mean Regression Score is: 0.97522 for k_value = 5


### View the Predictions

In [11]:
print('10 values of y_test:')
print(np.asarray(y_test.head(10)))
print()
print()
print('Corresponding values of y_pred:')
print(np.array(y_pred.tolist()[0:10]))

10 values of y_test:
[ 592  760  350  895  349  378  799  298 1096  987]


Corresponding values of y_pred:
[ 575.   801.7  369.3  932.3  314.3  452.5  769.2  383.8 1011.3 1022.4]
