## $$Session-15-Assignment-1$$

In this assignment, students will be using the K-nearest neighbors algorithm to predict how many points NBA players scored in the 2013-2014 season.

A look at the data 

Before we dive into the algorithm, let’s take a look at our data. Each row in the data contains information on how a player performed in the 2013-2014 NBA season. 

Download 'nba_2013.csv' file from this link: <br> 
https://www.dropbox.com/s/b3nv38jjo5dxcl6/nba_2013.csv?dl=0 <br>
Here are some selected columns from the data: <br>
player - name of the player <br>
pos - the position of the player <br> 
g - number of games the player was in <br> 
gs - number of games the player started <br>
pts - total points the player scored <br>
There are many more columns in the data, mostly containing information about average player game performance over the course of the season. See this site for an explanation of the rest of them. 

We can read our dataset in and figure out which columns are present: <br>

In [1]:
import pandas as pd 
with open("nba_2013.csv", 'r') as csvfile:
    nba = pd.read_csv(csvfile)
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


### Checking for null values 

In [2]:
nba.isnull().any()

player          False
pos             False
age             False
bref_team_id    False
g               False
gs              False
mp              False
fg              False
fga             False
fg.              True
x3p             False
x3pa            False
x3p.             True
x2p             False
x2pa            False
x2p.             True
efg.             True
ft              False
fta             False
ft.              True
orb             False
drb             False
trb             False
ast             False
stl             False
blk             False
tov             False
pf              False
pts             False
season          False
season_end      False
dtype: bool

### Filling NAN Values

In [3]:
nba["fg."].fillna(nba["fg."].mean(),inplace=True)
nba["x2p."].fillna(nba["x2p."].mean(),inplace=True)
nba["efg."].fillna(nba["efg."].mean(),inplace=True)
nba["x3p."].fillna(nba["x3p."].mean(),inplace=True)
nba["ft."].fillna(nba["ft."].mean(),inplace=True)

### Selecting only the numeric columns from the dataset

In [4]:
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']
nba_numeric = nba[distance_columns]

### Normalize all of the numeric columns

In [5]:
nba_normalized = nba_numeric.apply(lambda x: (x - x.min()) / (x.max() - x.min()))

### Selecting only the Category columns from the dataset

In [6]:
nba_category = nba[['player', 'bref_team_id', 'season']]

### Concatinating numeric and category columns

In [7]:
nba = pd.concat([nba_category, nba_normalized], axis=1)

In [8]:
from sklearn.model_selection import train_test_split

# The columns that we will be making predictions with.
x_columns = nba[['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']]

# The column that we want to predict.
y_column = nba["pts"]

x_train, x_test, y_train, y_test = train_test_split(x_columns, y_column, test_size=0.3, random_state=0)

### Creating the KNN model by using regressor because we are going to predict continous values

In [9]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics

#### Checking for increasing neighbour values to see which has the highest regression score

In [10]:
for k in range(20):
    k_value = k + 1
    knn = KNeighborsRegressor(n_neighbors = k_value)
    knn.fit(x_train, y_train) 
    y_pred = knn.predict(x_test)
    print ("Regression score is:",format(metrics.r2_score(y_test, y_pred),'.4f'), "for k_value:", k_value)

Regression score is: 0.9145 for k_value: 1
Regression score is: 0.9464 for k_value: 2
Regression score is: 0.9548 for k_value: 3
Regression score is: 0.9594 for k_value: 4
Regression score is: 0.9583 for k_value: 5
Regression score is: 0.9579 for k_value: 6
Regression score is: 0.9579 for k_value: 7
Regression score is: 0.9609 for k_value: 8
Regression score is: 0.9576 for k_value: 9
Regression score is: 0.9557 for k_value: 10
Regression score is: 0.9535 for k_value: 11
Regression score is: 0.9506 for k_value: 12
Regression score is: 0.9503 for k_value: 13
Regression score is: 0.9515 for k_value: 14
Regression score is: 0.9498 for k_value: 15
Regression score is: 0.9482 for k_value: 16
Regression score is: 0.9462 for k_value: 17
Regression score is: 0.9456 for k_value: 18
Regression score is: 0.9446 for k_value: 19
Regression score is: 0.9436 for k_value: 20


#### K=8, as it gives us the highest prediction score.

In [11]:
knn = KNeighborsRegressor(n_neighbors = 8)
knn.fit(x_train, y_train) 
y_pred = knn.predict(x_test)
print ("Mean Squared Error is:", format(metrics.mean_squared_error(y_test, y_pred), '.7f'))
print ("Regression score is:", format(metrics.r2_score(y_test, y_pred),'.4f'))

Mean Squared Error is: 0.0011143
Regression score is: 0.9609


In [12]:
Test_With_Predicted = pd.DataFrame({'Actual Points': y_test.tolist(), 'Predicted Points': y_pred.tolist() })
Test_With_Predicted

Unnamed: 0,Actual Points,Predicted Points
0,0.168145,0.125723
1,0.276514,0.297243
2,0.422676,0.363189
3,0.007327,0.011088
4,0.381026,0.373939
5,0.160432,0.159130
6,0.100656,0.097570
7,0.271115,0.264221
8,0.032009,0.029985
9,0.009641,0.014317
