**Welcome to my Mini Project. This is NBA Salaries Prediction, version 3. I hope everybody can learn from it. If you think this is helpful or it has any problem, please upvote it and discuss your idea with me.**

# Contents

1. Introduction & Questions

2. Methods & Results
    
    2.1 Data Cleaning
    
    2.2 Exploratory Analysis
        2.2.1 Descriptive Statistics
        2.2.2 Statistical Inference
        2.2.3 Probability Distributions
    
    2.3 Modelling
        2.3.1 Feature Engineering
            (1) Creating Features
            (2) Pearson's R-Square Correlation
            (3) Multicollinearity Analysis
        2.3.2 Regression
            (1) Measure of Goodness: RMSE
            (2) Selection of Model: Multivariate, Cross Validation, and Bias/Variance Trade-off
            (3) Regularization: Ridge, Lasso, and ElasticNet
        2.3.3 Classification
            (1) Measure of Goodness: Accuracy Score and Confusion Matrix
            (2) Selection of Model: KNN, SVM, Naïve Bayes, Decision Tree, Logistic Regression
            (3) Comparison: Model Tuning, Learning Curve and Curse of Dimensionality

3. Recommendations & Discussions

# 1. Introduction & Questions

In 2017-18 season, the salary cap and the luxury tax of NBA reached 99 million and 119 million dolars respectively (Di, 2018). This means that the managers and coaches of all 30 teams have to focus on finding those who are ability to put their teams to another level within their budgets. Therefore, a model which can predict players’ salaries according to their performance data is necessary in the league.

This report chooses a dataset named NBA 2017-18 season players’ salaries. It is oriented from Kaggle, the largest online machine learning and data science community (Narayanan, Shi & Rubinstein, 2011). We believe that the dataset is suitable because it focuses on not only on-ball stats, such as scores, rebounds, and assists, but also off-ball stats, which can capture those who have a big impact on the game without ball on the hand. Also, the identity information is captured, such as age, draftnumber, and so on.

In this report, the 3 key questions are: 
* What are the most important 4 features that influence the salary? 
* What are the most suitable regression and classification models to predict players’ salaries? And how do the models work? 
* What recommendations can be made?

In order to answer the questions, data cleaning, exploratory analysis, and data modelling methods will be applied. Firstly, the dataset should be cleaned in order to make it easier to build model. Secondly, in order to get familiar with the dataset, this report will use descriptive statistics, statistical inference and concepts of probability distributions to make a description of the potential variables that we need to choose for moddelling. Thirdly, we will use regression and classification methods to select, build, and evaluate our models.

# 2 Methods & Results

In [None]:
# Importing all pakages that necessary.

# loading pandas and numpy for data cleaning and exploratory anaysis.

import pandas as pd
from pandas import DataFrame
import numpy as np

# Defining the dataset's new name in this project.
salary_file_path = '../input/2017-18_NBA_salary.csv'
salary_data = pd.read_csv(salary_file_path)

## 2.1 Data Cleaning

Missing values and outliers would make the modelling process difficult. Therefore, we use imputation method to fill in the missing values. In this case, we choose median number of the whole column to fill in the missing variables because a median value will not be influened by outliers.

In [None]:
# make copy to avoid changing original data when imputing.

copy_data = salary_data.copy()

In [None]:
# Firstly, we should have a look whether the data is completed or not.
# Because the missing value will have an adverse impact on the building of regression model.

null_values_col = copy_data.isnull().sum()
null_values_col = null_values_col[null_values_col != 0].sort_values(ascending = False).reset_index()
null_values_col.columns = ["variable", "number of missing"]
null_values_col.head()

In [None]:
# using median value of each column to fill the N/A values, because it will not be influened by outliers.

def fillWithMedian(data):
    return data.fillna(data.median(), inplace=True)

fillWithMedian(copy_data)

In [None]:
copy_data.isnull().any()

## 2.2 Exploratory Analysis 

In [None]:
# read the data

copy_data.head(10)

In [None]:
copy_data.columns

### 2.2.1 Descriptive Statistics

In order to understand our data, descriptive statistics can be used to get a well understand of the dataset. The methods here include mean, median, mode, standard deviation, interquatile range. The first three methods are to read the central tendency of data, and the rest are to describe the dispersion of datasets.

In [None]:
copy_data.describe()

Apart from that, the histogram of salaries can make it clear to see the distribution of NBA salaries. From the positive-skewed histogram, we can read that more than 33% of NBA players' salaries are less than 3 million dollars, while only no more than 40 players' salaries are more than 25 million dollars. This also means that the salary problem needs to be paid attention by managers.

In [None]:
# Matplotlib package for visualisation.

import matplotlib.pyplot as plt

copy_data.Salary.hist(bins=20, alpha=0.5)
plt.title("NBA Players' Salaries in 2017-18 Season Histogram")
plt.xlabel("Salary($)")
plt.ylabel("Frequency")

### 2.2.2 Statistical Inference

In order to select appropriate features for prediction, an independent t-test can be applied to calculate whether a feature is significant enough. Here we choose "Country" as our feature.
* H0: There is no significant difference of salaries between players from USA and not.
* H1: There is significant difference of salaries between players from USA and not.
* (when alpha level = .05, two-tailed test)

If |t-statistics| < |t-critical|, we retain the null hypothesis. Conversely, we reject the null. In this case, the |t-statistics| = 0.7033 which is smaller than |t-critical|, then we prefer to think that there is no significant difference of salaries between players from USA and overseas.

In [None]:
# Extracting two columns: Salary and NBA_Country.
# sample variance - Why does Bessel's correction use N-1?
# https://en.wikipedia.org/wiki/Bessel%27s_correction#Proof_of_correctness_-_Alternate_3

# covariance
# https://blog.csdn.net/guomutian911/article/details/43317019

inf_data = pd.read_csv(salary_file_path,usecols=[2,1])
inf_data.head(10)

In [None]:
usa_data = inf_data[inf_data['NBA_Country'] == 'USA']
non_usa_data = inf_data[inf_data['NBA_Country'] != 'USA']
u_data = usa_data[['Salary']] 
n_data = non_usa_data[['Salary']]

In [None]:
u_mean = u_data.mean()
n_mean = n_data.mean()
u_stdev = u_data.std()
n_stdev = n_data.std()
u_count = u_data.count()
n_count = n_data.count()
degree_of_freedom = u_count + n_count - 2

In [None]:
standard_error = (u_stdev**2/u_count + n_stdev**2/n_count)**0.5
t_statistics = (u_mean - n_mean)/standard_error
print('the t-statistics is: {}'.format(t_statistics))
print('the degree of freedom is: {}'.format(degree_of_freedom))

### 2.2.3 Probability Distributions

Bayes Theorem is the fundamental concept of probability. Here we can apply it to answer the question such as “what is the probability that players’ salaries are higher than 10 million dollars, given that the player is from USA?”

In [None]:
# Bayes Theorem

usa_list = u_data['Salary'].values.tolist()
non_usa_list = n_data['Salary'].values.tolist()

usa_count = u_data['Salary'].count()
non_usa_count = n_data['Salary'].count()

usa_10m_count = 0
non_usa_10m_count = 0

for i in usa_list:
    if i > 10000000:
        usa_10m_count += 1

for i in non_usa_list:
    if i > 10000000:
        non_usa_10m_count += 1

# P(a|b) = (P(b|a))*P(a)/P(b)

probability = (usa_10m_count/(usa_10m_count+non_usa_10m_count)) * ((usa_10m_count + non_usa_10m_count)/
               (usa_count + non_usa_count)) / (usa_count/(usa_count + non_usa_count))
print(probability)

Probability Distribution can make it clear to realize the feature of our variables. Continuing the "Country" problem, a Bernoulli Distribution can be applied to see the difference between these two groups.

In [None]:
# Bernoulli distribution

u_country = usa_data[['NBA_Country']] 
n_country = non_usa_data[['NBA_Country']]
bernoulli_count = [int(u_country.count()), int(n_country.count())]
u_probability = bernoulli_count[0]/(bernoulli_count[0] + bernoulli_count[1])
n_probability = bernoulli_count[1]/(bernoulli_count[0] + bernoulli_count[1])

# Define the dataset
probability = [u_probability, n_probability]
bars = ('0', '1')
y_pos = np.arange(len(bars))
 
# Create bars
plt.bar(y_pos, probability)
 
# Create names on the x-axis
plt.xticks(y_pos, bars)
 
# Show graphic
plt.title("The Bernoulli Distribution about probability of USA Players and Oversea Players \n")
plt.xlabel("0-USA, 1-Oversea")
plt.ylabel("Probability")
plt.show()

Also, players’ age is one of the most important issues in NBA, for a player can make more profits if he can play longer. As it is a discrete value, we can build a probability mass function about age.

In [None]:
# probability mass function, S-total = 1
# When variables are continous, it becomes Probability Denstiy Function.

import seaborn as sns

plt.figure(figsize=(10,6))
sns.kdeplot(copy_data.Age, shade=True)
plt.xlim((15,45))
plt.title("PMF of Players' Age")
plt.ylabel("Density")
plt.xlabel('Age')
plt.grid(True)
plt.show()

## 2.3 Modelling

### 2.3.1 Feature Engineering

In [None]:
# Selecting Features

print(copy_data.columns)

#### (1) Creating Features

In order to make classification prediction, it is neccessary to create discrete target variables according to players' salaries. Here we create two columns named 'Binary' and 'Nominal' as below:

In [None]:
# Creating Features
# buiding binary categories in order to make classifications prediction
# normal-0, star-1

conditions = [
    (copy_data['Salary'] < 10000000)]
choices = [0]
copy_data['Binary'] = np.select(conditions, choices, default=1)
copy_data.head(10)

# copy_data.drop(['USA/NOT'], axis=1, inplace=True)

# Then we build nomial categories
# 0 - edge players
# 1 - normal players
# 2 - all stars
# 3 - superstars

conditions = [
        (copy_data['Salary'] < 5000000),
        (copy_data['Salary'] <= 10000000),
        (copy_data['Salary'] <= 20000000)]
choices = [0, 1, 2]
copy_data['Nominal'] = np.select(conditions, choices, default=3)
copy_data.head(10)

In [None]:
color_wheel = {0: "#0392cf", 
               1: "#7bc043"}
colors = copy_data['Binary'].map(lambda x: color_wheel.get(x))
print(copy_data.Binary.value_counts())
p=copy_data.Binary.value_counts().plot(kind="bar")

In [None]:
colors = copy_data['Nominal'].map(lambda x: color_wheel.get(x))
print(copy_data.Nominal.value_counts())
p=copy_data.Nominal.value_counts().plot(kind="pie")

#### (2) Pearson's R-Square Correlation

In order to choose features that are correlated to our target variables, the Pearson's R-Square Correlation can be applied to choose top 8 features that are most correlated to the salaries.

In [None]:
df = DataFrame(copy_data,columns=['Salary', 'NBA_Country', 'NBA_DraftNumber', 'Age', 'Tm', 'G',
       'MP', 'PER', 'TS', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%',
       'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM',
       'DBPM', 'BPM', 'VORP'])

'''
pandas.DataFrame.corr
method : {‘pearson’, ‘kendall’, ‘spearman’}
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation

min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
'''

corrmat = df.corr(method='pearson', min_periods=1)
r_square = corrmat ** 2

## Top 8 correlated variables
k = 9 #number of variables for heatmap
cols = r_square.nlargest(k, 'Salary')['Salary'].index
cm = df[cols].corr()
cm_square = cm ** 2
f, ax = plt.subplots(figsize=(10, 10))
sns.set(font_scale=1.25)
hm = sns.heatmap(cm_square, cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
cm_square.columns

In [None]:
# Using scatter plots to detect the correlation value

variables = ['Salary', 'WS', 'VORP', 'OWS', 'MP', 'DWS', 'NBA_DraftNumber', 'Age', 'BPM']

sns.set()
sns.pairplot(df[variables], size = 2.5)
plt.show()

#### (3) Multicollinearity Analysis

As we have 8 features now, which may contain multicollinearity that make the model inaccurate and cause overfitting. Therefore, the VIF value can be chosen to detect the multicollinearity. If it is larger than 10, we think that the multicollinearity is very strong and the feature should not be included.

In [None]:
# https://etav.github.io/python/vif_factor_python.html
# https://onlinecourses.science.psu.edu/stat501/node/347/

x = df[['WS', 'VORP', 'OWS', 'MP', 'DWS', 'NBA_DraftNumber', 'Age', 'BPM']]

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns

vif.round(1)

As expected, the OWS, DWS, and WS have a high variance inflation factor because they "explain" the same meaning. Also, the Age, BPM, USG%, VORP, MP and PER also share the similar high VIF, so some of them should be discarded. Therefore, we choose 'NBA_DraftNumber', 'Age', 'WS', 'BPM' as our features for modelling.

In [None]:
x = df[['NBA_DraftNumber', 'Age', 'WS', 'BPM']]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns

vif.round(1)

### 2.3.2 Regression

#### (1) Measure of Goodness: RMSE

Root Mean Squrare Error is a measure of how far the predicted points away from the real points. Compared with MSE and MAE, RMSE can provide the same dimensionality with target variables, and the sqaure function can make the measurement more precise than MAE when comparing different models.

In [None]:
from sklearn.metrics import mean_squared_error

# RMSE for testing data

def rmse_model(model, x_test, y_test):
    predictions = model.predict(x_test)
    rmse = np.sqrt(mean_squared_error(predictions, y_test))
    return(rmse)

#### (2) Selection of Model: Multivariate, Cross Validation, and Bias/Variance Trade-off

At first, we use multivariate linear regression to build our initial model. Then we assume that our model does not cause overfitting or underfitting. In order to accept or reject our hypothesis, we use cross validation to separate our data into training set and validation set (8:2). Then we apply the bias/variance trade-off graph to see whether the assumption is true or not.

In [None]:
from sklearn.model_selection import train_test_split

x = df[['NBA_DraftNumber', 'Age', 'WS', 'BPM']]
y = df[['Salary']]

x_model, x_test, y_model, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
# Cross Validation

# Spliting dataset into three parts, for training, validation, and testing respectively.

x_train, x_val, y_train, y_val = train_test_split(x_model, y_model, test_size=0.25, random_state=1)

In [None]:
print("the number of data for training:")
print(y_train.count())
print("the number of data for validation:")
print(y_val.count())
print("the number of data for testing:")
print(y_test.count())

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(x_train, y_train)

print(rmse_model(linear_regression, x_test, y_test))
print(linear_regression.coef_)
print(linear_regression.intercept_)

In [None]:
# Bias-Variance Trade-off

from sklearn.preprocessing import PolynomialFeatures

train_rmses = []
val_rmses = []
degrees = range(1,8)

for i in degrees:
    
    poly = PolynomialFeatures(degree=i, include_bias=False)
    x_train_poly = poly.fit_transform(x_train)

    poly_reg = LinearRegression()
    poly_reg.fit(x_train_poly, y_train)
    
    # training RMSE
    y_train_pred = poly_reg.predict(x_train_poly)
    train_poly_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    
    train_rmses.append(train_poly_rmse)
    
    # validation RMSE
    x_val_poly = poly.fit_transform(x_val)
    y_val_pred = poly_reg.predict(x_val_poly)
    
    val_poly_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    val_rmses.append(val_poly_rmse)

    print('degree = %s, training RMSE = %.2f, validation RMSE = %.2f' % (i, train_poly_rmse, val_poly_rmse))
        
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, train_rmses,label= 'training set')
ax.plot(degrees, val_rmses,label= 'validation set')
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')
ax.set_title('Bias/Variance Trade-off')  
plt.legend()
plt.show()

As it is shown on the Figure 5, when degree=1, both of training and validation set's RMSE are quite low. But when degree>=4, the difference between training set's RMSE and validation set's RMSE is obvious. Here we retain the null hypothesis that the 1nd-order polynomial model does not cause high-bias.

In [None]:
# RMSE for testing data

second_poly = PolynomialFeatures(degree=2, include_bias=False)
x_train_poly = second_poly.fit_transform(x_train)

second_reg = LinearRegression()
second_reg.fit(x_train_poly, y_train)

x_test_second_poly = second_poly.fit_transform(x_test)
y_test_pred = second_reg.predict(x_test_second_poly)

print(rmse_model(second_reg, x_test_second_poly, y_test))
print(second_reg.coef_)
print(second_reg.intercept_)

#### (3) Regularization: Ridge, Lasso, and ElasticNet

There are 3 ways to solve overfiting. The first way is to increase the size of dataset, the second way is to choose a suitable model complexity, and the third way is to use regularization to reduce the value of coefficient. In this part, we focus on regularization and select degree=4 to test the effectiveness of these three methods.

The meaning of regularization can be considered as 'punishiment'. When the model is too complex, the values of coefficients are very large. So we introduce the l to make the coefficients smaller than before.

In [None]:
# At first, we calculate the RMSE before regularization.

poly = PolynomialFeatures(degree=4, include_bias=False)
x_train_poly = poly.fit_transform(x_train)

poly_reg = LinearRegression()
poly_reg.fit(x_train_poly, y_train)

x_test_poly = poly.fit_transform(x_test)
y_test_pred = poly_reg.predict(x_test_poly)

print(rmse_model(poly_reg, x_test_poly, y_test))

***Ridge***

In [None]:
# Ridge

# https://blog.csdn.net/hzw19920329/article/details/77200475
# https://www.kaggle.com/sflender/comparing-lin-regression-ridge-lasso
# https://www.kaggle.com/junyingzhang2018/ridge-regression-score-0-119

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

rmse=[]
alpha=[1, 2, 5, 10, 20, 30, 40, 50, 75, 100]

for a in alpha:
    ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=a))
    ridge.fit(x_train, y_train)
    predict=ridge.predict(x_val)
    rmse.append(np.sqrt(mean_squared_error(predict, y_val)))
print(rmse)
plt.scatter(alpha, rmse)

In [None]:
# Adjust alpha based on previous result

alpha=np.arange(20, 60, 2)
rmse=[]

for a in alpha:
    #ridge=Ridge(alpha=a, copy_X=True, fit_intercept=True)
    #ridge.fit(x_train, y_train)
    ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=a))
    ridge.fit(x_train, y_train)
    predict=ridge.predict(x_val)
    rmse.append(np.sqrt(mean_squared_error(predict, y_val)))
print(rmse)
plt.scatter(alpha, rmse)

In [None]:
# Adjust alpha based on previous result

alpha=np.arange(20, 30, 0.2)
rmse=[]

for a in alpha:
    #ridge=Ridge(alpha=a, copy_X=True, fit_intercept=True)
    #ridge.fit(x_train, y_train)
    ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=a))
    ridge.fit(x_train, y_train)
    predict=ridge.predict(x_val)
    rmse.append(np.sqrt(mean_squared_error(predict, y_val)))
print(rmse)
plt.scatter(alpha, rmse)

In [None]:
# Use alpha=40.4 to predict the test data

ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=24.6))
ridge_model = ridge.fit(x_train, y_train)

predictions = ridge_model.predict(x_test)
print("Ridge RMSE is: " + str(rmse_model(ridge_model, x_test, y_test)))

***Lasso***

In [None]:
# Lasso

# https://www.kaggle.com/sflender/comparing-lin-regression-ridge-lasso

from sklearn.linear_model import Lasso

rmse=[]
alpha=[0.0001, 0.001, 0.01, 0.1, 1]

for a in alpha:
    lasso=make_pipeline(PolynomialFeatures(4), Lasso(alpha=a))
    lasso.fit(x_train, y_train)
    predict=lasso.predict(x_val)
    rmse.append(np.sqrt(mean_squared_error(predict, y_val)))
print(rmse)
plt.scatter(alpha, rmse)

In [None]:
lasso = make_pipeline(PolynomialFeatures(4), Lasso(alpha=0.0001))
lasso_model = lasso.fit(x_train, y_train)
predictions = lasso_model.predict(x_test)
print("RMSE in Testing : " + str(rmse_model(lasso_model, x_test, y_test)))

***ElasticNet***

In [None]:
# ElasticNet

# https://www.kaggle.com/jack89roberts/top-7-using-elasticnet-with-interactions

from sklearn.linear_model import ElasticNet, ElasticNetCV

rmse=[]
alpha=[0.000001, 0.00001, 0.0001, 0.001,0.01,0.1]

for a in alpha:
    elasticnet=make_pipeline(PolynomialFeatures(4), ElasticNet(alpha=a))
    elasticnet.fit(x_train, y_train)
    predict=elasticnet.predict(x_val)
    rmse.append(np.sqrt(mean_squared_error(predict, y_val)))
                             
print(rmse)
plt.scatter(alpha, rmse)

In [None]:
elasticnet=make_pipeline(PolynomialFeatures(4), ElasticNet(alpha=0.000001))
elasticnet_model = elasticnet.fit(x_train, y_train)
predictions = elasticnet_model.predict(x_test)
print("RMSE in Testing : " + str(rmse_model(elasticnet_model, x_test, y_test)))

In [None]:
# Comparison

print("For testing dataset\n")

print("Linear RMSE is: " + str(rmse_model(linear_regression, x_test, y_test)))
print("2nd Polynomial RMSE is: " + str(rmse_model(second_reg, x_test_second_poly, y_test)))

print("\nFor 4th order polynomial (RMSE = 128255850.32699986 before regualarization)")
print("Ridge RMSE is: " + str(rmse_model(ridge_model, x_test, y_test)))
print("Lasso RMSE is: " + str(rmse_model(lasso_model, x_test, y_test)))
print("ElasticNet RMSE is: " + str(rmse_model(elasticnet_model, x_test, y_test)))

In [None]:
data = np.array([['','Parameter','RMSE'],
                ['1st-order Poly',1,4508432.2],
                ['2nd-order Poly',2,4136581.4],
                ['4nd-order Poly',4,128255850.3],
                ['4nd-order Lasso','<0.0001',10261503.6],
                ['4nd-order Ridge',24.6,32093519.5],
                ['4nd-order ElasticNet','<0.0001',10261504.7]])
                
regression_comparison = pd.DataFrame(data=data[1:,1:],
                                      index=data[1:,0],
                                    columns=data[0,1:])
regression_comparison

In [None]:
#http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html

my_ridge = Ridge(alpha = 24.6, normalize = True)
my_ridge.fit(x_train, y_train) 
#pd.Series(my_ridge.coef_,index = ['NBA_DraftNumber', 'Age', 'WS', 'BPM'])
my_ridge.coef_

In [None]:
my_lasso = Lasso(alpha = 0.0001, normalize = True)
my_lasso.fit(x_train, y_train) 
my_lasso.coef_

In [None]:
my_elasticnet = ElasticNet(alpha = 0.0001, normalize = True)
my_elasticnet.fit(x_train, y_train) 
my_elasticnet.coef_

In [None]:
#https://www.zhihu.com/question/38121173

data = np.array([['','NBA_DraftNumber','Age', 'WS', 'BPM'],
                ['Ridge',-5063.8823843 , 23745.62461498, 69180.1194312 , 14071.36231572],
                ['Lasso',-71799.85643407,  478944.20547638, 1505350.67035793, -28969.97816668],
                ['ElasticNet',-71487.33915305,  473914.28999172, 1478049.57849672, -22200.94106611]])
                
regularization_comparison = pd.DataFrame(data=data[1:,1:],
                                      index=data[1:,0],
                                    columns=data[0,1:])
regularization_comparison

As it is shown on the Figure 5, when degree=1, both of training and validation set's RMSE are quite low. But when degree>=4, the difference between training set's RMSE and validation set's RMSE is obvious. Here we retain the null hypothesis that the 1nd-order polynomial model does not cause high-bias.

And the performance of different models is shown above, where 2nd-order polynomial regression performs the best. And when it comes to 4nd-order polynomial regression, it causes overfiting.

Different regularization methods perform differently. Focusing on coefficients and we can find that Ridge regularization drives parameters to smaller values. But if the multicollinearity exits, Lasso will turn its coefficients to 0, while Ridge will not erase any feature value. So if we want to do the feature selection, we can choose Lasso. But if we want to keep all features on the list, we prefer Ridge.

In all, although the effect of regularization is significant, it is much better to choose the correct parameters and features.

### 2.3.3 Classification

#### (1) Measure of Goodness: Accuracy Score and Confusion Matrix

Accuracy Score is straight-forward, for it tells us the probability of the right answers that your model can predict. However, if we want to know the Accuracy Score of each target group, it is more suitable to use Confusion Matrix, which will show the comparison of predicted values and real values in each group.

In [None]:
# https://www.kaggle.com/pablovargas/naive-bayes-svm-spam-filtering
# for binary target variables

from sklearn import metrics

def confusion_matrix(model, x_test, y_test):
    model_confusion_test = metrics.confusion_matrix(y_test, model.predict(x_test))
    matrix = pd.DataFrame(data = model_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
                 index = ['Actual 0', 'Actual 1'])
    return matrix

#### (2) Selection of Model: KNN, SVM, Naïve Bayes, Decision Tree, Logistic Regression

Five Models are selected to fit our dataset. Here we use KNN, SVM to fit the 'Binary' target variables, and use others to fit the 'Nominal' variables.

In [None]:
print(copy_data.columns)

In [None]:
df = DataFrame(copy_data,columns=['Binary', 'Nominal', 'Age','NBA_DraftNumber','MP', 'PER', 'TS', '3PAr', 
        'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%','STL%', 'BLK%', 
        'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 
        'DBPM', 'BPM', 'VORP'])

In [None]:
x = df[['NBA_DraftNumber', 'Age', 'WS', 'BPM']]
y = df[['Binary']]

x_model, x_test, y_model, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
x_train, x_val, y_train, y_val = train_test_split(x_model, y_model, test_size=0.25, random_state=1)

#### (3) Comparison: Model Tuning, Learning Curve and Curse of Dimensionality

The process of Model Tuning is similar to "Bias-Variance Trade-off", which is to find the balance that provide not only the high score of the training set, but also good ability to predict the testing set. After applying our models, it is essential to use the model tuning techniques to find the best parameters that fit our dataset.

Learning Curve is the process to see the change of correctness within the quantity of data set. It is also a good way to overcome the adverse impact of overfitting, for a big-size dataset can make a complex model performs well than a small-size dataset.

Curse of Dimensionality is one of the main reasons of overfitting. If the number of dimensions is similar to the amount of data, each or several samples may form one class, which may make the traing model performs well in the training set, but losing its ability to predict the testing set at the same time.

##### KNN

In [None]:
# Model Tuning

# 5-fold cross validation

from sklearn.model_selection import KFold, cross_val_score

def rmse_cv(model):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(x_model.values)
    predictions = model.predict(x_test)
    rmse= np.sqrt(-cross_val_score(model, x_model.values, y_model, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
# How to find K?

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

train_scores = []
validation_scores = []

x_model_values = x_model.values
y_model_values = y_model.values

# 5-fold cross validation

kfold = KFold(5, shuffle=True, random_state=42)

for i in range(1,20):
    knn = KNeighborsClassifier(i)
    
    tr_scores = []
    va_scores = []
    
    for a, b in kfold.split(x_model_values):

        x_train_fold, y_train_fold = x_model_values[a], y_model_values[a]
        x_val_fold, y_val_fold = x_model_values[b], y_model_values[b]
        
        knn.fit(x_train_fold, y_train_fold.ravel())
        
        va_scores.append(knn.score(x_val_fold, y_val_fold))
        tr_scores.append(knn.score(x_train_fold, y_train_fold))
        
    validation_scores.append(np.mean(va_scores))
    train_scores.append(np.mean(tr_scores))

In [None]:
plt.title('k-NN Varying number of neighbours')
plt.plot(range(1,20),validation_scores,label="Validation")
plt.plot(range(1,20),train_scores,label="Train")
plt.legend()
plt.xticks(range(1,20))
plt.show()

In [None]:
# Learning Curve

# How KNN algorithm performs in both small-size data and big-size data 

# choose an acceptable color
# https://www.spycolor.com/ff8040

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(KNeighborsClassifier(5), 
        x_model, 
        y_model,
        # Number of folds in cross-validation
        cv=5,
        # Evaluation metric
        scoring='accuracy',
        # Use all computer cores
        n_jobs=-1, 
        # 50 different sizes of the training set
        train_sizes=np.linspace(0.1, 1.0, 5))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040",  label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")

# Create plot
plt.title("Learning Curve \n k-fold=5, number of neighbours=5")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
# curse of dimensionality

# one or two features are simple, but it cannot recognize and divide our categories. more features means
# more evidence in different dimensions, but it could cause overfitting.

X = df[[ 'Age', 'NBA_DraftNumber','MP', 'PER', '3PAr',
        'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%','STL%', 'BLK%', 
        'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 
        'DBPM', 'BPM', 'VORP']]
Y = df[['Binary']]

X_model, X_test, Y_model, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
X_train, X_val, Y_train, Y_val = train_test_split(X_model, Y_model, test_size=0.25, random_state=1)

In [None]:
# [:, :2]extract columns

# convert[[1],[2],[3],...] to [1,2,3,4,0]
# x_train_values_list = np.array(x_train_values).tolist() 

'''
y_train_value = [j for i in y_train_values for j in i] - delete sublists to just one list

dimensionality = []
for i in range(10):

a = [item[:, :2] for item in list(x_train_values)]
print(a)
'''

d_train = []
d_val = []

X_train_values = X_train.values
Y_train_values = Y_train.values
X_val_values = X_val.values
Y_val_values = Y_val.values

for i in range(1,23):
    
    X_train_value = X_train_values[:,:i].tolist() #convert dataframe
    X_val_value = X_val_values[:,:i].tolist()
    
    knn = KNeighborsClassifier(5)
    Knn = knn.fit(X_train_value, Y_train_values.ravel())

    d_train.append(Knn.score(X_train_value, Y_train_values))
    d_val.append(Knn.score(X_val_value, Y_val_values))

plt.title('k-NN Curse of Dimensionality')
plt.plot(range(1,23),d_val,label="Validation")
plt.plot(range(1,23),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,23))
plt.show()

In [None]:
# The best result is captured at k = 5 hence it is used for the final model.

#Setup a knn classifier with k neighbors

kfold = KFold(5, shuffle=True, random_state=42)
knn = KNeighborsClassifier(5)

for m,n in kfold.split(x_model_values):
        
        x_train_fold, y_train_fold = x_model_values[m], y_model_values[m]
        
        Knn = knn.fit(x_train_fold, y_train_fold.ravel())

print('When k=5, the testing score(accuracy) is: ')
print(Knn.score(x_test,y_test))

In [None]:
confusion_matrix(Knn, x_test, y_test)

##### SVM

In [None]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

classifier = SVC(gamma = 'auto')
svm_model = OneVsRestClassifier(classifier, n_jobs=1).fit(x_train, y_train)

print(svm_model.score(x_train,y_train))
print(svm_model.score(x_val,y_val))

In [None]:
#Tuning

# https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769
# https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72

#from sklearn.model_selection import GridSearchCV

#parameters = {"estimator__gamma":[0.0001, 0.001, 0.01, 0.3, 0.5, 0.1, 2, 5, 10, 100]}
#grid_search = GridSearchCV(svm_model, param_grid=parameters)
#grid_search.fit(x_train, y_train)
#print(grid_search.best_score_)
#print(grid_search.best_params_)

accuracy=[]
gamma=[0.0001, 0.001, 0.005, 0.01, 0.1, 0.2, 0.3, 0.5, 0.1]

for a in gamma:
    classifier = SVC(C=1, 
        kernel='rbf', 
        degree=2, 
        gamma=a, 
        coef0=1,
        shrinking=True, 
        tol=0.5,
        probability=False, 
        cache_size=200, 
        class_weight=None,
        verbose=False, 
        max_iter=-1, 
        decision_function_shape=None, 
        random_state=None)
    svm_model = OneVsRestClassifier(classifier, n_jobs=1)
    svm_model.fit(x_train, y_train)
    predict=svm_model.predict(x_val)
    accuracy.append(svm_model.score(x_val,y_val))
print(accuracy)
plt.scatter(gamma, accuracy)

In [None]:
gamma=np.arange(0.0001, 0.005, 0.0003) 
accuracy=[]

for a in gamma:
    classifier = SVC(C=1, 
        kernel='rbf', 
        degree=2, 
        gamma=a, 
        coef0=1,
        shrinking=True, 
        tol=0.5,
        probability=False, 
        cache_size=200, 
        class_weight=None,
        verbose=False, 
        max_iter=-1, 
        decision_function_shape=None, 
        random_state=None)
    svm_model = OneVsRestClassifier(classifier, n_jobs=1)
    svm_model.fit(x_train, y_train)
    predict=svm_model.predict(x_val)
    accuracy.append(svm_model.score(x_val,y_val))
print(accuracy)
plt.scatter(gamma, accuracy)
plt.scatter(gamma, accuracy)
plt.title("Finding Gamma")
plt.xlabel("Gamma")
plt.ylabel("Accuracy Score")
plt.show()

The use of gamma is similar to k in KNN. The higher the gamma value it tries to exactly fit the training data set. And if the gamma value is too high, it will cause overfitting.

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible seperation line are considered in calculation for the seperation line. Where as high gamma means the points close to plausible line are considered in calculation.

In [None]:
accuracy=[]
C=np.arange(1,10,1) 

for a in C:
    classifier = SVC(C=a, 
        kernel='rbf', 
        degree=2, 
        gamma=0.0013, 
        coef0=1,
        shrinking=True, 
        tol=0.5,
        probability=False, 
        cache_size=200, 
        class_weight=None,
        verbose=False, 
        max_iter=-1, 
        decision_function_shape=None, 
        random_state=None)
    svm_model = OneVsRestClassifier(classifier, n_jobs=1)
    svm_model.fit(x_train, y_train)
    predict=svm_model.predict(x_val)
    accuracy.append(svm_model.score(x_val,y_val))
print(accuracy)
plt.scatter(C, accuracy)
plt.title("Finding C")
plt.xlabel("C")
plt.ylabel("Accuracy Score")
plt.show()

The Regularization parameter tells the SVM optimization how much you want to avoid misclassifying each training example.

For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.

In [None]:
classifier = SVC(C=1, # Regularization parameter
        kernel='rbf', # kernel type, rbf working fine here
        degree=2, # default value
        gamma=0.0013, # kernel coefficient
        coef0=1, # change to 1 from default value of 0.0
        shrinking=True, # using shrinking heuristics
        tol=0.5, # stopping criterion tolerance 
        probability=False, # no need to enable probability estimates
        cache_size=200, # 200 MB cache size
        class_weight=None, # all classes are treated equally 
        verbose=False, # print the logs 
        max_iter=-1, # no limit, let it run
        decision_function_shape=None, # will use one vs rest explicitly 
        random_state=None)
svm_model = OneVsRestClassifier(classifier, n_jobs=1).fit(x_train, y_train)

print(svm_model.score(x_train,y_train))
print(svm_model.score(x_val,y_val))

In [None]:
print(svm_model.score(x_test,y_test))

In [None]:
# Confusion Matrix

confusion_matrix(svm_model, x_test, y_test)

In [None]:
# Learning Curve

train_sizes, train_scores, val_scores = learning_curve(OneVsRestClassifier(classifier, n_jobs=1), 
        x_model, 
        y_model,
        # Number of folds in cross-validation
        cv=5,
        # Evaluation metric
        scoring='accuracy',
        # Use all computer cores
        # 50 different sizes of the training set
        train_sizes=np.linspace(0.1, 1.0, 5))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040",  label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")

# Create plot
plt.title("Learning Curve \n C=1, gamma=0.0013")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
# curse of dimensionality

# one or two features are simple, but it cannot recognize and divide our categories. more features means
# more evidence in different dimensions, but it could cause overfitting.

# https://thispointer.com/select-rows-columns-by-name-or-index-in-dataframe-using-loc-iloc-python-pandas/

d_train = []
d_val = []

for i in range(1,23):
    
    X_train_index = X_train.iloc[: , 0:i]
    X_val_index = X_val.iloc[: , 0:i]
    
    classifier = SVC(C=1, # Regularization parameter
                    kernel='rbf', # kernel type, rbf working fine here
                    degree=2, # default value
                    gamma=0.0001, # kernel coefficient
                    coef0=1, # change to 1 from default value of 0.0
                    shrinking=True, # using shrinking heuristics
                    tol=0.5, # stopping criterion tolerance 
                    probability=False, # no need to enable probability estimates
                    cache_size=200, # 200 MB cache size
                    class_weight=None, # all classes are treated equally 
                    verbose=False, # print the logs 
                    max_iter=-1, # no limit, let it run
                    decision_function_shape=None, # will use one vs rest explicitly 
                    random_state=None)
    svm_model = OneVsRestClassifier(classifier, n_jobs=1).fit(X_train_index, Y_train)

    d_train.append(svm_model.score(X_train_index, Y_train))
    d_val.append(svm_model.score(X_val_index, Y_val))

In [None]:
plt.title('SVM Curse of Dimensionality')
plt.plot(range(1,23),d_val,label="Validation")
plt.plot(range(1,23),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,23))
plt.show()

##### Naive Bayes

In Naive Bayes, we assume that the features are independent from each other. We can try non-binary target variables.

In [None]:
# NB assumes that the features themselves are not correlated to each other. Therefore, if the collinearity of our features are low, the model will perform better.

x = df[['NBA_DraftNumber', 'Age', 'WS', 'BPM']]
y = df[['Nominal']]

x_model, x_test, y_model, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
x_train, x_val, y_train, y_val = train_test_split(x_model, y_model, test_size=0.25, random_state=1)

In [None]:
# https://medium.com/machine-learning-101/chapter-1-supervised-learning-and-naive-bayes-classification-part-1-theory-8b9e361897d5
# https://blog.csdn.net/li8zi8fa/article/details/76176597
# GaussianNB,MultinomialNB, BernoulliNB

# http://www.cnblogs.com/lesliexong/p/6907642.html

# Gaussian is for continous features

#值得注意的是，在离散样本也就是基于频率的估计中，如果某个特征fn未在训练集的类别ci中出现过，那么P(fn|ci)项为0会导致整个估计为0而忽略了其他的特征信息。
#这样的估计显然是不准确的，所以通常需要对于样本进行样本修正保证不会有0概率出现。
#比如采用laplace校准，对没类别下所有划分的计数加1，这样如果训练样本集数量充分大时，并不会对结果产生影响。listone修正则是加一个0-1之间的数。

# 和多元朴素贝叶斯中通过特征出现频率来计算P(fn|ci)不同，伯努利模型只考虑出现不出现的二值问题。

from sklearn.naive_bayes import GaussianNB

gaussian = GaussianNB()
nb_model = gaussian.fit(x_train, y_train.values.ravel())

print(nb_model.score(x_train,y_train))

In [None]:
train_score = []
val_score = []
a = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]

#for i in np.arange(1,20):
for i in a:
    gaussian = GaussianNB(priors=None, var_smoothing=i)
    nb_model = gaussian.fit(x_train, y_train.values.ravel())
    train_score.append(nb_model.score(x_train, y_train))
    val_score.append(nb_model.score(x_val, y_val))

In [None]:
plt.plot(a,train_score)
plt.plot(a,val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Naive Bayes Tuning')
plt.xlabel('Variance Smoothing')
plt.ylabel('Accuracy')

In [None]:
gaussian = GaussianNB(priors=None, var_smoothing=0.1)
nb_model = gaussian.fit(x_train, y_train.values.ravel())

print(nb_model.score(x_test, y_test))

In [None]:
# https://www.kaggle.com/diegosch/classifier-evaluation-using-confusion-matrix

# 0 - edge players
# 1 - normal players
# 2 - all stars
# 3 - superstars

from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support

y_predict = nb_model.predict(x_test)
cm = confusion_matrix(y_test, y_predict) 

# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['edge players','normal players', 'all stars', 'superstars'], 
                     columns = ['edge players','normal players', 'all stars', 'superstars'])

plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Naive Bayes \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_predict)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# Learning Curve

train_sizes, train_scores, val_scores = learning_curve(OneVsRestClassifier(GaussianNB(priors=None, var_smoothing=0.1)), 
        x_model, 
        y_model,
        # Number of folds in cross-validation
        cv=5,
        # Evaluation metric
        scoring='accuracy',
        # Use all computer cores
        # 50 different sizes of the training set
        train_sizes=np.linspace(0.1, 1.0, 5))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040",  label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")

# Create plot
plt.title("NB Learning Curve \n ")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
d_train = []
d_val = []

for i in range(1,24):
    
    X_train_index = X_train.iloc[: , 0:i]
    X_val_index = X_val.iloc[: , 0:i]
    
    classifier = GaussianNB(priors=None, var_smoothing=0.1)
    nb_model = gaussian.fit(X_train_index, Y_train.values.ravel())

    d_train.append(nb_model.score(X_train_index, Y_train))
    d_val.append(nb_model.score(X_val_index, Y_val))

In [None]:
plt.title('Naive Bayes Curse of Dimensionality')
plt.plot(range(1,24),d_val,label="Validation")
plt.plot(range(1,24),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,24))
plt.show()

##### Decision Tree

There are three ways to build a decision tree. CART is for binary target variables, ID3 is for nomial attributes, and C4.5 can be applied for continous features, whcih is the most suitable in our case.

In [None]:
# https://blog.csdn.net/app_12062011/article/details/52136117

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(x_train, y_train)
print(decision_tree_model.score(x_train,y_train))
print(decision_tree_model.score(x_val,y_val))

In [None]:
plt.bar(range(len(x_train.columns.values)), decision_tree_model.feature_importances_)
plt.xticks(range(len(x_train.columns.values)),x_train.columns.values, rotation= 45)
plt.title('Feature Importance')

In [None]:
#Model Tuning

#https://www.kaggle.com/drgilermo/stephen-curry-s-decision-tree

train_score = []
val_score = []
for depth in np.arange(1,20):
    decision_tree = tree.DecisionTreeClassifier(max_depth = depth,min_samples_leaf = 5)
    decision_tree.fit(x_train, y_train)
    train_score.append(decision_tree.score(x_train, y_train))
    val_score.append(decision_tree.score(x_val, y_val))

plt.plot(np.arange(1,20),train_score)
plt.plot(np.arange(1,20),val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Decision Tree Tuning')
plt.xlabel('Depth')
plt.ylabel('Accuracy')

In [None]:
train_score = []
val_score = []
for leaf in np.arange(1,30):
    decision_tree = tree.DecisionTreeClassifier(max_depth = 5, min_samples_leaf = leaf)
    decision_tree.fit(x_train, y_train)
    train_score.append(decision_tree.score(x_train, y_train))
    val_score.append(decision_tree.score(x_val, y_val))

plt.plot(np.arange(1,30),train_score)
plt.plot(np.arange(1,30),val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Decision Tree Tuning')
plt.xlabel('Minimum Samples Leaf')
plt.ylabel('Accuracy')

In [None]:
my_decision_tree_model = DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 6)
my_decision_tree_model.fit(x_train, y_train)
print(my_decision_tree_model.score(x_train,y_train))
print(my_decision_tree_model.score(x_val,y_val))

In [None]:
print(my_decision_tree_model.score(x_test,y_test))

In [None]:
y_predict = my_decision_tree_model.predict(x_test)
cm = confusion_matrix(y_test, y_predict) 

# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['edge players','normal players', 'all stars', 'superstars'], 
                     columns = ['edge players','normal players', 'all stars', 'superstars'])

plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Decision Tree \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_predict)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# Learning Curve

train_sizes, train_scores, val_scores = learning_curve(OneVsRestClassifier(DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 6)), 
        x_model, 
        y_model,
        # Number of folds in cross-validation
        cv=5,
        # Evaluation metric
        scoring='accuracy',
        # Use all computer cores
        # 50 different sizes of the training set
        train_sizes=np.linspace(0.1, 1.0, 5))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040",  label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")

# Create plot
plt.title("Decision Tree Learning Curve \n ")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
# Curse of Dimensionality

d_train = []
d_val = []

for i in range(1,24):
    
    X_train_index = X_train.iloc[: , 0:i]
    X_val_index = X_val.iloc[: , 0:i]
    
    classifier = DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 6)
    dt_model = classifier.fit(X_train_index, Y_train.values.ravel())

    d_train.append(dt_model.score(X_train_index, Y_train))
    d_val.append(dt_model.score(X_val_index, Y_val))

In [None]:
plt.title('Decision Tree Curse of Dimensionality')
plt.plot(range(1,24),d_val,label="Validation")
plt.plot(range(1,24),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,24))
plt.show()

##### Logistic Regression

In [None]:
# logistic regression (LR)

#https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression

from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train.values.ravel())

print(logistic_model.score(x_train,y_train))
print(logistic_model.score(x_val,y_val))

In [None]:
#https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression

train_score = []
val_score=[]

for i in np.arange(1,80):
    
    logistic_model = LogisticRegression(penalty = 'l2', C = i,random_state = 0)
    
    logistic_model.fit(x_train,y_train.values.ravel()) 
    
    train_score.append(logistic_model.score(x_train, y_train))
    val_score.append(logistic_model.score(x_val,y_val))

    
plt.plot(np.arange(1,80),train_score)
plt.plot(np.arange(1,80),val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Logistic Regression Tuning')
plt.xlabel('C')
plt.ylabel('Accuracy')

In [None]:
my_logistic_regression_model = LogisticRegression(penalty = 'l2', C = 50, random_state = 0)
my_logistic_regression_model.fit(x_train, y_train)
print(my_logistic_regression_model.score(x_train,y_train))
print(my_logistic_regression_model.score(x_val,y_val))

In [None]:
print(my_logistic_regression_model.score(x_test,y_test))

In [None]:
y_predict = my_logistic_regression_model.predict(x_test)
cm = confusion_matrix(y_test, y_predict) 

# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['edge players','normal players', 'all stars', 'superstars'], 
                     columns = ['edge players','normal players', 'all stars', 'superstars'])

plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Logistic Regression \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_predict)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# Learning Curve

train_sizes, train_scores, val_scores = learning_curve(OneVsRestClassifier(LogisticRegression(penalty = 'l2', C = 50, random_state = 0)), 
        x_model, 
        y_model,
        # Number of folds in cross-validation
        cv=5,
        # Evaluation metric
        scoring='accuracy',
        # Use all computer cores
        # 50 different sizes of the training set
        train_sizes=np.linspace(0.1, 1.0, 5))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040",  label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")

# Create plot
plt.title("Logistic Regression Learning Curve \n ")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
# Curse of Dimensionality

d_train = []
d_val = []

for i in range(1,24):
    
    X_train_index = X_train.iloc[: , 0:i]
    X_val_index = X_val.iloc[: , 0:i]
    
    classifier = LogisticRegression(penalty = 'l2', C = 50, random_state = 0)
    lr_model = classifier.fit(X_train_index, Y_train.values.ravel())

    d_train.append(lr_model.score(X_train_index, Y_train))
    d_val.append(lr_model.score(X_val_index, Y_val))

In [None]:
plt.title('Logistic Regression Curse of Dimensionality')
plt.plot(range(1,24),d_val,label="Validation")
plt.plot(range(1,24),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,24))
plt.show()

Through Model Tuning, Learning Curve, Curse of Dimensionality, and Confusion Matrix, we can get some knowledge about models' characteristics.

Firstly, model tuning is quite silimar with Bias-Variance Trade-off. The most suitable point is not the highest point in training set, but a balanced point which performs "not so bad" in both training and validation sets. However, the values of some models' parameters are very large, while others are quite small, such as Naive Bayes, which should be paid attention to.

Secondly, all models share the similar trends in the Learning Curve. When the size of training size is small, the score of training set is very high, but the score of vaidation set is very low, which causes overfitting. As the increase of data size, the score of training set becomes lower, and the validation set's score becomes higher, which means that the distance between these two groups are narrowing. However, as the size of this dataset only reaches 300, it cannot make sure that if the size is absolutly large (more than 10 thousands), how well will the curves perform.

Thridly, different models have different sensitivities to dimensionality. In our models, the curse of dimensionality is obvious in SVM, Naive Bayes, and Logistic Regression, where the high-dimensional features cause overfiting. While in other models, maybe it is because of the number of features are not enough, the "curse" does not appear.
Lastly, Confusion Matrix tells us the performance on different target groups. For example, SVM does better in predicting label "0", while KNN performs better in predicting label "1". The similar phenomenon happens in another comparison, which can be touted as an important way to see the details of our models' prediction.

# 3 Recommendations & Discussions

## Recommendations

* What are the most important 4 features that influence the salary?

Through our modelling process, the most important 4 features are draft number, age, WS, and BPM.

* What are the most suitable regression and classification models to predict players’ salaries? And how do the models work?

The most suitable regression model is 2nd-order polynomial regression, which has the RMSE of about 4.1m. The most suitable binary classification model is SVM, taking the accuracy score of 0.896. Decision Tree model performs the best in Nominal target classification, with 0.691 accuracy score.

* What recommendations can be made?

Firstly, through our exploratory analysis and modelling process, the difference between salaries of overseas and USA players are not significant, but the number of USA players are almost 3 times more than that of overseas players. This difference can be an opportunity to recruit more overseas players from other countries for the promotion.

Secondly, the correlations between players and their stats are not so strong, which means that there are the situations that players are overpaid or underpaid. In fact, this phenomenon is quite popular in the real NBA market. This should be noticed by the teams' managers.

## Discussions

In this mini-project, there are some technical limitations, such as normalization methods, model selections and stacking skills.

From Andrew Ng's Open Course, Normalization can change the big-value-features into a small-value one, which may make the lost function more accurate. And maybe that is the reason why my model's RMSE reachs over million. But I have not applied it because I think this will affect the real data's meaning, and also I think it is meaningless because our RMSE is a relative value, not an absolute one.

Model selection is important. At the beggining of this report, we planned to apply other advanced models, such as random forests, ADBoost, and so on. However, it is the mathematical concepts that let me realize that it is meaningless to apply them if I cannot understand the basic algorithm behind them. Therefore, I select Decision Tree by reading the slides, and select Logistic Regression by watching Andrew Ng's videos. Although I cannot apply these models this time, I believe I will understand them in the nearing future.

Also, multiple regression is the one that I cannot solve in programming language, because I cannot find any example in others' work. Similarly, within my ability, stacking skills is not approachable yet. But I believe that I will get them done in the future.

## Reference

Di, H. (2018). Value Creation: Comparative Netnographic Study of Two NBA Online Communities.

Narayanan, A., Shi, E., & Rubinstein, B. I. (2011, July). Link prediction by de-anonymization: How we won the kaggle social network challenge. In Neural Networks (IJCNN), The 2011 International Joint Conference on (pp. 1825-1834). IEEE.

Rosen, J., Arcidiacono, P., & Kimbrough, K. (2016). Determining NBA Free Agent Salary from Player Performance.

**---------Ending--------**