### 主題：NBA salary prediction

NBA球員的薪水想必與他們在場上表現有一定的關係，因此以下將透過場上的一些統計數據，看看是否能用linear regression 成功預測 NBA 球員的薪水。

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### **讀入資料並清掉Nan**  [資料來源]( https://www.kaggle.com/aishjun/nba-salaries-prediction-in-20172018-season#2017-18_NBA_salary.csv)

In [2]:
dataset = pd.read_csv('./2017-18_NBA_salary.csv')
dataset.head()

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
0,Zhou Qi,815615,China,43,22,HOU,16,87,0.6,0.303,...,18.2,19.5,-0.4,0.1,-0.2,-0.121,-10.6,0.5,-10.1,-0.2
1,Zaza Pachulia,3477600,Georgia,42,33,GSW,66,937,16.8,0.608,...,19.3,17.2,1.7,1.4,3.1,0.16,-0.6,1.3,0.8,0.7
2,Zach Randolph,12307692,USA,19,36,SAC,59,1508,17.3,0.529,...,12.5,27.6,0.3,1.1,1.4,0.046,-0.6,-1.3,-1.9,0.0
3,Zach LaVine,3202217,USA,13,22,CHI,24,656,14.6,0.499,...,9.7,29.5,-0.1,0.5,0.4,0.027,-0.7,-2.0,-2.6,-0.1
4,Zach Collins,3057240,USA,10,20,POR,62,979,8.2,0.487,...,15.6,15.5,-0.4,1.2,0.8,0.038,-3.7,0.9,-2.9,-0.2


In [3]:
dataset.isnull().any() #check if there is any null in dataset

Player             False
Salary             False
NBA_Country        False
NBA_DraftNumber    False
Age                False
Tm                 False
G                  False
MP                 False
PER                False
TS%                 True
3PAr                True
FTr                 True
ORB%               False
DRB%               False
TRB%               False
AST%               False
STL%               False
BLK%               False
TOV%                True
USG%               False
OWS                False
DWS                False
WS                 False
WS/48              False
OBPM               False
DBPM               False
BPM                False
VORP               False
dtype: bool

In [4]:
dataset.shape

(485, 28)

In [5]:
dataset = dataset.dropna(axis = 0) # drop掉nan
dataset.shape

(483, 28)

In [6]:
dataset.isnull().any().any()

False

### 過濾出我們想要的feature

In [7]:
salary_df= dataset.loc[:,['Salary']]# salary is the final prediction result
#I drop these feature to predict, because they're not useful.
feature_df =dataset.drop(['NBA_Country','Salary','Player','Tm'],axis=1)
feature_df.head()

Unnamed: 0,NBA_DraftNumber,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
0,43,22,16,87,0.6,0.303,0.593,0.37,6.5,16.8,...,18.2,19.5,-0.4,0.1,-0.2,-0.121,-10.6,0.5,-10.1,-0.2
1,42,33,66,937,16.8,0.608,0.004,0.337,11.0,25.0,...,19.3,17.2,1.7,1.4,3.1,0.16,-0.6,1.3,0.8,0.7
2,19,36,59,1508,17.3,0.529,0.193,0.14,7.0,23.8,...,12.5,27.6,0.3,1.1,1.4,0.046,-0.6,-1.3,-1.9,0.0
3,13,22,24,656,14.6,0.499,0.346,0.301,1.4,14.4,...,9.7,29.5,-0.1,0.5,0.4,0.027,-0.7,-2.0,-2.6,-0.1
4,10,20,62,979,8.2,0.487,0.387,0.146,4.9,18.3,...,15.6,15.5,-0.4,1.2,0.8,0.038,-3.7,0.9,-2.9,-0.2


### linear regression

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

model = LinearRegression() 
x_train,x_test,y_train,y_test =train_test_split( feature_df , salary_df ,test_size = 0.2, random_state = 4)

model.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
Y_pred = model.predict(x_test)

In [10]:
model.score(x_test,y_test)

0.4391598877185242

### 發現Score 只有 43% 覺得表現不太好，因此下面改用變相的 classification 方法

作法是每五百萬薪水為一個區間，將預測的 salary 和實際的 salary 做分類，最後查看有幾個數值是預測正確區間的。

In [11]:
ytest = y_test.values

In [12]:
NewPredict=[]
NewReal=[]
def ReValue(dataset,NewDataSet):
    for i in dataset:
        if i>=30000000:
            NewDataSet.append(30000000)
        elif i>=25000000:
            NewDataSet.append(25000000)
        elif i>=20000000:
            NewDataSet.append(20000000)
        elif i>=15000000:
            NewDataSet.append(15000000)
        elif i>=10000000:
            NewDataSet.append(10000000)
        elif i>=5000000:
            NewDataSet.append(5000000)
        else :
            NewDataSet.append(1000000)
            
ReValue(Y_pred,NewPredict)
ReValue(ytest,NewReal)
assert(len(NewReal)==len(NewPredict))

In [13]:
error=0
for i in range(1,len(NewPredict)-1):
    if(NewReal[i]!=NewPredict[i]):
        error+=1

print(f"{len(NewPredict)} 個 testing data 中有 {error} 個 data 預測失誤 正確率{100 *(len(NewPredict)-error)/len(NewPredict)} %")

97 個 testing data 中有 40 個 data 預測失誤 正確率58.76288659793814 %
