# Nonlinear Regression Models

Content:

1. K Nearest Neigbor (KNN)
2. Support Vector Regression (SVR)
3. Naive Bayes
4. Artificial Neural Networks
5. Classification and Regression Tree / Decision Tree (CART)


In [32]:

# Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import neighbors
from sklearn.svm import SVR

In [4]:
# stop warnings

from warnings import filterwarnings
filterwarnings('ignore')

## <b>1. KNN Model<b/>
    
It makes prediction using similarity of observation eachother (nonparametric).\
Euclid Distance is used to calculate average of k nearest observations.
    
1) define number of neighbors \
2) find the distance between unknown observation point and the known observation points \
3) Sort distance and select the closest k number of observation points \
4) -if it is a classification problem,choose the most frequant classes as prediction value \
   -if it is a regression problem, choose mean of values asprediction value
    


In [11]:
import pandas as pd
df = pd.read_csv("/Users/User/Hitters.csv")
df

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,497,127,7,65,48,37,5,2703,806,32,379,311,138,N,E,325,9,3,700.0,N
318,492,136,5,76,50,94,12,5511,1511,39,897,451,875,A,E,313,381,20,875.0,A
319,475,126,3,61,43,52,6,1700,433,7,217,93,146,A,W,37,113,7,385.0,A
320,573,144,9,85,60,78,8,3198,857,97,470,420,332,A,E,1314,131,12,960.0,A


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      322 non-null    int64  
 1   Hits       322 non-null    int64  
 2   HmRun      322 non-null    int64  
 3   Runs       322 non-null    int64  
 4   RBI        322 non-null    int64  
 5   Walks      322 non-null    int64  
 6   Years      322 non-null    int64  
 7   CAtBat     322 non-null    int64  
 8   CHits      322 non-null    int64  
 9   CHmRun     322 non-null    int64  
 10  CRuns      322 non-null    int64  
 11  CRBI       322 non-null    int64  
 12  CWalks     322 non-null    int64  
 13  League     322 non-null    object 
 14  Division   322 non-null    object 
 15  PutOuts    322 non-null    int64  
 16  Assists    322 non-null    int64  
 17  Errors     322 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  322 non-null    object 
dtypes: float64

In [13]:
df.nunique()

AtBat        247
Hits         144
HmRun         36
Runs          96
RBI          103
Walks         89
Years         22
CAtBat       314
CHits        288
CHmRun       146
CRuns        261
CRBI         262
CWalks       248
League         2
Division       2
PutOuts      232
Assists      161
Errors        29
Salary       150
NewLeague      2
dtype: int64

In [18]:
df.isnull().sum()

AtBat         0
Hits          0
HmRun         0
Runs          0
RBI           0
Walks         0
Years         0
CAtBat        0
CHits         0
CHmRun        0
CRuns         0
CRBI          0
CWalks        0
League        0
Division      0
PutOuts       0
Assists       0
Errors        0
Salary       59
NewLeague     0
dtype: int64

In [19]:
# remove missing values
df.dropna(inplace =True)
df.shape

(263, 20)

In [22]:
# convert categorical data into numeric using One Hot Encoding

dms = pd.get_dummies(df[["League", "Division", "NewLeague"]])
dms.head()

Unnamed: 0,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
1,0,1,0,1,0,1
2,1,0,0,1,1,0
3,0,1,1,0,0,1
4,0,1,1,0,0,1
5,1,0,0,1,1,0


In [27]:
y=df["Salary"]
X_=df.drop(["Salary","League","Division","NewLeague"], axis=1)

In [51]:
X=pd.concat([X_, dms[["League_N","Division_W","NewLeague_N"]]], axis=1).astype("float64")

***HoldOut / train_test_split***

In [52]:
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

- Let's make KNN model

In [53]:
# Let's make KNN model

from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor().fit(X_train, y_train)
knn_model

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [54]:
knn_model.n_neighbors

5

In [55]:
knn_model.metric

'minkowski'

In [56]:
y_pred = knn_model.predict(X_test)

In [57]:
np.sqrt(mean_squared_error(y_test, y_pred))

426.6570764525201

***Model Tuning***

In [63]:
RMSE =[]

for k in range(1,10):
    knn_model = KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    RMSE.append(rmse)
    print("for k = ",k,"MSE value=",rmse)

for k =  1 MSE value= 455.03925390751965
for k =  2 MSE value= 415.99629571490965
for k =  3 MSE value= 420.6765370082348
for k =  4 MSE value= 428.8564674588792
for k =  5 MSE value= 426.6570764525201
for k =  6 MSE value= 423.5071669008732
for k =  7 MSE value= 414.9361222421057
for k =  8 MSE value= 413.7094731463598
for k =  9 MSE value= 417.84419990871265


***Model Tuning / K-Fold Cross Validation*** / GridSearchCV

We use ***GridSearchCV()*** function to optimize Hyperparameters



In [66]:
knn_params = {"n_neighbors": np.arange(1,30,1)} # arange() 
knn = KNeighborsRegressor()

In [67]:
# We are creating new model with Cross Validation so that we can see optimum hyperparameter values 
knn_cv_model = GridSearchCV(knn,knn_params, cv=10).fit(X_train,y_train)

In [68]:
knn_cv_model.best_params_

{'n_neighbors': 8}

In [70]:
knn_cv_model.best_params_['n_neighbors']

8

***Final Model***

In [71]:

knn_tuned = KNeighborsRegressor(n_neighbors=8).fit(X_train, y_train) # or we can use below. No difference!

#knn_tuned = KNeighborsRegressor(n_neighbors=knn_cv_model.best_params_['n_neighbors']).fit(X_train, y_train)

In [72]:
y_pred = knn_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

413.7094731463598

# <b>2. Supported Vector Regression (SVR)<b/>
    
    It is a strong and flexible modelling. 
    enduring againist outliers

In [73]:
?SVR  # C is effective hyperparameter in SVR

[1;31mInit signature:[0m
[0mSVR[0m[1;33m([0m[1;33m
[0m    [0mkernel[0m[1;33m=[0m[1;34m'rbf'[0m[1;33m,[0m[1;33m
[0m    [0mdegree[0m[1;33m=[0m[1;36m3[0m[1;33m,[0m[1;33m
[0m    [0mgamma[0m[1;33m=[0m[1;34m'scale'[0m[1;33m,[0m[1;33m
[0m    [0mcoef0[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.001[0m[1;33m,[0m[1;33m
[0m    [0mC[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mepsilon[0m[1;33m=[0m[1;36m0.1[0m[1;33m,[0m[1;33m
[0m    [0mshrinking[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcache_size[0m[1;33m=[0m[1;36m200[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mmax_iter[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Epsilon-Support Vector Regression.

The free parameters in the model are C and epsilon.

The implementat

In [77]:
# Let's make primitive model

from sklearn.svm import SVR

svr_model = SVR().fit(X_train, y_train)
y_pred = svr_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

460.0032657244849

In [78]:
# put linear in hyperparameter

svr_model = SVR("linear").fit(X_train, y_train) 
y_pred = svr_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

370.04084185624924

In [79]:
svr_model.intercept_  # In KNN, no intecept and coef due to it's structure, but SVR

array([-80.15196151])

In [80]:
svr_model.coef_

array([[ -1.21839037,   6.09602969,  -3.67574533,   0.14217075,
          0.51435919,   1.28388986,  12.55922537,  -0.08693755,
          0.46597184,   2.98259944,   0.52944523,  -0.79820799,
         -0.16015534,   0.30872794,   0.28842348,  -1.79560067,
          6.41868985, -10.74313783,   1.33374317]])

***Model Tuning***

In [82]:
svr_params = {"C":[0.1, 0.5, 1, 3]}

svr_cv_model = GridSearchCV(svr_model, svr_params, cv=5, verbose =2, n_jobs=-1 ).fit(X_train, y_train)

# verbose=2 provides detail / n_job=-1 accelerates the process

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.9min finished


In [83]:
svr_cv_model.best_params_

{'C': 0.5}

***Final Model***

In [88]:
svr_tuned = SVR("linear",C=svr_cv_model.best_params_["C"]).fit(X_train, y_train)
y_pred = svr_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

367.9874739022889

## <b>3. Naive Bayes<b/>
    
    coming soon!
   

## <b>4. Artificial Neural Network (ANN) / Multi-layer Percention (MLP) / Deep Learning (DL)<b/>
    
    It models human brain. We seek reaching coefficients since ANN has functional structure. \
    The goal is to attain the coefficients that will predict with minimum error.

In [103]:
from sklearn.neural_network import MLPRegressor

mlp_model = MLPRegressor().fit(X_train, y_train)
y_pred = mlp_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

364.34113126339435

***Model Tuning / CV***

In [104]:
mlp_params ={"alpha":[0.1, 0.01, 0.02, 0.001, 0.0001]
            ,"hidden_layer_sizes":[(10,20), (5,5), (100,100)]}

mlp_cv_model = GridSearchCV(MLPRegressor(), mlp_params, cv=10, verbose=2, n_jobs=-1 )

In [105]:
mlp_cv_model = mlp_cv_model.fit(X_train, y_train)

Fitting 10 folds for each of 15 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   24.6s finished


In [106]:
mlp_cv_model.best_params_

{'alpha': 0.1, 'hidden_layer_sizes': (100, 100)}

***Final Model***

In [107]:
mlp_tuned = MLPRegressor(alpha=0.0001, hidden_layer_sizes =(100,100)).fit(X_train, y_train)
y_pred = mlp_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

343.2049478944078

## <b>4. Decision Trees / Classification and Regression Trees (CART)<b/>
    
The goal is to convert complex structures in dataset into simple decision structures

In [111]:
from sklearn.tree import DecisionTreeRegressor

cart_model = DecisionTreeRegressor().fit(X_train, y_train)
cart_model

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [112]:
y_pred = cart_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

446.6238326904319

***Model Tuning***

In [114]:
cart_params = {"max_depth":[2,3,4,5,10,20], 
               "min_samples_split":[2,5,10,30,50]}

cart_cv_model = GridSearchCV(DecisionTreeRegressor(), cart_params, cv=10, verbose=2, n_jobs=-1)
cart_cv_model = cart_cv_model.fit(X_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    0.9s finished


In [115]:
cart_cv_model.best_params_

{'max_depth': 4, 'min_samples_split': 50}

***Final Model***

In [116]:
cart_tuned = DecisionTreeRegressor(max_depth =4, min_samples_split=50).fit(X_train, y_train)
y_pred = cart_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

361.0876906511434

In [127]:
# Let's visualize the decision tree

In [128]:
?GridSearchCV

[1;31mInit signature:[0m
[0mGridSearchCV[0m[1;33m([0m[1;33m
[0m    [0mestimator[0m[1;33m,[0m[1;33m
[0m    [0mparam_grid[0m[1;33m,[0m[1;33m
[0m    [0mscoring[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0miid[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m
[0m    [0mrefit[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcv[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mpre_dispatch[0m[1;33m=[0m[1;34m'2*n_jobs'[0m[1;33m,[0m[1;33m
[0m    [0merror_score[0m[1;33m=[0m[0mnan[0m[1;33m,[0m[1;33m
[0m    [0mreturn_train_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV impleme