<a href="https://colab.research.google.com/github/weibb123/ScikitLearn_Tutorial/blob/main/ScikitLearn_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Linear Models

$\hat{y}(w,x) = w_o+w_1x+...+w_px_p$

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/weibb123/ScikitLearn_Tutorial/main/boston.csv')
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


##Split the data into train and test

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
X = df.drop('MEDV', axis=1)
y = df['MEDV']

#preprocess data
scaler = MinMaxScaler()
X = scaler.fit_transform(X)



xtrain, xtest, ytrain, ytest = train_test_split(X, y, random_state=1) 

In [54]:
# Create evalaution function 
from sklearn import metrics
from sklearn.model_selection import cross_val_score


def cross_val(model):
    pred = cross_val_score(model, X, y, cv=10)
    return pred.mean()

def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
    print('__________________________________')
    
def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

#1.1 Ordinary Least Squares

Linear Regression assumes linearity and takes in its fit method arrays X, y and will store the coeifficients $w$ of linear model in  its coef_ member.

$$ min_{w}||Xw-y||^2_{2}$$

In [50]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
# fitting this model to the training data
reg.fit(xtrain, ytrain)

# make prediction
y_model = reg.predict(xtest)

In [52]:
reg.coef_

array([-10.18072577,   5.7129978 ,   1.0448317 ,   2.42854641,
       -10.31905505,  15.01628506,   0.67107567, -16.18284737,
         7.03303653,  -5.5937189 ,  -9.36369973,   2.48951602,
       -20.20069882])

These are the weights that model learns

In [56]:
#Prediction on test/train sets
test_pred = reg.predict(xtest)
train_pred = reg.predict(xtrain)


print('Test set evaluation:\n_____________________________________')
print_evaluate(ytest, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(ytrain, train_pred)

Test set evaluation:
_____________________________________
MAE: 3.574868126127539
MSE: 21.897765396049476
RMSE: 4.679504823808762
R2 Square 0.7789410172622859
__________________________________
Train set evaluation:
_____________________________________
MAE: 3.2505130022869873
MSE: 22.47798382187789
RMSE: 4.741095213331819
R2 Square 0.7168057552393374
__________________________________


#1.1.2. Ridge regression
$$min_{w}||Xw-y||^2_{2}+\alpha||w||^2_{2}$$
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. This is also called regularization. \\

The complexity parameter $\alpha \geq 0$  controls the amount of shrinkage: the larger the value of $\alpha$ , the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

In [64]:
reg = linear_model.Ridge(alpha=0.7)
# fitting this model to the training data
reg.fit(xtrain, ytrain)

# make prediction
y_model = reg.predict(xtest)

In [65]:
#Prediction on test/train sets
test_pred = reg.predict(xtest)
train_pred = reg.predict(xtrain)


print('Test set evaluation:\n_____________________________________')
print_evaluate(ytest, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(ytrain, train_pred)

Test set evaluation:
_____________________________________
MAE: 3.5930415034165835
MSE: 22.671072050748094
RMSE: 4.761414921086808
R2 Square 0.771134449818522
__________________________________
Train set evaluation:
_____________________________________
MAE: 3.2088512237189897
MSE: 22.696456757340012
RMSE: 4.764079843720087
R2 Square 0.7140532718115924
__________________________________


#1.1.3. Lasso regression
Lasso is a linear model that estimates sparse coefficients. It is useful since it reduces number of features. 
$$ min_{w}\frac{1}{2n_{samples}}||Xw-y||^2_{2}+\alpha||w||_1$$

In [66]:
#1.1.2. Ridge regression
reg = linear_model.Lasso(alpha=0.1)
# fitting this model to the training data
reg.fit(xtrain, ytrain)

# make prediction
y_model = reg.predict(xtest)

In [67]:
#Prediction on test/train sets
test_pred = reg.predict(xtest)
train_pred = reg.predict(xtrain)


print('Test set evaluation:\n_____________________________________')
print_evaluate(ytest, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(ytrain, train_pred)

Test set evaluation:
_____________________________________
MAE: 4.1218465832498925
MSE: 29.947615036775744
RMSE: 5.472441414649931
R2 Square 0.6976774024328232
__________________________________
Train set evaluation:
_____________________________________
MAE: 3.517609787604138
MSE: 26.75666982668098
RMSE: 5.1726849726888435
R2 Square 0.6628997082691072
__________________________________


In [68]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

model.fit(xtrain, ytrain)


RandomForestRegressor(n_jobs=-1, random_state=42)

In [70]:
#Prediction on test/train sets
test_pred = model.predict(xtest)
train_pred = model.predict(xtrain)


print('Test set evaluation:\n_____________________________________')
print_evaluate(ytest, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(ytrain, train_pred)

Test set evaluation:
_____________________________________
MAE: 2.2958897637795284
MSE: 8.921631559055129
RMSE: 2.9869100353132714
R2 Square 0.9099357052587104
__________________________________
Train set evaluation:
_____________________________________
MAE: 0.8113350923482849
MSE: 1.4660157414248032
RMSE: 1.2107913699002002
R2 Square 0.9815300507380935
__________________________________


In [73]:
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [379]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)

# Fit the RandomizedSearchCV model
rs_model.fit(xtrain, ytrain)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


10 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
8 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 386, in fit
    n_samples=X.shape[0], max_samples=self.max_samples
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 111, in _get_n_samples_bootstrap
    raise ValueError(msg.format(n_samples, max_samples))
ValueError: `max_samples` must be in range 1 to 303 but got value 379

-----------------------------

RandomizedSearchCV(cv=5,
                   estimator=RandomForestRegressor(n_jobs=-1, random_state=42),
                   n_iter=2,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [379],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)

In [74]:
# Find the best model hyperparameters
rs_model.best_params_

{'max_depth': 3,
 'max_features': 'sqrt',
 'max_samples': 379,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 90}

In [76]:
# Evaluate the RandomizedSearch model
ideal_model = RandomForestRegressor(n_estimators=90,
                                    min_samples_leaf=1,
                                    min_samples_split=2,
                                    max_features='sqrt',
                                    max_depth=3,
                                    max_samples=379,
                                    random_state=42) # random state so result is consistent

In [77]:
ideal_model.fit(xtrain, ytrain)

RandomForestRegressor(max_depth=3, max_features='sqrt', max_samples=379,
                      n_estimators=90, random_state=42)

In [78]:
#Prediction on test/train sets
test_pred = ideal_model.predict(xtest)
train_pred = ideal_model.predict(xtrain)


print('Test set evaluation:\n_____________________________________')
print_evaluate(ytest, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(ytrain, train_pred)

Test set evaluation:
_____________________________________
MAE: 3.486659645937973
MSE: 23.66009593778766
RMSE: 4.864164464508541
R2 Square 0.7611502066586373
__________________________________
Train set evaluation:
_____________________________________
MAE: 2.7889549790050743
MSE: 15.720193482598846
RMSE: 3.9648699200098414
R2 Square 0.8019453899391522
__________________________________
