# Regression Models

In this exericse, we will be analyzing scikit-learn's Boston Housing dataset by using regression models and evaluating them with MSE, MAE and R2 scores.

In [2]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np

boston = load_boston()
y = boston.target
X = boston.data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [4]:
#Import our libraries
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

Note that random forest, adaboost, bagging and decision trees can all be applied to regression models as well.  While logistic regression should only be used for classifications.

In [5]:
#Instantiate the models
ran_reg = RandomForestRegressor()
ada_reg = AdaBoostRegressor()
bag_reg = BaggingRegressor()
tree_reg = DecisionTreeRegressor()
lin_reg = LinearRegression()

In [6]:
#Fit the models
ran_reg.fit(X_train, y_train)
ada_reg.fit(X_train, y_train)
bag_reg.fit(X_train, y_train)
tree_reg.fit(X_train, y_train)
lin_reg.fit(X_train, y_train)



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
#Predictions
preds_ran = ran_reg.predict(X_test)
preds_ada = ada_reg.predict(X_test)
preds_bag = bag_reg.predict(X_test)
preds_tree = tree_reg.predict(X_test)
preds_lin = lin_reg.predict(X_test)

In [10]:
#To see our performance
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#Write a function to print out metrics
def print_metrics(actual, predict, model=None):
    if model==None:
        print('MSE for this model is:',format(mean_squared_error(actual, predict)))
        print('MAE for this model is:',format(mean_absolute_error(actual, predict)))
        print('R2 Score for this model is:',format(r2_score(actual, predict)))
        print('\n')
    else:
        print('MSE for ' + model + ' is: ',format(mean_squared_error(actual, predict)))
        print('MAE for ' + model + ' is: ',format(mean_absolute_error(actual, predict)))
        print('R2 Score for ' + model + ' is: ',format(r2_score(actual, predict)))
        print('\n')

In [11]:
print_metrics(y_test, preds_ran, 'random forest')
print_metrics(y_test, preds_ada, 'adaboost')
print_metrics(y_test, preds_bag, 'bagging')
print_metrics(y_test, preds_tree, 'decision tree')
print_metrics(y_test, preds_lin, 'linear regression')

MSE for random forest is:  21.512811764705877
MAE for random forest is:  3.0143137254901964
R2 Score for random forest is:  0.7942947128602192


MSE for adaboost is:  18.06265073874272
MAE for adaboost is:  3.2844005817205666
R2 Score for adaboost is:  0.8272851174752309


MSE for bagging is:  21.760065686274512
MAE for bagging is:  3.1869607843137255
R2 Score for bagging is:  0.7919304733786952


MSE for decision tree is:  28.095980392156868
MAE for decision tree is:  3.1696078431372543
R2 Score for decision tree is:  0.7313465214470865


MSE for linear regression is:  34.41396845313853
MAE for linear regression is:  4.061419182954704
R2 Score for linear regression is:  0.6709339839115631




It seems like adaboost has the lowest mean squared error and the highest R2 score when comparing with other models with the default settings.