## Build Models

In this notebook, I will read in the preprocessed data and try out a few different models.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [2]:
# Import train data
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv')

# Import test data
X_test = pd.read_csv('X_test.csv')
y_test = pd.read_csv('y_test.csv')

In [9]:
# Define function to compute mean absolute error for Decision Tree Regressor
def score_DTR(X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    return mean_absolute_error(y_test, predictions)

# Print mean absolute error
print("Decision Tree Regressor mean absolute error:") 
print(score_DTR(X_train, X_test, y_train, y_test))

Decision Tree Regressor mean absolute error:
4.609177466666665


In [4]:
# Use GridSearchCV to find best values for n_estimators for Random Forest Regressor
parameter = {
    'n_estimators' : [10, 50, 100, 200],
}

RFR = RandomForestRegressor()
RFR_CV = GridSearchCV(RFR, parameter).fit(X_train, np.ravel(y_train))
RFR_CV.best_params_

{'n_estimators': 200}

In [10]:
# Define function to compute mean absolute error for Random Forest Regressor
def score_RFR(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor(n_estimators=200)
    model.fit(X_train, np.ravel(y_train))
    predictions = model.predict(X_test)
    return mean_absolute_error(y_test, predictions)

# Print mean absolute error
print("Random Forest Regressor mean absolute error:") 
print(score_RFR(X_train, X_test, y_train, y_test))

Random Forest Regressor mean absolute error:
3.6346501080000033


In [6]:
# Use GridSearchCV to find best values for n_estimators, learning_rate for XGBoost Regressor
parameters = {
    'n_estimators' : [200, 300, 400, 500],
    'learning_rate' : [0.05, 0.075, 0.09, 0.115]
}

XGB = XGBRegressor()
XGB_CV = GridSearchCV(XGB, parameters).fit(X_train, y_train)
XGB_CV.best_params_

{'learning_rate': 0.115, 'n_estimators': 200}

In [11]:
# Define function to compute mean absolute error for XGBoost Regressor
def score_XGB(X_train, X_test, y_train, y_test):
    model = XGBRegressor(n_estimators=200, learning_rate=0.115)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    return mean_absolute_error(y_test, predictions)

# Print mean absolute error
print("XGBoost Regressor mean absolute error:") 
print(score_XGB(X_train, X_test, y_train, y_test))

XGBoost Regressor mean absolute error:
3.123175854500325


We see that the best (lowest) mean absolute error is for the XGBoost Regressor, followed by the Random Forest Regressor, and finally the Decision Tree Regressor.