**DS 301: Applied Data Modeling and Predictive Analysis**

# Lab 7 – Random Forests, AdaBoost

Nok Wongpiromsarn, 9 October 2020

**Instructions:**
Perform regression with 'SalePrice' as the output.
1. Select at least 2 features of your choice. Explain why you select these features.
2. Prepare the data using Pipeline and ColumnTransformer. Explain the reasoning behind having each transformation in the Pipeline. Hint: Consider, e.g., StandardScaler, OneHotEncoder, etc.
3. Train the following models
   - RandomForestRegressor
   - AdaBoostRegressor
   - XGBRegressor
4. Evaluate each of the above models based on RMSE.

**Get the data and allocate some for testing**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("datasets/house-price.csv")
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)

### 1. Select at least 2 features of your choice. Explain why you select these features.

In [None]:
# Pick the features with correlation > 0.55
# Essentially, we want to pick features with high correlation.

attribs_encoded = data_encoded.columns[abs_corr > 0.55]
attribs_encoded = attribs_encoded[(attribs_encoded != "SalePrice")]
attribs_encoded

In [None]:
# Convert attribs_encoded to the original attributes before one-hot encoding.
# Note that the following code assumes that the encoded attribute name is obtained from the original attribute name
# by appending "_" and that the original attribute names do not include "_"

attribs = []

for a in attribs_encoded:
    index = a.find('_')
    if index > 0:
        a = a[:index]
    if a not in attribs:
        attribs.append(a)
        
# Print selected attributes and their corresponding types
print("Selected {} atrributes".format(len(attribs)))
print("  {:15} {:10} {:^10}".format("Column", "Dtype", "Null Count"))
print("  {:15} {:10} {:^10}".format("------", "-----", "----------"))
for attr in attribs:
    print("  {:15} {:10} {:^10}".format(attr, str(data_train[attr].dtype), data_train[attr].isnull().sum()))

### 2. Prepare the data using Pipeline and ColumnTransformer

In [None]:
# Separate the selected features based on their types

num_attribs = [a for a in attribs if data[a].dtypes == 'int64']
cat_attribs = [a for a in attribs if data[a].dtypes == 'object']

# Ensure that we've covered all the selected features
assert len(num_attribs) + len(cat_attribs) == len(attribs)

print(num_attribs)
print(cat_attribs)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# We need to scale the features and convert categorical features to numerical ones.
# There is no missing values in this case, so there is really no need to use SimpleImputer.
# I'll still add SimpleImputer to illustrate how you may use Pipeline, together with ColumnTransformer
# to create a complete transformer.
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_attribs), # Apply StandardScaler to numerical features
    ('cat', cat_transformer, cat_attribs),  # Apply cat_transformer to categorical features
])

### 3. Train the following models

- RandomForestRegressor
- AdaBoostRegressor
- XGBRegressor

### 4. Evaluate each of the above models based on RMSE

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse_rnd = sqrt(mean_squared_error(y_test, y_pred_rnd))
print("RMSE RandomForestRegressor: {}".format(rmse_rnd))

rmse_adb = sqrt(mean_squared_error(y_test, y_pred_adb))
print("RMSE AdaBoostRegressor: {}".format(rmse_adb))

rmse_xgb = sqrt(mean_squared_error(y_test, y_pred_xgb))
print("RMSE XGBRegressor: {}".format(rmse_xgb))