# Random Forest Model with Automatic Feature Engineering

In this notebook, we will focus on achieving better performance by providing more features to the model through Automatic Feature Engineering by using **Featuretools** which is an open source library for performing automated feature engineering.

## 1. Loading Data and Packages

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Importing the dataset
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataset = pd.read_csv('housing.csv', delim_whitespace=True, names=column_names)
dataset.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


## 2. Data Preprocessing

In [3]:
dataset = dataset[~(dataset['MEDV'] >= 50.0)]
dataset = dataset[~(dataset['RM'] >= 8.78)]

**Restrict to the Most Important Features**

These were the nine features required to reach a total feature importance of 97% in the previous random forest notebook with Feature Reduction. We will use only these features in order to speed up the model.

In [4]:
y = dataset['MEDV']
dataset = dataset.drop(['ZN', 'CHAS', 'RAD', 'TAX', 'B'], axis=1)
X = dataset.drop('MEDV', axis = 1)
dataset.head()

Unnamed: 0,CRIM,INDUS,NOX,RM,AGE,DIS,PTRATIO,LSTAT,MEDV
0,0.00632,2.31,0.538,6.575,65.2,4.09,15.3,4.98,24.0
1,0.02731,7.07,0.469,6.421,78.9,4.9671,17.8,9.14,21.6
2,0.02729,7.07,0.469,7.185,61.1,4.9671,17.8,4.03,34.7
3,0.03237,2.18,0.458,6.998,45.8,6.0622,18.7,2.94,33.4
4,0.06905,2.18,0.458,7.147,54.2,6.0622,18.7,5.33,36.2


## 3. Implementation of Automatic Feature Engineering

**Feature Engineering using Featuretools**

In [5]:
# Creating Entity set 'es' and adding Dataframe
import featuretools as ft
es = ft.EntitySet(id='prices')

# Adding a dataframe to it
es.entity_from_dataframe(entity_id='boston', dataframe = dataset, index='UNIQUE', make_index=True)
es.normalize_entity(base_entity_id='boston', new_entity_id='lstat', index = 'LSTAT')

Entityset: prices
  Entities:
    boston [Rows: 489, Columns: 10]
    lstat [Rows: 442, Columns: 1]
  Relationships:
    boston.LSTAT -> lstat.LSTAT

In [6]:
# Building new features
feature_matrix, feature_names = ft.dfs(entityset=es, target_entity='boston', max_depth=2, verbose=1, n_jobs=2)

Built 58 features
EntitySet scattered to workers in 2.865 seconds
Elapsed: 00:03 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


In [7]:
# Printing the new columns
feature_matrix.columns

Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'PTRATIO', 'LSTAT', 'MEDV',
       'lstat.SUM(boston.CRIM)', 'lstat.SUM(boston.INDUS)',
       'lstat.SUM(boston.NOX)', 'lstat.SUM(boston.RM)',
       'lstat.SUM(boston.AGE)', 'lstat.SUM(boston.DIS)',
       'lstat.SUM(boston.PTRATIO)', 'lstat.SUM(boston.MEDV)',
       'lstat.STD(boston.CRIM)', 'lstat.STD(boston.INDUS)',
       'lstat.STD(boston.NOX)', 'lstat.STD(boston.RM)',
       'lstat.STD(boston.AGE)', 'lstat.STD(boston.DIS)',
       'lstat.STD(boston.PTRATIO)', 'lstat.STD(boston.MEDV)',
       'lstat.MAX(boston.CRIM)', 'lstat.MAX(boston.INDUS)',
       'lstat.MAX(boston.NOX)', 'lstat.MAX(boston.RM)',
       'lstat.MAX(boston.AGE)', 'lstat.MAX(boston.DIS)',
       'lstat.MAX(boston.PTRATIO)', 'lstat.MAX(boston.MEDV)',
       'lstat.SKEW(boston.CRIM)', 'lstat.SKEW(boston.INDUS)',
       'lstat.SKEW(boston.NOX)', 'lstat.SKEW(boston.RM)',
       'lstat.SKEW(boston.AGE)', 'lstat.SKEW(boston.DIS)',
       'lstat.SKEW(boston.PTRATIO)', 

In [8]:
feature_matrix.head()

Unnamed: 0_level_0,CRIM,INDUS,NOX,RM,AGE,DIS,PTRATIO,LSTAT,MEDV,lstat.SUM(boston.CRIM),...,lstat.MIN(boston.MEDV),lstat.MEAN(boston.CRIM),lstat.MEAN(boston.INDUS),lstat.MEAN(boston.NOX),lstat.MEAN(boston.RM),lstat.MEAN(boston.AGE),lstat.MEAN(boston.DIS),lstat.MEAN(boston.PTRATIO),lstat.MEAN(boston.MEDV),lstat.COUNT(boston)
UNIQUE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.00632,2.31,0.538,6.575,65.2,4.09,15.3,4.98,24.0,0.00632,...,24.0,0.00632,2.31,0.538,6.575,65.2,4.09,15.3,24.0,1
1,0.02731,7.07,0.469,6.421,78.9,4.9671,17.8,9.14,21.6,0.02731,...,21.6,0.02731,7.07,0.469,6.421,78.9,4.9671,17.8,21.6,1
2,0.02729,7.07,0.469,7.185,61.1,4.9671,17.8,4.03,34.7,0.02729,...,34.7,0.02729,7.07,0.469,7.185,61.1,4.9671,17.8,34.7,1
3,0.03237,2.18,0.458,6.998,45.8,6.0622,18.7,2.94,33.4,0.03237,...,33.4,0.03237,2.18,0.458,6.998,45.8,6.0622,18.7,33.4,1
4,0.06905,2.18,0.458,7.147,54.2,6.0622,18.7,5.33,36.2,0.13569,...,29.4,0.067845,3.115,0.484,6.8465,43.65,4.59725,17.65,32.8,2


In [9]:
# Checking if there are any null values present
feature_matrix.isnull().sum()

CRIM                            0
INDUS                           0
NOX                             0
RM                              0
AGE                             0
DIS                             0
PTRATIO                         0
LSTAT                           0
MEDV                            0
lstat.SUM(boston.CRIM)          0
lstat.SUM(boston.INDUS)         0
lstat.SUM(boston.NOX)           0
lstat.SUM(boston.RM)            0
lstat.SUM(boston.AGE)           0
lstat.SUM(boston.DIS)           0
lstat.SUM(boston.PTRATIO)       0
lstat.SUM(boston.MEDV)          0
lstat.STD(boston.CRIM)        400
lstat.STD(boston.INDUS)       400
lstat.STD(boston.NOX)         400
lstat.STD(boston.RM)          400
lstat.STD(boston.AGE)         400
lstat.STD(boston.DIS)         400
lstat.STD(boston.PTRATIO)     400
lstat.STD(boston.MEDV)        400
lstat.MAX(boston.CRIM)          0
lstat.MAX(boston.INDUS)         0
lstat.MAX(boston.NOX)           0
lstat.MAX(boston.RM)            0
lstat.MAX(bost

In [10]:
# Listing the column names which contain null values
feature_matrix.columns[feature_matrix.isnull().any()].tolist()

['lstat.STD(boston.CRIM)',
 'lstat.STD(boston.INDUS)',
 'lstat.STD(boston.NOX)',
 'lstat.STD(boston.RM)',
 'lstat.STD(boston.AGE)',
 'lstat.STD(boston.DIS)',
 'lstat.STD(boston.PTRATIO)',
 'lstat.STD(boston.MEDV)',
 'lstat.SKEW(boston.CRIM)',
 'lstat.SKEW(boston.INDUS)',
 'lstat.SKEW(boston.NOX)',
 'lstat.SKEW(boston.RM)',
 'lstat.SKEW(boston.AGE)',
 'lstat.SKEW(boston.DIS)',
 'lstat.SKEW(boston.PTRATIO)',
 'lstat.SKEW(boston.MEDV)']

**Dropping the columns with null values because more than 70% of rows contain null values**.

In [11]:
feature_matrix = feature_matrix.drop(['lstat.STD(boston.CRIM)',
                                      'lstat.STD(boston.INDUS)',
                                      'lstat.STD(boston.NOX)',
                                      'lstat.STD(boston.RM)',
                                      'lstat.STD(boston.AGE)',
                                      'lstat.STD(boston.DIS)',
                                      'lstat.STD(boston.PTRATIO)',
                                      'lstat.STD(boston.MEDV)',
                                      'lstat.SKEW(boston.CRIM)',
                                      'lstat.SKEW(boston.INDUS)',
                                      'lstat.SKEW(boston.NOX)',
                                      'lstat.SKEW(boston.RM)',
                                      'lstat.SKEW(boston.AGE)',
                                      'lstat.SKEW(boston.DIS)',
                                      'lstat.SKEW(boston.PTRATIO)',
                                      'lstat.SKEW(boston.MEDV)'], axis=1)

In [12]:
# Again Checking if there are any null values present
feature_matrix.isnull().sum()

CRIM                          0
INDUS                         0
NOX                           0
RM                            0
AGE                           0
DIS                           0
PTRATIO                       0
LSTAT                         0
MEDV                          0
lstat.SUM(boston.CRIM)        0
lstat.SUM(boston.INDUS)       0
lstat.SUM(boston.NOX)         0
lstat.SUM(boston.RM)          0
lstat.SUM(boston.AGE)         0
lstat.SUM(boston.DIS)         0
lstat.SUM(boston.PTRATIO)     0
lstat.SUM(boston.MEDV)        0
lstat.MAX(boston.CRIM)        0
lstat.MAX(boston.INDUS)       0
lstat.MAX(boston.NOX)         0
lstat.MAX(boston.RM)          0
lstat.MAX(boston.AGE)         0
lstat.MAX(boston.DIS)         0
lstat.MAX(boston.PTRATIO)     0
lstat.MAX(boston.MEDV)        0
lstat.MIN(boston.CRIM)        0
lstat.MIN(boston.INDUS)       0
lstat.MIN(boston.NOX)         0
lstat.MIN(boston.RM)          0
lstat.MIN(boston.AGE)         0
lstat.MIN(boston.DIS)         0
lstat.MI

**Assigning the new feature matrix to X and y sets.**

In [13]:
y = feature_matrix['MEDV']
X = feature_matrix.drop('MEDV', axis = 1)

In [14]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## 4. Fitting the Random Forest Regression Model

In [15]:
# Fitting the Regression Model to the dataset
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=1000, random_state=42)
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [16]:
# Predicting a new result
y_pred = regressor.predict(X_test)

In [17]:
# Calculating accuracy using r2_score
from sklearn.metrics import r2_score
train_score = round(regressor.score(X_train, y_train)*100,2)
test_score = round(r2_score(y_test, y_pred)*100,2)
print('----------------Model Performance---------------')
print("Train_accuracy :" + str(train_score))
print("Test_accuracy :" + str(test_score))

----------------Model Performance---------------
Train_accuracy :99.36
Test_accuracy :94.23


## 5. Applying K-Fold Cross Validation

Using K-Fold to evaluate the model's performace on 10 different test sets so that we can be more sure of our accuracy and find a mean accuracy for our model.

In [18]:
# Applying K-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=regressor, X=X_train, y=y_train, cv=10)
mean_accuracy = round(accuracies.mean()*100,2)
print(accuracies)
print("Mean_test_accuracy :" + str(mean_accuracy))

[0.93416882 0.93127763 0.95289943 0.97747262 0.96309391 0.94878883
 0.97743575 0.9787996  0.95984424 0.93424281]
Mean_test_accuracy :95.58
