The goal of this assignment is to compare the performance of a decision tree model and a random forest model on the same dataset. I am also going to check the run-time for the random forest model.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

In [3]:
df = pd.read_csv('Concrete_Data_Yeh.csv')
df.describe()

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


The first eight variables are the independent variable for cement mixture. The final variable, csMPA, is the measurement of the cement mixture's strength and is the outcome variable that I will be trying to predict.

In [38]:
tree_reg = tree.DecisionTreeRegressor(
            max_features=5,
            max_depth=15,
            random_state = 1337
)

X = df.drop(columns='csMPa')
y = df['csMPa']

cross_val_score(tree_reg, X, y, cv=10)

array([0.23762593, 0.48700336, 0.55961454, 0.37505609, 0.34249729,
       0.5737482 , 0.76730104, 0.78042022, 0.90055833, 0.90463208])

After iterating and changing the max features and max depth multiple times, this is the best decision tree that I can come up with.

In [41]:
import time
start_time = time.time()

rf_reg = ensemble.RandomForestRegressor(
          n_estimators=100,        
          max_features=5,
          max_depth=15,
          random_state = 1337  
)

X = df.drop(columns='csMPa')
y = df['csMPa']

print(cross_val_score(rf_reg, X, y, cv=10))

print("--- %s seconds ---" % (time.time() - start_time))

[0.55946842 0.6407802  0.73398373 0.61594115 0.64970423 0.74799193
 0.83321135 0.84850091 0.92621118 0.93264186]
--- 2.276378870010376 seconds ---


Let's adjust some of the parameter, like adding more trees, and see how that affect accuracy and time.

In [42]:
import time
start_time = time.time()

rf_reg = ensemble.RandomForestRegressor(
          n_estimators=1000,        
          max_features=5,
          max_depth=15,
          random_state = 1337  
)

X = df.drop(columns='csMPa')
y = df['csMPa']

print(cross_val_score(rf_reg, X, y, cv=10))

print("--- %s seconds ---" % (time.time() - start_time))

[0.57368063 0.6470704  0.75873551 0.63100782 0.6292313  0.73083054
 0.84231994 0.83281608 0.92995526 0.9328583 ]
--- 22.35339117050171 seconds ---


Not as much improvement as I would have hoped, especially not considering it took an extra 20 seconds. Let's do some more adjusting.

In [44]:
import time
start_time = time.time()

rf_reg = ensemble.RandomForestRegressor(
          n_estimators=100,        
          max_features=8,
          max_depth=50,
          random_state = 1337  
)

X = df.drop(columns='csMPa')
y = df['csMPa']

print(cross_val_score(rf_reg, X, y, cv=10))

print("--- %s seconds ---" % (time.time() - start_time))

[0.54604327 0.64513522 0.74471016 0.61349418 0.58353017 0.69748696
 0.83687588 0.82744116 0.93162957 0.92143744]
--- 3.098667621612549 seconds ---


In [45]:
import time
start_time = time.time()

rf_reg = ensemble.RandomForestRegressor(
          n_estimators=1000,        
          max_features=5,
          max_depth=50,
          random_state = 1337  
)

X = df.drop(columns='csMPa')
y = df['csMPa']

print(cross_val_score(rf_reg, X, y, cv=10))

print("--- %s seconds ---" % (time.time() - start_time))

[0.57014595 0.65073925 0.75493294 0.62854001 0.62251322 0.73419087
 0.84318389 0.83431701 0.93061619 0.93252292]
--- 22.417426824569702 seconds ---


In [46]:
import time
start_time = time.time()

rf_reg = ensemble.RandomForestRegressor(
          n_estimators=10000,        
          max_features=5,
          max_depth=50,
          random_state = 1337  
)

X = df.drop(columns='csMPa')
y = df['csMPa']

print(cross_val_score(rf_reg, X, y, cv=10))

print("--- %s seconds ---" % (time.time() - start_time))

[0.58124062 0.64294215 0.75440488 0.62939186 0.62179999 0.73293101
 0.84122658 0.83626925 0.93094262 0.93218269]
--- 241.66370797157288 seconds ---


It appears that I have come to the best performance of these models with the data still in an unprocessed state. It is significant that with the Random Forest models, it came to a point that the accuracy was not increasing much but the time spent was increasing dramatically.