### The purpose of this notebook is to check a number of different types of models on the Dublin Bus data for August.

### The main purpose here is not to test the accuracy of the models, but rather to check how much data we can process with different types of models before we encounter memory issues. We will also compare the models in terms of training time and the size of the pickled models.

<br>

# 1. Setup & Data Load

Import required modules and packages.

In [68]:
# import time so that run time of various tasks can be tracked
import time

# import math for mathematical functions
import math

# import pandas and numpy for data analysis
import pandas as pd
import numpy as np

# import from sklearn for machine learning
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor

# import pickle so that models can be saved to file
import pickle

Set the max number of columns & rows to display.

In [2]:
pd.set_option('display.max_columns', 250)
pd.set_option('display.max_rows', 5700)

Import the prepared data:

In [3]:
df = pd.read_hdf('/data_analytics/data/all_routes_aug_prepared.hdf')

# 2. Split Test & Training Data

We will use out of time sampling to split our test and training data.

First we ensure that the data is sorted by date and time:

In [4]:
df = df.sort_values(by=['dayofservice', 'actualtime_arr_stop_first'])

Data is then split between training and test data:

In [7]:
df_train, df_test = train_test_split(df, test_size=0.3, shuffle=False)

# 3. Prepare Features

In [9]:
# Prepare the descriptive & target features for the training data
X_train = df_train[['actualtime_arr_stop_first','segment_means','rain','temp','rhum','msl','weekday','bank_holiday','day_of_week_0','day_of_week_1','day_of_week_2','day_of_week_3','day_of_week_4','day_of_week_5','day_of_week_6','hour_0.0','hour_1.0','hour_4.0','hour_5.0','hour_6.0','hour_7.0','hour_8.0','hour_9.0','hour_10.0','hour_11.0','hour_12.0','hour_13.0','hour_14.0','hour_15.0','hour_16.0','hour_17.0','hour_18.0','hour_19.0','hour_20.0','hour_21.0','hour_22.0','hour_23.0']]
y_train = df_train.time_diff

In [10]:
# Prepare the descriptive & target features for the test data
X_test = df_test[['actualtime_arr_stop_first','segment_means','rain','temp','rhum','msl','weekday','bank_holiday','day_of_week_0','day_of_week_1','day_of_week_2','day_of_week_3','day_of_week_4','day_of_week_5','day_of_week_6','hour_0.0','hour_1.0','hour_4.0','hour_5.0','hour_6.0','hour_7.0','hour_8.0','hour_9.0','hour_10.0','hour_11.0','hour_12.0','hour_13.0','hour_14.0','hour_15.0','hour_16.0','hour_17.0','hour_18.0','hour_19.0','hour_20.0','hour_21.0','hour_22.0','hour_23.0']]
y_test = df_test.time_diff

In [11]:
# normalise the features for training
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
# drop dataframes that are no longer being used to free up memory
del df_train
del df_test
del df

# 4. Linear Regression

## 4.1 Train the Model

Train a model using linear regression from scikit-learn:

In [13]:
start = time.time()
linreg = linear_model.LinearRegression().fit(X_train, y_train)
end = time.time()
print(end - start)

24.313337087631226


## 4.2 Test on the Test Data

In [14]:
# make predictions based on the training data
start = time.time()
linreg_predicted = (linreg.predict(X_test))
end = time.time()
print(end - start)

5.586925506591797


In [15]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, linreg_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, linreg_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, linreg_predicted))

Mean Absolute Error:  19.335854368219916

Root Mean Squared Error:  38.96768930928269

R Squared: 0.6201653187308931


## 4.3 Pickle the Model

Save the model to check the size:

In [17]:
filename = '/data_analytics/linreg_model_size_check.sav'
pickle.dump(linreg, open(filename, 'wb'))

File size: 1.1K

# 5. Random Forest

## 5.1 Train the Model

In [18]:
# specify the random forest parameters
rfr = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1, max_depth=6)

In [19]:
# Fit model on the training data
start = time.time()
random_forest = rfr.fit(X_train_scaled, y_train)
end = time.time()
print(end - start)

1979.8052804470062


## 5.2 Test the Model

In [20]:
# make predictions based on the training data
start = time.time()
rf_predicted = (random_forest.predict(X_test_scaled))
end = time.time()
print(end - start)

18.546390771865845


In [21]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, rf_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, rf_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, rf_predicted))

Mean Absolute Error:  19.40559333863481

Root Mean Squared Error:  38.769509441373806

R Squared: 0.6240189816983333


## 5.3 Pickle the Model

Save the model to check the size:

In [22]:
filename = '/data_analytics/rf_model_size_check.sav'
pickle.dump(random_forest, open(filename, 'wb'))

File size: 53M

# 6. Neural Networks

## 6.1 Train the Model

In [61]:
# specify the neural net parameters
nn = MLPRegressor(
    hidden_layer_sizes=(10,),  activation='relu', solver='adam', alpha=0.001, batch_size='auto',
    learning_rate='constant', learning_rate_init=0.01, power_t=0.5, max_iter=1000, shuffle=True,
    random_state=9, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True,
    early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

In [62]:
# Fit model on the training data
start = time.time()
neural_net = nn.fit(X_train, y_train)
end = time.time()
print(end - start)

343.93044781684875


## 6.2 Test the Model

In [63]:
# make predictions based on the training data
start = time.time()
nn_predicted = (neural_net.predict(X_test))
end = time.time()
print(end - start)

6.9637932777404785


In [64]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, nn_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, nn_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, nn_predicted))

Mean Absolute Error:  20.520370561013433

Root Mean Squared Error:  39.46567474015634

R Squared: 0.6103951337172383


## 6.3 Pickle the Model

Save the model to check the size:

In [65]:
filename = '/data_analytics/nn_model_size_check.sav'
pickle.dump(neural_net, open(filename, 'wb'))

File size: 14K

# 7. SVM

## 7.1 Train the Model

In [28]:
# specify the SVM parameters
svm = LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=0, tol=1e-05, verbose=0)

In [29]:
# Fit model on the training data
start = time.time()
svm_model = svm.fit(X_train_scaled, y_train)
end = time.time()
print(end - start)

2448.409593820572




## 7.2 Test the Model

In [30]:
# make predictions based on the training data
start = time.time()
svm_predicted = (svm_model.predict(X_test_scaled))
end = time.time()
print(end - start)

1.280200481414795


In [31]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, svm_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, svm_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, svm_predicted))

Mean Absolute Error:  19.207791961375786

Root Mean Squared Error:  39.280773617126314

R Squared: 0.6140372669805653


## 7.3 Pickle the Model

Save the model to check the size:

In [32]:
filename = '/data_analytics/svm_model_size_check.sav'
pickle.dump(svm_model, open(filename, 'wb'))

File size: 928

# 8. Gradient Tree Boosting

## 8.1 Train the Model

In [33]:
# specify the GTB parameters
gtb = GradientBoostingRegressor()

In [34]:
# Fit model on the training data
start = time.time()
gtb_model = gtb.fit(X_train_scaled, y_train)
end = time.time()
print(end - start)

857.5822958946228


## 8.2 Test the Model

In [35]:
# make predictions based on the training data
start = time.time()
gtb_predicted = (gtb_model.predict(X_test_scaled))
end = time.time()
print(end - start)

6.654131174087524


In [36]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, gtb_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, gtb_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, gtb_predicted))

Mean Absolute Error:  18.967087292723583

Root Mean Squared Error:  38.25283692673184

R Squared: 0.6339734349591741


## 8.3 Pickle the Model

Save the model to check the size:

In [37]:
filename = '/data_analytics/gtb_model_size_check.sav'
pickle.dump(gtb_model, open(filename, 'wb'))

File size: 133K

# 9. Extra Trees

## 9.1 Train the Model

In [48]:
# specify the ET parameters
et = ExtraTreesRegressor(n_estimators=100, max_depth=6)

In [49]:
# Fit model on the training data
start = time.time()
et_model = et.fit(X_train, y_train)
end = time.time()
print(end - start)

876.3205444812775


## 9.2 Test the Model

In [50]:
# make predictions based on the training data
start = time.time()
et_predicted = (et_model.predict(X_test))
end = time.time()
print(end - start)

16.305891036987305


In [51]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, et_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, et_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, et_predicted))

Mean Absolute Error:  20.97022269221333

Root Mean Squared Error:  40.07958681840747

R Squared: 0.5981797869483765


## 9.3 Pickle the Model

Save the model to check the size:

In [52]:
filename = '/data_analytics/et_model_size_check.sav'
pickle.dump(et_model, open(filename, 'wb'))

# 10. AdaBoost

## 10.1 Train the Model

In [69]:
# specify the ET parameters
ab = AdaBoostRegressor()

In [75]:
# Fit model on the training data
start = time.time()
ab_model = ab.fit(X_train_scaled, y_train)
end = time.time()
print(end - start)

555.0543534755707


## 9.2 Test the Model

In [76]:
# make predictions based on the training data
start = time.time()
ab_predicted = (ab_model.predict(X_test_scaled))
end = time.time()
print(end - start)

6.7231903076171875


In [77]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, ab_predicted))
print()
print("Root Mean Squared Error: ", math.sqrt(metrics.mean_squared_error(y_test, ab_predicted)))
print()
print("R Squared:", metrics.r2_score(y_test, ab_predicted))

Mean Absolute Error:  24.98505485153132

Root Mean Squared Error:  46.60751752788408

R Squared: 0.45662800519344526


## 9.3 Pickle the Model

Save the model to check the size:

In [78]:
filename = '/data_analytics/ab_model_size_check.sav'
pickle.dump(ab_model, open(filename, 'wb'))

File size: 15K