# [Spring 2022] CS 4361 / 5361 - Ensembles

## **Before you start**

Make a copy of this Colab by clicking on File > Save a Copy in Drive

After making a copy, add your student id, last name, and first name to the title.

Credit: This notebook was originally created by Dr. Olac Fuentes. It was adapted by Dr. Diego Aguirre on 02/28/2022


In [1]:
student_name = "Salvador Robles Herrera"
student_id = "80683116"

In this assignment, we will be predicting the running times of GPU operations under various parameter settings using ensembles of trees.


Let's load the libraries we will be using.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
from sklearn.metrics import accuracy_score, confusion_matrix,mean_squared_error,mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from google.colab import files

Let's upload the dataset we will be working with (you'll find the file on Blackboard).

In [7]:
from google.colab import files
import os

if not os.path.isfile("gpu_running_time.csv"):
  files.upload() # If this doesn't work, try using Google Chrome

In [8]:
df = pd.read_csv('gpu_running_time.csv')
df

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,Run1 (ms),Run2 (ms),Run3 (ms),Run4 (ms)
0,128,128,16,16,32,32,32,8,2,4,1,0,1,1,13.29,13.25,13.36,13.37
1,128,128,16,16,32,32,32,8,2,4,1,1,1,1,13.29,13.36,13.38,13.65
2,128,128,16,16,32,32,32,2,2,2,1,1,1,1,13.78,13.76,13.73,13.69
3,128,128,16,16,32,32,32,8,2,2,1,1,1,1,14.34,14.44,14.43,14.58
4,128,64,16,16,16,16,32,2,2,2,1,1,1,1,14.61,14.69,14.80,14.78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241595,128,128,32,8,8,16,32,8,2,2,0,0,1,1,3322.83,3313.44,3359.22,3342.30
241596,128,128,32,8,8,32,8,8,2,2,0,0,1,1,3324.15,3324.11,3332.74,3300.80
241597,128,128,32,8,8,16,16,8,2,2,0,0,1,1,3325.87,3340.98,3333.41,3341.08
241598,128,128,32,8,8,32,16,8,2,2,0,0,1,1,3333.92,3335.08,3354.68,3317.04


We convert the dataframe to a numpy array and extract the features (X) and target (y). 



In [9]:
data = df.to_numpy()
X = data[:,:15]
print(X[:100])
y = np.mean(data[:,15:],axis=1)
print(y[:100])
print('X shape', X.shape)
print('y', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4361)

[[128.   128.    16.   ...   1.     1.    13.29]
 [128.   128.    16.   ...   1.     1.    13.29]
 [128.   128.    16.   ...   1.     1.    13.78]
 ...
 [ 64.    64.    16.   ...   1.     1.    16.25]
 [ 64.   128.    16.   ...   1.     1.    16.25]
 [128.   128.    32.   ...   1.     1.    16.26]]
[13.32666667 13.46333333 13.72666667 14.48333333 14.75666667 14.62333333
 14.86333333 14.86       15.03       14.98       14.99333333 14.98
 15.19666667 15.25       15.68333333 15.51       15.39333333 15.46333333
 15.47       15.39       15.44666667 15.41       15.48666667 15.42333333
 15.40666667 15.55       15.53       15.45666667 15.43333333 15.18666667
 15.36666667 15.42333333 15.4        15.54333333 15.38333333 15.57333333
 15.68666667 15.65333333 15.60666667 15.73666667 15.65333333 15.83666667
 15.83666667 15.96333333 15.90333333 15.80666667 15.88       15.99
 15.90666667 15.91333333 15.86       15.96333333 15.9        15.94
 15.91666667 15.95333333 15.93666667 15.94666667 15.97333333 

As a baseline, we evaluate the error when we use the mean y train as predicition for all examples in the test set. 

In [6]:
pred = np.mean(y_train)+np.zeros_like(y_test)
print(np.mean(y_train))
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

217.9919387811416
Mean squared error = 134935.89
Mean absolute error =  216.03


Now let's evaluate a decision tree regressor for this task. 

In [7]:
model = DecisionTreeRegressor()
model.fit(X_train,y_train)
pred = model.predict(X_test)

print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error =  9.09
Mean absolute error =   1.05


We'd expect a random forest to perform better than a single tree. Let's see if that is the case.

In [8]:
model = RandomForestRegressor()
model.fit(X_train,y_train)
pred = model.predict(X_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error =  4.86
Mean absolute error =   0.80


k-nn can also be used for regression. Let's see how well it works in this problem. Warning: it takes a while to run!

In [9]:
model = KNeighborsRegressor(algorithm='brute',n_jobs=-1)
model.fit(X_train,y_train)
pred = model.predict(X_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error = 11.49
Mean absolute error =   1.63


## Exercises

Let's build our own ensembles of decision tree regressors to see if we can surpass the performance of random forests using the same number of individual trees (the sklearn implementation uses 100 by default). So, instead of using sklearn's RandomForestRegressor, we will build the ensemble ourselves by building 100 trees (use a loop).

1. Build an ensemble that uses 100 trees with random splitting instead of best splitting to solve the problem. Print the mean squared error and mean absolute error evaluated on the test dataset. Each tree should be an instance of the DecisionTreeRegressor class - use the [splitter](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) argument to tell sklearn how to choose the split at each node.

In [15]:
# Your code goes here

#model = DecisionTreeRegressor()
#model.fit(X_train,y_train)
#pred = model.predict(X_test)
#print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
#print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

def random_splitting():
  E = []
  for i in range(100):
    tree = DecisionTreeRegressor(splitter="random")
    tree.fit(X_train, y_train)
    #pred = tree.predict(X_test)
    E.append(tree)
  return E

# Getting the average of the predictions
ensambles_1 = random_splitting()
sum = 0
for i in range(len(ensambles_1)):
  sum += ensambles_1[i].predict(X_test)
sum /= 100

print('Mean squared error = {:5.2f}'.format(mean_squared_error(sum,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(sum,y_test)))

Mean squared error =  3.63
Mean absolute error =   0.66


2. Build an ensemble that uses bagging and 100 trees to solve the problem. Print the mean squared error and mean absolute error evaluated on the test dataset. Each tree should be an instance of the DecisionTreeRegressor class.

In [10]:
# Your code goes here
def ensemble_bagging():
  E = []
  for i in range(100):
    tree = DecisionTreeRegressor()
    random_vector = np.random.randint(low = 0, high = X_train.shape[0], size = X_train.shape[0])
    tree.fit(X_train[random_vector], y_train[random_vector])
    E.append(tree)
  return E

# Getting the average of the predictions
ensembles_2 = ensemble_bagging()
sum = 0
for i in range(len(ensembles_2)):
  sum += ensembles_2[i].predict(X_test)
sum /= 100

print('Mean squared error = {:5.2f}'.format(mean_squared_error(sum,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(sum,y_test)))

Mean squared error =  4.89
Mean absolute error =   0.80


3. Build an ensemble that uses random attribute selection and 100 regression trees to solve the problem. Print the mean squared error and mean absolute error evaluated on the test dataset. Each tree should be an instance of the DecisionTreeRegressor class.

In [16]:
# Your code goes here
def ensemble_random_attribute():
  E = []
  for i in range(100):
    tree = DecisionTreeRegressor()
    boolean_vector = np.random.choice(a=[True, False], size=(X_train.shape[1],))
    tree.fit(X_train[:,boolean_vector], y_train)
    E.append(tree.predict(X_test[:,boolean_vector]))
  return E

# Getting the average of the predictions
#ensembles_3 = ensemble_random_attribute()
#sum = 0
#for i in range(len(ensembles_3)):
#  sum += ensembles_3[i].predict(X_test)
#sum /= 100

ensembles_3 = ensemble_random_attribute()
sum = 0
for i in range(len(ensembles_3)):
  sum += ensembles_3[i]
sum /= 100

print('Mean squared error = {:5.2f}'.format(mean_squared_error(sum,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(sum,y_test)))

Mean squared error = 15261.64
Mean absolute error =  64.28


## Submission Instructions

1. File > Download .ipynb
2. Go to Blackboard, find the submission page, and upload the .ipynb file you just downloaded.