<img src="https://www.mmu.edu.my/fci/wp-content/uploads/2021/01/FCI_wNEW_MMU_LOGO.png" style="height: 80px;" align=left>  



---



### For Google Colab Use Only
Skip this section if you are using Jupyter Notebook etc.

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# drive_path = '/content/drive/MyDrive/Trimester/2310/TDS3301/Tutorials/Tutorial 9/' #set your google drive path

---

# Artificial Neural Networks
[Multi-layer Perceptron (MLP)](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) is a supervised learning algorithm that can virtually learn any function by training on a dataset. Given a set of features / attributes and target, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers.

Some advantages of MLP are:
* Capability to learn non-linear models.
* Capability to learn models in real-time (on-line learning) using partial_fit.

Some disadvantages of MLP include:
* MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.
* MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
* MLP is sensitive to feature scaling.

There are various Python packages that provide methods to implement MLP, but for this lab, we will continue to focus on the popular scikit-learn package. <br>
*Note: The neural network implementation from scikit-learn is not intended for large-scale applications and it does not offer no GPU support. For GPU-based implementations, there are other packages available, e.g. Tensorflow*<br>

The following experiments lets you try out the simple MLP classifier on the `diabetes.csv` dataset.

---

## Data Preprocessing

The following steps of preprocessing are replicated from the previous tutorial.

In [3]:
#import the relevant libraries
import pandas as pd #data reading
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler # label encoding
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import numpy as np
#code to ensure path
# try:
#   drive_path
# except NameError:
#   drive_path = ''

# load diabetes.csv dataset
newdf = pd.read_csv("finaldf.csv")
newdf.head()

Unnamed: 0,latitude,longitude,mag,gap,rms,horizontalError,nst,dmin,depth,magNst,label
0,52.0999,178.5218,0.096154,0.383041,0.093407,0.091361,0.059166,0.263682,0.125824,0.018551,0
1,7.1397,126.738,0.365385,0.280702,0.252747,0.061756,0.100572,0.267839,0.1206,0.051266,0
2,19.1631,-66.5251,0.255769,0.695906,0.115385,0.015591,0.079869,0.263356,0.039979,0.020808,1
3,-4.7803,102.7675,0.326923,0.523392,0.274725,0.109325,0.066067,0.257594,0.098096,0.008399,0
4,53.3965,-166.9417,0.076923,0.532164,0.164835,0.020609,0.070667,0.256754,0.019529,0.023064,1


The `diabetes.csv` dataset contains fully numerical data. Therefore, we need to convert the target class attribute, which is `Outcome` into categorical/string type.<br>


In [4]:
# check the data types
newdf.dtypes

latitude           float64
longitude          float64
mag                float64
gap                float64
rms                float64
horizontalError    float64
nst                float64
dmin               float64
depth              float64
magNst             float64
label                int64
dtype: object

In [5]:
df = newdf.drop(['latitude','longitude'], axis=1)

In [6]:
df = df[df['label']==1]
df

Unnamed: 0,mag,gap,rms,horizontalError,nst,dmin,depth,magNst,label
2,0.255769,0.695906,0.115385,0.015591,0.079869,0.263356,0.039979,0.020808,1
4,0.076923,0.532164,0.164835,0.020609,0.070667,0.256754,0.019529,0.023064,1
5,0.038462,0.347953,0.093407,0.034258,0.070667,0.251830,0.060065,0.043370,1
6,0.288462,0.230994,0.076923,0.135418,0.061466,0.295832,0.818512,0.009527,1
7,0.288462,0.213450,0.170330,0.105211,0.118974,0.301169,0.378085,0.041113,1
...,...,...,...,...,...,...,...,...,...
26409,0.423077,0.371345,0.653846,0.065971,0.197185,0.264389,0.104366,0.307346,1
26413,0.365385,0.277778,0.543956,0.146558,0.121275,0.295272,0.813206,0.039985,1
26414,0.442308,0.198830,0.307692,0.125684,0.132776,0.333452,0.134132,0.038857,1
26418,0.365385,0.181287,0.219780,0.149669,0.093671,0.288019,0.568264,0.047882,1


Then, we can split the dataset attributes and class labels into their own variables, and then create the training, validation, and testing sets. <br>
*Note*: The `train_test_split` function does not provide a direct train/val/test split, so we can split the dataset twice to create the 3 sets.

In [7]:
X = df.drop('mag',axis=1) #Features
y = df['mag'] # Target variable
print(X.shape)
print(y.shape)

(14333, 8)
(14333,)


In [8]:
#First split to get 20% as test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

print('Training set: ', y_train.shape)
print('Testing set: ', y_test.shape)


Training set:  (11466,)
Testing set:  (2867,)


Normalize the training data to ensure no bias in the features during model training.

In [9]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(oob_score=True)
rfr.fit(X_train,y_train)

In [11]:
y_pred = rfr.predict(X_test)
pred_mag = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
pred_mag

Unnamed: 0,Actual,Predicted
23810,0.384615,0.394808
15323,0.019231,0.111596
26072,0.461538,0.429231
7213,0.032692,0.031731
17209,0.076923,0.076731
...,...,...
10197,0.161538,0.117135
13723,0.288462,0.372115
10375,0.365385,0.378462
15927,0.205769,0.286885


In [12]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test,y_pred)
mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Absolute Error: ', mae)
print('Mean Squared Error: ', mse)
print('R-squared : ', r2)

Mean Absolute Error:  0.03918802101700482
Mean Squared Error:  0.0032286836984461737
R-squared :  0.8686287089034599


In [13]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rfr, X, y, cv=5)  
print("Cross-Validation Scores:", max(scores))
print("Mean Score:", scores.mean())
print("Standard Deviation:", scores.std())

Cross-Validation Scores: 0.8141543782116973
Mean Score: 0.6609310557809684
Standard Deviation: 0.27355311678942423


## RandomizedSearchCV

In [14]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'n_estimators': [25, 50, 100, 150, 200, 300],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [None, 5, 10, 15, 20,30],
    'max_leaf_nodes': [None, 5, 10, 15, 20],
}

random_search = RandomizedSearchCV(RandomForestRegressor(),
                                   param_grid)
random_search.fit(X_train, y_train)
print(random_search.best_estimator_)

In [None]:
best_estimator = random_search.best_estimator_

rfr1 = RandomForestRegressor(**best_estimator.get_params())
rfr1.fit(X_train, y_train)

In [None]:
y_pred_rs = rfr1.predict(X_test)
pred_mag_rs = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred_rs})
pred_mag_rs

Unnamed: 0,Actual,Predicted
5990,0.134615,0.112662
19383,0.365385,0.355413
17486,0.365385,0.348682
24384,0.788462,0.607810
10326,0.365385,0.363118
...,...,...
7688,0.096154,0.098797
4291,0.346154,0.341439
16809,0.000000,0.090304
12181,0.000000,0.010924


In [None]:
mae_rs = mean_absolute_error(y_test,y_pred)
mse_rs = mean_squared_error(y_test,y_pred)
r2_rs = r2_score(y_test, y_pred)

print('Mean Absolute Error: ', mae_rs)
print('Mean Squared Error: ', mse_rs)
print('R-squared : ', r2_rs)

Mean Absolute Error:  0.03496729222718916
Mean Squared Error:  0.0027312593229721124
R-squared :  0.8801763787938929


In [None]:
from sklearn.model_selection import cross_val_score
scores1 = cross_val_score(rfr1, X, y, cv=5)  
print("Cross-Validation Scores:", max(scores1))
print("Mean Score:", scores1.mean())
print("Standard Deviation:", scores1.std())

Cross-Validation Scores: 0.8294666804532921
Mean Score: 0.7513903462912983
Standard Deviation: 0.11324135897120724


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [25, 50, 100],
    'max_features': ['log2', None],
    'max_depth': [5, 10, 15, 20],
    'max_leaf_nodes': [None, 5, 10, 15, 20],
}

rf_grid = GridSearchCV(estimator = RandomForestRegressor(), param_grid = param_grid,verbose=2, n_jobs = 4)
rf_grid.fit(X_train, y_train)
print(rf_grid.best_estimator_)

Fitting 5 folds for each of 120 candidates, totalling 600 fits
