In [1]:
from main import *
import numpy as np
import torch
from torch import nn
from torch.autograd import Variable
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from Load_Ceserian_Dataset import readCeserianFile
from keras.utils import to_categorical
import torch.nn.functional as F

Using TensorFlow backend.


## Introduction

In this notebook, the online shoppers purchasing intentions dataset will be used to train a neural network. 

Firstly, we can define the algorithm of a neural network as a mathematical function with flexibility to solve any problem simply by varying the weights. The algorithm is modelled much like the brain and is designed to recognise patterns by containing artificial neurons with the ability to learn. The result of this creation became the well known perceptron which is a model architecture of a binary classifier. However, as the working dataset explored is much more complex, the multilayer perceptron will be used subsequently and includes with the following 3 layers:
1. Input - consists of neurons making up the attributes
2. Hidden - transforms the layer with a weighted summation including the activation function e.g. Rectified Linear Unit (ReLU)
3. Output - receives the values from the last hidden layer and produces the label

For example, given one hidden layer and one neuron, the function 

$$
  f(x) = W_{2}g(W^{T}_{1}x+b_{1}+b_{2})
$$

can be learned by initialising the model parameters 
$
  W_{1}\in\R^{m}
$
and
$
  \newcommand{\R}{\mathbb{R}}
  W_{2},b_{1},b_{2}\in\R
$
With $W_{1}$ and $W_{2}$ as the weights of both the input and hidden layer and finally, $b_{1}$ and $b_{2}$ as the bias within the hidden and output layer. The inputs are then passed through an activation rule which exerts influence on a unit with its current state to produce the output (1). Lastly, the model is tested to see how good it is at predicting based on a given set of parameters by using cross entropy. Model optimisation will also be explored in the sections below.

## Preprocessing Data

The data was first examined by importing the csv file which was then later transferred to a pandas dataframe for easy retrieval. As mentioned before in the previous learning tasks, the data is highly categorical and will need to be encoded before it can be used to fit and train the model. The data was then split into features and labels in preparation for the train test split.

In [2]:
data_frame_os = read_data_return_frame("online_shoppers_intention.csv")
preprocess_df(data_frame_os) # function preprocess_df factorizes the categorical variables
data_frame_os # return factorized dataset

Unnamed: 0,ProductRelated_Duration,ProductRelatedAve,BounceRates,ExitRates,SpecialDay,Month,Region,VisitorType,Weekend,Revenue
0,0.000000,0.000000,0.200000,0.200000,0.0,0,1,0,0,False
1,64.000000,32.000000,0.000000,0.100000,0.0,0,1,0,0,False
2,0.000000,0.000000,0.200000,0.200000,0.0,0,9,0,0,False
3,2.666667,1.333333,0.050000,0.140000,0.0,0,2,0,0,False
4,627.500000,62.750000,0.020000,0.050000,0.0,0,1,0,1,False
...,...,...,...,...,...,...,...,...,...,...
12325,1783.791667,33.656447,0.007143,0.029031,0.0,9,1,0,1,False
12326,465.750000,93.150000,0.000000,0.021333,0.0,7,1,0,1,False
12327,184.250000,30.708333,0.083333,0.086667,0.0,7,1,0,1,False
12328,346.000000,23.066667,0.000000,0.021053,0.0,7,3,0,0,False


In [3]:
print(data_frame_os)
features = np.array(data_frame_os.iloc[:,0:9]).astype(np.float)
labels =np.array(data_frame_os['Revenue']).astype(np.float)

       ProductRelated_Duration  ProductRelatedAve  BounceRates  ExitRates  \
0                     0.000000           0.000000     0.200000   0.200000   
1                    64.000000          32.000000     0.000000   0.100000   
2                     0.000000           0.000000     0.200000   0.200000   
3                     2.666667           1.333333     0.050000   0.140000   
4                   627.500000          62.750000     0.020000   0.050000   
...                        ...                ...          ...        ...   
12325              1783.791667          33.656447     0.007143   0.029031   
12326               465.750000          93.150000     0.000000   0.021333   
12327               184.250000          30.708333     0.083333   0.086667   
12328               346.000000          23.066667     0.000000   0.021053   
12329                21.250000           7.083333     0.000000   0.066667   

       SpecialDay  Month  Region  VisitorType  Weekend  Revenue  
0        

## Multi-Layer Perceptron Classifier

Using Pytorch's library to build the model allowed for the use of the base class of the neural network modules which made development simple. Due to the backward step being fully handled by Pytorch, we can focus on creating the forward step. Firstly the hidden layers needed to be created, in this instance 3 layers were created which will apply the linear transformation to the incoming data. The choice of 3 was purely arbitrary due to the fact that both layers and nodes are model hyperparameters but research from similar datasets suggests that it may be best to go for greater depth as this results in better generalisation for a wider variety of tasks. Furthermore, deep architectures tend to convey useful priors which leads over the function space for the model to learn (2). 

Switching back to the forward function; this function simply exists to define the network structure that includes the activation function. There are many activation functions availabe such as the commonly used Sigmoid or Tanh but for this model, ReLU was the chosen function and can be defined as: $max(0,z)$. Comparing to the Sigmoid, ReLU is also non-linear given that it takes real values as inputs and provides outputs as 0s and 1s but the only difference is that it generally gives better performance by avoiding vanishing gradient problems (3). Finally the Softmax activation function defined as:
$$
  \newcommand{\euler}{e}
  \sigma(\vec{z}_{i})=\frac{e^{z_{i}}}{\sum_{j=1}^{K}e^{z_{j}}}
$$
where:

$\vec{z}$ = input vector

$e^{z_{i}}$ = standard exponential function for input vector

$K$ = number of classes in the multi-class classifier

$e^{z_{j}}$ = standard exponential function for output vector

was then used for the third layer as it assigns decimal properties to each target class making it the best choice to help speed up the converges of training (4). 

In [4]:
class Model(nn.Module):
    def __init__(self, input_dim):
        super(Model, self).__init__()
        self.layer1 = nn.Linear(input_dim, 50)
        self.layer2 = nn.Linear(50, 20)
        self.layer3 = nn.Linear(20, 2)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        x = F.softmax(self.layer3(x))  # To check with the loss function
        return x

Using Scikit-learn's train test split, training was allocated 75% of the data and remaining to test. As already identified that the dataset exhibits an imbalance in the distribution of target class, the choice to use stratified sampling was the best way forward to preserve the same percentage for each target class.

In [5]:
features_train,features_test, labels_train, labels_test = train_test_split(features, labels, random_state=21, shuffle=True)

Next is to train the model taking performance in to consideration. By defining a loss function that evaluates how well the model does we can then set goals to minimise this lost. As briefly mentioned above, cross entropy was used. In essence, this loss function measures the performance of the model by calculating the sum of the separate loss for each class by the observation (5). This is defined as:
$$
  -\sum_{c=1}^{M}y_{o,c}log(p_{o,c})
$$
where:

${M}$ = the number of classes

$y$ = represents 0 or 1 if the class label $c$ is the correct classification for observation $o$

$p$ = the predicted probability the observatin $o$ is of class $c$


Now we can then optimise the model and this was done usig Adaptive Moment Estimation (ADAM) was chosen as opposed to the more well known optimisation method Stochastic Gradient Descent (SGD) is because of their similarity but differs in the matters of adaptive learning. When it comes to parameters that would be given small updates, ADAM gives large updates due to storing both the exponentially decaying average of past squared gradients and the exponentially decarying average of past gradients. In comparison, SGD's learning rate tends to have the same effect for all weights in the model which could slow down and will require manual tuning whereas ADAM converges much faster (6).

To use ADAM, the decaying average of past squared gradients and past gradients must be calculated as follows:
$$
  v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t}^{2}
$$
$$
  m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})g_{t}
$$

where $v_{t}$ and $m_{t}$ are the first moment denoting the mean and the second moment denoting the variance of the gradients. However, there is still an issue of a bias towards 0 in the intial first steps so to counter the bias, estimates can be calculated for the first and second moments:
$$
  \hat{v}_{t}=\frac{v_{t}}{1-\beta_{2}^{t}}
$$
$$
  \hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}}
$$

The parameters are then updated which gives the ending ADAM rule:
$$
  \theta_{t}=\theta_{t-1}-\frac{\eta}{\sqrt{\hat{v_{t}}}+\epsilon}\hat{m_{t}}
$$

On a final note, the epoch value also needs to be stated given the nature of that neural networks is part of an epoch-based alogoritm. An epoch is essentially the number of times the algorithm will see the entire training set not to be confused with iteration which is the number of batches the algorithm has seen. As ADAM is used which is an iterative process, it is not advised to use one epoch. This is due to the fact that one epoch is insufficient in updating weights on a single pass and therefore lead to underfitting (6). In light of this, epoch was set to 100 as it is purely heuristics and gives the model a chance to readjust the parameters so it is not biased towards the last few data points. 

In [6]:
# Training
model = Model(features_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
loss_fn = nn.CrossEntropyLoss()
epochs = 100

## Results

After training, the printed results confirms the reduction of the loss function. After that, the function zero_grad was used to clear old gradients from the last step so there is no accumulation of the previous gradients from the backward functions calls. Calling backward() then computes the derivative of the loss with respect to the parameters using backpropagation and finishing with a step call which tells the optimiser to take a step based on the gradients of the parameters. Prediction is then made under F1 score rather than accuracy given then unbalanced nature of the dataset.

In [7]:
def print_(loss):
    print ("The loss calculated: ", loss)

In [8]:
x_train, y_train = Variable(torch.from_numpy(features_train)).float(), Variable(torch.from_numpy(labels_train)).long()

for epoch in range(1, epochs + 1):
    print("Epoch #", epoch)
    y_pred = model(x_train)
    loss = loss_fn(y_pred, y_train)
    print_(loss.item())

    # Zero gradients
    optimizer.zero_grad()
    loss.backward()  # Gradients
    optimizer.step()  # Update

Epoch # 1
The loss calculated:  0.9568246603012085
Epoch # 2
The loss calculated:  0.4685142934322357
Epoch # 3
The loss calculated:  0.46760451793670654
Epoch # 4
The loss calculated:  0.46757015585899353
Epoch # 5
The loss calculated:  0.4675697088241577
Epoch # 6
The loss calculated:  0.4675697088241577
Epoch # 7
The loss calculated:  0.4675697088241577
Epoch # 8
The loss calculated:  0.4675697088241577
Epoch # 9
The loss calculated:  0.4675697088241577
Epoch # 10
The loss calculated:  0.4675697088241577
Epoch # 11
The loss calculated:  0.4675697088241577
Epoch # 12
The loss calculated:  0.4675697088241577
Epoch # 13
The loss calculated:  0.4675697088241577
Epoch # 14
The loss calculated:  0.4675697088241577
Epoch # 15
The loss calculated:  0.4675697088241577
Epoch # 16
The loss calculated:  0.4675697088241577
Epoch # 17
The loss calculated:  0.4675697088241577
Epoch # 18


  # This is added back by InteractiveShellApp.init_path()


The loss calculated:  0.4675697088241577
Epoch # 19
The loss calculated:  0.4675697088241577
Epoch # 20
The loss calculated:  0.4675697088241577
Epoch # 21
The loss calculated:  0.4675697088241577
Epoch # 22
The loss calculated:  0.4675697088241577
Epoch # 23
The loss calculated:  0.4675697088241577
Epoch # 24
The loss calculated:  0.4675697088241577
Epoch # 25
The loss calculated:  0.4675697088241577
Epoch # 26
The loss calculated:  0.4675697088241577
Epoch # 27
The loss calculated:  0.4675697088241577
Epoch # 28
The loss calculated:  0.4675697088241577
Epoch # 29
The loss calculated:  0.4675697088241577
Epoch # 30
The loss calculated:  0.4675697088241577
Epoch # 31
The loss calculated:  0.4675697088241577
Epoch # 32
The loss calculated:  0.4675697088241577
Epoch # 33
The loss calculated:  0.4675697088241577
Epoch # 34
The loss calculated:  0.4675697088241577
Epoch # 35
The loss calculated:  0.4675697088241577
Epoch # 36
The loss calculated:  0.4675697088241577
Epoch # 37
The loss cal

In [9]:
x_test = Variable(torch.from_numpy(features_test)).float()
pred = model(x_test)

pred = pred.detach().numpy()

print("The accuracy is", accuracy_score(labels_test, np.argmax(pred, axis=1)))

# Checking for first value
np.argmax(model(x_test[0]).detach().numpy(), axis=0)

The accuracy is 0.8439831333117094


  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


0

## Bibliography 

1. 1.17. Neural network models (supervised) — scikit-learn 0.24.1 documentation [Internet]. Scikit-learn.org. 2021 [cited 27 March 2021]. Available from: https://scikit-learn.org/stable/modules/neural_networks_supervised.html#mathematical-formulation

2. Sze V, Chen Y, Yang T, Emer J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE. 2017;105(12):2295-2329.

3. Activation Functions in Neural Networks [Internet]. Medium. 2021 [cited 27 March 2021]. Available from: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

4. Multi-Class Neural Networks: Softmax  |  Machine Learning Crash Course [Internet]. Google Developers. 2021 [cited 27 March 2021]. Available from: https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax

5. sklearn.metrics.log_loss — scikit-learn 0.24.1 documentation [Internet]. Scikit-learn.org. 2021 [cited 27 March 2021]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

6. Zhang Z. Improved Adam Optimizer for Deep Neural Networks. 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). 2018;.