# Recurrent Neural Network & Classification
Yun Xing. 2023-5-14

The objective is to detect the security breach by predicting suspicious access using an RNN model and the provided Logfile data.

### Data Processing

In [1]:
import sys
import os
import json
import pandas as pd
import numpy as np
import optparse

from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout

#from keras.layers.embeddings import Embedding
from tensorflow.keras.layers import Embedding

from keras.preprocessing import sequence # fixed later 
from keras.preprocessing.text import Tokenizer
from collections import OrderedDict

In [2]:
# b) We will read the code in slightly differently than before: 
dataframe = pd.read_csv("dev-access.csv", engine='python', quotechar='|', header=None)
dataframe.head()

Unnamed: 0,0,1
0,"{""timestamp"":1502738402847,""method"":""post"",""qu...",0
1,"{""timestamp"":1502738402849,""method"":""post"",""qu...",0
2,"{""timestamp"":1502738402852,""method"":""post"",""qu...",0
3,"{""timestamp"":1502738402852,""method"":""post"",""qu...",0
4,"{""timestamp"":1502738402853,""method"":""post"",""qu...",0


In [3]:
# c) convert to a numpy.ndarray type. 
dataset = dataframe.values

#d) Check the shape of the dataset
dataset.shape

(26773, 2)

In [4]:
# e) Store all rows and the 0th index as the feature data: 
X = dataset[:,0]

# f) Store all rows and index 1 as the target variable: 
Y = dataset[:,1]

In [5]:
# g) clean up the predictors: removing features that are not valuable, such as timestamp and source. 

for index, item in enumerate(X):
    # Quick hack to space out json elements
    reqJson = json.loads(item, object_pairs_hook=OrderedDict)
    del reqJson['timestamp']
    del reqJson['headers']
    del reqJson['source']
    del reqJson['route']
    del reqJson['responsePayload']
    X[index] = json.dumps(reqJson, separators=(',', ':'))

In [6]:
# h) tokenize data, which just means vectorizing our text. 
#.    we will tokenize every character (thus char_level = True)

tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

# we will need this later
num_words = len(tokenizer.word_index)+1
X = tokenizer.texts_to_sequences(X)



In [7]:
# i) pad our data as each observation has a different length

from tensorflow.keras.preprocessing import sequence

max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)


In [8]:
# j) Create your train set to be 75% of the data and your test set to be 25%

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_processed, Y, test_size=0.25, random_state=42)


### Model 1 - RNN
The first model will be a pretty minimal RNN with only an embedding layer, simple RNN and Dense layer. The next model we will add a few more layers. 

In [9]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.layers import SimpleRNN, Dense


model = keras.Sequential()

model.add(tf.keras.layers.Embedding(input_dim = num_words, output_dim = 32, input_length = max_log_length))

model.add(SimpleRNN(units=32, activation='relu'))

model.add(Dense(units=1, activation='sigmoid'))
          
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1024, 32)          2016      
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                2080      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 4,129
Trainable params: 4,129
Non-trainable params: 0
_________________________________________________________________


In [11]:
X_train = np.asarray(X_train).astype('float32')
Y_train = np.asarray(Y_train).astype('float32')

model.fit(X_train, Y_train, validation_split=0.25, epochs=3, batch_size=128)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7feac0ee3b50>

In [12]:
X_test = np.asarray(X_test).astype('float32')
Y_test = np.asarray(Y_test).astype('float32')

test_loss, test_acc = model.evaluate(X_test, Y_test, batch_size=128)
print('Test loss:', test_loss)
print('Test accuracy:', test_acc)

Test loss: 0.08795161545276642
Test accuracy: 0.977591872215271


### Model 2 - LSTM + Dropout Layers:

In [13]:
from keras.layers import LSTM, Dropout

model2 = keras.Sequential()

model2.add(tf.keras.layers.Embedding(input_dim = num_words, output_dim = 32, input_length = max_log_length))

model2.add(LSTM(units = 64, recurrent_dropout = 0.5))

model2.add(Dropout(0.5))

model2.add(Dense(units=1, activation='sigmoid'))
          
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



In [14]:
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1024, 32)          2016      
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


In [15]:
model2.fit(X_train, Y_train, validation_split=0.25, epochs=3, batch_size=128)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7feaadcf26d0>

In [16]:
test_loss2, test_acc2 = model2.evaluate(X_test, Y_test, batch_size=128)
print('Test loss:', test_loss2)
print('Test accuracy:', test_acc2)

Test loss: 0.09789314866065979
Test accuracy: 0.9755004644393921


### Model 3: Customized Recurrent Neural Net 

In [21]:
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam, Adamax


In [23]:
model3 = Sequential()

model3.add(Embedding(input_dim=num_words, output_dim=32, input_length=max_log_length))
model3.add(LSTM(units=64, recurrent_dropout=0.5, return_sequences=True))
model3.add(Dropout(0.5))
model3.add(LSTM(units=64, recurrent_dropout=0.5)) # Adding an additional LSTM
model3.add(Dense(units=1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])


In [24]:
model3.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 1024, 32)          2016      
                                                                 
 lstm_9 (LSTM)               (None, 1024, 64)          24832     
                                                                 
 dropout_5 (Dropout)         (None, 1024, 64)          0         
                                                                 
 lstm_10 (LSTM)              (None, 64)                33024     
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 59,937
Trainable params: 59,937
Non-trainable params: 0
_________________________________________________________________


In [25]:
model3.fit(X_train, Y_train, validation_split=0.25, epochs=3, batch_size=128)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fea895d57c0>

In [26]:
test_loss3, test_acc3 = model3.evaluate(X_test, Y_test, batch_size=128)
print('Test loss:', test_loss3)
print('Test accuracy:', test_acc3)

Test loss: 0.08145732432603836
Test accuracy: 0.9765461683273315


## Discussions

#### 1. Difference between the relu activation function and the sigmoid activation function:

1. ReLU (Rectified Linear Unit) is an activation function that maps any number to 0 if it is negative, and otherwise maps it to itself. The ReLU function is very good for networks with many layers because it can prevent vanishing gradients when training deep networks. it allows the network to learn more complex relationships between the inputs and outputs. 

2. Sigmoid maps any number between 0 and 1, inclusive, to itself. It is useful at binary classification tasks, where the output should be a probability between 0 and 1. When the input is very negative, the output of the sigmoid function is close to 0, and when the input is very positive, the output is close to 1. However, when the input is around 0, the output is close to 0.5, meaning the sigmoid function can struggle to differentiate between inputs that are close to zero.


#### 2. What one epoch actually is (epoch was a parameter used in the .fit() method):

The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.

One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. During each epoch, the neural network will make a prediction on each training sample in the dataset, compare that prediction to the actual output, and then adjust its weights and biases accordingly in order to improve its prediction accuracy. 

too few epochs may result in an underfit model that does not generalize well to new data, while too many epochs may result in an overfit model that has memorized the training data and does not generalize well to new data. T


#### 3. How dropout works:

Dropout is a regularization technique for neural network models to prevent overfitting, where randomly selected neurons are ignored during training. They are “dropped out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass.

In training, each neuron in the layer that has dropout applied has a probability p of being "dropped out," i.e. its output is set to zero. The value of p is a hyperparameter that can be tuned, but is typically set to 0.5. Importantly, the dropout is applied independently to each example in the training batch, and each time the model is trained on a new batch, a new set of neurons are randomly dropped out.

In testing, the full network is used, but the weights of the neurons that were dropped out during training are scaled down by the dropout rate p. This is done to ensure that the expected value of the output is the same at test time as it was during training. For example, if during training a neuron had a dropout rate of 0.5, then at test time its weight would be multiplied by 0.5. This is known as the "inverted dropout" technique.

#### 4. Why problems such as this are better modeled with RNNs than CNNs. What type of problem will CNNs outperform RNNs on?

CNNs are commonly used in solving problems related to spatial data, such as images. RNNs are better suited to analyzing temporal, sequential data, such as text or videos. RNNs are able to capture the temporal dependencies and long-term patterns in sequential data by maintaining a memory of the previous inputs in a sequence. A CNN has a different architecture from an RNN. CNNs are "feed-forward neural networks" that use filters and pooling layers, whereas RNNs feed results back into the network.

In this problem, RNN works better because it envolves text data. 

#### 5. In this problem, an RNN problem could be solved using LSTM:

LSTM networks solve the problem of vanishing gradients in RNN. Each LSTM cell contains three gates: the input gate, the forget gate, and the output gate. In LSTMs, the presence of the forget gate, along with the additive property of the cell state gradients, enables the network to update the parameter in such a way that the different sub gradients do not necessarily agree and behave in a similar manner, making it less likely that all of the T gradients will vanish, or in other words, the series of functions does not converge to zero. And our gradients do not vanish.