# Recurrent Neural Network & Classification

In this assignment the objective is to detect the security breach by predicting suspicious access using an RNN model and the provided Logfile data.

Logfile data includes login information like LogID, Timestamp, Method, Path, Status Code, Source, Remote Address, User Agent etc. The last indicator in each row denotes breach(1) and no breach(0) which is the target variable.

The data will be coded using keras Tokenizer class which converts strings to ints. Then the neural network will be trained on the numeric inputs instead of text.

Three RNN models will be built:
  1. RNN - Minimal netorkw with Embedding layer, LSTM layer, and Dense layer
  2. RNN + Dropout Layers & Acitivation Function
  3. RNN w. 5 layers

## 1. Data Pre-Processing

The data used here is a logfile of http requests. One way of encoding this information would be to extract each feature and build a dataframe with columns as features. However not every log has every feature so we would end up with a very sparse matrix. Instead, we encoded the logfile rows as a series of ints relating to the characters founds within the headers. This will allow us to train the RNN on numeric data and let it detect patterns from the http requests instead of features extracted.

In [1]:
import sys
import os
import json
import pandas as pd
import numpy as np
import optparse

from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from collections import OrderedDict

from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
df = pd.read_csv("data/dev-access.csv", engine='python', quotechar='|', header=None).values

X = df[:,0]
y = df[:,1]

A sample row contains many responses the http request

In [3]:
df[0]

array(['{"timestamp":1502738402847,"method":"post","query":{},"path":"/login","statusCode":401,"source":{"remoteAddress":"88.141.113.237","referer":"http://localhost:8002/enter"},"route":"/login","headers":{"host":"localhost:8002","accept-language":"en-us","accept-encoding":"gzip, deflate","connection":"keep-alive","accept":"*/*","referer":"http://localhost:8002/enter","cache-control":"no-cache","x-requested-with":"XMLHttpRequest","content-type":"application/json","content-length":"36"},"requestPayload":{"username":"Carl2","password":"bo"},"responsePayload":{"statusCode":401,"error":"Unauthorized","message":"Invalid Login"}}',
       0], dtype=object)

Many of these are unnecessary. We are interested in only the features that will be unique for a fraudulent request.

In [4]:
for index, item in enumerate(X):
    # Quick hack to space out json elements
    reqJson = json.loads(item, object_pairs_hook=OrderedDict)
    del reqJson['timestamp']
    del reqJson['headers']
    del reqJson['source']
    del reqJson['route']
    del reqJson['responsePayload']
    X[index] = json.dumps(reqJson, separators=(',', ':'))

In [5]:
X[0]

'{"method":"post","query":{},"path":"/login","statusCode":401,"requestPayload":{"username":"Carl2","password":"bo"}}'

The Tokenizer class will split the data in each row into character components and assign each character an index. Then the data can be trained on the numeric vectors instead of characters.

In [6]:
tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

# we will need this later
num_words = len(tokenizer.word_index)+1
X = tokenizer.texts_to_sequences(X)

In [7]:
dict(list(tokenizer.word_index.items())[0:10])

{'"': 1,
 ',': 10,
 ':': 4,
 'a': 5,
 'e': 2,
 'o': 7,
 'r': 9,
 's': 6,
 't': 3,
 'u': 8}

What does a data row look like now?

In [8]:
X[0][0:10]

[18, 1, 20, 2, 3, 14, 7, 11, 1, 4]

18 means '{', 1 means '"' as definied byt the word index. So this is just the first row but as numbers now.

In [9]:
# Need to pad our data as each observation has a different length
max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)

In [10]:
np.unique(X_processed[0], return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 32, 40],
       dtype=int32),
 array([909,  22,   7,   6,   7,   7,   7,   7,   4,   5,   5,   4,   3,
          4,   2,   2,   2,   2,   3,   3,   2,   2,   1,   1,   1,   1,
          1,   1,   1,   1,   1]))

The first element is mostly zeros for padding

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.25, random_state=123)

## Model l: Base RNN

The first model will be a pretty minimal RNN with only an embedding layer, LSTM layer, and Dense layer. The next model we will add a few more layers. 

In [12]:
model1 = Sequential() 

model1.add(Embedding(input_dim = num_words, output_dim = 32, input_length = max_log_length))
model1.add(LSTM(units=64, recurrent_dropout=0.5))
model1.add(Dense(units=1, activation='relu'))

model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


In [13]:
model1.fit(X_train, y_train, batch_size=128, epochs=3, validation_split=0.25, verbose=1)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x114bfceb8>

In [14]:
score1 = model1.evaluate(X_test, y_test, batch_size=128)
print('Test loss:', score1[0])
print('Test accuracy:', score1[1])

Test loss: 0.5695606596013272
Test accuracy: 0.6057663578078182


# Model 2: Enhanced RNN

Base RNN + Dropout Layer & New Activation Function

Now we will add a few new layers to our RNN and switch the activation function. You will be creating a new model here, so make sure to call it something different than the model from Part 2.

In [15]:
model2 = Sequential() 

model2.add(Embedding(input_dim = num_words, output_dim = 32, input_length = max_log_length))
model2.add(Dropout(rate=0.5))
model2.add(LSTM(units=64, recurrent_dropout=0.5))
model2.add(Dropout(rate=0.5))
model2.add(Dense(units=1, activation='sigmoid'))

model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024, 32)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


In [16]:
model2.fit(X_train, y_train, batch_size=128, epochs=3, validation_split=0.25, verbose=1)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x12e5ddfd0>

In [17]:
score2 = model2.evaluate(X_test, y_test, batch_size=128)
print('Test loss:', score2[0])
print('Test accuracy:', score2[1])

Test loss: 0.18731713233098793
Test accuracy: 0.948760083550156


## Model 3: BYO RNN

In [18]:
model3 = Sequential() 

model3.add(Embedding(input_dim = num_words, output_dim = 32, input_length = max_log_length))
model3.add(LSTM(units=64, recurrent_dropout=0.5))
model3.add(Dense(units=1, activation='tanh'))
model2.add(Dropout(rate=0.5))
model3.add(Dense(units=1, activation='softmax'))

model3.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
model3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 2         
Total params: 26,915
Trainable params: 26,915
Non-trainable params: 0
_________________________________________________________________


In [19]:
model3.fit(X_train, y_train, batch_size=128, epochs=3, validation_split=0.25, verbose=1)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x114bfcfd0>

In [20]:
score3 = model3.evaluate(X_test, y_test, batch_size=128)
print('Test loss:', score3[0])
print('Test accuracy:', score3[1])

Test loss: 8.059311297319525
Test accuracy: 0.494472661960791


## Conceptual Questions: 

### 5) Explain the difference between the relu activation function and the sigmoid activation function.

The ReLU activation function of z outputs z when z is positive and 0 everywhere else. The sigmoid activatation function is the logistic sigmoid of z: 1/(1+exp(-z)). The ReLU function has the advantage in that it never saturates, i.e. it never becomes flat. So gradient descent works quite well. However with negative or zero value inputs the gradient breaks. It is necessary to start with very small but positive weights for this reason. The sigmoid function saturates and both very large and very small values of z and only works well if z is close to zero. Gradient descent therefore gets stuck and nowadays ReLU is prefered to sigmoid.

### 6) In regards to question 5, which of these activation functions performed the best (they were used in Model 1 & Model 2) ? Why do you think that is?

Model 2 outperformed Model 1 by quite a bit, but I have a hard time believing that it's because of the sigmoid function and I would rather attribute it to the dropout layers. However the dropout layers are for regularization and shouldn't contribute to the overall accuracy on the test set much, so it must be that sigmoid significantly outperformed ReLU for this problem. We learned in the textbooks that ReLU is prefered for modern neural network problems because the activation function doesn't saturate at large values of z. However for this problem of detecting security breaches which is a rare event, the gradients might be close to zero anyway which would lead to problems with ReLU but an area where sigmoid can do well. Still, it's confusing to see sigmoid do so much better than ReLU.

### 7) Explain how dropout works (you can look at the keras code) for (a) training, and (b) test data sets.

Dropout is a regularization technique that helps to prevent overfitting by randomly droping nodes in the net between training iteration. The hyperparameter p controls the dropout rate. In this assignment we set it to 50% across the board. That means that a node has a 50/50 chance of being including in any given training iteration. The network is forced to adapt to missing nodes and it helps to prevent the training data from being memorized.

In the test data, each connection is weighted by (1-p) to account for the fact that it was trained on fewer connections than there now appear to be.

### 8) Explain why problems such as this are better modeled with RNNs than CNNs.

The unique thing about RNNs is that the output of each layer is an input for itself in the next iteration. For that reason they are the prefered method for timeseries data or sequences where the previous values in that sequence matter for the prediction. In this task, it's not just the metadata that matters but the order in which it is received that helps to determine hacking. For this reason the RNN is better suited for detection.

### 9) Explain what RNN problem is solved using LSTM and briefly describe how.

Training RNNs can take a very long time because learning long-term patterns requires hundreds of iterations. The most common technique is to simply cap the input sequence, either to look at recent data for timeseries or to look at a fixed number of inputs. The problem is that then you lose information. We need a way to include both recent and long-term information at each step. The Long Short-Term Memory (LSTM) cell is one technique to solve this problem.

LSTM contains two memory inputs, one for long-term and one for short-term. The LSTM cells can learn to recognize an important input and store it in the long-term cell. It can continue to use that input over many iterations until it is no longer useful and gets dropped. This makes the LSTM cell very good for Deep RNNs training over long sequences.
