# Classification of HTTP request strings

## Task
Given the [CSIC 2010 dataset](http://www.isi.csic.es/dataset/) containing HTTP requests labelled as 'normal' and 'anomalous' build a classifier able to distinguish between normal and anomalous (potentially malicious) requests.

## Approach
A brief literature survey reveals many approaches that have been used for this task. Our preferred approach is based on the recent work of [Althubiti et al](https://digitalcommons.kennesaw.edu/ccerp/2017/practice/2?utm_source=digitalcommons.kennesaw.edu%2Fccerp%2F2017%2Fpractice%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages) from last year which shows that a simple logistic regression built on extracting five features from each HTTP request should give excellent results. Those five features are:
1. Length of the request
2. Length of the arguments
3. Number of arguments
4. Length of the path
5. Number of 'special' chars in the path

## Parsing of the input data
Combine the two data sets labelled as 'normal' into the file 'input_data/normalTrafficAll.txt'

In [None]:
# Libraries needed for the parsing step
import re
import numpy as np
from urlparse import urlparse, parse_qs

In [None]:
files = ['input_data/normalTrafficTest.txt','input_data/normalTrafficTraining.txt']
with open('input_data/normalTrafficAll.txt', 'w') as outfile:
    for input_file in files:
        with open(input_file) as infile:
            for line in infile:
                outfile.write(line)

Extracting the individual HTTP requests can be done splitting the contents of the input file on the strings 'GET ', 'POST ', 'PUT ' which mark the beginning of each request. We have implemented a class called `dataset` which can perform parsing of any file from the CSIC 2010 data set:

In [None]:
class dataset:

        # Read-in the dataset and parse it into individual HTTP requests.
        def __init__(self, path_to_file):

                self.HTTP_requests = []
                self.n_requests = 0
                self.path_to_file = path_to_file

                with open(path_to_file, 'r') as input_file: data = input_file.read()

                # Split the raw data into individual methods (GET, POST, PUT) request strings
                methods_requests = re.split('(GET |POST |PUT )', data)
                methods_requests.pop(0)

                only_methods = methods_requests[::2]
                only_requests = methods_requests[1:][::2]

                self.HTTP_requests =  [method + request for method,request in zip(only_methods,only_requests)]

                self.n_requests = len(self.HTTP_requests)
                print "\nFound ", self.n_requests, " HTTP requests in file: ", path_to_file

        # Generate the HTTP features for all HTTP requests and label them.
        def extract_labelled_HTTP_features(self,label):

                X = []
                y = []

                for HTTP_request in self.HTTP_requests:
                        X.append(extract_features(HTTP_request))
                        y.append(label)

                print "\nData from",self.path_to_file,"have been exported"
                print "Label assigned:", label

                return X, y


Parsing of the input file is performed by the constructor `__init__`. The method `extract_labelled_HTTP_features` can be applied to extract the features for all HTTP requests in the data set. This method uses the function `extract_features` which extracts from each HTTP request the five features listed above. This function looks as follows:

In [None]:
# Extract the five features from a single HTTP_request as described in Althubiti et al paper.
# These are formed from the URI only: the rest of the HTTP request is not useful.
def extract_features(HTTP_request, debug = False):
        features = np.zeros(5)

        # Extract the method and the URI
        first_line = re.match('(.*) (http.*) HTTP',HTTP_request)
        method = first_line.group(1)
        uri_string = first_line.group(2)

        # Remove the redundant information from the request body
        stripped_request = re.sub('User-Agent:.*Connection: close|[\n\r\X]','',HTTP_request)

        if (method != 'GET'):
                post_query = re.search('Content-Length:\s*\d+(.*)',stripped_request)
                arguments = re.sub('\n|\r','',post_query.group(1))
                uri_string = uri_string + '?' + arguments

        # Parse the URI into path and arguments
        uri = urlparse(uri_string)

        path = uri.path
        arguments = parse_qs(uri.query)

        #1. Length of the request
        features[0] = len(uri_string)

        #2. Length of the arguments
        features[1] = 0
        for parameter, value in arguments.iteritems():
                features[1] += len(parameter+(''.join(value)))

        #3. Number of arguments
        features[2] = len(arguments)

        #4. Length of the path
        features[3] = len(path)

        #5. Number of special chars in the path
        features[4] = len(re.findall('\W',path))

        if (debug):
                print "\n" + method
                print path
                print arguments
                print features

        return features

The function `extract_features` only needs to use the part of the request containing the URL and the parameters. The rest of the HTTP request is discarded because it does not describe users's behavior. We proceed to build the full URI consiting of the URL and the parameters and parse them using the `urlparse` library. Finally, we construct the five features listed by Althubiti. Optionally, the results of the feature extraction can be printed to std output using the `debug` parameter.

Now we use the class `dataset` to parse both the normal and the anomalous data sets. We have labelled the normal HTTP requests as 0 and the anomalous requests as 1.

In [None]:
normal_requests = dataset('input_data/normalTrafficAll.txt')
X_normal, y_normal = normal_requests.extract_labelled_HTTP_features(0)

anomalous_requests = dataset('input_data/anomalousTrafficTest.txt')
X_anomalous, y_anomalous = anomalous_requests.extract_labelled_HTTP_features(1)

X = np.concatenate((X_normal,X_anomalous),axis=0)
y = np.concatenate((y_normal,y_anomalous),axis=0)

print X
print y

Let's illustrate the feature selection which just took place on a single HTTP request:

In [None]:
print normal_requests.HTTP_requests[1]

In [None]:
extract_features(normal_requests.HTTP_requests[1],True)

The printout shows all the values which have been used to construct features. Note that the parsing using `urlparse` library also takes care of capturing the Spanish Unicode characters.

## Fitting of the Logisitic regression model

In [None]:
# We use the `sklearn` library to perform the fitting:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Split the whole data set into 60% for fitting and 40% for testing:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state=0)

### 1. Linear model using all training data

In [None]:
model1 = LogisticRegression(C=0.01)
model1.fit(X_train, y_train)

In [None]:
y_pred = model1.predict(X_test)
print(classification_report(y_test, y_pred))

The low value of recall for the anomalous class shows that this model is rather bad at finding all anomalous requests, i.e. the rate of false negatives is very high. On the other hand this test is good at identifying normal requests.

### 2. Non-linear model using all training data

Model 1 separates the two classes using a simple hyperplane. Let's see if we can improve the model by using a more compilcated non-linear boundary which uses all unique quadratic terms:

In [None]:
Xquad_train = X_train
Xquad_test = X_test
col = np.zeros(len(y))
for col_i in range(0,len(X_train[0,:])):
        for col_j in range(0,col_i+1):
                col = X_train[:,col_i] * X_train[:,col_j]
                Xquad_train = np.column_stack((Xquad_train,col))

                col_test = X_test[:,col_i] * X_test[:,col_j]
                Xquad_test = np.column_stack((Xquad_test,col_test))

In [None]:
model2 = LogisticRegression(C=0.01)
model2.fit(Xquad_train, y_train)

y_pred = model2.predict(Xquad_test)
print(classification_report(y_test, y_pred))

Indeed, increasing the variability of the boundary improves all parameters of the model.

### 3. Linear model using unique training data

The parsed data set contains pairs of equivalent GET and POST calls which are not distinguished by the features of the model. We can therefore train another model from data not containing the duplicates.

In [None]:
Xy = np.column_stack((X_train,y_train))
Xy = np.asarray(np.unique(Xy, axis=0))

n_samples = len(Xy)
n_features = len(Xy[0])-1

print "Number of unique samples = ", n_samples
print "Number of features = ", n_features

y_train = Xy[:,n_features]
X_train = np.delete(Xy,n_features,1)

In [None]:
model3 = LogisticRegression(C=0.01)
model3.fit(X_train, y_train)

y_pred = model3.predict(X_test)
print(classification_report(y_test, y_pred))

The performance of this model is bad for both classes. Clearly, the presence of the redundant data points puts more weight on the important parts of the hyperspace and results in a more accurate decision boundary. We could also construct a model including all unique quadratic features (this is done in the script `main.py`) but it does not improve the results.

## Discussion
The performance of our models is significantly worse than the logistic regression models of Althubiti et al, see Table 3 ibid. The regularization parameter `C` in `LogisticRegression(C=0.01)` has only a minor effect on the final model. Possible causes of the underperformance of our model are:
1. The regression solver is stuck in a local minimum: this seems unlikely since all the available solvers return very similar results. Scaling of the training data also doesn't improve the fitting.
2. Althubiti et al perform some additonal cleaning of the training data which is not described in their paper.
3. The use of a high-order decision boundary in the Althubiti paper.
4. A mistake in our function `extract_features`: we have tried including also the features from Table 1 of Althubiti et al to see if the performance improves but the results were almost identical.

## Conclusions
Our logistic regression models behave poorly in comparison with the one determined by Althubiti et al. The reason for that remains unknown. Our best model is the non-linear Model 2 including all data points which has recall of approx. 0.5 for the anomalous class. This means that it would correctly spot only about a half of all anomalous requests. As is the model is clearly deficient and should be fixed before it is applied in practice.

## Practical application
Once the reason for the underperformance of our classifier wrt Althubiti one is found the classifier can be used in practice. Nevertheless, a blind application of this classifier as a hard rule may not be desirable depending on the purpose of the web application due to incidence of false positives. For example for an e-shop server the occurence of a false positive classification could mean that e.g. a particular expensive product with a long name would never be sold which is clearly an unacceptable behavior. This behavior is caused by the fact that the classifier is only statistical and therefore blind to the actual intent of the request.

In order to reduce the incidence of false positives the HTTP request classification could be split into two stages:
1. Pre-screening based on the statistical classifier. Requests labelled as normal are allowed.
2. Further screening of the requests labelled as anomalous. Here we would apply a different classifier which would try to infer the intent of the request as belonging to a set of allowed operations, e.g. put a certain item in the shopping cart. This would require a non-statitical algorithm analyzing the type of parameters supplied in the request. With the intent inferred the web server could create its own HTTP request and compare it with the one supplied by the user. If they match the request would be allowed, otherwise rejected. This approach ensures that the request supplied by the user is always compared to a request which is free of any malicious code.

### Function for classification of HTTP requests
For simplicity we implement here the decision function `normal_or_anomalous` for Model 1 which therefore constitutes our predictor of normal vs anomalous HTTP requests:

In [None]:
print model1.coef_
print model1.intercept_

In [None]:
# Predicts whether a given HTTP request string is normal (=0) or anomalous (=1)
# The decision boundary has been determined by Model 1.
def normal_or_anomalous(HTTP_request):

        features = extract_features(HTTP_request)
        theta = [0.15609435, -0.16087162, -0.23866712, -0.17492764, -0.0141425]

        p = 1.0/(1+np.exp(-(np.dot(theta,features)-4.12389027)))
        if p >= 0.5:
                p = 1
        else:
                p = 0

        return p

This function has been included in the stand-alone module `http_requests` and can be used to analyze individual HTTP requests in practice, e.g.:

In [None]:
print normal_or_anomalous(normal_requests.HTTP_requests[0])