Skip to content
sweetmanC edited this page May 8, 2016 · 7 revisions

Introduction

Intrusion Detection Systems (IDS) are a second level defence to computer systems. An anomaly IDS builds a ‘normal’ profile for system operation, and anything that deviates from normal behaviour can be flagged as a potential attack. Using normal system call data and a series of pre-defined attack data, this project will develop a python implementation of an Intrusion Detection System utilising machine learning theories. Utilising two machine learning theories, Decision Trees and Cluster Analysis, this project will describe and implement both methods and include computational costs and the false alarm rate for each implementation.

An example of the system call data being utilised for training is as follows (1-2-tv762):

setpgrp ioctl setpgrp ioctl ioctl setpgrp ioctl close close close ioctl ioctl close close close close close close execve open mmap open mmap mmap munmap mmap close open mmap mmap munmap mmap mmap close open mmap mmap munmap mmap mmap close open mmap close open mmap mmap munmap mmap close open mmap mmap munmap mmap close open mmap mmap munmap mmap close open mmap mmap munmap mmap close close munmap close close close close close close close close close close close close close close close close close open stat open open close ioctl ioctl ioctl ioctl close open pipe vfork close ioctl ioctl close lstat close unlink close close close close exit

These system calls represent a trace of system operation, and a reference frame cannot be determined for when or where within the system operation and machine execution space the trace was taken. Utilising 400 trace of normal system execution and machine learning tools, an IDS should be able to determine with accuracy, if other traces match the normal system execution. The data provided to assist in algorithm development and analysis contains the following:

/Attack Data – 10 attacks/
	/1-1-ffb/
	/1-1-format/
	/2-2-ipsweep/
	/2-5-ftpwrite/
	/3-1-ffb/
	/3-3-ftpwrite/
	/3-4-warez/
	/3-5-warezmaster/
	/4-1-warezclient/
	/4-2-spy/
/Training Data – 100 traces/
/Validation Data – 400 traces/

Attack Data – 10 attacks. This folder of data contains ten sub folders, each represent an attack profile. Within each attack profile is a series of traces linked to the attack profile. However, not every trace within each attack profile is an attack trace. The IDS will be required to determine which traces are normal and which traces to raise the alarm on. This analysis only requires an alarm to be flagged if more than one trace within an attack profile does not conform to normal system operation.

Training Data – 100 traces. This is the data source for the normal system operation traces. These traces represent a portion of normal system operation and the IDS are required to learn normal operating behaviour from this data.

Validation Data – 400 traces. This data cannot be used to train the IDS. This data is a set of system traces used to validate the algorithms developed during this analysis. The data itself is an extended set of normal system operating traces that are used to determine the accuracy, or False Alarm Rate (FAR) of the implemented algorithm.

#Implementation Methods The two methods chosen to be implemented to solve these problems are decision trees and cluster analysis. These methods were chosen because of their ability to compare data sets and determine if system traces match the normal trained variables, while maintaining a low resource and computational footprint.

The decision tree method will build a tree of normal operating system calls, so that when the algorithm checks a new trace it simply steps through the tree to compare the order of system calls. With the use of only ‘normal’ data, it means that all leaf nodes within the decision tree end in ‘normal’, or meaning that the order of system calls is deemed as normal. Without a specific set of ‘attack’ traces to add to the decision tree (leaf nodes ending in ‘attack’), a percentage of accuracy cannot be added to the calculation of data; where a shortened order of system calls might result in a 40% chance of ending in attack, due to a branch that extends deeper than the length of sampled system calls. With this factor in consideration, it means that all checks for normal system calls will result in either a pass or fail, it is normal and exists within the decision tree, or it is an attack and doesn’t exist.

The cluster analysis method takes a simplified data comparison approach to the IDS. The implemented method will create the normal system call traces as cluster data within a list or two-dimensional matrix, and then create separate clusters out of the system call traces being test. The algorithm will then compare the clusters against each other and if there is no match, the cluster will not be deemed normal operating system calls. These methods were implemented based on perceived accuracy and proficiency, along with the ability to successfully implement the algorithms.

#Conclusion Machine learning is a very important factor in the design of an Intrusion Detection System, and so is the ability to develop a reliable method of determining, with some variation, normal operation behaviour.

Of the two machine learning techniques that were implemented, decision trees and cluster analysis, it was proven that the decision tree implementation was far more accurate than the cluster implementation. The implementation of decision trees proved more reliable with a 1.25% rate of misclassification, in comparison to 18.5%.

The clustering implementation was also far more efficient in terms of computational cost, with a local order of growth of 0.5 in comparison to 1; and the execution time for the decision tree algorithm steadily grew to nearly 2 seconds, where the clustering method remained within the 1 second time frame.

These two algorithms demonstrate the trade-off between performance and accuracy in the implementation of a decision support system. While the clustering implementation is far superior in performance, its False Alarm Rate is significantly higher – reducing its overall reliability due to its inaccuracies.

With its good performance, lower memory usage and significantly smaller False Alarm Rate, the decision tree implementation is the better machine learning tool for this application.

Clone this wiki locally