### Appyling NLP on the description field of the AIS data. All stopwords are removed and the words are vectorized. The target variable used for training is the Object Type that has been identified like malicious IP, email, URL etc.. Hence, this model is trying to predict the observable of the attack based on it's description. This is just a try to see how we can extract meaningful information from the description field of the AIS data.

In [114]:
import sklearn.feature_extraction
import pandas as pd
import os
import re

df = pd.read_csv('ais_data.csv', encoding = "ISO-8859-1")
df2 = pd.DataFrame(df[['stix_header.description.0', 'indicators.observable.object.properties.xsi:type.0']])
df2 = df2.fillna('')

# I am only using the rows which contain this observable for training the model
df2 = df2[df2['indicators.observable.object.properties.xsi:type.0'] != '']

#### You can see 186 fields out of 225 are AddressObjectType. So the dataset is skewed

In [152]:
df2['indicators.observable.object.properties.xsi:type.0'].describe()

count                   225
unique                    5
top       AddressObjectType
freq                    186
Name: indicators.observable.object.properties.xsi:type.0, dtype: object

### Extract IP addresses from text

In [116]:
def extract_ip(text):
    try:
        ip_candidates = re.findall(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", text.encode('utf-8'))
        return ', '.join(ip_candidates)
    except AttributeError:
        return None

df2['ip'] = df2['stix_header.description.0'].apply(lambda x: extract_ip(x))

df2.head()

Unnamed: 0,stix_header.description.0,indicators.observable.object.properties.xsi:type.0,ip
6,"On 06 January 2015, an organization in the aer...",URIObjectType,178.79.186.55
9,A trusted third party has found these IP addre...,AddressObjectType,
13,An organization in the Aerospace sector observ...,URIObjectType,
15,"On 5 April 2016, a trusted third-party provide...",DomainNameObjectType,
20,,DomainNameObjectType,


### Extract length of the description field

In [117]:
df2['description_length'] = df2['stix_header.description.0'].apply(len)
df2.head()

Unnamed: 0,stix_header.description.0,indicators.observable.object.properties.xsi:type.0,ip,description_length
6,"On 06 January 2015, an organization in the aer...",URIObjectType,178.79.186.55,291
9,A trusted third party has found these IP addre...,AddressObjectType,,295
13,An organization in the Aerospace sector observ...,URIObjectType,,252
15,"On 5 April 2016, a trusted third-party provide...",DomainNameObjectType,,241
20,,DomainNameObjectType,,0


### Let's remove stopwords. We can import a list of english stopwords from NLTK. Stopwords are the most commonly used English words

In [149]:
from nltk.corpus import stopwords
stopwords.words('english')[0:10] # Show some stop words

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

In [150]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [151]:
df2['stix_header.description.0'].head(5).apply(text_process)

6     [06, January, 2015, organization, aerospace, s...
9     [trusted, third, party, found, IP, addresses, ...
13    [organization, Aerospace, sector, observed, re...
15    [5, April, 2016, trusted, thirdparty, provided...
20                                                   []
Name: stix_header.description.0, dtype: object

Each vector will have as many dimensions as there are unique words in the SMS corpus.  We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message. 

For example:

<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix).

In [121]:
from sklearn.feature_extraction.text import CountVectorizer

In [122]:
bow_transformer = CountVectorizer(analyzer=text_process).fit(df2['stix_header.description.0'])

# Print total number of vocab words
print len(bow_transformer.vocabulary_)

253


## These are the features that the model has identified

In [123]:
', '.join(bow_transformer.get_feature_names())

u'0060, 045, 06, 10, 10063500, 10070166, 1015, 16, 2015, 20150826, 2016, 20161015, 20161017, 2016INDICATOR1WFQ1U, 2016INDICATOR3HIA6T, 2016INDICATOR47JTFP, 2016INDICATOR8TTBY, 2016INDICATOROZFTC, 25, 31, 32bit, 4, 42, 5, APT, Access, According, Active, Aerospace, Aircraft, Analysis, April, Association, Attempts, CIG, CISA, CSEC, Canada, Center, Circular, Colorado, Conficker, Cyber, DGAs, Dated, Directed, Directory, Donald, DustySky, EXEPROXY, Election, Environment, FIDELIS, IP, IPs, IPv4, Implant, Intrusion, January, Java, July, MIFR, Malicious, March, Multiplatform, None, OPM, Operations, Owners, PHP, Phishing, Pilots, PresidentElect, Presidential, Remote, Reported, Results, Scans, Secretary, Security, Spear, State, States, Submission, Test, Tool, Traversal, Treasury, Trojans, Trump, UK, URL, USCERT, United, Webform, ability, activity, actor, additional, address, addresses, advanced, aerospace, algorithm, alleged, analysis, artifact, artifacts, associated, available, aviation, aware, 

In [129]:
messages_bow = bow_transformer.transform(df2['stix_header.description.0'])
print 'Shape of Sparse Matrix: ', messages_bow.shape
print 'Amount of Non-Zero occurences: ', messages_bow.nnz
print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))

Shape of Sparse Matrix:  (225, 253)
Amount of Non-Zero occurences:  422
sparsity: 0.74%


In [130]:
df2['stix_header.description.0'].describe()

count     225
unique     32
top          
freq      185
Name: stix_header.description.0, dtype: object

### Applying TF-IDF

In [132]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(messages_bow)

In [133]:
messages_tfidf = tfidf_transformer.transform(messages_bow)
print messages_tfidf.shape

(225, 253)


### Training the Model

In [146]:
from sklearn.naive_bayes import MultinomialNB

threat_detect_model = MultinomialNB().fit(messages_tfidf, df2['indicators.observable.object.properties.xsi:type.0'])

In [147]:
all_predictions = threat_detect_model.predict(messages_tfidf)

In [148]:
from sklearn.metrics import classification_report
print classification_report(df2['indicators.observable.object.properties.xsi:type.0'], all_predictions)

                        precision    recall  f1-score   support

     AddressObjectType       0.83      1.00      0.91       186
  DomainNameObjectType       0.00      0.00      0.00        22
EmailMessageObjectType       0.00      0.00      0.00         1
        FileObjectType       0.00      0.00      0.00         4
         URIObjectType       0.00      0.00      0.00        12

           avg / total       0.68      0.83      0.75       225



## The model achieves 68% precision and 83% recall

### NOTE: Right now the model predicts all classes to be AddressObjectType as the data is very biased and does not have many rows with the other target variables. But the model might perform better on the entire AIS data.