### **PIP Installations**
Installation of some stuff that are not already contained on Google Colab



1.    http://scikit.ml/ --> Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem. 
2.   https://pypi.org/project/neattext/ --> NeatText:a simple NLP package for cleaning textual data and text preprocessing. Simplifying Text Cleaning For NLP & ML



In [1]:
!pip install scikit-multilearn
!pip install neattext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


----

### **Package Installations**

In [2]:
# pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, 
# built on top of the Python programming language. --> https://pandas.pydata.org/ 
import pandas as pd

# The fundamental package for scientific computing with Python --> https://numpy.org/
import numpy as np

# Split arrays or matrices into random train and test subsets. --> https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

# Convert a collection of raw documents to a matrix of TF-IDF features. --> https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer

# kNN classification method adapted for multi-label classification --> http://scikit.ml/api/skmultilearn.adapt.mlknn.html
from skmultilearn.adapt import MLkNN

# Compute the average Hamming loss. The Hamming loss is the fraction of labels that are 
# incorrectly predicted. --> https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

# In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of 
# labels in y_true. --> https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

# Build a text report showing the main classification metrics.
from sklearn.metrics import hamming_loss, accuracy_score, classification_report

# Transform between iterable of iterables and a multilabel format.
from sklearn.preprocessing import MultiLabelBinarizer

# Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems using the same base classifier provided  
# in the constructor. The prediction output is the union of all per label classifiers. --> http://scikit.ml/api/skmultilearn.problem_transform.br.html
from skmultilearn.problem_transform import BinaryRelevance

# Each model makes a prediction in the order specified by the chain using all of the available features provided to the model plus the predictions of models that are 
# earlier in the chain. --> https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html
from skmultilearn.problem_transform import ClassifierChain

# Transform multi-label problem to a multi-class problem --> http://scikit.ml/api/skmultilearn.problem_transform.lp.html
from skmultilearn.problem_transform import LabelPowerset

# NeatText:a simple NLP package for cleaning textual data and text preprocessing. Simplifying Text Cleaning For NLP & ML. --> https://pypi.org/project/neattext/
import neattext as nt
import neattext.functions as nfx
from neattext.functions import clean_text

from sklearn.naive_bayes import GaussianNB,MultinomialNB

---

----

### **Data Load**
Data is Uploaded on Google Sessions. Only available at this point, while session is running. After this, it has to be reuploaded.

In [3]:
from operator import index
# Load data and show the first 5 rows.
# Read a comma-separated values (csv) file into DataFrame. --> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
df = pd.read_csv('dataset2.csv', header=0, sep = ',')
df.head(5)

Unnamed: 0,description,technique_id,Unnamed: 2
0,It was possible for double OGNL evaluation in ...,"1156, 1058, 1130, 1050, 1160, 1152, 1031, 1062...",
1,A buffer overflow in glibc 2.5 (released on Se...,,
2,A memory leak in glibc 2.1.1 (released on May ...,,
3,"The ""stub_send_ret_submit()"" function (drivers...",,
4,"The ""stub_recv_cmd_submit()"" function (drivers...",,


---

----

### **Data Manipulation**

**Drop some stuff**

*   A column "Unnamed: 2" was autimatically created. Somthing wrong with the csv file that was downloaded.
*   We will drop this and show the first 2 rows to confirm action.

In [4]:
df.drop('Unnamed: 2', axis=1, inplace=True)
df.head(2)

Unnamed: 0,description,technique_id
0,It was possible for double OGNL evaluation in ...,"1156, 1058, 1130, 1050, 1160, 1152, 1031, 1062..."
1,A buffer overflow in glibc 2.5 (released on Se...,


**Drop rows with NaN values** 

All NaN values of column "technique_id" are useless at the moment so we will drop each row that contains one


In [5]:
df = df.dropna(subset=['technique_id'])
df.head(2)

Unnamed: 0,description,technique_id
0,It was possible for double OGNL evaluation in ...,"1156, 1058, 1130, 1050, 1160, 1152, 1031, 1062..."
6,The vhci_hcd driver in the Linux Kernel before...,"1148, 1049, 1018, 1124, 1046, 1016, 1126, 1082..."


Check if any NaN values exist on our data.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8077 entries, 0 to 27470
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   description   8077 non-null   object
 1   technique_id  8077 non-null   object
dtypes: object(2)
memory usage: 189.3+ KB


Data loaded shows column "technique_id" as one string character all together. We need to distinguish attack id values in order for multi label to work.

In [7]:
df['technique_id'] = df['technique_id'].str.split(',').apply(set).tolist()

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('technique_id')),
                index=df.index,
                columns=mlb.classes_))

df.head(2)

Unnamed: 0,description,1007,1012,1014,1015,1016,1018,1027,1031,1033,...,1023,1027.1,1038,1044,1081,1130,1134,1148,1156,1208
0,It was possible for double OGNL evaluation in ...,0,0,1,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
6,The vhci_hcd driver in the Linux Kernel before...,1,0,0,0,1,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0


---

----

## **Prepare Data**


In [8]:
X = df["description"]
y = df.loc[ : , df.columns != 'description']

In [9]:
# Scan Percentage of Noise(Unclean data) in text
X.apply(lambda x:nt.TextFrame(x).noise_scan())

0        {'text_noise': 10.855263157894738, 'text_lengt...
6        {'text_noise': 8.018867924528301, 'text_length...
7        {'text_noise': 9.82905982905983, 'text_length'...
10       {'text_noise': 8.655462184873949, 'text_length...
16       {'text_noise': 8.176100628930817, 'text_length...
                               ...                        
27458    {'text_noise': 7.03125, 'text_length': 128, 'n...
27460    {'text_noise': 10.666666666666668, 'text_lengt...
27461    {'text_noise': 9.420289855072465, 'text_length...
27464    {'text_noise': 11.11111111111111, 'text_length...
27470    {'text_noise': 11.39240506329114, 'text_length...
Name: description, Length: 8077, dtype: object

In [10]:
# Show the stopwords contained on each row of column "description"
X.apply(lambda x:nt.TextExtractor(x).extract_stopwords())

0        [it, was, for, in, and, in, and, in, to, an, w...
6        [the, in, the, before, and, to, that, a, is, o...
7        [the, in, before, and, before, does, not, this...
10       [a, in, the, of, for, the, could, an, to, a, o...
16                                  [in, the, in, all, of]
                               ...                        
27458                                [before, because, of]
27460                                     [before, during]
27461                   [before, in, the, of, the, during]
27464                               [before, to, via, the]
27470                                    [before, to, the]
Name: description, Length: 8077, dtype: object

In [11]:
X.apply(lambda x:nt.TextExtractor(x).extract_numbers())

0                  [4, 4, 5, 4, 4, 4, 5, 0, 4, 5, 2, 4, 5]
6                                    [4, 14, 8, 4, 4, 114]
7                           [5, 2, 9, 2, 5, 3, 5, 3, 4, 2]
10       [6, 5, 3, 4, 9000, 6, 6, 9000, 5, 3, 4, 6, 7, ...
16                                 [3, 0, 0, 4, 380, 7743]
                               ...                        
27458                                      [74, 0, 0, 416]
27460                                      [74, 0, 8, 449]
27461                                      [74, 0, 8, 447]
27464                                      [74, 0, 8, 444]
27470                                      [74, 0, 8, 409]
Name: description, Length: 8077, dtype: object

In [12]:
# Clean text by removing emails,numbers,stopwords,emojis,etc
# A simplified method for cleaning text by specifying as True/False what to clean from a text

X = X.apply(lambda x:nt.TextExtractor(x).clean_text())

In [13]:
X = X.apply(nfx.remove_stopwords)

In [14]:
X

0        possible double ognl evaluation certain redire...
6        vhcihcd driver linux kernel version allows all...
7        gui component aka pulseui pulse secure desktop...
10       vulnerability ipv subsystem cisco ios xr softw...
16       highly predictable session tokens httpd server...
                               ...                        
27458    cpanel allows apache http server configuration...
27460      cpanel allows ftp access account suspension sec
27461    cpanel allows arbitrary filewrite operations c...
27464    cpanel allows demo accounts execute arbitrary ...
27470    cpanel allows local users disable clamav daemo...
Name: description, Length: 8077, dtype: object

----

-----

## **TfidfVectorizer**

TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. Each document is represented as a set of words, and the number of times each word appears in the collection is used to compute its TF-IDF feature.

The tfidf_vectorizer class implements a vectorizer for calculating term frequency/inverse document frequency (TF/IDF), which can be used as a measure for modeling the importance of terms within documents. It takes in raw texts and returns an array of feature vectors for each input text.

https://www.egochi.com/tfidfvectorizer/

In [17]:
tfidf = TfidfVectorizer(lowercase=False, max_df=0.5, max_features=60000, ngram_range=(1,2))

In [18]:
Xfeatures = tfidf.fit_transform(X).toarray()

In [19]:
X_train,X_test,y_train,y_test = train_test_split(Xfeatures,y,test_size=0.30,random_state=42)

In [20]:
print(df['description'].shape)
print(X_train.shape)

(8077,)
(5653, 60000)


In [21]:
# Convert Multi-Label Problem to Multi-Class binary classficiation
binary_rel_clf = BinaryRelevance(MultinomialNB())
binary_rel_clf.fit(X_train,y_train)
br_prediction = binary_rel_clf.predict(X_test)

In [22]:
br_prediction.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

In [23]:
accuracy_score(y_test,br_prediction)

0.4141914191419142

In [24]:
hamming_loss(y_test,br_prediction)

0.09107029123965028