
### 出处：https://colab.research.google.com/drive/1gvtxPb1Acd0QDAxEeumljPS3EZ5T3QWd?usp=sharing

### Multilabel Classification with Python

##### Multilabel Dataset Examples
+ https://sci2s.ugr.es/keel/multilabel.php#sub10



![](multi-class_vs_multi_label_classification_jcharistech.png)

#### Solution for Multi-Label Problem
+ Methods for solving Multi-label Classification Problems
    + Problem Transformation
    + Adapted Algorithm
    + Ensemble approaches

#### Problem Transformation
+ It refers to transforming the multi-label problem into single-label problem(s) by using
    - Binary Relevance: treats each label as a separate single class classification
    - Classifier Chains:In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
    - Label Powerset:we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

        
#### Adapted Algorithm
+ adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems.
   

In [225]:
# Load EDA Pkgs
import pandas as pd
import numpy as np

In [226]:
# ML Pkgs
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.metrics import accuracy_score,hamming_loss,classification_report

In [227]:
### Split Dataset into Train and Text
from sklearn.model_selection import train_test_split
# Feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer

In [228]:
# Multi Label Pkgs
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.adapt import MLkNN


In [229]:
# !pip install scikit-multilearn==0.2.0

In [230]:
# Load Dataset
df = pd.read_csv("data/report1666743279291_with_incident_title_with_username.csv")

In [231]:
df.head()

Unnamed: 0,Event Receive Time,DayOfWeek(Event Receive Time),HourOfDay(Event Receive Time),Event Type,Event Name,Incident Title,Incident Reporting Device,Incident Source,Incident Target,Host IP,...,Incident Resolution,Incident Tag Name,Incident Comments,Incident Category,Attack Technique,Attack Tactic,user_A,user_B,user_C,user_D
0,2022-10-25 17:14:00,2,17,PH_RULE_Linux_Discovery_of_Network_Environment...,Linux: Discovery of Network Environment via Bu...,Linux Discovery of Network Environment via Bui...,sp14816.fortinet.com,,"hostName:sp14816,",,...,1,,,4,"[{""name"": ""System Network Configuration Discov...",Discovery,0,1,1,0
1,2022-10-25 17:14:00,2,17,PH_RULE_ANOMALY_TRAFFIC_NET_INTF,Sudden Increase in Network Interface Traffic,Sudden Increase in Network Interface Traffic o...,ussvnplesx56.fortinet-us.com,,"hostName:ussvnplesx56.fortinet-us.com, hostIpA...",172.30.54.56,...,1,,,4,"[{""name"": ""Network Denial of Service: Direct N...",Impact,1,1,0,0
2,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx58.fortinet-us.com,ussvnplesx58.fortinet-us.com,,"hostIpAddr:172.30.54.58, hostName:ussvnplesx58...",172.30.54.58,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,1,1,1,1
3,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx57.fortinet-us.com,ussvnplesx57.fortinet-us.com,,"hostIpAddr:172.30.54.57, hostName:ussvnplesx57...",172.30.54.57,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,0,1,1,0
4,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx51.fortinet-us.com,ussvnplesx51.fortinet-us.com,,"hostIpAddr:172.30.54.51, hostName:ussvnplesx51...",172.30.54.51,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,0,0,0,1


In [232]:
df.dtypes

Event Receive Time                object
DayOfWeek(Event Receive Time)      int64
HourOfDay(Event Receive Time)      int64
Event Type                        object
Event Name                        object
Incident Title                    object
Incident Reporting Device         object
Incident Source                   object
Incident Target                   object
Host IP                           object
Host Name                         object
Incident Status                    int64
Incident Resolution                int64
Incident Tag Name                float64
Incident Comments                float64
Incident Category                  int64
Attack Technique                  object
Attack Tactic                     object
user_A                             int64
user_B                             int64
user_C                             int64
user_D                             int64
dtype: object

### Text Preprocessing
+ neattext : remove_stopwords
+ pip install neattext

In [233]:
import neattext as nt
import neattext.functions as nfx

In [234]:
# Explore For Noise
df['Incident Title'].apply(lambda x:nt.TextFrame(x).noise_scan())

0        {'text_noise': 5.88235294117647, 'text_length'...
1        {'text_noise': 6.521739130434782, 'text_length...
2        {'text_noise': 8.51063829787234, 'text_length'...
3        {'text_noise': 8.51063829787234, 'text_length'...
4        {'text_noise': 8.51063829787234, 'text_length'...
                               ...                        
14704    {'text_noise': 13.513513513513514, 'text_lengt...
14705    {'text_noise': 10.588235294117647, 'text_lengt...
14706    {'text_noise': 13.91304347826087, 'text_length...
14707    {'text_noise': 9.411764705882353, 'text_length...
14708    {'text_noise': 10.679611650485436, 'text_lengt...
Name: Incident Title, Length: 14709, dtype: object

In [235]:
# Explore For Noise
df['Incident Title'].apply(lambda x:nt.TextExtractor(x).extract_stopwords())

0                  [of, via, on]
1                  [in, on, and]
2                           [on]
3                           [on]
4                           [on]
                  ...           
14704    [two, and, because, of]
14705    [two, and, because, of]
14706    [two, and, because, of]
14707    [two, and, because, of]
14708    [two, and, because, of]
Name: Incident Title, Length: 14709, dtype: object

In [236]:
# Explore For Noise
df['Incident Title'].apply(nfx.remove_stopwords)

0        Linux Discovery Network Environment Built-in T...
1        Sudden Increase Network Interface Traffic vmni...
                               ...                        
14704    devices merged FSM.JianD.700.58.35.com FSM_Jia...
14705    devices merged co200-rp Centos8.58134 Overlapp...
14706    devices merged FSM-REPO-JYU-CENTOS8-OFFLINEREP...
14707    devices merged ch58221 ylsp58221.fqdn Overlapp...
14708    devices merged sp14811.fortinet.com ml14811.fo...
Name: Incident Title, Length: 14709, dtype: object

In [237]:
corpus = df['Incident Title'].apply(nfx.remove_stopwords)

### Feature Engineering
+ Build features from our text
+ TFIDF,countvectorizer,bow

In [238]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [239]:
tfidf

In [240]:
# Build Features
Xfeatures = tfidf.fit_transform(corpus).toarray()

In [241]:
Xfeatures


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [242]:
df_tfidfvect = pd.DataFrame(data=Xfeatures, columns=tfidf.get_feature_names())
df_tfidfvect.head()
df_tfidfvect.to_csv('/Users/yzhao/Downloads/df_tfidfvect.csv', index=False)



In [243]:
df.head()

Unnamed: 0,Event Receive Time,DayOfWeek(Event Receive Time),HourOfDay(Event Receive Time),Event Type,Event Name,Incident Title,Incident Reporting Device,Incident Source,Incident Target,Host IP,...,Incident Resolution,Incident Tag Name,Incident Comments,Incident Category,Attack Technique,Attack Tactic,user_A,user_B,user_C,user_D
0,2022-10-25 17:14:00,2,17,PH_RULE_Linux_Discovery_of_Network_Environment...,Linux: Discovery of Network Environment via Bu...,Linux Discovery of Network Environment via Bui...,sp14816.fortinet.com,,"hostName:sp14816,",,...,1,,,4,"[{""name"": ""System Network Configuration Discov...",Discovery,0,1,1,0
1,2022-10-25 17:14:00,2,17,PH_RULE_ANOMALY_TRAFFIC_NET_INTF,Sudden Increase in Network Interface Traffic,Sudden Increase in Network Interface Traffic o...,ussvnplesx56.fortinet-us.com,,"hostName:ussvnplesx56.fortinet-us.com, hostIpA...",172.30.54.56,...,1,,,4,"[{""name"": ""Network Denial of Service: Direct N...",Impact,1,1,0,0
2,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx58.fortinet-us.com,ussvnplesx58.fortinet-us.com,,"hostIpAddr:172.30.54.58, hostName:ussvnplesx58...",172.30.54.58,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,1,1,1,1
3,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx57.fortinet-us.com,ussvnplesx57.fortinet-us.com,,"hostIpAddr:172.30.54.57, hostName:ussvnplesx57...",172.30.54.57,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,0,1,1,0
4,2022-10-25 17:14:00,2,17,PH_RULE_ESX_CPU_WARN,ESX CPU Warning,ESX CPU Warning on ussvnplesx51.fortinet-us.com,ussvnplesx51.fortinet-us.com,,"hostIpAddr:172.30.54.51, hostName:ussvnplesx51...",172.30.54.51,...,1,,,2,"[{""name"": ""Endpoint Denial of Service: OS Exha...",Impact,0,0,0,1


In [244]:
y = df[['user_A', 'user_B', 'user_C', 'user_D']]

In [245]:
# Split Data 
X_train,X_test,y_train,y_test = train_test_split(Xfeatures,y,test_size=0.3,random_state=42)

In [246]:
# Building Our Model
# Estimator + Multilabel Estimator

### Binary Relevance classficiation
+ Convert Our Multi-Label Prob to Multi-Class

![](binary_relevance_multilabel_ml_jcharistech.png)

In [247]:
# Convert Our Multi-Label Prob to Multi-Class
# binary classficiation
binary_rel_clf = BinaryRelevance(MultinomialNB())

In [248]:
binary_rel_clf.fit(X_train,y_train)

In [249]:
# Predictions
br_prediction = binary_rel_clf.predict(X_test)

In [250]:
# Convert to Array  To See Result
br_prediction.toarray()

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [1, 0, 0, 0],
       ...,
       [1, 0, 1, 1],
       [1, 1, 1, 1],
       [1, 0, 0, 0]])

In [251]:
# Accuracy
accuracy_score(y_test,br_prediction)

0.061862678450033994

In [252]:
# Hamming Loss :Incorrect Predictions
# The Lower the result the better
hamming_loss(y_test,br_prediction)

0.49648765012463175

#### Classifier Chains
+ Preserve Label Correlation

![](classifier_chains_multilabel_jcharistech.png)

In [253]:
def build_model(model,mlb_estimator,xtrain,ytrain,xtest,ytest):
    # Create an Instance
    clf = mlb_estimator(model)
    clf.fit(xtrain,ytrain)
    # Predict
    clf_predictions = clf.predict(xtest)
    # Check For Accuracy
    acc = accuracy_score(ytest,clf_predictions)
    ham = hamming_loss(ytest,clf_predictions)
    result = {"accuracy:":acc,"hamming_score":ham}
    return result

In [254]:
clf_chain_model = build_model(MultinomialNB(),ClassifierChain,X_train,y_train,X_test,y_test)

In [255]:
clf_chain_model

{'accuracy:': 0.06072966236120553, 'hamming_score': 0.49745071380013595}

#### LabelPowerset
![](labelPowerset_multilabel_ml_jcharistech.png)

In [256]:
clf_labelP_model = build_model(MultinomialNB(),LabelPowerset,X_train,y_train,X_test,y_test)

In [257]:
clf_labelP_model

{'accuracy:': 0.06299569453886245, 'hamming_score': 0.4917856333559937}

In [258]:
### Apply On A Simple Ttitle/Question

In [259]:
ex1 = df['Incident Title'].iloc[0]

In [260]:
# Vectorized 
vec_example = tfidf.transform([ex1])

In [261]:
# Make our prediction
binary_rel_clf.predict(vec_example).toarray()

array([[1, 1, 0, 0]])