Team ID: C16
Sem: 5
School: Computer Science and Engineering

Topic ID: YDMACP13
Title: IoT Malware Classification

Problem Statement: Classification of IoT Malware based on system calls into five different classes.

Team Members: 
Name: Shivam Ralli   USN: 01FE17BCS188 (Team Leader)
      Shashi Prakash USN: 01FE17BCS184
      Sagar Hotapeti USN: 01FE17BCS163
      Ayush Nalavade USN: 01FE16BCS048

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score,precision_recall_curve,roc_curve
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [2]:
pd.set_option('display.max_columns', 50)

In [3]:
train = pd.read_csv('CDMC2019 Task2 Df.csv')
train.drop('Unnamed: 0' , axis =  1 ,inplace = True)
train.index = range(1,len(train) + 1)
train.head(10)

Unnamed: 0,Commands,malware_value
1,execve ioctl ioctl prctl gettimeofday getpid g...,2
2,execve ioctl ioctl time getpid time getpid soc...,2
3,execve ioctl ioctl prctl time getpid time getp...,2
4,execve ioctl ioctl time getpid time getpid soc...,2
5,execve ioctl ioctl prctl time getpid time getp...,2
6,execve ioctl ioctl prctl gettimeofday getpid g...,2
7,execve ioctl ioctl time getpid time getpid soc...,2
8,execve ioctl ioctl prctl gettimeofday getpid g...,2
9,execve ioctl ioctl prctl gettimeofday getpid g...,2
10,execve ioctl ioctl prctl time getpid time getp...,2


## Adding new features:

In [4]:
train['length'] = train.Commands.apply(lambda x: len(str(x).split()))

In [5]:
train.corr()

Unnamed: 0,malware_value,length
malware_value,1.0,-0.248536
length,-0.248536,1.0


## Test train split:

In [6]:
X , y  = train.drop(['malware_value'], axis=1).to_numpy() , train.malware_value.to_numpy()

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X , y,test_size=.3,random_state=42)

# Resampling Theory:

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

![](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png)

## Undersampling the train and test to the class with least numbers:

In [10]:
train_under_df=pd.concat([pd.DataFrame(X_train), pd.Series(y_train)],axis=1, ignore_index= True)
train_under_df.columns=['Commands', 'Length', 'malware_class']
train_under_df.head()

Unnamed: 0,Commands,Length,malware_class
0,execve ioctl ioctl time getpid time getpid soc...,471,2
1,execve ioctl ioctl time getpid time getpid soc...,532,2
2,execve ioctl ioctl fork exit EXIT time getpid ...,756,4
3,execve uname brk brk set_tls set_tid_address s...,3611,5
4,execve ioctl ioctl time getpid time getpid soc...,532,2


In [9]:
pd.DataFrame(train_under_df.malware_class.value_counts())

Unnamed: 0,malware_class
2,1581
1,1062
5,218
3,28
4,27


#### Since the least number of classes are 27, we will undersample all of the classes down to the length of the 4th Class, thus undersampling it.

In [10]:
# Generating inividual dataframes for each malware class:

Class1= train_under_df[train_under_df.malware_class==1]
Class2= train_under_df[train_under_df.malware_class==2]
Class3= train_under_df[train_under_df.malware_class==3]
Class4= train_under_df[train_under_df.malware_class==4]
Class5= train_under_df[train_under_df.malware_class==5]

In [11]:
downsampled = pd.DataFrame()

In [12]:
Classes= [Class1, Class2, Class3, Class5]
for i in Classes:
    i = resample(i,
                replace = False, # sample without replacement
                n_samples = len(Class4), # match minority n
                random_state = 27) # reproducible results
    downsampled = pd.concat([downsampled,i])
    
# combine minority and downsampled majority
downsampled = pd.concat([downsampled, Class4])

# checking counts
#downsampled.isFraud.value_counts()

In [13]:
downsampled.head()

Unnamed: 0,Commands,Length,malware_class
39,execve rt_sigprocmask rt_sigaction rt_sigactio...,27194,1
1549,execve ioctl ioctl open open open open time ge...,219,1
119,execve ioctl ioctl access geteuid prctl time g...,14739,1
1090,execve mmap cacheflush readlink cacheflush mma...,33490,1
2482,execve open open open open open open open rt_s...,115,1


In [14]:
y_train_down = downsampled.malware_class

## OverSampling the train and test to the class with the most numbers:

In [15]:
train_over_df=pd.concat([pd.DataFrame(X_train), pd.Series(y_train)],axis=1, ignore_index= True)
train_over_df.columns=['Commands', 'Length', 'malware_class']
train_over_df.head()

Unnamed: 0,Commands,Length,malware_class
0,execve ioctl ioctl time getpid time getpid soc...,471,2
1,execve ioctl ioctl time getpid time getpid soc...,532,2
2,execve ioctl ioctl fork exit EXIT time getpid ...,756,4
3,execve uname brk brk set_tls set_tid_address s...,3611,5
4,execve ioctl ioctl time getpid time getpid soc...,532,2


In [16]:
pd.DataFrame(train_over_df.malware_class.value_counts())

Unnamed: 0,malware_class
2,1581
1,1062
5,218
3,28
4,27


#### Since the max number of classes are 1581, we will oversample all of the classes up to the length of the 2th Class, thus oversampling it.

In [17]:
# Generating inividual dataframes for each malware class:

Class1= train_over_df[train_over_df.malware_class==1]
Class2= train_over_df[train_over_df.malware_class==2]
Class3= train_over_df[train_over_df.malware_class==3]
Class4= train_over_df[train_over_df.malware_class==4]
Class5= train_over_df[train_over_df.malware_class==5]

In [18]:
upsampled = pd.DataFrame()

In [19]:
Classes= [Class1, Class4, Class3, Class5]
for i in Classes:
    i = resample(i,
                replace = True, # sample with replacement
                n_samples = len(Class2), # match majority n
                random_state = 27) # reproducible results

    upsampled = pd.concat([upsampled,i])

# combine minority and upsampled majority
upsampled = pd.concat([upsampled, Class2])




In [20]:
upsampled.head()

Unnamed: 0,Commands,Length,malware_class
2876,execve ioctl ioctl access geteuid prctl time g...,20877,1
1492,execve ioctl ioctl access geteuid prctl time g...,37275,1
2147,execve ioctl ioctl access geteuid32 prctl gett...,19153,1
2882,execve brk brk set_tls ioctl ioctl access gete...,11635,1
380,execve brk brk set_tls ioctl ioctl access gete...,7289,1


In [21]:
y_train_up = upsampled.malware_class

## TF-IDF and Feature Union

In [8]:
from sklearn.model_selection import KFold
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_log_error
import eli5

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [11]:
# we need a custom pre-processor to extract correct field,
# but want to also use default scikit-learn preprocessing (e.g. lowercasing)
def get_features(df):
    df[['Commands', 'Length']]=df[['Commands', 'Length']].astype(str)
    default_preprocessor = CountVectorizer().build_preprocessor()
    def build_preprocessor(field):
        field_idx = list(df.columns).index(field)
        return lambda x: default_preprocessor(x[field_idx])

    vectorizer = FeatureUnion([
        ('Length', CountVectorizer(
            token_pattern='\d+', 
            preprocessor=build_preprocessor('Length'))),
        ('Commands', TfidfVectorizer(
            ngram_range=(2, 5) , sublinear_tf= True,
            preprocessor=build_preprocessor('Commands'), )),
    ])
    x_train_fit = vectorizer.fit(df[['Commands', 'Length']].values)
    x_train_trans = vectorizer.transform(df[['Commands', 'Length']].values)
    
    return vectorizer, x_train_fit, x_train_trans

In [12]:
without_train = train_under_df

In [13]:
vectorizer_without, x_trainfit_without, x_train_trans_without = get_features(without_train)
#vectorizer_under, x_trainfit_under, x_train_trans_under = get_features(downsampled)
#vectorizer_up, x_trainfit_up, x_train_trans_up = get_features(upsampled)


## Checking for cross validation scores:

Three different kinds of models have been considered: 
- Linear Model: Logistic Regression
- Ensemble Model: LGBM Classification
- Instance based Model: K Nearest Neighbors

In [26]:
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier

In [27]:
models = [LogisticRegression(), LGBMClassifier(), KNeighborsClassifier()]

In [28]:
def cv_df_formation(x_val, y_val):
    from sklearn.model_selection import cross_val_score
    CV = 5
    cv_df = pd.DataFrame(index=range(CV * len(models)))

    entries = []
    for model in models:
      model_name = model.__class__.__name__
      accuracies = cross_val_score(model, x_val, y_val, scoring='accuracy', cv=CV)
      for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))

    cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
    return cv_df

In [29]:
cv_df_without = cv_df_formation(x_train_trans_without, y_train)



In [30]:
cv_df_without

Unnamed: 0,model_name,fold_idx,accuracy
0,LogisticRegression,0,0.97099
1,LogisticRegression,1,0.958974
2,LogisticRegression,2,0.951973
3,LogisticRegression,3,0.982788
4,LogisticRegression,4,0.977625
5,LGBMClassifier,0,0.979522
6,LGBMClassifier,1,0.981197
7,LGBMClassifier,2,0.97084
8,LGBMClassifier,3,0.989673
9,LGBMClassifier,4,0.984509


In [31]:
cv_df_under = cv_df_formation(x_train_trans_under, y_train_down)



In [32]:
cv_df_under

Unnamed: 0,model_name,fold_idx,accuracy
0,LogisticRegression,0,0.966667
1,LogisticRegression,1,0.966667
2,LogisticRegression,2,0.96
3,LogisticRegression,3,1.0
4,LogisticRegression,4,1.0
5,LGBMClassifier,0,0.966667
6,LGBMClassifier,1,0.966667
7,LGBMClassifier,2,0.96
8,LGBMClassifier,3,1.0
9,LGBMClassifier,4,0.96


In [33]:
cv_df_over = cv_df_formation(x_train_trans_up, y_train_up)



In [34]:
cv_df_over

Unnamed: 0,model_name,fold_idx,accuracy
0,LogisticRegression,0,0.988013
1,LogisticRegression,1,0.983544
2,LogisticRegression,2,0.987975
3,LogisticRegression,3,0.992405
4,LogisticRegression,4,0.991139
5,LGBMClassifier,0,0.995584
6,LGBMClassifier,1,0.991772
7,LGBMClassifier,2,0.993038
8,LGBMClassifier,3,0.998734
9,LGBMClassifier,4,0.996835


## Applying LGBM on X_Test for all three. (Without parameter tuning)


In [15]:
from lightgbm import LGBMClassifier

In [16]:
model_without, model_down, model_up = LGBMClassifier(), LGBMClassifier(), LGBMClassifier()

In [17]:
model_without.fit(x_train_trans_without,y_train)
model_down.fit(x_train_trans_under, y_train_down)
model_up.fit(x_train_trans_up, y_train_up)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Turning the X_test into a dataframe so that it can be transformed after being coverted to str

In [18]:
X_test = pd.DataFrame(X_test)
X_test.columns = ['Commands', 'Length']
X_test[['Commands', 'Length']]= X_test[['Commands', 'Length']].astype(str)

In [19]:
X_test_trans = x_trainfit_without.transform(X_test[['Commands', 'Length']].values)

In [20]:
y_pred_without = model_without.predict(x_trainfit_without.transform(X_test[['Commands', 'Length']].values))
y_pred_down = model_down.predict(x_trainfit_under.transform(X_test.values))
y_pred_up = model_up.predict(x_trainfit_up.transform(X_test.values))

### Accuracy Metrics for each malware_class

In [39]:
from sklearn import metrics

**Metrics for model without resampling**

In [40]:
labels_list = ['1','2','3','4', '5']
print(metrics.classification_report(y_test, y_pred_without,target_names= labels_list))

              precision    recall  f1-score   support

           1       0.99      0.97      0.98       479
           2       0.98      0.99      0.99       684
           3       1.00      1.00      1.00         6
           4       0.92      0.92      0.92        13
           5       1.00      1.00      1.00        69

    accuracy                           0.98      1251
   macro avg       0.98      0.98      0.98      1251
weighted avg       0.98      0.98      0.98      1251



**Metrics for model with downsampling**

In [41]:
print(metrics.classification_report(y_test, y_pred_down,target_names= labels_list))

              precision    recall  f1-score   support

           1       0.98      0.94      0.96       479
           2       0.97      0.98      0.97       684
           3       1.00      1.00      1.00         6
           4       0.59      1.00      0.74        13
           5       1.00      1.00      1.00        69

    accuracy                           0.97      1251
   macro avg       0.91      0.98      0.94      1251
weighted avg       0.97      0.97      0.97      1251



**Metrics for model with upsampling**

In [42]:
print(metrics.classification_report(y_test, y_pred_up,target_names= labels_list))

              precision    recall  f1-score   support

           1       0.99      0.96      0.97       479
           2       0.98      0.99      0.98       684
           3       1.00      1.00      1.00         6
           4       0.92      0.92      0.92        13
           5       1.00      1.00      1.00        69

    accuracy                           0.98      1251
   macro avg       0.98      0.98      0.98      1251
weighted avg       0.98      0.98      0.98      1251



### Significance of the metric scores:

Precision: True Positives + False Positives, tells us what proportion of malware we classified actually had that malware.

Recall: True Positives + False Negatives, Tells us what proportion of malware we classified had actually been detected by the algorithm

F-1 score: The balance between precision and recall, given by the formula: 2*(Precision * Recall)/(Precision+Recall)

## Finding the weights of the classes of each model:

#### Model Without any resampling:

In [43]:
eli5.show_weights(model_without, vec=vectorizer_without, top=10, feature_filter=lambda x: x != '<BIAS>')

Weight,Feature
0.2695,Commands__connect _newselect
0.2243,Commands__brk socket
0.1865,Commands__brk brk brk brk stat64
0.0555,Commands___newselect read
0.0304,Commands__getsockname close
0.0248,Commands__exit rt_sigaction socket
0.0247,Commands__exit rt_sigaction
0.0241,Commands__nanosleep time connect
0.0107,Commands__bind listen
0.0097,Commands__chdir rt_sigaction socket


In [44]:
eli5.show_prediction(model_without, doc=without_train.values[3], vec=vectorizer_without)

Contribution?,Feature
0.185,Commands__exit setsid chdir
0.062,Commands__chdir rt_sigaction socket
0.046,Commands__fork write exit exit setsid
0.028,Commands__sigchld write
0.017,Commands__fcntl connect
0.017,Commands__rt_sigaction socket
0.013,Commands__rt_sigaction nanosleep close
0.011,Commands__exit setsid
0.003,Commands__fork write
0.003,Commands__gettimeofday getpid

Contribution?,Feature
0.467,Commands__recvfrom _newselect
0.343,Commands__sendto sendto
0.121,Commands__close brk
0.042,Commands__close open read
0.01,Commands__exit rt_sigaction socket
0.003,Commands__gettimeofday brk
0.002,Commands__brk brk open open
0.002,Commands__cacheflush mmap cacheflush
0.001,Commands__execve brk brk set_tls
0.0,Commands__getpid time

Contribution?,Feature
0.0,Commands__ioctl ioctl
0.0,Commands__getpid socket
0.0,Commands__ioctl read
0.0,Commands__connect close
0.0,Commands__close fork
0.0,Commands___newselect getsockopt rt_sigprocmask
0.0,Commands__read close open ioctl
0.0,Commands__ioctl brk
0.0,Commands__wait4 sigchld
0.0,Commands__connect getsockname

Contribution?,Feature
0.021,Commands__fcntl fcntl
0.001,Commands__connect rt_sigprocmask
0.001,Commands__connect rt_sigprocmask rt_sigaction rt_sigprocmask
0.001,Commands__ioctl ioctl open
0.0,Commands__close open read close
0.0,Commands__close close socket
0.0,Commands__open open open socket
0.0,Commands__fcntl socket
0.0,Commands__socket connect
0.0,Commands__cacheflush cacheflush readlink

Contribution?,Feature
11.082,Commands: Highlighted in text (sum)
0.0,Length__427
0.0,Length__357
-0.0,Commands___newselect ioctl close nanosleep
-0.0,Commands__close munmap uname
-0.001,Commands__close open fcntl64
-6.749,<BIAS>


#### Model with downsampling:

In [45]:
eli5.show_weights(model_down, vec=vectorizer_under, top=15, feature_filter=lambda x: x != '<BIAS>')

Weight,Feature
0.2159,Commands___newselect read
0.2091,Commands__brk brk brk brk stat64
0.1435,Commands__brk socket
0.1142,Commands__connect _newselect
0.0758,Commands__ioctl time time
0.0756,Commands__getppid open
0.0501,Commands__open open
0.0243,Commands___newselect getsockopt
0.0143,Commands__getppid times
0.0143,Commands__ioctl brk brk


In [46]:
eli5.show_prediction(model_down, doc=downsampled.values[5], vec=vectorizer_under)

Contribution?,Feature
7.421,Commands: Highlighted in text (sum)
0.264,Commands__socket ioctl
-0.399,Commands__open open
-3.34,<BIAS>

Contribution?,Feature
1.819,Commands__open open
0.032,Commands__rt_sigprocmask rt_sigaction
0.002,Commands__execve ioctl
0.001,Commands__time getpid
0.001,Commands__getpid getppid
0.0,Commands__rt_sigaction rt_sigprocmask
-0.0,Commands__socket ioctl
-0.0,Commands__read read close open
-0.0,Commands__ioctl ioctl open
-0.0,Commands__open fstat64

Contribution?,Feature
0.012,Commands: Highlighted in text (sum)
0.0,Commands__open open
-0.0,Commands__exit exit
-0.0,Commands__read close open
-0.0,Commands__close open
-0.019,Commands__rt_sigprocmask rt_sigaction
-2.271,Commands___newselect read
-4.499,<BIAS>

Contribution?,Feature
0.861,Commands__fcntl fcntl
0.656,Commands__rt_sigprocmask rt_sigaction rt_sigaction
0.008,Commands__close close
0.003,Commands__exit exit fork
0.002,Commands__rt_sigaction rt_sigprocmask nanosleep
0.002,Commands__rt_sigaction fork
0.001,Commands__brk read
0.001,Commands__read read close open
0.001,Commands__time getpid
0.0,Commands___newselect getsockopt rt_sigprocmask

Contribution?,Feature
0.008,Commands: Highlighted in text (sum)
0.0,Commands__open open
-0.0,Commands__exit exit
-0.0,Commands__read read close rt_sigaction
-0.0,Commands__time getpid getppid
-0.0,Commands__getpid getppid
-0.0,Commands__fork exit
-0.0,Commands__exit setsid
-0.001,Commands__read read close open
-2.26,Commands__brk brk brk brk stat64



#### Model with upsampling:

In [47]:
eli5.show_weights(model_up, vec=vectorizer_up, top=10, feature_filter=lambda x: x != '<BIAS>')

KeyboardInterrupt: 

## Applying Hyperopt.

In [49]:
from hyperopt import fmin, tpe, hp, anneal, Trials
from sklearn.model_selection import KFold, cross_val_score

In [46]:
random_state=42
n_iter=100

num_folds=5
kf = KFold(n_splits=num_folds, random_state=random_state)

In [47]:

def gb_mse_cv_LGB(params, random_state=random_state, cv=kf, X=x_train_trans_without, y=y_train):
    # the function gets a set of variable parameters in "param"
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth']), 
             'learning_rate': params['learning_rate'],
              'num_leaves': int(params['num_leaves']),
            'min_child_samples': int(params['min_child_samples']),
              'reg_alpha': params['reg_alpha'],
              'reg_lambda': params['reg_lambda'],
              'subsample_freq': int(params['subsample_freq']),
              'subsample': params['subsample']
              
             }
    
    # we use this params to create a new LGBM Regressor
    model = LGBMClassifier(random_state=random_state, **params)
    
    # and then conduct the cross validation with the same folds as before
    score = -cross_val_score(model, X, y, cv=cv, scoring="accuracy", n_jobs=-1).mean()

    return score



In [50]:
space={'n_estimators': hp.quniform('n_estimators', 100, 2000, 1),
       'max_depth' : hp.quniform('max_depth', 2, 25, 1),
       'learning_rate': hp.loguniform('learning_rate',0 , 4),
       'num_leaves': hp.quniform('num_leaves', 11, 450,1),
    'min_child_samples': hp.quniform('min_child_samples', 10, 40, 1),
      'reg_alpha': hp.loguniform('reg_alpha', 0, 4),
       'reg_lambda':hp.loguniform('reg_lambda', 0, 5),
      'subsample_freq': hp.quniform('subsample_freq', 2, 20, 1),
      'subsample': hp.quniform('subsample', 0, 1, 0.1)
      }

# trials will contain logging information
trials = Trials()

best=fmin(fn=gb_mse_cv_LGB, # function to optimize
          space=space, 
          algo=anneal.suggest, # optimization algorithm, hyperotp will select its parameters automatically
          max_evals=n_iter, # maximum number of iterations
          trials=trials, # logging
          rstate=np.random.RandomState(random_state) # fixing random state for the reproducibility
         )

# computing the score on the test set
model = LGBMClassifier(random_state=random_state, n_estimators=int(best['n_estimators']),
                      max_depth=int(best['max_depth']),learning_rate=best['learning_rate'],
                      num_leaves=int(best['num_leaves']), min_child_samples=int(best['min_child_samples']),
                         reg_lambda=best['reg_lambda'], reg_alpha=best['reg_alpha'], subsample=best['subsample'], 
                       subsample_freq=int(best['subsample_freq'])
                      )
model.fit(x_train_trans_without,y_train)
tpe_test_score=accuracy_score(y_test, model.predict(X_test_trans))
print(tpe_test_score)

100%|██████████| 100/100 [2:27:11<00:00, 88.32s/it, best loss: -0.983195681289504]  
0.9856115107913669
