## A Machine Learning approach for Malware Detection

A Machine Learning approach for classifying a file as Malicious or Benign.

This approach tries out 5 different classification algorithms before deciding which one to use for prediction by comparing their results. Different Machine Learning models tried are, RandomForest, DecisionTree, Adaboost, Gaussian, Gradient Boosting.

In order to test the model on an unseen file, it's required to extract the characteristics of the given file. Python's pefile.PE library is used to construct and build the feature vector and a ML model is used to predict the class for the given file based on the already trained model.

Resouces

https://github.com/sophos/SOREL-20M

https://learn.microsoft.com/en-us/windows/win32/debug/pe-format

https://resources.infosecinstitute.com/topic/2-malware-researchers-handbook-demystifying-pe-file/

https://axcheron.github.io/pe-format-manipulation-with-pefile/

In [1]:
import os
import pandas
import numpy
import pickle
import pefile
import joblib
import sklearn.ensemble as ek
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn import tree
from sklearn.linear_model import LinearRegression

Loading the initial dataset delimited by |

In [2]:
dataset = pandas.read_csv('./data.csv',sep='|', low_memory=False)

In [3]:
dataset.head()

Unnamed: 0,Name,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
0,memtest.exe,631ea355665f28d4707448e442fbf5b8,332,224,258,9,0,361984,115712,0,...,4,3.262823,2.568844,3.537939,8797.0,216,18032,0,16,1
1,ose.exe,9d10f99a6712e28f8acd5641e3a7ea6b,332,224,3330,9,0,130560,19968,0,...,2,4.250461,3.420744,5.080177,837.0,518,1156,72,18,1
2,setup.exe,4d92f518527353c0db88a70fddcfd390,332,224,3330,9,0,517120,621568,0,...,11,4.426324,2.846449,5.271813,31102.272727,104,270376,72,18,1
3,DW20.EXE,a41e524f8d45f0074fd07805ff0c9b12,332,224,258,9,0,585728,369152,0,...,10,4.364291,2.669314,6.40072,1457.0,90,4264,72,18,1
4,dwtrig20.exe,c87e561258f2f8650cef999bf643a731,332,224,258,9,0,294912,247296,0,...,2,4.3061,3.421598,5.190603,1074.5,849,1300,72,18,1


In [4]:
dataset.describe()

Unnamed: 0,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,AddressOfEntryPoint,BaseOfCode,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
count,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,...,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0
mean,4259.069274,225.845632,4444.145994,8.619774,3.819286,242595.6,450486.7,100952.5,171956.1,57798.45,...,22.0507,4.000127,2.434541,5.52161,55450.93,18180.82,246590.3,465675.0,12.363115,0.29934
std,10880.347245,5.121399,8186.782524,4.088757,11.862675,5754485.0,21015990.0,16352880.0,3430553.0,5527658.0,...,136.494244,1.112981,0.815577,1.597403,7799163.0,6502369.0,21248600.0,26089870.0,6.798878,0.457971
min,332.0,224.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,332.0,224.0,258.0,8.0,0.0,30208.0,24576.0,0.0,12721.0,4096.0,...,5.0,3.458505,2.178748,4.828706,956.0,48.0,2216.0,0.0,13.0,0.0
50%,332.0,224.0,258.0,9.0,0.0,113664.0,263168.0,0.0,52883.0,4096.0,...,6.0,3.729824,2.458492,5.317552,2708.154,48.0,9640.0,72.0,15.0,0.0
75%,332.0,224.0,8226.0,10.0,0.0,120320.0,385024.0,0.0,61578.0,4096.0,...,13.0,4.233051,2.696833,6.502239,6558.429,132.0,23780.0,72.0,16.0,1.0
max,34404.0,352.0,49551.0,255.0,255.0,1818587000.0,4294966000.0,4294941000.0,1074484000.0,2028711000.0,...,7694.0,7.999723,7.999723,8.0,2415919000.0,2415919000.0,4294903000.0,4294967000.0,26.0,1.0


Number of malicious files vs Legitimate files in the training set

In [5]:
dataset.groupby(dataset['legitimate']).size()

legitimate
0    96724
1    41323
dtype: int64

Dropping columns like Name of the file, MD5 (message digest) and label

In [6]:
X = dataset.drop(['Name','md5','legitimate'],axis=1).values
y = dataset['legitimate'].values

ExtraTreesClassifier

ExtraTreesClassifier fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting

In [7]:
extratrees = ek.ExtraTreesClassifier().fit(X,y)
model = SelectFromModel(extratrees, prefit=True)
X_new = model.transform(X)
nbfeatures = X_new.shape[1]

ExtraTreesClassifier helps in selecting the required features useful for classifying a file as either Malicious or Legitimate

13 features are identified as required by ExtraTreesClassifier

In [8]:
nbfeatures

13

Cross Validation

Cross validation is applied to divide the dataset into random train and test subsets. test_size = 0.2 represent the proportion of the dataset to include in the test split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y ,test_size=0.2)

In [10]:
features = []
index = numpy.argsort(extratrees.feature_importances_)[::-1][:nbfeatures]

The features identified by ExtraTreesClassifier

In [11]:
for f in range(nbfeatures):
    print("%d. feature %s (%f)" % (f + 1, dataset.columns[2+index[f]], extratrees.feature_importances_[index[f]]))
    features.append(dataset.columns[2+f])

1. feature DllCharacteristics (0.163204)
2. feature Characteristics (0.101031)
3. feature Machine (0.095665)
4. feature VersionInformationSize (0.087507)
5. feature Subsystem (0.061024)
6. feature SectionsMaxEntropy (0.059663)
7. feature ResourcesMaxEntropy (0.049287)
8. feature ImageBase (0.047794)
9. feature SizeOfOptionalHeader (0.042024)
10. feature MajorSubsystemVersion (0.038406)
11. feature ResourcesMinEntropy (0.031099)
12. feature MajorOperatingSystemVersion (0.019566)
13. feature SizeOfStackReserve (0.018639)


Building the below Machine Learning model

In [12]:
model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),
         "RandomForest":ek.RandomForestClassifier(n_estimators=50),
         "Adaboost":ek.AdaBoostClassifier(n_estimators=50),
         "GradientBoosting":ek.GradientBoostingClassifier(n_estimators=50),
         "GNB":GaussianNB(),
         #"LinearRegression":LinearRegression()   
}

In [13]:
results = {}
for algo in model:
    clf = model[algo]
    clf.fit(X_train,y_train)
    score = clf.score(X_test,y_test)
    print ("%s : %s " %(algo, score))
    results[algo] = score

DecisionTree : 0.9915972473741398 
RandomForest : 0.9940963419051069 
Adaboost : 0.985729808040565 
GradientBoosting : 0.9881926838102137 
GNB : 0.6977906555595799 


In [14]:
winner = max(results, key=results.get)

In [15]:
joblib.dump(model[winner],'./classifier.pkl')

['./classifier.pkl']

In [16]:
open('./features.pkl', 'wb').write(pickle.dumps(features))

251

Calculating the False positive and negative on the dataset

In [17]:
clf = model[winner]
res = clf.predict(X_new)
mt = confusion_matrix(y, res)
print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0])))*100))
print('False negative rate : %f %%' % ( (mt[1][0] / float(sum(mt[1]))*100)))

False positive rate : 0.093048 %
False negative rate : 0.210537 %


In [18]:
# Load classifier
clf = joblib.load('./classifier.pkl')
#load features
features = pickle.loads(open(os.path.join('./features.pkl'), 'rb').read())


Testing with unseen file

Given any unseen test file, it's required to extract the characteristics of the given file.

In order to test the model on an unseen file, it's required to extract the characteristics of the given file. Python's pefile.PE library is used to construct and build the feature vector and a ML model is used to predict the class for the given file based on the already trained model.


In [29]:
%run malware_test.py "./samples/mal.exe"

The file mal.exe is malicious


In [30]:
%run malware_test.py "./samples/TLS_Static.exe"

The file TLS_Static.exe is malicious


In [31]:
%run malware_test.py "./samples/putty.exe"

The file putty.exe is benign


In [32]:
%run malware_test.py "./samples/SharpToken.exe"

The file SharpToken.exe is malicious


In [33]:
%run malware_test.py "./samples/psftp.exe"

The file psftp.exe is benign
