# Cloud-based PE Malware Detection API:

The purpose of this term project is to demonstrate your practical skills in implementing and deploying machine learning models for malware classification. The technical implementation of this project is comprised of three main tasks that need to be completed sequentially:

**Task 1 - Training:** In this task, you will be creating and training a deep neural network based on the MalConv architecture to classify PE files as malware or benign. As for the dataset, you will be using the EMBER-2017 v2 ( https://github.com/endgameinc/ember ).

If you explore the EMBER repository, you will find that it comes with a sample implementation of MalConv ( https://github.com/endgameinc/ember/tree/master/malconv ). This sample is a wonderful resource to base your implementation on. However, note that this code is 2 years (i.e., a lifetime in ML) old, and does not precisely conform to the requirements of this project.

**Implementation:** The model must be implemented in Python 3.x using TensorFlow (1.x or 2.x) and Keras, and needs to be coded and documented in a Jupyter Notebook. Additionaly, add textual description blocks to the notebook to document and explain the different parts of your code.

**Training:** This model may take a long time to train on your personal computers (from 7-8 hours to a couple of days, depending on the config), unless you already have a powerful NVIDIA GPU (1080 TI or better). Alternatively, you can use the cloud platforms to speed up the training: Google Colab or AWS Sagemaker.

# Deep Neural Network Model on EMBER Malware Dataset:

The EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. <br>
In this notebook, the EMBER-2017 v2 dataset is used which contains features from 1.1 million PE files scanned in or before 2017.

In [0]:
# Importing required modules

import pandas as pd

## Dataset Extraction:
To use the dataset in this notebook, simple download and upload didn't work as the URL to download the dataset is detected as untrused by Google. So, downloading the EMBER 2017 v2 dataset to Colab notebook using wget command and double unzipping it to get the 

In [0]:
!wget https://pubdata.endgame.com/ember/ember_dataset_2017_2.tar.bz2 --no-check-certificate

--2020-04-28 02:36:25--  https://pubdata.endgame.com/ember/ember_dataset_2017_2.tar.bz2
Resolving pubdata.endgame.com (pubdata.endgame.com)... 64.250.189.21
Connecting to pubdata.endgame.com (pubdata.endgame.com)|64.250.189.21|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 1751237573 (1.6G) [application/octet-stream]
Saving to: ‘ember_dataset_2017_2.tar.bz2.1’

      ember_dataset   0%[                    ]   5.33M  6.38MB/s               ^C


In [0]:
# Decompressing a .bz2 file
!bzip2 -d ember_dataset_2017_2.tar.bz2

In [0]:
# Extracting from tar file
!tar -xvf ember_dataset_2017_2.tar

ember_2017_2/
ember_2017_2/train_features_1.jsonl
ember_2017_2/train_features_0.jsonl
ember_2017_2/train_features_3.jsonl
ember_2017_2/test_features.jsonl
ember_2017_2/train_features_5.jsonl
ember_2017_2/train_features_4.jsonl
ember_2017_2/train_features_2.jsonl


All the required dataset files are extracted.

Now to work with the EMBER dataset, we need to clone its github repository whihc can be done by following code:

In [0]:
!git clone https://github.com/endgameinc/ember 

Cloning into 'ember'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 215 (delta 9), reused 15 (delta 6), pack-reused 192[K
Receiving objects: 100% (215/215), 11.35 MiB | 13.80 MiB/s, done.
Resolving deltas: 100% (90/90), done.


In [0]:
!mv ember ember-master

In [0]:
!cp -r ember-master/* .

In [0]:
!pip install -r requirements.txt
!python setup.py install

Collecting lief>=0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/f7/38/e6bf942cf2ee073bf81fa3324bca35409175312b7b72d71919c8fc8e547b/lief-0.10.1-cp36-cp36m-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 6.6MB/s 
Installing collected packages: lief
Successfully installed lief-0.10.1
running install
running bdist_egg
running egg_info
creating ember.egg-info
writing ember.egg-info/PKG-INFO
writing dependency_links to ember.egg-info/dependency_links.txt
writing requirements to ember.egg-info/requires.txt
writing top-level names to ember.egg-info/top_level.txt
writing manifest file 'ember.egg-info/SOURCES.txt'
reading manifest file 'ember.egg-info/SOURCES.txt'
writing manifest file 'ember.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/ember
copying ember/__init__.py -> build/lib/ember
copying ember/features.py -> build/lib/

The LIEF project is used to extract features from PE files included in the EMBER dataset. Raw features are extracted to JSON format. Vectorized features can be produced from these raw features and saved in binary format from which they can be converted to CSV, dataframe, or any other format. 

In [0]:
import ember
ember.create_vectorized_features("/content/ember_2017_2/")
ember.create_metadata("/content/ember_2017_2/")

In [0]:
import ember
data_path = '/content/ember_2017_2/'
emberdf = ember.read_metadata(data_path)
emberdf.head()

  mask |= (ar1 == a)


Unnamed: 0,sha256,appeared,subset,label
0,0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...,2006-12,train,0
1,d4206650743b3d519106dea10a38a55c30467c3d9f7875...,2006-12,train,0
2,c9cafff8a596ba8a80bafb4ba8ae6f2ef3329d95b85f15...,2007-01,train,0
3,7f513818bcc276c531af2e641c597744da807e21cc1160...,2007-02,train,0
4,ca65e1c387a4cc9e7d8a8ce12bf1bcf9f534c9032b9d95...,2007-02,train,0


In [0]:
X_train0, y_train0, X_test0, y_test0 = ember.read_vectorized_features(data_path)



In [0]:
X_train0

In [0]:
#shape of the dataset
X_train0.shape, y_train0.shape, X_test0.shape, y_test0.shape

((900000, 2381), (900000,), (200000, 2381), (200000,))

## Data Preprocessing:

It is known that the EMBER train dataset has three sample categories, namels unlabled, benign and malicious. They are represented as -1, 0 and 1 respectively. But it can be seen that the test dataset has only benign and malicious samples. In this project, I am ignoring the unlabled samples from the train dataset for the better performance of the model.

In [0]:
import pandas as pd
# Creating dataframes of X_train & y_train
X_train0 = pd.DataFrame(X_train0)
y_train0 = pd.DataFrame(y_train0)
X_train0.shape, y_train0.shape

((600000, 2381), (600000, 1))

In [0]:
#Unique labels in the train dataset 
y_train0[0].unique()

array([0., 1.], dtype=float32)

In [0]:
# Combining features and lables of train dataset
X_train0[2381] = y_train0[0]
X_train0.shape, y_train0.shape

((600000, 2382), (600000, 1))

In [0]:
#Checking the presence of unique lables in the combined dataframe
X_train0[2381].unique()

array([0., 1.], dtype=float32)

In [0]:
# Removing the unlabeled rows from the dataframe

X_train0.drop(X_train0[(X_train0[2381] == -1)].index, inplace = True)
y_train0.drop(y_train0[(y_train0[0] == -1)].index, inplace = True)

In [0]:
X_train0.shape, y_train0.shape

((600000, 2382), (600000, 1))

In [0]:
#reconstructing the X_train dataframe
X_train0.drop([2381], axis =1, inplace=True)
X_train0.shape, y_train0.shape

((600000, 2381), (600000, 1))

The dataset is huge and takes lot to time for vectorizing and creating metadata for every runtime execution. So, create pickle files for the training and testing samples to store them in the system. By downloading and storing these pickle files, one can avoid the execution of the former lines of code.

In [0]:
#Pickling the datasets
pd.DataFrame(X_train0).to_pickle("./X_train.pkl")
pd.DataFrame(y_train0).to_pickle("./y_train.pkl")
pd.DataFrame(X_test0).to_pickle("./X_test.pkl")
pd.DataFrame(y_test0).to_pickle("./y_test.pkl")

I faced network fialure error while downloading the pickle file to store it in my system. The alternate solution for this error is to upload the pickle files to the Google Drive by executing the following code:

In [0]:
# To upload the files, mount the Google drive in the colab. Go the URL obtained after executing the below lines of code 
#and do the necessary allows for the Google access. Then at the end an authorization code is displayed.
#Copy the code and paste it in the box below "Enter your authorization code:"

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Copying pickle files to Google Drive
!cp ./X_test.pkl ./gdrive/My\ Drive/Pickle_Files
!cp ./y_test.pkl ./gdrive/My\ Drive/Pickle_Files
!cp ./X_train.pkl ./gdrive/My\ Drive/Pickle_Files
!cp ./y_train.pkl ./gdrive/My\ Drive/Pickle_Files

In [0]:
# Extracting training data from pickle files
X_trainp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/X_train.pkl")
y_trainp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/y_train.pkl")

In [0]:
# Extracting testing data from pickle files
X_testp =pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/X_test.pkl")
y_testp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/y_test.pkl")

In [0]:
#Shape of the dataset
X_trainp.shape, y_trainp.shape, X_testp.shape, y_testp.shape

((900000, 2381), (900000, 1), (200000, 2381), (200000, 1))

At this point of execution, I can see that the above lines of code used most of the 25GB RAM availbale in Colab. So, even though the datasets are pickled, the RAM crashes. The alternative for this is to create HDF5 files. The h5py package is a Pythonic interface to the HDF5 binary data format.

In [0]:
import h5py

# Loading X_train data to HDF5 file
h50 = h5py.File('X_train0.h5', 'w')
h50.create_dataset('X_train0', data=X_train0)
h50.close()

In [0]:
# Loading y_train data to HDF5 file
h51 = h5py.File('y_train0.h5', 'w')
h51.create_dataset('y_train0', data=y_train0)
h51.close()

In [0]:
#Loading X_test data to HDF5 file
h52 = h5py.File('X_test0.h5', 'w')
h52.create_dataset('X_test0', data=X_test0)
h52.close()

In [0]:
#Loading y_test data to HDF5 file
h53 = h5py.File('y_test0.h5', 'w')
h53.create_dataset('y_test0', data=y_test0)
h53.close()

In [0]:
#Storing all the h5 files to GDrive
!cp ./X_train0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./y_train0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./X_test0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./y_test0.h5 ./gdrive/My\ Drive/Pickle_Files

In [0]:
#reading the X_train data from h5 files 
import h5py
Xh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/X_train0.h5','r')
X_train = Xh5['X_train0']
X_train.shape 

(600000, 2381)

In [0]:
# Reading y_train data from h5 files
import h5py
yh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/y_train0.h5','r')
y_train = yh5['y_train0']
y_train.shape 

(600000, 1)

In [0]:
# Reading X_test data from h5 files
import h5py
Xth5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/X_test0.h5','r')
X_test = Xth5['X_test0']
X_test.shape 

(200000, 2381)

In [0]:
# Reading y_test data from h5 files
import h5py
yth5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/y_test0.h5','r')
y_test = yth5['y_test0']
y_test.shape

(200000,)

**The features of this dataset are scaled on different scalars and among them I picked RobustScalar to do the feature scaling.** 

In [0]:
# Scaling the features inorder to improve the performance of the model
from sklearn.preprocessing import RobustScaler

rs = RobustScaler()
Xtrain_rs = rs.fit_transform(X_train)
Xtest_rs = rs.fit_transform(X_test)

In [0]:
#Loading scaled X_train data to HDF5 file
h54 = h5py.File('Xtrain_rs.h5', 'w')
h54.create_dataset('Xtrain_rs', data=Xtrain_rs)
h54.close()

#Storing the h5 files to GDrive
!cp ./Xtrain_rs.h5 ./gdrive/My\ Drive/Pickle_Files

In [0]:
#Loading scaled X_test data to HDF5 file
h55 = h5py.File('Xtest_rs.h5', 'w')
h55.create_dataset('Xtest_rs', data=Xtest_rs)
h55.close()

#Storing the h5 files to GDrive
!cp ./Xtest_rs.h5 ./gdrive/My\ Drive/Pickle_Files

In [0]:
# Reading Xtrain_rs data from h5 files
import h5py
Xrsh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/Xtrain_rs.h5','r')
Xtrain_rs = Xrsh5['Xtrain_rs']
Xtrain_rs.shape

(600000, 2381)

In [0]:
# Reading Xtest_rs data from h5 files
import h5py
Xtrsh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/Xtest_rs.h5','r')
Xtest_rs = Xtrsh5['Xtest_rs']
Xtest_rs.shape

(200000, 2381)

## Model Arcitecture & Training:

In [0]:
#Function for the model
def myModel():

    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    from tensorflow.keras.models import Sequential
    from keras import regularizers
    tf.compat.v1.disable_eager_execution()
    
    #Model architecture
    model = Sequential()
    model.add(layers.InputLayer(input_shape=(2381,))) 
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(units = 1000, activation = tf.nn.relu, activity_regularizer=regularizers.l2(0.01)))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(units = 1, activation=tf.nn.sigmoid))
    print(model.summary())
    
    #model compilation
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    model.save('my_model.h5')
    
    return model

In [0]:
model = myModel()

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 2381)              0         
_________________________________________________________________
dense (Dense)                (None, 1000)              2382000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 1001      
Total params: 2,383,001
Trainable params: 2,383,001
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
#Training the model on 1 epoch
history = model.fit(Xtrain_rs, y_train,
                batch_size=256, shuffle="batch",
                epochs=1, 
                validation_split=0.2)

Train on 480000 samples, validate on 120000 samples
Epoch 1/1


In [0]:
history = model.fit(Xtrain_rs, y_train,
                batch_size=256, shuffle="batch",
                epochs=30, 
                validation_split=0.2)

Train on 480000 samples, validate on 120000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Model Testing:

In [0]:
# testing the model

score =model.evaluate(Xtest_rs,y_test)
print("Training accuracy:", score[1])

Training accuracy: 0.4422149956226349


Now, lets save the model for future use.

In [0]:
# Save the model
#model.save('my_model.h5')
model.save_weights('my_model_weights.h5')

#Storing the model to GDrive
!cp ./my_model.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./my_model_weights.h5 ./gdrive/My\ Drive/Pickle_Files

In [0]:
# save neural network structure to JSON (no weights)
model_json = model.to_json()
with open("mymodeljson.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("my_model-weights.h5")

The below set of code is a a function that takes a PE file as its argument, runs it through the trained model, and returns the output i.e., 1 for Malware or ) for Benign.

In [0]:
!wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Windows-x86_64.exe

--2020-04-28 03:17:13--  https://repo.anaconda.com/archive/Anaconda3-2020.02-Windows-x86_64.exe
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 488908696 (466M) [application/octet-stream]
Saving to: ‘Anaconda3-2020.02-Windows-x86_64.exe’


2020-04-28 03:17:15 (201 MB/s) - ‘Anaconda3-2020.02-Windows-x86_64.exe’ saved [488908696/488908696]



In [0]:
def testPE(pe):
  import ember
  import numpy as np
  import tensorflow as tf
  from sklearn.preprocessing import RobustScaler
  rs = RobustScaler()
  
  #opening the downloaded PE file
  testpe = open(pe, "rb").read()
  #Feature extractor class of the ember project 
  extract = ember.PEFeatureExtractor() 
  data = extract.feature_vector(testpe) #vectorizing the extracted features
  scaled_data = rs.fit_transform([data])
  Xdata = np.reshape(scaled_data,(1, 2381))

  model = tf.keras.models.load_model('my_model.h5')
  pred = model.predict_classes(Xdata)

  return pred

In [0]:
testPE("Anaconda3-2020.02-Windows-x86_64.exe")



array([[0]], dtype=int32)

The model predicted that Anaconda PE file as Benign