In [None]:
!mkdir data
%cd data
!wget https://nextcloud.mpi-klsb.mpg.de/index.php/s/pJrRGzm2So2PMZm/download -O train.tar.gz 
!tar xzf train.tar.gz
!wget https://nextcloud.mpi-klsb.mpg.de/index.php/s/zN3yeWzQB3i5WqE/download -O test.tar.gz 
!tar xzf test.tar.gz
%cd ..
!wget https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/application_vocab_5521.pkl

# ML in Cybersecurity: Task 3

## Team
  * **Team name**:  *R2D2C3P0BB8*
  * **Members**:  <br/> **Navdeeppal Singh (s8nlsing@stud.uni-saarland.de)** <br/> **Shahrukh Khan (shkh00001@stud.uni-saarland.de)** <br/> **Mahnoor Shahid (mash00001@stud.uni-saarland.de)**


## Logistics
  * **Due date**: 9th December 2021, 23:59:59
  * Email the completed notebook to: `mlcysec_ws2022_staff@lists.cispa.saarland`
  * Complete this in **teams of 3**
  * Feel free to use the forum to discuss.
  
## Timeline
  * 26-Nov-2021: hand-out
  * **09-Dec-2021**: Email completed notebook
  
  
## About this Project
In this project, you will explore an application of ML to a popular task in cybersecurity: malware classification.
You will be presented with precomputed behaviour analysis reports of thousands of program binaries, many of which are malwares.
Your goal is to train a malware detector using this behavioural reports.


## A Note on Grading
The grading for this project will depend on:
 1. Vectorizing Inputs
   * Obtaining a reasonable vectorized representations of the input data (a file containing a sequence of system calls)
   * Understanding the influence these representations have on your model
 1. Classification Model  
   * Following a clear ML pipeline
   * Obtaining reasonable performances (>60\%) on held-out test set
   * Choice of evaluation metric
   * Visualizing loss/accuracy curves
 1. Analysis
   * Which methods (input representations/ML models) work better than the rest and why?
   * Which hyper-parameters and design-choices were important in each of your methods?
   * Quantifying influence of these hyper-parameters on loss and/or validation accuracies
   * Trade-offs between methods, hyper-parameters, design-choices
   * Anything else you find interesting (this part is open-ended)


## Grading Details
 * 40 points: Vectorizing input data (each input = behaviour analysis file in our case)
 * 40 points: Training a classification model
 * 15 points: Analysis/Discussion
 * 5 points: Clean code
 
## Filling-in the Notebook
You'll be submitting this very notebook that is filled-in with your code and analysis. Make sure you submit one that has been previously executed in-order. (So that results/graphs are already visible upon opening it). 

The notebook you submit **should compile** (or should be self-contained and sufficiently commented). Check tutorial 1 on how to set up the Python3 environment.


**The notebook is your project report. So, to make the report readable, omit code for techniques/models/things that did not work. You can use the final summary to provide a report about these.**

It is extremely important that you **do not** re-order the existing sections. Apart from that, the code blocks that you need to fill-in are given by:
```
#
#
# ------- Your Code -------
#
#
```
Feel free to break this into multiple-cells. It's even better if you interleave explanations and code-blocks so that the entire notebook forms a readable "story".


## Code of Honor
We encourage discussing ideas and concepts with other students to help you learn and better understand the course content. However, the work you submit and present **must be original** and demonstrate your effort in solving the presented problems. **We will not tolerate** blatantly using existing solutions (such as from the internet), improper collaboration (e.g., sharing code or experimental data between groups) and plagiarism. If the honor code is not met, no points will be awarded.

 
 ## Versions
  * v1.1: Updated deadline
  * v1.0: Initial notebook
  
  ---

In [1]:
import time 
 
import numpy as np 
import matplotlib.pyplot as plt 

import json 
import time 
import pickle 
import sys 
import csv 
import os 
import os.path as osp 
import shutil 
import pathlib
from pathlib import Path

from IPython.display import display, HTML
 
%matplotlib inline 
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots 
plt.rcParams['image.interpolation'] = 'nearest' 
plt.rcParams['image.cmap'] = 'gray' 
 
# for auto-reloading external modules 
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython 
%load_ext autoreload
%autoreload 2

In [2]:
# Some suggestions of our libraries that might be helpful for this project
from collections import Counter          # an even easier way to count
from multiprocessing import Pool         # for multiprocessing
from tqdm import tqdm                    # fancy progress bars

# Load other libraries here.
from sklearn.metrics import recall_score
# Keep it minimal! We should be easily able to reproduce your code.

# We preload pytorch as an example
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset

# Setup

  * Download the datasets: [train](https://nextcloud.mpi-klsb.mpg.de/index.php/s/pJrRGzm2So2PMZm) (128M) and [test](https://nextcloud.mpi-klsb.mpg.de/index.php/s/zN3yeWzQB3i5WqE) (92M)
  * Unpack them under `./data/train` and `./data/test`
  * Hint: you can execute shell scripts from notebooks using the `!` prefix, e.g., `! wget <url>`

In [3]:
# Check that you are prepared with the data
try:
    print(f"# train examples (Should be 13682) : ", len(os.listdir('./data/train')))
    print(f"# test examples (Should be 13682) : ", len(os.listdir('./data/test')))

except Exception as e:
    print("You don't have the data!")
# ! printf '# train examples (Should be 13682) : '; ls data/train | wc -l
# ! printf '# test  examples (Should be 10000) : '; ls data/test | wc -l

Now that you're set, let's briefly look at the data you have been handed.
Each file encodes the behavior report of a program (potentially a malware), using an encoding scheme called "The Malware Instruction Set" (MIST for short).
At this point, we highly recommend you briefly read-up Sec. 2 of the [MIST](http://www.mlsec.org/malheur/docs/mist-tr.pdf) documentation.

You will find each file named as `filename.<malwarename>`:
```
» ls data/train | head
00005ecc06ae3e489042e979717bb1455f17ac9d.NothingFound
0008e3d188483aeae0de62d8d3a1479bd63ed8c9.Basun
000d2eea77ee037b7ef99586eb2f1433991baca9.Patched
000d996fa8f3c83c1c5568687bb3883a543ec874.Basun
0010f78d3ffee61101068a0722e09a98959a5f2c.Basun
0013cd0a8febd88bfc4333e20486bd1a9816fcbf.Basun
0014aca72eb88a7f20fce5a4e000c1f7fff4958a.Texel
001ffc75f24a0ae63a7033a01b8152ba371f6154.Texel
0022d6ba67d556b931e3ab26abcd7490393703c4.Basun
0028c307a125cf0fdc97d7a1ffce118c6e560a70.Swizzor
...
```
and within each file, you will see a sequence of individual systems calls monitored duing the run-time of the binary - a malware named 'Basun' in the case:
```
» head data/train/000d996fa8f3c83c1c5568687bb3883a543ec874.Basun
# process 000006c8 0000066a 022c82f4 00000000 thread 0001 #
02 01 | 000006c8 0000066a 00015000
02 02 | 00006b2c 047c8042 000b9000
02 02 | 00006b2c 047c8042 00108000
02 02 | 00006b2c 047c8042 00153000
02 02 | 00006b2c 047c8042 00091000
02 02 | 00006b2c 047c8042 00049000
02 02 | 00006b2c 047c8042 000aa000
02 02 | 00006b2c 047c8042 00092000
02 02 | 00006b2c 047c8042 00011000
...
```
(**Note**: Please ignore the first line that begins with `# process ...`.)

Your task in this project is to train a malware detector, which given the sequence of system calls (in the MIST-formatted file like above), predicts one of 10 classes: `{ Agent, Allaple, AutoIt, Basun, NothingFound, Patched, Swizzor, Texel, VB, Virut }`, where `NothingFound` roughly represents no malware is present.
In terms of machine learning terminology, your malware detector $F: X \rightarrow Y$ should learn a mapping from the MIST-encoded behaviour report (the input $x \in X$) to the malware class $y \in Y$.

Consequently, you will primarily tackle two challenges in this project:
  1. "Vectorizing" the input data i.e., representing each input (file) as a tensor
  1. Training an ML model
  

### Some tips:
  * Begin with an extremely simple representation/ML model and get above chance-level classification performance
  * Choose your evaluation metric wisely
  * Save intermediate computations (e.g., a token to index mapping). This will avoid you parsing the entire dataset for every experiment
  * Try using `multiprocessing.Pool` to parallelize your `for` loops

---

# 1. Vectorize Data

## 1.a. Load Raw Data
## => We converted list of list lines to string to save memory in order to load entire dataset

In [4]:
def load_content(filepath):
    '''Given a filepath, returns (content, classname), where content = [list of lines in file]'''
    ## load file content
    file = open(filepath, "r")
    file_lines = file.read()
    ## here converted list of list lines to string to save memory in order to load entire dataset
    lines = "\n".join(file_lines.splitlines())
    file.close()

    ## extracting label
    label = filepath.split(".")[-1]
    return lines, label


def load_data(data_path, nworkers=10):
    '''Returns each data sample as a tuple (x, y), x = sequence of strings (i.e., syscalls), y = malware program class'''
    raw_data_samples = []
    
    file_paths = [f"{data_path}/{filename}" for filename in os.listdir(data_path)]
    pool = Pool(processes=nworkers)
 
    raw_data_samples = pool.map(load_content, file_paths)
    return raw_data_samples

def pickle_file(out_path, file_content):
    with open(out_path, 'wb') as wf:
        pickle.dump(file_content, wf)
        
def unpickle_file(in_path):
    return pickle.load(open(in_path, "rb"))

In [5]:
train_path = './data/train'
test_path = './data/test'
n_workers = 10

In [6]:
project_mode = 'eval'    # trainval, traintest, debug, eval
np.random.seed(123) 


## in trainval mode we use test_raw_samples variable to hold validation dataset
train_raw_samples, test_raw_samples = [], []
 
if project_mode == 'trainval':
    print('=> Loading training data ... ')
    train_raw_samples = load_data(Path(train_path), nworkers=n_workers)
    # To perform the same split across multiple runs
    np.random.seed(123)          
    # Split data into train and validation set
    np.random.shuffle(train_raw_samples)
    train_raw_samples, test_raw_samples = train_raw_samples[:int(len(train_raw_samples)*0.8)], train_raw_samples[int(len(train_raw_samples)*0.8):]

elif project_mode == 'traintest':
    ## loading train and test set
    print('=> Loading training data ... ')
    train_raw_samples = load_data(Path(train_path), nworkers=n_workers)
    print('=> Loading testing data ... ')
    test_raw_samples = load_data(Path(test_path), nworkers=n_workers)
    
elif project_mode == 'debug':
    # Optional, use a small subset of the training and validation data for fast debugging
    print('=> Loading training data ... ')
    train_raw_samples = load_data(Path(train_path), nworkers=n_workers)[:100]
    print('=> Loading testing data ... ')
    test_raw_samples = load_data(Path(test_path), nworkers=n_workers)[:100]

elif project_mode == 'eval':
    ## load only test set for evaluating the model
    print('=> Loading testing data ... ')
    test_raw_samples = load_data(Path(test_path), nworkers=n_workers)

else:
    raise ValueError('Unrecognized mode')
    
print('=> # Train samples = ', len(train_raw_samples))
print('=> # Test  samples = ', len(test_raw_samples))

## 1.b. Vectorize: Setup

Make one pass over the inputs to identify relevant features/tokens.

Suggestion:
  - identify tokens (e.g., unigrams, bigrams)
  - create a token -> index (int) mapping. Note that you might have a >10K unique tokens. So, you will have to choose a suitable "vocabulary" size.

In [7]:
# Feel free to edit anything in this block

def get_key_idx_map(counter, vocab_size, ukn_token='_ukn_'):
    """counter is a mapping: token -> count
    build vectorizer using vocab_size most common elements"""
    key_to_idx, idx_to_key = dict(), dict()
    
    for idx, (key, value) in tqdm(enumerate(list(train_counter.items())[:vocab_size-1])):
        ## perform mapping for token
        key_to_idx[key] = idx
        idx_to_key[idx] = key
    ## perform mapping for unk token at the end
    key_to_idx[ukn_token] = vocab_size - 1
    idx_to_key[vocab_size - 1] = ukn_token
    
    return key_to_idx, idx_to_key

def preprocess(data):
    """concatenating all sys calls to single string for tokenization
    removing extraneous information such as lines with '# process', white spaces and '|' characters"""
    for i, (X,y) in enumerate(tqdm(data)):
        example = ""
        for line in X.split("\n"):
            ## skip lines containing '# process'
            if "# process" in line:
                continue
            ## remove extraneous white spaces and 
            example += line.replace("|","").replace("  ", " ").strip() + " "
        example = example.strip()
        ## assign preprocessed sample
        data[i] = (example, y)
    return data
        
def count_words(data):
    """
    count token occurences for building vocabulary later
    """
    counter = {}
    for X,y in tqdm(data):
        counts = dict(Counter(X.split()))
        counter = dict(counter, **counts)
    return counter

In [8]:
"""
Preprocessing both train and test set and
Creating token counter for building vocabulary on train set
"""
train_raw_samples = preprocess(train_raw_samples)
train_counter = count_words(train_raw_samples)
test_raw_samples = preprocess(test_raw_samples)

In [9]:
## Code for finding appropriate threshold for setting `MAX_VOCAB_SIZE`
def choose_vocab_size(min_frequency_threshold=10):
    count = 0
    for value,key in sorted([(value,key) for (key,value) in train_counter.items()], reverse=True):
        if value > min_frequency_threshold:
            count+=1
    print(f"Number of tokens are {count} for min. frequency threshold={min_frequency_threshold}")
#choose_vocab_size(10)

In [10]:
## sorting the counters wrt to count values in decending order
train_counter = {key:value for value, key in  sorted([(value,key) for (key,value) in train_counter.items()], reverse=True)}

In [11]:
# Feel free to edit anything in this block
## By keeping a minimum count threshold of 10 we get 5520 most frequent tokens in train dataset
## adding one to MAX_VOCAB_SIZE for _ukn_ token
MAX_VOCAB_SIZE = 5520 + 1
token_to_idx, idx_to_token = {}, {}
# Path for vocab for saving and loading
vocab_path = 'application_vocab_{}.pkl'.format(MAX_VOCAB_SIZE)

## check if vocab already exists on file system otherwise create one
if os.path.isfile(vocab_path):
    token_to_idx, idx_to_token = unpickle_file(vocab_path)['token_to_idx'], unpickle_file(vocab_path)['idx_to_token']
    
else:
    token_to_idx, idx_to_token = get_key_idx_map(train_counter, MAX_VOCAB_SIZE)
    with open(vocab_path, 'wb') as wf:
        dct = {'token_to_idx': token_to_idx,
              'idx_to_token': idx_to_token}
        pickle.dump(dct, wf)

## 1.c. Vectorize Data

Use the (token $\rightarrow$ index) mapping you created before to vectorize your data

In [12]:
def sample_to_idx(sample):
    """
    Maps each document's tokens to their ids in the vocabulary
    """
    idx_sample = []
    for token in sample.split(' '):
        if token not in token_to_idx:
                token = '_ukn_'
        idx_sample.append(token_to_idx[token])
    return idx_sample


## define mapping for labels
label_encodings = {'Virut': 0,
 'Swizzor': 1,
 'Agent': 2,
 'Patched': 3,
 'Allaple': 4,
 'Texel': 5,
 'Basun': 6,
 'AutoIt': 7,
 'NothingFound': 8,
 'VB': 9}

In [13]:
def vectorize_raw_samples_bow(raw_samples, vocab_length, nworkers=10):
    """
    BAG-OF-WORDS Vectorizer which vectorizes examples by adding '1' for
    term i occuring in document j, hence producing a vector for each document
    raw_samples: List of documents to vectorize
    vocab_length: Size of the vocabulary
    """
    vectorized_samples = []
    labels = []
    lengths = []
    try:
        for idx, (X,y) in tqdm(enumerate(raw_samples)):
            vectorized_sample = []
            ## map labeks to ids
            label = label_encodings[y]
            ## map tokens to ids
            X_idx = sample_to_idx(X)

            ## initializing placeholder vector with unknown tokens equivalent to max_length
            vector_sample = [0] * vocab_length

            ## creating Bag of Words Vectors 
            for index, val in enumerate(set(X_idx)):
                vector_sample[val] = 1
            sequence_length = len(X_idx)

            ## append sample to respective lists
            vectorized_samples.append(vector_sample)
            labels.append(label)
            lengths.append(sequence_length)
    except:
        pass
            
    return (torch.DoubleTensor(vectorized_samples), torch.LongTensor(labels), torch.LongTensor(lengths))



def vectorize_raw_samples_count_vectorizer(raw_samples, vocab_length, nworkers=10):
    """ 
    Count Vectorizer is similar to BAG-OF-WORDS Vectorizer, however this vectorizer places 'counts/frequency' for
    term i occuring in document j, hence producing a vector for each document
    raw_samples: List of documents to vectorize
    vocab_length: Size of the vocabulary
    """
    vectorized_samples = []
    labels = []
    lengths = []
    try:
        for idx, (X,y) in tqdm(enumerate(raw_samples)):
            vectorized_sample = []
            ## map labeks to ids
            label = label_encodings[y]
            ## map tokens to ids
            X_idx = sample_to_idx(X)

            ## initializing placeholder vector with unknown tokens equivalent to max_length
            vector_sample = [0] * vocab_length
            ## compute counts
            counts = dict(Counter(X_idx))
            ## creating Count Vectors 
            for index, (key, val) in enumerate(counts.items()):
                vector_sample[key] = val
            sequence_length = len(X_idx)

            ## append sample to respective lists
            vectorized_samples.append(vector_sample)
            labels.append(label)
            lengths.append(sequence_length)
    except:
        pass
    
    return (torch.DoubleTensor(vectorized_samples), torch.LongTensor(labels), torch.LongTensor(lengths))



def vectorize_raw_samples_tfidf(raw_samples, vocab_length, nworkers=10):
    """ 
    TF-IDF Vectorizer vectorizes examples computing term frequencies and inverse document
    for freqyencies for term i occuring in document j, hence producing a vector for each document
    raw_samples: List of documents to vectorize
    vocab_length: Size of the vocabulary
    """
    labels = []
    lengths = []
    tf_samples = [] ## term frequency vector for each sample
    tf_idf = np.zeros(shape=(len(raw_samples), vocab_length))
    try:
        for idx, (X,y) in tqdm(enumerate(raw_samples)):
            vectorized_sample = []
            ## map labeks to ids
            label = label_encodings[y]
            ## map tokens to ids
            X_idx = sample_to_idx(X)

            ## initializing placeholder vector with unknown tokens equivalent to max_length
            vector_sample = [0] * vocab_length
            ## compute counts
            counts = dict(Counter(X_idx))
            ## creating Count Vectors 
            for index, (key, val) in enumerate(counts.items()):
                vector_sample[key] = val
            sequence_length = len(X_idx)
            ## compute term frequencies 'tf => # of times term in the doc / total words in the doc'
            term_frequencies = np.array(vector_sample) / sequence_length
            ## append sample to respective lists
            labels.append(label)
            lengths.append(sequence_length)
            tf_samples.append(term_frequencies)

        # compute idf
        # 1. computing BOW matrix 
        bow = np.zeros(shape=(len(raw_samples), vocab_length))
        for i in range(len(tf_samples)):
            for j in range(vocab_length):
                if tf_samples[i][j] > 0:
                    bow[i,j] = 1
        # 2, compute idf scores 'idf(t) => log( ((1 + # of docs)/ # of docs with term t + 1) + 1 ) '
        idf = [np.log(((1+len(raw_samples))/(1+sum(bow[:, i])))+1) for i in range(vocab_length)]

        # compute tf-idf => tf * idf
        tf = np.array(tf_samples)

        for i in range(vocab_length):
            tf_idf[:, i] = tf[:, i] * idf[i]
    except:
        pass
    
    return (torch.DoubleTensor(tf_idf), torch.LongTensor(labels), torch.LongTensor(lengths))

In [14]:
## select vectorization_method from {'BOW','COUNT_VEC', 'TF_IDF'}
vectorization_method = "BOW" 
train_data, test_data = None, None

if vectorization_method == "BOW":
    print(f'=> {vectorization_method} Processing: Train')
    train_data = vectorize_raw_samples_bow(train_raw_samples, vocab_length=MAX_VOCAB_SIZE)
    print()
    print(f'=> {vectorization_method} Processing: Test')
    test_data = vectorize_raw_samples_bow(test_raw_samples, vocab_length=MAX_VOCAB_SIZE)

elif vectorization_method == "COUNT_VEC":
    print(f'=> {vectorization_method} Processing: Train')
    train_data = vectorize_raw_samples_count_vectorizer(train_raw_samples, vocab_length=MAX_VOCAB_SIZE)
    print()
    print(f'=> {vectorization_method} Processing: Test')
    test_data = vectorize_raw_samples_count_vectorizer(test_raw_samples, vocab_length=MAX_VOCAB_SIZE)
        
elif vectorization_method == "TF_IDF":
    print(f'=> {vectorization_method} Processing: Train')
    train_data = vectorize_raw_samples_tfidf(train_raw_samples, vocab_length=MAX_VOCAB_SIZE)
    print()
    print(f'=> {vectorization_method} Processing: Test')
    test_data = vectorize_raw_samples_tfidf(test_raw_samples, vocab_length=MAX_VOCAB_SIZE)
else:
    print("Please choose one of the following vectorization method: {'BOW','COUNT_VEC', 'TF_IDF'}")


In [15]:
#test_data = unpickle_file('./test_v5701_count_vec.pkl')
#train_data = unpickle_file('./train_v5701_count_vec.pkl')
#vocab = unpickle_file('./application_vocab_5701.pkl')
#len(vocab['token_to_idx'])

In [16]:

# Suggestions: 
#
# (a) You can use torch.utils.data.TensorDataset to represent the tensors you created previously
# trainset = TensorDataset(train_x, train_y)
# testset = TensorDataset(test_x, test_y)
#
# (b) Store your datasets to disk so that you do not need to precompute it every time

"""
Standard Pytorch Dataset class for loading datasets.
"""
class MalwareDataset(Dataset):

    def __init__(self, data_tensor, target_tensor, length_tensor):
        """
        initializes  and populates the the length, data and target tensors, and raw texts list
        """
        assert data_tensor.size(0) == target_tensor.size(0) == length_tensor.size(0)
        self.data_tensor = data_tensor
        self.target_tensor = target_tensor
        self.length_tensor = length_tensor

    def __getitem__(self, index):
        """
        returns the tuple of data tensor, targets, lengths of sequences tensor
        """
        return self.data_tensor[index], self.target_tensor[index], self.length_tensor[index]

    def __len__(self):
        """
        returns the length of the data tensor.
        """
        return self.data_tensor.size(0)

## instantiate train and test datasets
malware_testset = MalwareDataset(test_data[0], test_data[1], test_data[2])
malware_trainset = MalwareDataset(train_data[0], train_data[1], train_data[2])

# 2. Train Model

You will now train an ML model on the vectorized datasets you created previously.

_Note_: Although we often refer to each input as a 'vector' for simplicity, each of your inputs can also be higher dimensional tensors.

## 2.a. Helpers

In [17]:
# Feel free to edit anything in this block
## temporarily upload files to cloud for moving them around: !curl --upload-file ./train_v5700_l2000.pkl https://transfer.sh/train_v5700_l2000.pkl

def evaluate_preds(y_gt, y_pred):
    recall = recall_score(y_gt, y_pred, average='macro')
    return recall


def another_helper(question):
    return 42


def save_model(model, out_path):
    pass


In [18]:
pickle_file(f'test_v5521_{vectorization_method}.pkl', test_data)
!curl --upload-file ./test_v5521_BOW.pkl https://transfer.sh/test_v5521_BOW.pkl
    
#pickle_file(f'train_v5521_{vectorization_method}.pkl', train_data)
#pickle_file(f'test_v5521_{vectorization_method}.pkl', test_data)
#application_vocab_5701
# !wget https://transfer.sh/uzTwuQ/test_v5701_bow.pkl
#!wget https://transfer.sh/WTptcx/application_vocab_5701.pkl
# !wget https://transfer.sh/bqXwhn/train_v5701_bow.pkl
# train_data = unpickle_file('train_v5701_bow.pkl')

## 2.b. Define Model

Describe your model here.

In [None]:
# Feel free to edit anything in this block

class MalwareNet(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MalwareNet, self).__init__()
        # Layer definitions
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
             nn.Linear(512, 256),
            nn.ReLU(),
             nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, x):
        # Forward pass
        x = self.layers(x)
        return x

## 2.c. Set Hyperparameters

In [None]:
# Define your hyperparameters here

in_dims = train_data[0][0].shape[0]
out_dims = len(label_encodings)

# Optimization
n_epochs = 100
batch_size = 512
lr = 0.0001

In [None]:
train_data[1].shape ####### distribution, hyperparemter tuning

## 2.d. Train your Model

In [None]:
# Feel free to edit anything in this block

model = MalwareNet(input_dim=in_dims, output_dim=out_dims)
model.train()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Data Loaders
trainloader = DataLoader(malware_trainset, batch_size=batch_size, shuffle=True)
testloader = DataLoader(malware_testset, batch_size=batch_size, shuffle=False)

In [None]:

# Example:
# for epoch in range(n_epochs):
#     ... train ...
#     ... validate ...

device = device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss() ## since we are doing multiclass classification
for epoch in range(n_epochs):
    y_true = list()
    y_pred = list()
    total_loss = 0
    for batch, targets, lengths in trainloader:
        
        ## perform forward pass  
        batch = batch.type(torch.FloatTensor).to(device)
        pred = model(batch) 
        preds = torch.max(pred, 1)[1]
        
        ## accumulate predictions per batch for the epoch
        y_pred += list([x.item() for x in preds.detach().cpu().numpy()])
        targets = torch.LongTensor([x.item() for x in list(targets)])
        y_true +=  list([x.item() for x in targets.detach().cpu().numpy()])
        
        ## compute loss and perform backward pass
        loss = criterion(pred.to(device), targets.to(device)) ## compute loss 
        optimizer.zero_grad()
        loss.backward() 
        optimizer.step()
        
        ## accumulate train loss
        total_loss += loss.item() 
        
    print(f"[{epoch+1}/{n_epochs}] Train loss: {total_loss} Recall score: {evaluate_preds(y_true, y_pred)}")
    
    ## init placeholder for predictions and groundtruth
    y_true_val = list()
    y_pred_val = list()
    ## perform validation pass
    for batch, targets, lengths in testloader:
        ## perform forward pass  
        batch = batch.type(torch.FloatTensor).to(device)
        pred = model(batch) 
        preds = torch.max(pred, 1)[1]
        
        ## accumulate predictions per batch for the epoch
        y_pred_val += list([x.item() for x in preds.detach().cpu().numpy()])
        targets = torch.LongTensor([x.item() for x in list(targets)])
        y_true_val +=  list([x.item() for x in targets.detach().cpu().numpy()])
    print(f"[{epoch+1}/{n_epochs}] Validation Recall score: {evaluate_preds(y_true_val, y_pred_val)}")
        

## 2.e. Evaluate model

In [None]:
#
#
# ------- Your Code -------
#
# 

## 2.f. Save Model + Data

In [None]:
#
#
# ------- Your Code -------
#
# 

---

# 3. Analysis

## 3.a. Summary: Main Results

Summarize your approach and results here

## 3.b. Discussion

Enter your final summary here.

For instance, you can address:
- What was the performance you obtained with the simplest approach?
- Which vectorized input representations helped more than the others?
- Which malwares are difficult to detect and why?
- Which approach do you recommend to perform malware classification?