Our blog site: https://rcgonzalez9061.github.io/m2v-adversarial-hindroid/

# m2vDroid: Perturbation-resilient metapath-based Android Malware Detection

## INTRODUCTION

Over the past decade, malware has established itself as a constant issue for the Android operating system. In 2018, Symantec reported that they blocked more than 10 thousand malicious Android apps per day, while nearly 3 quarters of Android devices remained on older versions of Android. With billions active Android devices, millions are only a swipe away from becoming victims. Naturally, automated machine learning-based detection systems have become commonplace solutions to address this threat. However, it has been shown that many of these models are vulnerable to adversarial attacks, notably attacks that add redundant code to malware to consfuse detectors. 

First, we introduce a new model that extends the [Hindroid detection system](https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf) by employing node embeddings using [metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf). We believe that the introduction of node embeddings will improve the performance of the model beyond the capabilities of HinDroid. Second, we intend to break these two models using a method similar to that proposed in [Android HIV](https://ieeexplore.ieee.org/document/8782574). That is we train an adversarial model that perturbs malware with the goal of evading our models. We then measure the performance of each model after recursively feeding adversarial examples back into them. We believe that by doing so, our model will be able outperform the Hindroid implementation in its ability to label malware even after adversarial examples have been added.

## m2vDroid
m2vDroid is another malware detection model that we implemented that is largely based off HinDroid. However it used node embeddings instead of the bag-of-APIs model that HinDroid uses. 

### Preliminaries
There are a few concepts that we should introduce before we get into details:

- *Definition 1)* A **Heterogeneous Information Network (HIN)** is a graph whose nodes and edges can have diferent types. 

- *Definition 2)* A **Metapath** is a path within a HIN that follows certain node types. For example, let us define a HIN with node types $\{a, b\}$. The metapath $a \rightarrow b \rightarrow a$ is a path that begins on a node of type $a$, proceeds to a node of type $b$ which shares an edge with the previous, and continues likewise onto an $a$ node connected to the $b$ node.

### HIN Construction
Our heterogeneous information network contains 4 types of nodes which we define as: 
- $Apps$: Android apps determined by name.
- $APIs$: APIs determined by their smali representation, i.e. `Lpackage/Class;->method();V`
- $Packages$: the package an API originates from.
- $Methods$: Methods (or "functions") that API calls appear in.

$Apps$ and $APIs$ share an edge if the $API$ is used within the $App$. Likewise with $APIs$ and $Methods$, they share an edge if a $Method$ contains the $API$. $Packages$ and $APIs$ share an edge if the $API$ orginates from the $Package$. With this representation, we believe we should retain more information about the apps we aim to represent versus HinDroid.

### Feature extraction (Metapath2vec)
To generate our features, we apply the metapath2vec algorithm on the $App$ nodes of our HIN. That is we 1) perform a random-walk starting from each app following a designated metapath(s) to generate a corpus, then we 2) pass this corpus into the [word2vec](https://arxiv.org/pdf/1301.3781.pdf) model to transform each $App$ into a vector. This results in node embeddings for each App within our data which we have visualized below:

<img src="../data/out/all-apps/2D-plot.png">

## ADVERSARIAL ATTACK
TODO

## EXPERIMENT SETUP
Using multiple models:
- Our HinDroid implementation
- Our improved model (and possible variations)
    - random forest and a gradient-boosted model 

We...

1. Train on normal data
2. Train Android HIV on these models and output perturbed sourced code, perturbing only the malware
3. Retrain models on original code pool + perturbed code
4. Repeat if necessary (or possible)

## RESULTS

- Initial performance of models on normal data
- Performance after Android HIV trained on data
- Performance of models 

## REFERENCES