In [2]:
import os
# move to project root
os.chdir('/home/rcgonzal/DSC180Malware/m2v-adversarial-hindroid/')

import pandas as pd
import numpy as np

from src.model.model import M2VDroid
from src.model.hindroid import Hindroid
from src.data.hindroid_etl import make_models
from src.utils import find_apps

%load_ext autoreload
%autoreload 2

# Purpose
This notebook should guide a user with some detail in how to use this package. Note all paths should be relative to the project directory unless of course the root indicator is present i.e. `/`.

# Data Selection
We assume you should have access to Android apps already decompiled into their Smali representations. If you have not done this, please look into how to use Apktool and Smali to decompile Android APKs (We may provide a script in the future). What we do provide is the `find_app` function which, given a directory, will recursively look for decompiled apps and return a DataFrame with their locations. This is how the `app_list.csv` file begins. 

In [2]:
find_apps('test/testdata/')

Unnamed: 0_level_0,app_dir
app,Unnamed: 1_level_1
testapp1,test/testdata/testapp1
testapp2,test/testdata/testapp2


In some cases like the file `data/out/all-apps/app_list.csv`, we add more columns to this table such as what category an app is from and whether is it malware or not in order for us to label our examples.

In [3]:
all_apps = pd.read_csv('data/out/all-apps/app_list.csv', dtype=str, index_col='app')
all_apps

Unnamed: 0_level_0,app_dir,category,malware
app,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
com.kaktus.hyungkaktus,/teams/DSC180A_FA20_A00/a04malware/random-apps...,random-apps,0
com.wedup.duduamzaleg,/teams/DSC180A_FA20_A00/a04malware/random-apps...,random-apps,0
com.dublin_mobile123.cheat_gta_5,/teams/DSC180A_FA20_A00/a04malware/random-apps...,random-apps,0
com.appall.optimizationbox,/teams/DSC180A_FA20_A00/a04malware/random-apps...,random-apps,0
live.wallpaper.t910001560,/teams/DSC180A_FA20_A00/a04malware/random-apps...,random-apps,0
...,...,...,...
com.nytimes.android,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps,0
com.tinytouchtales.alchi,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps,0
com.mycelium.wallet,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps,0
com.aceviral.smashycity,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps,0


**Aside:** `all-apps` is a special folder in our out project because it houses all apps -- and their API data in `app-data`! When parsed in our ETL, each app is extracted into its own `.csv` containing every API call made within it, making it easy to pick and choose which apps we want to select just by knowing their names (or md5s for malware). 

With that said, let's return to selecting our data. We want to split our data into stratified halves, both with equal amounts of benign apps and malware. We also have a category `random-apps` which we do not know the label to and must drop from our dataset.

In [4]:
all_apps = all_apps[all_apps.category != 'random-apps']
training_sample = (
    all_apps.groupby('malware')
    .apply(lambda x: x.sample(frac=0.5, random_state=42)) # perform stratified sample
    .drop(columns='malware').reset_index().drop(columns='malware').set_index('app') # reset the index
)
training_sample

Unnamed: 0_level_0,app_dir,category
app,Unnamed: 1_level_1,Unnamed: 2_level_1
com.hcg.cok.gp,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
com.glu.wrestling,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
com.tmusic.christmassongs,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
com.han.dominoes,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
com.jetappfactory.jetaudio,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
...,...,...
7280d6d74716513369c3a8b8f1d94676,/teams/DSC180A_FA20_A00/a04malware/malware/Ban...,malware
5d59c7c74c7133d94b8a257d749c823a,/teams/DSC180A_FA20_A00/a04malware/malware/Fak...,malware
3a54c9c23e49c0c67185d22ad2cbfc58,/teams/DSC180A_FA20_A00/a04malware/malware/Fak...,malware
ba6633d214a4e85cb157acb8da9054c1,/teams/DSC180A_FA20_A00/a04malware/malware/Fak...,malware


In [5]:
testing_sample = all_apps[['app_dir', 'category']].loc[all_apps.index.difference(training_sample.index)]
testing_sample

Unnamed: 0_level_0,app_dir,category
app,Unnamed: 1_level_1,Unnamed: 2_level_1
00268453be254779f0c7590de47db944,/teams/DSC180A_FA20_A00/a04malware/malware/Dro...,malware
002a7270ec52ec68ea3d979c85261308,/teams/DSC180A_FA20_A00/a04malware/malware/Ban...,malware
0030e0003b7226e9142683e49b41a423,/teams/DSC180A_FA20_A00/a04malware/malware/Fak...,malware
00335946abb79777f9fe2d0d96651e03,/teams/DSC180A_FA20_A00/a04malware/malware/Vid...,malware
0038be31cfed95e13a33d87142eada70,/teams/DSC180A_FA20_A00/a04malware/malware/Fak...,malware
...,...,...
org.edx.mobile,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
org.mozilla.firefox,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
org.videolan.vlc,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps
pps.christmas.photo.frames,/teams/DSC180A_FA20_A00/a04malware/popular-app...,popular-apps


In [6]:
# create two separate directories for each sample and save both to their respective directory
os.makedirs('data/out/train-half', exist_ok=True)
os.makedirs('data/out/test-half', exist_ok=True)
training_sample.to_csv('data/out/train-half/app_list.csv')
testing_sample.to_csv('data/out/test-half/app_list.csv')

Now we must train a model on the training set. To do that we must run the ETL pipeline on that directory. Therefore we set `config/etl-params/etl-params.json` as shown below and then execute `python run.py data`. *This may take a few hours run especially the random walks!*

```json
{
    "outfolder": "data/out/train-half",
    "parse_params": {
        "nprocs": 16
    },
    "feature_params": {
        "redo": false,
        "walk_args": {
            "nprocs": 16,
            "length": 60,
            "n": 3,
            "metapaths": [
                ["app", "api", "app"],
                ["app", "api", "method", "api", "app"],
                ["app", "api", "package", "api", "app"],
                ["app", "api", "package", "api", "method", "api", "app"],
                ["app", "api", "method", "api", "package", "api", "app"]
            ]
        },
        "w2v_args": {
            "size": 128,
            "window": 7,
            "min_count": 0,
            "negative": 5,
            "sg": 1,
            "workers": 16,
            "iter": 5
        }
    },
    "hindroid_params": {
        "redo": false
    }
}
```

In [14]:
%time !python run.py data

2021-02-16 01:24:54.385458: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-16 01:24:54.385500: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-02-16 01:24:56.891521: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-16 01:24:56.894470: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-16 01:24:56.923293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:61:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidt

In [7]:
%time make_models('data/out/train-half/')

Fitting models:


  0%|          | 0/30 [00:00<?, ?it/s]

	Fitting AAT model...


100%|██████████| 30/30 [00:28<00:00,  1.07it/s]
  0%|          | 0/30 [00:00<?, ?it/s]

	Fitting ABAT model...


100%|██████████| 30/30 [06:26<00:00, 12.88s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

	Fitting APAT model...


100%|██████████| 30/30 [00:50<00:00,  1.70s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

	Fitting ABPBTAT model...


100%|██████████| 30/30 [45:00<00:00, 90.02s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

	Fitting APBPTAT model...


100%|██████████| 30/30 [28:04<00:00, 56.16s/it]


              acc    recall        f1
kernel                               
AAT      1.000000  1.000000  1.000000
ABAT     0.997603  0.999275  0.998732
APAT     1.000000  1.000000  1.000000
ABPBTAT  1.000000  1.000000  1.000000
APBPTAT  0.988699  0.998187  0.994042
CPU times: user 8h 6min 28s, sys: 38min 55s, total: 8h 45min 24s
Wall time: 1h 21min 26s


From here, we can create the models we will use. Note that we included `"hindroid_params"` in the config file. Therefore we also fitted a Hindroid model on the data. We will also describe how to utilize that class though both models are largely the same.

In [3]:
m2vDroid = M2VDroid('data/out/train-half/',
                    classifier_args={'max_depth':3, 'n_jobs':-1})

In [None]:
# also saves output table to a folder
m2vDroid.fit_predict('data/out/test-half/', 
                     walk_args={
                         "nprocs": 16,
                         "length": 60,
                         "n": 3,
                         "metapaths": [
                             ["app", "api", "app"],
                             ["app", "api", "method", "api", "app"],
                             ["app", "api", "package", "api", "app"],
                             ["app", "api", "package", "api", "method", "api", "app"],
                             ["app", "api", "method", "api", "package", "api", "app"]
                         ]
                     },
                    w2v_args={
                        "size": 128,
                        "window": 7,
                        "min_count": 0,
                        "negative": 5,
                        "sg": 1,
                        "workers": 16,
                        "iter": 5
                    })

Computing new edges


In [None]:
hindroid = Hindroid('data/out/train-half/')
hindroid.fit_predicct('data/out/test-half/')