# Hindroid

This repository contains a mimic implementation and future implementation plan of the [Hindroid](https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf) paper (DO>>> I:[10.1145/3097983.3098026](https://doi.org/10.1145/3097983.3098026)).

## What Is Hindroid

The main task of Hindroid is to use machine learning, typically Graph Neural Network, to classify Android Apps as benign or malicious. Hindroid is designed to be an intelligent Android malware detection system based on structured heterogeneous information network.


## What is the Data

### APK and Smali

The paper uses a static analysis method to identify malware, extracting source code from [.apk](https://en.wikipedia.orwiki/Android_application_package) files of apps. Because of reversibility of .apk files, we will decompile .apk files to [Smali Code](https://limbenjamin.com/articles/analysing-smali-code.html) with  [ApkTool](https://ibotpeaches.github.io/Apktool/). We then use technique similar to Natural Language Processing to perform feature extraction outputting corresponding features, in particular, Nodes and Edges of the network.

The paper is mainly targeting on API calls in smali code. [API](https://en.wikipedia.org/wiki/Application_programming_interface), Application Programming Interfaces, is an interface or communication protocol between parts of a computer program intended to simplify the implementation and maintenance of software. API calls are used by Android apps in order to access operating system functionality and system resources. API calls grant possibility to apps access asking system permission to perform low level system actions like sending HTTP requests to an unknown server.

#### Example

API calls via Smali

```smali
invoke−virtual {v2, v3}, Ljava/lang/Runtime;−>exec(Ljava/lang/String ;)
Ljava / lang / Process ;
```

```smali
invoke−virtual {v2}, Ljava / lang / Process;−>getInputStream () Ljava / io / InputStream
```

### Data Design & Collection

#### Abstract

The data we use in replication of paper will consist of:

- Benign Android Application from [APKPure](http://apkpure.com)
- Malicious Android Application from our private source.

The benign apps are from an online app platform (like Playstore) APKPure. The reason we use APKPure instead of Google Playstore that APKPure is more scraping-friendly than Playstore: Playstore requires a google account to purchase free app. We can use sitemap of APKPure to sample our benign apps. More importantly, APKPure is an apk recommending site which consists app pre-census step by editors. It can reduce the possibility to get malicious app in our benign app samples.
The malicious Android Application are from our private source because of to avoid the data be used in malicious way.

With the Benign sample and the Malicious sample, we have both positive and negative labels in our classification task, then we will perform ML algorithms for binary classification.

While portions focused on learning graph techniques will also use
examples from other languages (for example, python and java source code).
Under folder utils, building utility functions to download apk and transfer apks smali code with python

#### Pros

- Using Smali as our data is appropriate with following reasons:
  - perform static analysis is a novel and secure way to perform malware detection. Rather than traditional detection on apks by running in a virtual machine or actual machine, it will not execute the apks. In this way, we can prevent the malware to actual damage our personal devices while we do malware detection.
  - perform static analysis is an efficient way to process large task when we want to perform a mass malware detection over apps, not only in personal use but also in business use. Rather than detecting the malware by running the file, we scan through the code.
  - perform static analysis is more robust with iteration. Iteration by feeding in new data and tuning parameter, the classification task will follow the trend of malware and detect them precisely.
- Using APKPure as our benign data is appropriate with following reasons:
  - APKPure is a secondary app store rather than Playstore, which has significantly less census on app release. Thus, APKPure's samples are more trust-worthy and can be good positive samples.
  - APKPure is scratching-friendly. Compared to Playstore, which requires a google account to download and purchase apps, APKPure does not require an accont to download apks. Moreover, APKPure provides a sitemap on the robots.txt. We can use the sitemap to easier sample our dataset.
- The benign Android Application and Malware samples are a good match to solve our classification task. As mentioned above, APKPure
  - With balanced of positive and negative samples of apk, we can build a robust classifier to identify malware and benign apps.

#### Cons

- Limitation of Benign Sample
  - Although APKPure is more trust-worthy than Google Playstore, it is still questionable that every app in APKPure is benign. If a large amount of our positive samples are negative, our classifier will be less robust even invalidated. We must aware the shortcoming that not every app in APKPure is benign.
  - Since we can only download free app from APKPure, there is a big limitation of our data design: we cannot access the paid apps, which is far away from our real world scenario. Despite the low malware possibility of paid apps, we cannot neglect the sample of paid apps.
- Limitation of Malicious Sample
  - The apps from APKPure is updated over time, but our malicious sample is from historical database. There is a time gap between our Benign sample and Malicious sample, and it is not easy to keep malicious sample updated.
  - The malicious sample is much less than the benign sample. It is not easy to make two sample balanced.
- Limitation of Only Detecting API calls
  - Our paper only targets on API calls, there exit malicious apps contain non-suspect API calls, which cannot be detected by our classifier. Also, the paper neglect to analysis the relationship between each method and class.
  - The repeat use of a specific API call will not feed in to the feature extraction of the paper, which will lead an inaccuracy of classifier.

#### Past Efforts

- Traditional Approach
  - The traditional approach of malware detection or security threats is to scan the signature of the apps compares to the database of identified malicious apps. This approach is harder to iterate because it requires to keep update the malware database.
- Dynamic Analysis
  - Others using dynamic analysis to perform malware detection. Because this method requires an active virtual machine to run the apps, it may have security concern and it is more computationally heavy.
- Static Analysis
  - Rather than extracting API calls using a structured heterogeneous information network, some constructed similarities between apps with ML to identify malware.

### Data Ingestion Process

#### Data Accessability

- Data Original
  - Benign Android Application from [APKPure](http://apkpure.com)
  - Malicious Android Application from our private source.
- Legal Issues
  - According to APKPure's [Term of Use](https://apkpure.com/terms.html)

    ```Note: APKPure.com is NOT associated or affiliated with Google, Google Play or Android in any way. Android is a trademark of Google Inc. All the apps & games are property and trademark of their respective developer or publisher and for HOME or PERSONAL use ONLY. Please be aware that APKPure.com ONLY SHARE THE ORIGINAL APK FILE FOR FREE APPS. ALL THE APK FILE IS THE SAME AS IN GOOGLE PLAY WITHOUT ANY CHEAT, UNLIMITED GOLD PATCH OR ANY OTHER MODIFICATIONS.```

    it specifies APKPure's data is only for personal use. Since our project is a personal capstone project without commercial purpose. We are free of legal Issues in data use.
  - According to APKPure's [robots.txt](https://apkpure.com/robots.txt), [sitemap.xml](https://apkpure.com/sitemap.xml) is obtained for scraping use. Thus, we are free of violation of scraping rule.

#### Data Privacy

*subject to change

- According to APKPure's [Privacy Policy](https://apkpure.com/privacy-policy.html). If necessary, we will provide our privacy information as policy requests.
- For data we collected, since it is public by APKPure, we are free of privacy concern. Regardlessly, we will still anonymise our data by following steps:
  - anonymise apk url with sha256 encryption.
  - anonymise app name with two-way hash function.
  - anonymise apk file names ,if necessary, with sha256 encryption.
  - anonymise apk developer with two-way hash function.
  - anonymise apk signature ,if necessary, with sha256 encryption.
  - anonymise apk category with two-way hash function.

#### Data Schemas

- Since we need to feed in data into a ML pipeline to make classification, we need preprocess our data, storing as a designed Data Schema like following form:

  ``` source
├── datasets
│   ├── external
│   ├── interim
│   │   ├── b_features
│   │   └── m_features
│   ├── processed
│   │   ├── matrices
│   │   │   ├── A.npz
│   │   │   ├── B.npz
│   │   │   ├── P.npz
│   │   │   └── ref
│   │   │       ├── api_ref.json
│   │   │       └── app_ref.json
│   │   └── results
│   │           └── results.csv
│   └── raw
│       ├── apps
│       └── smali
├── metadata
│   └── metadata.csv
└── tests
    ├── external
    ├── interim
    │   ├── b_features
    │   └── m_features
    ├── processed
    │   ├── matrices
    │   │   ├── A.npz
    │   │   ├── B.npz
    │   │   ├── P.npz
    │   │   └── ref
    │   │       ├── api_ref.json
    │   │       └── app_ref.json
    │   └── results
    │          └── results.csv
    └── raw
        ├── apps
        └── smali

  ```

  Since apks are fairly large, and we are interested in the API call of every app. We may only keep the file AndroidManifest.xml and smali folders. For each app, after extraction of smali, we will delete the .apk file

- For each app, we will create an overall metadata.csv to store their feature according their corresponding sitemap.

  The metadata will consist following columns:

  - `loc`: the url of specific app
  - `lastmod`: the datetime of the last update of the app
  - `changefreq`: check for update frequency
  - `priority`: the priority group of the app
  - `sitemap_url`: the url in sitemap.xml

  Metadata is a map of what we will sample according to, we can do different sampling with the metadata.

### Data Ingestion Pipeline

#### Data Sampling

  get the list of apks url to download from `sitemap.xml`

- Initialize `metadata.csv` from `sitemap.xml`

    Initialize a metadata gives us a hint what data to sample:

- Naive sampling
  
    random sample same amount of apks from APKPure to the malware sample.
  
#### Data Ingesting

- Given a `app-url.json` to execute download.
- APK -> Smali using apktool

## Feature Extraction

### API Call Extraction

  Each app's samli code will be extracted into api calls and be grouped into .csv file. For example, instagram's smali code will be extracted as `instagram.csv` with following columns: `block`, `invocation`, `package`, `method_name`, `app`.

- Extract API Calls of Apps: `package` + '->' + `method_name`
- Extract method name of API Calls: `method_name`
- Extract Code blocks of API Calls: `block`
- Extract Package used of each API Calls: `package`
- Extract Invocation of each API Calls: `invocation`

## ML Deployment

### Baseline Model

- The baseline mode lies under directory `src/models/baseline.py`

- It uses the results from EDA to build on following features engineering pipelin upon each app:

#### 1. Preprocess

    - numerical:
        - `api`: number of unique apis
        
        - `block`: number of unique blocks
        
        - `package`: number of packages
        
    - categorical:
    
        - `most_invo`: the most called invocation
        
        - `most_api`: the most called api
        
        - `package`: the most used package
        
#### 2. Column Transfer:

    - the column transfere process follows following rule based on data type:
    
        - `numerical`: Standardized by Scikit Learn Standard Scaler
        
        - `catogorical`: Transferred to dummy variable by Scikit Learn OneHot Encoder
        
#### 3. Classifier:

    - Logistic Regression
    
    - SVM
    
    - Random Forest

    The performance and result showed below

#### 4. Results:

- with 40 samples, the Training metrics and Testing metrics are showed below:


In [1]:
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append('../')
import run

In [2]:
run.main('baseline-test')

training metrics: 
                   method        f1       acc  tp  fp  tn  fn
0      LogisticRegression  0.888889  0.884615  12   3  11   0
1                     SVC  0.888889  0.884615  12   3  11   0
2  RandomForestClassifier  1.000000  1.000000  12   0  14   0
testing metrics: 
                   method        f1       acc  tp  fp  tn  fn
0      LogisticRegression  1.000000  1.000000   8   0   6   0
1                     SVC  0.941176  0.928571   8   1   5   0
2  RandomForestClassifier  1.000000  1.000000   8   0   6   0


In [2]:
run.main('baseline')

AnalysisException: 'Path does not exist: file:/datasets/home/home-03/87/887/shy166/dsc180a/hindroid/notebooks/data/datasets/interim/b_features/adolygu-revision.csv;'

##### Observations

- The baseline model actually has a very decent result. Also, the mainly error those classifier made was False Positive (making error that it is a malware). This is a faily strict baseline mode to classify the malwares

- In training large data, the baseline mode will be computational heavy especially because of the computational demanding of the preprocess part.

### Hindroid Model

#### 1. Preprocess w/ Matrix Construction

  we used Hindroid's method to construct our feature matrix, the description as follows:

##### A Matrix

  $a_{ij}$ is defined as:
  
  "If $app_i$ contains $api_j$ , then $a_{ij} = 1$; otherwise, $a_{ij} = 0$."
  
###### implementation Detail
- using spark to create an adjacency matrix by first assigning every unique api calls and app with ids using `stringIndexer`, then output the unique pair of api id and app id and feeding into the sparse coordinate matrix to construct APP x API A Matrix
        
##### B Matrix
  
  $b_{ij}$ is defined as:
  
  "If $api_i$ and $api_j$ co-exist in the same code block, then $b_{ij}$ = 1; otherwise, $b_{ij}$ = 0."

###### implementation Detail
- Using the same approach of A to construct an API x Block matrix.
- Taking the dot product of (API x Block) and (Block x API) matrix to get an squared matrix, then filtering every non-negative term to 1, Then we have API x API B Matrix

##### P Matrix

  $p_{ij}$ is defined as:
  
  "If $api_i$ and $api_j$ are with the same package name, then $p_{ij}$ = 1; otherwise, $p_{ij}$ = 0."
  
###### implementation Detail

- Similiar to B matrix, first generating API x Package matrix and then take dot product and filtering.

#### 2. Fit into Precomputed Kernel SVM

- With the Matrices, we can make classifiers based on different Kernel Path. In specific, we have AA^T, ABA^T, APA^T, and APBP^TA^T implemented as kernel function of SVM. We used scikit learn built SVC to make our classifier. (sklearn.svm.SVC(kernel = 'precomputed').

#### 3. Why Hindroid ?

- hindroid is a very intuitive and reasonable approach. From our EDA, we can see the significant difference between Malware and Benign apps with api calls, packages, and blocks. Api calls mainly represents malware may frequently call some api call, and there are some frequently used package with malwares. Also, the code blocks are a good representations of complexity of apps since malware are less complext than benign apps in general.

#### 4. Explaination of Approach
- Since there are many and many different api calls, package, blocks in every app, we need to handle the data when it comes large. We are limited with memory, then distributed packages rather than pandas are brought to table: Spark and Dask. In specific, we used dask to handle the task stream of extracting api calls, blocks, etc from smali files, and we used pyspark to do matrix construction since Spark can easily handle the case of adjancency matrix.

- For B and P matrix, we first computted the API x Block matrix and API x Package matrix to reduce the computational cost. Since there are so any different api calls in our dataset, and number of unique block and number of package is much smaller.

#### 5. Experimental Results

In [2]:
import pandas as pd
import pandas as pd
import numpy as np

test_train = pd.read_csv('../data/tests/processed/results/training.csv')
test_test = pd.read_csv('../data/tests/processed/results/testing.csv')
real_train = pd.read_csv('../data/datasets/processed/results/training.csv')
real_test = pd.read_csv('../data/datasets/processed/results/testing.csv')

FileNotFoundError: [Errno 2] File b'../data/tests/processed/results/training.csv' does not exist: b'../data/tests/processed/results/training.csv'

In [9]:
print('for our 53 datasets sample, ')
display(test_train)
display(test_test)

Unnamed: 0,method,f1,acc,tp,fp,tn,fn
0,AA,1.0,1.0,14,0,21,0
1,ABA,1.0,1.0,14,0,21,0
2,APA,1.0,1.0,14,0,21,0
3,APBPA,1.0,1.0,14,0,21,0


Unnamed: 0,method,f1,acc,tp,fp,tn,fn
0,AA,0.75,0.777778,6,4,8,0
1,ABA,0.857143,0.888889,6,2,10,0
2,APA,0.75,0.777778,6,4,8,0
3,APBPA,0.857143,0.888889,6,2,10,0


In [1]:
print('for our 53 datasets sample, ')
display(real_train)
display(real_test)

for our 53 datasets sample, 


NameError: name 'real_train' is not defined