In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

from sklearn import preprocessing
import torch
from sklearn import svm
from sklearn import tree
import pandas as pd
from sklearn.externals import joblib
import pickle
import numpy as np
import seaborn as sns

**Static Analysis**
In this first step, I'm going to analyze some features in order to answer the next hypothesis, exist a differential of the permissions used between a set of malware and benign samples, in other words.

For this approach, I developed a code that consisted to extract and make a CSV file which has information about permissions of applications, through this script you can map each APK (Android Application Package) against a list of permissions.

For the next analysis, we're going to explore the **Malgenome dataset**

In [None]:
import pandas as pd
df = pd.read_csv("../input/datasetandroidpermissions/train.csv", sep=";")

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

In [None]:
df = df.astype("int64")
df.type.value_counts()

Type is the label that represents if an application is a malware or not, as we can see this dataset is balanced.

In [None]:
df.shape

*Let's get the top 10 of permissions that are used for our malware samples*

*Malicious*

In [None]:
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11]

*Benign*

In [None]:
pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10]

In [None]:
import matplotlib.pyplot as plt
fig, axs =  plt.subplots(nrows=2, sharex=True)

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10].plot.bar(ax=axs[0])
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[1], color="red")

The last outputs allow us to get insights about a difference between the permissions used by the malware and the benign applications.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:330], df['type'], test_size=0.20, random_state=42)

**Naive Bayes algorithm**

Naive Bayes is the classification machine learning algorithm that relies on the
Bayes Theorem. It can be used for both binary and multi-class classification
problems. The main point relies on the idea of treating each feature
independently. Naive Bayes method evaluates the probability of each feature
independently, regardless of any correlations, and makes the prediction based
on the Bayes Theorem. That is why this method is called ”naive” – in real-world
problems features often have some level of correlation between each other. 



In [None]:
# Naive Bayes algorithm
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# pred
pred = gnb.predict(X_test)

# accuracy
accuracy = accuracy_score(pred, y_test)
print("naive_bayes")
print(accuracy)
print(classification_report(pred, y_test, labels=None))

**K-Nearest Neighbors**

K-Nearest Neighbors (KNN) is one of the simplest, though, accurate machine
learning algorithms. KNN is a non-parametric algorithm, meaning that it does
not make any assumptions about the data structure. In real world problems,
data rarely obeys the general theoretical assumptions, making non-parametric
algorithms a good solution for such problems. KNN model representation is as
simple as the dataset – there is no learning required, the entire training set is
stored.

In [None]:
# kneighbors algorithm

for i in range(3,15,3):
    
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    # accuracy
    accuracy = accuracy_score(pred, y_test)
    print("kneighbors {}".format(i))
    print(accuracy)
    print(classification_report(pred, y_test, labels=None))
    print("")

**Decision Tree**

As it implies from the name, decision trees are data structures that have a
structure of the tree. The training dataset is used for the creation of the tree,
that is subsequently used for making predictions on the test data. In this
algorithm, the goal is to achieve the most accurate result with the least number
of the decisions that must be made.

In [None]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Read the csv test file

pred = clf.predict(X_test)
# accuracy
accuracy = accuracy_score(pred, y_test)
print(clf)
print(accuracy)
print(classification_report(pred, y_test, labels=None))