<a href="https://colab.research.google.com/github/s-vali/S321-Project/blob/main/Android_Malware_Analysis___TEEP_Internship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
xwolf12_datasetandroidpermissions_path = kagglehub.dataset_download('xwolf12/datasetandroidpermissions')
xwolf12_network_traffic_android_malware_path = kagglehub.dataset_download('xwolf12/network-traffic-android-malware')

print('Data source import complete.')


# Android Malware Analysis

### Packages

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

from sklearn import preprocessing
import torch
from sklearn import svm
from sklearn import tree
import pandas as pd
from sklearn.externals import joblib
import pickle
import numpy as np
import seaborn as sns

### Exploratory

In [None]:
import pandas as pd
df = pd.read_csv("../input/datasetandroidpermissions/train.csv", sep=";")

In [None]:
df = df.astype("int64")
df.type.value_counts()

Type is the label that represents if an application is a malware or not, as we can see this dataset is balanced.

In [None]:
df.shape

*Let's get the top 10 of permissions that are used for our malware samples*

*Malicious*

In [None]:
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11]

*Benign*

In [None]:
pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10]

In [None]:
import matplotlib.pyplot as plt
fig, axs =  plt.subplots(nrows=2, sharex=True)

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10].plot.bar(ax=axs[0])
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[1], color="red")

The last outputs allow us to get insights about a difference between the permissions used by the malware and the benign applications.

### Modeling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:330], df['type'], test_size=0.20, random_state=42)

*Naive Bayes algorithm*

In [None]:
# Naive Bayes algorithm
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# pred
pred = gnb.predict(X_test)

# accuracy
accuracy = accuracy_score(pred, y_test)
print("naive_bayes")
print(accuracy)
print(classification_report(pred, y_test, labels=None))

*kneighbors algorithm*

In [None]:
# kneighbors algorithm

for i in range(3,15,3):

    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    # accuracy
    accuracy = accuracy_score(pred, y_test)
    print("kneighbors {}".format(i))
    print(accuracy)
    print(classification_report(pred, y_test, labels=None))
    print("")

*Decision Tree*

In [None]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Read the csv test file

pred = clf.predict(X_test)
# accuracy
accuracy = accuracy_score(pred, y_test)
print(clf)
print(accuracy)
print(classification_report(pred, y_test, labels=None))

Through the last results we can see how we trained different classifiers to detect malware using its permissions, but as I said this is only a first approximation, I didn't analyze the hyperparameters and others things to improve the results.

# Dynamic Analysis

In [None]:
import pandas as pd
data = pd.read_csv("../input/network-traffic-android-malware/android_traffic.csv", sep=";")
data.head()

In [None]:
data.columns

In [None]:
data.shape

In [None]:
data.type.value_counts()

In this case, we have an unbalanced dataset, so another model evaluation will be used.

### Data Cleaning and Processing

In [None]:
data.isna().sum()

In [None]:
data = data.drop(['duracion','avg_local_pkt_rate','avg_remote_pkt_rate'], axis=1).copy()

In [None]:
data.describe()

Now, the idea is to see the outliers in the data

In [None]:
sns.boxplot(data.tcp_urg_packet)

In [None]:
data.loc[data.tcp_urg_packet > 0].shape[0]

That column will be no used for the analysis, only two rows are different to zero, maybe they are interesting for future analysis.

In [None]:
data = data.drop(columns=["tcp_urg_packet"], axis=1).copy()
data.shape

In [None]:
sns.pairplot(data)

We have many outliers in some features, I will omit the depth analysis and only get the set of the data without the noise.

In [None]:
data=data[data.tcp_packets<20000].copy()
data=data[data.dist_port_tcp<1400].copy()
data=data[data.external_ips<35].copy()
data=data[data.vulume_bytes<2000000].copy()
data=data[data.udp_packets<40].copy()
data=data[data.remote_app_packets<15000].copy()

In [None]:
data[data.duplicated()].sum()

In [None]:
data=data.drop('source_app_packets.1',axis=1).copy()

In [None]:
scaler = preprocessing.RobustScaler()
scaledData = scaler.fit_transform(data.iloc[:,1:11])
scaledData = pd.DataFrame(scaledData, columns=['tcp_packets','dist_port_tcp','external_ips','vulume_bytes','udp_packets','source_app_packets','remote_app_packets',' source_app_bytes','remote_app_bytes','dns_query_times'])

From [6] we concluded that the best network features are:

+ (R1): TCP packets, it has the number of packets TCP sent and got during communication.
+ (R2): Different TCP packets, it is the total number of packets different from TCP.
+ (R3): External IP, represents the number the external addresses (IPs) where the application tried to communicated
+ (R4): Volume of bytes, it is the number of bytes that was sent from the application to the external sites
+ (R5) UDP packets, the total number of packets UDP transmitted in a communication.
+ (R6) Packets of the source application, it is the number of packets that were sent from the application to a remote server.
+ (R7) Remote application packages, number of packages received from external sources.
+ (R8) Bytes of the application source, this is the volume (in Bytes) of the communication between the application and server.
+ (R9) Bytes of the application remote, this is the volume (in Bytes) of the data from the server to the emulator.
+ (R10) DNS queries, number of DNS queries.


### Modeling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaledData.iloc[:,0:10], data.type.astype("str"), test_size=0.25, random_state=45)

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
pred = gnb.predict(X_test)
## accuracy
accuracy = accuracy_score(y_test,pred)
print("naive_bayes")
print(accuracy)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))

In [None]:
# kneighbors algorithm

for i in range(3,15,3):

    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    # accuracy
    accuracy = accuracy_score(pred, y_test)
    print("kneighbors {}".format(i))
    print(accuracy)
    print(classification_report(pred, y_test, labels=None))
    print("cohen kappa score")
    print(cohen_kappa_score(y_test, pred))
    print("")

In [None]:
rdF=RandomForestClassifier(n_estimators=250, max_depth=50,random_state=45)
rdF.fit(X_train,y_train)
pred=rdF.predict(X_test)
cm=confusion_matrix(y_test, pred)

accuracy = accuracy_score(y_test,pred)
print(rdF)
print(accuracy)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))
print(cm)