# Android Malware Classification

## Section 1:  Cleaning Dataset

I am Machine Learning to classify malicious apps vs benign apps. features are based on permissions requested by a specific app. 

Pandas is used to manipulate training data before it is fed into a training model. 

Pandas (Python Data Analysis Library) "is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

Scikit-learn is an open-source machine learning library for Python, which will train a model and  later be used to test the trained model. 


In [2]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns',5)



open our training data, which is a csv file containing a table of app_is and its associated permissions and the classification of the app



In [3]:
# open csv as dataframe
df = pd.read_csv('data/appdata.csv')

print data frame

Pandas provides dataframe functionality for reading/accessing/manipulating data in memory. You can think of a data frame as a table of indexed values.

In [4]:
df

Unnamed: 0,a_id,ACCESS_ALL_DOWNLOADS,...,WRITE_USER_DICTIONARY,classification
0,3,0,...,0,benign
1,7,0,...,0,benign
2,9,0,...,0,benign
3,11,0,...,0,benign
4,14,0,...,0,benign
...,...,...,...,...,...
193,204,0,...,0,malware
194,206,0,...,0,malware
195,207,0,...,0,malware
196,211,0,...,0,malware


our classification column contains string and Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the the classification column into numerical. 

In [5]:
# check the unique vlaues in the column
df['classification'].unique()

array(['benign', 'malware'], dtype=object)

map benign ----> 0  and 
map malware -----> 1

In [6]:
df['class'] = df['classification'].map({'benign' : 0, 'malware':1}).astype(int)


In [7]:
df

Unnamed: 0,a_id,ACCESS_ALL_DOWNLOADS,...,classification,class
0,3,0,...,benign,0
1,7,0,...,benign,0
2,9,0,...,benign,0
3,11,0,...,benign,0
4,14,0,...,benign,0
...,...,...,...,...,...
193,204,0,...,malware,1
194,206,0,...,malware,1
195,207,0,...,malware,1
196,211,0,...,malware,1


In [8]:
dt_malware = df[(df['class'] == 1)]
dt_malware.info()
dt_benign = df[(df['class']== 0)]
print( "\n")
dt_benign.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140 entries, 16 to 197
Columns: 211 entries, a_id to class
dtypes: int32(1), int64(209), object(1)
memory usage: 231.3+ KB


<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 0 to 117
Columns: 211 entries, a_id to class
dtypes: int32(1), int64(209), object(1)
memory usage: 95.8+ KB


drop the classification column

In [9]:
df = df.drop(['classification'], axis=1)


In [10]:
# review our dataset
cols = df.columns.tolist()
print(cols)

['a_id', 'ACCESS_ALL_DOWNLOADS', 'ACCESS_ASSISTED_GPS', 'ACCESS_BLUETOOTH_PRINTER', 'ACCESS_BLUETOOTH_SHARE', 'ACCESS_BROWSER', 'ACCESS_CACHE_FILESYSTEM', 'ACCESS_CELL_ID', 'ACCESS_CHECKIN_PROPERTIES', 'ACCESS_COARSE_LOCATION', 'ACCESS_COARSE_UPDATES', 'ACCESS_DEV_STORAGE', 'ACCESS_DOWNLOAD_DATA', 'ACCESS_DOWNLOAD_MANAGER', 'ACCESS_DOWNLOAD_MANAGER_ADVANCED', 'ACCESS_DRM', 'ACCESS_FINE_LOCATION', 'ACCESS_FM_RECEIVER', 'ACCESS_GPS', 'ACCESS_LGDRM', 'ACCESS_LOCATION', 'ACCESS_LOCATION_EXTRA_COMMANDS', 'ACCESS_MOCK_LOCATION', 'ACCESS_MTP', 'ACCESS_NETWORK_STATE', 'ACCESS_NETWORK_LOCATION', 'ACCESS_OBEX', 'ACCESS_SURFACE_FLINGER', 'ACCESS_UPLOAD_DATA', 'ACCESS_UPLOAD_MANAGER', 'ACCESS_WIFI_STATE', 'ACCESS_WIMAX_STATE', 'ACCOUNT_MANAGER', 'ADD_SYSTEM_SERVICE', 'ASEC_ACCESS', 'ASEC_CREATE', 'ASEC_DESTROY', 'ASEC_MOUNT_UNMOUNT', 'AUTHENTICATE_ACCOUNTS', 'BACKUP', 'BACKUP_DATA', 'BATTERY_STATS', 'BILLING', 'BIND_APPWIDGET', 'BIND_DEVICE_ADMIN', 'BIND_INPUT_METHOD', 'BIND_REMOTEVIEWS', 'BIND_WA

In [11]:
df.head(2)

Unnamed: 0,a_id,ACCESS_ALL_DOWNLOADS,...,WRITE_USER_DICTIONARY,class
0,3,0,...,0,0
1,7,0,...,0,0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Columns: 210 entries, a_id to class
dtypes: int32(1), int64(209)
memory usage: 324.1 KB


convert the processed training data from a Pandas dataframe into a numerical (Numpy) array.

In [13]:
train_data = df.values

In [14]:
type(train_data)

numpy.ndarray

In [15]:
# train_data[0:,0]
# model = model.fit(train_data[0:,2:], train_data[0:,0])
# 
# train_data[0:,1:209]
# train_data[0:,209]

# Section 2: Machine Learning & training the Model

In [16]:
# import sklearn and use random forest
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100)

using our training data(train_data) we 'train' (or 'fit') our model. The column class will be our second input, and the set of other features (with the column appid omitted) as the first.

In [17]:
model = model.fit(train_data[0:,1:209], train_data[0:,209])

### load test data

In [19]:
df_test = pd.read_csv('data/appTest.csv')

In [20]:
df_test.head(10)

Unnamed: 0,a_id,ACCESS_ALL_DOWNLOADS,...,WRITE_TASKS,WRITE_USER_DICTIONARY
0,3,0,...,0,0
1,7,0,...,0,0
2,9,0,...,0,0
3,11,0,...,0,0
4,14,0,...,0,0
5,19,0,...,0,0
6,25,0,...,0,0
7,27,0,...,0,0
8,29,0,...,0,0
9,32,0,...,0,0


In [21]:
test_data = df_test.values
test_data.shape

(198, 209)

test the test data on the model(omitting the appid from the test)

In [22]:
output = model.predict(test_data[:,1:])


append appid from train data to output produced by the classifier

In [23]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['AppId', 'classification'])

In [24]:
df_result.head(10)

Unnamed: 0,AppId,classification
0,3,1
1,7,0
2,9,0
3,11,0
4,14,0
5,19,0
6,25,0
7,27,0
8,29,0
9,32,0


In [25]:
df_result


Unnamed: 0,AppId,classification
0,3,1
1,7,0
2,9,0
3,11,0
4,14,0
...,...,...
193,204,1
194,206,1
195,207,1
196,211,1


In [26]:
df_result['original_classification'] = df['class']

In [27]:
# pd.set_option('display.max_columns', None)
df_result

Unnamed: 0,AppId,classification,original_classification
0,3,1,0
1,7,0,0
2,9,0,0
3,11,0,0
4,14,0,0
...,...,...,...
193,204,1,1
194,206,1,1
195,207,1,1
196,211,1,1


In [28]:
y_pred = df_result.values
# y_pred[:,1:2]
# y_true[:,2:]

In [29]:
from sklearn.metrics import classification_report
target_names = ['benign', 'malware']
# target_names = ['0', '1']
# print(classification_report(y_true, y_pred, target_names=target_names)
print(classification_report(y_pred[:,2:], y_pred[:,1:2], target_names=target_names))     

             precision    recall  f1-score   support

     benign       1.00      0.97      0.98        58
    malware       0.99      1.00      0.99       140

avg / total       0.99      0.99      0.99       198



In [30]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
print(confusion_matrix(y_pred[:,2:], y_pred[:,1:2]))     

[[ 56   2]
 [  0 140]]


In [31]:
accuracy_score(y_pred[:,2:], y_pred[:,1:2], normalize=True)

0.98989898989898994