# Android Malware

Our dataset for this project is taken from https://www.mlsec.org/docs/2014-ndss.pdf which is a known dataset of android malware data.
This is a public dataset which we downloaded from https://figshare.com/articles/dataset/Android_malware_dataset_for_machine_learning_2/5854653/1

There are other similar datasets for android malware for example:
- https://github.com/DefenseDroid/DefenseDroid    
- https://www.unb.ca/cic/datasets/maldroid-2020.html 
    - file: feature_vectors_syscalls_frequency_5_Cat.csv 
    - This is a very comprehensive and clean data set. The problem with this dataset is that it is too clean to demonstrate preprocessing steps in our project.
- https://www.unb.ca/cic/datasets/andmal2020.html
    - This data set similar to the other Maldroid 2020 data set. However, in contrast it contains multiple csv files and we need to spend considerable effort
      to create a data set that can be used for our assignment.
Hence, the  Drebin dataset is originally chosen as made it easier to work with for our purpose and demonstrate the ML processes.
However, after running exploratory data analysis, we discovered that Drebin dataset is flawed given it contains a lot of duplicates 
and we found a paper that highlighted the same issue https://ieeexplore.ieee.org/document/9609892


# Import Dependencies

In [144]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt

# Drebin DataSet Exploratory Data Analysis

Let's do basic analysis of our dataset and hopefully we gain some insights.

In [145]:
url="https://raw.githubusercontent.com/raymondsamalo/25S1-C-NYP-ITI111-Applied-Machine-Learning/refs/heads/main/drebin-215-dataset-5560malware-9476-benign.csv"
df=pd.read_csv(url) # read our data frame
print("show size")
print(df.shape)
print("show features")
print(df.describe())
print("show headers")
df.info()

show size
(15036, 216)
show features
           transact  onServiceConnected   bindService  attachInterface  \
count  15036.000000        15036.000000  15036.000000     15036.000000   
mean       0.426443            0.446595      0.442671         0.413208   
std        0.494576            0.497156      0.496719         0.492426   
min        0.000000            0.000000      0.000000         0.000000   
25%        0.000000            0.000000      0.000000         0.000000   
50%        0.000000            0.000000      0.000000         0.000000   
75%        1.000000            1.000000      1.000000         1.000000   
max        1.000000            1.000000      1.000000         1.000000   

       ServiceConnection  android.os.Binder      SEND_SMS  \
count       15036.000000       15036.000000  15036.000000   
mean            0.444932           0.486898      0.236632   
std             0.496975           0.499845      0.425029   
min             0.000000           0.000000      0.0

  df=pd.read_csv(url) # read our data frame


From basic analysis we found that our data contains mostly 0 or 1 which represent whether the Application uses the android features/function call or not.
```
Columns: 216 entries, transact to class
dtypes: int64(214), object(2)

```

We have 216 columns and 2 of the columns are object while the rest are int64 / numeric.

We have quite a number of columns or features. Hence we need to simplify by filtering columns that does not have strong correlation to our label.

Pandas reported issue for columns 92
``` DtypeWarning: Columns (92) have mixed types. Specify dtype option on import or set low_memory=False.```.

Let's take a deeper look on this column 92.



# Preprocessing Data Cleanup

## Clean Up Column 92 and Object Columns


In [146]:
column_names_index = df.columns
print(column_names_index[92]) # output TelephonyManager.getSimCountryIso
print(df['TelephonyManager.getSimCountryIso'].unique()) # output array(['0', '1', '?', 1, 0], dtype=object)

TelephonyManager.getSimCountryIso
['0' '1' '?' 1 0]


We discovered that the column 92 or `TelephonyManager.getSimCountryIso` contains `['0' '1' '?' 1 0]`.

We need to handle unknown '?' data and also convert '0','1' to integer.

However, the df.info() shows us that we have two object columns. 
Let's check out the other object column to gain deeper insight to our data.

In [147]:
object_columns = df.select_dtypes(include=['object']).columns.tolist()
print(f"Object columns: {object_columns}") # Object columns: ['TelephonyManager.getSimCountryIso', 'class']
print(f"{'Column':40}Values")
for i in object_columns:
    print(f"{i:40}{df[i].unique()}")

Object columns: ['TelephonyManager.getSimCountryIso', 'class']
Column                                  Values
TelephonyManager.getSimCountryIso       ['0' '1' '?' 1 0]
class                                   ['S' 'B']


We discover that beside 'TelephonyManager.getSimCountryIso', the other object or string column is 'class'.
For 'class' column, the values are ['S' 'B']. 
We do not need to handle missing value for 'class' column but we do need to convert 'S' to suspicious malware and 'B' to benign.
We shall do this by converting the value to integer 1 for Malware and 0 for benign in a new column.

Alright, given we know the two columns that we need to handle, let's preprocess them.

In [148]:
get_sim_country_column='TelephonyManager.getSimCountryIso'
df['malware']=(df['class']=='S').astype(int)
df.drop('class',axis=1,inplace=True)
before_values=df[get_sim_country_column].unique()
df[get_sim_country_column] = pd.to_numeric(df[get_sim_country_column], errors='coerce') 
after_values=df[get_sim_country_column].unique()

print(f"Check values of {get_sim_country_column}")
print(f"before {before_values} -> after {after_values}")

print("Check column values")
all_zeros_or_ones_nan = df.isin([0, 1, np.nan]).all().all()
print(f"is all values are in [0,1, NaN]? {all_zeros_or_ones_nan}")


Check values of TelephonyManager.getSimCountryIso
before ['0' '1' '?' 1 0] -> after [ 0.  1. nan]
Check column values
is all values are in [0,1, NaN]? True


It seemed our data is simply 0 or 1 or NaN 

## Check For Duplicates Data

First let's check duplicates data. 


In [149]:

duplicated_row_count = df.duplicated().sum()
total_row_count = df.shape[0]
duplicated_row_percentage = (duplicated_row_count/total_row_count*100)
print(f"Total rows count: {total_row_count}")
print(f"Duplicated rows count: {duplicated_row_count}")
print(f"Duplicated rows percentage: {duplicated_row_percentage}")


Total rows count: 15036
Duplicated rows count: 7775
Duplicated rows percentage: 51.70923117850492


During this analysis we are surprised that Debrin data has quite a number of duplicates: 
```
Total rows count: 15036
Duplicated rows count: 7775
Duplicated rows percentage: 51.70923117850492
```
More than half of rows in Debrin is a duplicate, we searched internet and discover a paper that mention the same problem https://ieeexplore.ieee.org/document/9609892

We decided to continue with Debrin dataset with the duplicates removed

In [150]:
df_no_duplicates = df.drop_duplicates()
print("Non Duplicated rows:\n",df_no_duplicates.shape[0])
print("Check whether we have sufficient malware vs non malware data")
print(df_no_duplicates["malware"].value_counts())

df = df_no_duplicates # replace our data frame with non-duplicated rows

Non Duplicated rows:
 7261
Check whether we have sufficient malware vs non malware data
malware
0    5540
1    1721
Name: count, dtype: int64


After removing the duplicate rows, our data set is still reasonably sufficient for our need with 
- benign sample    5540
- malware sample    1721
Roughly 23 % data shows malware and 77 % benign. 

Let's deal with missing data next

## Handle Missing Data

In [151]:

print("List all column with missing data")
print(df.columns[df.isna().any()].tolist()) # return ['TelephonyManager.getSimCountryIso'] as the only columen with nan value
rows_with_nan = df[df.isnull().any(axis=1)].index
no_of_rows = df.shape[0]
print("List all rows with missing data")
print(rows_with_nan)
no_of_rows_missing_data=rows_with_nan.shape[0]
print(f"Missing data is {no_of_rows_missing_data} out of {no_of_rows}")

List all column with missing data
['TelephonyManager.getSimCountryIso']
List all rows with missing data
Index([176, 2109], dtype='int64')
Missing data is 2 out of 7261


Given we only have 2 rows out of 7261 with missing data, let us simply remove them.

In [152]:
df.dropna(inplace=True)
rows_with_nan = df[df.isnull().any(axis=1)].index
no_of_rows_missing_data=rows_with_nan.shape[0]
no_of_rows = df.shape[0]
print(f"No of rows with missing data is now {no_of_rows_missing_data} out of {no_of_rows}")

No of rows with missing data is now 0 out of 7259
