### SC1015 DSAI Mini Project
>Part 1: Data Cleaning and Preparation

>In this section we will be cleaning and preparation of the dataset to help us gain meaningful insights from the dataset and help us answer the question we posed.

>Question: What are the variables that help to predict the types of malware attack that happened on an Android device

>Dataset: https://www.kaggle.com/datasets/subhajournal/android-malware-detection?select=Android_Malware.csv

>Table of Contents:

>1. Preliminary Feature Selection

>2. Splitting the Dataset into the different type of possible Malware Attacks

>3. Encoding Categorical Variables (Label)

>4. Conversion of Final cleaned and prepared DataFrames to Pickle Files

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

In [2]:
# Read in Android malware data from CSV file
aData = pd.read_csv('Android_Malware.csv')
# Print the contents of the DataFrame
print(aData)

  aData = pd.read_csv('Android_Malware.csv')


        Unnamed: 0                                  Flow ID     Source IP  \
0                0    172.217.6.202-10.42.0.211-443-50004-6   10.42.0.211   
1                1    172.217.6.202-10.42.0.211-443-35455-6   10.42.0.211   
2                2    131.253.61.68-10.42.0.211-443-51775-6   10.42.0.211   
3                3    131.253.61.68-10.42.0.211-443-51775-6   10.42.0.211   
4                4    131.253.61.68-10.42.0.211-443-51776-6   10.42.0.211   
...            ...                                      ...           ...   
355625         405      172.217.7.14-10.42.0.211-80-38405-6  172.217.7.14   
355626         406         10.42.0.211-10.42.0.1-7632-53-17   10.42.0.211   
355627         407  10.42.0.211-104.192.110.245-45970-443-6   10.42.0.211   
355628         408        10.42.0.211-10.42.0.1-51982-53-17   10.42.0.211   
355629         409         10.42.0.211-10.42.0.1-9320-53-17   10.42.0.211   

         Source Port   Destination IP   Destination Port   Protocol  \
0   

In [3]:
aData.shape

(355630, 86)

In [4]:
aData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355630 entries, 0 to 355629
Data columns (total 86 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Unnamed: 0                   355630 non-null  int64  
 1   Flow ID                      355629 non-null  object 
 2    Source IP                   355630 non-null  object 
 3    Source Port                 355630 non-null  int64  
 4    Destination IP              355630 non-null  object 
 5    Destination Port            355630 non-null  float64
 6    Protocol                    355630 non-null  float64
 7    Timestamp                   355630 non-null  object 
 8    Flow Duration               355630 non-null  int64  
 9   Total Fwd Packets            355630 non-null  int64  
 10  Total Backward Packets       355630 non-null  int64  
 11  Total Length of Fwd Packets  355630 non-null  float64
 12  Total Length of Bwd Packets  355630 non-null  float64
 13 

### 1. Preliminary Feature Selection
>Initially, we saw that our dataset had a large number of variables (85). It would not be possible to use all of these factors. Hence, we made the decision to pick a subset of the variables and work with them.

>The choice of the variables has been made carefully. Since our question relates to the type of malware attacks and the common indicators when evaluating whether a malware attack has taken place

>['Total Fwd Packets', 'Total Backward Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets'] : They provide information about the volume and amount of traffic going in both directions.

>['Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Std', 'Bwd Packet Length Max', 'Bwd Packet Length Mean',  'Bwd Packet Length Std'] : provide information about the size and variability of packets being transmitted.

>['Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std'] : give information about the flow of traffic and can be used to detect anomalies.

>['FIN Flag Count', 'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count', 'URG Flag Count'] : provide information about specific TCP flags and can be used to identify certain types of attacks.

>Labels which are ['Android_Adware', 'Android_Scareware', 'Android_SMS_Malware', 'Benign'] : This will be the response variable where all the above variables are the listed predictors that we will be using in this study.

In [5]:
s_Data = pd.DataFrame(aData[[
        'Total Fwd Packets',
        'Total Backward Packets',
        'Total Length of Fwd Packets',
        'Total Length of Bwd Packets',
        'Fwd Packet Length Max',
        'Fwd Packet Length Mean',
        'Fwd Packet Length Std',
        'Bwd Packet Length Max',
        'Bwd Packet Length Mean',
        'Bwd Packet Length Std',
        'Flow Bytes/s',
        'Flow Packets/s',
        'Flow IAT Mean',
        'Flow IAT Std',
        'FIN Flag Count',
        'SYN Flag Count',
        'RST Flag Count',
        'PSH Flag Count',
        'ACK Flag Count',
        'URG Flag Count',
        'Label']])      
s_Data

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
0,1,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,54.014638,3.702700e+04,0.000000e+00,0.0,0.0,0.0,0.0,1.0,1.0,Android_Adware
1,1,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,54.565793,3.665300e+04,0.000000e+00,0.0,0.0,0.0,0.0,1.0,1.0,Android_Adware
2,8,12,1011.0,11924.0,581,126.375000,207.799311,1460.0,993.666667,656.474376,...,37.446241,2.811047e+04,4.314810e+04,0.0,0.0,0.0,1.0,0.0,0.0,Android_Adware
3,3,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,322.268772,4.654500e+03,5.137131e+03,0.0,0.0,0.0,0.0,1.0,0.0,Android_Adware
4,8,6,430.0,5679.0,218,53.750000,99.538578,1460.0,946.500000,710.412204,...,0.703854,1.530038e+06,5.377887e+06,0.0,0.0,0.0,1.0,0.0,0.0,Android_Adware
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355625,1,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,15.783949,1.267110e+05,0.000000e+00,0.0,0.0,0.0,0.0,1.0,1.0,Benign
355626,1,1,30.0,140.0,30,30.000000,0.000000,140.0,140.000000,0.000000,...,41.656253,4.801200e+04,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,Benign
355627,11,8,339.0,6335.0,213,30.818182,71.272461,1460.0,791.875000,720.377369,...,0.948671,1.112668e+06,4.629064e+06,0.0,0.0,0.0,1.0,0.0,0.0,Benign
355628,1,1,32.0,48.0,32,32.000000,0.000000,48.0,48.000000,0.000000,...,5.748349,3.479260e+05,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,Benign


In [6]:
s_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355630 entries, 0 to 355629
Data columns (total 21 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Total Fwd Packets            355630 non-null  int64  
 1   Total Backward Packets       355630 non-null  int64  
 2   Total Length of Fwd Packets  355630 non-null  float64
 3   Total Length of Bwd Packets  355630 non-null  float64
 4   Fwd Packet Length Max        355630 non-null  int64  
 5   Fwd Packet Length Mean       355630 non-null  float64
 6   Fwd Packet Length Std        355630 non-null  float64
 7   Bwd Packet Length Max        355630 non-null  float64
 8   Bwd Packet Length Mean       355630 non-null  float64
 9   Bwd Packet Length Std        355630 non-null  float64
 10  Flow Bytes/s                 355630 non-null  float64
 11  Flow Packets/s               355630 non-null  float64
 12  Flow IAT Mean                355630 non-null  float64
 13 

In [7]:
s_Data['Label'].value_counts()

Android_Adware         147443
Android_Scareware      117082
Android_SMS_Malware     67397
Benign                  23708
Name: Label, dtype: int64

### 2. Splitting the Dataset into the different type of possible Malware Attacks
>For the purpose of our analysis, it is best suited if we split our dataset into different data sets depending on the type of malware attacks for exploratory analaysis

>DataFrame containing variables relating to Android_Adware attacks

>DataFrame containing variables relating to Android_Scareware attacks

>DataFrame containing variables relating to Android_SMS_Malware attacks

>DataFrame containing variables relating to Benign attacks as the control 

>From this point, the further data cleaning and preparation is done separately for all the DataFrames

In [8]:
# Define lists of labels to filter the original dataset by
android_aw = ['Android_Adware']
android_sw = ['Android_Scareware']
android_sms_m = ['Android_SMS_Malware']
benign = ['Benign']

# Filter the original dataset to create separate DataFrames for each label
aw_prelimData = s_Data[s_Data['Label'].isin(android_aw)]
sw_prelimData = s_Data[s_Data['Label'].isin(android_sw)]
sms_prelimData = s_Data[s_Data['Label'].isin(android_sms_m)]
b_Data = s_Data[s_Data['Label'].isin(benign)]

### Check for NaN or Infinite values to be removed

In [9]:
has_nan = aw_prelimData.isna().any().any()
has_inf = aw_prelimData.isin([np.inf]).any().any()

print(f"aw_prelimData Has NaN values: {has_nan}, Has Inf values: {has_inf}")

aw_prelimData Has NaN values: False, Has Inf values: False


In [10]:
has_nan = sw_prelimData.isna().any().any()
has_inf = sw_prelimData.isin([np.inf]).any().any()

print(f"sw_prelimData Has NaN values: {has_nan}, Has Inf values: {has_inf}")

sw_prelimData Has NaN values: False, Has Inf values: False


In [11]:
has_nan = sms_prelimData.isna().any().any()
has_inf = sms_prelimData.isin([np.inf]).any().any()

print(f"sms_prelimData Has NaN values: {has_nan}, Has Inf values: {has_inf}")

sms_prelimData Has NaN values: True, Has Inf values: False


In [12]:
has_nan = b_Data.isna().any().any()
has_inf = b_Data.isin([np.inf]).any().any()

print(f"b_Data Has NaN values: {has_nan}, Has Inf values: {has_inf}")

b_Data Has NaN values: False, Has Inf values: False


In [13]:
sms_prelimData = sms_prelimData.dropna()
has_nan = sw_prelimData.isna().any().any()
print(f"sms_prelimData Has NaN values: {has_nan}")

sms_prelimData Has NaN values: False


### Merging the Dataframes (Different Types of Malware + Benign)
>We will now merge the different type of malware attacks into 3 data sets containing the each of the different type of Malware Attacks and Benign (as the control negative value) for the EDA and Machine Learning processes further down the line.

>Due to the disproportionate amount of data when comparing the Benign to the other types of Malware Attacks, we will be controlling the volume and ratio of the resulting merged Datasets with random resampling processs which will be utilized for EDA/Machine Learning.

In [14]:
frames_aw_benign = [b_Data, aw_prelimData]
frames_sw_benign = [b_Data, sw_prelimData]
frames_sms_benign = [b_Data, sms_prelimData]

In [15]:
aw_finalData = pd.concat(frames_aw_benign)
sw_finalData = pd.concat(frames_sw_benign)
sms_finalData = pd.concat(frames_sms_benign)

In [16]:
aw_finalData['Label'].value_counts()

Android_Adware    147443
Benign             23708
Name: Label, dtype: int64

In [17]:
sw_finalData['Label'].value_counts()

Android_Scareware    117082
Benign                23708
Name: Label, dtype: int64

In [18]:
sms_finalData['Label'].value_counts()

Android_SMS_Malware    67396
Benign                 23708
Name: Label, dtype: int64

### 3. Encoding Categorical Variables (Label)
>In the case of the variables, the categorical variable that we are trying predict is Label which consist of the the different type of android attacks of ['Android_Adware', 'Android_Scareware', 'Android_SMS_Malware', 'Benign']. In order to do any form of analysis on this, we must encode it. For this, we use LabelEncoder from the sklearn.preprocessing library.

In [19]:
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d
#Here, I am using pandas.Series.unique(), to get unique classes instead of sorting alphabetical order on the orignal LabelEncoder
class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

le = MyLabelEncoder()

In [20]:
le.fit(aw_finalData['Label'])
le.classes_

array(['Benign', 'Android_Adware'], dtype=object)

In [21]:
aw_not_encodeData = pd.DataFrame(data=aw_finalData)
aw_not_encodeData.to_pickle('aw_not_encodeData.pickle')
aw_not_encodeData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,11619.53846,14541.15588,0.0,0.0,0.0,1.0,0.0,0.0,Benign
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,349.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,119.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,37055.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,14893.91667,18532.64075,0.0,0.0,0.0,1.0,0.0,0.0,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147438,2,0,62.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,125000.000000,16.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
147439,1,1,35.0,139.0,35,35.000000,0.000000,139.0,139.000000,0.000000,...,64.491165,31012.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
147440,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,5000.000000,400.00000,0.00000,1.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
147441,2,0,62.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,142857.142900,14.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware


In [22]:
aw_finalData['Label'] = le.transform(aw_finalData['Label'])
aw_finalData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,11619.53846,14541.15588,0.0,0.0,0.0,1.0,0.0,0.0,0
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,349.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,0
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,119.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,0
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,37055.00000,0.00000,0.0,1.0,0.0,0.0,1.0,0.0,0
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,14893.91667,18532.64075,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147438,2,0,62.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,125000.000000,16.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,1
147439,1,1,35.0,139.0,35,35.000000,0.000000,139.0,139.000000,0.000000,...,64.491165,31012.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,1
147440,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,5000.000000,400.00000,0.00000,1.0,0.0,0.0,0.0,0.0,0.0,1
147441,2,0,62.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,142857.142900,14.00000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,1


In [23]:
le.fit(sw_finalData['Label'])
le.classes_

array(['Benign', 'Android_Scareware'], dtype=object)

In [24]:
sw_not_encodeData = pd.DataFrame(data=sw_finalData)
sw_not_encodeData.to_pickle('sw_not_encodeData.pickle')
sw_not_encodeData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,1.161954e+04,1.454116e+04,0.0,0.0,0.0,1.0,0.0,0.0,Benign
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,3.490000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,1.190000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,3.705500e+04,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,1.489392e+04,1.853264e+04,0.0,0.0,0.0,1.0,0.0,0.0,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264520,1,1,38.0,151.0,38,38.000000,0.000000,151.0,151.000000,0.000000,...,6.843362,2.922540e+05,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,Android_Scareware
264521,1,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,7.965620,2.510790e+05,0.000000e+00,0.0,0.0,0.0,0.0,1.0,1.0,Android_Scareware
264522,1,1,34.0,139.0,34,34.000000,0.000000,139.0,139.000000,0.000000,...,13.541512,1.476940e+05,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,Android_Scareware
264523,2,2,826.0,34.0,414,413.000000,1.414214,17.0,17.000000,0.000000,...,0.241014,5.532192e+06,9.411234e+06,0.0,0.0,0.0,0.0,0.0,0.0,Android_Scareware


In [25]:
sw_finalData['Label'] = le.transform(sw_finalData['Label'])
sw_finalData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,1.161954e+04,1.454116e+04,0.0,0.0,0.0,1.0,0.0,0.0,0
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,3.490000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,1.190000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,3.705500e+04,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,1.489392e+04,1.853264e+04,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264520,1,1,38.0,151.0,38,38.000000,0.000000,151.0,151.000000,0.000000,...,6.843362,2.922540e+05,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,1
264521,1,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,7.965620,2.510790e+05,0.000000e+00,0.0,0.0,0.0,0.0,1.0,1.0,1
264522,1,1,34.0,139.0,34,34.000000,0.000000,139.0,139.000000,0.000000,...,13.541512,1.476940e+05,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,1
264523,2,2,826.0,34.0,414,413.000000,1.414214,17.0,17.000000,0.000000,...,0.241014,5.532192e+06,9.411234e+06,0.0,0.0,0.0,0.0,0.0,0.0,1


In [26]:
le.fit(sms_finalData['Label'])
le.classes_

array(['Benign', 'Android_SMS_Malware'], dtype=object)

In [27]:
sms_not_encodeData = pd.DataFrame(data=sms_finalData)
sms_not_encodeData.to_pickle('sms_not_encodeData.pickle')
sms_not_encodeData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,1.161954e+04,1.454116e+04,0.0,0.0,0.0,1.0,0.0,0.0,Benign
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,3.490000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,1.190000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,3.705500e+04,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,Benign
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,1.489392e+04,1.853264e+04,0.0,0.0,0.0,1.0,0.0,0.0,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
331917,3,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.396622,3.361725e+06,5.800886e+06,0.0,0.0,0.0,1.0,0.0,0.0,Android_SMS_Malware
331918,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,200000.000000,1.000000e+01,0.000000e+00,0.0,0.0,0.0,0.0,1.0,0.0,Android_SMS_Malware
331919,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,250000.000000,8.000000e+00,0.000000e+00,0.0,0.0,0.0,0.0,1.0,0.0,Android_SMS_Malware
331920,3,3,768.0,317.0,768,256.000000,443.405007,317.0,105.666667,183.020035,...,5.849601,2.051422e+05,2.116620e+05,0.0,0.0,0.0,1.0,0.0,0.0,Android_SMS_Malware


In [28]:
sms_finalData['Label'] = le.transform(sms_finalData['Label'])
sms_finalData

Unnamed: 0,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,...,Flow Packets/s,Flow IAT Mean,Flow IAT Std,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,Label
331922,6,8,1076.0,4575.0,821,179.333333,321.621931,1418.0,571.875000,679.532284,...,92.682087,1.161954e+04,1.454116e+04,0.0,0.0,0.0,1.0,0.0,0.0,0
331923,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,5730.659026,3.490000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331924,2,0,23.0,0.0,23,11.500000,16.263456,0.0,0.000000,0.000000,...,16806.722690,1.190000e+02,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331925,1,1,31.0,0.0,31,31.000000,0.000000,0.0,0.000000,0.000000,...,53.973823,3.705500e+04,0.000000e+00,0.0,1.0,0.0,0.0,1.0,0.0,0
331926,6,7,1313.0,307.0,753,218.833333,331.306152,168.0,43.857143,75.366722,...,72.736632,1.489392e+04,1.853264e+04,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
331917,3,1,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.396622,3.361725e+06,5.800886e+06,0.0,0.0,0.0,1.0,0.0,0.0,1
331918,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,200000.000000,1.000000e+01,0.000000e+00,0.0,0.0,0.0,0.0,1.0,0.0,1
331919,2,0,0.0,0.0,0,0.000000,0.000000,0.0,0.000000,0.000000,...,250000.000000,8.000000e+00,0.000000e+00,0.0,0.0,0.0,0.0,1.0,0.0,1
331920,3,3,768.0,317.0,768,256.000000,443.405007,317.0,105.666667,183.020035,...,5.849601,2.051422e+05,2.116620e+05,0.0,0.0,0.0,1.0,0.0,0.0,1


### 4. Conversion of Final cleaned and prepared DataFrames to Pickle Files
>Final step is to convert the various dataframes that we will be using for exploratory data analysis(EDA)/visualization to gather relevant insights and for the use of machine learning techniques to solve question of (Question: What are the factors that help predict the type of malware attack that happened on an Android device).

In [29]:
aw_finalData.to_pickle('aw_finalData.pickle')
sw_finalData.to_pickle('sw_finalData.pickle')
sms_finalData.to_pickle('sms_finalData.pickle')
b_Data.to_pickle('benign_Data.pickle')