# Caso Práctico: Seleccion de Caracteristicas

En este caso de uso práctico se presenta un mecanismo de seleccion de características mediante el uso de Random Forest 

## DataSet: Deteccion de Malware en Android

Android malware dataset (CIC-AndMal2017)

We propose our new Android malware dataset here, named CICAndMal2017. In this approach, we run our both malware and benign applications on real smartphones to avoid runtime behaviour modification of advanced malware samples that are able to detect the emulator environment. We collected more than 10,854 samples (4,354 malware and 6,500 benign) from several sources. We have collected over six thousand benign apps from Googleplay market published in 2015, 2016, 2017.

We installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories:

    Adware
    Ransomware
    Scareware
    SMS Malware

Our samples come from 42 unique malware families. The family kinds of each category and the numbers of the captured samples are as follows:
Adware

    Dowgin family, 10 captured samples
    Ewind family, 10 captured samples
    Feiwo family, 15 captured samples
    Gooligan family, 14 captured samples
    Kemoge family, 11 captured samples
    koodous family, 10 captured samples
    Mobidash family, 10 captured samples
    Selfmite family, 4 captured samples
    Shuanet family, 10 captured samples
    Youmi family, 10 captured samples

Ransomware

    Charger family, 10 captured samples
    Jisut family, 10 captured samples
    Koler family, 10 captured samples
    LockerPin family, 10 captured samples
    Simplocker family, 10 captured samples
    Pletor family, 10 captured samples
    PornDroid family, 10 captured samples
    RansomBO family, 10 captured samples
    Svpeng family, 11 captured samples
    WannaLocker family, 10 captured samples

Scareware

    AndroidDefender 17 captured samples
    AndroidSpy.277 family, 6 captured samples
    AV for Android family, 10 captured samples
    AVpass family, 10 captured samples
    FakeApp family, 10 captured samples
    FakeApp.AL family, 11 captured samples
    FakeAV family, 10 captured samples
    FakeJobOffer family, 9 captured samples
    FakeTaoBao family, 9 captured samples
    Penetho family, 10 captured samples
    VirusShield family, 10 captured samples

SMS Malware

    BeanBot family, 9 captured samples
    Biige family, 11 captured samples
    FakeInst family, 10 captured samples
    FakeMart family, 10 captured samples
    FakeNotify family, 10 captured samples
    Jifake family, 10 captured samples
    Mazarbot family, 9 captured samples
    Nandrobox family, 11 captured samples
    Plankton family, 10 captured samples
    SMSsniffer family, 9 captured samples
    Zsone family, 10 captured samples

In order to acquire a comprehensive view of our malware samples, we created a specific scenario for each malware category. We also defined three states of data capturing in order to overcome the stealthiness of an advanced malware:

    Installation: The first state of data capturing which occurs immediately after installing malware (1-3 min).
    Before restart: The second state of data capturing which occurs 15 min before rebooting phones.
    After restart: The last state of data capturing which occurs 15 min after rebooting phones.

For feature Extraction and Selection, we captured network traffic features (.pcap files), and extracted more than 80 features by using CICFlowMeter-V3 during all three mentioned states (installation, before restart, and after restart). 
License

The CICAndMal2017 dataset is publicly available for researchers. If you are using our dataset, you should cite our related research paper that outlines the details of the dataset and its underlying principles:

    Arash Habibi Lashkari, Andi Fitriah A. Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.

[Descargar Dataset](https://www.unb.ca/cic/datasets/andmal2017.html)



## Imports

In [1]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score 

* Funciones Auxiliares (Particionado)
* Eliminacion de etiquetas (Remove labels)
* Lectura del DataSet (../datasets/TotalFeatures-ISCXFlowMeter.csv)
* Visualizacion del DataSet (head, describe, info)
* División del DataSet

### Funciones Auxiliares

In [2]:
## Construcción de una funcion que realice el particionado completo
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [3]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return(X, y)

In [4]:
def evaluate_resul(y_pred, y, y_prep_pred, y_prep, metric):
    print(metric.__name__, "WITHOUT preparation", metric(y_pred, y, average='weighted'))
    print(metric.__name__, "WITH preparation", metric(y_prep_pred, y_prep, average='weighted'))
    

## 1. Lectura del DataSet 

In [5]:
df = pd.read_csv('datasets/datasets/TotalFeatures-ISCXFlowMeter.csv')

## 2. Visualización del DataSet

In [6]:
df.head(10)

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.0,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.0,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.0,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.0,...,0.0,-1,0.0,2,155136,31232,5,4,32,benign
5,261876,7,6,1618,882,52,52,730,477,231.142857,...,0.0,-1,0.0,2,4194240,926720,3,7,32,benign
6,14,2,0,104,0,52,-1,52,-1,52.0,...,0.0,-1,0.0,3,5824,-1,0,2,32,benign
7,29675,1,1,71,213,71,213,71,213,71.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
8,806635,4,0,239,0,52,-1,83,-1,59.75,...,0.0,-1,0.0,5,107008,-1,0,4,32,benign
9,56620,3,2,1074,719,52,52,592,667,358.0,...,0.0,-1,0.0,3,128512,10816,1,3,32,benign


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631955 entries, 0 to 631954
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 631955 non-null  int64  
 1   total_fpackets           631955 non-null  int64  
 2   total_bpackets           631955 non-null  int64  
 3   total_fpktl              631955 non-null  int64  
 4   total_bpktl              631955 non-null  int64  
 5   min_fpktl                631955 non-null  int64  
 6   min_bpktl                631955 non-null  int64  
 7   max_fpktl                631955 non-null  int64  
 8   max_bpktl                631955 non-null  int64  
 9   mean_fpktl               631955 non-null  float64
 10  mean_bpktl               631955 non-null  float64
 11  std_fpktl                631955 non-null  float64
 12  std_bpktl                631955 non-null  float64
 13  total_fiat               631955 non-null  int64  
 14  tota

In [8]:
df ['calss'].value_counts()

calss
benign            471597
asware            155613
GeneralMalware      4745
Name: count, dtype: int64

In [9]:
df.describe()

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,min_idle,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward
count,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,...,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0
mean,21952450.0,6.728514,10.431934,954.0172,12060.42,141.475727,44.357688,263.675901,183.248084,174.959706,...,19973270.0,20312280.0,20752380.0,466387.5,2.360896,962079.6,310451.9,9.733144,6.72471,19.965713
std,190057800.0,174.161354,349.424019,82350.4,482471.6,157.68088,89.099554,289.644383,371.863224,162.024811,...,189798600.0,189790200.0,189972100.0,6199704.0,3.04181,1705655.0,664795.6,347.877923,174.13813,14.914261
min,-18.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,...,-1.0,0.0,-1.0,0.0,2.0,-1.0,-1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,69.0,0.0,52.0,-1.0,52.0,-1.0,52.0,...,-1.0,0.0,-1.0,0.0,2.0,0.0,-1.0,0.0,1.0,0.0
50%,24450.0,1.0,0.0,184.0,0.0,52.0,-1.0,83.0,-1.0,83.0,...,-1.0,0.0,-1.0,0.0,2.0,87616.0,-1.0,0.0,1.0,32.0
75%,1759751.0,3.0,1.0,427.0,167.0,108.0,52.0,421.0,115.0,356.0,...,1013498.0,1291379.0,1306116.0,0.0,2.0,304640.0,90496.0,1.0,3.0,32.0
max,44310760000.0,48255.0,74768.0,40496440.0,103922200.0,1390.0,1390.0,1500.0,1390.0,1390.0,...,44310720000.0,44300000000.0,44310720000.0,847000000.0,2269.0,4194240.0,4194240.0,74524.0,48255.0,44.0


In [10]:
# Copiar el DataSet y transformar la variable de salida a númerica

X = df.copy()
X['calss'] = X ['calss'].factorize()[0]

# Convertimos Calss a una variable numerica es decir de categorica a numerica

In [11]:
# Calcular correlaciones

corr_matrix = X.corr()
corr_matrix['calss'].sort_values(ascending=False)

calss                     1.000000
flow_fin                  0.286175
min_seg_size_forward      0.258352
Init_Win_bytes_forward    0.129425
std_fpktl                 0.123758
                            ...   
furg_cnt                       NaN
burg_cnt                       NaN
flow_urg                       NaN
flow_cwr                       NaN
flow_ece                       NaN
Name: calss, Length: 80, dtype: float64

In [12]:
X.corr()

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
duration,1.000000,0.004837,0.004011,0.001673,0.003518,-0.064100,-0.027231,0.008761,0.042925,-0.043746,...,0.998901,0.999458,0.047582,0.016532,0.027610,0.029712,0.003785,0.004838,0.082955,0.067066
total_fpackets,0.004837,1.000000,0.924622,0.425756,0.904007,-0.018958,0.005252,0.024685,0.086255,-0.007910,...,0.001614,0.002267,0.017229,0.016089,0.050201,0.059224,0.902713,0.999866,0.018198,0.018377
total_bpackets,0.004011,0.924622,1.000000,0.156780,0.997268,-0.017667,0.006912,0.018170,0.086886,-0.016104,...,0.000922,0.001617,0.016230,-0.000493,0.048190,0.058435,0.997580,0.924746,0.015124,0.019430
total_fpktl,0.001673,0.425756,0.156780,1.000000,0.090082,-0.003099,0.000803,0.021278,0.022088,0.022409,...,0.000335,0.000609,0.009896,0.001657,0.013283,0.015991,0.088422,0.425789,0.005477,0.000679
total_bpktl,0.003518,0.904007,0.997268,0.090082,1.000000,-0.014926,0.005966,0.012560,0.079905,-0.017328,...,0.000812,0.001452,0.014336,-0.000293,0.043571,0.053134,0.999616,0.904129,0.012139,0.019838
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Init_Win_bytes_backward,0.029712,0.059224,0.058435,0.015991,0.053134,-0.268444,0.038319,0.429893,0.593143,-0.030004,...,0.026959,0.029512,0.097316,-0.052507,0.811204,1.000000,0.056761,0.059242,0.333701,0.069405
RRT_samples_clnt,0.003785,0.902713,0.997580,0.088422,0.999616,-0.016659,0.006156,0.015727,0.084280,-0.017595,...,0.000893,0.001560,0.015200,-0.000437,0.046784,0.056761,1.000000,0.902834,0.014299,0.019679
Act_data_pkt_forward,0.004838,0.999866,0.924746,0.425789,0.904129,-0.018947,0.005264,0.024705,0.086278,-0.007893,...,0.001617,0.002269,0.017233,0.000734,0.050220,0.059242,0.902834,1.000000,0.018229,0.018391
min_seg_size_forward,0.082955,0.018198,0.015124,0.005477,0.012139,-0.686154,-0.189824,-0.074763,0.217989,-0.524024,...,0.077943,0.079324,0.048803,0.052177,0.394743,0.333701,0.014299,0.018229,1.000000,0.258352


## 3. División del DataSet

In [13]:
# División del DataSet

train_set, val_set, test_set = train_val_test_split(X)

In [14]:
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

## 4. Random Forest

In [15]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [16]:
# Predecir con el DataSet de validacion 
y_pred = clf_rnd.predict(X_val)

In [17]:
print("F1 Score:", f1_score(y_pred, y_val, average='weighted'))
# Validar el rendiminento del Modelo

F1 Score: 0.9322833436207185


## 5. Importancia de las características

In [18]:
clf_rnd.feature_importances_
# Coloca las caracteristicas de mayor a mneor

array([0.03148474, 0.00447086, 0.00355324, 0.02287126, 0.01121304,
       0.01725066, 0.00933473, 0.0214129 , 0.01070152, 0.02036135,
       0.01232613, 0.01051379, 0.00522225, 0.01905194, 0.00459829,
       0.01360064, 0.00596092, 0.01764871, 0.00500794, 0.01829403,
       0.00468597, 0.00595802, 0.00276679, 0.00953712, 0.00590933,
       0.        , 0.        , 0.00287043, 0.00402655, 0.02736208,
       0.01791751, 0.02924532, 0.02833102, 0.02540786, 0.01716949,
       0.02337682, 0.01503398, 0.019862  , 0.03657175, 0.02918993,
       0.00832129, 0.03273876, 0.00610444, 0.00374368, 0.01128868,
       0.00891973, 0.        , 0.        , 0.        , 0.01230706,
       0.01945172, 0.01976295, 0.00334435, 0.00118137, 0.00076241,
       0.00102501, 0.00499646, 0.01028637, 0.00267208, 0.00155135,
       0.00243639, 0.00269669, 0.0095307 , 0.00235958, 0.00207977,
       0.00916491, 0.00897294, 0.01111258, 0.00177268, 0.01228588,
       0.00814881, 0.00937185, 0.00173232, 0.00154932, 0.14223

In [19]:
# Es posible extraer características que son más importnantes para la correcta clasificacion de los datos.

feature_importances = {name: score for name, score in zip(list(df), clf_rnd.feature_importances_)}

In [20]:
feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)
feature_importances_sorted.head(20)

Init_Win_bytes_forward     0.142233
max_flowiat                0.036572
flow_fin                   0.032739
Init_Win_bytes_backward    0.032202
duration                   0.031485
flowPktsPerSecond          0.029245
mean_flowiat               0.029190
flowBytesPerSecond         0.028331
fPktsPerSecond             0.027362
min_flowpktl               0.025408
mean_flowpktl              0.023377
total_fpktl                0.022871
max_fpktl                  0.021413
mean_fpktl                 0.020361
min_flowiat                0.019862
fAvgSegmentSize            0.019763
avgPacketSize              0.019452
total_fiat                 0.019052
mean_fiat                  0.018294
bPktsPerSecond             0.017918
dtype: float64

## Reduccion del Número de Características 

In [21]:
# Extraer las 10 caracteristicas con mas relevancia para el algoritmo

columns = list(feature_importances_sorted.head(10).index)
columns

['Init_Win_bytes_forward',
 'max_flowiat',
 'flow_fin',
 'Init_Win_bytes_backward',
 'duration',
 'flowPktsPerSecond',
 'mean_flowiat',
 'flowBytesPerSecond',
 'fPktsPerSecond',
 'min_flowpktl']

In [22]:
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [23]:
X_train_reduced.head(10)

Unnamed: 0,Init_Win_bytes_forward,max_flowiat,flow_fin,Init_Win_bytes_backward,duration,flowPktsPerSecond,mean_flowiat,flowBytesPerSecond,fPktsPerSecond,min_flowpktl
508881,0,490,0,0,490,4081.632653,490.0,679591.8367,2040.816327,73
208326,0,-1,0,-1,0,0.0,0.0,0.0,0.0,422
107213,0,-1,0,-1,0,0.0,0.0,0.0,0.0,436
466726,0,23933,0,0,23933,83.566623,23933.0,21267.70568,41.783312,54
230085,0,-1,0,-1,0,0.0,0.0,0.0,0.0,422
472961,4194240,60224201,1,1145472,60365946,0.132525,8623707.0,22.993096,0.066263,52
482372,62912,212,1,-1,212,9433.962264,212.0,636792.4528,9433.962264,52
619993,107008,30839880,2,-1,30839880,0.064851,30800000.0,5.382641,0.064851,83
65344,0,-1,0,-1,0,0.0,0.0,0.0,0.0,436
46666,0,-1,0,-1,0,0.0,0.0,0.0,0.0,365


In [24]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train_reduced, y_train)

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [25]:
# Predecir con el DataSet de validacion 
y_pred = clf_rnd.predict(X_val_reduced)

In [26]:
print("F1 Score:", f1_score(y_pred, y_val, average='weighted'))
# Validar el rendiminento del Modelo

F1 Score: 0.9270918565338361


Como pueden observarse en la casilla anterior el rendimiento de nuestro modelo empeora muy poco eliminando 69 de las 79 características de las que se disponia. Por otro lado el rendimiento en el entrenamiento y en la prediccion mejora sustancialmente.

In [27]:
# Con menos carcteristicas mejoramos el rendimiento
# Coeficiente de correlacion muy bajito no sirve
# Coeficiente de correlacion muy alto sirve