# Caso practico: Seleccion de características

En este caso de uso practico se presenta un mecanismo de seleccion de caracteristicas mediante el uso de Random Forest

## DataSet: Deetccion de Malware en Android 

We propose our new Android malware dataset here, named CICAndMal2017. In this approach, we run our both malware and benign applications on real smartphones to avoid runtime behaviour modification of advanced malware samples that are able to detect the emulator environment. We collected more than 10,854 samples (4,354 malware and 6,500 benign) from several sources. We have collected over six thousand benign apps from Googleplay market published in 2015, 2016, 2017.

We installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories:

    Adware
    Ransomware
    Scareware
    SMS Malware

Our samples come from 42 unique malware families. The family kinds of each category and the numbers of the captured samples are as follows:
Adware

    Dowgin family, 10 captured samples
    Ewind family, 10 captured samples
    Feiwo family, 15 captured samples
    Gooligan family, 14 captured samples
    Kemoge family, 11 captured samples
    koodous family, 10 captured samples
    Mobidash family, 10 captured samples
    Selfmite family, 4 captured samples
    Shuanet family, 10 captured samples
    Youmi family, 10 captured samples

Ransomware

    Charger family, 10 captured samples
    Jisut family, 10 captured samples
    Koler family, 10 captured samples
    LockerPin family, 10 captured samples
    Simplocker family, 10 captured samples
    Pletor family, 10 captured samples
    PornDroid family, 10 captured samples
    RansomBO family, 10 captured samples
    Svpeng family, 11 captured samples
    WannaLocker family, 10 captured samples

Scareware

    AndroidDefender 17 captured samples
    AndroidSpy.277 family, 6 captured samples
    AV for Android family, 10 captured samples
    AVpass family, 10 captured samples
    FakeApp family, 10 captured samples
    FakeApp.AL family, 11 captured samples
    FakeAV family, 10 captured samples
    FakeJobOffer family, 9 captured samples
    FakeTaoBao family, 9 captured samples
    Penetho family, 10 captured samples
    VirusShield family, 10 captured samples

SMS Malware

    BeanBot family, 9 captured samples
    Biige family, 11 captured samples
    FakeInst family, 10 captured samples
    FakeMart family, 10 captured samples
    FakeNotify family, 10 captured samples
    Jifake family, 10 captured samples
    Mazarbot family, 9 captured samples
    Nandrobox family, 11 captured samples
    Plankton family, 10 captured samples
    SMSsniffer family, 9 captured samples
    Zsone family, 10 captured samples

In order to acquire a comprehensive view of our malware samples, we created a specific scenario for each malware category. We also defined three states of data capturing in order to overcome the stealthiness of an advanced malware:

    Installation: The first state of data capturing which occurs immediately after installing malware (1-3 min).
    Before restart: The second state of data capturing which occurs 15 min before rebooting phones.
    After restart: The last state of data capturing which occurs 15 min after rebooting phones.

For feature Extraction and Selection, we captured network traffic features (.pcap files), and extracted more than 80 features by using CICFlowMeter-V3 during all three mentioned states (installation, before restart, and after restart). 
License

The CICAndMal2017 dataset is publicly available for researchers. If you are using our dataset, you should cite our related research paper that outlines the details of the dataset and its underlying principles:

    Arash Habibi Lashkari, Andi Fitriah A. Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.

Descargar 

## Imports


In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score  

# Funciones Auxiliares

In [10]:
### Homework
# 1.- Funciones Auxiliares (Particionado)
# 2.- Eliminacion de etiquetas (Remove_Labels)
# 3.- Lecturta del DataSet (../datasets/TotalFeatures-ISCXFlowMeter.csv)
# 4.- Visualización del DataSet 
    # head, describe, info, 
# 5.- Division del Dataset

In [2]:
# 1.- Funciones Auxiliares (Particionado)
def particionar_datos(X, y, test_size=0.3, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    return X_train, X_test, y_train, y_test


In [3]:
# 2.- Eliminacion de etiquetas (Remove_Labels)
def remove_labels(df, label_column='Label'):
    X = df.drop(columns=[label_column])
    y = df[label_column]
    return X, y

## 1.- Lectura del DataSet

In [None]:
# 3.- Lecturta del DataSet (../datasets/TotalFeatures-ISCXFlowMeter.csv)
ruta_dataset = "datasets/datasets/TotalFeatures-ISCXFlowMeter.csv"
# Leer el dataset
df = pd.read_csv(ruta_dataset)
print(f"Filas: {df.shape[0]}, Columnas: {df.shape[1]}")

In [5]:
# 4.- Visualización del DataSet 
display(df.head())

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.0,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.0,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.0,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.0,...,0.0,-1,0.0,2,155136,31232,5,4,32,benign


In [6]:
display(df.describe())

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,min_idle,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward
count,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,...,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0
mean,21952450.0,6.728514,10.431934,954.0172,12060.42,141.475727,44.357688,263.675901,183.248084,174.959706,...,19973270.0,20312280.0,20752380.0,466387.5,2.360896,962079.6,310451.9,9.733144,6.72471,19.965713
std,190057800.0,174.161354,349.424019,82350.4,482471.6,157.68088,89.099554,289.644383,371.863224,162.024811,...,189798600.0,189790200.0,189972100.0,6199704.0,3.04181,1705655.0,664795.6,347.877923,174.13813,14.914261
min,-18.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,...,-1.0,0.0,-1.0,0.0,2.0,-1.0,-1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,69.0,0.0,52.0,-1.0,52.0,-1.0,52.0,...,-1.0,0.0,-1.0,0.0,2.0,0.0,-1.0,0.0,1.0,0.0
50%,24450.0,1.0,0.0,184.0,0.0,52.0,-1.0,83.0,-1.0,83.0,...,-1.0,0.0,-1.0,0.0,2.0,87616.0,-1.0,0.0,1.0,32.0
75%,1759751.0,3.0,1.0,427.0,167.0,108.0,52.0,421.0,115.0,356.0,...,1013498.0,1291379.0,1306116.0,0.0,2.0,304640.0,90496.0,1.0,3.0,32.0
max,44310760000.0,48255.0,74768.0,40496440.0,103922200.0,1390.0,1390.0,1500.0,1390.0,1390.0,...,44310720000.0,44300000000.0,44310720000.0,847000000.0,2269.0,4194240.0,4194240.0,74524.0,48255.0,44.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631955 entries, 0 to 631954
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 631955 non-null  int64  
 1   total_fpackets           631955 non-null  int64  
 2   total_bpackets           631955 non-null  int64  
 3   total_fpktl              631955 non-null  int64  
 4   total_bpktl              631955 non-null  int64  
 5   min_fpktl                631955 non-null  int64  
 6   min_bpktl                631955 non-null  int64  
 7   max_fpktl                631955 non-null  int64  
 8   max_bpktl                631955 non-null  int64  
 9   mean_fpktl               631955 non-null  float64
 10  mean_bpktl               631955 non-null  float64
 11  std_fpktl                631955 non-null  float64
 12  std_bpktl                631955 non-null  float64
 13  total_fiat               631955 non-null  int64  
 14  tota

In [8]:
print(df.columns)


Index(['duration', 'total_fpackets', 'total_bpackets', 'total_fpktl',
       'total_bpktl', 'min_fpktl', 'min_bpktl', 'max_fpktl', 'max_bpktl',
       'mean_fpktl', 'mean_bpktl', 'std_fpktl', 'std_bpktl', 'total_fiat',
       'total_biat', 'min_fiat', 'min_biat', 'max_fiat', 'max_biat',
       'mean_fiat', 'mean_biat', 'std_fiat', 'std_biat', 'fpsh_cnt',
       'bpsh_cnt', 'furg_cnt', 'burg_cnt', 'total_fhlen', 'total_bhlen',
       'fPktsPerSecond', 'bPktsPerSecond', 'flowPktsPerSecond',
       'flowBytesPerSecond', 'min_flowpktl', 'max_flowpktl', 'mean_flowpktl',
       'std_flowpktl', 'min_flowiat', 'max_flowiat', 'mean_flowiat',
       'std_flowiat', 'flow_fin', 'flow_syn', 'flow_rst', 'flow_psh',
       'flow_ack', 'flow_urg', 'flow_cwr', 'flow_ece', 'downUpRatio',
       'avgPacketSize', 'fAvgSegmentSize', 'fHeaderBytes', 'fAvgBytesPerBulk',
       'fAvgPacketsPerBulk', 'fAvgBulkRate', 'bVarianceDataBytes',
       'bAvgSegmentSize', 'bAvgBytesPerBulk', 'bAvgPacketsPerBulk',
     

## 3.- Division del DataSet

In [9]:
# 5.- Division del Dataset
# Eliminamos la etiqueta
X, y = remove_labels(df, label_column='calss')
# Particionamos los datos
X_train, X_test, y_train, y_test = particionar_datos(X, y)
print("Tamaño del conjunto de entrenamiento:", X_train.shape)
print("Tamaño del conjunto de prueba:", X_test.shape)


Tamaño del conjunto de entrenamiento: (442368, 79)
Tamaño del conjunto de prueba: (189587, 79)


## 4.- Ramdom Forest

In [12]:
from sklearn.ensemble import RandomForestClassifier

In [13]:
clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [15]:
# predecir con el dataset de validacion

y_pred = clf_rnd.predict(X_val)

NameError: name 'X_val' is not defined

In [16]:
print("F1 Score:", f1_score(y_val, y_pred, average='weighted'))

NameError: name 'y_val' is not defined

## 5.- Importancia de las carateristicas

In [17]:
clf_rnd.feature_importances_

array([0.03207279, 0.00348929, 0.00296119, 0.02388163, 0.01247875,
       0.01866934, 0.00813463, 0.02290893, 0.00988901, 0.01952097,
       0.01200288, 0.01096947, 0.00479843, 0.0199861 , 0.00573239,
       0.01169399, 0.0059963 , 0.01557586, 0.00416844, 0.01932286,
       0.00543832, 0.00432754, 0.00331563, 0.01125258, 0.00424211,
       0.        , 0.        , 0.00436922, 0.00265784, 0.02549355,
       0.014814  , 0.02926473, 0.02872515, 0.02655634, 0.01567924,
       0.02310853, 0.01519079, 0.02150446, 0.03683107, 0.02976509,
       0.00741727, 0.03519243, 0.00516403, 0.00352552, 0.0123077 ,
       0.00929566, 0.        , 0.        , 0.        , 0.01267757,
       0.01992017, 0.01930678, 0.0031483 , 0.00104878, 0.00090775,
       0.00106618, 0.00507932, 0.01084196, 0.00307037, 0.0013725 ,
       0.00235969, 0.00207799, 0.01052795, 0.00164308, 0.00287675,
       0.00802909, 0.00883883, 0.01197598, 0.00205637, 0.01272119,
       0.00635449, 0.00989741, 0.00163963, 0.00188489, 0.14413

In [20]:
# es posible extraer carateristicas que son mas importantes para la correcta clasificacion

feature_importances = {name: score for name, score in zip(list(df),clf_rnd.feature_importances_)}

In [None]:
feacture_importances_sorted = pd.Series(feacture_importances)

## reduccion del numero de carateristicas

In [21]:
# Extraer las 10 carateristicas con mas relevancia para el algoritmo
columns = list(feature_importances_sorted.head(10)index)

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2173495109.py, line 2)

In [None]:
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [None]:
X_train_reduced.head(10)

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50)

como puede observarce en la casilla anterior el redimiento de nuestro modelo empeoreo un poco eliminndo 69 de las 70 carateristicas de las que se disponia por otro lado el rendimiento en el entrenamiento y en la prediccion mejora sustencialmente