# 1. Selección de características por filtro

Las técnicas de selección de características supervisadas usan la variable objetivo (y) y remueven variables irrelevantes.

Los métodos de selección de características de filtro utilizan técnicas estadísticas para evaluar la relación entre cada variable de entrada y la variable de destino, y estos puntajes se utilizan como base para elegir (filtrar) las variables de entrada que se utilizarán en el modelo.

Un método de correlación estadística para características numéricas y variable objetivo categórica es ANOVA (f_classif)

**Objetivo:** Seleccionar las características más importantes para predicción


**Información de las características**
Este dataset contiene imágenes que pertenecen all dataset de EuroSat. Hay 10 folders:
* 0 AnnualCrop
* 1 Forest
* 2 HerbaceousVegatation
* 3 Highway
* 4 Industrial
* 5 Pasture
* 6 PermanentCrop
* 7 Residential
* 8 River
* 9 SeaLake


**Número de instancias:** 27000


# 2. Autenticación de Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 3. Importando módulos

In [2]:
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# 4. Lectura de archivo

In [3]:
path = r'/content/drive/Shareddrives/Data Science para Geociencias/4. Selección de caracteristicas/Resultados'
train_path = os.path.join(path,'EUROSAT_TRAIN_FEAT.csv')

In [4]:
train_df = pd.read_csv(train_path)
train_df.head()

Unnamed: 0,histogram_0,histogram_1,histogram_2,histogram_3,histogram_4,histogram_5,histogram_6,histogram_7,histogram_8,histogram_9,histogram_10,histogram_11,histogram_12,histogram_13,histogram_14,histogram_15,histogram_16,histogram_17,histogram_18,histogram_19,histogram_20,histogram_21,histogram_22,histogram_23,histogram_24,histogram_25,histogram_26,histogram_27,histogram_28,histogram_29,histogram_30,histogram_31,histogram_32,histogram_33,histogram_34,histogram_35,histogram_36,histogram_37,histogram_38,histogram_39,...,histogram_493,histogram_494,histogram_495,histogram_496,histogram_497,histogram_498,histogram_499,histogram_500,histogram_501,histogram_502,histogram_503,histogram_504,histogram_505,histogram_506,histogram_507,histogram_508,histogram_509,histogram_510,histogram_511,hal_0,hal_1,hal_2,hal_3,hal_4,hal_5,hal_6,hal_7,hal_8,hal_9,hal_10,hal_11,hal_12,hum_0,hum_1,hum_2,hum_3,hum_4,hum_5,hum_6,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012692,8.819525,0.606407,11.202328,0.411772,112.869246,35.989785,4.587024,6.990196,0.002292,2.686005,-0.132405,0.783513,0.002973,6.369633e-10,1.516469e-13,7.302349e-13,9.142081e-26,8.576981e-18,-2.251498e-25,Forest
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01678,6.737624,0.523612,7.070943,0.395274,113.67399,21.546147,4.192914,6.515763,0.002552,2.556843,-0.087996,0.659967,0.002942,2.693041e-10,9.678997e-15,7.353245e-13,1.652979e-26,1.1911830000000001e-17,5.979174999999999e-26,Forest
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025098,4.159946,0.587513,5.044265,0.489949,105.225032,16.017115,4.011338,5.961651,0.003509,2.247017,-0.124186,0.72792,0.003155,4.341931e-10,1.849652e-12,2.115307e-13,-1.0710650000000001e-25,3.1512589999999995e-19,7.768625e-26,Forest
3,0.0,0.0,0.0,0.000718,0.00395,0.001436,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022486,50.834805,0.7447,99.422595,0.436835,117.787398,346.855576,4.579942,6.767477,0.001086,2.938769,-0.190203,0.866096,0.002811,4.023647e-09,7.724173e-12,3.490115e-12,-1.568084e-23,6.745461e-17,9.082255e-24,Forest
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00764,12.731279,0.589228,15.496675,0.334405,113.976952,49.255423,4.815528,7.530149,0.001799,2.969903,-0.106116,0.746747,0.002936,4.838209e-10,3.726097e-12,5.626374e-12,2.496585e-23,1.228961e-16,-6.352761e-24,Forest


# 5. Normalización

#### a) Escalamiento

In [5]:
scaler = MinMaxScaler(feature_range=(0, 1))
train_df.loc[:, train_df.columns != 'label'] = scaler.fit_transform(train_df.loc[:, train_df.columns != 'label'])

#### b) Codificación

In [6]:
le = LabelEncoder()
train_df['label'] = le.fit_transform(train_df.label.values)

In [7]:
train_df.head()

Unnamed: 0,histogram_0,histogram_1,histogram_2,histogram_3,histogram_4,histogram_5,histogram_6,histogram_7,histogram_8,histogram_9,histogram_10,histogram_11,histogram_12,histogram_13,histogram_14,histogram_15,histogram_16,histogram_17,histogram_18,histogram_19,histogram_20,histogram_21,histogram_22,histogram_23,histogram_24,histogram_25,histogram_26,histogram_27,histogram_28,histogram_29,histogram_30,histogram_31,histogram_32,histogram_33,histogram_34,histogram_35,histogram_36,histogram_37,histogram_38,histogram_39,...,histogram_493,histogram_494,histogram_495,histogram_496,histogram_497,histogram_498,histogram_499,histogram_500,histogram_501,histogram_502,histogram_503,histogram_504,histogram_505,histogram_506,histogram_507,histogram_508,histogram_509,histogram_510,histogram_511,hal_0,hal_1,hal_2,hal_3,hal_4,hal_5,hal_6,hal_7,hal_8,hal_9,hal_10,hal_11,hal_12,hum_0,hum_1,hum_2,hum_3,hum_4,hum_5,hum_6,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02428,0.005422,0.600306,0.001954,0.447481,0.115357,0.001613,0.462514,0.491858,0.177397,0.326815,0.78352,0.764189,0.576694,0.00129,8.8e-05,0.000462,0.902864,0.06377,0.818836,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032223,0.004092,0.514067,0.001221,0.42712,0.117643,0.000959,0.40726,0.447951,0.197881,0.302561,0.858441,0.629037,0.568891,0.000545,6e-06,0.000465,0.902864,0.063773,0.818836,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048387,0.002445,0.580627,0.000862,0.543958,0.093642,0.000708,0.381803,0.396671,0.273576,0.244384,0.797386,0.703374,0.623777,0.000879,0.001069,0.000134,0.902864,0.063763,0.818836,0
3,0.0,0.0,0.0,0.00075,0.0041,0.001614,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000377,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043312,0.032265,0.744353,0.017589,0.47841,0.129328,0.015698,0.461521,0.471247,0.082033,0.374278,0.686013,0.85453,0.534996,0.008148,0.004465,0.002208,0.902856,0.06382,0.818851,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014464,0.007921,0.582412,0.002715,0.352002,0.118504,0.002214,0.494551,0.541829,0.138432,0.380124,0.827871,0.72397,0.567159,0.00098,0.002154,0.003559,0.902877,0.063867,0.818826,0


# 6. Selección de características

Definiendo la selección de caracteristicas

In [8]:
fs = SelectKBest(score_func=f_classif, k=50)

Aplicando la selección de características

In [9]:
np_X = train_df.iloc[:,:-1].to_numpy()
print(np_X.shape)

(18000, 532)


In [10]:
np_Y = train_df.iloc[:,-1].to_numpy()
print(np_Y.shape)

(18000,)


In [11]:
X_selec = fs.fit_transform(np_X, np_Y)

  55  56  57  58  59  60  61  62  63  64  72  80  88  91  94  95  96  98
  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
 117 118 119 120 121 122 123 124 125 126 127 128 136 144 160 168 175 177
 180 181 182 183 184 185 186 187 188 189 190 191 192 200 216 224 232 240
 244 245 246 247 248 249 250 251 252 253 254 255 256 264 280 285 286 287
 288 289 291 292 293 294 295 296 298 299 300 301 302 303 305 306 307 308
 309 310 311 312 313 314 315 316 317 318 319 320 328 336 344 352 353 354
 360 361 362 363 365 368 369 370 371 372 373 374 375 376 377 378 379 380
 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398
 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434
 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452
 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470
 471 472 473 474 475 476 477 478 479 480 481 482 48

Visualizando los atributos elegidos

In [12]:
attr_names = train_df.columns.values.tolist()

In [13]:
# Lista de bools que indican si es un atributo seleccionado o no
mask = fs.get_support()
# Lista de los K mejores atributos
new_features = [] 

for bool, feature in zip(mask, attr_names):
    if bool:
        new_features.append(feature)

new_train_df = pd.DataFrame(X_selec, columns=new_features)
new_train_df['label'] = train_df['label']

In [14]:
new_train_df.head()

Unnamed: 0,histogram_3,histogram_4,histogram_6,histogram_7,histogram_11,histogram_12,histogram_13,histogram_21,histogram_135,histogram_146,histogram_154,histogram_162,histogram_194,histogram_195,histogram_196,histogram_197,histogram_198,histogram_199,histogram_202,histogram_203,histogram_204,histogram_210,histogram_218,histogram_226,histogram_234,histogram_259,histogram_260,histogram_261,histogram_262,histogram_263,histogram_323,histogram_324,histogram_325,histogram_326,histogram_327,hal_0,hal_1,hal_2,hal_3,hal_4,hal_5,hal_6,hal_7,hal_8,hal_9,hal_10,hal_11,hal_12,hum_0,hum_1,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003104,0.643229,0.765667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02428,0.005422,0.600306,0.001954,0.447481,0.115357,0.001613,0.462514,0.491858,0.177397,0.326815,0.78352,0.764189,0.576694,0.00129,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067486,0.011473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.38062,0.922194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032223,0.004092,0.514067,0.001221,0.42712,0.117643,0.000959,0.40726,0.447951,0.197881,0.302561,0.858441,0.629037,0.568891,0.000545,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02312,0.025536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00138,0.488279,0.872001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048387,0.002445,0.580627,0.000862,0.543958,0.093642,0.000708,0.381803,0.396671,0.273576,0.244384,0.797386,0.703374,0.623777,0.000879,0
3,0.00075,0.0041,0.0,0.0,0.0,0.0,0.000377,0.0,0.0,0.0,0.000359,0.0,0.003726,0.004668,0.002105,0.0,0.0,0.0,0.015725,0.010414,0.0,0.038784,0.528966,0.847494,0.0,0.002643,0.001214,0.0,0.0,0.0,0.002234,0.002342,0.0,0.0,0.0,0.043312,0.032265,0.744353,0.017589,0.47841,0.129328,0.015698,0.461521,0.471247,0.082033,0.374278,0.686013,0.85453,0.534996,0.008148,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006844,0.815076,0.579313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014464,0.007921,0.582412,0.002715,0.352002,0.118504,0.002214,0.494551,0.541829,0.138432,0.380124,0.827871,0.72397,0.567159,0.00098,0


In [None]:
new_train_df.to_csv(os.path.join(path, 'Eurosat_fs_50_train.csv'))