[View in Colaboratory](https://colab.research.google.com/github/Kneeplay/Classification_RandomForest/blob/master/SPY_Random_Forest_clasificador.ipynb)

Carga de datos a partir de un fichero csv en Google Collaboratory:

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving SPYV3-18VAR.csv to SPYV3-18VAR.csv
User uploaded file "SPYV3-18VAR.csv" with length 426316 bytes


Carga de datos en un DataFrame con el nombre spy del dataset. Visualizamos la cabecera y los primeros 4 registros:

In [25]:
import pandas as pd
import io
spy = pd.read_csv(io.StringIO(uploaded[fn].decode('utf-8')), sep=',', usecols=['CLASIFICADOR','1','31','42','46','47','48','60','68','76','77', '93','171','173','191','221','225','237','FECHA.month'])
                                                       
                                                       
                                                      
                                                      
                                                    
                                                       
spy.head()

Unnamed: 0,CLASIFICADOR,1,31,42,46,47,48,60,68,76,77,93,171,173,191,221,225,237,FECHA.month
0,1,2.17,141.82,0.19,1.2847,27.33,1.3908,-58.5,0.97,15.01,-0.34,2.4,-0.08,10.96,9.48,23.43,1039427977,4.7209,3
1,1,2.16,142.01,0.19,1.2799,26.03,1.3792,-43.7,0.95,0.0,-0.07,7.6,-0.07,10.96,9.48,23.49,544544364,1.6001,3
2,1,2.13,141.98,0.19,1.2831,25.5,1.3709,-28.2,0.99,15.09,-0.07,0.2,-0.07,10.99,9.48,22.92,507977208,-1.4107,3
3,1,2.1,141.97,0.19,1.2845,24.9,1.3661,-14.2,0.99,14.9,-0.02,-1.0,-0.06,10.97,9.48,22.75,1330374457,-0.2309,3
4,1,2.07,141.98,0.19,1.2945,24.71,1.3641,-10.8,0.97,14.38,0.21,-0.5,-0.06,10.95,9.48,22.75,977047937,-0.8835,3


Se eliminan los atributos que no van a usarse y se factorizan los atributos que estñan catogorizados por letras:

División del conjunto en train y test:

In [26]:
p_train = 0.80 # Porcentaje de train. Modificar para obtener diferentes conjuntos.

train = spy[:int((len(spy))*p_train)]
test = spy[int((len(spy))*p_train):]

print("Ejemplos usados para entrenar: ", len(train))
print("Ejemplos usados para test: ", len(test))
print("\n")

features = spy.columns[1:]
x_train = train[features]
y_train = train['CLASIFICADOR']

x_test = test[features]

Ejemplos usados para entrenar:  2273
Ejemplos usados para test:  569




Utilización de RandomizedSearchCV para determinar la mejor parametrización

In [27]:
import numpy as np
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Datos de entrenamiento
X, y = x_train, y_train

# Construcción del clasificador
clf = RandomForestClassifier(n_estimators=512, n_jobs=-1)


# Función para mostrar resultados
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# Parámetros y distribuciones para muestrear
param_dist = {"max_depth": [12,11,10,9,8,7,6,5,4,3,2,None],
              "max_features": sp_randint(1, 18),
              "min_samples_split": sp_randint(2, 95),
              "min_samples_leaf": sp_randint(1, 95),
              "bootstrap": [True, False], 'class_weight':['balanced', None],
              "criterion": ["gini", "entropy"]}

# Ejecución
n_iter_search = 80
random_search = RandomizedSearchCV(clf, scoring= 'f1', param_distributions=param_dist,
                                   n_iter=n_iter_search)

random_search.fit(X, y)
report(random_search.cv_results_)


  'precision', 'predicted', average, warn_for)


Model with rank: 1
Mean validation score: 0.860 (std: 0.001)
Parameters: {'bootstrap': False, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 7, 'min_samples_leaf': 53, 'min_samples_split': 91}

Model with rank: 2
Mean validation score: 0.860 (std: 0.007)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 15, 'min_samples_leaf': 83, 'min_samples_split': 11}

Model with rank: 3
Mean validation score: 0.860 (std: 0.000)
Parameters: {'bootstrap': False, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 46, 'min_samples_split': 39}

Model with rank: 3
Mean validation score: 0.860 (std: 0.000)
Parameters: {'bootstrap': False, 'class_weight': None, 'criterion': 'gini', 'max_depth': 2, 'max_features': 17, 'min_samples_leaf': 93, 'min_samples_split': 74}

Model with rank: 3
Mean validation score: 0.860 (std: 0.000)
Parameters: {'bootstrap': True, 'class_we

Creación del modelo Random Forest con los parámetros obtenidos:

In [0]:
clf_rf = RandomForestClassifier(n_estimators = 1024, criterion = 'entropy', 
                                max_depth = 3, max_features = 7, 
                                min_samples_leaf = 53, min_samples_split = 91, 
                                bootstrap=False, oob_score=False, n_jobs=-1, 
                                class_weight=None)

clf_rf.fit(x_train, y_train) # Construcción del modelo

preds_rf = clf_rf.predict(x_test) # Test del modelo

Visualización de resultados:

In [24]:
from sklearn.metrics import classification_report

print("Random Forest: \n" 
      +classification_report(y_true=test['CLASIFICADOR'], y_pred=preds_rf))

# Matriz de confusión

print("Matriz de confusión:\n")
matriz = pd.crosstab(test['CLASIFICADOR'], preds_rf, rownames=['actual'], colnames=['preds'])
print(matriz)

Random Forest: 
             precision    recall  f1-score   support

          0       0.19      0.27      0.22        90
          1       0.85      0.78      0.82       479

avg / total       0.75      0.70      0.72       569

Matriz de confusión:

preds     0    1
actual          
0        24   66
1       103  376


Variables relevantes:

In [30]:
print("Relevancia de variables:\n")
print(pd.DataFrame({'Indicador': features ,
              'Relevancia': clf_rf.feature_importances_}),"\n")
print("Máxima relevancia RF :" , max(clf_rf.feature_importances_), "\n")

Relevancia de variables:

      Indicador  Relevancia
0             1    0.021092
1            31    0.259254
2            42    0.043690
3            46    0.027469
4            47    0.018996
5            48    0.159873
6            60    0.000250
7            68    0.021149
8            76    0.179013
9            77    0.006010
10           93    0.015967
11          171    0.066723
12          173    0.074541
13          191    0.034626
14          221    0.020436
15          225    0.000201
16          237    0.000068
17  FECHA.month    0.050642 

Máxima relevancia RF : 0.2592539658376424 

