# Gridsearch hyperparameters SVM con Doc2vec
### By **Néstor Suat** in 2019

**Descripción:** Buscando los parametros adecuados para el modelo **SVM** usando **Doc2vec** como embedding. 

**Input:**
* Train and Test set
* Doc2vec model (DBOW or DMM or both of them)
* Hyperparameters

**Output:**
* The best model with parameters
* Metrics: confusion matrix, accuracy, recall, precision and F1-score

***

## 0. Cargando datos y limpieza

### Importando librerías

Como estamos en un archivo afuera se necesita agregar la dirección ../ (raíz del proyexto) para importar la librería de preprocesamiento.

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '../../../')

from classes.doc2vec.preprocessing import Preprocessing as doc2vec

### Importando datasets

In [2]:
train = pd.read_csv("../../../data/v1/7030/train70.tsv", delimiter = "\t", quoting = 3)
train['dataset'] = 99 # train = 1
test = pd.read_csv("../../../data/v1/7030/test30.tsv", delimiter = "\t", quoting = 3)
test['dataset'] = 100 # test = 0
dataset = pd.concat([train,test])
dataset = dataset.reset_index(drop=True)
print(dataset.shape) # (3804, 3)
dataset.head(5)

(3804, 3)


Unnamed: 0,text,label,dataset
0,📢#Atención: se presenta siniestro vial entre u...,1,99
1,📢#Atención: a esta hora se presentan disturbio...,0,99
2,Incidente vial entre taxi 🚖 y‍ motocicleta 🏍️ ...,1,99
3,@chemabernal @Moniva0517 @MartinSantosR La grá...,0,99
4,RT @CaracolRadio: #CaracolEsMás | ¡Atención! F...,1,99


### Preprocessing

In [3]:
#Preprocessing
#directory = "../../../data/v1/doc2vec/"
directory = "../../../data/v1/doc2vec/v2/"
file = "5_clean_stem_dataset_propuesta1_5050"
type_clean = 5 #Tiene que ser el mismo que 'file' (prefijo)

In [4]:
clean = doc2vec(dataset)
clean.fit_clean(type_clean)

embendding = clean.feature_extraction_dbow(directory, file)

### Train & Test set
Para el preprocesamiento uno los conjuntos, aquí vuelvo a separarlos.

In [5]:
vecs_train = embendding[embendding[:,0] == 99.0,:] #train = 99
vecs_test = embendding[embendding[:,0] == 100.0,:] #test = 100

X_train = vecs_train[:,2:]
y_train = vecs_train[:,1]
X_test = vecs_test[:,2:]
y_test = vecs_test[:,1]

In [6]:
print("Size vecs_train", vecs_train.shape)
print("Size vecs_test", vecs_test.shape)
print("Size: \n * X_train: %s \n * y_train: %s \n * X_test: %s \n * y_test: %s" % (X_train.shape, y_train.shape, X_test.shape, y_test.shape))

Size vecs_train (2662, 202)
Size vecs_test (1142, 202)
Size: 
 * X_train: (2662, 200) 
 * y_train: (2662,) 
 * X_test: (1142, 200) 
 * y_test: (1142,)


## 0. Importando Librerías

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from pprint import pprint
from time import time

## 1. Random Search Training

## 2. GridSearchCV

### Support Vector Machine

2.2. Configurando el archivo donde se va guardar el resultado (info)

In [None]:
import logging  # Setting up the loggings to monitor gensim

logger = logging.getLogger("gridsearch")
hdlr = logging.FileHandler("gridsearch_doc2vec.log")
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)

**2.3. Comenzando a entrenar modelo**

### Grid Search Random Forest

In [None]:
logger.info("#####Comenzando a entrenar modelo######")    
logger.info(__doc__)
pipeline = Pipeline([      
  ('clf', RandomForestClassifier(random_state=100,bootstrap=False, max_features='auto') )
])

"""parameters = {'clf__n_estimators': [500, 600, 800],
               'clf__max_features': ['log2', 'auto'],
               'clf__max_depth': [30, 40, 70, 100, None],
               'clf__min_samples_split': [2, 5, 10],
               'clf__min_samples_leaf': [1, 2, 4],
               'clf__bootstrap': [True, False],
             }  """
parameters = {'clf__n_estimators': [600,1000, 1200, 1600, 2000],                             
               'clf__max_depth': [40, 50, 100, 110],
               'clf__min_samples_split': [2, 4, 5, 10],
               'clf__min_samples_leaf': [1, 2, 4],               
             }    

scores = ['accuracy', 'f1']  

In [None]:
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4, 7],
 'bootstrap': [True, False]}