# Gridsearch hyperparameters SVM o NB con **TFIDF**
### By **Néstor Suat** in 2019

**Descripción:** Buscando los parametros adecuados para el modelo **SVM** o **Naive Bayes** usando **TFIDF** como embedding. 

**Input:**
* Train and Test set
* Hyperparameters

**Output:**
* The best model with parameters
* Metrics: confusion matrix, accuracy, recall, precision and F1-score

***

## 0. Cargando datos y limpieza

### Importando librerías

Como estamos en un archivo afuera se necesita agregar la dirección ../ (raíz del proyexto) para importar la librería de preprocesamiento.

In [2]:
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '../../../')

from classes.tfidf.preprocessing import Preprocessing as tfidf

### Importando datasets

In [6]:
train = pd.read_csv("../../../data/v1/7030/train70.tsv", delimiter = "\t", quoting = 3)
test = pd.read_csv("../../../data/v1/7030/test30.tsv", delimiter = "\t", quoting = 3)

print(train.shape, test.shape) # (3804, 3)

(2662, 2) (1142, 2)


In [7]:
train.head()

Unnamed: 0,text,label
0,📢#Atención: se presenta siniestro vial entre u...,1
1,📢#Atención: a esta hora se presentan disturbio...,0
2,Incidente vial entre taxi 🚖 y‍ motocicleta 🏍️ ...,1
3,@chemabernal @Moniva0517 @MartinSantosR La grá...,0
4,RT @CaracolRadio: #CaracolEsMás | ¡Atención! F...,1


In [8]:
test.head()

Unnamed: 0,text,label
0,¿Cómo se encuentra el tráfico en la ciudad? 🔴 ...,0
1,RT @GuavioNoticias: 🚨En horas de la madrugada ...,1
2,"Incidente vial entre moto 🏍️ y taxi 🚕, en la ...",1
3,Los supervisores de Asobel prestaron apoyo en ...,1
4,Paso a un carril en la vía Bogotá-Villavicenci...,1


### Preprocessing

In [18]:
type_clean = 4 #Tiene que ser el mismo que 'file' (prefijo)
#TFIDF
max_df = 0.5    
max_features = None
min_df = 0.001    
ngram_range = (1, 2)

In [10]:
clean = tfidf(train)
clean.fit_clean(type_clean)
train.head()

Unnamed: 0,text,label,clean
0,📢#Atención: se presenta siniestro vial entre u...,1,atención se presenta siniestro vial entre...
1,📢#Atención: a esta hora se presentan disturbio...,0,atención a esta hora se presentan disturb...
2,Incidente vial entre taxi 🚖 y‍ motocicleta 🏍️ ...,1,incidente vial entre taxi y motocicleta ...
3,@chemabernal @Moniva0517 @MartinSantosR La grá...,0,la gráfica dice que la deuda lp como ...
4,RT @CaracolRadio: #CaracolEsMás | ¡Atención! F...,1,rt caracol esmás atención fuerte ac...


In [11]:
clean = tfidf(test)
clean.fit_clean(type_clean)
test.head()

Unnamed: 0,text,label,clean
0,¿Cómo se encuentra el tráfico en la ciudad? 🔴 ...,0,cómo se encuentra el tráfico en la ciudad ...
1,RT @GuavioNoticias: 🚨En horas de la madrugada ...,1,rt en horas de la madrugada se presenta u...
2,"Incidente vial entre moto 🏍️ y taxi 🚕, en la ...",1,incidente vial entre moto y taxi en ...
3,Los supervisores de Asobel prestaron apoyo en ...,1,los supervisores de asobel prestaron apoyo en ...
4,Paso a un carril en la vía Bogotá-Villavicenci...,1,paso a un carril en la vía bogotá villavicenci...


In [12]:
train = train[~train['clean'].isnull()] #Elimina publicaciones que estan null al eliminarlo porque no generan valor en el proceso de limpieza
test = test[~test['clean'].isnull()]
print(train.shape, test.shape) # (3804, 3)

(2662, 3) (1142, 3)


### Train & Test set

In [13]:
X, y = train.clean, train.label
X_test, y_test = test.clean, test.label

## 1. GridSearchCV

### Support Vector Machine

1.1. Importando librerías

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

from pprint import pprint
from time import time

## 1.Random Search Training

In [19]:
embedding, vectorizer = clean.feature_extraction(ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features)

In [20]:
X_train = embedding[:,1:]
X_train=X_train.astype('float')

y_train = embedding[:,0]
y_train=y_train.astype('int')

In [21]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 7]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [5]:
random_grid

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4, 7],
 'bootstrap': [True, False]}

In [22]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=100, n_jobs = -1)# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 11.9min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [23]:
rf_random.best_params_

{'n_estimators': 1600,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 50,
 'bootstrap': False}

1.2. Configurando el archivo donde se va guardar el resultado (info)

**1.3. Comenzando a entrenar modelo**

### Support Vector Machine (**SVM**) Model

### Naive Bayes (**NB**) Model

### Random Forest (**RF**) Model

### Gridsearch