# Gridsearch hyperparameters SVM o NB con **TFIDF**
### By **N√©stor Suat** in 2019

**Descripci√≥n:** Buscando los parametros adecuados para el modelo **SVM** o **Naive Bayes** usando **TFIDF** como embedding. 

**Input:**
* Train and Test set
* Hyperparameters

**Output:**
* The best model with parameters
* Metrics: confusion matrix, accuracy, recall, precision and F1-score

***

## 0. Cargando datos y limpieza

### Importando librer√≠as

Como estamos en un archivo afuera se necesita agregar la direcci√≥n ../ (ra√≠z del proyexto) para importar la librer√≠a de preprocesamiento.

In [1]:
import pandas as pd

import sys
sys.path.insert(0, '../../../')

from classes.tfidf.preprocessing import Preprocessing as tfidf

### Importando datasets

In [2]:
train = pd.read_csv("../../../data/v1/7030/train70.tsv", delimiter = "\t", quoting = 3)
test = pd.read_csv("../../../data/v1/7030/test30.tsv", delimiter = "\t", quoting = 3)

print(train.shape, test.shape) # (3804, 3)

(2662, 2) (1142, 2)


In [3]:
train.head()

Unnamed: 0,text,label
0,üì¢#Atenci√≥n: se presenta siniestro vial entre u...,1
1,üì¢#Atenci√≥n: a esta hora se presentan disturbio...,0
2,Incidente vial entre taxi üöñ y‚Äç motocicleta üèçÔ∏è ...,1
3,@chemabernal @Moniva0517 @MartinSantosR La gr√°...,0
4,RT @CaracolRadio: #CaracolEsM√°s | ¬°Atenci√≥n! F...,1


In [4]:
test.head()

Unnamed: 0,text,label
0,¬øC√≥mo se encuentra el tr√°fico en la ciudad? üî¥ ...,0
1,RT @GuavioNoticias: üö®En horas de la madrugada ...,1
2,"Incidente vial entre moto üèçÔ∏è y taxi üöï, en la ...",1
3,Los supervisores de Asobel prestaron apoyo en ...,1
4,Paso a un carril en la v√≠a Bogot√°-Villavicenci...,1


### Preprocessing

In [5]:
type_clean =6 #Tiene que ser el mismo que 'file' (prefijo)

In [6]:
clean = tfidf(train)
clean.fit_clean(type_clean)
train.head()

Unnamed: 0,text,label,clean
0,üì¢#Atenci√≥n: se presenta siniestro vial entre u...,1,atenci√≥n se presentar siniestro vial entr...
1,üì¢#Atenci√≥n: a esta hora se presentan disturbio...,0,atenci√≥n a este hora se presentar disturb...
2,Incidente vial entre taxi üöñ y‚Äç motocicleta üèçÔ∏è ...,1,incidente vial entrar taxi y motocicleta ...
3,@chemabernal @Moniva0517 @MartinSantosR La gr√°...,0,lo gr√°fico decir que lo deuda lp comer ...
4,RT @CaracolRadio: #CaracolEsM√°s | ¬°Atenci√≥n! F...,1,rt caracol esm√°s atenci√≥n fuerte ac...


In [7]:
clean = tfidf(test)
clean.fit_clean(type_clean)
test.head()

Unnamed: 0,text,label,clean
0,¬øC√≥mo se encuentra el tr√°fico en la ciudad? üî¥ ...,0,c√≥mo se encontrar el tr√°fico en lo ciudad ...
1,RT @GuavioNoticias: üö®En horas de la madrugada ...,1,rt en hora de lo madrugar se presentar un...
2,"Incidente vial entre moto üèçÔ∏è y taxi üöï, en la ...",1,incidente vial entrar moto y taxi en...
3,Los supervisores de Asobel prestaron apoyo en ...,1,lo supervisor de asobel prestar apoyar en uno ...
4,Paso a un carril en la v√≠a Bogot√°-Villavicenci...,1,pasar a uno carril en lo v√≠a bogot√° villavicen...


In [8]:
train = train[~train['clean'].isnull()] #Elimina publicaciones que estan null al eliminarlo porque no generan valor en el proceso de limpieza
test = test[~test['clean'].isnull()]
print(train.shape, test.shape) # (3804, 3)

(2662, 3) (1142, 3)


### Train & Test set

In [9]:
X, y = train.clean, train.label
X_test, y_test = test.clean, test.label

## 1. GridSearchCV

### Support Vector Machine

1.1. Importando librer√≠as

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

from pprint import pprint
from time import time

1.2. Configurando el archivo donde se va guardar el resultado (info)

In [11]:
import logging  # Setting up the loggings to monitor gensim

logger = logging.getLogger("gridsearch")
hdlr = logging.FileHandler("gridsearch_tfidf.log")
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)

**1.3. Comenzando a entrenar modelo**

### Support Vector Machine (**SVM**) Model

### Naive Bayes (**NB**) Model

In [12]:
#Para resolver el error:
#ERROR A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

In [13]:
logger.info("#####Comenzando a entrenar modelo######")    
logger.info(__doc__)
pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),  
  ('to_dense', DenseTransformer()),
  ('clf', GaussianNB())
])

parameters = {
    'tfidf__ngram_range': ((1,1),(1,2),(1,3)),
    'tfidf__max_df': (0.2,0.3, 0.35, 0.4, 0.45, 0.5),
    'tfidf__min_df': (0.001,0.01, 0.1),
    'tfidf__max_features': (None, 600, 800, 1000, 1200, 2000),                
}
scores = ['accuracy', 'f1']   
#scores = ['accuracy']

### Gridsearch

In [14]:
try:    
    #logger.info("Comenzando tuning")
    for score in scores:
        logger.info("# Tuning hyper-parameters for %s" % score)
        logger.info(" ")
    
        logger.info("Performing grid search...")
        print("pipeline:", [name for name, _ in pipeline.steps])
        logger.info("parameters:")
        pprint(parameters)
        t0 = time()        
        grid_search = GridSearchCV(pipeline, parameters, cv=5, scoring=score, n_jobs=-1,verbose=1)                
        grid_search.fit(X, y)        
        logger.info("done in %0.3fs" % (time() - t0))
        logger.info(" ")
        
        logger.info("Best parameters set found on development set:")
        logger.info(" ")
        logger.info(grid_search.best_params_)
        logger.info(" ")
        ##Old start
        logger.info("--")
        logger.info("Best score: %0.3f" % grid_search.best_score_)    
        logger.info("Best parameters set:")
        best_parameters = grid_search.best_estimator_.get_params()    
        for param_name in sorted(parameters.keys()):
            logger.info("\t%s: %r" % (param_name, best_parameters[param_name]))
        logger.info("--")
        logger.info(" ")
        ##Old end
        
        logger.info("Grid scores on development set:")
        logger.info(" ")
        means = grid_search.cv_results_['mean_test_score']
        stds = grid_search.cv_results_['std_test_score']
        for mean, std, params in sorted(zip(means, stds, grid_search.cv_results_['params']),key = lambda t: t[0],reverse=True):
            logger.info("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
        logger.info(" ")
    
        logger.info("Detailed classification report:")
        logger.info(" ")
        logger.info("The model is trained on the full development set.")
        logger.info("The scores are computed on the full evaluation set.")
        logger.info(" ")
        y_true, y_pred = y_test, grid_search.predict(X_test)
        logger.info(classification_report(y_true, y_pred))
        logger.info(" ")
    
except Exception as e:
    logger.error('Unhandled exception:')
    logger.error(e)
    

pipeline: ['tfidf', 'to_dense', 'clf']
{'tfidf__max_df': (0.2, 0.3, 0.35, 0.4, 0.45, 0.5),
 'tfidf__max_features': (None, 600, 800, 1000, 1200, 2000),
 'tfidf__min_df': (0.001, 0.01, 0.1),
 'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3))}
Fitting 5 folds for each of 324 candidates, totalling 1620 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   30.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:   57.4s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1620 out of 1620 | elapsed:  2.1min finished


pipeline: ['tfidf', 'to_dense', 'clf']
{'tfidf__max_df': (0.2, 0.3, 0.35, 0.4, 0.45, 0.5),
 'tfidf__max_features': (None, 600, 800, 1000, 1200, 2000),
 'tfidf__min_df': (0.001, 0.01, 0.1),
 'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3))}
Fitting 5 folds for each of 324 candidates, totalling 1620 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 876 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1576 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1620 out of 1620 | elapsed:  2.1min finished


In [15]:
print("hello")

hello
