# Model (Train & Test) Doc2vec
### By **Néstor Suat** in 2019

**Descripción:** Entrenando y probando los modelos doc2vec (entrenados previamente) con SVM. 

**Input:**
* Train and Test set
* Doc2vec model (DBOW or DMM or both of them)

**Output:**
* Metrics: confusion matrix, accuracy, recall, precision and F1-score

***

## 0. Cargando datos y limpieza

### Importando librerías

Como estamos en un archivo afuera se necesita agregar la dirección ../ (raíz del proyexto) para importar la librería de preprocesamiento.

In [1]:
import pandas as pd

import sys
sys.path.insert(0, '../../../')

from classes.doc2vec.preprocessing import Preprocessing as doc2vec

### Importando datasets

In [2]:
train = pd.read_csv("../../../data/v1/7030/train70.tsv", delimiter = "\t", quoting = 3)
train['dataset'] = 99 # train = 1
test = pd.read_csv("../../../data/v1/7030/test30.tsv", delimiter = "\t", quoting = 3)
test['dataset'] = 100 # test = 0
dataset = pd.concat([train,test])
dataset = dataset.reset_index(drop=True)
print(dataset.shape) # (3804, 3)
dataset.head(5)

(3804, 3)


Unnamed: 0,text,label,dataset
0,📢#Atención: se presenta siniestro vial entre u...,1,99
1,📢#Atención: a esta hora se presentan disturbio...,0,99
2,Incidente vial entre taxi 🚖 y‍ motocicleta 🏍️ ...,1,99
3,@chemabernal @Moniva0517 @MartinSantosR La grá...,0,99
4,RT @CaracolRadio: #CaracolEsMás | ¡Atención! F...,1,99


### Preprocessing

In [3]:
#Preprocessing
#directory = "../../../data/v1/doc2vec/"
directory = "../../../data/v1/doc2vec/v2/"
file = "5_clean_stem_dataset_propuesta1_5050"
type_clean = 5 #Tiene que ser el mismo que 'file' (prefijo)

#Model SVM
kernel='rbf'
gamma=0.1
C=2

In [4]:
clean = doc2vec(dataset)
clean.fit_clean(type_clean)

embendding = clean.feature_extraction_dbow(directory, file)

### Train & Test set
Para el preprocesamiento uno los conjuntos, aquí vuelvo a separarlos.

In [5]:
vecs_train = embendding[embendding[:,0] == 99.0,:] #train = 99
vecs_test = embendding[embendding[:,0] == 100.0,:] #test = 100

X_train = vecs_train[:,2:]
y_train = vecs_train[:,1]
X_test = vecs_test[:,2:]
y_test = vecs_test[:,1]

In [6]:
X = embendding[:,2:]
y = embendding[:,1]

In [7]:
print("Size vecs_train", vecs_train.shape)
print("Size vecs_test", vecs_test.shape)
print("Size: \n * X_train: %s \n * y_train: %s \n * X_test: %s \n * y_test: %s" % (X_train.shape, y_train.shape, X_test.shape, y_test.shape))

Size vecs_train (2662, 202)
Size vecs_test (1142, 202)
Size: 
 * X_train: (2662, 200) 
 * y_train: (2662,) 
 * X_test: (1142, 200) 
 * y_test: (1142,)


## 1. Model

### Support Vector Machine

In [8]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn import model_selection

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

### Support Vector Machine (**SVM**) Model

### Naive Bayes (**NB**) Model

### Random Forest (**RF**) Classifier

In [41]:
n_estimators = 1000
min_samples_split = 5
min_samples_leaf = 1
max_features = 'sqrt'
max_depth = 100
bootstrap = False

In [42]:
#classifier = RandomForestClassifier(n_estimators=100,random_state=100,n_jobs=-1)
classifier = RandomForestClassifier(n_estimators=n_estimators,
                                    min_samples_split=min_samples_split,
                                    min_samples_leaf=min_samples_leaf,
                                    max_features=max_features,
                                    max_depth=max_depth,
                                    bootstrap=bootstrap,
                                    random_state=100,n_jobs=-1)

#classifier.fit(X_train, y_train)

Cross validation solo con datos de entrenamiento

Cross validation solo **todos los datos**

In [43]:
skfold = StratifiedKFold(n_splits=10, random_state=100)

scores = model_selection.cross_val_score(classifier, X, y, cv=skfold)
print("Accuracy: %0.6f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

scores = model_selection.cross_val_score(classifier, X, y, cv=skfold, scoring='f1_macro')
print("F1-score: %0.6f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#scores = model_selection.cross_val_score(classifier, X, y, cv=skfold, scoring='recall_macro')
#print("Recall: %0.6f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#scores = model_selection.cross_val_score(classifier, X, y, cv=skfold, scoring='precision_macro')
#print("Precision: %0.6f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.934806 (+/- 0.02)
F1-score: 0.934691 (+/- 0.02)


In [25]:
scores

array([0.95024507, 0.92138839, 0.92353117, 0.94470583, 0.93142508,
       0.93675626, 0.93937923, 0.92335246, 0.94997194, 0.92353117])

In [44]:
# Predicting the Test set results
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
cm_svm = confusion_matrix(y_test, y_pred)

metrics_svm = []
metrics = {}
metrics['accuracy'] = accuracy_score(y_test, y_pred)
metrics['recall'] = recall_score(y_test, y_pred)
metrics['precision'] = precision_score(y_test, y_pred)
metrics['f1'] = f1_score(y_test, y_pred)
metrics_svm.append(metrics)
metrics_svm = pd.DataFrame(metrics_svm)

In [45]:
print(metrics_svm)
print(cm_svm)

   accuracy    recall  precision        f1
0  0.936953  0.893238   0.976654  0.933086
[[568  12]
 [ 60 502]]
