<h1 align="center"> Experimentos utilizando o algoritmo Distributed Memory Doc2vec Embeddings </h1>

---



Bibliotecas utilizadas para auxiliar no trabalho de geração e utilização do vetor de palavras Doc2vec

In [None]:
from google.colab import drive
import gensim
from gensim.models import Doc2Vec
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics._classification import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report



Utilização do google drive para acessar o data set e os arquivos de vetores de palavras

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive



Leitura dos arquivos contendo os dados. Métodos auxiliares para gerar o split de todo o conjunto.

In [None]:
def read_corpus(train_filepath, test_filepath):
    
    train_data = pd.read_csv(train_filepath, sep=',')
    test_data = pd.read_csv(test_filepath, sep=',')
    
    X_train = []
    y_train = []
    X_train.extend(train_data['text'].values)
    y_train.extend(train_data['prediction'].values)
    
    X_test = []
    y_test = []
    X_test.extend(test_data['text'].values)
    y_test.extend(test_data['prediction'].values)
   
    return X_train, X_test, y_train, y_test  

def split_data_set(X, y, size):
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,stratify = y, test_size = size, random_state = 42)

    return X_train, X_test, y_train, y_test


def prepare_corpus(tweets: list):
    for i, line in enumerate(tweets):
        tokens = gensim.utils.simple_preprocess(line)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])



Método para criação das embeddings a serem utilizadas na classificação

In [None]:
def train_doc2vec_embeddings(vec_size, option, data):
 
    documents = list(prepare_corpus(data))
    #model = Doc2Vec(vector_size=vec_size, window=4, min_count=1, epochs=300, sample=1e-4, workers=5)
    model = Doc2Vec(vector_size = vec_size, dm = option, window=4, min_count=1, epochs=300, sample=1e-4, workers=5)
    model.build_vocab(documents)
    model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
    '''
    file_name = str(vec_size)+'doc2vec.model'
    model.save('/content/drive/MyDrive/Projects/IDPT2021/saved_models/'+ file_name)
    '''    
    return model
'''
vector_size = 50
option = 1

train_and_save(vector_size, option)
print('Vector size ' + str(vector_size) + ' done!')
'''

"\nvector_size = 50\noption = 1\n\ntrain_and_save(vector_size, option)\nprint('Vector size ' + str(vector_size) + ' done!')\n"

Código para capturar o vetor de palavras já gerado e utiliza-lo para inferir novos vetores de palavras no data set a ser utilizado na classificação.

Método de avaliação dos classificadores é através de um hold-out 90/10, com o uso das métricas acurácia e acurácia balanceada.

In [None]:

def infer_d2v_embeddings(d2v_model, X):
    
    for i in range(len(X)):
        model_vector = d2v_model.infer_vector(gensim.utils.simple_preprocess(X[i]))
        X[i] = model_vector
        
    return X

def run_classifier(classifier, X_train, X_test, y_train, y_test):
    
    classifier.fit(X_train,y_train)
    
    pred = classifier.predict(X_test)
    
    acc_score = accuracy_score(y_test,pred)
    ballanced_acc_score = balanced_accuracy_score(y_test, pred)
    classif_report = classification_report(y_test, pred)
    
    return acc_score, ballanced_acc_score, classif_report

<h1 align="center"> <b>Organização</b> </h1>
<h2 align="center"> <font color="red"> <b>1 - Experimentos Hold out 90 treino / 10 teste</b> </font> </h2>


1.   Distributed Memory === Hold out 90 treino / 10 teste
2.   Distributed Memory === Hold out 80 treino / 20 teste
3.   Distributed Memory === Hold out 70 treino / 30 teste

*    Tamanho do vetor de palavras [50, 100, 300]
*    Datasets - Corpus Twitter Qualidade Tradução [75%, 25%, valor médio]
*    As embeddings são geradas a partir dos conjuntos de treinos dos datasets e utilizadas para inferir nos conjuntos de testes. As três configurações (tamanho do vetor) são utilizados nos datasets. 

---

---






<h2 align="center"> <font color="red"> <b>1 - Experimentos Hold out 90 treino / 10 teste</b> </font> </h2>

---
<h2 align="center"> Distributed Memory size 50</h2>


<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.41      0.82      0.55       304
           1       0.94      0.73      0.82      1286

    accuracy                           0.74      1590
   macro avg       0.68      0.77      0.68      1590
weighted avg       0.84      0.74      0.77      1590

Acuracia == 0.7427672955974842
Acuracia balanceada == 0.7706474584595235


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.53      0.80      0.64       424
           1       0.92      0.76      0.83      1279

    accuracy                           0.77      1703
   macro avg       0.72      0.78      0.74      1703
weighted avg       0.82      0.77      0.78      1703

Acuracia == 0.7721667645331768
Acuracia balanceada == 0.782888865121631


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.46      0.76      0.57       361
           1       0.92      0.75      0.83      1290

    accuracy                           0.75      1651
   macro avg       0.69      0.76      0.70      1651
weighted avg       0.82      0.75      0.77      1651

Acuracia == 0.7546941247728649
Acuracia balanceada == 0.7552481264360411


<h2 align="center"> Distributed Memory size 100</h2>

<h4 align="center"> Hold out 90 treino / 10 teste </h4>

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.42      0.86      0.57       304
           1       0.96      0.72      0.82      1286

    accuracy                           0.75      1590
   macro avg       0.69      0.79      0.69      1590
weighted avg       0.85      0.75      0.77      1590

Acuracia == 0.7484276729559748
Acuracia balanceada == 0.7904738274535483


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.53      0.79      0.64       424
           1       0.92      0.77      0.84      1279

    accuracy                           0.78      1703
   macro avg       0.73      0.78      0.74      1703
weighted avg       0.82      0.78      0.79      1703

Acuracia == 0.7751027598355843
Acuracia balanceada == 0.780901942850399


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.49      0.78      0.60       361
           1       0.93      0.78      0.85      1290

    accuracy                           0.78      1651
   macro avg       0.71      0.78      0.73      1651
weighted avg       0.83      0.78      0.79      1651

Acuracia == 0.7777104784978801
Acuracia balanceada == 0.7779563658227576


<h2 align="center"> Distributed Memory size 300</h2>

<h4 align="center"> Hold out 90 treino / 10 teste </h4>

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.48      0.81      0.60       304
           1       0.95      0.79      0.86      1286

    accuracy                           0.80      1590
   macro avg       0.71      0.80      0.73      1590
weighted avg       0.86      0.80      0.81      1590

Acuracia == 0.7968553459119497
Acuracia balanceada == 0.8003166693951052


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.60      0.76      0.67       424
           1       0.91      0.83      0.87      1279

    accuracy                           0.81      1703
   macro avg       0.76      0.80      0.77      1703
weighted avg       0.83      0.81      0.82      1703

Acuracia == 0.8126834997064005
Acuracia balanceada == 0.7956733960788942


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.55      0.72      0.62       361
           1       0.91      0.83      0.87      1290

    accuracy                           0.81      1651
   macro avg       0.73      0.78      0.75      1651
weighted avg       0.84      0.81      0.82      1651

Acuracia == 0.8098122350090854
Acuracia balanceada == 0.7785501084412377


<h2 align="center"> <font color="red"> <b>2 - Experimentos Hold out 80 treino / 20 teste</b> </font> </h2>



2.   Distributed Memory === Hold out 80 treino / 20 teste

*    Tamanho do vetor de palavras = 50 
*    Tamanho do vetor de palavras = 100 
*    Tamanho do vetor de palavras = 300 

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.42      0.82      0.55       621
           1       0.94      0.72      0.82      2558

    accuracy                           0.74      3179
   macro avg       0.68      0.77      0.69      3179
weighted avg       0.84      0.74      0.77      3179

Acuracia == 0.74268637936458
Acuracia balanceada == 0.772434117838136


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.43      0.84      0.57       621
           1       0.95      0.73      0.83      2558

    accuracy                           0.75      3179
   macro avg       0.69      0.79      0.70      3179
weighted avg       0.85      0.75      0.78      3179

Acuracia == 0.7540106951871658
Acuracia balanceada == 0.7873968063314359


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.50      0.83      0.62       621
           1       0.95      0.79      0.87      2558

    accuracy                           0.80      3179
   macro avg       0.72      0.81      0.74      3179
weighted avg       0.86      0.80      0.82      3179

Acuracia == 0.8018244731047499
Acuracia balanceada == 0.8128397034216799


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.52      0.77      0.62       858
           1       0.91      0.76      0.83      2547

    accuracy                           0.76      3405
   macro avg       0.71      0.76      0.72      3405
weighted avg       0.81      0.76      0.78      3405

Acuracia == 0.7627019089574155
Acuracia balanceada == 0.7637073370288918


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.53      0.77      0.63       858
           1       0.91      0.77      0.83      2547

    accuracy                           0.77      3405
   macro avg       0.72      0.77      0.73      3405
weighted avg       0.81      0.77      0.78      3405

Acuracia == 0.7712187958883994
Acuracia balanceada == 0.7701731915512835


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.62      0.71      0.66       858
           1       0.90      0.85      0.88      2547

    accuracy                           0.82      3405
   macro avg       0.76      0.78      0.77      3405
weighted avg       0.83      0.82      0.82      3405

Acuracia == 0.8179148311306902
Acuracia balanceada == 0.7832236471812444


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.46      0.75      0.57       744
           1       0.91      0.75      0.82      2558

    accuracy                           0.75      3302
   macro avg       0.69      0.75      0.70      3302
weighted avg       0.81      0.75      0.76      3302

Acuracia == 0.7468201090248334
Acuracia balanceada == 0.7469944597173531


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.49      0.76      0.59       744
           1       0.92      0.77      0.84      2558

    accuracy                           0.77      3302
   macro avg       0.70      0.76      0.71      3302
weighted avg       0.82      0.77      0.78      3302

Acuracia == 0.7655966081162932
Acuracia balanceada == 0.7643556584024818


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.56      0.76      0.64       744
           1       0.92      0.83      0.87      2558

    accuracy                           0.81      3302
   macro avg       0.74      0.79      0.76      3302
weighted avg       0.84      0.81      0.82      3302

Acuracia == 0.8119321623258631
Acuracia balanceada == 0.7928321016923503


<h2 align="center"> <font color="red"> <b>3 - Experimentos Hold out 70 treino / 30 teste</b> </font> </h2>



3.   Distributed Memory === Hold out 70 treino / 30 teste

*    Tamanho do vetor de palavras = 50 
*    Tamanho do vetor de palavras = 100 
*    Tamanho do vetor de palavras = 300 

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.41      0.84      0.55       925
           1       0.95      0.71      0.81      3843

    accuracy                           0.73      4768
   macro avg       0.68      0.77      0.68      4768
weighted avg       0.84      0.73      0.76      4768

Acuracia == 0.7321728187919463
Acuracia balanceada == 0.7731095498308613


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.42      0.84      0.56       925
           1       0.95      0.72      0.82      3843

    accuracy                           0.75      4768
   macro avg       0.69      0.78      0.69      4768
weighted avg       0.85      0.75      0.77      4768

Acuracia == 0.7458053691275168
Acuracia balanceada == 0.781976918370361


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.49      0.83      0.62       925
           1       0.95      0.79      0.86      3843

    accuracy                           0.80      4768
   macro avg       0.72      0.81      0.74      4768
weighted avg       0.86      0.80      0.82      4768

Acuracia == 0.7997063758389261
Acuracia balanceada == 0.81172043237617


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.52      0.76      0.62      1316
           1       0.90      0.76      0.82      3792

    accuracy                           0.76      5108
   macro avg       0.71      0.76      0.72      5108
weighted avg       0.80      0.76      0.77      5108

Acuracia == 0.7580266249021144
Acuracia balanceada == 0.7571426968309543


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.55      0.78      0.64      1316
           1       0.91      0.78      0.84      3792

    accuracy                           0.78      5108
   macro avg       0.73      0.78      0.74      5108
weighted avg       0.82      0.78      0.79      5108

Acuracia == 0.7795614722004699
Acuracia balanceada == 0.7785932309902146


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.64      0.74      0.68      1316
           1       0.90      0.85      0.88      3792

    accuracy                           0.82      5108
   macro avg       0.77      0.80      0.78      5108
weighted avg       0.84      0.82      0.83      5108

Acuracia == 0.8247846515270164
Acuracia balanceada == 0.7966479582676055


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.48      0.78      0.59      1132
           1       0.92      0.75      0.82      3821

    accuracy                           0.75      4953
   macro avg       0.70      0.76      0.71      4953
weighted avg       0.82      0.75      0.77      4953

Acuracia == 0.7528770442156268
Acuracia balanceada == 0.7611899045908652


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.50      0.77      0.61      1132
           1       0.92      0.77      0.84      3821

    accuracy                           0.77      4953
   macro avg       0.71      0.77      0.72      4953
weighted avg       0.82      0.77      0.79      4953

Acuracia == 0.7714516454673935
Acuracia balanceada == 0.7726069572744263


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 1

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.56      0.76      0.64      1132
           1       0.92      0.82      0.87      3821

    accuracy                           0.81      4953
   macro avg       0.74      0.79      0.76      4953
weighted avg       0.84      0.81      0.82      4953

Acuracia == 0.8069856652533818
Acuracia balanceada == 0.7900424518399805
