<h1 align="center"> Experimentos utilizando o algoritmo Distributed Memory Doc2vec Embeddings </h1>

---



Bibliotecas utilizadas para auxiliar no trabalho de geração e utilização do vetor de palavras Doc2vec

In [None]:
from google.colab import drive
import gensim
from gensim.models import Doc2Vec
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics._classification import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report



Utilização do google drive para acessar o data set e os arquivos de vetores de palavras

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive



Leitura dos arquivos contendo os dados. Métodos auxiliares para gerar o split de todo o conjunto.

In [None]:
def read_corpus(train_filepath, test_filepath):
    
    train_data = pd.read_csv(train_filepath, sep=',')
    test_data = pd.read_csv(test_filepath, sep=',')
    
    X_train = []
    y_train = []
    X_train.extend(train_data['text'].values)
    y_train.extend(train_data['prediction'].values)
    
    X_test = []
    y_test = []
    X_test.extend(test_data['text'].values)
    y_test.extend(test_data['prediction'].values)
   
    return X_train, X_test, y_train, y_test  

def split_data_set(X, y, size):
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,stratify = y, test_size = size, random_state = 42)

    return X_train, X_test, y_train, y_test


def prepare_corpus(tweets: list):
    for i, line in enumerate(tweets):
        tokens = gensim.utils.simple_preprocess(line)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])



Método para criação das embeddings a serem utilizadas na classificação

In [None]:
def train_doc2vec_embeddings(vec_size, option, data):
 
    documents = list(prepare_corpus(data))
    #model = Doc2Vec(vector_size=vec_size, window=4, min_count=1, epochs=300, sample=1e-4, workers=5)
    model = Doc2Vec(vector_size = vec_size, dm = option, window=4, min_count=1, epochs=300, sample=1e-4, workers=5)
    model.build_vocab(documents)
    model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
    '''
    file_name = str(vec_size)+'doc2vec.model'
    model.save('/content/drive/MyDrive/Projects/IDPT2021/saved_models/'+ file_name)
    '''    
    return model
'''
vector_size = 50
option = 1

train_and_save(vector_size, option)
print('Vector size ' + str(vector_size) + ' done!')
'''

"\nvector_size = 50\noption = 1\n\ntrain_and_save(vector_size, option)\nprint('Vector size ' + str(vector_size) + ' done!')\n"

Código para capturar o vetor de palavras já gerado e utiliza-lo para inferir novos vetores de palavras no data set a ser utilizado na classificação.

Método de avaliação dos classificadores é através de um hold-out 90/10, com o uso das métricas acurácia e acurácia balanceada.

In [None]:

def infer_d2v_embeddings(d2v_model, X):
    
    for i in range(len(X)):
        model_vector = d2v_model.infer_vector(gensim.utils.simple_preprocess(X[i]))
        X[i] = model_vector
        
    return X

def run_classifier(classifier, X_train, X_test, y_train, y_test):
    
    classifier.fit(X_train,y_train)
    
    pred = classifier.predict(X_test)
    
    acc_score = accuracy_score(y_test,pred)
    ballanced_acc_score = balanced_accuracy_score(y_test, pred)
    classif_report = classification_report(y_test, pred)
    
    return acc_score, ballanced_acc_score, classif_report

<h1 align="center"> <b>Organização</b> </h1>
<h2 align="center"> <font color="red"> <b>1 - Experimentos Hold out 90 treino / 10 teste</b> </font> </h2>


1.   Distributed Memory === Hold out 90 treino / 10 teste

*    Tamanho do vetor de palavras = 50 
*    Tamanho do vetor de palavras = 100 
*    Tamanho do vetor de palavras = 300 


---

---





---

<h2 align="center"> <font color="red"> <b>1 - Experimentos Hold out 90 treino / 10 teste</b> </font> </h2>

---
<h2 align="center"> Distributed Memory size 50</h2>


<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.71      0.77       304
           1       0.93      0.97      0.95      1286

    accuracy                           0.92      1590
   macro avg       0.89      0.84      0.86      1590
weighted avg       0.92      0.92      0.92      1590

Acuracia == 0.9188679245283019
Acuracia balanceada == 0.8405781902267333


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.83      0.61      0.71       424
           1       0.88      0.96      0.92      1279

    accuracy                           0.87      1703
   macro avg       0.86      0.79      0.81      1703
weighted avg       0.87      0.87      0.87      1703

Acuracia == 0.8731650029359953
Acuracia balanceada == 0.7862753920368213


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.60      0.70       361
           1       0.90      0.97      0.93      1290

    accuracy                           0.89      1651
   macro avg       0.87      0.78      0.82      1651
weighted avg       0.89      0.89      0.88      1651

Acuracia == 0.8891580860084797
Acuracia balanceada == 0.7834428482466878


<h2 align="center"> Distributed Memory size 100</h2>

<h4 align="center"> Hold out 90 treino / 10 teste </h4>

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.83      0.74      0.78       304
           1       0.94      0.96      0.95      1286

    accuracy                           0.92      1590
   macro avg       0.89      0.85      0.87      1590
weighted avg       0.92      0.92      0.92      1590

Acuracia == 0.9213836477987422
Acuracia balanceada == 0.8521808750102317


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.64      0.73       424
           1       0.89      0.96      0.92      1279

    accuracy                           0.88      1703
   macro avg       0.86      0.80      0.82      1703
weighted avg       0.88      0.88      0.87      1703

Acuracia == 0.8796241926012919
Acuracia balanceada == 0.8000354050186613


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.63      0.72       361
           1       0.90      0.97      0.93      1290

    accuracy                           0.89      1651
   macro avg       0.87      0.80      0.83      1651
weighted avg       0.89      0.89      0.89      1651

Acuracia == 0.8933979406420351
Acuracia balanceada == 0.7971279177135004


<h2 align="center"> Distributed Memory size 300</h2>

<h4 align="center"> Hold out 90 treino / 10 teste </h4>

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.73      0.79       304
           1       0.94      0.97      0.95      1286

    accuracy                           0.92      1590
   macro avg       0.90      0.85      0.87      1590
weighted avg       0.92      0.92      0.92      1590

Acuracia == 0.9245283018867925
Acuracia balanceada == 0.8516130187443726


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.66      0.74       424
           1       0.89      0.96      0.93      1279

    accuracy                           0.89      1703
   macro avg       0.87      0.81      0.84      1703
weighted avg       0.88      0.89      0.88      1703

Acuracia == 0.8866705813270699
Acuracia balanceada == 0.8102447740717247


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:

train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/90-10/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.82      0.66      0.73       361
           1       0.91      0.96      0.93      1290

    accuracy                           0.89      1651
   macro avg       0.86      0.81      0.83      1651
weighted avg       0.89      0.89      0.89      1651

Acuracia == 0.8927922471229558
Acuracia balanceada == 0.8087096566385363


<h2 align="center"> <font color="red"> <b>2 - Experimentos Hold out 80 treino / 20 teste</b> </font> </h2>



2.   Distributed Memory === Hold out 80 treino / 20 teste

*    Tamanho do vetor de palavras = 50 
*    Tamanho do vetor de palavras = 100 
*    Tamanho do vetor de palavras = 300 

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.69      0.76       621
           1       0.93      0.97      0.95      2558

    accuracy                           0.91      3179
   macro avg       0.89      0.83      0.85      3179
weighted avg       0.91      0.91      0.91      3179

Acuracia == 0.9144385026737968
Acuracia balanceada == 0.8297734114438741


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.73      0.78       621
           1       0.94      0.97      0.95      2558

    accuracy                           0.92      3179
   macro avg       0.89      0.85      0.87      3179
weighted avg       0.92      0.92      0.92      3179

Acuracia == 0.9216734822271154
Acuracia balanceada == 0.8489016177342655


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.89      0.72      0.80       621
           1       0.93      0.98      0.96      2558

    accuracy                           0.93      3179
   macro avg       0.91      0.85      0.88      3179
weighted avg       0.93      0.93      0.93      3179

Acuracia == 0.9282793331236238
Acuracia balanceada == 0.8487385726822108


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.86      0.58      0.69       858
           1       0.87      0.97      0.92      2547

    accuracy                           0.87      3405
   macro avg       0.87      0.77      0.81      3405
weighted avg       0.87      0.87      0.86      3405

Acuracia == 0.8701908957415565
Acuracia balanceada == 0.7741124207555303


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.59      0.70       858
           1       0.87      0.96      0.92      2547

    accuracy                           0.87      3405
   macro avg       0.86      0.78      0.81      3405
weighted avg       0.87      0.87      0.86      3405

Acuracia == 0.8698972099853157
Acuracia balanceada == 0.7770076409652382


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.88      0.65      0.75       858
           1       0.89      0.97      0.93      2547

    accuracy                           0.89      3405
   macro avg       0.88      0.81      0.84      3405
weighted avg       0.89      0.89      0.88      3405

Acuracia == 0.8883994126284875
Acuracia balanceada == 0.8086971920894183


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.83      0.61      0.70       744
           1       0.89      0.96      0.93      2558

    accuracy                           0.88      3302
   macro avg       0.86      0.79      0.81      3302
weighted avg       0.88      0.88      0.88      3302

Acuracia == 0.8834039975772259
Acuracia balanceada == 0.7860617543948145


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.82      0.66      0.73       744
           1       0.91      0.96      0.93      2558

    accuracy                           0.89      3302
   macro avg       0.86      0.81      0.83      3302
weighted avg       0.89      0.89      0.89      3302

Acuracia == 0.8909751665657177
Acuracia balanceada == 0.808105185502787


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/80-20/test_media_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.88      0.68      0.77       744
           1       0.91      0.97      0.94      2558

    accuracy                           0.91      3302
   macro avg       0.89      0.83      0.85      3302
weighted avg       0.90      0.91      0.90      3302

Acuracia == 0.9061175045427013
Acuracia balanceada == 0.825503690719396


<h2 align="center"> <font color="red"> <b>3 - Experimentos Hold out 70 treino / 30 teste</b> </font> </h2>



3.   Distributed Memory === Hold out 70 treino / 30 teste

*    Tamanho do vetor de palavras = 50 
*    Tamanho do vetor de palavras = 100 
*    Tamanho do vetor de palavras = 300 

<h4 align="center"> Conjuto das 75% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.82      0.68      0.74       925
           1       0.93      0.96      0.94      3843

    accuracy                           0.91      4768
   macro avg       0.87      0.82      0.84      4768
weighted avg       0.91      0.91      0.91      4768

Acuracia == 0.9091862416107382
Acuracia balanceada == 0.8234066853738985


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.73      0.78       925
           1       0.94      0.97      0.95      3843

    accuracy                           0.92      4768
   macro avg       0.89      0.85      0.87      4768
weighted avg       0.92      0.92      0.92      4768

Acuracia == 0.9213506711409396
Acuracia balanceada == 0.8494223966355114


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_75_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_75_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.89      0.74      0.81       925
           1       0.94      0.98      0.96      3843

    accuracy                           0.93      4768
   macro avg       0.92      0.86      0.88      4768
weighted avg       0.93      0.93      0.93      4768

Acuracia == 0.9324664429530202
Acuracia balanceada == 0.8596015218966039


<h4 align="center"> Conjuto das 25% melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.85      0.52      0.65      1316
           1       0.85      0.97      0.91      3792

    accuracy                           0.85      5108
   macro avg       0.85      0.75      0.78      5108
weighted avg       0.85      0.85      0.84      5108

Acuracia == 0.8537588097102584
Acuracia balanceada == 0.745707248021751


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.83      0.63      0.72      1316
           1       0.88      0.96      0.92      3792

    accuracy                           0.87      5108
   macro avg       0.86      0.79      0.82      5108
weighted avg       0.87      0.87      0.87      5108

Acuracia == 0.8731401722787784
Acuracia balanceada == 0.7947330325882036


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_25_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_25_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.87      0.67      0.76      1316
           1       0.89      0.96      0.93      3792

    accuracy                           0.89      5108
   macro avg       0.88      0.82      0.84      5108
weighted avg       0.89      0.89      0.88      5108

Acuracia == 0.8888018794048551
Acuracia balanceada == 0.8184299372859836


<h4 align="center"> Conjuto da média das melhores traduções</h4>

In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 50
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.61      0.71      1132
           1       0.89      0.97      0.93      3821

    accuracy                           0.89      4953
   macro avg       0.87      0.79      0.82      4953
weighted avg       0.88      0.89      0.88      4953

Acuracia == 0.8855239248940037
Acuracia balanceada == 0.7902783853042005


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 100
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.84      0.68      0.75      1132
           1       0.91      0.96      0.93      3821

    accuracy                           0.90      4953
   macro avg       0.87      0.82      0.84      4953
weighted avg       0.89      0.90      0.89      4953

Acuracia == 0.8962245103977388
Acuracia balanceada == 0.8199050856203813


In [None]:
train_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/train_media_tweets.csv'
test_fp = '/content/drive/MyDrive/Projects/IDPT2021/data/balanceado/70-30/test_media_tweets.csv'

vec_size = 300
#DM model = 1
model_option = 0

X_train, X_test, y_train, y_test = read_corpus(train_fp, test_fp)

d2v_model = train_doc2vec_embeddings(vec_size, model_option, X_train)

X_train = infer_d2v_embeddings(d2v_model, X_train)
X_test = infer_d2v_embeddings(d2v_model, X_test)

classifier = LinearSVC(max_iter=10000)

acc_score, ballanced_acc_scores, report = run_classifier(classifier, X_train, X_test, y_train, y_test)

print('======Relatório========')
print(report)
print("Acuracia == " + str(acc_score))
print("Acuracia balanceada == " + str(ballanced_acc_scores))

              precision    recall  f1-score   support

           0       0.86      0.68      0.76      1132
           1       0.91      0.97      0.94      3821

    accuracy                           0.90      4953
   macro avg       0.89      0.82      0.85      4953
weighted avg       0.90      0.90      0.90      4953

Acuracia == 0.9016757520694528
Acuracia balanceada == 0.8221948308723503
