## Testes de valência para o projeto final de IA369Y 2 Semestre 2018

Passos para tratar os dados com valência, testar e escolher um classificador para utilizar no projeto final de IA369Y.

1) Remover espaços duplos, quebras de linha, números e links do dataset e das frases a serem testadas.

2) Remover stopwords e aplicar o stemmer.

3) Treinar os classificadores.

4) Realizar as predições com os classificadores.

5) Avaliar as medidas obtidas com os classificadores.

In [1]:
import sys
sys.path.append('../..')

import csv
import codecs
import copy
import re
from random import shuffle

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nltk import word_tokenize
import gensim

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer, MinMaxScaler
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

from utils import tokenizer, load_six_emotions, load_3_emotions

%matplotlib inline

np.warnings.filterwarnings('ignore')
np.random.seed(12345)

## Datasets

Para validar, serão utilizados dois datasets.

Os dos datasets foram obtidos do site minerando dados.

O primeiro deles tem tweets de política de Minas Gerais com rótulos de valência: positivo, negativo e neutro. Foi feito um tratamento para eliminar tweets repetidos e dessa forma sobraram 3016 tweets.

O segundo contém 2123 títulos de notícias com rótulos de valência: positivo, negativo e neutro.

In [2]:
#Download dos datasets
!ls -la
!rm -f *.csv
!rm -f *.txt
!wget https://raw.githubusercontent.com/rdenadai/sentiment-analysis-2018-president-election/edgarbanhesse/material-apoio/tweets_mg_tratados.csv
!wget https://raw.githubusercontent.com/rdenadai/sentiment-analysis-2018-president-election/edgarbanhesse/material-apoio/titulo_noticias.txt
!wget https://raw.githubusercontent.com/rdenadai/sentiment-analysis-2018-president-election/edgarbanhesse/material-apoio/50_tweets_mg.csv
!ls -la

total 1152
drwxr-xr-x 3 rdenadai rdenadai    4096 Nov 22 16:52 .
drwxr-xr-x 5 rdenadai rdenadai    4096 Nov 22 15:59 ..
-rw-r--r-- 1 rdenadai rdenadai   21708 Nov 22 16:52 emotion_regular_supervised_ml.ipynb
drwxr-xr-x 2 rdenadai rdenadai    4096 Nov 22 16:06 .ipynb_checkpoints
-rw-r--r-- 1 rdenadai rdenadai 1142042 Nov 22 16:05 valence_regular_supervised_ml.ipynb
--2018-11-22 16:57:38--  https://raw.githubusercontent.com/rdenadai/sentiment-analysis-2018-president-election/edgarbanhesse/material-apoio/tweets_mg_tratados.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.92.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.92.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 295061 (288K) [text/plain]
Saving to: ‘tweets_mg_tratados.csv’


2018-11-22 16:57:38 (32.6 MB/s) - ‘tweets_mg_tratados.csv’ saved [295061/295061]

--2018-11-22 16:57:39--  https://raw.githubusercontent.com/rdenadai/sentiment-analys

In [3]:
#Carregando os datasets
def carregar(filename):
    frases = []
    with open(filename, 'r', encoding='utf-8') as h:
        reader = csv.reader(h, delimiter='|')
        for row in reader:
            frase = tokenizer(row[0]).strip()
            valencia = row[1].upper()
            if len(frase) > 5:
                frases.append((valencia, frase))
    return frases

frases = carregar('tweets_mg_tratados.csv')
frases += carregar('titulo_noticias.txt')

shuffle(frases)

print(frases[:5])

[('NEUTRO', 'aéci govern min fez mais viagens pra rio janeir cad indign'), ('NEUTRO', 'concórd indic pap invest nest mês conf'), ('NEGATIVO', 'bat caminhã carr deix cinc mort sul min ger estad min'), ('NEGATIVO', 'obras evit rodízi águ paul som milhõ'), ('POSITIVO', 'filhob menor aprend roub tráfic drog uberlând')]


In [5]:
#Carrega os datasets em separado
tweets_mg = []
titulo_noticias = []

tweets_mg = carregar('tweets_mg_tratados.csv')
titulo_noticias = carregar('titulo_noticias.txt')

print(tweets_mg[:5])
print('-' * 20)
print(titulo_noticias[:5])

[('NEUTRO', 'bom band mort'), ('NEUTRO', 'fóruns region govern vã eleg nov prefeit vereador'), ('NEGATIVO', 'govern min ger compr mais dois helicópter'), ('POSITIVO', 'polic milit faz prisõ aprend armas fog drog bail funk'), ('POSITIVO', 'cab políc milit anos folg imped roub pad noit dest')]
--------------------
[('POLARIDADE', '\ufeffnotic'), ('POSITIVO', 'diretor petrobr neg organiz crimin estatal notíc brasil'), ('NEUTRO', 'tom cautel janet yelen pression bols fortalec dól'), ('POSITIVO', 'bovesp caminh nov máxim ano quart alta segu'), ('NEGATIVO', 'após abrir estável ibovesp pass registr qued petrobr val')]


In [9]:
#all_datasets
afrases = []
avalencias = []
for valencia, frase in frases:
    afrases.append(frase)
    avalencias.append(valencia)
    
print(afrases[:5])
print(avalencias[:5])
print('-' * 20)


#tweets_mg
atweets_mg = []
aval_tweets_mg = []
for valencia, frase in tweets_mg:
    atweets_mg.append(frase)
    aval_tweets_mg.append(valencia)

print(atweets_mg[:5])
print(aval_tweets_mg[:5])
print('-' * 20)

#titulo_noticias
atitulo_noticias = []
aval_titulo_noticias = []
for valencia, frase in titulo_noticias:
    atitulo_noticias.append(frase)
    aval_titulo_noticias.append(valencia)

print(atitulo_noticias[:5])
print(aval_titulo_noticias[:5])

['aéci govern min fez mais viagens pra rio janeir cad indign', 'concórd indic pap invest nest mês conf', 'bat caminhã carr deix cinc mort sul min ger estad min', 'obras evit rodízi águ paul som milhõ', 'filhob menor aprend roub tráfic drog uberlând']
['NEUTRO', 'NEUTRO', 'NEGATIVO', 'NEGATIVO', 'POSITIVO']
--------------------
['bom band mort', 'fóruns region govern vã eleg nov prefeit vereador', 'govern min ger compr mais dois helicópter', 'polic milit faz prisõ aprend armas fog drog bail funk', 'cab políc milit anos folg imped roub pad noit dest']
['NEUTRO', 'NEUTRO', 'NEGATIVO', 'POSITIVO', 'POSITIVO']
--------------------
['\ufeffnotic', 'diretor petrobr neg organiz crimin estatal notíc brasil', 'tom cautel janet yelen pression bols fortalec dól', 'bovesp caminh nov máxim ano quart alta segu', 'após abrir estável ibovesp pass registr qued petrobr val']
['POLARIDADE', 'POSITIVO', 'NEUTRO', 'POSITIVO', 'NEGATIVO']


In [10]:
def run_ml_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    print(f'Modelo   : {model.__class__.__name__}')
    print(f'Acurácia : {np.round(model.score(X_test, y_test) * 100, 2)}%')
    print('-' * 20)

def split_data(X, y):
    test_size = .3
    random_state = 0
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return {
        'X_train': X_train,
        'y_train': y_train,
        'X_test': X_test,
        'y_test': y_test
    }

## Classificadores

In [11]:
classifiers = (
    MultinomialNB(),
    ComplementNB(),
    LogisticRegression(multi_class='auto', solver='lbfgs'),
    RandomForestClassifier(n_estimators=50, min_samples_split=5, random_state=0),
    KNeighborsClassifier(n_neighbors=8, algorithm='auto'),
    MLPClassifier(hidden_layer_sizes=(100, 25), max_iter=500, random_state=0),
    LinearSVC(max_iter=500),
    SVC(gamma='auto', max_iter=500),
)

## TF-IDF

In [12]:
vec_tfidf = TfidfVectorizer(ngram_range=(1, 2))
X_tfidf = vec_tfidf.fit_transform(afrases)

vec_tfidf_tmg = TfidfVectorizer(ngram_range=(1, 2))
X_tfidf_tmg = vec_tfidf_tmg.fit_transform(atweets_mg)

vec_tfidf_tn = TfidfVectorizer(ngram_range=(1, 2))
X_tfidf_tn = vec_tfidf_tn.fit_transform(atitulo_noticias)

In [13]:
print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_tfidf, avalencias))
    except:
        pass

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_tfidf_tmg, aval_tweets_mg))
    except:
        pass

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_tfidf_tn, aval_titulo_noticias))
    except:
        pass      


all_datasets
Modelo   : MultinomialNB
Acurácia : 60.87%
--------------------
Modelo   : ComplementNB
Acurácia : 62.17%
--------------------
Modelo   : LogisticRegression
Acurácia : 66.47%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 64.39%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 61.52%
--------------------
Modelo   : MLPClassifier
Acurácia : 64.32%
--------------------
Modelo   : LinearSVC
Acurácia : 66.86%
--------------------
Modelo   : SVC
Acurácia : 58.46%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 62.18%
--------------------
Modelo   : ComplementNB
Acurácia : 61.07%
--------------------
Modelo   : LogisticRegression
Acurácia : 65.18%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 63.74%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 61.07%
--------------------
Modelo   : MLPClassifier
Acurácia : 62.74%
--------------------
Modelo   : LinearSVC
Acurácia : 64.29%
--------------

## LSA (usando TF-IDF)

In [0]:
#all_datasets
svd = TruncatedSVD(n_components=100, n_iter=50, random_state=0)
normalizer = MinMaxScaler(copy=False)
lsa = make_pipeline(svd, normalizer)
X_svd = lsa.fit_transform(X_tfidf)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, avalencias))
    except Exception as e:
        print(e)

#tweets_mg
X_svd = lsa.fit_transform(X_tfidf_tmg)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, aval_tweets_mg))
    except:
        pass


#titulo_noticias
X_svd = lsa.fit_transform(X_tfidf_tn)

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, aval_titulo_noticias))
    except:
        pass


all_datasets
Modelo   : MultinomialNB
Acurácia : 42.13%
--------------------
multi_class should be either multinomial or ovr, got auto
Modelo   : RandomForestClassifier
Acurácia : 59.88%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 55.27%
--------------------
Modelo   : MLPClassifier
Acurácia : 57.22%
--------------------
Modelo   : LinearSVC
Acurácia : 61.44%
--------------------
Modelo   : SVC
Acurácia : 42.78%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 45.17%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 62.15%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 58.27%
--------------------
Modelo   : MLPClassifier
Acurácia : 62.71%
--------------------
Modelo   : LinearSVC
Acurácia : 63.82%
--------------------
Modelo   : SVC
Acurácia : 44.95%
--------------------

titulo_noticias
Modelo   : MultinomialNB
Acurácia : 46.47%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 61.54%
------------

## LDA (usando TF-IDF)

In [0]:
#all_datasets
lda = LatentDirichletAllocation(n_components=200, max_iter=50, random_state=0, n_jobs=5)
normalizer = MinMaxScaler(copy=False)
lda = make_pipeline(lda, normalizer)
X_lda = lda.fit_transform(X_tfidf)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, avalencias))
    except:
        pass


#tweets_mg
X_lda = lda.fit_transform(X_tfidf_tmg)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, aval_tweets_mg))
    except:
        pass


#titulo_noticias
X_lda = lda.fit_transform(X_tfidf_tn)

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, aval_titulo_noticias))
    except:
        pass      


all_datasets
Modelo   : MultinomialNB
Acurácia : 54.36%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 50.26%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 50.78%
--------------------
Modelo   : MLPClassifier
Acurácia : 52.8%
--------------------
Modelo   : LinearSVC
Acurácia : 54.1%
--------------------
Modelo   : SVC
Acurácia : 51.82%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 56.6%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 54.38%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 55.94%
--------------------
Modelo   : MLPClassifier
Acurácia : 55.72%
--------------------
Modelo   : LinearSVC
Acurácia : 56.16%
--------------------
Modelo   : SVC
Acurácia : 55.72%
--------------------

titulo_noticias
Modelo   : MultinomialNB
Acurácia : 51.49%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 47.88%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 48.98%
--

## Count

In [0]:
#all_datasets
vec_count = CountVectorizer(ngram_range=(1, 2))
X_count = vec_count.fit_transform(afrases)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_count, avalencias))
    except:
        pass

      
#tweets_mg
vec_count = CountVectorizer(ngram_range=(1, 2))
X_count_tmg = vec_count.fit_transform(atweets_mg)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_count_tmg, aval_tweets_mg))
    except:
        pass
      
      
#titulo_noticias
vec_count = CountVectorizer(ngram_range=(1, 2))
X_count_tn = vec_count.fit_transform(atitulo_noticias)

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_count_tn, aval_titulo_noticias))
    except:
        pass      


all_datasets
Modelo   : MultinomialNB
Acurácia : 64.37%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 65.02%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 41.68%
--------------------
Modelo   : MLPClassifier
Acurácia : 65.15%
--------------------
Modelo   : LinearSVC
Acurácia : 65.8%
--------------------
Modelo   : SVC
Acurácia : 43.37%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 62.15%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 63.26%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 49.28%
--------------------
Modelo   : MLPClassifier
Acurácia : 61.38%
--------------------
Modelo   : LinearSVC
Acurácia : 61.49%
--------------------
Modelo   : SVC
Acurácia : 43.51%
--------------------

titulo_noticias
Modelo   : MultinomialNB
Acurácia : 66.09%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 64.68%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 37.36%


## LSA (usando Count)

In [0]:
#all_datasets
svd = TruncatedSVD(n_components=100, n_iter=50, random_state=0)
normalizer = MinMaxScaler(copy=False)
lda = make_pipeline(svd, normalizer)
X_svd = lda.fit_transform(X_count)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, avalencias))
    except:
        pass
      

#tweets_mg
X_svd = lda.fit_transform(X_count_tmg)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, aval_tweets_mg))
    except:
        pass
      
      
#titulos_noticias
X_svd = lda.fit_transform(X_count_tn)

print("\ntitulos_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_svd, aval_titulo_noticias))
    except:
        pass      


all_datasets
Modelo   : MultinomialNB
Acurácia : 42.39%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 59.49%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 55.66%
--------------------
Modelo   : MLPClassifier
Acurácia : 59.82%
--------------------
Modelo   : LinearSVC
Acurácia : 60.79%
--------------------
Modelo   : SVC
Acurácia : 46.88%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 49.94%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 62.71%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 60.16%
--------------------
Modelo   : MLPClassifier
Acurácia : 63.15%
--------------------
Modelo   : LinearSVC
Acurácia : 63.6%
--------------------
Modelo   : SVC
Acurácia : 58.6%
--------------------

titulos_noticias
Modelo   : MultinomialNB
Acurácia : 46.47%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 60.6%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 46.15%
-

## LDA (usando Count)

In [0]:
#all_datasets
lda = LatentDirichletAllocation(n_components=200, max_iter=50, random_state=0, n_jobs=5)
normalizer = MinMaxScaler(copy=False)
lda = make_pipeline(lda, normalizer)
X_lda = lda.fit_transform(X_count)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, avalencias))
    except:
        pass


#tweets_mg
X_lda = lda.fit_transform(X_count_tmg)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, aval_tweets_mg))
    except:
        pass
      
      
#titulo_noticias
X_lda = lda.fit_transform(X_count_tn)

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X_lda, aval_titulo_noticias))
    except:
        pass


all_datasets
Modelo   : MultinomialNB
Acurácia : 54.81%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 53.32%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 52.41%
--------------------
Modelo   : MLPClassifier
Acurácia : 54.1%
--------------------
Modelo   : LinearSVC
Acurácia : 54.29%
--------------------
Modelo   : SVC
Acurácia : 48.63%
--------------------

tweets_mg
Modelo   : MultinomialNB
Acurácia : 61.71%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 59.93%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 56.71%
--------------------
Modelo   : MLPClassifier
Acurácia : 57.71%
--------------------
Modelo   : LinearSVC
Acurácia : 59.49%
--------------------
Modelo   : SVC
Acurácia : 57.05%
--------------------

titulo_noticias
Modelo   : MultinomialNB
Acurácia : 49.76%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 46.31%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 43.49%


## Count + TF-IDF + Word2Vec

In [0]:
#all_datasets
# Count
vec_count = CountVectorizer()
X_count = vec_count.fit_transform(afrases)
weights_count = pd.DataFrame(np.round(X_count.toarray().T, 8), index=vec_count.get_feature_names())

# TF-IDF
vec_tfidf = TfidfVectorizer()
X_tfidf = vec_tfidf.fit_transform(afrases)
weights_tfidf = pd.DataFrame(np.round(X_tfidf.toarray().T, 8), index=vec_tfidf.get_feature_names())

# Word2Vec preprocessing
frases_w2v = []
for frase in afrases:
    bigram = []
    p_frase = word_tokenize(frase)
    for m, palavra in enumerate(p_frase):
        next_p = None
        try:
            next_p = p_frase[m+1]
        except:
            pass
        bigram += [f'{palavra}']
#         if next_p:
#             bigram += [f'{palavra} {next_p}']
    frases_w2v += [bigram]

# Word2Vec
model = gensim.models.Word2Vec(
    sentences=frases_w2v,
    sg=1,
    hs=1,
    size=1,
    window=25,
    min_count=1,
    seed=0,
    workers=10)
model.train(frases_w2v, total_examples=len(frases_w2v), epochs=1000)

(39902405, 45879000)

In [0]:
#all_datasets
r_words = {}
for word in vec_count.get_feature_names():
    idx = weights_count.index.get_loc(word)
    w2c_val = .1
    try:
        w2c_val = model.wv[word]
    except:
        pass
    r_words[word] = (weights_tfidf.iloc[idx].values + weights_count.iloc[idx].values) * w2c_val
lwor = list(r_words.keys())
X = np.asarray(list(r_words.values()))
weights = pd.DataFrame(X, index=lwor)
X = X.T

normalizer = Normalizer(copy=False)
X = normalizer.fit_transform(X)

print("\nall_datasets")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X, avalencias))
    except:
        pass


all_datasets
Modelo   : RandomForestClassifier
Acurácia : 63.2%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 60.4%
--------------------
Modelo   : MLPClassifier
Acurácia : 62.35%
--------------------
Modelo   : LinearSVC
Acurácia : 65.08%
--------------------
Modelo   : SVC
Acurácia : 52.47%
--------------------


In [53]:
#tweets_mg
# Count
vec_count = CountVectorizer()
X_count = vec_count.fit_transform(atweets_mg)
weights_count = pd.DataFrame(np.round(X_count.toarray().T, 8), index=vec_count.get_feature_names())

# TF-IDF
vec_tfidf = TfidfVectorizer()
X_tfidf = vec_tfidf.fit_transform(atweets_mg)
weights_tfidf = pd.DataFrame(np.round(X_tfidf.toarray().T, 8), index=vec_tfidf.get_feature_names())

# Word2Vec preprocessing
frases_w2v = []
for frase in atweets_mg:
    bigram = []
    p_frase = word_tokenize(frase)
    for m, palavra in enumerate(p_frase):
        next_p = None
        try:
            next_p = p_frase[m+1]
        except:
            pass
        bigram += [f'{palavra}']
#         if next_p:
#             bigram += [f'{palavra} {next_p}']
    frases_w2v += [bigram]

# Word2Vec
model = gensim.models.Word2Vec(
    sentences=frases_w2v,
    sg=1,
    hs=1,
    size=1,
    window=25,
    min_count=1,
    seed=0,
    workers=10)
model.train(frases_w2v, total_examples=len(frases_w2v), epochs=1000)

(23296133, 29024000)

In [54]:
#tweets_mg
r_words = {}
for word in vec_count.get_feature_names():
    idx = weights_count.index.get_loc(word)
    w2c_val = .1
    try:
        w2c_val = model.wv[word]
    except:
        pass
    r_words[word] = (weights_tfidf.iloc[idx].values + weights_count.iloc[idx].values) * w2c_val
lwor = list(r_words.keys())
X = np.asarray(list(r_words.values()))
weights = pd.DataFrame(X, index=lwor)
X = X.T

normalizer = Normalizer(copy=False)
X = normalizer.fit_transform(X)

print("\ntweets_mg")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X, aval_tweets_mg))
    except:
        pass


tweets_mg
Modelo   : RandomForestClassifier
Acurácia : 64.48%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 59.6%
--------------------
Modelo   : MLPClassifier
Acurácia : 61.04%
--------------------
Modelo   : LinearSVC
Acurácia : 62.38%
--------------------
Modelo   : SVC
Acurácia : 43.51%
--------------------


In [55]:
#titulo_noticias
# Count
vec_count = CountVectorizer()
X_count = vec_count.fit_transform(atitulo_noticias)
weights_count = pd.DataFrame(np.round(X_count.toarray().T, 8), index=vec_count.get_feature_names())

# TF-IDF
vec_tfidf = TfidfVectorizer()
X_tfidf = vec_tfidf.fit_transform(atitulo_noticias)
weights_tfidf = pd.DataFrame(np.round(X_tfidf.toarray().T, 8), index=vec_tfidf.get_feature_names())

# Word2Vec preprocessing
frases_w2v = []
for frase in atitulo_noticias:
    bigram = []
    p_frase = word_tokenize(frase)
    for m, palavra in enumerate(p_frase):
        next_p = None
        try:
            next_p = p_frase[m+1]
        except:
            pass
        bigram += [f'{palavra}']
#         if next_p:
#             bigram += [f'{palavra} {next_p}']
    frases_w2v += [bigram]

# Word2Vec
model = gensim.models.Word2Vec(
    sentences=frases_w2v,
    sg=1,
    hs=1,
    size=1,
    window=25,
    min_count=1,
    seed=0,
    workers=10)
model.train(frases_w2v, total_examples=len(frases_w2v), epochs=1000)

(14144745, 16855000)

In [57]:
#titulo_noticias
r_words = {}
for word in vec_count.get_feature_names():
    idx = weights_count.index.get_loc(word)
    w2c_val = .1
    try:
        w2c_val = model.wjv[word]
    except:
        pass
    r_words[word] = (weights_tfidf.iloc[idx].values + weights_count.iloc[idx].values) * w2c_val
lwor = list(r_words.keys())
X = np.asarray(list(r_words.values()))
weights = pd.DataFrame(X, index=lwor)
X = X.T

normalizer = Normalizer(copy=False)
X = normalizer.fit_transform(X)

print("\ntitulo_noticias")
for classifier in classifiers:
    try:
        run_ml_model(classifier, **split_data(X, aval_titulo_noticias))
    except:
        pass


titulo_noticias
Modelo   : MultinomialNB
Acurácia : 61.22%
--------------------
Modelo   : RandomForestClassifier
Acurácia : 62.95%
--------------------
Modelo   : KNeighborsClassifier
Acurácia : 59.81%
--------------------
Modelo   : MLPClassifier
Acurácia : 64.36%
--------------------
Modelo   : LinearSVC
Acurácia : 67.5%
--------------------
Modelo   : SVC
Acurácia : 46.47%
--------------------
