[View in Colaboratory](https://colab.research.google.com/github/rdenadai/TxtP-Study-Notebooks/blob/master/notebooks/text_classification_example.ipynb)

## Análise e Validação de Textos em Português


### Referências:

 - Bibliotecas:
  - [NLTK](http://www.nltk.org/howto/portuguese_en.html)
  - [spaCy](https://spacy.io/usage/spacy-101)
 
 - Dados:
  - [Frases para Face](https://www.frasesparaface.com.br/outras-frases/)
 
 - Tutoriais
  - [Utilizando processamento de linguagem natural para criar uma sumarização automática de textos](https://medium.com/@viniljf/utilizando-processamento-de-linguagem-natural-para-criar-um-sumariza%C3%A7%C3%A3o-autom%C3%A1tica-de-textos-775cb428c84e)
  - [Latent Semantic Analysis (LSA) for Text Classification Tutorial](http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/)
  - [Machine Learning :: Cosine Similarity for Vector Space Models (Part III)](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/)
  - [My Notes for Singular Value Decomposition with Interactive Code ](https://towardsdatascience.com/my-notes-for-singular-value-decomposition-with-interactive-code-feat-peter-mills-7584f4f2930a)
  - [https://plot.ly/ipython-notebooks/principal-component-analysis/](https://plot.ly/ipython-notebooks/principal-component-analysis/)
 
 - Topic Modelling
  - [Topic Modeling with LSA, PLSA, LDA & lda2Vec](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)
  - [Integrating Topics and Syntax (HHM-LDA)](http://psiexp.ss.uci.edu/research/papers/composite.pdf)
 
 - Others
  - [PANAS-t: A Pychometric Scale for Measuring Sentiments on Twitter](https://arxiv.org/abs/1308.1857)
  - [Um Método de Identificação de Emoções em Textos Curtos para o Português do Brasil](http://www.ppgia.pucpr.br/~paraiso/Projects/Emocoes/Emocoes.html)
  - [An Introduction to Latent Semantic Analysis](http://lsa.colorado.edu/papers/dp1.LSAintro.pdf)
  - [Unsupervised Emotion Detection from Text using Semantic and Syntactic Relations](http://www.cse.yorku.ca/~aan/research/paper/Emo_WI10.pdf)
  - [An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis](https://ieeexplore.ieee.org/document/7363382)
  - [Sentiment Classification of Documents Based on Latent Semantic Analysis](https://link.springer.com/chapter/10.1007/978-3-642-21802-6_57)
  - [Applying latent semantic analysis to classify emotions in Thai text](https://ieeexplore.ieee.org/document/5486137)
  - [Text Emotion Classification Research Based on Improved Latent Semantic Analysis Algorithm](https://www.researchgate.net/publication/266651993_Text_Emotion_Classification_Research_Based_on_Improved_Latent_Semantic_Analysis_Algorithm)



### Instalação

In [2]:
!pip install -U spacy
!python -m spacy download en
!python -m spacy download pt
# !pip install feedparser

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.0.12)

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/pt_core_news_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/pt

    You can now load the model via spacy.load('pt')



In [3]:
# Download Oplexicon
!rm -rf wget-log*
!rm -rf oplexicon_v3.0
!wget -O oplexicon_v3.0.zip https://github.com/rdenadai/sentiment-analysis-2018-president-election/blob/master/dataset/oplexicon_v3.0.zip?raw=true
!unzip oplexicon_v3.0.zip
!ls -lh


Redirecting output to ‘wget-log’.
Archive:  oplexicon_v3.0.zip
  inflating: oplexicon_v3.0/lexico_v3.0.txt  
  inflating: oplexicon_v3.0/README   
total 120K
drwxr-xr-x 2 root root 4.0K Oct 10 12:20 oplexicon_v3.0
-rw-r--r-- 1 root root 102K Oct 10 12:20 oplexicon_v3.0.zip
drwxr-xr-x 2 root root 4.0K Sep 28 23:32 sample_data
-rw-r--r-- 1 root root 1.6K Oct 10 12:20 wget-log


### Imports

In [4]:
import nltk

nltk.download('rslp')
nltk.download('averaged_perceptron_tagger')
nltk.download('floresta')
nltk.download('mac_morpho')
nltk.download('machado')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')

import concurrent.futures
import codecs
import re
import pprint
from random import shuffle
from string import punctuation
import copy
from unicodedata import normalize

import numpy as np
from scipy.sparse.linalg import svds
from scipy.linalg import svd
import pandas as pd
import spacy
from spacy.lang.pt.lemmatizer import LOOKUP

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
from sklearn.utils.extmath import randomized_svd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import floresta as flt
from nltk.corpus import machado as mch
from nltk.corpus import mac_morpho as mcm


nlp = spacy.load('pt')
pp = pprint.PrettyPrinter(indent=4)
stemmer = nltk.stem.RSLPStemmer()

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package floresta to /root/nltk_data...
[nltk_data]   Package floresta is already up-to-date!
[nltk_data] Downloading package mac_morpho to /root/nltk_data...
[nltk_data]   Package mac_morpho is already up-to-date!
[nltk_data] Downloading package machado to /root/nltk_data...
[nltk_data]   Package machado is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk

### Functions

In [0]:
def normalization(x, a, b):
    return (2 * b) * (x - np.min(x)) / np.ptp(x) + a

def normalization_01(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))


def remover_acentos(txt):
    return normalize('NFKD', txt).encode('ASCII', 'ignore').decode('ASCII')


def load_oplexicon_data(filename):
    spacy_conv = {
        'adj': 'ADJ',
        'n': 'NOUN',
        'vb': 'VERB',
        'det': 'DET',
        'emot': 'EMOT',
        'htag': 'HTAG'
    }
    
    data = {}
    with codecs.open(filename, 'r', 'UTF-8') as hf:
        lines = hf.readlines()
        for line in lines:
            info = line.lower().split(',')
            if len(info[0].split()) <= 1:
                info[1] = [spacy_conv.get(tag) for tag in info[1].split()]
                word, tags, sent = info[:3]
                if 'HTAG' not in tags and 'EMOT' not in tags:
                    word = remover_acentos(word.replace('-se', ''))
                    word = LOOKUP.get(word, word)
                    # stem = stemmer.stem(word)
                    if word in data:
                        data[word] += [{
                            'word': [word],
                            'tags': tags,
                            'sentiment': sent
                        }]
                    else:
                        data[word] = [{
                            'word': [word],
                            'tags': tags,
                            'sentiment': sent
                        }]
    return data

### Usage

In [6]:
frase = u"Gostaria de saber mais informações sobre a Amazon. Uma excelente loja de produtos online!".lower()
doc = nlp(remover_acentos(frase))
pp.pprint([(w.text, w.pos_) for w in doc])

[   ('gostaria', 'VERB'),
    ('de', 'ADP'),
    ('saber', 'VERB'),
    ('mais', 'ADV'),
    ('informacoes', 'NOUN'),
    ('sobre', 'ADP'),
    ('a', 'DET'),
    ('amazon', 'NOUN'),
    ('.', 'PUNCT'),
    ('uma', 'DET'),
    ('excelente', 'ADJ'),
    ('loja', 'NOUN'),
    ('de', 'ADP'),
    ('produtos', 'NOUN'),
    ('online', 'ADJ'),
    ('!', 'PUNCT')]


In [7]:
opx = load_oplexicon_data('oplexicon_v3.0/lexico_v3.0.txt')
print('Oplexicon size: ', len(opx))
print('Examples: ')

view = opx.items()
pp.pprint(list(view)[:7])

Oplexicon size:  15958
Examples: 
[   ('ab-rogar', [{'sentiment': '-1', 'tags': ['VERB'], 'word': ['ab-rogar']}]),
    ('ababadar', [{'sentiment': '0', 'tags': ['VERB'], 'word': ['ababadar']}]),
    (   'ababelar',
        [   {'sentiment': '-1', 'tags': ['VERB'], 'word': ['ababelar']},
            {'sentiment': '1', 'tags': ['VERB'], 'word': ['ababelar']}]),
    ('abacanar', [{'sentiment': '1', 'tags': ['VERB'], 'word': ['abacanar']}]),
    ('abacinar', [{'sentiment': '1', 'tags': ['VERB'], 'word': ['abacinar']}]),
    (   'abafar',
        [   {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafar']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafar']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafar']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafar']},
            {'sentiment': '-1', 'tags': ['VERB'], 'word': ['abafar']}]),
    (   'abafante',
        [   {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafante']},
            {'s

In [0]:
ALEGRIA = np.array(['abundante', 'acalmar', 'aceitável', 'aclamar', 'aconchego', 'adesão', 'admirar', 'adorar', 'afável', 'afeição', 'afeto', 'afortunado', 'agradar', 'ajeitar', 'alívio', 'amabilidade', 'amado', 'amar', 'amável', 'amenizar', 'ameno', 'amigável', 'amistoso', ' amizade', ' amor', ' animação', ' ânimo', 'anseio', 'ânsia', 'ansioso', 'apaixonado', 'apaziguar', 'aplausos', 'apoiar', 'aprazer', 'apreciar', 'aprovação', 'aproveitar', 'ardor', 'armirar', 'arrumar', 'atração', 'atraente', 'atrair', 'avidamente', 'avidez', 'ávido', 'belo', 'bem-estar', 'beneficência', 'beneficiador', 'benefício', 'benéfico', 'benevocência', 'benignamente', 'benígno', 'bom', 'bondade', 'bondoso', 'bonito', 'brilhante', 'brincadeira', 'calma', 'calor', 'caridade', 'caridoso', 'carinho', 'cativar', 'charme', 'cheery', 'clamar', 'cofortar', 'coleguismo', 'comédia', 'cômico', 'comover', 'compaixão', 'companheirismo', 'compatibilidade', 'compatível', 'complacência', 'completar', 'compreensão', 'conclusão', 'concretização', 'condescendência', 'confiança', 'confortante', 'congratulação', 'conquistar', 'consentir', 'consideração', 'consolação', 'contentamento', 'coragem', 'cordial', 'considerar', 'consolo', 'contente', 'cuidadoso', 'cumplicidade', 'dedicação', 'deleitado', 'delicadamente', 'delicadeza', 'delicado', 'desejar', 'despreocupação', 'devoção', 'devoto', 'diversão', 'divertido', 'encantar', 'elogiado', 'emoção', 'emocionante', 'emotivo', 'empatia', 'empático', 'empolgação', 'enamorar', 'encantado', 'encorajado', 'enfeitar', 'engraçado', 'entendimento', 'entusiasmadamente', 'entusiástico', 'esperança', 'esplendor', 'estima', 'estimar', 'estimulante', 'euforia', 'eufórico', 'euforizante', 'exaltar', 'excelente', 'excitar', 'expansivo', 'extasiar', 'exuberante', 'exultar', 'fã', 'facilitar', 'familiaridade', 'fascinação', 'fascínio', 'favor', 'favorecer', 'favorito', 'felicidade', 'feliz', 'festa', 'festejar', 'festivo', 'fidelidade', 'fiel', 'filantropia', 'filantrópico', 'fraterno', 'ganhar', 'generosidade', 'generoso', 'gentil', 'glória', 'glorificar', 'gostar', 'gostoso', 'gozar', 'gratificante', 'grato', 'hilariante', 'honra', 'humor', 'impressionar', 'incentivar', 'incentivo', 'inclinação', 'incrível', 'inspirar', 'interessar', 'interesse', 'irmandade', 'jovial', 'jubilante', 'júbilo', 'lealdade', 'legítimo', 'leveza', 'louvar', 'louvável', 'louvavelmente', 'lucrativo', 'lucro', 'maravilhoso', 'melhor', 'obter', 'obteve', 'ode', 'orgulho', 'paixão', 'parabenizar', 'paz', 'piedoso', 'positivo', 'prazenteiro', 'prazer', 'predileção', 'preencher', 'preferência', 'preferido', 'promissor', 'prosperidade', 'proteção', 'proteger', 'revigorar', 'simpático', 'vantajoso', 'protetor', 'risada', 'sobrevivência', 'vencedor', 'proveito', 'risonho', 'sobreviver', 'veneração', 'provilégio', 'romântico', 'sorte', 'ventura', 'querer', 'romantismo', 'sortudo', 'vida', 'radiante', 'saciar', 'sucesso', 'vigor', 'realizar', 'saciável', 'surpreender', 'virtude', 'recomendável', 'satisfação', 'tenro', 'virtuoso', 'reconhecer', 'satisfatoriamente', 'ternura', 'vitória', 'recompensa', 'satisfatório', 'torcer', 'vitorioso', 'recrear', 'satisfazer', 'tranquilo', 'viver', 'recreativo', 'satisfeito', 'tranquilo', 'vivo', 'recreação', 'sedução', 'triunfo', 'zelo', 'regozijar', 'seduzir', 'triunfal', 'zeloso', 'respeitar', 'sereno', 'triunfante', 'ressuscitar', 'simpaticamente', 'vantagem',])
DESGOSTO = np.array(['abominável', 'adoentado', 'amargamente', 'antipatia', 'antipático', 'asco', 'asqueroso', 'aversão', 'chatear', 'chateação', 'desagrado', 'desagradável', 'desprezível', 'detestável', 'doente', 'doença', 'enfermidade', 'enjoativo', 'enjoo', 'enjôo', 'feio', 'fétido', 'golfar', 'grave', 'gravidade', 'grosseiro', 'grosso', 'horrível', 'ignóbil', 'ilegal', 'incomodar', 'incômdo', 'indecente', 'indisposição', 'indisposto', 'inescrupuloso', 'maldade', 'maldoso', 'malvado', 'mau', 'nauseabundo', 'nauseante', 'nausear', 'nauseoso', 'nojento', 'nojo', 'náusea', 'obsceno', 'obstruir', 'obstrução', 'ofensivo', 'patético', 'perigoso', 'repelente', 'repelir', 'repugnante', 'repulsa', 'repulsivo', 'repulsão', 'rude', 'sujeira', 'sujo', 'terrivelmente', 'terrível', 'torpe', 'travesso', 'travessura', 'ultrajante', 'vil', 'vomitar', 'vômito',])
MEDO = np.array(['abominável', 'afugentar', 'alarmar', 'alerta', 'ameaça', 'amedrontar', 'angustia', 'angústia', 'angustiadamente', 'ansiedade', 'ansioso', 'apavorar', 'apreender', 'apreensão', 'apreensivo', 'arrepio', 'assombrado', 'assombro', 'assustado', 'assustadoramente', 'atemorizar', 'aterrorizante', 'brutal', 'calafrio', 'chocado', 'chocante', 'consternado', 'covarde', 'cruel', 'crueldade', 'cruelmente', 'cuidado', 'cuidadosamente', 'cuidadoso', 'defender', 'defensor', 'defesa', 'derrotar', 'desconfiado', 'desconfiança', 'desencorajar', 'desespero', 'deter', 'envergonhado', 'escandalizado', 'escuridão', 'espantoso', 'estremecedor', 'estremecer', 'expulsar', 'feio', 'friamente', 'fugir', 'hesitar', 'horrendo', 'horripilante', 'horrível', 'horrivelmente', 'horror', 'horrorizar', 'impaciência', 'impaciente', 'impiedade', 'impiedoso', 'indecisão', 'inquieto', 'insegurança', 'inseguro', 'intimidar', 'medonho', 'medroso', 'monstruosamente', 'mortalha', 'nervoso', 'pânico', 'pavor', 'premonição', 'preocupar', 'presságio', 'pressentimento', 'recear', 'receativamente', 'receio', 'receoso', 'ruim', 'suspeita', 'suspense', 'susto', 'temer', 'tenso', 'terror', 'tremor', 'temeroso', 'terrificar', 'timidamente', 'vigiar', 'temor', 'terrível', 'timidez', 'vigilante', 'tensão', 'terrivelmente', 'tímido',])
RAIVA = np.array(['abominação', 'aborrecer', 'adredido', 'agredir', 'agressão', 'agressivo', 'amaldiçoado', 'amargor', 'amargura', 'amolar', 'angústia', 'animosidade', 'antipatia', 'antipático', 'asco', 'assassinar', 'assassinato', 'assediar', 'assédio', 'atormentar', 'avarento', 'avareza', 'aversão', 'beligerante', 'bravejar', 'chateação', 'chato', 'cobiçoso', 'cólera', 'colérico', 'complicar', 'contraiedade', 'contrariar', 'corrupção', 'corrupto', 'cruxificar', 'demoníaco', 'demônio', 'descaso', 'descontente', 'descontrole', 'desenganar', 'desgostar', 'desgraça', 'desprazer', 'desprezar', 'destruição', 'destruir', 'detestar', 'diabo', 'diabólico', 'doido', 'encolerizar', 'energicamente', 'enfurecido', 'enfuriante', 'enlouquecer', 'enraivecer', 'escandalizar', 'escândalo', 'escoriar', 'exasperar', 'execração', 'ferir', 'frustração', 'frustrar', 'fúria', 'furioso', 'furor', 'ganância', 'ganancioso', 'guerra', 'guerreador', 'guerrilha', 'hostil', 'humilhar', 'implicância', 'implicar', 'importunar', 'incomodar', 'incômodo', 'indignar', 'infernizar', 'inimigo', 'inimizade', 'injúria', 'injuriado', 'injustiça', 'insulto', 'malícia', 'odiável', 'repulsivo', 'inveja', 'malicioso', 'ódio', 'resmungar', 'ira', 'malignidade', 'odioso', 'ressentido', 'irado', 'malígno', 'ofendido', 'revolta', 'irascibilidade', 'maltratar', 'ofensa', 'ridículo', 'irascível', 'maluco', 'opressão', 'tempestuoso', 'irritar', 'malvadeza', 'opressivo', 'tirano', 'louco', 'malvado', 'oprimir', 'tormento', 'loucura', 'matar', 'perseguição', 'torturar', 'magoar', 'mesquinho', 'perseguir', 'ultrage', 'mal', 'misantropia', 'perturbar', 'ultrajar', 'maldade', 'misantrópico', 'perverso', 'vexatório', 'maldição', 'molestar', 'provocar', 'vigoroso', 'maldito', 'moléstia', 'rabugento', 'vingança', 'maldizer', 'mortal', 'raivoso', 'vingar', 'maldoso', 'morte', 'rancor', 'vingativo', 'maleficência', 'mortífero', 'reclamar', 'violência', 'maléfico', 'mortificar', 'repressão', 'violento', 'malevolência', 'nervoso', 'reprimir', 'zangar', 'malévolo', 'odiar', 'repulsa',])
SURPRESA = np.array(['admirar', 'afeição', 'apavorante', 'assombro', 'chocado', 'chocante', 'desconcertar', 'deslumbrar', 'embasbacar', 'emudecer', 'encantamento', 'enorme', 'espanto', 'estupefante', 'estupefato', 'estupefazer', 'expectativa', 'fantasticamente', 'fantástico', 'horripilante', 'imaginário', 'imenso', 'impressionado', 'incrível', 'maravilha', 'milagre', 'mistério', 'misterioso', 'ótimo', 'pasmo', 'perplexo', 'prodígio', 'sensacional', 'surpreendente', 'surpreender', 'suspense', 'susto', 'temor', 'tremendo',])
TRISTEZA = np.array(['abandonar', 'abatido', 'abominável', 'aborrecer', 'abortar', 'afligir', 'aflito', 'aflição', 'agoniar', 'amargo', 'amargor', 'amargura', 'ansiedade', 'arrepender', 'arrependidamente', 'atrito', 'azar', 'cabisbaixo', 'choro', 'choroso', 'chorão', 'coitado', 'compassivo', 'compunção', 'contristador', 'contrito', 'contrição', 'culpa', 'defeituoso', 'degradante', 'deplorável', 'deposição', 'depravado', 'depressivo', 'depressão', 'deprimente', 'deprimir', 'derrota', 'derrubar', 'desalentar', 'desamparo', 'desanimar', 'desapontar', 'desconsolo', 'descontente', 'desculpas', 'desencorajar', 'desespero', 'desgaste', 'desgosto', 'desgraça', 'desistir', 'desistência', 'deslocado', 'desmoralizar', 'desolar', 'desonra', 'despojado', 'desprazer', 'desprezo', 'desumano', 'desânimo', 'discriminar', 'disforia', 'disfórico', 'dissuadir', 'doloroso', 'dor', 'dó', 'enfadado', 'enlutar', 'entediado', 'entristecedor', 'entristecer', 'envergonhar', 'errante', 'erro', 'errôneo', 'escurecer', 'escuridão', 'escuro', 'esquecido', 'estragado', 'execrável', 'extirpar', 'falsidade', 'falso', 'falta', 'fraco', 'fraqueza', 'fricção', 'frieza', 'frio', 'funesto', 'fúnebre', 'grave', 'horror', 'humilhar', 'inconsolável', 'indefeso', 'infelicidade', 'infeliz', 'infortúnio', 'isolar', 'lacrimejante', 'lacrimoso', 'lamentar', 'lastimoso', 'luto', 'lutoso', 'lágrima', 'lástima', 'lúgubre', 'magoar', 'martirizar', 'martírio', 'mau', 'melancolia', 'melancólico', 'menosprezar', 'miseravelmente', 'misterioso', 'mistério', 'miséria', 'morre', 'morte', 'mortificante', 'mágoa', 'negligentemente', 'nocivo', 'obscuro', 'opressivo', 'opressão', 'oprimir', 'pena', 'penalizar', 'penitente', 'penoso', 'penumbra', 'perder', 'perturbado', 'perverso', 'pervertar', 'pesaroso', 'pessimamente', 'piedade', 'pobre', 'porcamente', 'prejudicado', 'prejudicial', 'prejuízo', 'pressionar', 'pressão', 'quebrar', 'queda', 'queixoso', 'rechaçar', 'remorso', 'repressivo', 'repressão', 'reprimir', 'ruim', 'secreto', 'servil', 'sobrecarga', 'sobrecarregado', 'sofrer', 'sofrimento', 'solidão', 'sombrio', 'soturno', 'sujo', 'suplicar', 'suplício', 'só', 'timidez', 'torturar', 'trevas', 'triste', 'tristemente', 'tédio', 'tímido', 'vazio',])

emotion_words = {
    'ALEGRIA': ALEGRIA,
    'DESGOSTO': DESGOSTO,
    'MEDO': MEDO,
    'RAIVA': RAIVA,
    'SURPRESA': SURPRESA,
    'TRISTEZA': TRISTEZA,
}
for key, values in emotion_words.items():
    for i, word in enumerate(values):
        emotion_words[key][i] = ''.join([remover_acentos(p.strip()) for p in LOOKUP.get(word.lower(), word.lower())])
# pp.pprint(emotion_words)

In [27]:
stpwords = stopwords.words('portuguese') + list(punctuation)
rms = ['um', 'não', 'mais', 'muito']
for rm in rms:
    del stpwords[stpwords.index(rm)]

def tokenize_frases(frase):
    return word_tokenize(remover_acentos(frase.lower()))

def rm_stop_words_tokenized(frase):
    frase = nlp(remover_acentos(frase.lower()))
    clean_frase = []
    for palavra in frase:
        if palavra.pos_ != 'PUNCT':
            palavra = palavra.lemma_
            if palavra not in stpwords and not palavra.isdigit():
                clean_frase.append(palavra)
    return ' '.join(clean_frase)

def generate_corpus(frases, tokenize=False):
    print('Iniciando processamento...')
    tokenized_frases = frases
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as procs:
        if tokenize:
            print('Executando processo de tokenização das frases...')
            tokenized_frases = procs.map(tokenize_frases, frases, chunksize=25)
        print('Executando processo de remoção das stopwords...')
        tokenized_frases = procs.map(rm_stop_words_tokenized, tokenized_frases, chunksize=25)
    print('Filtro e finalização...')
    return tokenized_frases


frases_originais = [
    'Bom dia SENADOR, agora está claro porque o pedágio não baixava,o judiciário não se manifestava quando era provocado e as CPIs só serviram prá corrupção,deu no que deu 🙄',
    'Não basta apenas retirar o candidato preferencial da maioria dos eleitores brasileiros. Tem que impedir também que esses mesmos eleitores possam comparecer às urnas. Que democracia é essa, minha gente? Poder judiciário comprometido até os cabelos com o golpe de destrói o país.',
    'Deus abençoe o dia de todos você, tenham um bom trabalho e bom estudo a todos. E pra aqueles que não trabalha e nem estuda, boa curtição em sua cama 🙂',
    'Aprenda a ter amor próprio que nem essa banana q fez uma tatuagem dela mesma.',
    'Estou muito feliz hoje',
    'Dias chuvosos me deixam triste',
    'Hoje o dia esta excelente',
    'Tem certas coisas que eu não como, acho bem nojento ficar mastigando lingua de boi por exemplo',
    'É de se admirar àqueles que conseguem realizar boas ações sem desejar nada em troca',
    
    'Quando a tristeza bater na sua porta, abra um belo sorriso e diga: Desculpa, mas hoje a felicidade chegou primeiro!',
    'Feliz é aquele que vê a felicidade dos outros sem ter inveja. O sol é para todos e a sombra pra quem merece.',
    'Minha meta é ser feliz, não perfeito.',
    'Ser feliz até onde der, até onde puder. Sem adiar, ser feliz o tanto que durar.',
    'Nunca deixe ninguém dizer que você não pode fazer alguma coisa. Se você tem um sonho, tem que correr atrás dele. As pessoas não conseguem vencer, e dizem que você também não vai vencer. Se quer alguma coisa, corre atrás.',
    
    'O maior problema em acreditar nas pessoas erradas, é que um dia você acaba não acreditando em mais ninguém.',
    'Não me deixe ir, posso nunca mais voltar.',
    'Não me arrependo de ter conhecido ninguém, só me arrependo de ter perdido tanto tempo com algumas pessoas.',
    'Se existe uma coisa que eu me arrependo é de ter confiado em algumas pessoas.',
    'Prefiro que enxerguem em mim erros com arrependimento do que uma falsa perfeição.',
    
    'Mesmo sabendo que um dia a vida acaba, a gente nunca está preparado para perder alguém.',
    'A morte é uma pétala que se solta da flor e deixa uma eterna saudade no coração.',
    'Mãe é imortal, porque quando ela parte para outro mundo fica vivendo nas lágrimas que escorrem em nosso rosto eternamente.',
    'Só existem dois motivos pra uma pessoa se preocupar com você: Ou ela te ama muito, ou você tem algo que ela queira muito!',
    'Ser feliz nao é ter uma vida perfeita, mas sim reconhecer que vale a pena viver apesar de todos os desafios e perdas.',
    'Minha maravilhosa vida é uma merda.'
]

# N = 10000

frases = copy.deepcopy(frases_originais)
frases += [' '.join(f).replace('_', ' ') for f in flt.sents()[:500]]
frases += [' '.join(f).replace('_', ' ') for f in mch.sents()[:500]]
frases += [' '.join(f).replace('_', ' ') for f in mcm.sents()[:500]]

frases = list(generate_corpus(frases, tokenize=False))
# print(frases)

ldocs = [f'D{i}' for i in range(len(frases))]

Iniciando processamento...
Executando processo de remoção das stopwords...
Filtro e finalização...


In [34]:
print('Tf-Idf:')
vec_tfidf = TfidfVectorizer(max_df=5, sublinear_tf=False, use_idf=True, ngram_range=(1, 2))
X_tfidf = vec_tfidf.fit_transform(frases)
print("   Actual number of tfidf features: %d" % X_tfidf.get_shape()[1])
weights_tfidf = pd.DataFrame(np.round(X_tfidf.toarray().T, 3), index=vec_tfidf.get_feature_names(), columns=ldocs)
display(weights_tfidf.head(15))

Tf-Idf:
   Actual number of tfidf features: 19120


Unnamed: 0,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1515,D1516,D1517,D1518,D1519,D1520,D1521,D1522,D1523,D1524
00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00 preco,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000 000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000 centimetro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000 ha,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000 pessoa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
022 90,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
print('Count:')
vec_count = CountVectorizer()
X_count = vec_count.fit_transform(frases)
print("   Actual number of tfidf features: %d" % X_count.get_shape()[1])
weights_count = pd.DataFrame(X_count.toarray().T, index=vec_count.get_feature_names(), columns=ldocs)
display(weights_count.head(15))

Count:
   Actual number of tfidf features: 4889


Unnamed: 0,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1515,D1516,D1517,D1518,D1519,D1520,D1521,D1522,D1523,D1524
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
05,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
077,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
084,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0h00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
108,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
USE = True
X = X_count if USE else X_tfidf
weights = weights_count if USE else weights_tfidf
vectorizer = vec_count if USE else vec_tfidf

print('SVD: ')
AC = copy.deepcopy(X.toarray().T)
u, s, v = np.linalg.svd(AC, full_matrices=False)
print('Original and SVD equals: ', np.allclose(AC, np.dot(u, np.dot(np.diag(s), v))))

# print(AC)
# print(u.astype(np.float16))
# print('-' * 20)
# print(np.diag(s.astype(np.float16)))
# print('-' * 20)
# print(v.astype(np.float16))

SVD: 
Original and SVD equals:  True


In [30]:
print('Matriz U:')
print(u.shape)
weights_um = pd.DataFrame(u, index=vectorizer.get_feature_names(), columns=ldocs)
display(weights_um.head(15))

Matriz U:
(4889, 1525)


Unnamed: 0,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1515,D1516,D1517,D1518,D1519,D1520,D1521,D1522,D1523,D1524
00,-0.00047,0.000545,0.000932,0.002942,0.000894,-0.000274,4.7e-05,-0.000456,0.000121,0.002148,...,0.03401,0.012801,-0.040311,-0.076908,-0.01375,-0.001184,0.074333,-0.03849,-0.029697,0.008273
000,-0.003072,0.001925,-0.001826,-0.0005,0.001803,0.016402,0.004345,0.0016,0.002052,0.003629,...,0.000209,-0.000115,0.001354,0.00908,0.009074,-0.006153,-0.00091,0.010828,0.026622,0.011433
022,-0.001007,-0.0011,-0.000664,0.003132,0.001608,-0.001247,0.000167,-0.001069,0.000508,-0.000496,...,0.002585,-0.031763,-0.048917,0.041925,0.040436,0.017542,-0.001256,0.059372,-0.007325,0.033947
05,-0.000602,9.7e-05,-1e-05,0.000152,-0.003799,-0.000118,-0.001551,0.001747,0.000536,0.001116,...,-0.010953,0.014219,0.015937,0.007512,-0.024319,0.01122,0.00302,-0.018258,0.010514,-0.00846
077,-0.000181,0.000129,-1.5e-05,0.000234,0.000238,0.001307,0.000256,0.000384,-0.000621,0.002519,...,0.002955,0.01925,-0.014693,-0.051652,0.017776,0.026258,-0.006377,-0.015598,0.015813,0.00819
084,-0.000678,-0.001472,-0.001625,0.00044,0.00103,0.000469,0.001155,-0.000312,-0.000632,0.003638,...,0.003842,-0.003286,-0.004667,-0.029838,-0.00755,-0.008294,-0.008312,-0.013563,0.003519,0.004332
0h00,-0.000537,-0.00158,-0.001507,0.00027,0.000824,-0.000559,0.000484,-0.000337,-0.00016,0.000281,...,-0.004584,0.007769,0.008972,0.020034,-0.001323,0.003249,-0.006971,0.00948,0.003094,-0.000978
10,-0.001317,0.002853,-0.000107,0.003164,0.002268,0.002093,0.001845,-0.001212,-7.4e-05,0.012872,...,-0.010149,-0.019601,-0.030645,0.03792,0.027412,-0.010959,-0.018078,0.023812,0.040379,0.016714
100,-0.000201,-1.2e-05,4.7e-05,0.000172,-0.000201,0.001402,-0.00115,0.001252,0.000384,0.001618,...,-0.017128,0.017699,-0.012826,0.020451,-0.021062,0.003651,0.002396,-0.022317,-0.011141,-0.008945
108,-9.3e-05,0.000218,-2.1e-05,0.000445,0.000231,0.000424,0.000362,-0.000123,-3.7e-05,0.003141,...,0.014702,0.019918,0.034423,-0.023843,0.022375,0.003871,0.010895,0.03747,-0.020221,0.042735


In [31]:
print('Matriz VT:')
weights_vm = pd.DataFrame(v.T, columns=ldocs)
display(weights_vm.head(15))

Matriz VT:


Unnamed: 0,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1515,D1516,D1517,D1518,D1519,D1520,D1521,D1522,D1523,D1524
0,-0.04763,-0.058177,0.056207,-0.044432,0.021808,-0.041579,0.0269,-0.009721,0.020173,0.006205,...,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,-2.993553e-17,0.0,0.0
1,-0.026527,-0.017766,0.059225,-0.025276,0.006634,0.024682,0.034415,-0.011329,-0.039928,5.1e-05,...,1.03978e-17,1.472189e-17,-1.237468e-18,-4.220551e-17,-1.1233980000000001e-17,-1.930489e-17,-7.316166000000001e-17,-1.0698170000000001e-17,6.632021e-18,2.3096660000000002e-17
2,-0.045769,0.023192,0.041277,-0.049197,0.015113,0.048819,0.024926,-0.048745,-0.055261,-0.008221,...,1.064741e-17,-1.5965660000000002e-17,6.47173e-18,-4.300706e-17,4.8116000000000005e-17,-5.1719550000000006e-17,-1.1608360000000002e-17,5.4621380000000005e-17,-1.001424e-17,2.091746e-18
3,-0.017893,0.002939,0.006051,-0.002905,-0.064452,0.021874,0.005251,-0.031322,-0.049428,-0.010372,...,7.18369e-17,6.656306e-18,2.0244880000000003e-17,4.451114e-18,8.497167000000001e-17,-1.117202e-16,2.6030610000000002e-17,1.23819e-16,-2.523236e-16,1.589248e-16
4,-0.004193,-0.001969,0.002822,0.001587,-0.001279,0.014599,-0.007254,0.009585,-0.00279,0.009456,...,2.552417e-17,-2.4674e-16,-1.260061e-16,1.247464e-16,2.918106e-16,9.603981000000001e-17,-3.8848250000000006e-17,-1.3610100000000001e-17,-2.08809e-16,-1.01155e-16
5,-0.002639,0.00074,0.001295,0.000451,0.000469,0.006629,-0.004026,-0.002289,0.004184,0.000502,...,-5.514965000000001e-17,-7.528357000000001e-17,5.629802000000001e-17,-1.321244e-16,-2.1974670000000003e-17,-2.066563e-16,-1.2846110000000001e-17,-4.8837510000000007e-17,-8.02683e-17,-2.029459e-16
6,-0.00251,0.001775,-0.000384,0.001662,0.000316,0.006238,-0.006483,-0.002621,0.000785,0.00417,...,-9.518657000000001e-17,-1.8739970000000003e-17,1.869867e-16,-1.216888e-16,-1.665364e-17,-7.156058e-17,1.152033e-16,2.295127e-19,8.503517e-17,-7.849299000000001e-18
7,-0.023907,-0.012135,0.052841,-0.027918,0.001203,0.014124,0.022972,0.035955,-0.070534,-0.010704,...,-8.690306000000001e-17,3.510845e-17,2.3636860000000002e-17,-4.648372e-17,1.674276e-16,-6.820820000000001e-17,-8.138836e-17,-3.209031e-17,-2.153314e-17,8.896874000000001e-17
8,-0.003159,-0.001263,0.003013,-0.001457,-0.000688,0.004189,-0.007134,-0.001433,-0.005224,0.002954,...,1.066186e-16,-8.489699000000001e-17,1.287091e-16,2.4132080000000002e-17,2.653758e-18,5.4411880000000006e-17,-5.465631e-17,-2.464898e-18,1.7855890000000002e-17,1.220153e-16
9,-0.025789,0.034328,-0.00727,-0.020609,0.01905,0.029119,-0.030188,-0.011163,0.014654,0.054358,...,-7.183451e-17,3.1215580000000006e-17,4.137352e-17,-1.4133090000000003e-17,-2.720535e-17,8.102328e-18,-1.201365e-16,-4.510928e-17,2.4041930000000002e-17,7.0951e-17


In [32]:
SIMPLE = USE

ws = np.zeros((len(ldocs), len(emotion_words.keys())))
idx = { w:i for i, w in enumerate(weights.index.get_values())}

for i, doc in enumerate(ldocs):
    for k, item in enumerate(emotion_words.items()):
        key, values = item
        for value in values:
            try:
                if SIMPLE:
                    index = weights.index.get_loc(value)
                    idx_val = u[index]
                    ws[i][k] += idx_val[i] * weights.iloc[index].values[i]
                else:
                    weight_sum = []
                    indexes = filter(None, [e if value in inx else None for e, inx in enumerate(idx.keys())])
                    for index in indexes:
                        idx_val = u[index]
                        weight_sum.append(idx_val[i] * weights.iloc[index].values[i])
                    ws[i][k] += np.sum(weight_sum)
            except:
                pass

ws = ws/len(ldocs)
            
df = pd.DataFrame(ws, columns=emotion_words.keys())
display(df[
    (df['ALEGRIA'] != 0) | (df['DESGOSTO'] != 0) | (df['MEDO'] != 0) | 
    (df['RAIVA'] != 0) | (df['SURPRESA'] != 0) | (df['TRISTEZA'] != 0)
])

Unnamed: 0,ALEGRIA,DESGOSTO,MEDO,RAIVA,SURPRESA,TRISTEZA
0,-2.305113e-05,0.000000,0.000000e+00,-5.324163e-06,0.000000,-2.263428e-05
2,3.271124e-05,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000e+00
3,-7.940492e-06,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000e+00
4,9.946928e-07,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000e+00
5,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,-1.651026e-06
6,-4.381039e-06,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000e+00
7,0.000000e+00,0.000002,0.000000e+00,0.000000e+00,0.000000,0.000000e+00
8,-4.101853e-05,0.000000,0.000000e+00,0.000000e+00,-0.000003,0.000000e+00
9,-2.835185e-08,0.000000,0.000000e+00,0.000000e+00,0.000000,2.805809e-06
10,4.784073e-06,0.000000,0.000000e+00,4.358943e-07,0.000000,0.000000e+00


In [33]:
dtframe = np.zeros((len(ldocs), len(emotion_words.keys())))
for i, d in enumerate(ldocs):
    for k, it in enumerate(emotion_words.items()):
        a = [ws[:, k]]
        b = [v.T[i]]
        dtframe[i][k] = cosine_similarity(a, b)
dtframe = np.round(normalization(dtframe, -100, 100), 2)
       
for i, frase in enumerate(frases_originais[9:25]):
    print(f' D{i+9} - {frase}')

df = pd.DataFrame(dtframe[9:25].T, index=emotion_words.keys(), columns=ldocs[9:25])
display(df.head(15))

 D9 - Quando a tristeza bater na sua porta, abra um belo sorriso e diga: Desculpa, mas hoje a felicidade chegou primeiro!
 D10 - Feliz é aquele que vê a felicidade dos outros sem ter inveja. O sol é para todos e a sombra pra quem merece.
 D11 - Minha meta é ser feliz, não perfeito.
 D12 - Ser feliz até onde der, até onde puder. Sem adiar, ser feliz o tanto que durar.
 D13 - Nunca deixe ninguém dizer que você não pode fazer alguma coisa. Se você tem um sonho, tem que correr atrás dele. As pessoas não conseguem vencer, e dizem que você também não vai vencer. Se quer alguma coisa, corre atrás.
 D14 - O maior problema em acreditar nas pessoas erradas, é que um dia você acaba não acreditando em mais ninguém.
 D15 - Não me deixe ir, posso nunca mais voltar.
 D16 - Não me arrependo de ter conhecido ninguém, só me arrependo de ter perdido tanto tempo com algumas pessoas.
 D17 - Se existe uma coisa que eu me arrependo é de ter confiado em algumas pessoas.
 D18 - Prefiro que enxerguem em mim err

Unnamed: 0,D9,D10,D11,D12,D13,D14,D15,D16,D17,D18,D19,D20,D21,D22,D23,D24
ALEGRIA,-11.3,17.6,-8.05,-24.98,48.76,-35.4,-14.23,21.68,4.68,-43.72,5.85,3.49,-29.5,8.03,18.96,-3.55
DESGOSTO,35.4,-11.36,-15.61,14.93,-19.87,21.27,-9.17,-33.03,17.87,5.92,-17.6,17.39,3.27,48.56,30.24,-0.81
MEDO,26.49,30.4,-6.8,-19.02,-20.28,-8.23,26.31,-2.73,-13.07,5.17,-16.6,-5.55,-9.67,48.27,32.0,-13.89
RAIVA,-19.63,4.45,29.02,34.07,15.4,-8.79,-14.71,0.54,29.69,-25.82,2.23,13.36,4.2,2.11,40.49,-10.57
SURPRESA,-3.11,7.51,-19.38,-1.92,4.48,-34.38,-18.81,44.76,15.93,-28.56,-22.3,9.56,3.84,-14.25,29.96,-18.97
TRISTEZA,-9.99,2.0,11.61,22.3,33.32,-6.62,2.86,43.49,0.78,47.97,-38.26,77.74,3.4,28.07,52.11,7.19
