[View in Colaboratory](https://colab.research.google.com/github/rdenadai/TxtP-Study-Notebooks/blob/master/notebooks/text_classification_example.ipynb)

## Análise e Validação de Textos em Português


### Referências:

 - [NLTK](http://www.nltk.org/howto/portuguese_en.html)
 - [spaCy](https://spacy.io/usage/spacy-101)
 - [Utilizando processamento de linguagem natural para criar uma sumarização automática de textos](https://medium.com/@viniljf/utilizando-processamento-de-linguagem-natural-para-criar-um-sumariza%C3%A7%C3%A3o-autom%C3%A1tica-de-textos-775cb428c84e)
 - [Latent Semantic Analysis (LSA) for Text Classification Tutorial](http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/)
 - [Topic Modeling with LSA, PLSA, LDA & lda2Vec](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)

### Instalação

In [1]:
!pip install -U spacy
!python -m spacy download en
!python -m spacy download pt
!pip install feedparser

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.0.12)
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 1.1MB/s 
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25l- \ | done
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')

Collecting pt_core_news_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-2.0.0/pt_core_news_sm-2.0.0.tar.gz#egg=pt

In [2]:
# Download Oplexicon
!rm -rf wget-log*
!rm -rf oplexicon_v3.0
!wget -O oplexicon_v3.0.zip https://github.com/rdenadai/sentiment-analysis-2018-president-election/blob/master/dataset/oplexicon_v3.0.zip?raw=true
!unzip oplexicon_v3.0.zip
!ls -lh


Redirecting output to ‘wget-log’.
Archive:  oplexicon_v3.0.zip
  inflating: oplexicon_v3.0/lexico_v3.0.txt  
  inflating: oplexicon_v3.0/README   
total 120K
drwxr-xr-x 2 root root 4.0K Sep 27 11:21 oplexicon_v3.0
-rw-r--r-- 1 root root 102K Sep 27 11:21 oplexicon_v3.0.zip
drwxr-xr-x 2 root root 4.0K Sep 26 00:53 sample_data
-rw-r--r-- 1 root root 1.6K Sep 27 11:21 wget-log


### Imports

In [3]:
import nltk

nltk.download('rslp')
nltk.download('averaged_perceptron_tagger')
nltk.download('floresta')
nltk.download('mac_morpho')
nltk.download('machado')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')

import concurrent.futures
import codecs
import re
import pprint
from random import shuffle
from string import punctuation

import numpy as np
import pandas as pd
import spacy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import floresta as flt
from nltk.corpus import machado as mch
from nltk.corpus import mac_morpho as mcm

nlp = spacy.load('pt')
pp = pprint.PrettyPrinter(indent=4)
stemmer = nltk.stem.RSLPStemmer()

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package floresta to /root/nltk_data...
[nltk_data]   Unzipping corpora/floresta.zip.
[nltk_data] Downloading package mac_morpho to /root/nltk_data...
[nltk_data]   Unzipping corpora/mac_morpho.zip.
[nltk_data] Downloading package machado to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


### Functions

In [0]:
def load_oplexicon_data(filename):
    spacy_conv = {
        'adj': 'ADJ',
        'n': 'NOUN',
        'vb': 'VERB',
        'det': 'DET',
        'emot': 'EMOT',
        'htag': 'HTAG'
    }
    
    data = {}
    with codecs.open(filename, 'r', 'UTF-8') as hf:
        lines = hf.readlines()
        for line in lines:
            info = line.lower().split(',')
            if len(info[0].split()) <= 1:
                info[1] = [spacy_conv.get(tag) for tag in info[1].split()]
                word, tags, sent = info[:3]
                if 'HTAG' not in tags and 'EMOT' not in tags:
                    word = word.replace('-se', '')
                    stem = stemmer.stem(word)
                    if stem in data:
                        data[stem] += [{
                            'word': [word],
                            'tags': tags,
                            'sentiment': sent
                        }]
                    else:
                        data[stem] = [{
                            'word': [word],
                            'tags': tags,
                            'sentiment': sent
                        }]
    return data

### Usage

In [5]:
frase = u"Gostaria de saber mais informações sobre a Amazon. Uma excelente loja de produtos online!".lower()
doc = nlp(frase)
pp.pprint([(w.text, w.pos_) for w in doc])

[   ('gostaria', 'VERB'),
    ('de', 'ADP'),
    ('saber', 'VERB'),
    ('mais', 'DET'),
    ('informações', 'NOUN'),
    ('sobre', 'ADP'),
    ('a', 'DET'),
    ('amazon', 'NOUN'),
    ('.', 'PUNCT'),
    ('uma', 'DET'),
    ('excelente', 'ADJ'),
    ('loja', 'NOUN'),
    ('de', 'ADP'),
    ('produtos', 'NOUN'),
    ('online', 'ADJ'),
    ('!', 'PUNCT')]


In [6]:
opx = load_oplexicon_data('oplexicon_v3.0/lexico_v3.0.txt')
print('Oplexicon size: ', len(opx))
print('Examples: ')

view = opx.items()
pp.pprint(list(view)[:7])

Oplexicon size:  10687
Examples: 
[   ('ab-rog', [{'sentiment': '-1', 'tags': ['VERB'], 'word': ['ab-rogar']}]),
    ('ababad', [{'sentiment': '0', 'tags': ['VERB'], 'word': ['ababadar']}]),
    (   'ababel',
        [   {'sentiment': '-1', 'tags': ['VERB'], 'word': ['ababelar']},
            {'sentiment': '1', 'tags': ['VERB'], 'word': ['ababelar']}]),
    ('abaçan', [{'sentiment': '1', 'tags': ['VERB'], 'word': ['abaçanar']}]),
    ('abacin', [{'sentiment': '1', 'tags': ['VERB'], 'word': ['abacinar']}]),
    (   'abaf',
        [   {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafada']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafadas']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafado']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafados']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafante']},
            {'sentiment': '-1', 'tags': ['ADJ'], 'word': ['abafantes']},
            {'sentiment': '-1', 'tags': [

In [82]:
# stpwords = set(stopwords.words('portuguese') + list(punctuation))
stpwords = set(list(punctuation))

def tokenize_frases(frase):
    return word_tokenize(frase.lower())

def rm_stop_words_tokenized(frase):
    clean_frase = []
    for palavra in frase:
        if palavra not in stpwords and not palavra.isdigit():
            clean_frase.append(re.sub(r'[\"\'!@#$%&*\(\)-_=+{}\[\]:;>.<,|\\`´]', '', palavra.lower()))
        else:
            clean_frase.append(None)
    return ' '.join(filter(None, clean_frase))

def generate_corpus(frases, tokenize=False):
    print('Iniciando processamento...')
    tokenized_frases = frases
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as procs:
        if tokenize:
            print('Executando processo de tokenização das frases...')
            tokenized_frases = procs.map(tokenize_frases, frases, chunksize=25)
        print('Executando processo de remoção das stopwords...')
        tokenized_frases = procs.map(rm_stop_words_tokenized, tokenized_frases, chunksize=25)
    print('Filtro e finalização...')
    return tokenized_frases

frases = [
    'Bom dia SENADOR, agora está claro porque o pedágio não baixava,o judiciário não se manifestava quando era provocado e as CPIs só serviram prá corrupção,deu no que deu 🙄',
    'Não basta apenas retirar o candidato preferencial da maioria dos eleitores brasileiros. Tem que impedir também que esses mesmos eleitores possam comparecer às urnas. Que democracia é essa, minha gente? Poder judiciário comprometido até os cabelos com o golpe de destrói o país.',
    'Deus abençoe o dia de todos você, tenham um bom trabalho e bom estudo a todos. E pra aqueles que não trabalha e nem estuda, boa curtição em sua cama 🙂',
    'Aprenda a ter amor próprio que nem essa banana q fez uma tatuagem dela mesma.'
]

N = 10000
frases = flt.sents()[:N] + mch.sents()[:N] + mcm.sents()[:N]

frases = generate_corpus(frases, tokenize=False)

Iniciando processamento...
Executando processo de remoção das stopwords...
Filtro e finalização...


In [81]:
# d = feedparser.parse('http://rss.uol.com.br/feed/tecnologia.xml')
# for entry in d['entries']:
#     pp.pprint(entry['link'])

stpwords = set(stopwords.words('portuguese') + list(punctuation))
def rm_stop_words(palavra):
    if palavra not in stpwords and not palavra.isdigit():
        return re.sub(r'[\"\'!@#$%&*\(\)-_=+{}\[\]:;>.<,|\\`´]', '', palavra.lower())
    return None

# Machado + Mac_Morpho
print('Iniciando corpus...')
N = 100000
palavras = flt.words()[:N] + mch.words()[:N] + mcm.words()[:N]
print('Corpus criado...')
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as procs:
    print('Executando processo de remoção das stopwords...')
    palavras = procs.map(rm_stop_words, palavras, chunksize=50)
print('Filtro e finalização...')
palavras = filter(None, palavras)

Iniciando corpus...
Corpus criado...
Executando processo de remoção das stopwords...
Filtro e finalização...


In [83]:
print('Tf-Idf:')
vectorizer = TfidfVectorizer(max_df=2, use_idf=True, ngram_range=(1, 3))

X_tfidf = vectorizer.fit_transform(frases)
print("   Actual number of tfidf features: %d" % X_tfidf.get_shape()[1])

Tf-Idf:
   Actual number of tfidf features: 543386


In [84]:
weights = np.asarray(X_tfidf.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': vectorizer.get_feature_names(), 'weight': weights})
weights_df = weights_df.sort_values(by='weight', ascending=True)
display(weights_df.head(10))

Unnamed: 0,term,weight
219976,faculdades não,2e-06
263553,ininterrupto,2e-06
263554,ininterrupto que,2e-06
263555,ininterrupto que vista,2e-06
9705,agora expostas,2e-06
1533,absoluto que,2e-06
1534,absoluto que desse,2e-06
215836,excluía da razão,2e-06
215835,excluía da,2e-06
215834,excluía,2e-06


In [85]:
print("LSA:")

# Project the tfidf vectors onto the first N principal components.
# Though this is significantly fewer features than the original tfidf vector,
# they are stronger features, and the accuracy is higher.
svd = TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Run SVD on the training data, then project the training data.
X_lsa = lsa.fit_transform(X_tfidf)

explained_variance = svd.explained_variance_ratio_.sum()
print("   Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

LSA:
   Explained variance of the SVD step: 0%
