[View in Colaboratory](https://colab.research.google.com/github/rdenadai/ia369y/blob/master/notebooks/T2/T2_Sentiment_Analysis_Extra_dataset.ipynb)

# IA369 - Y Computação Afetiva

Exercício proposto pela Prof. Paula para o alunos de sua disciplina de Pós-Graduação.

### Dupla

- Edgar Lopes Banhesse RA 993396
- Rodolfo De Nadai RA 208911

## T2 - Análise de Sentimentos em Textos

Este notebook serve apenas para gerar e criar novo dataset baseado no Movie Reviews e SentiWordNet.

In [1]:
import re
import pprint
import copy
from collections import namedtuple
import nltk
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

np.warnings.filterwarnings('ignore')

# Download de alguns dataset disponibilizados pelo NLTK
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('movie_reviews')
nltk.download('sentence_polarity')
nltk.download('sentiwordnet')
nltk.download('stopwords')
nltk.download('words')

from nltk.corpus import wordnet as wn
from nltk.corpus import movie_reviews
from nltk.corpus import sentiwordnet as wdn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.util import ngrams

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

pp = pprint.PrettyPrinter(indent=4)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/sentence_polarity.zip.
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Vamos carregar todas as frases existentes dentro do Movie Reviews vindos da biblioteca NLTK.

In [2]:
neg, pos = movie_reviews.categories()

new_phrases = []
for ids in movie_reviews.fileids(neg):
    for phrase in movie_reviews.sents(ids)[1:]:
        if len(phrase) > 3:
            new_phrases.append({
                'type': 'neg',
                'phrase': ' '.join(phrase).lower(),
                'over_score': 0.0
            })
for ids in movie_reviews.fileids(pos):
    for phrase in movie_reviews.sents(ids):
        if len(phrase) > 3:
            new_phrases.append({
                'type': 'pos',
                'phrase': ' '.join(phrase).lower(),
                'over_score': 0.0
            })
pp.pprint(new_phrases[:3])

[   {'over_score': 0.0, 'phrase': 'they get into an accident .', 'type': 'neg'},
    {   'over_score': 0.0,
        'phrase': 'one of the guys dies , but his girlfriend continues to see '
                  'him in her life , and has nightmares .',
        'type': 'neg'},
    {'over_score': 0.0, 'phrase': "what ' s the deal ?", 'type': 'neg'}]


Como estamos usando o SentiWordNet, vamos realizar o download do dataset e importar todo ele no formato de um dicionário em python.

In [3]:
!rm -rf SentiWordNet_3.0.0_20130122.txt
!wget https://raw.githubusercontent.com/rdenadai/ia369y/master/datasets/SentiWordNet_3.0.0_20130122.txt
!ls -lh


Redirecting output to ‘wget-log’.
total 13M
drwxr-xr-x 2 root root 4.0K Sep 20 00:09 sample_data
-rw-r--r-- 1 root root  13M Sep 23 18:32 SentiWordNet_3.0.0_20130122.txt
-rw-r--r-- 1 root root  897 Sep 23 18:32 wget-log


In [4]:
senti_word_net = {}
with open('SentiWordNet_3.0.0_20130122.txt') as fh:
    content = fh.readlines()
    for line in content:
        if not line.startswith('#'):
            data = line.strip().split("\t")
            if len(data) == 6:
                pos_score = float(data[2].strip())
                neg_score = float(data[3].strip())
                if pos_score > 0 or neg_score > 0:
                    pos = data[0].strip()
                    uid = int(data[1].strip())
                    lemmas = [lemma.name() for lemma in wn.synset_from_pos_and_offset(pos, uid).lemmas()]
                    for lemma in lemmas:
                        if lemma in senti_word_net:
                            senti_word_net[lemma]['pos_score'] = pos_score if pos_score > senti_word_net[lemma]['pos_score'] else senti_word_net[lemma]['pos_score']
                            senti_word_net[lemma]['neg_score'] = neg_score if neg_score > senti_word_net[lemma]['neg_score'] else senti_word_net[lemma]['neg_score']
                            senti_word_net[lemma]['obj_score'] = 1 - (senti_word_net[lemma]['pos_score'] + senti_word_net[lemma]['neg_score'])
                        else:
                            senti_word_net[lemma] = {
                                'pos': pos,
                                'id': uid,
                                'pos_score': pos_score,
                                'neg_score': neg_score,
                                'obj_score': 1 - (pos_score + neg_score),
                                'SynsetTerms': [lemma.name() for lemma in wn.synset_from_pos_and_offset(pos, uid).lemmas()]
                            }
print('SentiWordNet size : ', len(senti_word_net))
print('-' * 10)
pp.pprint(next(iter(senti_word_net.items())))

SentiWordNet size :  39822
----------
(   'able',
    {   'SynsetTerms': ['able'],
        'id': 1740,
        'neg_score': 0.0,
        'obj_score': 0.75,
        'pos': 'a',
        'pos_score': 0.25})


Vamos vetorizar todas as frases que carregamos do Movie Review usando o TF-IDF para obtermos valores de cada palavra dentro de todo o contexto.

In [5]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3))
transformed_weights = vectorizer.fit_transform([phrase['phrase'] for phrase in new_phrases])
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()

tfidf_word_weights = {}
i = 0
for item in vectorizer.vocabulary_.items():
    tfidf_word_weights[item[0]] = weights[item[1]]
print('TfIdf size : ', len(tfidf_word_weights))
print('-' * 10)
pp.pprint(next(iter(tfidf_word_weights.items())))

TfIdf size :  956437
----------
('accident', 0.00022434362084179348)


A célula abaixo irá fazer um cálculo simples e aproximado da valência das frases.

Para isso vamos selecionar o score positivo / negativo de cada palavra que o SentiWordNet nos dá e somaremos para o total de palavras dentro de cada frase. Adicionamos uma correção em caso de duas palavras negativas em sequência ou duas positivas.

Finalmente, realizaremos a correção para positivo ou negativo, de acordo com a classe que a frase tinha no dataset de Movie Review, adicionando o valor de peso (TF-IDF) das palavras dentro de cada frase.

In [6]:
n_new_phrases = copy.deepcopy(new_phrases)

wordnet_lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stwords = set(ENGLISH_STOP_WORDS)

for i, phrase in enumerate(n_new_phrases):
    words = [word for word in phrase['phrase'].split() if len(word) > 1]
    stem_words = [stemmer.stem(word) for word in words]
    lemm_words = [wordnet_lemmatizer.lemmatize(word) for word in words]
    words = [stem if len(stem) > len(lemm_words[i]) else lemm_words[i] for i, stem in enumerate(stem_words)]
    grams = list(ngrams(words, 2, pad_right=True))

    n_grams = []
    for gram in grams:
        v_grams = []
        for word in filter(None, gram):
            word_v = senti_word_net.get(word, None)
            pos_score = 0.0
            neg_score = 0.0
            if word_v:
                pos_score = word_v.get('pos_score')
                neg_score = word_v.get('neg_score')
            v_grams.append((word, pos_score, neg_score))
        n_grams.append(v_grams)
    
    ovr = 0.0
    for n_gram in n_grams:
        g1 = n_gram[0]
        word1, pos1, neg1 = g1
        try:
            g2 = n_gram[1]
            word2, pos2, neg2 = g2
            if pos1 - neg1 >= 0 and pos2 - neg2 >= 0:
                pos_db = 1.0
                if pos1 > 0 and pos2 > 0:
                    pos_db = 1.25
                ovr += ((pos1 - neg1) + (pos2 - neg2)) * pos_db
            elif pos1 - neg1 <= 0 and pos2 - neg2 <= 0:
                neg_db = 1.0
                if neg1 > 0 and neg2 > 0:
                    neg_db = 1.25
                ovr += ((pos1 - neg1) + (pos2 - neg2)) * neg_db
        except IndexError:
            pass

    tfidf = 0.0
    for word in set(words):
        tfidf += tfidf_word_weights.get(word, 0)
    corr = 1 + (tfidf * len(words))
    corr = corr if n_new_phrases[i]['type'] == 'pos' else -corr
    n_new_phrases[i]['over_score'] = corr + ovr

# normalizando os valores
scores = np.array([m['over_score'] for m in n_new_phrases])
a, b, mmin, mmax = -100, 100, np.min(scores), np.max(scores)
gt = np.max([np.abs(mmin), mmax])
mmin = -gt + (-.25)
mmax += .25
scores = np.floor(a + (((scores - mmin) * (b-a)) / (mmax - mmin)))

for i, item in enumerate(n_new_phrases):
    n_new_phrases[i]['over_score'] = scores[i]

print('-' * 20)
print('Frases:')
pp.pprint(n_new_phrases[:5])

--------------------
Frases:
[   {   'over_score': -5.0,
        'phrase': 'they get into an accident .',
        'type': 'neg'},
    {   'over_score': 1.0,
        'phrase': 'one of the guys dies , but his girlfriend continues to see '
                  'him in her life , and has nightmares .',
        'type': 'neg'},
    {'over_score': -4.0, 'phrase': "what ' s the deal ?", 'type': 'neg'},
    {   'over_score': -2.0,
        'phrase': 'watch the movie and " sorta " find out .',
        'type': 'neg'},
    {   'over_score': 2.0,
        'phrase': 'critique : a mind - fuck movie for the teen generation that '
                  'touches on a very cool idea , but presents it in a very bad '
                  'package .',
        'type': 'neg'}]
