<h1 align="center"> Aplicações em Processamento de Linguagem Natural </h1>
<h2 align="center"> Aula 05 - Vetores de Palavras</h2>
<h3 align="center"> Prof. Fernando Vieira da Silva MSc.</h3>

<h2>1. Introdução</h2>

Vetores de palavras, ou Word2Vec, são representações muito usadas recentemente. Essas técnicas consistem em, geralmente, utilizar redes neurais artificiais simples com o objetivo de, dado uma palavra, prever a probabilidade da ocorrência de outras palavras nas proximidades. Porém, não se usa a rede neural treinada, mas os valores aprendidos como pesos da camada escondida são utilizados como representação para a palavra dada. Vamos estudar algumas arquiteturas diferentes.</p>


<h2>2. Arquitetura Skip-Gram </h2>
<p> A Figura 1 abaixo, retirada de (McCormick, 2016), mostra como são obtidas as amostras para treinar uma rede neural no modelo Word2Vec com arquitetura Skip-Gram para a frase “The quick brown fox jumps over the lazy dog” (“A rápida raposa marrom pula sobre o cachorro preguiçoso”, numa tradução livre). Nesse exemplo, a palavra usada como entrada da rede neural é marcada em azul, enquanto o resultado esperado é uma das palavras dentro do quadro em destaque (que trata-se da janela escolhida para delinear a proximidade). </p>
As amostras utilizadas são exibidas na coluna mais à direita.

![Figura 1: Amostras arquitetura Skip-gram](http://mccormickml.com/assets/word2vec/training_data.png)

O modelo de rede neural nessa arquitetura se assemelha ao exibido na Figura 3. Nesse outro exemplo, trata-se de um conjunto de 10 mil palavras e a entrada mostrada na figura é a palavra “ants” (formigas). Observe que esta palavra é representada em forma de um hot-vector, ou seja, um vetor com 10 mil dimensões (o mesmo número de palavras), onde somente uma leva o valor 1, enquanto as demais são iguais a 0. Há uma camada de neurônios escondida e uma camada de saída. Por sua vez, na camada de saída, cada neurônio retorna a probabilidade de uma palavra estar nas proximidades (também há 10 mil neurônios, ativados pela função Softmax). Essa rede é treinada com todas as ocorrências de cada palavra no corpus, e os pesos estabelecidos para a camada escondida são escolhidos como a representação dessa palavra (nesse exemplo, um vetor de 300 elementos).

![Figura 2: Arquitetura da rede neural Skip-Gram](http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png)


<h2>3. Arquitetura CBOW (Continuous Bag of Words)</h2>

<p>Já na arquitetura CBOW, a entrada é a representação vetorial das palavras que cercam a palavra que se deseja representar, e a saída é justamente tal palavra, como ilustrado na figura abaixo.
    
![](https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png)
<p fontsize="small">Fonte: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.</p>

Em resumo, enquanto o modelo na arquitetura Skip-Gram tenta prever o contexto, dado a palavra; a arquitetura CBOW tenta prever a palavra, dado o contexto.

Mikolov (autor) fez a seguinte comparação entre as duas arquiteturas:
* Skip-gram: Funciona bem com poucos dados de treino, representa bem até mesmo palavras ou frases raras.
* CBOW: muitas vezes mais rápido para treinar do qeu o Skip-gram, acurácia levemente melhor para palavras mais frequentes.

<p>Este artigo traz um exemplo bem interessante em que um modelo de rede neural na arquitetura Skip-Gram é criado usando o Keras: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-skip-gram.html, porém, por simplicidade, vamos treinar usando a biblioteca Gensim, que já possui os modelos adequados para Skip-Gram e CBOW.</p>


<p>Vamos gerar representações Word2Vec do livro Hamlet. Primeiro, vamos tokenizar o corpus, como já fizemos em aulas anteriores.</p>

In [1]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

hamlet_raw = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
sentences = sent_tokenize(hamlet_raw)

all_words = [nltk.word_tokenize(sent) for sent in sentences]

for i in range(len(all_words)):  
    all_words[i] = [w.lower() for w in all_words[i] if w not in stopwords.words('english') and w not in string.punctuation]
    
print(all_words[:10])

[['the', 'tragedie', 'hamlet', 'william', 'shakespeare', '1599', 'actus', 'primus'], ['scoena', 'prima'], ['enter', 'barnardo', 'francisco', 'two', 'centinels'], ['barnardo'], ['who', "'s"], ['fran'], ['nay', 'answer', 'stand', 'vnfold', 'selfe', 'bar'], ['long', 'liue', 'king', 'fran'], ['barnardo'], ['bar']]


In [2]:
from gensim.models import Word2Vec

#sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
word2vec = Word2Vec(all_words, min_count=2, sg=1)

vocabulary = word2vec.wv.vocab  
print(vocabulary)

{'the': <gensim.models.keyedvectors.Vocab object at 0x7fbe83525080>, 'tragedie': <gensim.models.keyedvectors.Vocab object at 0x7fbe83527e10>, 'hamlet': <gensim.models.keyedvectors.Vocab object at 0x7fbe83527da0>, 'actus': <gensim.models.keyedvectors.Vocab object at 0x7fbe83527dd8>, 'enter': <gensim.models.keyedvectors.Vocab object at 0x7fbe8370b080>, 'barnardo': <gensim.models.keyedvectors.Vocab object at 0x7fbe8370bf60>, 'francisco': <gensim.models.keyedvectors.Vocab object at 0x7fbe837cab70>, 'two': <gensim.models.keyedvectors.Vocab object at 0x7fbe784b0048>, 'who': <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde668>, "'s": <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde6a0>, 'fran': <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde6d8>, 'nay': <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde710>, 'answer': <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde748>, 'stand': <gensim.models.keyedvectors.Vocab object at 0x7fbe77fde780>, 'vnfold': <gensim.m

In [3]:
v1 = word2vec.wv['heart']
print(v1)

[-0.10965656 -0.19440058  0.04478817  0.02354403  0.1460341  -0.08431838
  0.3867209  -0.05071728 -0.02770275  0.09110805  0.21167877 -0.08531615
 -0.19137006 -0.12802282  0.0774199   0.11014211  0.01992117  0.09093858
 -0.15589936  0.10439043 -0.01269653 -0.0226512   0.15009551 -0.09626579
 -0.0617141   0.05230551  0.12016147  0.07233952 -0.1115278  -0.23280461
 -0.3027934  -0.10698824  0.0474533  -0.20318016 -0.17140846 -0.11836509
  0.05690295  0.02470967 -0.0430086   0.04991573  0.10011757 -0.18146518
  0.26836756 -0.0764969   0.12393732  0.16229035 -0.12622204 -0.13056064
  0.09709316  0.03007673  0.13926437 -0.03828734  0.1577552  -0.1924679
  0.0334581   0.07286196 -0.32882804 -0.16254403 -0.03900281 -0.14916731
  0.0092638  -0.15682232 -0.00069207  0.00703817  0.03289991 -0.03965494
 -0.10843748 -0.27450758 -0.04931876  0.09796052  0.03460863  0.2573241
  0.23109251 -0.26298776  0.27077448 -0.18982199  0.2208644  -0.32345068
  0.36670202  0.14779823 -0.0544466   0.08337989  0.0

In [4]:
sim_words = word2vec.wv.most_similar('truth')
print(sim_words)

[('himselfe', 0.999485969543457), ('thing', 0.9994436502456665), ('his', 0.9994401931762695), ('three', 0.9994359016418457), ('where', 0.9994220733642578), ('which', 0.9994214773178101), ('deere', 0.9994214177131653), ('yet', 0.9994207620620728), ('state', 0.9994165897369385), ('make', 0.9994117021560669)]


<h2>4. Arquitetura GloVe (Global Vectors)</h2>

<p>GloVe é um método de aprendizado não supervisionado que gera representações vetoriais de palavras com base na probabilidade de co-ocorrência entre duas palavras. Ela toma como base o vetor que representa a distância entre a representação vetorial dessas palavras, como pode ser observado na imagem abaixo. </p>

![GloVe - Distância vetorial ](https://nlp.stanford.edu/projects/glove/images/man_woman.jpg)

 <p>A intuição por trás do GloVe está no fato de que a probabilidade de co-ocorrência entre duas palavras será sempre alta para palavras de mesmo contexto (que são naturalmente associadas), e baixa para palavras sem relação de contexto. Esses dois casos representam ruído no modelo, como pode ser observado no exemplo abaixo, extraído de um corpus real de 2 bi. de palavras.</p>

![GloVe - Intuição sobre a co-ocorrência das palavras](https://nlp.stanford.edu/projects/glove/images/table.png)

<p>O objetivo do GloVe é aprender uma representação vetorial de palavras tal que o produto escalar entre duas palavras seja igual ao log da probabilidade de co-ocorrência. Isso é feito através de um modelo simples usando o algoritmo de mínimos quadrados. Dessa forma, é possível representar o significado das palavras.</p>

Vamos ver um exemplo prático de representação utilizando o glove python, usando um código inspirado em https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b


In [5]:
!pip install glove_python

Collecting glove_python
[?25l  Downloading https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz (263kB)
[K    100% |████████████████████████████████| 266kB 7.7MB/s 
Building wheels for collected packages: glove-python
  Building wheel for glove-python (setup.py) ... [?25l- \ | / - \ | / - \ | done
[?25h  Stored in directory: /tmp/.cache/pip/wheels/88/4b/6d/10c0d2ad32c9d9d68beec9694a6f0b6e83ab1662a90a089a4b
Successfully built glove-python
Installing collected packages: glove-python
Successfully installed glove-python-0.1.0


In [6]:
from glove import Corpus, Glove


corpus = Corpus() 

corpus.fit(all_words, window=10)

#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components

glove = Glove(no_components=5, learning_rate=0.05)
 
glove.fit(corpus.matrix, epochs=30, no_threads=1, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

glove.add_dictionary(corpus.dictionary)

Performing 30 training epochs with 1 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29


In [7]:
print(corpus.dictionary)



In [8]:
print(corpus.matrix)

  (0, 1)	1.2000000476837158
  (0, 1)	1.0
  (0, 2)	1.2750000953674316
  (0, 2)	0.8333333730697632
  (0, 3)	0.3333333432674408
  (0, 4)	0.25
  (0, 5)	0.20000000298023224
  (0, 6)	0.1666666716337204
  (0, 7)	0.1428571492433548
  (0, 10)	0.3333333432674408
  (0, 13)	0.8333333730697632
  (0, 16)	1.7666666507720947
  (0, 16)	0.20000000298023224
  (0, 19)	0.5
  (0, 19)	0.1428571492433548
  (0, 20)	0.5
  (0, 22)	2.2666666507720947
  (0, 22)	0.5166666507720947
  (0, 24)	0.36666667461395264
  (0, 24)	0.625
  (0, 25)	0.2678571343421936
  (0, 26)	4.724999904632568
  (0, 26)	5.0333333015441895
  (0, 27)	0.20000000298023224
  (0, 27)	0.2666666805744171
  :	:
  (4756, 4758)	0.1666666716337204
  (4756, 4759)	0.1428571492433548
  (4757, 4758)	0.5
  (4757, 4759)	0.3333333432674408
  (4758, 4759)	1.0
  (4759, 4760)	0.10000000149011612
  (4760, 4761)	0.5
  (4760, 4762)	0.25
  (4761, 4762)	0.5
  (4764, 4765)	0.3333333432674408
  (4766, 4767)	0.3333333432674408
  (4766, 4768)	0.20000000298023224
  (4767, 47

In [9]:
# Vamos ver os vetores de uma palavra:
print(glove.word_vectors[glove.dictionary['heart']])


[-0.00283001  0.16988025  0.27931247  0.17671957 -0.25639263]


In [10]:
glove.most_similar('truth')

[('ophe', 0.9910646621285898),
 ('abhorred', 0.9888344465913119),
 ('commission', 0.9877386240000242),
 ('poem', 0.9784020566777817)]

<b>Referências:</b>

<p>https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/</p>
<p>McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</p>
<p>Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. </p>
<p>https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b</p>
<p>Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. 1-12. </p>

<p><b>Exercício 5:</b>Treine novamente seu modelo de identificação de sentimentos (positivo e negativo) usando o corpus disponível no NLTK (ver enunciado Exercício 4). Dessa vez, porém, use as representações Skip-Gram, CBOW e GloVe. Qual modelo apresenta melhor resultado? Como os resultados se comparam aos resultados obtidos com o modelo LSA usado anteriormente?
</p>

Dica: Segue abaixo código de um fit-transformer para Word2Vec

In [11]:
from gensim.models import Word2Vec

#sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.


class Word2VecTransformer(object):
    
    ALGO_SKIP_GRAM=1
    ALGO_CBOW=2    
    
    def __init__(self, algo=1):
        self.algo = algo
    
    def fit(self, X, y=None):     
        X = [nltk.word_tokenize(x) for x in X]
        
        self.word2vec = Word2Vec(X, min_count=2, sg=self.algo)
        
        # Pegamos a dimensão da primeira palavra, para saber quantas dimensões estamos trabalhando,
        # assim podemos ajustar nos casos em que aparecerem palavras que não existirem no vocabulário.
        first_word = next(iter(self.word2vec.wv.vocab.keys()))
        self.num_dim = len(self.word2vec[first_word])       
        
        return self
    
    def transform(self, X, Y=None):        
        X = [nltk.word_tokenize(x) for x in X]
        
        return np.array([np.mean([self.word2vec[w] for w in words if w in self.word2vec] or [np.zeros(self.num_dim)], axis=0) 
                         for words in X])

        
    def get_params(self, deep=True):
        return {}


**Skip-Gram**

In [12]:
import nltk
nltk.download('sentence_polarity')
from nltk.corpus import sentence_polarity

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!


In [13]:
# não há necessidade do For
# x_data_pos = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

x_data_pol_pos = []
x_data_pol_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

y_data_pos = [1] * len(x_data_pol_pos)

# x_data_neg = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])

x_data_pol_neg = []
x_data_pol_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])
y_data_neg = [0] * len(x_data_pol_neg)

x_data_full = x_data_pol_pos[:500] + x_data_pol_neg[:500]
print(len(x_data_full))
y_data_full = y_data_pos[:500] + y_data_neg[:500]
print(len(y_data_full))

1000
1000


In [14]:
import numpy as np

x_data = np.array(x_data_full, dtype=object)
print(x_data.shape)
y_data = np.array(y_data_full)
print(y_data.shape)

(1000,)
(1000,)


In [15]:
train_indexes = np.random.rand(len(x_data)) < 0.80

print(len(train_indexes))
print(train_indexes[:10])

1000
[False False  True  True  True  True False  True  True False]


In [16]:
x_data_train = x_data[train_indexes]
y_data_train = y_data[train_indexes]

print(len(x_data_train))
print(len(y_data_train))

792
792


In [17]:
x_data_test = x_data[~train_indexes]
y_data_test = y_data[~train_indexes]

print(len(x_data_test))
print(len(y_data_test))

208
208


In [18]:
#tokenizer da aula 4
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet

stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas

In [19]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):        
        try:
            self.svd_transformer = TruncatedSVD(n_components=round(X.shape[1]/2))
            self.svd_transformer.fit(X)
        
            cummulative_variance = 0.0
            k = 0
            for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
                cummulative_variance += var
                if cummulative_variance >= 0.5:
                    break
                else:
                    k += 1
                
            self.svd_transformer = TruncatedSVD(n_components=k)
        except Exception as ex:
            print(ex)
            
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('Word2Vec', Word2VecTransformer()),\
                       ('clf', clf)])

In [21]:
from sklearn.model_selection import RandomizedSearchCV
import scipy
par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}
hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=1, n_iter=20)

In [22]:
print(x_data_train)

['effective but too-tepid biopic'
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."
 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .'
 "steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off ."
 'what really surprises about wisegirls is its low-key quality and genuine tenderness .'
 '( wendigo is ) why we go to the cinema : to be fed through the eye , the heart , the mind .'
 'ultimately , it ponders the reasons we need stories so much .'
 'illuminating if overly talky documentary .'
 'offers a breath of the fresh air of true sophistication .'
 'a thoughtful ,

In [23]:
hyperpar_selector.fit(X=x_data_train, y=y_data_train)



RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('Word2Vec', <__main__.Word2VecTransformer object at 0x7fbe741fa630>), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params=None, iid='warn', n_iter=20, n_jobs=1,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [24]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.510
Best parameters set:
	clf__n_neighbors: 11
	clf__weights: 'distance'


In [25]:
from sklearn.metrics import *
y_pred = hyperpar_selector.predict(x_data_test)
accSKPGRAM=accuracy_score(y_data_test, y_pred)
print(accSKPGRAM)


0.5240384615384616




In [26]:
import pickle
string_obj = pickle.dumps(hyperpar_selector)

In [27]:
model_file = open('model.pkl', 'wb')
model_file.write(string_obj)
model_file.close()

In [28]:
model_file = open('model.pkl', 'rb')
model_content = model_file.read()
obj_classifier = pickle.loads(model_content)
model_file.close()
res = obj_classifier.predict(["what's up bro?"])
print(res)

[0]




In [29]:
res = obj_classifier.predict(x_data_test)
print(accuracy_score(y_data_test, res))



0.5240384615384616


In [30]:
res = obj_classifier.predict(x_data_test)
print(res)



[0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1
 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0
 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0]


In [31]:
formal = [x_data_test[i] for i in range(len(res)) if res[i] == 1]
for txt in formal:
    print("%s\n" % txt)

take care of my cat offers a refreshingly different slice of asian cinema .

one of the greatest family-oriented , fantasy-adventure movies ever .

an utterly compelling 'who wrote it' in which the reputation of the most famous author who ever lived comes into question .

a masterpiece four years in the making .

the movie's ripe , enrapturing beauty will tempt those willing to probe its inscrutable mysteries .

scores a few points for doing what it does with a dedicated and good-hearted professionalism .

a masterful film from a master filmmaker , unique in its deceptive grimness , compelling in its fatalist worldview .

the film makes a strong case for the importance of the musicians in creating the motown sound .

behind the snow games and lovable siberian huskies ( plus one sheep dog ) , the picture hosts a parka-wrapped dose of heart .

everytime you think undercover brother has run out of steam , it finds a new way to surprise and amuse .

manages to be original , even though it 

In [32]:
informal = [x_data_test[i] for i in range(len(res)) if res[i] == 0]
for txt in informal:
    print("%s\n" % txt)

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

offers that rare combination of entertainment and education .

this is a film well worth seeing , talking and singing heads and all .

with a cast that includes some of the top actors working in independent film , lovely & amazing involves us because it is so incisive , so bleakly amusing about how we go about our lives .

cantet perfectly captures the hotel lobbies , two-lane highways , and roadside cafes that permeate vincent's days

like most bond outings in recent years , some of the stunts are so outlandish that they border on being cartoonlike . a heavy reliance on cgi technology 

In [33]:
res2 = obj_classifier.predict(["unfortunately the story and the actors are served with a hack script"])
print(res2)

[1]




**CBOW**

In [34]:
import nltk
nltk.download('sentence_polarity')
from nltk.corpus import sentence_polarity

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!


In [35]:
import nltk

# não há necessidade do For
# x_data_pos = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

x_data_pol_pos = []
x_data_pol_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

y_data_pos = [1] * len(x_data_pol_pos)

# x_data_neg = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])

x_data_pol_neg = []
x_data_pol_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])
y_data_neg = [0] * len(x_data_pol_neg)

x_data_full = x_data_pol_pos[:500] + x_data_pol_neg[:500]
print(len(x_data_full))
y_data_full = y_data_pos[:500] + y_data_neg[:500]
print(len(y_data_full))

1000
1000


In [36]:
import numpy as np

x_data = np.array(x_data_full, dtype=object)
print(x_data.shape)
y_data = np.array(y_data_full)
print(y_data.shape)

(1000,)
(1000,)


In [37]:
train_indexes = np.random.rand(len(x_data)) < 0.80

print(len(train_indexes))
print(train_indexes[:10])

1000
[ True  True  True  True  True False  True  True  True  True]


In [38]:
x_data_train = x_data[train_indexes]
y_data_train = y_data[train_indexes]

print(len(x_data_train))
print(len(y_data_train))

818
818


In [39]:
x_data_test = x_data[~train_indexes]
y_data_test = y_data[~train_indexes]

print(len(x_data_test))
print(len(y_data_test))

182
182


In [40]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet

stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas

In [41]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):        
        try:
            self.svd_transformer = TruncatedSVD(n_components=round(X.shape[1]/2))
            self.svd_transformer.fit(X)
        
            cummulative_variance = 0.0
            k = 0
            for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
                cummulative_variance += var
                if cummulative_variance >= 0.5:
                    break
                else:
                    k += 1
                
            self.svd_transformer = TruncatedSVD(n_components=k)
        except Exception as ex:
            print(ex)
            
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('Word2Vec', Word2VecTransformer(2)),\
                       ('clf', clf)])

In [43]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}


hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=1, n_iter=20)

In [44]:
print(x_data_train)

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'
 'effective but too-tepid biopic'
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."
 'offers that rare combination of entertainment and education .'
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .'
 "steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off ."
 'take care of my cat offers a ref

In [45]:
hyperpar_selector.fit(X=x_data_train, y=y_data_train)



RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('Word2Vec', <__main__.Word2VecTransformer object at 0x7fbe65b53e10>), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params=None, iid='warn', n_iter=20, n_jobs=1,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [46]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.543
Best parameters set:
	clf__n_neighbors: 38
	clf__weights: 'distance'


In [47]:
from sklearn.metrics import *

y_pred = hyperpar_selector.predict(x_data_test)

accCBOW=accuracy_score(y_data_test, y_pred)
print(accCBOW)

0.4945054945054945




In [48]:
import pickle

string_obj = pickle.dumps(hyperpar_selector)

In [49]:
model_file = open('model.pkl', 'wb')

model_file.write(string_obj)

model_file.close()


In [50]:
model_file = open('model.pkl', 'rb')
model_content = model_file.read()

obj_classifier = pickle.loads(model_content)

model_file.close()

res = obj_classifier.predict(["what's up bro?"])

print(res)

[0]




In [51]:
res = obj_classifier.predict(x_data_test)
print(accuracy_score(y_data_test, res))



0.4945054945054945


In [52]:
res = obj_classifier.predict(x_data_test)

print(res)

[1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1
 0 0 1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0
 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1
 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0]




In [53]:
formal = [x_data_test[i] for i in range(len(res)) if res[i] == 1]

for txt in formal:
    print("%s\n" % txt)

the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

this is a film well worth seeing , talking and singing heads and all .

the movie's ripe , enrapturing beauty will tempt those willing to probe its inscrutable mysteries .

though it is by no means his best work , laissez-passer is a distinguished and distinctive effort by a bona-fide master , a fascinating film replete with rewards to be had by all willing to make the effort to reap them .

manages to be original , even though it rips off many of its ideas .

you'd think by now america would have had enough of plucky british eccentrics with hearts of gold . yet the act is still charming here .

son of the bride may be a good half-hour too long but comes replete with a flattering sense of mystery and quietness .

a taut , intelligent psychological drama .

painful to watch , but viewers willing to take a chance will be rewarded with two of the year

In [54]:
informal = [x_data_test[i] for i in range(len(res)) if res[i] == 0]

for txt in informal:
    print("%s\n" % txt)

( wendigo is ) why we go to the cinema : to be fed through the eye , the heart , the mind .

illuminating if overly talky documentary .

offers a breath of the fresh air of true sophistication .

spiderman rocks

behind the snow games and lovable siberian huskies ( plus one sheep dog ) , the picture hosts a parka-wrapped dose of heart .

singer/composer bryan adams contributes a slew of songs  a few potential hits , a few more simply intrusive to the story  but the whole package certainly captures the intended , er , spirit of the piece .

whether or not you're enlightened by any of derrida's lectures on " the other " and " the self , " derrida is an undeniably fascinating and playful fellow .

just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .

the invincible werner herzog is alive and well and living in la

at heart the movie is a deftly wrought suspense yarn whose richer shadings work as coloring rathe

In [55]:
res2 = obj_classifier.predict(["unfortunately the story and the actors are served with a hack script"])

print(res2)

[0]




**GLOOVE**

In [56]:
import nltk

# não há necessidade do For
# x_data_pos = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

x_data_pol_pos = []
x_data_pol_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])

y_data_pos = [1] * len(x_data_pol_pos)

# x_data_neg = []
# for fileid in nltk.corpus.sentence_polarity.fileids():
#     x_data_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])

x_data_pol_neg = []
x_data_pol_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])
y_data_neg = [0] * len(x_data_pol_neg)

x_data_full = x_data_pol_pos[:500] + x_data_pol_neg[:500]
print(len(x_data_full))
y_data_full = y_data_pos[:500] + y_data_neg[:500]
print(len(y_data_full))

1000
1000


In [57]:
class GloveTransformer(object):
    
    def __init__(self, algo=2):
        self.algo = algo
    
    def fit(self, X, y=None):     
        X = [nltk.word_tokenize(x) for x in X]
        
        self.corpus = Corpus()
        self.corpus.fit(X, window=10)
        
        self.num_dim = 100       
        self.glove = Glove(no_components=self.num_dim, learning_rate=0.05)
        
        self.glove.fit(corpus.matrix, epochs=30, no_threads=1, verbose=True)
        self.glove.add_dictionary(self.corpus.dictionary)
        
        return self
    
    def transform(self, X, Y=None):        
        X = [nltk.word_tokenize(x) for x in X]
        
        return np.array([np.mean([self.glove.word_vectors[self.glove.dictionary[w]] for w in words if w in self.corpus.dictionary.keys()] or [np.zeros(self.num_dim)],axis=0) 
                         for words in X])

        
    def get_params(self, deep=True):
        return {}

In [58]:
import numpy as np

x_data = np.array(x_data_full, dtype=object)
#x_data = np.array(x_data_full)
print(x_data.shape)
y_data = np.array(y_data_full)
print(y_data.shape)

(1000,)
(1000,)


In [59]:
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('gt', GloveTransformer()), \
                       ('clf', clf)])

In [60]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}

hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=1, n_iter=20)

In [61]:
print(x_data_train)

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'
 'effective but too-tepid biopic'
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."
 'offers that rare combination of entertainment and education .'
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .'
 "steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off ."
 'take care of my cat offers a ref

In [62]:
hyperpar_selector.fit(X=x_data_train, y=y_data_train)

Performing 30 training epochs with 1 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Performing 30 training epochs with 1 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Performing 30 training epochs with 1 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Performing 30 training epochs with 1 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('gt', <__main__.GloveTransformer object at 0x7fbe65b43278>), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params=None, iid='warn', n_iter=20, n_jobs=1,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [63]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.551
Best parameters set:
	clf__n_neighbors: 2
	clf__weights: 'uniform'


In [64]:
from sklearn.metrics import *

y_pred = hyperpar_selector.predict(x_data_test)

accGLOVE=accuracy_score(y_data_test, y_pred)
print(accGLOVE)

0.5


**Conclusão**

Como usamos o mesmo datase/corpus e executamos basicamente os mesmos passos a comparação é valida.

In [65]:
print('LSA exercicio 4: ',0.5792079207920792*100)
print('Skip-Gram:',accSKPGRAM*100)
print('CBOW:',accCBOW*100)
print('GLOVE:',accGLOVE*100)

LSA exercicio 4:  57.920792079207914
Skip-Gram: 52.40384615384615
CBOW: 49.45054945054945
GLOVE: 50.0


Analisando os 4 modelos nesse caso de analise de sentimento positivo e negativo o que se melhor saiu foi LSA.