<a href="https://colab.research.google.com/github/viniciusrpb/cic0269_natural_language_processing/blob/main/lectures/cap13_1_embeddings_principles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capítulo 13 - Word Embeddings

Essa aula se baseia:



*   No artigo "*Distributed Representations of Words and Phrases
and their Compositionality*", de Mikolov et al. (2013). Clique [aqui](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) para acesso.
*   Na documentação do word2vec da biblioteca [Gensim](https://radimrehurek.com/gensim/models/word2vec.html).



## 13.1. Princípios de Word Embeddings

In [None]:
!pip install keras
!pip install tensorflow
!pip install -U gensim

In [2]:
from keras.models import Sequential
from keras.layers import Dense,Embedding,Activation,Dropout,SimpleRNN,BatchNormalization,RNN,Flatten,Input,LSTM,Bidirectional
from keras.utils.np_utils import to_categorical
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import numpy as np
import gensim
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Treinando um modelo Word2vec

In [3]:
from gensim.models import Word2Vec
corpus = [["hello", "world","hi","earth","sunshine","law"],["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences=corpus, vector_size=5, window=5, min_count=1, workers=4)

In [4]:
word_vectors = model.wv

In [5]:
len(word_vectors[0])

5

In [6]:
word_vectors['world'] - word_vectors['hello']

array([ 0.15723471, -0.24314074,  0.2750364 ,  0.08315043,  0.01469049],
      dtype=float32)

Podemos retreinar o modelo com novas palavras

In [7]:
model.train([["dear", "bear", "cream"]], total_examples=3, epochs=1)



(0, 3)

### Carregando Modelos Treinados em Outros Corpora

In [8]:
import gensim.downloader

print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [9]:
word_vectors = gensim.downloader.load('word2vec-google-news-300')



In [10]:
word_vectors.most_similar('car')

[('vehicle', 0.7821096181869507),
 ('cars', 0.7423831224441528),
 ('SUV', 0.7160962224006653),
 ('minivan', 0.6907036900520325),
 ('truck', 0.6735789775848389),
 ('Car', 0.6677608489990234),
 ('Ford_Focus', 0.667320191860199),
 ('Honda_Civic', 0.6626849174499512),
 ('Jeep', 0.651133120059967),
 ('pickup_truck', 0.6441438794136047)]

In [12]:
word_vectors['airplane']-word_vectors['flight']

array([ 2.53417969e-01, -2.23144531e-01,  3.80859375e-02,  2.54394531e-01,
       -1.92871094e-01, -1.28906250e-01, -3.61328125e-02,  2.16796875e-01,
        2.08007812e-01,  1.16699219e-01,  1.44042969e-01, -1.10351562e-01,
        9.17968750e-02, -2.11425781e-01, -6.05468750e-02,  1.19140625e-01,
       -1.39404297e-01,  1.58508301e-01,  2.34375000e-02, -4.05273438e-02,
        5.56640625e-02,  5.85021973e-02,  1.89697266e-01, -2.05078125e-02,
        4.58526611e-03, -7.69042969e-02,  6.83593750e-02, -4.39453125e-03,
       -7.51953125e-02, -1.25976562e-01, -2.45117188e-01, -1.58203125e-01,
       -1.97753906e-01,  2.93212891e-01,  2.24243164e-01,  1.37695312e-01,
        2.87658691e-01, -9.76562500e-02, -3.07617188e-02,  1.98242188e-01,
        7.30438232e-02,  5.85937500e-03, -1.48437500e-01,  7.95288086e-02,
       -5.29785156e-02,  1.97265625e-01, -1.09863281e-01, -2.58789062e-02,
        1.17187500e-02, -1.85546875e-01, -1.45019531e-01,  2.28515625e-01,
       -1.41601562e-02, -

In [15]:
result = (word_vectors['airplane']+word_vectors['flight'])-word_vectors['ship']
print(result)

[-0.19482422 -0.46826172 -0.14648438  0.12548828 -0.33544922 -0.52368164
 -0.171875   -0.4296875   0.19042969 -0.03369141  0.33666992 -0.21191406
  0.4609375  -0.12548828 -0.05224609  0.22460938  0.22924805  0.25360107
  0.19677734  0.24560547 -0.29492188 -0.12802124 -0.31689453 -0.31933594
 -0.32746124  0.02954102 -0.39038086 -0.22509766  0.10644531  0.4794922
 -0.28808594 -0.3334961  -0.05810547 -0.08178711 -0.22473145 -0.01464844
 -0.07659912  0.2529297   0.17871094  0.5859375   0.05122375 -0.00195312
  0.6191406   0.16119385 -0.04638672 -0.56640625  0.4267578  -0.1743164
  0.78808594 -0.04882812 -0.5805664  -0.07519531 -0.32470703  0.23022461
 -0.07110596 -0.45117188  0.19335938 -0.04370117  0.19580078  0.2446289
  0.12402344 -0.46661377  0.05938721 -0.05102539 -0.00793457 -0.06005859
 -0.10253906  0.19628906  0.47387695 -0.01782227 -0.171875   -0.31030273
 -0.17602539  0.09411621 -0.25        0.10791016  0.21543121 -0.2866211
  0.09283447  0.0625     -0.11743164  0.10601807  0.202

Utilizando o método ```most_similar```, podemos retornar as palavras mais similares por meio de uma operação aritmética:
somam-se os vetores positivos e subtrai o vetor negativo. A partir do resultado, podemos obter as palavras mais similares comparando-se esse vetor resultante com os vetores das demais palavras do word embedding com base na similaridade cosseno. Abaixo as top 10 palavras mais similares são mostradas:

In [11]:
word_vectors.most_similar(positive=['airplane','flight'],negative=['ship'],topn=10) 

[('plane', 0.6277297735214233),
 ('jet', 0.5784463882446289),
 ('flights', 0.5631440877914429),
 ('airliner', 0.5585241913795471),
 ('aircraft', 0.5546182990074158),
 ('jetliner', 0.550014853477478),
 ('NOTE_Expedia_Expedia.com', 0.5478827357292175),
 ('airplanes', 0.5451778173446655),
 ('Flight', 0.5407993197441101),
 ('airline', 0.5332231521606445)]

In [13]:
word_vectors.similarity('orange','apple')

0.39203462

In [14]:
word_vectors.similarity('orange','blue')

0.6421891