# _Word Embeddings_

A **representação semântica distribuı́da** é baseada na hipótese distribucional que estabelece que o sentido de uma palavra é dado por seu contexto de ocorrência [2]. Esses vetores de palavras podem ser usados como recursos em uma variedade de aplicações, tais como: classificação de documentos [3], perguntas e respostas [4] e reconhecimento de entidade nomeada [5]. A representação de palavras como vetores contı́nuos tem uma longa história [6], [7], [8]). 

Muitos tipos diferentes de modelos foram propostos para estimar representações contı́nuas de palavras, incluindo a Análise Semântica Latente (do inglês, _Latent Semantic Analysis_ – **LSA**) e a Alocação Latente de Dirichlet (do inglês, _Latent Dirichlet Allocation_ – **LDA**). 

Já as representações distribuı́das de palavras aprendidas por redes neurais apresentam um desempenho significativamente superior ao LSA ao preservar regularidades lineares entre as palavras [1], [9]. Quanto ao LDA, sabe-se que ele é computacionalmente caro quando usado em grandes conjuntos de dados.

Neste notebook, faremos o treinamento do WordeVec em um córpus da Wikipédia no PT-BR.


In [1]:
#imports
import multiprocessing

from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [2]:
#Save dataset/model
!mkdir data
!mkdir model
!mkdir "model/word2vec"

mkdir: é impossível criar o diretório “data”: Arquivo existe
mkdir: é impossível criar o diretório “model”: Arquivo existe
mkdir: é impossível criar o diretório “model/word2vec”: Arquivo existe


## NNLM (_Neural Network Language Model_)

Uma arquitetura interessante de **NNLM** que foi apresentada por [10], é a que vetores de palavras são primeiro aprendidos usando uma rede neural com uma única camada oculta. Os vetores de palavras são, então, usados para treinar o NNLM. Assim, os vetores de palavra são aprendidos mesmo sem construir o NNLM completo. Mikolov et al. (2013) [1] estende diretamente essa arquitetura, focando apenas no primeiro passo onde os vetores de palavras são aprendidos usando um modelo simples. 

O objetivo é gerar vetores contendo números de tal forma que palavras similares de acordo com seus contextos estarão “próximas” no espaço vetorial, como ilustra a figura abaixo. Segundo Mikolov et al. (2013) [1], os vetores em cada lı́ngua foram projetados para 2 dimensões usando **PCA** e rotacionados manualmente para enfatizar a similaridade. Na figura temos palavras em inglês do lado direito e em espanhol do lado esquerdo.

<img src='images/word2vec_similaridade.png' width='600'>

Na célula seguinte, definiremos o caminho do córpus que utilizaremos para treinar o Word2Vec e onde o modelo será salvo.

O córpus que utilizaremos é um [dump da Wikipédia](https://www.dropbox.com/s/g2414vow56azzj2/wiki.pt-br.text.zip?dl=0). Baixe e salve na pasta "data".

In [2]:
def read_input(input_file):
    documents = []
    """This method reads the input file which is in gzip format"""

    print("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):

            if (i % 10000 == 0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review
            # text
            documents.append(line)
        return documents

## Word2Vec

Partindo da premissa de que técnicas básicas como contagem de n-gramas já estão em seu limite, Mikolov et al. (2013) [1] propõe a utilização de modelos de linguagem baseados em redes neurais para modelar representações distribuídas de palavras. O principal objetivo das técnicas propostas por Mikolov et al. (2013) [1] é aprender vetores de palavras de alta qualidade, a partir de enormes conjuntos de dados com bilhões de palavras. De maneira surpreendente, verificou-se que a similaridade das representações de palavras vai além das simples regularidades sintáticas. Dentro de um espaço de dimensões vetoriais, usando uma simples operação algébrica nos vetores de palavras, foi mostrado por exemplo que:

> vetor(**rei**) - vetor(**homem**) + vetor(**mulher**) = vetor que está próximo da representação vetorial da palavra **rainha**.

Mikolov et al. (2013) [1] propõe duas arquiteturas de modelos para a aprendizagem de representações distribuı́das de palavras que tentam minimizar a complexidade computacional: o modelo _Continuous Bag-of-Words_ (CBOW) e o modelo Skip-gram.

* **CBOW** – No CBOW, a arquitetura é semelhante à do NNLM _feedforward_, onde a camada escondida não-linear é removida e a camada de projeção é compartilhada para todas as palavras (não apenas a matriz de projeção). Assim, todas as palavras são projetadas na mesma posição. Essa arquitetura é chamada de modelo de saco de palavras (_bag of words_), pois a ordem das palavras não influencia a projeção. O CBOW usa representação distribuı́da contı́nua do contexto. A arquitetura do modelo é mostrada na figura abaixo, na qual pode-se observar que a matriz de pesos entre a entrada e a camada de projeção é compartilhada para todas as posições de palavras (da mesma maneira que no NNLM).
    
    
* **Skip-gram** – A arquitetura do Skip-gram é semelhante à do CBOW, mas em vez de prever a palavra atual com base no contexto, Skip-gram tenta maximizar a classificação de uma palavra com base em outra da mesma sentença. Mais precisamente, usa-se cada palavra atual como uma entrada para um classificador log-linear para prever palavras dentro de um intervalo anterior e posterior à palavra atual. O aumento do intervalo melhora a qualidade dos vetores de palavra resultantes, mas também aumenta a complexidade computacional. A distância entre uma palavra do contexto e a palavra atual indica o grau de relação entre elas. Quanto mais distante, menos relacionada estará à palavra atual, podendo receber pesos menores.

<img src='images/CBOW_Skip-Gram.png' width='500'>

### Parâmetros

Na próxima célula de código, definimos os seguintes parâmetros:

* **sg**: define o algoritmo de treinamento. Por padrão, o CBOW é usado (sg = 0). O outro é o skip-gram (sg = 1).

* **size**: dimensionalidade dos vetores.

* **window**: é a quantidade de palavras anteriores e posteriores à palavra alvo.

* **LineSentence**: Interpreta uma string ou arquivo. Cada linha é uma sentença.

* **min_count**: ignore as palavras com frequência total inferior a **min_count**.

* **max_vocab_size**: Limite a RAM durante a construção do vocabulário; se houver mais palavras únicas do que **max_vocab_size**, ocorre a poda os infrequentes. Cada 10 milhões de tipos de palavras precisam de cerca de 1GB de RAM.

* **sample**: limiar para configurar quais palavras de maior frequência são aleatoriamente reduzidas; O padrão é 1e-3, o intervalo útil é (0, 1e-5).

* **workers**: parâmetro que indica quantos cores da máquina serão utilizados para o treinamento.

* **hs**: se 1, softmax hierárquico será usado para o treinamento do modelo. Se definido como 0 (padrão), e existir amostragem negativa, esse recurso será utilizado.

* **negative**: se > 0, será utilizada amostragem negativa. O valor indica quantas "palavras de ruído" devem ser consideradas (normalmente entre 5 a 20). Se **negative** configurado para 0, não é utilizada a amostragem negativa.

* **cbow_mean**: se 0, usa a soma dos vetores das palavras de contexto. Se 1 (padrão), usa a média. Aplica-se apenas quando cbow é utilizado.

* **hashfxn**: função hash para inicializar os pesos aleatoriamente.

* **iter**: número de iterações (épocas) sobre o córpus. O padrão é 5.


In [None]:
#train model
import gzip
documents = read_input('/home/jessica/Documentos/UFSCar/Pesquisa/Projeto/sense2vec-master/bin/corpora/corpora/corpora_tokenized/corpora_tokenized.tar.gz')
%time model = Word2Vec(LineSentence(documents), size=300, window=5, min_count=15, workers=multiprocessing.cpu_count())

reading file /home/jessica/Documentos/UFSCar/Pesquisa/Projeto/sense2vec-master/bin/corpora/corpora/corpora_tokenized/corpora_tokenized.tar.gz...this may take a while
read 0 reviews
read 10000 reviews
read 20000 reviews
read 30000 reviews
read 40000 reviews
read 50000 reviews
read 60000 reviews
read 70000 reviews
read 80000 reviews
read 90000 reviews
read 100000 reviews
read 110000 reviews
read 120000 reviews
read 130000 reviews
read 140000 reviews
read 150000 reviews
read 160000 reviews
read 170000 reviews
read 180000 reviews
read 190000 reviews
read 200000 reviews
read 210000 reviews
read 220000 reviews
read 230000 reviews
read 240000 reviews
read 250000 reviews
read 260000 reviews
read 270000 reviews
read 280000 reviews
read 290000 reviews
read 300000 reviews
read 310000 reviews
read 320000 reviews
read 330000 reviews
read 340000 reviews
read 350000 reviews
read 360000 reviews
read 370000 reviews
read 380000 reviews
read 390000 reviews
read 400000 reviews
read 410000 reviews
read 420

read 3920000 reviews
read 3930000 reviews
read 3940000 reviews
read 3950000 reviews
read 3960000 reviews
read 3970000 reviews
read 3980000 reviews
read 3990000 reviews
read 4000000 reviews
read 4010000 reviews
read 4020000 reviews
read 4030000 reviews
read 4040000 reviews
read 4050000 reviews
read 4060000 reviews
read 4070000 reviews
read 4080000 reviews
read 4090000 reviews
read 4100000 reviews
read 4110000 reviews
read 4120000 reviews
read 4130000 reviews
read 4140000 reviews
read 4150000 reviews
read 4160000 reviews
read 4170000 reviews
read 4180000 reviews
read 4190000 reviews
read 4200000 reviews
read 4210000 reviews
read 4220000 reviews
read 4230000 reviews
read 4240000 reviews
read 4250000 reviews
read 4260000 reviews
read 4270000 reviews
read 4280000 reviews
read 4290000 reviews
read 4300000 reviews
read 4310000 reviews
read 4320000 reviews
read 4330000 reviews
read 4340000 reviews
read 4350000 reviews
read 4360000 reviews
read 4370000 reviews
read 4380000 reviews
read 4390000 

read 7870000 reviews
read 7880000 reviews
read 7890000 reviews
read 7900000 reviews
read 7910000 reviews
read 7920000 reviews
read 7930000 reviews
read 7940000 reviews
read 7950000 reviews
read 7960000 reviews
read 7970000 reviews
read 7980000 reviews
read 7990000 reviews
read 8000000 reviews
read 8010000 reviews
read 8020000 reviews
read 8030000 reviews
read 8040000 reviews
read 8050000 reviews
read 8060000 reviews
read 8070000 reviews
read 8080000 reviews
read 8090000 reviews
read 8100000 reviews
read 8110000 reviews
read 8120000 reviews
read 8130000 reviews
read 8140000 reviews
read 8150000 reviews
read 8160000 reviews
read 8170000 reviews
read 8180000 reviews
read 8190000 reviews
read 8200000 reviews
read 8210000 reviews
read 8220000 reviews
read 8230000 reviews
read 8240000 reviews
read 8250000 reviews
read 8260000 reviews
read 8270000 reviews
read 8280000 reviews
read 8290000 reviews
read 8300000 reviews
read 8310000 reviews
read 8320000 reviews
read 8330000 reviews
read 8340000 

read 11780000 reviews
read 11790000 reviews
read 11800000 reviews
read 11810000 reviews
read 11820000 reviews
read 11830000 reviews
read 11840000 reviews
read 11850000 reviews
read 11860000 reviews
read 11870000 reviews
read 11880000 reviews
read 11890000 reviews
read 11900000 reviews
read 11910000 reviews
read 11920000 reviews
read 11930000 reviews
read 11940000 reviews
read 11950000 reviews
read 11960000 reviews
read 11970000 reviews
read 11980000 reviews
read 11990000 reviews
read 12000000 reviews
read 12010000 reviews
read 12020000 reviews
read 12030000 reviews
read 12040000 reviews
read 12050000 reviews
read 12060000 reviews
read 12070000 reviews
read 12080000 reviews
read 12090000 reviews
read 12100000 reviews
read 12110000 reviews
read 12120000 reviews
read 12130000 reviews
read 12140000 reviews
read 12150000 reviews
read 12160000 reviews
read 12170000 reviews
read 12180000 reviews
read 12190000 reviews
read 12200000 reviews
read 12210000 reviews
read 12220000 reviews
read 12230

read 15570000 reviews
read 15580000 reviews
read 15590000 reviews
read 15600000 reviews
read 15610000 reviews
read 15620000 reviews
read 15630000 reviews
read 15640000 reviews
read 15650000 reviews
read 15660000 reviews
read 15670000 reviews
read 15680000 reviews
read 15690000 reviews
read 15700000 reviews
read 15710000 reviews
read 15720000 reviews
read 15730000 reviews
read 15740000 reviews
read 15750000 reviews
read 15760000 reviews
read 15770000 reviews
read 15780000 reviews
read 15790000 reviews
read 15800000 reviews
read 15810000 reviews
read 15820000 reviews
read 15830000 reviews
read 15840000 reviews
read 15850000 reviews
read 15860000 reviews
read 15870000 reviews
read 15880000 reviews
read 15890000 reviews
read 15900000 reviews
read 15910000 reviews
read 15920000 reviews
read 15930000 reviews
read 15940000 reviews
read 15950000 reviews
read 15960000 reviews
read 15970000 reviews
read 15980000 reviews
read 15990000 reviews
read 16000000 reviews
read 16010000 reviews
read 16020

read 19300000 reviews
read 19310000 reviews
read 19320000 reviews
read 19330000 reviews
read 19340000 reviews
read 19350000 reviews
read 19360000 reviews
read 19370000 reviews
read 19380000 reviews
read 19390000 reviews
read 19400000 reviews
read 19410000 reviews
read 19420000 reviews
read 19430000 reviews
read 19440000 reviews
read 19450000 reviews
read 19460000 reviews
read 19470000 reviews
read 19480000 reviews
read 19490000 reviews
read 19500000 reviews
read 19510000 reviews
read 19520000 reviews
read 19530000 reviews
read 19540000 reviews
read 19550000 reviews
read 19560000 reviews
read 19570000 reviews
read 19580000 reviews
read 19590000 reviews
read 19600000 reviews
read 19610000 reviews
read 19620000 reviews
read 19630000 reviews
read 19640000 reviews
read 19650000 reviews
read 19660000 reviews
read 19670000 reviews
read 19680000 reviews
read 19690000 reviews
read 19700000 reviews
read 19710000 reviews
read 19720000 reviews
read 19730000 reviews
read 19740000 reviews
read 19750

read 23060000 reviews
read 23070000 reviews
read 23080000 reviews
read 23090000 reviews
read 23100000 reviews
read 23110000 reviews
read 23120000 reviews
read 23130000 reviews
read 23140000 reviews
read 23150000 reviews
read 23160000 reviews
read 23170000 reviews
read 23180000 reviews
read 23190000 reviews
read 23200000 reviews
read 23210000 reviews
read 23220000 reviews
read 23230000 reviews
read 23240000 reviews
read 23250000 reviews
read 23260000 reviews
read 23270000 reviews
read 23280000 reviews
read 23290000 reviews
read 23300000 reviews
read 23310000 reviews
read 23320000 reviews
read 23330000 reviews
read 23340000 reviews
read 23350000 reviews
read 23360000 reviews
read 23370000 reviews
read 23380000 reviews
read 23390000 reviews
read 23400000 reviews
read 23410000 reviews
read 23420000 reviews
read 23430000 reviews
read 23440000 reviews
read 23450000 reviews
read 23460000 reviews
read 23470000 reviews
read 23480000 reviews
read 23490000 reviews
read 23500000 reviews
read 23510

read 26810000 reviews
read 26820000 reviews
read 26830000 reviews
read 26840000 reviews
read 26850000 reviews
read 26860000 reviews
read 26870000 reviews
read 26880000 reviews
read 26890000 reviews
read 26900000 reviews
read 26910000 reviews
read 26920000 reviews
read 26930000 reviews
read 26940000 reviews
read 26950000 reviews
read 26960000 reviews
read 26970000 reviews
read 26980000 reviews
read 26990000 reviews
read 27000000 reviews
read 27010000 reviews
read 27020000 reviews
read 27030000 reviews
read 27040000 reviews
read 27050000 reviews
read 27060000 reviews
read 27070000 reviews
read 27080000 reviews
read 27090000 reviews
read 27100000 reviews
read 27110000 reviews
read 27120000 reviews
read 27130000 reviews
read 27140000 reviews
read 27150000 reviews
read 27160000 reviews
read 27170000 reviews
read 27180000 reviews
read 27190000 reviews
read 27200000 reviews
read 27210000 reviews
read 27220000 reviews
read 27230000 reviews
read 27240000 reviews
read 27250000 reviews
read 27260

read 30550000 reviews
read 30560000 reviews
read 30570000 reviews
read 30580000 reviews
read 30590000 reviews
read 30600000 reviews
read 30610000 reviews
read 30620000 reviews
read 30630000 reviews
read 30640000 reviews
read 30650000 reviews
read 30660000 reviews
read 30670000 reviews
read 30680000 reviews
read 30690000 reviews
read 30700000 reviews
read 30710000 reviews
read 30720000 reviews
read 30730000 reviews
read 30740000 reviews
read 30750000 reviews
read 30760000 reviews
read 30770000 reviews
read 30780000 reviews
read 30790000 reviews
read 30800000 reviews
read 30810000 reviews
read 30820000 reviews
read 30830000 reviews
read 30840000 reviews
read 30850000 reviews
read 30860000 reviews
read 30870000 reviews
read 30880000 reviews
read 30890000 reviews
read 30900000 reviews
read 30910000 reviews
read 30920000 reviews
read 30930000 reviews
read 30940000 reviews
read 30950000 reviews
read 30960000 reviews
read 30970000 reviews
read 30980000 reviews
read 30990000 reviews
read 31000

read 34350000 reviews
read 34360000 reviews
read 34370000 reviews
read 34380000 reviews
read 34390000 reviews
read 34400000 reviews
read 34410000 reviews
read 34420000 reviews
read 34430000 reviews
read 34440000 reviews
read 34450000 reviews
read 34460000 reviews
read 34470000 reviews
read 34480000 reviews
read 34490000 reviews
read 34500000 reviews
read 34510000 reviews
read 34520000 reviews
read 34530000 reviews
read 34540000 reviews
read 34550000 reviews
read 34560000 reviews
read 34570000 reviews
read 34580000 reviews
read 34590000 reviews
read 34600000 reviews
read 34610000 reviews
read 34620000 reviews
read 34630000 reviews
read 34640000 reviews
read 34650000 reviews
read 34660000 reviews
read 34670000 reviews
read 34680000 reviews
read 34690000 reviews
read 34700000 reviews
read 34710000 reviews
read 34720000 reviews
read 34730000 reviews
read 34740000 reviews
read 34750000 reviews
read 34760000 reviews
read 34770000 reviews
read 34780000 reviews
read 34790000 reviews
read 34800

read 38130000 reviews
read 38140000 reviews
read 38150000 reviews
read 38160000 reviews
read 38170000 reviews
read 38180000 reviews
read 38190000 reviews
read 38200000 reviews
read 38210000 reviews
read 38220000 reviews
read 38230000 reviews
read 38240000 reviews
read 38250000 reviews
read 38260000 reviews
read 38270000 reviews
read 38280000 reviews
read 38290000 reviews
read 38300000 reviews
read 38310000 reviews
read 38320000 reviews
read 38330000 reviews
read 38340000 reviews
read 38350000 reviews
read 38360000 reviews
read 38370000 reviews
read 38380000 reviews
read 38390000 reviews
read 38400000 reviews
read 38410000 reviews
read 38420000 reviews
read 38430000 reviews
read 38440000 reviews
read 38450000 reviews
read 38460000 reviews
read 38470000 reviews
read 38480000 reviews
read 38490000 reviews
read 38500000 reviews
read 38510000 reviews
read 38520000 reviews
read 38530000 reviews
read 38540000 reviews
read 38550000 reviews
read 38560000 reviews
read 38570000 reviews
read 38580

read 41870000 reviews
read 41880000 reviews
read 41890000 reviews
read 41900000 reviews
read 41910000 reviews
read 41920000 reviews
read 41930000 reviews
read 41940000 reviews
read 41950000 reviews
read 41960000 reviews
read 41970000 reviews
read 41980000 reviews
read 41990000 reviews
read 42000000 reviews
read 42010000 reviews
read 42020000 reviews
read 42030000 reviews
read 42040000 reviews
read 42050000 reviews
read 42060000 reviews
read 42070000 reviews
read 42080000 reviews
read 42090000 reviews
read 42100000 reviews
read 42110000 reviews
read 42120000 reviews
read 42130000 reviews
read 42140000 reviews
read 42150000 reviews
read 42160000 reviews


* **init_sims(replace=True)**: Normaliza o modelo para não demandar tanta memória.

In [12]:
# trim unneeded model memory = use (much) less RAM
model.init_sims(replace=True)

Salva o modelo no caminho especificado em outp

In [13]:
#model.save(outp)
model.wv.save_word2vec_format('/home/jessica/Documentos/UFSCar/Pesquisa/Projeto/portuguese_word_embeddings/word2vec/word2vec_s300.txt', binary=False)

## Leituras

Sugiro as seguintes leituras complementares sobre o Word2Vec.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)
* [Deep Learning com Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) por Gensim

## Referências

[1] [Efficient estimation of word representations in vector space.](https://arxiv.org/abs/1301.3781)

[2] [Multimodal distributional semantics.](https://www.jair.org/media/4135/live-4135-7609-jair.pdf)

[3] [Machine learning in automated text categorization.](http://delivery.acm.org/10.1145/510000/505283/p1-sebastiani.pdf?ip=200.137.216.145&id=505283&acc=ACTIVE%20SERVICE&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1518017578_f5561e072809aadaea8bb04a71a5b21c)

[4] [Quantitative evaluation of passage retrieval algorithms for question answering.](http://delivery.acm.org/10.1145/870000/860445/p41-tellex.pdf?ip=200.137.216.145&id=860445&acc=ACTIVE%20SERVICE&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1518017629_0f0d78efd0501b7ad05a74e586cd7ef8)

[5] [Word representations: a simple and general method for semi-supervised learning.](http://delivery.acm.org/10.1145/1860000/1858721/p384-turian.pdf?ip=200.137.216.145&id=1858721&acc=OPEN&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1518017678_cec1b87c9c6e3f9ccd8e61f591acaa26)

[6] [Distributed representations.](https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf)

[7] [Learning internal representations by back-propagating errors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition](http://lia.disi.unibo.it/Courses/SistInt/articoli/nnet1.pdf)

[8] [Finding structure in time.](http://onlinelibrary.wiley.com/doi/10.1207/s15516709cog1402_1/epdf)

[9] [Combining heterogeneous models for measuring relational similarity.](http://www.aclweb.org/anthology/N13-1120)

[10] [Neural network based language models for highly inflective languages.](http://ieeexplore.ieee.org/abstract/document/4960686/)