# Embarradita de Natural Language Processing

- Bag Of Words
- Word Embeddings
- Word2Vec
- RNN

In [24]:
import re
import nltk
import numpy as np

# Bag of words

In [4]:
sentences = ["The quick brown fox jumps over the lazy dog",
             "never jump over the lazy dog quickly"]

In [5]:
text = " ".join(sentences)
text

'The quick brown fox jumps over the lazy dog never jump over the lazy dog quickly'

In [13]:
words = re.split(r"\ ", text)
words = list(map(lambda x: x.lower(), words))
uniques = np.unique(words)
corpus = {w: i for i,w in enumerate(uniques)}
corpus

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jump': 3,
 'jumps': 4,
 'lazy': 5,
 'never': 6,
 'over': 7,
 'quick': 8,
 'quickly': 9,
 'the': 10}

In [22]:
def count_words(phrase, word):
    lowercase = lambda x: x.lower()
    return lowercase(phrase).count(word)
matr_sentence = []
for sentence in sentences:
    vector_sentence = list(map(lambda x: count_words(sentence, x),\
        uniques))
    matr_sentence.append(vector_sentence)
np.array(matr_sentence)


array([[1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2],
       [0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1]])

# Word embeddings

# $$ W : words \rightarrow R_n $$

### J. R. Firth 1957
- "You shall know a word by the company it keeps"

<img src="Pictures/Screenshot from 2020-08-22 16-43-45.png">

[king] - [man] + [woman] ~ = [queen]

### Find most similar
X = vector(biggest) - vector(big) + vector(small)

### More examples for semantic relationships 
Efficient Estimation of Word Representations in Vector Space
 - https://arxiv.org/pdf/1301.3781.pdf

In [None]:
# Folder for BIN files for FastText: wiki.simple.bin & wiki.simple.vec
# You can download them from: https://github.com/facebookresearch/fastText

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('./BIN/wiki.simple')

In [43]:
# Give the embedding of a given word
print(1, "\n", len(model.wv['brain']))
# Test if a word is in the model
print(2, "\n", 'brain' in model.wv.vocab)
# Give the most similar words
print(3, "\n", model.most_similar('brain'))
# Compute similarity between two words
print(4, "\n", model.similarity('brain', 'synapse'))
# Compute cosine distance between two groups of words
model.n_similarity(['sushi', 'shop'],\
    ['japanese', 'restaurant'])
# Make arithmetic with words
print(5, "\n", model.wv.most_similar(positive=['king', 'woman'],
                            negative=['man']))

1 
 300
2 
 True
3 
 [('brainstem', 0.7807184457778931), ('blood–brain', 0.7781210541725159), ('brains', 0.7710486650466919), ('midbrain', 0.760841429233551), ('brainy', 0.7459008693695068), ('forebrain', 0.7418298721313477), ('hindbrain', 0.7350934743881226), ('braint', 0.7265611886978149), ('braine', 0.6931329965591431), ('brained', 0.6897463798522949)]
4 
 0.42670387
5 
 [('queen', 0.5129120349884033), ('kingship', 0.4836469888687134), ('kingz', 0.4800509810447693), ('kings', 0.4712149500846863), ('adulyadej', 0.46895354986190796), ('regnant', 0.4502074420452118), ('kingkong', 0.44608789682388306), ('womans', 0.4455949664115906), ('noblewoman', 0.4404858648777008), ('bhumibol', 0.4334842562675476)]


```python
R(W("The"), W("quick"), W("brown"), W("fox"), ...) = 1
```

In [38]:
wvects = list(map(lambda x: model.wv[x], sentences[0].split(" ")))
wvects = np.array(wvects)
print(wvects.shape)

(9, 300)


### Problems of distributed representation
- Similarity and relatedness are not the same:
    - example: "man" and "male" are similar,
    - example: "keyboard" and "computer" are related but dissimilar
- Word ambiguity, mexican spanish as example :_( :
    - Multiple embeddings based on supervised disambiguation, Trask et all:  https://arxiv.org/abs/1511.06388

### Commonly used pre-trained word embeddings (2016)
- Word2Vec  https://code.google.com/archive/p/word2vec/,
- multilingual Word2Vec  htttps://github.com/Kyubyong/wordvectors
- Glove  http://nlp.stanford.edu/projects/glove/
- Fastext  https://github.com/icoxfog417/fastTextJapaneseTutorial
- LaxVec  https://github.com/alexandres/lexvec
- Meta-Embeddings  http://cistern.cis.lmu.de/meta-emb/

# Training word embeddings

<img src="Pictures/Screenshot from 2020-08-22 16-43-56.png">

# Basic idea of Word2Vec

- Continuous Bag of words: starts from source context words does aggregation and transformation and predicts the target word.
- Skip Gram model: has each target word  as input and predicts the context/surrounding
<img src="Pictures/Screenshot from 2020-08-22 16-44-03.png">

Weight matrix W with V rows (amount of words in vocabulary), and N columns (N nodes of hidden layer)

## Example, for train the Word2Vec model with tensorflow

https://www.tensorflow.org/tutorials/text/word_embeddings

# Why a conventional Neural Network can not performs well for NLP?

<img src="Pictures/Screenshot from 2020-08-22 16-44-12.png">

<img src="Pictures/Screenshot from 2020-08-22 16-44-21.png">

<img src="Pictures/Screenshot from 2020-08-22 16-44-27.png">

# But, what is a RNN?

https://explained.ai/rnn/index.html

- Basically, each layer in a RNN vectorizes data, the layer h (for hidden) does this.
- What exactly is h (sometimes called s) in the recurrence relation representing an RNN: (leaving off the nonlinearity)? The variable name h is typically used because it represents the hidden state of the RNN. An RNN takes a variable-length input record of symbols (e.g., stock price sequence, document, sentence, or word) and generates a fixed-length vector in high dimensional space, called an embedding, that somehow meaningfully represents or encodes the input record. The vector is only associated with a single input record and is only meaningful in the context of a classification or regression problem; the RNN is just a component of a surrounding model. For example, the h vector is often passed through a final linear layer V (multiclass logistic regressor) to get model predictions.

# RNN architectures

<img src="Pictures/Screenshot from 2020-08-22 16-44-33.png">

<img src="Pictures/Screenshot from 2020-08-22 16-44-40.png">

<img src="Pictures/Screenshot from 2020-08-22 16-44-56.png">

<img src="Pictures/Screenshot from 2020-08-22 16-45-02.png">

# Problem with training an RNN

<img src="Pictures/Screenshot from 2020-08-22 16-45-11.png">

<img src="Pictures/Screenshot from 2020-08-22 16-45-20.png">

<img src="Pictures/Screenshot from 2020-08-22 16-45-26.png">