# Word Embeddings

In Notebook 1, we represent a document with a vector whose size is equal to the the size of vocabulary set. We encoded each words with one hot encoding technique which assign a value in the vector at index corresponding to the vocabulary and leave other elements zero. This technique has several drawbacks. It creates sparseness in vector space. It also cannot capture two different words that are synonym or similar, or share some sort of relations. For example, word `cat` and `dog` will totally different, as different as words like `electrical` and `poem`. These weaknesses can undermine downstream tasks. To solve these issues, researcher comes up with [dense representation](https://web.stanford.edu/~jurafsky/slp3/6.pdf) to be contrast with the sparseness of one hot encoding. Several approaches on dense representation have been studies since 1990s and culminated at the invention of [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) in 2013. The technique (seemingly, since it's still controversial) outperforms previous dense representation techniques discovered in 1990s for many downstream tasks. Our experiment will be center on pre-trained [Word2Vec] by exploring different ways of using it and their performances.

    
**Prerequisite**

1. Download [Google Word2Vec Model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) to this directory (3M vocab, cased, 300 d) and run 

    ```
    gunzip GoogleNews-vectors-negative300.bin.gz
    ```

2. Download [Stanford GloVe Model](http://nlp.stanford.edu/data/glove.840B.300d.zip) (2.2M vocab, cased, 300d) to this directory and run the following commands.

    ```
    unzip glove.840B.300d.zip
    python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
    ```

GloVe is also available in SpaCy's `en_core_web_md` too. See [Document](https://spacy.io/models/en#en_core_web_md). In this notebook, we will not use GloVe from SpaCy due to lots of its limitations.

If you already have those files or you don't want to save it in this directory, you can either change constant variable PRETRAINED_WV_MODEL_PATH  and PRETRAINED_GLOVE_MODEL_PATH or create symbolic link.
    
```
ln -s /path/to/your/word2vec ./GoogleNews-vectors-negative300.bin
ln -s /path/to/your/glove ./glove.840B.300d.w2vformat.txt

```

In [1]:
%load_ext autoreload
%autoreload

from lib.dataset import download_tfds_imdb_as_text, download_tfds_imdb_as_text_tiny
from lib.word_emb import run_pipeline
import gensim

PRETRAINED_WV_MODEL_PATH = "./GoogleNews-vectors-negative300.bin"
PRETRAINED_GLOVE_MODEL_PATH = "./glove.840B.300d.w2vformat.txt"


In [2]:
word_emb_models = {
    "word2vec": gensim.models.KeyedVectors.load_word2vec_format(PRETRAINED_WV_MODEL_PATH, binary=True),
    "glove": gensim.models.KeyedVectors.load_word2vec_format(PRETRAINED_GLOVE_MODEL_PATH, binary=False) 
}

dataset  = download_tfds_imdb_as_text()
tiny_dataset = download_tfds_imdb_as_text_tiny()

# Experiment 1 - Effect text preprocess

In this experiment, 

In [3]:
# approximate running time: 16 mins
    
print("Simple SpaCy tokenizer")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("Simple SpaCy tokenizer and lowercase")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True)

print("Simple SpaCy tokenizer, lowercase, ignore stop words and numbers")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True, ignore=["like_num", "is_stop"])


    

Simple SpaCy tokenizer
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 318.11
Simple SpaCy tokenizer and lowercase
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 318.46
Simple SpaCy tokenizer, lowercase, ignore stop words and numbers
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.84
time: 277.41


**Experiment 1 Discussion**
```
Simple SpaCy tokenizer
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer and lowercase
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer, lowercase, ignore stop words and numbers
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.84
F1 on test set: 0.84
954.2723577022552
```

# Experiment 2 - Embeddings

In this experiment, we will use two different word embeddings, [Word2Vec](https://arxiv.org/pdf/1310.4546.pdf) and [GloVE](https://nlp.stanford.edu/projects/glove/). The high level intuitions of both embeddings are similar in the sense that they both estimate dense representation of words based on co-occurrence, i.e. words that are replaceable are similar. However, their models are very different. In a nutshell, GloVE directly estimates embeddings from co-occurrence matrix, while Word2Vec is a learning based model that learns to predict neighboring words from center words (skip-gram) or other way around (C-BOW). More info, see [this](https://www.quora.com/How-is-GloVe-different-from-word2vec).

We will use pre-trained Word2Vec and GloVE. The pre-trained Word2Vec has 3M words, trained on roughly 100B tokens from a Google News dataset. The vector length is 300 features. More info, see [this](https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). The pre-trained GloVE model has 2.2M words, trained on 840B tokens from Common Crawl. The vector length is also 300 features. In sum
- both trained on very large corpus (100B vs 840B)
- both trained on general corpus (Google News vs Common Crawl)
- both has 300 features


Also note that differences of embeddings in this experiment is not only the models (Word2Vec vs GloVE) but also the data they were trained. 



In [4]:
# approximate running time: 13 mins

print("Word2Vec")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("GloVe")
_, _ = run_pipeline(dataset, word_emb_models["glove"])

    

Word2Vec
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 313.31
GloVe
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.85
F1 on test set: 0.84
time: 339.56


**Experiment 2 Discussion**


```
Word2Vec
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
GloVe
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.86
F1 on test set: 0.85
742.7005350589752
```

# Experiment 3 - of tfidf

In this experiment,...

useful for IR but may not for classi

In [5]:
# approximate running time: 12 mins

print("norm")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("norm + idf")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], tfidf=True)



norm
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 316.04
norm + idf
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.84
F1 on test set: 0.83
time: 333.35


In [6]:
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], tfidf=True)

Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.84
F1 on test set: 0.83
time: 330.85


**Discussion**
```
norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
norm + idf
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.83
F1 on test set: 0.83
677.7894566059113
```

# Experiment 4 - Pooling

While reading the first three experiments in this Notebook, you may be wondering that Word embeddings (Word2Vec and GloVE) are dense representation of "words" not "document", so how can we come up with vectors representing document? To do so, we have to pool word embeddings, similar idea as pooling layer in CNN. In the first three experiments, we simply average the embeddings of each tokens to get the vector representing the document. While this technique is so simple, it has been widely used. Not only in academic, but industrial NLP library such as spaCy [doc vector](https://spacy.io/api/doc#vector) and [BERT-AS-A-SERVICE](https://github.com/hanxiao/bert-as-service#speech_balloon-faq) also pool a document vector by averaging. 


However, averaging is not the only way we can pool a document vector. Let's step back a little to the fundamental. What do we do in Notebook 1? We use one-hot encoding to encode a word and then we sum them up! Although our word representation is now embeddings (dense) instead of one-hot encoding (sparse), we can still do the same thing. The reason why averaging is more popular is that it eliminate the effect of document length. For example, these two documents `cat cat dog dog` and `cat dog` will be the same in vector space. Another technique is to use log, as presented in this [book](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf). The idea is to reduce the effect of token that occur many times. For example, the document like `dog dog dog cat` will lean toward `dog` in vector space if we average.  However, it will lean toward `dog` in less degree if we use log pooling technique. However, this log technique is introduced for Information Retrieval context, which is determining the query vector and document vector. As our problem set is text classification, this technique may not work. 

One may speculate that averaging and summing are pretty much the same since we just multiply vectors with some constants. This may be true for information retrieval since `sim(q, d)` and `sim(q, c x d)` are the same where `sim` is cosine similarity. However, for classification we are to draw a boundary in vector space, and since by taking average we multiply those vectors with different constants (each document can have different length), it can change to decision boundary. 

Note that all these variations are Bag Of Word, which does not take the position of words into account. In other words `The movie is not good. It is boring` and `The movie is not boring. It is good` are represented with the same vector.

In this experiment, we will try three pooling technique: sum, average and log.



    

In [7]:
# approximate running time: 60 mins

print("norm")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="norm")

print("sum")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="sum")

print("log")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="log")
    


norm
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 311.83
sum
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 2076.04
log
Best parameters set found on development set:  {'C': 1}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 1185.65


**Discussion Exp1-4**

```

norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.85
F1 on test set: 0.85
sum
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.85
log
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.84
3642.0309517383575
```

**Conclusion**

- generalize, not domain specific
- too many mssing words

In [8]:
type(word_emb_models["word2vec"])

gensim.models.keyedvectors.Word2VecKeyedVectors

In [9]:
print("%0.2f" %2.34501)

2.35


In [10]:
set(word_emb_models["word2vec"].vocab)

{'McCaskell',
 'Teresa_Strattard',
 'MERCYHURST',
 'EDAPS_Consortium',
 'fingerpicked_guitar',
 'Samhan',
 'Hazard_Identification',
 'Parithi_Ilamvazhuthi',
 'mudslide_prone_mountain',
 'Cecafa_Kagame_Cup',
 'strum_guitars',
 'OSAA_4A',
 'GeoCortex_Essentials_insights',
 'Columbanus',
 'Bibiani_Ghana',
 'cataloger_retailer',
 'Hadith_sayings',
 'COACH_MILES',
 'Omar_Leary',
 'Zimdars',
 'Motosport',
 'disentangling_itself',
 '0w_####m',
 'LOSS_BEFORE_INCOME_TAXES',
 'Dante_Walkup',
 'Barkha',
 'girl_mother_Nixzaliz',
 'Nissha',
 'Walior',
 'non_tariff_barriers',
 'FEATHER_RIVER',
 'Nomlaki',
 'Charging_Cradle',
 'bio_diesels',
 'CRATE',
 'Minister_Atzo_Nicolai',
 'Mazze',
 'Alumni_Awards_Banquet',
 'inseparably_linked',
 'Indridson',
 'Qorivva',
 'Rasho_Nesterovic_Maceo_Baston',
 'Referee_Alex_Prus',
 'Volrath',
 'Jeanette_Liebold_Ricker',
 'Gien',
 '£_7bn',
 'ayurvedic_medicines',
 'Wheelchair_racer',
 'plush_leather_couches',
 'By_AAyles',
 'SPLENDA_®_Sucralose',
 'critic',
 'Fathi_T

In [11]:
print("Word2Vec")
_, dense = run_pipeline(dataset, word_emb_models["word2vec"])

Word2Vec
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.85
F1 on test set: 0.85
time: 314.68


**Conclusion**

Both Word2Vec and GloVe are generic word embeddings. They were trained on general corpora which are not specific on any domains. Word2vec was trained on Google News Corpus. GloVe was trained on CommonCrawl Corpus.

From the experiment, we can see that
- tokenizer: unlike one hot vector representation, lowercase and lemmatization do not help much, and that makes sense! Recall the reason why lemmatization is useful for one hot vector? In that case, lemmatization can group words like "good" and "best" together to "good", to reduce the sparsity of vector space resulting in model being more certain to classify when see word "good". However, this is not the case for word embeddings. The pretrained models have vector for words like "good", "better" and "best" and those vector are similar enough to represent the idea of these words (positive sentiment), but still be able to capture subtle differences (best > better > good). Thus, it's better for word embeddings to leave these words as their original form
- tfidf: TODO TODO (tfidf makes thing worse? why? did i do something wrong?)
- polling function: From the experiment, although log function performs slighly better, the differences are not significant. The log function should work well for long document because it reduces the effect of words with more occurrence (Manning). One assumption we can make is that the IMDB review are not long enough (270 tokens) to observe the effect of using log function. Sum and Norm give the similar wor information retrieval tasks or any other tasks that require cosine similarity. But this is not the case for logistic regression, as shown in the experiment. Note that when we do normalization, the constants we apply for each training instances are their magniture so they are different among other. Also note that normalization in this context is different from feature normalization.


Ones should expect that word embeddings (dense representation) should achieve higher performance than one hot vector (sparse representation). However, the experiment show that the best F1 achieved by word embeddings is about 0.86 (row 20). These are assumptions 
- OOV
- Biased to train corpora