# Word Embeddings

Prerequisite - If you are not familiar with word embeddings, read [this](https://web.stanford.edu/~jurafsky/slp3/6.pdf).

Study
- model token effect
- polling technique effect
- model changes
    - GloVe
    - Word2Vec
    - train on train data 
    - transfer learning
- for train on train data
    - effect of span
    - effect of dim
    - effect of epoch (can it overfit?)

- deal with unknwon
 - zero
 - random
 - average
    


In [1]:
%load_ext autoreload
%autoreload

from lib.dataset import download_tfds_imdb_as_text, download_tfds_imdb_as_text_tiny
from lib.word_emb import run_pipeline
import gensim


In [None]:
word_emb_models = {
    "word2vec": gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True),
    "glove": gensim.models.KeyedVectors.load_word2vec_format('./glove.840B.300d.w2vformat.txt', binary=False) 
}

dataset  = download_tfds_imdb_as_text()
tiny_dataset = download_tfds_imdb_as_text_tiny()

# Experiment 1 - Effect text preprocess

In this experiment, 

In [None]:
def exp1(dataset):
    
    print("Simple SpaCy tokenizer")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"])

    print("Simple SpaCy tokenizer and lowercase")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True)
    
    print("Simple SpaCy tokenizer, lowercase, ignore stop words and numbers")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True, ignore=["like_num", "is_stop"])

# approximate running time: 16 mins
exp1(dataset)

    

**Experiment 1 Discussion**
```
Simple SpaCy tokenizer
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer and lowercase
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer, lowercase, ignore stop words and numbers
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.84
F1 on test set: 0.84
954.2723577022552
```

# Experiment 2 - Embeddings

In this experiment, we will use two different word embeddings, [Word2Vec](https://arxiv.org/pdf/1310.4546.pdf) and [GloVE](https://nlp.stanford.edu/projects/glove/). The high level intuitions of both embeddings are similar in the sense that they both estimate dense representation of words based on co-occurrence, i.e. words that are replaceable are similar. However, their models are very different. In a nutshell, GloVE directly estimates embeddings from co-occurrence matrix, while Word2Vec is a learning based model that learns to predict neighboring words from center words (skip-gram) or other way around (C-BOW). More info, see [this](https://www.quora.com/How-is-GloVe-different-from-word2vec).

We will use pre-trained Word2Vec and GloVE. The pre-trained Word2Vec has 3M words, trained on roughly 100B tokens from a Google News dataset. The vector length is 300 features. More info, see [this](https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). The pre-trained GloVE model has 2.2M words, trained on 840B tokens from Common Crawl. The vector length is also 300 features. In sum
- both trained on very large corpus (100B vs 840B)
- both trained on general corpus (Google News vs Common Crawl)
- both has 300 features


Also note that differences of embeddings in this experiment is not only the models (Word2Vec vs GloVE) but also the data they were trained. 



In [None]:
def exp2(dataset):
    print("Word2Vec")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"])

    print("GloVe")
    _, _ = run_pipeline(dataset, word_emb_models["glove"])
    

# approximate running time: 13 mins
import time
now = time.time()
exp2(dataset)
print(time.time()-now)

    

**Experiment 2 Discussion**


```
Word2Vec
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
GloVe
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.86
F1 on test set: 0.85
742.7005350589752
```

# Experiment 3 - of tfidf

In this experiment,...

useful for IR but may not for classi

In [None]:
def exp3(dataset):
    print("norm")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"])
    
    print("norm + idf")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], tfidf=True)


    

# approximate running time: 12 mins
import time
now = time.time()
exp3(dataset)
print(time.time()-now)

**Discussion**
```
norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
norm + idf
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.83
F1 on test set: 0.83
677.7894566059113
```

# Experiment 4 - of polling

In [None]:
def exp4(dataset):
    print("norm")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="norm")
    
    print("sum")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="sum")
    
    print("log")
    _, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="log")
    
# approximate running time: 60 mins
import time
now = time.time()
exp4(dataset)
print(time.time()-now)

**Discussion Exp1-4**

```

norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.85
F1 on test set: 0.85
sum
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.85
log
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.84
3642.0309517383575
```

**Conclusion**

- generalize, not domain specific
- too many mssing words