# Word Embeddings

Prerequisite - If you are not familiar with word embeddings, read [this](https://web.stanford.edu/~jurafsky/slp3/6.pdf).

Study
- model token effect
- polling technique effect
- model changes
    - GloVe
    - Word2Vec
    - train on train data 
    - transfer learning
- for train on train data
    - effect of span
    - effect of dim
    - effect of epoch (can it overfit?)

- deal with unknwon
 - zero
 - random
 - average
    
    
    
    
1. Download [Google Word2Vec Model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) to this directory (3M vocab, cased, 300 d) and run 

```
gunzip GoogleNews-vectors-negative300.bin.gz
```

2. Download [Stanford GloVe Model](http://nlp.stanford.edu/data/glove.840B.300d.zip) (2.2M vocab, cased, 300d) to this directory and run the following commands.

```
unzip glove.840B.300d.zip
python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
```

PS, GloVe is also available in SpaCy's `en_core_web_md` too. See [Document](https://spacy.io/models/en#en_core_web_md). In this notebook, we will not use GloVe from SpaCy due to lots of its limitations.

If you already have those files or you don't want to save it in this directory, you can
- change constant variable PRETRAINED_WV_MODEL_PATH  and PRETRAINED_GLOVE_MODEL_PATH
- create symbolic link 
    - `ln -s /path/to/your/word2vec ./GoogleNews-vectors-negative300.bin` 
    - `ln -s /path/to/your/glove ./glove.840B.300d.w2vformat.txt`


In [29]:
%load_ext autoreload
%autoreload

from lib.dataset import download_tfds_imdb_as_text, download_tfds_imdb_as_text_tiny
from lib.word_emb import run_pipeline
import gensim

PRETRAINED_WV_MODEL_PATH = "./GoogleNews-vectors-negative300.bin"
PRETRAINED_GLOVE_MODEL_PATH = "./glove.840B.300d.w2vformat.txt"


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
word_emb_models = {
    "word2vec": gensim.models.KeyedVectors.load_word2vec_format(PRETRAINED_WV_MODEL_PATH, binary=True),
    "glove": gensim.models.KeyedVectors.load_word2vec_format(PRETRAINED_GLOVE_MODEL_PATH, binary=False) 
}

dataset  = download_tfds_imdb_as_text()
tiny_dataset = download_tfds_imdb_as_text_tiny()

# Experiment 1 - Effect text preprocess

In this experiment, 

In [None]:
# approximate running time: 16 mins
    
print("Simple SpaCy tokenizer")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("Simple SpaCy tokenizer and lowercase")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True)

print("Simple SpaCy tokenizer, lowercase, ignore stop words and numbers")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], lower=True, ignore=["like_num", "is_stop"])


    

**Experiment 1 Discussion**
```
Simple SpaCy tokenizer
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer and lowercase
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
Simple SpaCy tokenizer, lowercase, ignore stop words and numbers
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.84
F1 on test set: 0.84
954.2723577022552
```

# Experiment 2 - Embeddings

In this experiment, we will use two different word embeddings, [Word2Vec](https://arxiv.org/pdf/1310.4546.pdf) and [GloVE](https://nlp.stanford.edu/projects/glove/). The high level intuitions of both embeddings are similar in the sense that they both estimate dense representation of words based on co-occurrence, i.e. words that are replaceable are similar. However, their models are very different. In a nutshell, GloVE directly estimates embeddings from co-occurrence matrix, while Word2Vec is a learning based model that learns to predict neighboring words from center words (skip-gram) or other way around (C-BOW). More info, see [this](https://www.quora.com/How-is-GloVe-different-from-word2vec).

We will use pre-trained Word2Vec and GloVE. The pre-trained Word2Vec has 3M words, trained on roughly 100B tokens from a Google News dataset. The vector length is 300 features. More info, see [this](https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). The pre-trained GloVE model has 2.2M words, trained on 840B tokens from Common Crawl. The vector length is also 300 features. In sum
- both trained on very large corpus (100B vs 840B)
- both trained on general corpus (Google News vs Common Crawl)
- both has 300 features


Also note that differences of embeddings in this experiment is not only the models (Word2Vec vs GloVE) but also the data they were trained. 



In [None]:
# approximate running time: 13 mins

print("Word2Vec")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("GloVe")
_, _ = run_pipeline(dataset, word_emb_models["glove"])

    

**Experiment 2 Discussion**


```
Word2Vec
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
GloVe
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.86
F1 on test set: 0.85
742.7005350589752
```

# Experiment 3 - of tfidf

In this experiment,...

useful for IR but may not for classi

In [None]:
# approximate running time: 12 mins

print("norm")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"])

print("norm + idf")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], tfidf=True)



In [2]:
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], tfidf=True)

NameError: name 'dataset' is not defined

**Discussion**
```
norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.85
F1 on test set: 0.85
norm + idf
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.83
F1 on test set: 0.83
677.7894566059113
```

# Experiment 4 - of polling

In [None]:
# approximate running time: 60 mins

print("norm")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="norm")

print("sum")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="sum")

print("log")
_, _ = run_pipeline(dataset, word_emb_models["word2vec"], polling="log")
    


**Discussion Exp1-4**

```

norm
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.85
F1 on test set: 0.85
sum
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.85
log
Load tokenized document from disk
Load tokenized document from disk
Best parameters set found on development set:  {'C': 0.001}
Best F1 on development set: 0.85
F1 on test set: 0.84
3642.0309517383575
```

**Conclusion**

- generalize, not domain specific
- too many mssing words

In [5]:
type(word_emb_models["word2vec"])

gensim.models.keyedvectors.Word2VecKeyedVectors

In [28]:
print("%0.2f" %2.34501)

2.35


In [34]:
set(word_emb_models["word2vec"].vocab)

{'infantry_artillery',
 'schols',
 '##g_1a',
 'Ashtabula_Cuyahoga',
 'Volpone',
 'Daimer.com_spokesperson_Matthew_Baratta',
 'oxide_heap_leach',
 'discodermolide',
 'Alder',
 'Camphill',
 'ovenproof_plate',
 'Concalves',
 'Softly_spoken',
 'zydeco_blues',
 'Anton_Caputo',
 'Wymans',
 'Eric_Cevis',
 'Valspar_Corp',
 'SP2_Windows_Server',
 'Nicholas_Minucci',
 'StarVision',
 'Toa_Reinsurance_Company',
 'SL###_SL###',
 'Marie_Callender_Cheesy_Chicken',
 'Cedar_waxwings',
 'Machesky',
 'Gyorgy_Barta',
 'Ganatra',
 'Karunakaran',
 'ª_ª',
 'PRC_NetDragon_Websoft',
 'losing_hands_feetA',
 'Sex_Discrimination_Commissioner',
 'tapioca_balls',
 'recessional_hymn',
 'remarks_Engvall_poked',
 'terrorist_mastermind_Noordin',
 'Buruku',
 'Bayno',
 'proband',
 'Shireen_Jinnah_Colony',
 'Offi_ce',
 'Nicole_Danna',
 'band_Shoe_Suede',
 'BCII',
 'Roger_Moret',
 'Ms_Shabangu',
 'Apparent_Robbery',
 'Superintendent_Danilo_Abarzosa',
 'cookie_tin',
 'Janine_Remillard',
 'taught_bilingually',
 'Harley_David

In [None]:
print("Word2Vec")
_, dense = run_pipeline(dataset, word_emb_models["word2vec"])