# More On Word Embeddings

As we found at the end of the Notebook 2, we only use only 80% of information in average and discard the rest. In this experiment, we will explore how can we use information from the dataset more efficiently i.e. reduce the fraction of discard information. Although we still focus on word embeddigs, we will tackle it differently this time. 

First let's recall from the Noteook 2 the reason why the efficiency is not very high. In that experiment, we vectorize  text by getting word embeddings corresponding to each tokens from pre-trained word embeddings and then polling to one single vector representing the text. For tokens that we don't have corresponding word embeddings, we randomly pick one from the pre-trained word embeddings (or we can simply ignore). Because there are some tokens, approxiately 20%, that do not present in our pre-trained word embeddings, it means that we discard those tokens and so the model will never learn from this information.

One way to solve this problem is to train word embeddings using the train set (we can't use both train and test to trian word embeddings due to data leakage). This way we will every token in train set a corresponding vector. We may still observe unknown tokens in test set, but it will be much less than what we observed in Notebook 2.

This [paper](https://www.aclweb.org/anthology/Q15-1016.pdf) discusses various aspects of word embeddings on different tasks, as well as hyper parameters tuning. However, they didn't benchmark on text classification task as our experiment.

Now let's discuss the experiments.


We have quite several variations on implementation. First let's discuss training word embeddings without transfer learning.

- Window Size: Generally speaking, window size is a size of context of consideration. For example, sentence `... We can address this issue by introducing ...  `, let's focus at word `this`. If the context size = 2, the model will try to encode the meaning of word `this` by considering some words in `can`, `address`, `issue`, `by`. It will not take `we` into account because it is out of considered context. For more details, see the [original paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). The intuition behind is this is that training the model with smaller size of windows will give you embeddings that encode more syntactic meaning. It will learn that words like `is`, `was` are similar because it's repalceable within a small window. The larger window size tends to create embeddings that can encode broader idea or topic or words. See [this paper](https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf) for more details. Whichever better is better depends on the downstream task. For example, one can say that smaller window size can perform better for downstream task like analogy, while larger size can be better for text classification. We will do experiment and see if it is the case.

- Epochs
- Dimension

Then, with transfer learning.



In [17]:
%load_ext autoreload
%autoreload

from lib.dataset import download_tfds_imdb_as_text, download_tfds_imdb_as_text_tiny
from lib.word_emb import run_pipeline, train_or_load_wv, train_or_load_wv_transfer




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
dataset  = download_tfds_imdb_as_text() # tuple of (X_train, X_test, y_train, y_test)
tiny_dataset = download_tfds_imdb_as_text_tiny() # first 100 samples from dataset

PRETRAINED_WV_MODEL_PATH = "./GoogleNews-vectors-negative300.bin"

## Baseline

Before starting the experiment, let's train the Word2Vec model with all default settings, pass it to the same text classification `pipeline` discussed in Notebook  2 and save the result as baseline.

Hyperparameters:
- Dimension `dim = 300`
- Window Size `window = 5`
- Epochs `iter = 5`


In [6]:
model_train = train_or_load_wv(dataset[0]) 
_, vectorizer = run_pipeline(dataset, model_train)

Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.87 2
F1 on test set: 0.86


We get F1 score of 0.86, slightly better than what we get from experiment 2.

## Window Size

Now let increase the window size from 5 to 15 and 30.

In [5]:
# approximate running time: 12 mins

model_train_window_15 = train_or_load_wv(dataset[0],window=15)
_, _ = run_pipeline(dataset, model_train_window_15)

model_train_window_30 = train_or_load_wv(dataset[0], window=30)
_, _ = run_pipeline(dataset, model_train_window_30)


Load Word2Vec from disk!
Loading tokenized document from disk...
Finished loading tokenized document in 0.37s!
Loading tokenized document from disk...
Finished loading tokenized document in 0.35s!
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.89
F1 on test set: 0.88
Load Word2Vec from disk!
Loading tokenized document from disk...
Finished loading tokenized document in 0.37s!
Loading tokenized document from disk...
Finished loading tokenized document in 0.36s!
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.89
F1 on test set: 0.88
681.8944387435913



Here we see that the F1 is significantly better when we change to window siez from 5 to 15. Further increasing to 30 doesn't help. We can say that at window=30, embeddings can capture "broad enough" concept for text classifcation.

## Epochs

In [7]:
# approximate running time: 40 mins

model_train_iter_2 = train_or_load_wv(dataset[0], iter=2, window=15)
_, _ = run_pipeline(dataset, model_train_iter_2)

model_train_iter_10 = train_or_load_wv(dataset[0],iter=10, window=15)
_, _ = run_pipeline(dataset, model_train_iter_10)

model_train_iter_15 = train_or_load_wv(dataset[0], iter=30, window=15)
_, _ = run_pipeline(dataset, model_train_iter_15)



Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.88 2
F1 on test set: 0.87
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.89 2
F1 on test set: 0.87
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.89 2
F1 on test set: 0.87
3352.4832718372345


## Dimension 

```
Best parameters set found on development set:  {'C': 1000}
Best F1 on development set: 0.88 2
F1 on test set: 0.87
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.89 2
F1 on test set: 0.87
Best parameters set found on development set:  {'C': 10}
Best F1 on development set: 0.89 2
F1 on test set: 0.87
3352.4832718372345
```

In [8]:
# approximate running time: 40 mins

model_train_dim_100 = train_or_load_wv(dataset[0], size=100, window=15)
_, _ = run_pipeline(dataset, model_train_dim_100)

model_train_dim_500 = train_or_load_wv(dataset[0], size=500, window=15)
_, _ = run_pipeline(dataset, model_train_dim_500)

model_train_dim_1000 = train_or_load_wv(dataset[0], size=1000, window=15)
_, _ = run_pipeline(dataset, model_train_dim_1000)



Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.88
F1 on test set: 0.87
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.89
F1 on test set: 0.88
Best parameters set found on development set:  {'C': 100}
Best F1 on development set: 0.89
F1 on test set: 0.88
3940.248155117035


# Transfer Learning

By training our word embeddings from train data, we should be aware that our word embeddings may be less quality , i.e. capture less semantic, than those pre-trained word ebmbeddings used in Notebook 2. While those vectors are trained on several billion tokens, our word embeddings will be trained on much less dataset. We can say that our embeddings will capture the meaning of tokens more specfic to domain i.e. movie reviews. But we can also say that embeddings corresponding to more common tokens will capture less semantic than those from pre-trained Word2Vec.

We can address this issue by introducing transfer learning. In high level, it's general term used when you train a model with one dataset (generally larger), then you (partially or entirely) use parameters of this trained model to train another model on another dataset (generally smaller). That new dataset and new model can be slightly different from the original model i.e. the different prediction targets or so. The use cases can be like when we train image classification from one domains and then "transfer" knowledge to another domain. We can apply similar technique by "transfer" pre-trained Word2Vec knowldege to our model. For formal definition and examples, see [this](https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a).

Let's dick a little deeper. How do we actually transfer the parameters from Word2Vec to our model?

The obvious approach is to start training the Word2Vec model with paramteres from pre-trained Word2Vec, instead of small randomized parameters. Then, the second question is what are vocabularies? We can combine vocabularies from our training set with pre-trained vocabularies. However, pre-trained vocabularies size are much larger (3M vs 100k), as we discussed in Notebook 0. If we augmented 100k vocabularies with 3M and train the Word2Vec model, the effect of 100k will be very slightly. In fact, this is not neccessary matter if we just want to learn embeddings for those 100 vocabularies from training set, let alone the number embeddings transfered from Word2Vec. In the experiment, we will see the effect of the size of augmented vocabularies.

We can also choose whether to freeze the embeddings transfered from pre-trained Word2Vec. If we freeze, only embeddings that are not transfered, i.e. embeddings corresponding to vocab that are in the train data but not pre-trained Word2Vec, will be trianed. We will experiment both options.

Lastly, what is the number of epochs? Genrally, we will train only a few epochs in transfer learning since most of the parameters are trained already and we only need introduce them to the new dataset. We will also experiment with different epochs.



## Number of augmented vocabularies

In [None]:

embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=0
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()


embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=50000
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()


embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=100000
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()


embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=500000
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()

## Freeze / Not Freeze

In [None]:
embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=100000
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()


embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=0, 
    window=15,
    n_transfer=100000
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()

## Epochs

In [None]:
embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=500000,
    iter = 1
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()


embeddings = train_or_load_wv_transfer(
    dataset[0], 
    PRETRAINED_WV_MODEL_PATH, 
    lockf=1, 
    window=15,
    n_transfer=500000,
    iter = 3
)
_, vectorizer = run_pipeline(dataset, embeddings)
vectorizer.print_stat()