# Fun with Word Embeddings

In this activity, we will explore a few pre-trained word embeddings, and also will learn to train Word2Vec and FastText word embeddings. 

We will use Gensim implementation of Word2Vec. 
[Gensim](https://radimrehurek.com/gensim/) is an open source Python library for natural language processing, with a focus on topic modeling. 
To prepare for this activity, you need to install Gensim. 
You can do this by going to your terminal, and run the following command:
```
pip install --upgrade gensim
```

In [1]:
!pip install --upgrade gensim  # lazy installation :D



In this notebook, there will be lots of things happening behind the scene &#128556; 
We can track events and display information through basic [Logging](https://docs.python.org/3/howto/logging.html).

In the following, we also specify the format that we want the information to be displayed by specifying the formatting string `'%(asctime)s : %(levelname)s : %(message)s'`. 

For a full set of things that can appear in format strings, you can refer to the documentation for [LogRecord](https://docs.python.org/3/library/logging.html#logrecord-attributes) attributes, but for simple usage, you just need the `levelname` (severity), `message` (event description, including variable data) and perhaps to display when the event occurred with `asctime`. 

In [2]:
import logging
from pprint import pprint as print

# Tracking events and display information through Logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## 3. FastText

As mentioned before, `word2vec` model does not accommodate words that do not appear in the training corpus. 

Here, we’ll learn to work with the fastText library for training word-embedding models, and performing similarity operations & vector lookups analogous to Word2Vec. 

In the following block of code, we import the `FastText` model form Gensim library, then:
1. We set the path to the corpus file. Similar as above, we use the Bbc News article as the training corpus;
2. Initialise the `FastText` model, similar as before, we use 100 dimention vectors;
3. Then we build the vocabulary from the copurs;
4. Finally, we train the fasttext model based on the corpus.

In [3]:
from gensim.models.fasttext import FastText

# 1. Set the corpus file names/path
corpus_file = './bbcNews.txt'

# 2. Initialise the Fast Text model
bbcFT = FastText(vector_size=100) 

# 3. build the vocabulary
bbcFT.build_vocab(corpus_file=corpus_file)

# 4. train the model
bbcFT.train(
    corpus_file=corpus_file, epochs=bbcFT.epochs,
    total_examples=bbcFT.corpus_count, total_words=bbcFT.corpus_total_words,
)

print(bbcFT)

2022-10-01 21:36:05,999 : INFO : FastText lifecycle event {'params': 'FastText<vocab=0, vector_size=100, alpha=0.025>', 'datetime': '2022-10-01T21:36:05.999746', 'gensim': '4.2.0', 'python': '3.9.12 (main, Apr  5 2022, 01:53:17) \n[Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2022-10-01 21:36:06,000 : INFO : collecting all words and their counts
2022-10-01 21:36:06,001 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-10-01 21:36:06,104 : INFO : collected 18345 word types from a corpus of 426118 raw words and 2225 sentences
2022-10-01 21:36:06,104 : INFO : Creating a fresh vocabulary
2022-10-01 21:36:06,137 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 retains 10485 unique words (57.15% of original 18345, drops 7860)', 'datetime': '2022-10-01T21:36:06.137930', 'gensim': '4.2.0', 'python': '3.9.12 (main, Apr  5 2022, 01:53:17) \n[Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'p

<gensim.models.fasttext.FastText object at 0x7fcf00e417c0>


We can retrieve the KeyedVectors from the model as follows,

In [4]:
bbcFT_wv = bbcFT.wv
print(bbcFT_wv)

<gensim.models.fasttext.FastTextKeyedVectors object at 0x7fcf00e410d0>


And see the vector of `king`:

In [5]:
bbcFT_wv['king']

array([ 0.25515214, -0.15216236, -0.91329324, -0.1548347 ,  0.25573623,
        0.6096976 , -0.02978956,  0.4597498 ,  0.5173004 ,  0.21684076,
       -0.8100364 ,  0.45270357, -0.76255137,  0.03837309,  0.29864702,
        0.3910782 , -0.09769247, -0.6677153 ,  0.19720758, -0.6545355 ,
       -0.6917739 ,  0.2979505 ,  0.5610619 ,  0.06513355, -0.35859495,
       -0.55487525, -0.45098144,  0.02784753, -0.14543949,  0.6603091 ,
       -0.60943335,  0.5215059 ,  0.2755244 , -0.13164893,  0.08852942,
       -0.03730462,  1.2275649 ,  0.90195924, -1.1044537 ,  0.04230173,
       -0.20181541,  0.22362003, -0.28896233, -0.48851252,  0.01154309,
       -0.12411773, -0.5494902 , -0.3262416 ,  0.9721059 ,  0.09267493,
        0.5495732 ,  0.8440249 ,  0.50472564,  0.12608802,  0.43247995,
       -0.11829105, -1.2220495 , -0.33981144, -0.51818687, -0.01814928,
       -0.5242976 , -0.94597554, -0.8754663 , -0.164685  , -0.91298574,
        0.87168354, -0.59012735,  0.0038243 , -0.7854779 ,  1.03

### Save the model

Similar as we do with the trained Word2Vec, we can also save our trained FastText model using the standard gensim methods. 


In [6]:
# Save the model
bbcFT.save("bbcFT.model")

2022-10-01 21:36:18,007 : INFO : FastText lifecycle event {'fname_or_handle': 'bbcFT.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-01T21:36:18.007755', 'gensim': '4.2.0', 'python': '3.9.12 (main, Apr  5 2022, 01:53:17) \n[Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'saving'}
2022-10-01 21:36:18,008 : INFO : storing np array 'vectors_ngrams' to bbcFT.model.wv.vectors_ngrams.npy
2022-10-01 21:36:19,445 : INFO : not storing attribute buckets_word
2022-10-01 21:36:19,446 : INFO : not storing attribute vectors
2022-10-01 21:36:19,446 : INFO : not storing attribute cum_table
2022-10-01 21:36:19,454 : INFO : saved bbcFT.model


## 4. Summary

In this activity, we have played with a few pretrained word embeddings including `Word2Vec` trained on the Google news dataset, and GloVe. 
We have also learnt to trained our `Word2Vec` and `FastText` models using our BBC News dataset. 
I hope you have a lot of fun in this activity. 

The semantic embeddings seem amazing, though so far, we have not yet really explore the actual usage of them. 
In the next activity, we will try to use this word embeddings for text classification task. Get ready! 😉

## 5. Exercise
* There are multiple pre-trained models in Gensim, see Section **Pretrained models** in https://radimrehurek.com/gensim/models/word2vec.html. Indeed, Gensim also includes the GloVe implementation. In this activity, we didn't use this Gensim implementation, instead, we had demonstated how to load the pre-trained GloVe word embeddings from the original source. You can explore other pretrained models in Gensim.

## Reference:
[1] [Word Embeddings— Fun with Word2Vec and Game of Thrones](https://medium.com/@khulasaandh/word-embeddings-fun-with-word2vec-and-game-of-thrones-ea4c24fcf1b8)  
[2] [Gensim Word2Vec Tutorial – Full Working Example](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.YJzBmmYza3f)  
[3] [Word2vec Tutorial](https://rare-technologies.com/word2vec-tutorial/)  
[4] [Word2Vec Model -- gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html)   
[5] [Using pre-trained word embeddings](https://nlp.stanford.edu/projects/glove/) 
[6] [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
[7] [Using pre-trained word embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/). 