**Table of contents**<a id='toc0_'></a>    
- [Embeddings](#toc1_)    
  - [Finding similar words](#toc1_1_)    
  - [Combining words to create new meaning](#toc1_2_)    
    - [How does it work? Cosine similarity](#toc1_2_1_)    
- [Resources](#toc2_)    
- [Acknowledgements](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Embeddings](#toc0_)

In [None]:
# You know the drill
# !pip install gensim

In [3]:
import gensim
import gensim.downloader as api

In [None]:
# The model I'd like to use but takes a very long time to download...
# model_nope = api.load("word2vec-google-news-300")

In [2]:
# ...so we will try a smaller model
model = api.load("glove-wiki-gigaword-100")

> Each word has an encoding, i.e. a number of abstract features that represent the word:

In [4]:
model['boat']

array([-0.0083229,  0.3696   , -0.24436  , -0.89288  , -0.23607  ,
       -0.41659  ,  0.48309  ,  0.91159  ,  0.14498  , -0.096963 ,
        0.55061  ,  1.0376   ,  0.32243  , -0.01381  ,  0.0098052,
       -0.66346  ,  0.27949  , -0.74099  , -0.29602  ,  0.64596  ,
        1.1305   ,  0.54629  ,  0.49664  , -0.87378  ,  0.42399  ,
        0.35015  , -1.9151   ,  0.010363 ,  0.35684  , -0.32398  ,
       -0.66927  ,  0.43628  , -0.20924  ,  0.28862  ,  0.63752  ,
       -0.18789  , -0.079442 ,  0.30494  ,  0.8829   , -0.3143   ,
       -1.2595   , -0.72301  ,  0.077278 , -0.045894 ,  1.0251   ,
        0.25472  , -0.2566   ,  0.18428  ,  0.34037  ,  0.53185  ,
       -0.070906 ,  0.57464  ,  0.5131   ,  1.1666   ,  0.1848   ,
       -1.4466   , -0.41846  ,  0.011812 ,  2.1553   ,  0.52012  ,
       -0.9029   ,  0.43183  ,  0.60584  ,  0.72845  , -0.32243  ,
        0.73929  , -0.68845  ,  0.25407  , -0.20834  , -0.059242 ,
       -0.45655  , -0.27773  ,  0.7168   ,  0.075051 ,  0.4167

In [5]:
len(model['boat'])

100

## <a id='toc1_1_'></a>[Finding similar words](#toc0_)

> Which enables you to search for "similar" words (words that could occupy the same vector space):

In [6]:
model.most_similar('boat')

[('boats', 0.8819124102592468),
 ('ship', 0.8258436322212219),
 ('vessel', 0.8200733661651611),
 ('ferry', 0.77042156457901),
 ('sailing', 0.7610524296760559),
 ('ships', 0.7407434582710266),
 ('capsized', 0.7383140921592712),
 ('barge', 0.7286107540130615),
 ('fishing', 0.7228102684020996),
 ('yacht', 0.7208371162414551)]

In [7]:
model.most_similar('book')

[('books', 0.847648561000824),
 ('novel', 0.8181166648864746),
 ('published', 0.8023924231529236),
 ('story', 0.7941390872001648),
 ('author', 0.7937875390052795),
 ('wrote', 0.7930577397346497),
 ('essay', 0.7821518182754517),
 ('biography', 0.7754694819450378),
 ('written', 0.760090172290802),
 ('fiction', 0.7549652457237244)]

## <a id='toc1_2_'></a>[Combining words to create new meaning](#toc0_)

In [8]:
# King is to man what ? is to woman:
model.most_similar(positive=['woman','king'], negative=['man'], topn=1)

[('queen', 0.7698540687561035)]

In [9]:
# Boy is to man what ? is to woman:
model.most_similar(positive=['woman', 'boy'], negative=['man'], topn=1)

[('girl', 0.9095936417579651)]

In [11]:
# Barley is to beer what grape is to ? :
# model.most_similar(positive=['grape', 'beer'], negative=['barley'], topn=1)
model.most_similar(positive=['grape', 'beer'], negative=['grain'], topn=1)

[('champagne', 0.6877999305725098)]

In [12]:
# Madrid is to Spain what Lisbon is to ? :
model.most_similar(positive=['lisbon', 'spain'],negative=['madrid'],topn=1)

[('portugal', 0.806252121925354)]

### <a id='toc1_2_1_'></a>[How does it work? Cosine similarity](#toc0_)

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([model['king']], [model['ufo']])

array([[-0.02140226]], dtype=float32)

In [15]:
cosine_similarity([model['woman']],[model['man']])

array([[0.8323495]], dtype=float32)

In [17]:
cosine_similarity([model['king']],[model['crown']])

array([[0.6647526]], dtype=float32)

In [18]:
cosine_similarity([model['king']],[model['monarch']])

array([[0.6977891]], dtype=float32)

In [19]:
cosine_similarity([model['king'] - model['queen']], [model['son'] - model['daughter']])

array([[0.70172185]], dtype=float32)

In [23]:
cosine_similarity([model['russia'] - model['moscow']], [model['poland'] - model['krakow']])

array([[0.6088005]], dtype=float32)

In [27]:
# Krakow is to Poland what ? is to Russia :
model.most_similar(positive=['russia', 'lancut'],negative=['poland'],topn=1)

KeyError: "Key 'lancut' not present in vocabulary"

# <a id='toc2_'></a>[Resources](#toc0_)

- [Word embeddings and Word2Vec, StatQuest (16 min)](https://www.youtube.com/watch?v=viZrOnJclY0)
- [Cosine similarity, StatQuest (10 min)](https://www.youtube.com/watch?v=e9U0QAFbfLI)
- Understanding transformers, by StatQuest:
    - [Recurrent Neural Networks (RNNs) (17 min)](https://youtu.be/AsNTP8Kwu80?si=yLu1H5CEVd4dX7hm)  
    - [Long Short-Term Memory Networks (LSTMs) (21 min)](https://youtu.be/YCzL96nL7j0?si=-i22L3FuAC6BWLSU)  
    - [seq2seq Encoder-Decoder (17 min)](https://youtu.be/L8HKweZIOmg?si=ROsiu2V4A8EfWPjN)  
    - [Attention for Neural Networks (16 min)](https://youtu.be/PSs6nxngL6k?si=Kk3U02Px3ij98RiX)  
    - [Transformer Neural Networks (36 min)](https://youtu.be/zxQyTK8quyY?si=StA0ZLl702j3br4T)

- [BERTopic library and documentation](https://maartengr.github.io/BERTopic/)

# <a id='toc3_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure & content!