**Table of contents**<a id='toc0_'></a>    
- [Embeddings](#toc1_)    
  - [Finding similar words](#toc1_1_)    
  - [Combining words to create new meaning](#toc1_2_)    
    - [How does it work? Cosine similarity](#toc1_2_1_)    
- [Resources](#toc2_)    
- [Acknowledgements](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Embeddings](#toc0_)

In [None]:
# You know the drill
# !pip install gensim

In [None]:
import gensim
import gensim.downloader as api

In [None]:
# The model I'd like to use but takes a very long time to download...
# model_nope = api.load("word2vec-google-news-300")

In [None]:
# ...so we will try a smaller model
model = api.load("glove-wiki-gigaword-100")

> Each word has an encoding, i.e. a number of abstract features that represent the word:

In [None]:
model['boat']

In [None]:
len(model['boat'])

## <a id='toc1_1_'></a>[Finding similar words](#toc0_)

> Which enables you to search for "similar" words (words that could occupy the same vector space):

In [None]:
model.most_similar('boat')

In [None]:
model.most_similar('book')

## <a id='toc1_2_'></a>[Combining words to create new meaning](#toc0_)

In [None]:
# King is to man what ? is to woman:
model.most_similar(positive=['woman','king'], negative=['man'], topn=1)

In [None]:
# Boy is to man what ? is to woman:
model.most_similar(positive=['woman', 'boy'], negative=['man'], topn=1)

In [None]:
# Barley is to beer what grape is to ? :
# model.most_similar(positive=['grape', 'beer'], negative=['barley'], topn=1)
model.most_similar(positive=['grape', 'beer'], negative=['grain'], topn=1)

In [None]:
# Madrid is to Spain what Lisbon is to ? :
model.most_similar(positive=['lisbon', 'spain'],negative=['madrid'],topn=1)

### <a id='toc1_2_1_'></a>[How does it work? Cosine similarity](#toc0_)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([model['king']], [model['ufo']])

In [None]:
cosine_similarity([model['woman']],[model['man']])

In [None]:
cosine_similarity([model['king']],[model['crown']])

In [None]:
cosine_similarity([model['king']],[model['monarch']])

In [None]:
cosine_similarity([model['king'] - model['queen']], [model['son'] - model['daughter']])

In [None]:
cosine_similarity([model['russia'] - model['moscow']], [model['poland'] - model['krakow']])

In [None]:
# Krakow is to Poland what ? is to Russia :
model.most_similar(positive=['russia', 'lancut'],negative=['poland'],topn=1)

# <a id='toc2_'></a>[Resources](#toc0_)

- [Word embeddings and Word2Vec, StatQuest (16 min)](https://www.youtube.com/watch?v=viZrOnJclY0)
- [Cosine similarity, StatQuest (10 min)](https://www.youtube.com/watch?v=e9U0QAFbfLI)
- Understanding transformers, by StatQuest:
    - [Recurrent Neural Networks (RNNs) (17 min)](https://youtu.be/AsNTP8Kwu80?si=yLu1H5CEVd4dX7hm)  
    - [Long Short-Term Memory Networks (LSTMs) (21 min)](https://youtu.be/YCzL96nL7j0?si=-i22L3FuAC6BWLSU)  
    - [seq2seq Encoder-Decoder (17 min)](https://youtu.be/L8HKweZIOmg?si=ROsiu2V4A8EfWPjN)  
    - [Attention for Neural Networks (16 min)](https://youtu.be/PSs6nxngL6k?si=Kk3U02Px3ij98RiX)  
    - [Transformer Neural Networks (36 min)](https://youtu.be/zxQyTK8quyY?si=StA0ZLl702j3br4T)

- [BERTopic library and documentation](https://maartengr.github.io/BERTopic/)

# <a id='toc3_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure & content!