# Classroom 6 - Working with word embeddings

So far we've seen a couple of key Python libraries for doing specific tasks in NLP. For example, ```scikit-learn``` provides a whole host of fundamental machine learning algortithms; ```spaCy``` allows us to do robust linguistic analysis.

Today, we're going to meet ```gensim``` which is the best way to work with (static) word embeddings like word2vec. You can find the documentation [here](https://radimrehurek.com/gensim/).

In [2]:
import gensim
import gensim.downloader
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

## Choose a language

I've taken the word2vec model that we're going to use from a public resource containing lots of different embedding models for lots of different languages. You can access that resource [here](http://vectors.nlpl.eu/repository/).

I've saved a English and a Danish model in the ```cds-lang-shared``` drive but feel free to experiment wiht more!

You can download them to UCloud by getting the URL and using the following code at the command line:

```curl http://some-url.example --output some.file```


In [8]:
# English and Danish embeddings http://vectors.nlpl.eu/repository/ (English CoNLL17 corpus)
model_en = gensim.models.KeyedVectors.load("/work/431868/word2vec_models/english/english_word2vec.bin")
model_dk = gensim.models.KeyedVectors.load("/work/431868/word2vec_models/danish/danish_word2vec.bin")


UnpicklingError: could not find MARK

I've outlined a couple of tasks for you below to experiment with. Use these just a stepping off points to explore the nature of word embeddings and how they work.

Work in small groups on these tasks and make sure to discuss the issues and compare results - preferably across languages!

### Task 1: Finding polysemy

Find a polysemous word (for example, "leaves" or "scoop") such that the top-10 most similar words (according to cosine similarity) contains related words from both meanings. An example is given for you below in English. 

Are there certain words for which polysemy is more of a problem?
- e.g., "cool" has ambigous meanings --> "toasty" and "coolness" and "cool"
- The higher value the more similar the words are

In [12]:
#model_en.most_similar("leaves")
model_en.most_similar("cool") # the higher value the more similarity
# this is NOT like the cosine value similarity!! 


[('Cool', 0.5849887132644653),
 ('cooler', 0.553480327129364),
 ('coolest', 0.5431790947914124),
 ('coolness', 0.5307180881500244),
 ('toasty', 0.5274037718772888),
 ('warm', 0.5268723368644714),
 ('hi_dev_ur', 0.5175032615661621),
 ('hot', 0.515114963054657),
 ('cooled', 0.5131838321685791),
 ('cools', 0.511819064617157)]

### Task 2: Synonyms and antonyms

In the lecture, we saw that _cosine similarity_ can also be thought of as _cosine distance_, which is simply ```1 - cosine similarity```. So the higher the cosine distance, the further away two words are from each other and so they have less "in common".

Find three words ```(w1,w2,w3)``` where ```w1``` and ```w2``` are synonyms and ```w1``` and ```w3``` are antonyms, but where: 

```Cosine Distance(w1,w3) < Cosine Distance(w1,w2)```

For example, w1="happy" is closer to w3="sad" than to w2="cheerful".

Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened. Are there any inconsistencies?

You should use the the ```model.distance(w1, w2)``` function here in order to compute the cosine distance between two words. I've given a starting example below.

In [13]:
model.distance("happy", "sad") # antonyms
#the antonyms have a smaller cosine value than the synonyms (this is very counter intuitive )

0.4645386338233948

In [14]:
model.distance("happy","cheerful") #synonyms

0.6162261962890625

MY TAKE :)

In [15]:
model.distance("virgin", "slut") # antonyms
# HERE THE ANTONYMS HAVE A LARGER COSINE VALUE THAN SYNONYM  --> THIS IS ACTUALLY INTUITIVE 

0.6018030047416687

In [16]:
model.distance("slut","whore") #synonyms

0.32027411460876465

In [17]:
model.distance("white", "black") # antonyms

0.19077849388122559

In [None]:
model.distance("light","white") #synonyms

**Question:** What should the following cell print? Why?

In [None]:
model.distance("happy", "sad") < model.distance("happy","cheerful")

### Task 3: Word analogies

We saw in the lecture we saw that can use "arithmetic" on word embeddings, in order to perform word analogy task.

For example:

```man::king as woman::queen```

So we can say that if we take the vector for ```king``` and subtract the vector for ```man```, we're removing the gender component from the ```king```. If we then add ```woman``` to the resulting vector, we should be left with a vector similar to ```queen```.

NB: It might not be _exactly_ the vector for ```queen```, but it should at least be _close_ to it.

```gensim``` has some quirky syntax that allows us to perform this kind of arithmetic.

In [None]:
model.most_similar(positive=['king', 'woman'], 
                   negative=['man'])[0]

Try to find at least three analogies which correctly hold - where "correctly" here means that the closest vector corresponds to the word that you as a native speaker think it should.

### Task 3b: Wrong analogies

Can you find any analogies which _should_ hold but don't? Why don't they work? Are there any similarities or trends?

In [None]:
# your code here

### Task 4: Exploring bias

As we spoke briefly about in the lecture, word embeddings tend to display bias of the kind found in the training data.

Using some of the techniques you've worked on above, can you find some clear instances of bias in the word embedding models that you're exploring

In [None]:
model.most_similar(positive=['doctor', 'woman'], 
                   negative=['man'])

### Task 5: Dimensionality reduction and visualizing

In the following cell, I've written a short bit of code which takes a given subset of words and plots them on a simple scatter plot. Remember that the word embeddings are 300d (or 100d here, depending on which language you're using), so we need to perform some kind of dimensionality reduction on the embeddings to get them down to 2D.

Here, I'm using an algorithm implemented via ```scikit-learn``` called Principal Component Analysis (PCA). PCA is a kind of dimensionality reduction algorithm which takes big vectors and tries to make them smaller while keeping as much information as possible.

(```maths```: An alternative approach might also be to use Singular Value Decomposition or SVD, which works in a similar but ever-so-slightly different way to PCA. You can read more [here](https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/) and [here](https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491) - the maths is bit mind-bending, just FYI.)

Experiment with plotting certain subsets of words by changing the ```words``` list. 

**Question:** How useful do you find these plots? Do they show anything meaningful?


In [None]:
# the list of words we want to plot
words = ["man", "woman", "doctor", "nurse", "king", "queen", "boy", "girl"]

# an empty list for vectors
X = []
# get vectors for subset of words
for word in words:
    X.append(model[word])

# Use PCA for dimensionality reduction to 2D
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a scatter plot of the projection
plt.scatter(result[:, 0], result[:, 1])

# for each word in the list of words
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]))

plt.show()

### Bonus tasks

If you run out of things to explore with these embeddings, try some of the following tasks:

- Take the code above and write a script which loads a word2vec mdoel, takes a list of words as an input and produces the visualisation above.