# Generating incorrect answer suggestions
Using word embeddings we're going to find the most similar words to an answer.

## Importing the word embeddings
Unfortunately our beloved *spacy* does not offer most similar words. We'll use **gensim** for that.

Make sure you download the embeddings file first. Instructions in the *README* in the **data** folder.

In [1]:
import gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [2]:
glove_file = '../data/embeddings/glove.6B.300d.txt'
tmp_file = '../data/embeddings/word2vec-glove.6B.300d.txt'

In [3]:
import os

if not os.path.isfile(glove_file):
    print("Glove embeddings not found. Please download and place them in the following path: " + glove_file)

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

  glove2word2vec(glove_file, tmp_file)


## Similar words examples

In [None]:
model.most_similar(positive=['koala'], topn=10)

It seems to be working fine. Though what *the f* * is a probo?

![image.png](https://i.gyazo.com/8e982abd6da0025cb985b388c07507a8.png)

Ok.

At this point we asume that we have our answer, the sentence it's in, the entire text, and the title. Let's explore some words.

*__Oxygen__ is a chemical element with symbol O and atomic number 8.*  

In [None]:
model.most_similar(positive=['oxygen'], topn=10)

That was easy. Let's try something more difficult.

*the oldest portuguese university was first established in **lisbon** before moving to coimbra.*

In [None]:
model.most_similar(positive=['lisbon'], topn=10)

Seems like we are getting closer to *football teams* rather than *cities with old universities*. Let's add some more words from the sentence.

In [None]:
model.most_similar(positive=['lisbon', 'university'], topn=10)

Great! But now the words are getting too close to university. It would be good if we can add more weight to the orignal answer.

I can manually do it by taking the best embeddings to the original answer and counting how many times they occur in the joint embeddings.

In [None]:
model.most_similar(positive=['lisbon', 'coimbra'], topn=10)

Using another city really makes a difference and shows some good candidates. I think it'll be a good idea to use words in the sentence that are next to the answer.

### Words with the same stem

In [None]:
model.most_similar(positive=['write'], topn=10)

We could just remove all similar words that have the same stem as the original answer.

Additionally, the incorrect answers should be the same part of speech. Like with **write** - *read*, *publish*, *tell* are good candidates, but *books* could be easily discarded for being a noun.

### Numbers

In [None]:
model.most_similar(positive=['1944'], topn=10)

Not that bad. They seem to gravitate around the events of WW2. It seems better than ramdon numbers or closest numbers if we need to have multiple answer question. But I think it may be a better question if you have to input the number yourself, and you get a better score if you are closer to the correct answer.

### Names

In [None]:
model.most_similar(positive=['bush'], topn=10)

In [None]:
model.most_similar(positive=['euclid'], topn=10)

In [None]:
model.most_similar(positive=['atanasov'], topn=10)

I expected to be a lot worse. Names of famous people gets us other names of people with the same profesion - US presidents and greek mathematicians come up pretty easily. 

But with some less known figures, like a general in a certain battle, it woulnd't work. In those cases it would be good if we find other names in the same text or if we're working with a textbook we can use the names from other topics.

# Function

We'll keep it simple. We just need the *count* amount of distractors (incorrect answers).

In [None]:
def generate_distractors(answer, count):
    answer = str.lower(answer)
    
    ##Extracting closest words for the answer. 
    try:
        closestWords = model.most_similar(positive=[answer], topn=count)
    except:
        #In case the word is not in the vocabulary, or other problem not loading embeddings
        return []

    #Return count many distractors
    distractors = list(map(lambda x: x[0], closestWords))[0:count]
    
    return distractors

In [None]:
generate_distractors('oxygen', 4)

In [None]:
generate_distractors('bulgaria', 6)

In [None]:
generate_distractors('bulgaria', 15)