# Semantic Textual Similarity

In this assignment, you will work on the [Semantic Textual Similarity](https://alt.qcri.org/semeval2017/task1/) (STS) shared task organized within the 2017 editions of SemEval. The goal of the task is to measure the meaning similarity of sentence-pairs in English, Spanish and Arabic as well as cross-lingual pairings of English with sentences in Arabic, Spanish and Turkish. In particular, you will focus on solving STS for English sentence-pairs. The task involves producing real-valued similarity scores
ranging from 0 for complete dissimilar sentences to 5 for completely equivalent sentences, as can be seen in the following examples:

| Sentence 1                                              | Sentence 2                                            | Score |
| :---                                                    | :----                                                 | ---:  |
| The bird is bathing in the sink.                        | Birdie is washing itself in the water basin.          |   5   |
| John said he is considered a witness but not a suspect. | “He is not a suspect anymore.” John said.             |   3   |
| The black dog is running through the snow.              | A race car driver is driving his car through the mud. |   0   |


You will try two different approaches to solve STS: Cosine Similarity and Word Mover's Distance. You will work with [gensim](https://radimrehurek.com/gensim/), a **Python** library that provides functionality to obtain text representations based on word embeddings and calculate their semantic similarity. Specifically, you will use the following objects and functions:

In [21]:
import spacy
import numpy as np
import pandas as pd
import gensim.downloader as api
from sklearn.preprocessing import minmax_scale
from scipy.stats import pearsonr

The data for the assignment consists of 250 sentences pairs with their corresponding similarity scores. The dataset can be loaded into a `DataFrame` as follows:

In [22]:
data = pd.read_csv("data.tsv", sep="\t")
data

Unnamed: 0,sent1,sent2,gold
0,A person is on a baseball team.,A person is playing basketball on a team.,2.4
1,Our current vehicles will be in museums when e...,The car needs to some work,0.2
2,A woman supervisor is instructing the male wor...,A woman is working as a nurse.,1.0
3,A bike is next to a couple women.,A child next to a bike.,2.0
4,The group is eating while taking in a breathta...,A group of people take a look at an unusual tree.,2.2
...,...,...,...
245,A brown dog is jumping.,A brown dog is jumping,5.0
246,the man is catching a ball,A man is kicking a ball.,3.0
247,Two men are sitting in the room.,Two men are standing in a room.,3.0
248,A group of teenagers in red shirts are smiling.,A group of people are wearing orange shirts.,1.2


## Data Pre-processing

In order to apply the methods provided by the **gensim** library, you first need to pre-process and convert the text into a proper format. In this exercise, you will have to tokenize and lemmatize the input sentences. For this, you will work with the [English pipeline optimized for CPU](https://spacy.io/models/en#en_core_web_sm) that can be loaded as follows:

In [23]:
nlp = spacy.load("en_core_web_sm")

You must complete the code for the `process_sentences` function that takes as input a `list` of sentences and the  a **spaCy** pipeline. The function should tokenize each sentence in the list, obtain the lemma for each token and filter out stop-words and punctuations.
Check the [spaCy 101](https://spacy.io/usage/spacy-101) and the [Token](https://spacy.io/api/token) documentation to learn how to apply the pipeline for this exercise.
The `process_sentences` function must return a `list` where each item corresponds to a sentence in the form of a `list` of filtered lemmas as in the following examples:

> <pre>
*** Processed 1st Sentences ***
[['person', 'baseball', 'team'], ['current', 'vehicle', 'museum', 'aircraft'], ['woman', 'supervisor', 'instruct', 'male', 'worker'], ['bike', 'couple', 'woman'], ['group', 'eat', 'take', 'breathtaking', 'view'], ['boy', 'raise', 'hand'], ['man', 'gray', 'beard', 'shave', 'lecture', 'hall'], ['sky', 'little', 'cloud'], ['young', 'boy', 'jump', 'barefoot', 'outside', 'yard'], ['dog', 'forest']]
>
> *** Processed 2nd Sentences ***
[['person', 'play', 'basketball', 'team'], ['car', 'need', 'work'], ['woman', 'work', 'nurse'], ['child', 'bike'], ['group', 'people', 'look', 'unusual', 'tree'], ['man', 'raise', 'hand'], ['man', 'beard', 'sit', 'grass'], ['Lady', 'ready', 'Rock', 'Climbing', 'watch', 'Clouds'], ['teen', 'ride', 'bike', 'people', 'walk', 'courtyard'], ['dog', 'forest']]
</pre>


In [24]:
def process_sentences(sentences, nlp):

    tokenized_sentence = []
    
    for sentence in sentences:
        tokens = []
        doc = nlp(sentence)
        for token in doc:
            if token.is_stop == False and token.is_punct == False:
                tokens.append(token.lemma_)
        tokenized_sentence.append(tokens)

    return tokenized_sentence

The `process_sentences` function can now be applied on the list of the 1st sentences and the list of the 2nd sentences of the pairs respectively. The resulting `sent1_processed` and `sent2_processed` will have the same siz and items with the same index will correspond to sentences from the same pair, e.g., `sent1_processed[1]` and `sent2_processed1[1]` contain the 1st and 2nd sentences of the same pair.

In [25]:
sent1_processed = process_sentences(data["sent1"].values, nlp)
sent2_processed = process_sentences(data["sent2"].values, nlp)
print("*** Processed 1st Sentences ***")
print(sent1_processed[:10])
print("")
print("*** Processed 2nd Sentences ***")
print(sent2_processed[:10])
print("")

*** Processed 1st Sentences ***
[['person', 'baseball', 'team'], ['current', 'vehicle', 'museum', 'aircraft'], ['woman', 'supervisor', 'instruct', 'male', 'worker'], ['bike', 'couple', 'woman'], ['group', 'eat', 'take', 'breathtaking', 'view'], ['boy', 'raise', 'hand'], ['man', 'gray', 'beard', 'shave', 'lecture', 'hall'], ['sky', 'little', 'cloud'], ['young', 'boy', 'jump', 'barefoot', 'outside', 'yard'], ['dog', 'forest']]

*** Processed 2nd Sentences ***
[['person', 'play', 'basketball', 'team'], ['car', 'need', 'work'], ['woman', 'work', 'nurse'], ['child', 'bike'], ['group', 'people', 'look', 'unusual', 'tree'], ['man', 'raise', 'hand'], ['man', 'beard', 'sit', 'grass'], ['Lady', 'ready', 'Rock', 'Climbing', 'watch', 'Clouds'], ['teen', 'ride', 'bike', 'people', 'walk', 'courtyard'], ['dog', 'forest']]



## Cosine Similarity

The first technique you will apply consist of calculating the similarity of two sentences by computing the cosine of the sentence embeddings, which can be obtained by, for example, averaging the embeddings of the words in the sentences. The [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) module of **gensim** is a straightforward way to obtain and operate with word embeddings. The library provides access to different [word embedding models](https://github.com/RaRe-Technologies/gensim-data). In this assignment, you will work with the 300 dimensional version of the GloVe embeddings pre-trained on the union of *Wikipedia* and the *GigaWord* corpus. Therefore, your first step is to download and load them.

> **Note!** Loading a large word embedding model may take some time. Besides, the first time the **gensim** library loads a model it has to download it first, which makes the process take even longer. For debugging, you can replace the large model `glove-wiki-gigaword-300` with a smaller one like `glove-twitter-25`.

You must complete the code for the `load_model` function. The function takes the name of the word embeddings model you want to obtain and it should download, load and return such model. For this, you should use the [downloader](https://radimrehurek.com/gensim/downloader.html) API. The functions should return a [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) structure that contains embeddings of 300 dimensions for a vocabulary of size 400000. 

> Model shape: (400000, 300) 

In [26]:
def load_model(model_name):

    model = api.load(model_name)

    return model

In [27]:
model = load_model("glove-wiki-gigaword-300")
print("Model shape: (%s, %s)" % model.vectors.shape)

Model shape: (400000, 300)


Once the word embedding model has been loaded, it can be used to obtain a vector representation of the sentences and calculate their cosine similarity.

You must complete the code for the `cosine_similarity` function. The function takes as input the list of 1st sentences, the list of 2nd sentences and the word embedding model obtained by the `load_model` function.  The `cosine_similarity` function must obtain an embedding for each sentence by averaging the word embedding of the lemmas, and calculate the cosine similarity for each pair of sentences. Check the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) documentation to learn what methods allow getting a mean vector given a list of words and computing the cosine similarities between vectors. The function must return a `list` of real-valued scores where each item corresponds to a sentence pair. This list can then be stored in the `prediction` column of `data`:


|     | sent1                                                                                          | sent2                                                                                                 |   gold |   prediction |
|----:|:-----------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------|-------:|-------------:|
|   0 | A person is on a baseball team.                                                                | A person is playing basketball on a team.                                                             |    2.4 |     0.902706 |
|   1 | Our current vehicles will be in museums when everyone has their own aircraft.                  | The car needs to some work                                                                            |    0.2 |     0.597748 |
|   2 | A woman supervisor is instructing the male workers.                                            | A woman is working as a nurse.                                                                        |    1   |     0.761507 |
|   3 | A bike is next to a couple women.                                                              | A child next to a bike.                                                                               |    2   |     0.761056 |
|   4 | The group is eating while taking in a breathtaking view.                                       | A group of people take a look at an unusual tree.                                                     |    2.2 |     0.767762 |
| ... | ... | ... | ... | ... |
| 245 | A brown dog is jumping.                                                                        | A brown dog is jumping                                                                                |    5   |     1.000000 |
| 246 | the man is catching a ball                                                                     | A man is kicking a ball.                                                                              |    3   |     0.874216 |
| 247 | Two men are sitting in the room.                                                               | Two men are standing in a room.                                                                       |    3   |     0.916421 |
| 248 | A group of teenagers in red shirts are smiling.                                                | A group of people are wearing orange shirts.                                                          |    1.2 |     0.769548 |
| 249 | The woman is waiting for her date.                                                             | The woman is on her way to a date.                                                                    |    3.2 |     0.880878 |

In [63]:
def cosine_similarity(sent1_processed, sent2_processed, model):
    
    similarities = []

    for i,j in zip(sent1_processed, sent2_processed):
        
        vec1 = model.get_mean_vector(i)
        vec2 = model.get_mean_vector(j)
        
        cosine_sim = model.cosine_similarities(vec1, [vec2])[0]
        
        similarities.append(cosine_sim)
    
    return similarities

In [64]:
data["prediction"] = cosine_similarity(sent1_processed, sent2_processed, model)
data

Unnamed: 0,sent1,sent2,gold,prediction
0,A person is on a baseball team.,A person is playing basketball on a team.,2.4,0.902706
1,Our current vehicles will be in museums when e...,The car needs to some work,0.2,0.597748
2,A woman supervisor is instructing the male wor...,A woman is working as a nurse.,1.0,0.761507
3,A bike is next to a couple women.,A child next to a bike.,2.0,0.761056
4,The group is eating while taking in a breathta...,A group of people take a look at an unusual tree.,2.2,0.767762
...,...,...,...,...
245,A brown dog is jumping.,A brown dog is jumping,5.0,1.000000
246,the man is catching a ball,A man is kicking a ball.,3.0,0.874216
247,Two men are sitting in the room.,Two men are standing in a room.,3.0,0.916421
248,A group of teenagers in red shirts are smiling.,A group of people are wearing orange shirts.,1.2,0.769548


The values obtained by the `cosine_similarity` function range from 0 to 1, while the manually annotated score range from 0 to 5. Although not strictly necessary, scaling the predicted similarity scores to the same range used by the annotators can help visualize and interpret the results.

You must complete the code for the `scale_similarities` function. The function takes a list of real values and should scale them to the `[0, 5]` range. The function should also round the resulting values to 1 decimal. You can use the [minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html) function of the **sklearn** library and the [round](https://numpy.org/doc/1.13/reference/generated/numpy.round_.html) method of **numpy**. The `scale_similarities` function must return the `list` of the scaled real-valued scores that can replace the values in the `prediction` column of `data`:


|     | sent1                                                                                          | sent2                                                                                                 |   gold |   prediction |
|----:|:-----------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------|-------:|-------------:|
|   0 | A person is on a baseball team.                                                                | A person is playing basketball on a team.                                                             |    2.4 |          4.4 |
|   1 | Our current vehicles will be in museums when everyone has their own aircraft.                  | The car needs to some work                                                                            |    0.2 |          2.5 |
|   2 | A woman supervisor is instructing the male workers.                                            | A woman is working as a nurse.                                                                        |    1   |          3.5 |
|   3 | A bike is next to a couple women.                                                              | A child next to a bike.                                                                               |    2   |          3.5 |
|   4 | The group is eating while taking in a breathtaking view.                                       | A group of people take a look at an unusual tree.                                                     |    2.2 |          3.6 |
| ... | ... | ... | ... | ... |
| 245 | A brown dog is jumping.                                                                        | A brown dog is jumping                                                                                |    5   |          5   |
| 246 | the man is catching a ball                                                                     | A man is kicking a ball.                                                                              |    3   |          4.2 |
| 247 | Two men are sitting in the room.                                                               | Two men are standing in a room.                                                                       |    3   |          4.5 |
| 248 | A group of teenagers in red shirts are smiling.                                                | A group of people are wearing orange shirts.                                                          |    1.2 |          3.6 |
| 249 | The woman is waiting for her date.                                                             | The woman is on her way to a date.                                                                    |    3.2 |          4.3 |

In [65]:
def scale_similarities(similarities):

    scale_values = minmax_scale(similarities, feature_range = (0,5))
    round_values = np.round(scale_values, 1)

    return round_values

In [66]:
data["prediction"] = scale_similarities(data["prediction"].values)
data

Unnamed: 0,sent1,sent2,gold,prediction
0,A person is on a baseball team.,A person is playing basketball on a team.,2.4,4.4
1,Our current vehicles will be in museums when e...,The car needs to some work,0.2,2.5
2,A woman supervisor is instructing the male wor...,A woman is working as a nurse.,1.0,3.5
3,A bike is next to a couple women.,A child next to a bike.,2.0,3.5
4,The group is eating while taking in a breathta...,A group of people take a look at an unusual tree.,2.2,3.6
...,...,...,...,...
245,A brown dog is jumping.,A brown dog is jumping,5.0,5.0
246,the man is catching a ball,A man is kicking a ball.,3.0,4.2
247,Two men are sitting in the room.,Two men are standing in a room.,3.0,4.5
248,A group of teenagers in red shirts are smiling.,A group of people are wearing orange shirts.,1.2,3.6


The STS task can be evaluated using Pearson's correlation to measure the linear relationship between the predicted scores and the manual annotations. The predictions obtained with cosine similarity should results in the following correlation:


> Pearson correlation: 0.688

In [69]:
def evaluate(data):
    score = pearsonr(data["gold"], data["prediction"])[0]
    print("Pearson correlation: %.3f" % score)

In [70]:
evaluate(data)

Pearson correlation: 0.688


## Word Movers Distance

Applying cosine similarity to sentences requires selecting some strategy to aggregate the embeddings of the words, be it the average, sum or other. Alternatively, this constraint can be avoided by applying techniques that work directly on the word embeddings such as Word Mover's Distance. In the last part of this assignment, you will try the Word Mover's Distance using **gensim**.

> **Note!** *Similarity* refers to how related two vectors are while *distance* refers to how far apart they are. Therefore, the values returned by Word Mover's Distance must be updated accordingly to reflect *similarity*.

You must complete the code for the `word_movers_similarity` function. The function takes as input the list of 1st sentences, the list of 2nd sentences and the word embedding model obtained by the `load_model` function.  The `word_movers_similarity` function must compute the Word Mover's Distance between each pair of sentences. Check the [Word Mover’s Distance
](https://radimrehurek.com/gensim/auto_examples/tutorials/run_wmd.html) and the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) documentation to learn what method allows running such algorithm between two lists of words. The function should convert the distance values into similarities and return them as a `list` of real-valued scores where each item corresponds to a sentence pair. This list can now be used as the new values of the `prediction` column of `data`:

|     | sent1                                                                                          | sent2                                                                                                 |   gold |   prediction |
|----:|:-----------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------|-------:|-------------:|
|   0 | A person is on a baseball team.                                                                | A person is playing basketball on a team.                                                             |    2.4 |          2.2 |
|   1 | Our current vehicles will be in museums when everyone has their own aircraft.                  | The car needs to some work                                                                            |    0.2 |          0.5 |
|   2 | A woman supervisor is instructing the male workers.                                            | A woman is working as a nurse.                                                                        |    1   |          0.8 |
|   3 | A bike is next to a couple women.                                                              | A child next to a bike.                                                                               |    2   |          1.4 |
|   4 | The group is eating while taking in a breathtaking view.                                       | A group of people take a look at an unusual tree.                                                     |    2.2 |          0.9 |
|  ... | ... | ... | ... | ... |
| 245 | A brown dog is jumping.                                                                        | A brown dog is jumping                                                                                |    5   |          5   |
| 246 | the man is catching a ball                                                                     | A man is kicking a ball.                                                                              |    3   |          2.6 |
| 247 | Two men are sitting in the room.                                                               | Two men are standing in a room.                                                                       |    3   |          2.9 |
| 248 | A group of teenagers in red shirts are smiling.                                                | A group of people are wearing orange shirts.                                                          |    1.2 |          1.4 |
| 249 | The woman is waiting for her date.                                                             | The woman is on her way to a date.                                                                    |    3.2 |          2.7 |

In [80]:
def word_movers_similarity(sent1_processed, sent2_processed, model):

    wm_similarity = []

    for i,j in zip(sent1_processed, sent2_processed):

        dist = model.wmdistance(i, j)
        
        if dist == float('inf'):
            similarity = 0.0
        else:
            similarity = 1/(1 + dist)
            
        wm_similarity.append(similarity)

    return wm_similarity

In [81]:
data["prediction"] = word_movers_similarity(sent1_processed, sent2_processed, model)
data["prediction"] = scale_similarities(data["prediction"].values)
data

Unnamed: 0,sent1,sent2,gold,prediction
0,A person is on a baseball team.,A person is playing basketball on a team.,2.4,2.2
1,Our current vehicles will be in museums when e...,The car needs to some work,0.2,0.5
2,A woman supervisor is instructing the male wor...,A woman is working as a nurse.,1.0,0.8
3,A bike is next to a couple women.,A child next to a bike.,2.0,1.4
4,The group is eating while taking in a breathta...,A group of people take a look at an unusual tree.,2.2,0.9
...,...,...,...,...
245,A brown dog is jumping.,A brown dog is jumping,5.0,5.0
246,the man is catching a ball,A man is kicking a ball.,3.0,2.6
247,Two men are sitting in the room.,Two men are standing in a room.,3.0,2.9
248,A group of teenagers in red shirts are smiling.,A group of people are wearing orange shirts.,1.2,1.4


In this dasaset, Word Mover's Distance should provide an improvement over cosine similarity.

> Pearson correlation: 0.739

In [82]:
evaluate(data)

Pearson correlation: 0.739
