<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Wednesday at 15:00</b>.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to two people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [1]:
'''
Group Work:
Enter the UID (i.e. student2s) of each team member into the variables. 
If you work alone please leave the second variable empty, or extend the list if necessary.
'''
member1 = 'Syed Mushrraf Ali (sali2s, 9040658)'
member2 = 'Shalaka Satheesh (ssathe2s, 9040760)'


# Word2Vec and FastText Embeddings

In this assignment we will work on Word2Vec embeddings and FastText embeddings.

I prepared three dictionaries for you:

- ```word2vec_yelp_vectors.pkl```: A dictionary with 300 dimensional word2vec embeddings trained on the Google News Corpus, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```fasttext_yelp_vectors.pkl```: A dictionary with 300 dimensional FastText embeddings trained on the English version of Wikipedia, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```tfidf_yelp_vectors.pkl```: A dictionary with 400 dimensional TfIdf embeddings trained on the Yelp training dataset from last assignment (key is the word, value is the embedding)

In the next cell we load those into the dictionaries ```w2v_vectors```, ```ft_vectors``` and ```tfidf_vectors```.

In [2]:
import pickle

with open('/srv/shares/NLP/word2vec_yelp_vectors.pkl', 'rb') as f:
    w2v_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/fasttext_yelp_vectors.pkl', 'rb') as f:
    ft_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/tfidf_yelp_vectors.pkl', 'rb') as f:
    tfidf_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/reviews_train.pkl', 'rb') as f:
    train = pickle.load(f)
    
with open('/srv/shares/NLP/reviews_test.pkl', 'rb') as f:
    test = pickle.load(f)
    
reviews = train + test

## Creating a vector model with helper functions [30 points]

In the next cell we have the class ```VectorModel``` with the methods:

- ```vector_size```: Returns the vector size of the model
- ```embed```: Returns the embedding for a word. Returns None if there is no embedding present for the word
- ```cosine_similarity```: Calculates the cosine similarity between two vectors
- ```most_similar```: Given a word returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**.
- ```most_similar_vec```: Given a vector returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**.

Your task is to complete these methods.

Example output:
```
model = VectorModel(w2v_vectors)

vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato)) # Prints: 0.05318105

print(model.most_similar('tomato')) 
'''
[('tomatoes', 0.8442263), 
 ('lettuce', 0.70699364),
 ('strawberry', 0.6888598), 
 ('strawberries', 0.68325955), 
 ('potato', 0.67841727)]
'''

print(model.most_similar_vec(vector_good)) 
'''
[('good', 1.0), 
 ('great', 0.72915095), 
 ('bad', 0.7190051), 
 ('decent', 0.6837349), 
 ('nice', 0.68360925)]
'''

```

In [3]:
from typing import List, Tuple, Dict
import numpy as np

   
class VectorModel:
    
    def __init__(self, vector_dict: Dict[str, np.ndarray]):
         self.vector_dict = vector_dict
        
    def embed(self, word: str) -> np.ndarray:
        try:
            embedding = self.vector_dict[word]
            return embedding
        except:
            return None
    
    def vector_size(self) -> int:
        return np.size(list(self.vector_dict.values())[0])
    
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1, ord=2) * np.linalg.norm(vec2, ord=2))
        return similarity

    def most_similar(self, word: str, top_n: int=5) -> List[Tuple[str, float]]:
        
        cosines = {key: self.cosine_similarity(self.embed(word), neighbour) for key, neighbour in self.vector_dict.items()}
        cosines_sorted = dict(sorted(cosines.items(), key=lambda item: item[1], reverse=True))

        most_similar = []
        
        for i in range(top_n):
            most_similar.append((list(cosines_sorted.keys())[1], list(cosines_sorted.values())[1]))
            del cosines_sorted[list(cosines_sorted.keys())[1]]
        
        return most_similar
    
    def most_similar_vec(self, vec: np.ndarray, top_n: int=5) -> List[Tuple[str, float]]:
        
        cosines = {key: self.cosine_similarity(vec, neighbour) for key, neighbour in self.vector_dict.items()}
        cosines_sorted = dict(sorted(cosines.items(), key=lambda item: item[1], reverse=True))
    
        most_similar = []
        
        for i in range(top_n):
            most_similar.append((list(cosines_sorted.keys())[0], list(cosines_sorted.values())[0]))
            del cosines_sorted[list(cosines_sorted.keys())[0]]
        
        return most_similar
    
    def embedding_used(self) -> str:
        '''
        Function to return the name of the embedding used
        '''
        if self.vector_dict == w2v_vectors:
            return 'word2vec'
        elif self.vector_dict == ft_vectors:
            return 'fasttext'
        elif self.vector_dict == tfidf_vectors:
            return 'TFIDF'
        

# model = VectorModel(w2v_vectors) 
# model.vector_size()

# vector_good = model.embed('good')
# vector_tomato = model.embed('tomato')

# print(model.cosine_similarity(vector_good, vector_tomato)) 
# print(model.most_similar('tomato')) 
# print(model.most_similar_vec(vector_good)) 


In [4]:
# This is a test cell, please ignore it


## Investigating similarity A) [10 points]

We now want to find the most similar words for a given input word for each model (Word2Vec, FastText and TfIdf).

Your input words are: ```['good', 'tomato', 'restaurant', 'beer', 'wonderful']```.

For each model and input word print the top three most similar words.

In [5]:
input_words = ['good', 'tomato', 'restaurant', 'beer', 'wonderful', 'dinner']

def get_similar_words(model: VectorModel, input_words: List, top_n: int):
    print("Top", top_n, "words using", model.embedding_used(), "are:")
    print()
    for word in input_words:
        print('For', word+':', model.most_similar(word, top_n))
        print()
    
model_1 = VectorModel(w2v_vectors)
model_2 = VectorModel(ft_vectors)
model_3 = VectorModel(tfidf_vectors)

get_similar_words(model_1, input_words, 3)
print("="*80)
get_similar_words(model_2, input_words, 3)
print("="*80)
get_similar_words(model_3, input_words, 3)

Top 3 words using word2vec are:

For good: [('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837349)]

For tomato: [('tomatoes', 0.8442263), ('lettuce', 0.70699364), ('strawberry', 0.6888598)]

For restaurant: [('restaurants', 0.7722894), ('diner', 0.7280216), ('steakhouse', 0.7269855)]

For beer: [('beers', 0.8409688), ('drinks', 0.66893125), ('ale', 0.63828725)]

For wonderful: [('fantastic', 0.8047919), ('great', 0.76478696), ('fabulous', 0.7614761)]

For dinner: [('dinners', 0.7902064), ('brunch', 0.7900513), ('breakfast', 0.7007028)]

Top 3 words using fasttext are:

For good: [('excellent', 0.7223856825801253), ('decent', 0.7202461451724537), ('bad', 0.6704173041669614)]

For tomato: [('eggplant', 0.7518509618329048), ('spinach', 0.7422800959168397), ('onions', 0.7328857483500282)]

For restaurant: [('restaurants', 0.8384667264823358), ('bistro', 0.7845601578005464), ('bakery', 0.7155727705943096)]

For beer: [('beers', 0.7944971406865431), ('brewed', 0.7929903321082488),

## Investigating similarity B) [10 points]

Comment on the output from the previous task. Let us look at the output for the word ```wonderful```. How do the models differ for this word? Can you reason why the TfIdf model shows so different results?

fasttext and word2vec perform similarly. fasttext is just an extension of word2vec.

## Investigating similarity C) [10 points]

Instead of just finding the most similar word to a single word, we can also find the most similar word given a list of positive and negative words.

For this we just sum up the positive and negative words into a single vector by calculating a weighted mean. For this we multiply each positive word with a factor of $+1$ and each negative word with a factor of $-1$. Then we get the most similar words to that vector.

You are given the following examples:

```
inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'salad']
    }    
]
```

In [6]:
inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'fruit']
    },
    {
        'positive': ['ceasar', 'chicken'],
        'negative': []
    }    
]

def similarity_given_posneg(model, inputs: List, top_n: int) -> List:
    inputs_vec_values = []
    for posneg in inputs:
        posneg_vec_values = {}
        for key, words in posneg.items():
            embeddings = []
            for word in words:
                embeddings.append(model.embed(word))
            if key == 'positive':
                posneg_vec_values[key] = embeddings
            elif key == 'negative':
                try:
                    posneg_vec_values[key] = (embeddings * -1)
                # ELEMENT WISE MULTIPLICATION ???
#                 posneg_vec_values[key] = np.dot(embeddings, -1)
                except:
                    posneg_vec_values[key] = embeddings

        inputs_vec_values.append(posneg_vec_values)
    
    summed_vec = []
    for i, posneg in enumerate(inputs_vec_values):
        summed_vec_inter = []
        for key, words in posneg.items():
            for word in words:
                if word is not None:
                    # if summed_vec_inter != []:
                    try:
                        summed_vec_inter = summed_vec_inter + word
                    except:
                        summed_vec_inter = word

        summed_vec.append(summed_vec_inter)
    
    print("Top", top_n, "words using", model.embedding_used(), "are:")
    print()
    for i, vector in enumerate(summed_vec):
        print('For input', str(i)+':', model.most_similar_vec(vector, top_n))
        print()


model_1 = VectorModel(w2v_vectors)
model_2 = VectorModel(ft_vectors)
model_3 = VectorModel(tfidf_vectors)

similarity_given_posneg(model_1, inputs, 5)
print("="*80)
similarity_given_posneg(model_2, inputs, 5)
print("="*80)
similarity_given_posneg(model_3, inputs, 5)

Top 5 words using word2vec are:

For input 0: [('wonderful', 0.9038065), ('good', 0.86836797), ('great', 0.84323597), ('fantastic', 0.82130545), ('nice', 0.75373894)]

For input 1: [('lettuce', 0.9304779), ('tomato', 0.9169305), ('tomatoes', 0.86696106), ('spinach', 0.7767467), ('broccoli', 0.7444947)]

For input 2: [('chicken', 1.0), ('meat', 0.6799131), ('pork', 0.6541997), ('turkey', 0.62825197), ('shrimp', 0.6004993)]

Top 5 words using fasttext are:

For input 0: [('wonderful', 0.901771055347115), ('good', 0.8790438652506161), ('excellent', 0.7178267179872835), ('decent', 0.665085234039435), ('lovely', 0.6603127457802108)]

For input 1: [('lettuce', 0.9264417334730983), ('tomato', 0.9140209476178994), ('spinach', 0.7932212016028506), ('eggplant', 0.7772270057499169), ('onions', 0.7737536660800143)]

For input 2: [('chicken', 0.8072479914245532), ('ceasar', 0.7750985549424643), ('beef', 0.6371025365057209), ('pork', 0.6124103144819651), ('hamburgers', 0.6056037232373332)]

Top 5 wo

## Investigating similarity D) [15 points]

We can use our model to find out which word does not match given a list of words.

For this we build the mean vector of all embeddings in the list.  
Then we calculate the cosine similarity between the mean and all those vectors.

The word that does not match is then the word with the lowest cosine similarity to the mean.

Example:

```
model = VectorModel(w2v_vectors)
doesnt_match(model, ['potato', 'tomato', 'beer']) # -> 'beer'
```

In [7]:
def doesnt_match(model, words):
    
    sum_vector = []
    for word in words:
        try:
            sum_vector = sum_vector + model.embed(word)
        except:
            sum_vector = model.embed(word)
    mean_vector = sum_vector / len(words)

    cosine_similarities = dict()
    for word in words:
        cosine_similarities[word] = model.cosine_similarity(mean_vector, model.embed(word))
    
    return list(dict(sorted(cosine_similarities.items(), key=lambda item: item[1], reverse=False)).items())[0][0]
    
model_tfidf = VectorModel(tfidf_vectors)
print(doesnt_match(model_tfidf, ['vegetable', 'strawberry', 'tomato', 'lettuce']))

# model = VectorModel(w2v_vectors)
# print(doesnt_match(model, ['potato', 'tomato', 'beer']))
# print(doesnt_match(model, ['vegetable', 'strawberry', 'tomato', 'lettuce']))

vegetable


In [8]:
# This is a test cell, please ignore it


## Document Embeddings A) [15 points]

Now we want to create document embeddings similar to the last assignment. For this you are given the function ```bagOfWords```. In the context of Word2Vec and FastText embeddings this is also called ```SOWE``` for sum of word embeddings.

Take the yelp reviews (```reviews```) and create a dictionary containing the document id as a key and the document embedding as a value.

Create the document embeddings from the Word2Vec, FastText and TfIdf embeddings.

Store these in the variables ```ft_doc_embeddings```, ```w2v_doc_embeddings``` and ```tfidf_doc_embeddings```

In [9]:
def bagOfWords(model: VectorModel, doc: List[str]) -> np.ndarray:
    '''
    Create a document embedding using the bag of words approach
    
    Args:
        model     -- The embedding model to use
        doc       -- A document as a list of tokens
        
    Returns:
        embedding -- The embedding for the document as a single vector 
    '''
    embeddings = [np.zeros(model.vector_size())]
    n_tokens = 0
    for token in doc:
        embedding = model.embed(token)
        if embedding is not None:
            n_tokens += 1
            embeddings.append(embedding)
    if n_tokens > 0:
        return sum(embeddings)/n_tokens
    return sum(embeddings)


ft_doc_embeddings = dict()
w2v_doc_embeddings = dict()
tfidf_doc_embeddings = dict()

model_1 = VectorModel(w2v_vectors)
model_2 = VectorModel(ft_vectors)
model_3 = VectorModel(tfidf_vectors)

# WHAT IS THE CATCH HERE???
for doc in reviews:
    w2v_doc_embeddings[doc['id']] = bagOfWords(model_1, doc['tokens'])
    ft_doc_embeddings[doc['id']] = bagOfWords(model_2, doc['tokens'])
    tfidf_doc_embeddings[doc['id']] = bagOfWords(model_3, doc['tokens'])

In [10]:
# This is a test cell, please ignore it


## Document Embeddings B) [10 points]

Create a vector model from each of the document embedding dictionaries. Call these ```model_w2v_doc```, ```model_ft_doc``` and ```model_tfidf_doc```.

Now find the most similar document (```top_n=1```) for document $438$ with each of these models. Print the text for each of the most similar reviews.

In [13]:
# First find the text for review 438
def find_doc(doc_id, reviews):
    for review in reviews:
        if review['id'] == doc_id:
            return review['text']
    
doc_id = 438

# Print it
print('Source document:')
print(find_doc(doc_id, reviews))


# Create the models
model_w2v_doc =  VectorModel(w2v_doc_embeddings)
model_ft_doc = VectorModel(ft_doc_embeddings)
model_tfidf_doc = VectorModel(tfidf_doc_embeddings)

model_w2v_doc.most_similar_vec(model_w2v_doc.vector_dict[438], 2)

print('='*70)
print('Similar document:')
print(find_doc(model_w2v_doc.most_similar_vec(model_w2v_doc.vector_dict[438], 2)[1][0], reviews))

Source document:
Absolutely ridiculously amazing! Chicken Tikka masala was perfect. Best I've ever had!
Similar document:
I think I've been spoiled by eating delicious quesadillas quite frequently because the chicken quesadilla I ate was sub par.  It was greasy and the quality of chicken was not impressive. I gave an extra star because you can choose the fillings and those were fresh.
