1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [2]:
### CODE HERE
word = "head"
result = wv_from_bin.most_similar(word)
print(f"The most similar words to <{word}> are : {result}")

The most similar words to <head> are : [('heads', 0.7668997645378113), ('headed', 0.6344295144081116), ('chief', 0.6314131617546082), ('body', 0.6098024249076843), ('assistant', 0.6064105033874512), ('director', 0.6037707328796387), ('deputy', 0.5836146473884583), ('hand', 0.5738338232040405), ('left', 0.5574275255203247), ('arm', 0.5565925240516663)]


### SOLUTION
The word "head" can have multiple meanings like:
1. brain
2. chief teacher
3. top of sth
4. go towards
5. boss of a business
6. hit with the head
7. title

-----------------------------------------------------------------------------

Many polysemous or homonymic words might not yield diverse sets of meanings in the top-10 most similar words due to several reasons:

1. Frequency and Distribution: The word embeddings are trained based on the distributional properties of words in a corpus. If one meaning of a polysemous word is much more frequent in the training data than others, the embeddings may prioritize that meaning over others, leading to a bias in the similar words retrieved.

2. Contextual Ambiguity: Polysemous words often exhibit different meanings depending on the context. Word embeddings capture co-occurrence patterns in the training data, but they might not capture all contextual nuances. As a result, similar words retrieved may be contextually biased towards one meaning over others.

3. Semantic Interference: Words with multiple meanings can introduce semantic interference, where the different meanings of the word influence each other's representation in the embeddings. This interference can make it challenging for the model to disambiguate between the meanings effectively.

4. Training Data Limitations: The training data might not sufficiently represent all senses or meanings of polysemous words. If certain senses are underrepresented or absent in the training data, the embeddings may not accurately capture the semantic relationships between different meanings of the word.

5. Model Limitations: While word embeddings are effective at capturing semantic relationships between words, they have inherent limitations. They represent words as dense vectors in a continuous space, which might not fully capture the complex and multifaceted nature of language, especially when it comes to polysemy and homonymy.

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [3]:
w1 = "mother"
w2 = "mama"
w3 = "father"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms mother, mama have cosine distance: 0.6049132645130157
Antonyms mother, father have cosine distance: 0.20632314682006836


### SOLUTION
counter-intuitive result may have happened because of:
1. Contextual Usage: The Word2Vec model is trained on a large corpus of text and it learns to associate words that are used in similar contexts. If in the training corpus “mother” and “father” appear in similar contexts more often than “mother” and “mama”, the vectors for “mother” and “father” will end up closer together.

2. Frequency of Words: “Father” is more commonly used than “mama” in many corpora. Therefore, the model might  have a more accurate representation for “father”, leading to a closer association with “mother”.

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [4]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION
we have the analogy "man : grandfather :: woman : x". First lets break it into vectors:

- m:  the vector representing "man".
- g: the vector representing "grandfather".
- w: the vector representing "woman".
- x: the vector representing the unknown word in place of "x".

Then, we want to find x:
1. First we add the vector for "woman" to the vector for "grandfather", so we will have: g + w
2. Then we subtract the vector for "man" from the previous: g + w - m

This expression represents the direction from "man" to "woman" applied to the "grandfather" vector. In other words, it's the vector representation of the relationship between "woman" and "grandfather" analogous to the relationship between "man" and "grandfather".
By maximizing the cosine similarity between x and this expression, we aim to find the word most analogous to "grandfather" in relation to "woman", which is the answer to the analogy.

### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION
The `most_similar` function in Word2Vec or similar word embedding models calculates the cosine similarity between word vectors to find words that are most similar to a given set of words. In the analogy "man : grandfather :: woman : x", the function is trying to find the word x that is most similar to the relationship between "man" and "grandfather" when applied to "woman".

In this case, "grandmother" is the correct answer because it represents the female counterpart of "grandfather", which maintains the generational and familial relationship established in the analogy. However, other words like "granddaughter", "daughter", or "mother" are also retrieved because they share certain semantic similarities or associations with the given words.

To investigate"
1. Granddaughter: It represents a familial relationship, and while it's not exactly the same as "grandfather", it's still within the familial hierarchy.
2. Daughter: This is a direct familial relationship with "woman", but it's not as directly linked to "grandfather" as "grandmother" would be.
3. Mother: Again, a direct familial relationship, but in this case, it's one step further removed from "grandfather" compared to "grandmother".

These words appear in the results because they share semantic relationships with the given words "man", "woman", and "grandfather", but "grandmother" is the closest match in terms of the specific familial relationship being described in the analogy.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [10]:
x, y, a, b = 'paris', 'france', 'beijing', 'china'
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION
The analogy is: "Paris : France :: Beijing : China."

This analogy holds because both Paris and Beijing are capital cities, and they each correspond to their respective countries, France and China. So, the relationship between Paris and France is analogous to the relationship between Beijing and China.

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [11]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION
here the problem is that the foot is considered as a metric of calculation of distance, not a part of body.

b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [12]:
x, y, a, b = "red", "blue", "apple", "sky"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('chips', 0.5595942735671997),
 ('ibm', 0.5587440133094788),
 ('chip', 0.5563934445381165),
 ('intel', 0.5496072769165039),
 ('sony', 0.5326699614524841),
 ('macintosh', 0.5229740738868713),
 ('microsoft', 0.5217815637588501),
 ('iphone', 0.5164424777030945),
 ('hewlett', 0.5043824315071106),
 ('itunes', 0.5014412999153137)]


### SOLUTION
Here the analogy is: "red:apple :: blue:sky."

And here again the prediction is not as we considered because the word vector considers apple as a company brand rather than a fruit.

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [13]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION
we can observe the gender bias inherent in word embeddings:
1. For terms similar to "man" and "profession" while being dissimilar to "woman", we see words such as "reputation", "skill", "business", "respected", and "regarded". These terms are more abstract and less directly related to specific professions. This result suggests that the model associates "man" with qualities like reputation, skill, and being respected in professional settings rather than specific professions themselves.
2. For terms similar to "woman" and "profession" while being dissimilar to "man", we see words like "nursing", "teaching", "teacher", "educator", and "physicians". These terms are more specific and directly related to certain professions. This result suggests that the model associates "woman" more with specific professions like nursing and teaching rather than abstract qualities or a diverse range of professions.

The difference between the lists reflects gender bias in societal perceptions of professions. Historically, certain professions like nursing and teaching have been associated more with women, while others like business and medicine have been associated more with men. This bias gets reflected in the word embeddings due to the patterns present in the training data, perpetuating stereotypes and potentially reinforcing societal biases when used in applications.

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [14]:
A = 'father'
B = 'mother'
word = 'doctor'
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))

[('physician', 0.6719361543655396),
 ('surgeon', 0.6208168268203735),
 ('dr.', 0.5724585056304932),
 ('brother', 0.5710500478744507),
 ('son', 0.5303334593772888),
 ('he', 0.5294877290725708),
 ('medical', 0.528836190700531),
 ('uncle', 0.5231919884681702),
 ('himself', 0.5133481621742249),
 ('pharmacist', 0.5111744403839111)]

[('nurse', 0.7208659648895264),
 ('doctors', 0.6413154602050781),
 ('patient', 0.6289440393447876),
 ('woman', 0.6113752126693726),
 ('hospital', 0.6000143885612488),
 ('pregnant', 0.5975667238235474),
 ('nurses', 0.572587788105011),
 ('physician', 0.5669365525245667),
 ('medical', 0.5617853403091431),
 ('patients', 0.5472391843795776)]


### SOLUTION
we can observe bias in the associations of the word "doctor" with "father" and "mother":
- When exploring the association between "doctor" and "father", the top similar words include "physician", "surgeon", "dr.", and "brother". These terms are all related to the medical profession, indicating a strong association between "doctor" and male family roles.
- when exploring the association between "doctor" and "mother", the top similar words include "nurse", "doctors", "patient", and "woman". Here, "nurse" stands out as the most prominent term, reflecting a bias in the model associating women more with nursing roles rather than being doctors.

This example reflects the societal bias that traditionally portrays men as more likely to be doctors while women are more likely to be nurses, perpetuating gender stereotypes in professional roles.

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION
One explanation of how bias gets into word vectors is through the biases present in the training data used to train the models. Word embedding models like Word2Vec are trained on large corpora of text, which reflects the language usage patterns present in the data. If the training data contains biases or reflects societal stereotypes, these biases get encoded into the word vectors.

A real-world example of this source of bias can be seen in the representation of gender roles in professions. If the training data predominantly contains examples where men are associated with certain professions like "doctor" or "engineer", while women are associated with others like "nurse" or "teacher", the resulting word vectors will reflect and reinforce these stereotypes. This can perpetuate gender biases in applications that utilize these word embeddings, such as natural language processing systems or recommendation algorithms, potentially leading to biased outcomes or reinforcing societal inequalities.

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.


### SOLUTION
One method to mitigate bias exhibited by word vectors is through debiasing techniques during or after the training process. This involves identifying and neutralizing biased associations present in the word vectors.

One common debiasing method is to identify gender-specific biases and neutralize them by projecting gender-neutral vectors. For example, in the context of professions, if the word vector for "nurse" is closer to female gender words and "doctor" is closer to male gender words, a debiasing algorithm could adjust the word vectors such that the gender associations are minimized.

A real-world example demonstrating this method is the work by Bolukbasi et al. (2016) titled "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings". They proposed a method to debias word embeddings by neutralizing gender-specific associations in word vectors. By applying this method, they were able to mitigate gender bias in word embeddings, reducing the association between gender-neutral professions and gender-specific terms.
Also, this debiased word vector can be used in various Natural Language Processing (NLP) tasks such as machine translation, sentiment analysis, or information retrieval, helping to ensure that the outcomes of these tasks are less biased. For example, a job recommendation system using debiased word vectors would be less likely to show gender-stereotyped job ads to users.