1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [None]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [None]:
wv_from_bin.most_similar("bat")

[('bats', 0.691724419593811),
 ('batting', 0.6160588264465332),
 ('balls', 0.5692734122276306),
 ('batted', 0.5530908107757568),
 ('toss', 0.5506128668785095),
 ('wicket', 0.5495278835296631),
 ('pitch', 0.5489361882209778),
 ('bowled', 0.5452010631561279),
 ('hitter', 0.5353438854217529),
 ('batsman', 0.5348091125488281)]

In [None]:
wv_from_bin.most_similar("kind")

[('sort', 0.9320126175880432),
 ('something', 0.8606709241867065),
 ('thing', 0.8369148969650269),
 ('really', 0.8218013644218445),
 ('what', 0.79334956407547),
 ('nothing', 0.791183590888977),
 ('think', 0.7800678610801697),
 ('anything', 0.7773789167404175),
 ('you', 0.7736413478851318),
 ('seems', 0.7729967832565308)]

In [None]:
wv_from_bin.most_similar("right")

[('left', 0.716508150100708),
 ('if', 0.6925000548362732),
 ("n't", 0.6774845719337463),
 ('back', 0.6770386099815369),
 ('just', 0.6740819811820984),
 ('but', 0.667771577835083),
 ('out', 0.6671877503395081),
 ('put', 0.665894091129303),
 ('hand', 0.6634083390235901),
 ('want', 0.6615420579910278)]

In [None]:
wv_from_bin.most_similar("bank")

[('banks', 0.7625691294670105),
 ('banking', 0.6818838119506836),
 ('central', 0.6283639073371887),
 ('financial', 0.6166563034057617),
 ('credit', 0.6049750447273254),
 ('lending', 0.5980608463287354),
 ('monetary', 0.5963003039360046),
 ('bankers', 0.5913101434707642),
 ('loans', 0.5802939534187317),
 ('investment', 0.5740203261375427)]

### SOLUTION

In [None]:
wv_from_bin.most_similar("lie")

[('lying', 0.6884376406669617),
 ('lies', 0.6648906469345093),
 ('lay', 0.49874207377433777),
 ('beneath', 0.4979451894760132),
 ('hide', 0.4929002821445465),
 ('exist', 0.49129071831703186),
 ('sit', 0.48945313692092896),
 ('truth', 0.4867425262928009),
 ('bare', 0.48414283990859985),
 ('these', 0.48072928190231323)]

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [None]:
w1 = "fast"   # "male"   # "good"
w2 = "quick"  # "man"    # "nice"
w3 = "slow"   # "female" # "bad"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms fast, quick have cosine distance: 0.3328641653060913
Antonyms fast, slow have cosine distance: 0.2522680163383484


### SOLUTION: because for example fast and slow have happened in same context more than fast and quick. it is the matter of how much the two words have been used in same context.

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [None]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION: g + w - m

### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION: it gives these words because these words are somehow near the words woman or grandfather and far from the word man.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [None]:
x, y, a, b = "iran", "japan", "tehran", "tokyo"
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION: iran:tehran :: japan:tokyo

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [None]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION: because the word 'foot' is a unit for measuring distance too. and the goal is to find words near to this word and far from hand, so it is outputing measurement units.

b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [None]:
x, y, a, b = "day", "good", "night", "bad"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('tonight', 0.6419892907142639),
 ('sure', 0.6270460486412048),
 ('really', 0.6245638728141785),
 ('terrific', 0.6219825148582458),
 ('pretty', 0.6166312098503113),
 ("n't", 0.609019935131073),
 ('excellent', 0.6084809899330139),
 ('always', 0.6062433123588562),
 ('something', 0.6053240299224854),
 ("'re", 0.6035588383674622)]


### SOLUTION:
day:night :: good:bad, tonight

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [None]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION:
there is a gender bias here. the jobs like nursing and teacher is considered to be for women but we can not see these words in man related ones.

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [None]:

A = "flowers"
B = "insects"
word = "pleasant"
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('lovely', 0.6010177135467529),
 ('beautiful', 0.5075923800468445),
 ('quiet', 0.5043780207633972),
 ('bright', 0.4798499345779419),
 ('bouquet', 0.4636004567146301),
 ('happy', 0.4616834223270416),
 ('wonderful', 0.4579712152481079),
 ('festive', 0.4563849866390228),
 ('nice', 0.44969430565834045),
 ('sweet', 0.4492822289466858)]

[('enjoyable', 0.4733288586139679),
 ('unpleasant', 0.4511060416698456),
 ('termites', 0.4320865571498871),
 ('pleasurable', 0.4303135573863983),
 ('surroundings', 0.42313987016677856),
 ('rodents', 0.4011191725730896),
 ('mosquitoes', 0.3891984522342682),
 ('pests', 0.38524946570396423),
 ('arthropods', 0.38287049531936646),
 ('harmless', 0.3790086805820465)]


### SOLUTION:
Word vectors can capture societal stereotypes and associations. For instance, the association between “flowers” and “pleasant” versus “insects” and “unpleasant” reflects cultural biases.

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION:
Bias can enter word vectors primarily through the data they are trained on. If the training data contains societal biases, these can be reflected in the word vectors. For example, if a corpus has more sentences associating men with technology and women with domestic roles, the resulting word vectors will likely mirror these biases.

A real-world example of this is seen in a study where word embeddings were found to associate male names more closely with career-oriented words and female names with family-oriented words, reflecting gender stereotypes present in the training data. This type of bias can perpetuate stereotypes and affect decisions made by AI systems using these word vectors, such as resume filtering in hiring processes.

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.

### SOLUTION:
One method to mitigate bias in word vectors is through debiasing techniques such as Hard Debiasing. This approach involves identifying bias directions in the word vector space and then neutralizing and equalizing vectors to reduce gender bias.

A real-world example of this method in action is the modification of the Word2Vec model to reduce gender bias. Researchers identified gender stereotypes in the model and applied debiasing techniques to adjust the vectors, resulting in a less biased representation of words related to gender. This helps in creating more fair and equitable NLP applications, such as job recommendation systems that are less likely to perpetuate gender biases.