1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [9]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [18]:
### CODE HERE

def get_most_similar_words(word):
    similar_words = wv_from_bin.most_similar(word, topn=10)
    return [word for word in similar_words]


In [None]:
word1="mouse"
get_most_similar_words(word1)

[('mice', 0.6580958962440491),
 ('keyboard', 0.5548278093338013),
 ('rat', 0.5433949828147888),
 ('rabbit', 0.5192376971244812),
 ('cat', 0.5077415704727173),
 ('cursor', 0.5058691501617432),
 ('trackball', 0.5048902630805969),
 ('joystick', 0.49841049313545227),
 ('mickey', 0.47242844104766846),
 ('clicks', 0.4722806215286255)]

In [None]:
word2="paper"
get_most_similar_words(word2)

[('newspaper', 0.671421229839325),
 ('papers', 0.6713257431983948),
 ('printed', 0.6686532497406006),
 ('sheet', 0.6124283671379089),
 ('printing', 0.6033082604408264),
 ('newspapers', 0.5930173397064209),
 ('print', 0.5892402529716492),
 ('piece', 0.5870198607444763),
 ('published', 0.581505298614502),
 ('book', 0.5597691535949707)]

In [None]:
word3="run"
get_most_similar_words(word3)

[('running', 0.7378345727920532),
 ('runs', 0.7364052534103394),
 ('ran', 0.696038544178009),
 ('went', 0.6395289897918701),
 ('start', 0.637183427810669),
 ('allowed', 0.6334168314933777),
 ('out', 0.6328096389770508),
 ('go', 0.6265833377838135),
 ('going', 0.6221196055412292),
 ('first', 0.6087011098861694)]

In [None]:
word4="book"
get_most_similar_words(word4)

[('books', 0.8452467918395996),
 ('author', 0.7746455669403076),
 ('novel', 0.7485204935073853),
 ('published', 0.7451642751693726),
 ('memoir', 0.7047821283340454),
 ('wrote', 0.6971326470375061),
 ('written', 0.6967507004737854),
 ('essay', 0.6844283938407898),
 ('biography', 0.681260347366333),
 ('autobiography', 0.6770558953285217)]

#Possible reasons:
* Semantic Distance: In the embedding space, unrelated terms may appear more similar due to the meanings being too dissimilar.
* Data Bias: It is possible that the word embedding model was trained on a corpus in which a single meaning of the word predominates, thereby introducing a bias into the embeddings.
* Ambiguity: Certain terms may possess meanings that are ambiguous and therefore difficult to represent with a single vector.
* Model Limitations: The GloVe model might not be fine-tuned to capture the subtle semantic nuances required for disambiguating meanings in certain words.


### SOLUTION

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [None]:
w1 = "fast"
w2 = "quick"
w3 = "slow"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms fast, quick have cosine distance: 0.3328641653060913
Antonyms fast, slow have cosine distance: 0.2522680163383484


### SOLUTION

because antonyms are about the same thing but in a reverse way. but synonym usually used in different situations. and I think if in a sentence we replace for example "sad" and "happy" it will be ok just the meaning will reverse but if we replace "cheerful" and "happy" may be it would not fit the sentence.


### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [None]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION
We are trying to find a vector 𝑥 such that it maximizes the cosine similarity with the result of the operation ('woman' + 'grandfather' - 'man').

Let's denote the resulting vector as 𝑥 = 'woman' + 'grandfather' - 'man'.

So, the expression for maximizing 𝑥's cosine similarity is:

* 𝑥 = 𝑤 + 𝑔 - 𝑚

This expression represents the vector operation where we add the vector for 'woman' and 'grandfather', and then subtract the vector for 'man'. The resulting vector 𝑥 is expected to have a high cosine similarity with the target word.

When we compute the cosine similarity between 𝑥 and other word vectors, we are essentially finding words that are semantically similar to the idea of 'woman' and 'grandfather' but different from 'man'. This is in line with the expected results of the word embedding model, where words like 'grandmother', 'daughter', 'mother', etc., are returned as similar words.

#2D vector

Let's consider a simplified 2D example to visualize the vectors. Assume we have a two-dimensional space where each word vector is represented by a point. For simplicity, we'll represent the vectors for 'man', 'woman', 'grandfather', and the answer ('x') on a coordinate plane.

Let's arbitrarily position these vectors:<br>

'man' at point M(1, 1)<br>
'woman' at point W(3, 2)<br>
'grandfather' at point G(2, 5)<br>
To find the answer vector, we perform the operation: 𝑥 = 𝑤 + 𝑔 - 𝑚.

Since 'man' is at point M(1, 1), 'woman' is at point W(3, 2), and 'grandfather' is at point G(2, 5), let's calculate the new vector 'x':

𝑥 = 𝑤 + 𝑔 - 𝑚
= (3, 2) + (2, 5) - (1, 1)
= (3 + 2 - 1, 2 + 5 - 1)
= (4, 6)

So, the resulting vector 'x' lies at point X(4, 6).

Now, if we draw the vectors on a 2D coordinate plane, the vector 'x' (representing the answer) is located relative to 'woman' and 'grandfather', which intuitively corresponds to the idea of a word similar to 'grandfather' and 'woman' but different from 'man'.<br>
Also by using the `most_similar` function with $x$ as the input, we can find the word that has the highest cosine similarity and is most similar to the vector (4,6).







### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

cause they are all in female catagory so its more likely to happend together.<br>
Proximity in Vector Space: In the word embedding space, vectors for words related to family relations might be clustered together. Since "granddaughter", "daughter", and "mother" are all related to family and share certain semantic similarities with "grandfather", "woman", and "man", respectively, they might be close to the expected answer vector.

### SOLUTION

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [None]:
x, y, a, b = "waitress","waiter","female","male"
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION
waiter + female - waitress = male

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [None]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION
I think maybe in a input model that trained these words used rarly or they use in a context with different meanings



b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [17]:

x, y, a, b = "white", "black", "sun", "moon"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('sky', 0.5087384581565857),
 ('bright', 0.46601271629333496),
 ('moon', 0.4504569172859192),
 ('jiazheng', 0.44230249524116516),
 ('shine', 0.43356406688690186),
 ('shines', 0.43294209241867065),
 ('earth', 0.4307621717453003),
 ('solar', 0.4234519898891449),
 ('blue', 0.420571506023407),
 ('light', 0.41411274671554565)]


### SOLUTION
sun + black - white <br>
I expected the model to notice the contrast in the first pair and return the word "moon" instead of "sky"!


### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [None]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION

Most similar word to man is "reputation"! and the most similar word to woman is "professions". The words similar to man are more abstract like skill, regarded while the words similar to woman are more practical like teacher, nursing.
The gender bias in the word vectors is evident in the difference between the lists.


### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [None]:

A ="man"
B ="woman"
word ="worker"
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('workers', 0.611325740814209),
 ('employee', 0.5983108878135681),
 ('working', 0.5615329742431641),
 ('laborer', 0.5442320108413696),
 ('unemployed', 0.536851704120636),
 ('job', 0.5278826355934143),
 ('work', 0.5223963856697083),
 ('mechanic', 0.5088937282562256),
 ('worked', 0.5054520964622498),
 ('factory', 0.4940454363822937)]

[('employee', 0.6375863552093506),
 ('workers', 0.6068920493125916),
 ('nurse', 0.5837947130203247),
 ('pregnant', 0.5363885164260864),
 ('mother', 0.5321308970451355),
 ('employer', 0.5127025842666626),
 ('teacher', 0.5099576711654663),
 ('child', 0.5096741318702698),
 ('homemaker', 0.5019454956054688),
 ('nurses', 0.4970572590827942)]


### SOLUTION
The words that relate to man are all about work and job while the words relate to woman has some words like pregnant, child and homemaker in the concept of job!


In [14]:
# another example
A = "small"
B = "mouse"
word = "tiny"

pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('large', 0.680227518081665),
 ('smaller', 0.623263418674469),
 ('larger', 0.551127016544342),
 ('few', 0.541064441204071),
 ('huge', 0.526732325553894),
 ('sized', 0.515216588973999),
 ('relatively', 0.5117579698562622),
 ('sizable', 0.5114010572433472),
 ('nearby', 0.5099323987960815),
 ('largest', 0.5029460787773132)]

[('mice', 0.5267527103424072),
 ('trackball', 0.4856228232383728),
 ('cursor', 0.47818639874458313),
 ('keyboard', 0.46513333916664124),
 ('white-footed', 0.45868802070617676),
 ('joystick', 0.4565153419971466),
 ('microcebus', 0.4375380277633667),
 ('reepicheep', 0.43316853046417236),
 ('peromyscus', 0.4318021237850189),
 ('rabbit', 0.43032732605934143)]


In [16]:
# another example
A = "running"
B = "march"
word = "spring"

pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('run', 0.5648393630981445),
 ('plenty', 0.5220548510551453),
 ('ran', 0.5006933212280273),
 ('catching', 0.4874459505081177),
 ('always', 0.4865102767944336),
 ('winter', 0.4798910617828369),
 ('summer', 0.4785727858543396),
 ('runs', 0.47564056515693665),
 ('start', 0.46902260184288025),
 ('like', 0.4679262638092041)]

[('july', 0.7441548109054565),
 ('april', 0.7351762056350708),
 ('june', 0.7311972975730896),
 ('september', 0.7232730388641357),
 ('october', 0.7174134850502014),
 ('august', 0.7155402302742004),
 ('february', 0.7122297883033752),
 ('january', 0.7020894885063171),
 ('december', 0.7004657983779907),
 ('november', 0.6892157196998596)]


when spring and running come together, there are verbs related to run and start but when it is used with march, the other months of sprint will be most related

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION

it is what glove embeddings do and its also related to the input database and how it is trained.

Word embeddings are typically trained on large corpora of text, which often reflect the biases and stereotypes present in society. As a result, the patterns learned by the model during training may encode and reinforce these biases in the resulting word vectors.

For example, consider a word embedding model trained on news articles from various sources. If certain demographics are overrepresented or underrepresented in these articles, the model may learn biased associations between words and concepts. For instance, if news articles predominantly mention men in positions of power and women in domestic roles, the resulting word vectors may reflect and reinforce these gender stereotypes. As a consequence, words like "doctor" might be more closely associated with male pronouns, while words like "nurse" might be more closely associated with female pronouns, even though these professions are not inherently gendered.



b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.


### SOLUTION

 One approach is modifying the word vectors to reduce or eliminate biased correlations, such as gender or racial stereotypes. For example, debiasing algorithms may identify gender-neutral words and adjust their embeddings to be equidistant from gendered words like "he" and "she." By doing so, the resulting word vectors are less likely to reflect biased associations present in the training data. A real-world example of this method is the work done by Bolukbasi et al. (2016), who proposed a debiasing algorithm to mitigate gender bias in word embeddings. Their algorithm identified gender-specific word pairs and applied transformations to the word vectors to reduce the gender bias while preserving the semantic relationships between words. This approach has been applied to various natural language processing tasks to promote fairness and reduce discrimination in machine learning systems.<br>
 article link:https://arxiv.org/abs/1607.06520