Lets Create interesting NLP project

The most commonly used models for word embeddings are [word2vec](https://github.com/dav/word2vec/) and [GloVe](https://nlp.stanford.edu/projects/glove/) which are both unsupervised approaches based on the distributional hypothesis (words that occur in the same contexts tend to have similar meanings).

Word2Vec word embeddings are vector representations of words, 
that are typically learnt by an unsupervised model when fed 
with large amounts of text as input (e.g. Wikipedia, science, news, articles etc.). These representation of words capture semantic similarity between words among other properties. Word2Vec word embeddings are learnt in a such way, that [distance](https://en.wikipedia.org/wiki/Euclidean_distance) between vectors for words with close meanings ("king" and "queen" for example) are closer than distance for words with complety different meanings ("king" and "carpet" for example).

![Замещающий текст](https://developers.google.com/machine-learning/crash-course/images/linear-relationships.svg)
Image from [developers.google.com](https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space)

Word2Vec vectors even allow some mathematic operations on vectors. For example, in this operation we are using word2vec vectors for each word:

**king - man + woman = queen**

word embedding method is Glove (“Global Vectors”). It is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus. Then this matrix is factorized to a lower-dimensional (word x features) matrix, where each row now stores a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.

In [60]:
!wget https://nlp.stanford.edu/data/glove.42B.300d.zip #https://nlp.stanford.edu/data/glove.6B.zip

--2022-08-08 15:17:17--  https://nlp.stanford.edu/data/glove.42B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip [following]
--2022-08-08 15:17:17--  https://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1877800501 (1.7G) [application/zip]
Saving to: ‘glove.42B.300d.zip’


2022-08-08 15:23:11 (5.08 MB/s) - ‘glove.42B.300d.zip’ saved [1877800501/1877800501]



In [61]:
# !unzip /content/glove.6B.zip
!unzip /content/glove.42B.300d.zip

Archive:  /content/glove.42B.300d.zip
  inflating: glove.42B.300d.txt      


In [2]:
from gensim.scripts.glove2word2vec import glove2word2vec

In [62]:
# glove_input_file = r"/content/glove.6B.300d.txt"
glove_input_file = r"/content/glove.42B.300d.txt"
glove_output_file = r"word2vec.txt"

In [63]:
glove2word2vec(glove_input_file, glove_output_file)

(1917494, 300)

In [5]:
from gensim.models import KeyedVectors

In [64]:
model = KeyedVectors.load_word2vec_format(glove_output_file, binary=False)

In [65]:
model.similarity('go','went')

0.7681654

print_related function based on cosine similarity

In [115]:
def print_related(word = 'india', topn = 7) : 
  try : 
    word = word.lower()
    top_wd = model.most_similar(word, topn = topn)
    print(f"{word} related Word : ")
    for itm in top_wd : 
      print(f"{itm[0]} {round(itm[1]*100, 2)}%")
  except : 
    print(f"{word} has spelled mistakes or does not exist...")

In [117]:
print_related()

india related Word : 
delhi 76.45%
pakistan 72.87%
indian 72.85%
mumbai 71.2%
bangalore 68.81%
chennai 67.87%
lanka 66.8%


Surprisingly enough:

vector(“France”) - vector("Paris") = answer_vector - vector("Rome")

Therefore:

vector(“France”) - vector("Paris") + vector("Rome") = answer_vector

We’ll look for words close to answer_vector. The answer_vector won’t match “Italy” exactly but it should be close.

In [133]:
def print_analogy(word1 = 'king', word2 = 'man', word3 = 'woman', topn = 1, percentage = False) : 
  try : 
    word1 = word1.lower()
    word2 = word2.lower()
    word3 = word3.lower()
    top_wd = model.most_similar(positive=[word2, word3], negative=[word1], topn = topn)
    print(f"{word1} : {word2} :: {word3} : {top_wd[0][0]}")
    if percentage : 
      for itm in top_wd : 
        print(f"{itm[0]} - {round(itm[1]*100, 2)}%")
  except : 
    print(f"{word1} | {word2} | {word3} have spelled mistake or does not exist...")

In [134]:
# 'Paris', 'France', 'Rome'
# 'man', 'king', 'woman'
# 'walk', 'walked' , 'go'
# 'do', 'done' , 'go'
# 'quick', 'quickest' , 'far'

In [135]:
print_analogy('Paris', 'France', 'Rome', 5, percentage=True)

paris : france :: rome : italy
italy - 72.5%
europe - 62.27%
greece - 59.68%
portugal - 59.45%
spain - 59.32%


Not-related word found in a group of words is called ODD
exectly same functon doesnt_match

In [142]:
def doesnt_match(words) : 
  try : 
    words = " ".join(words).lower().split(" ")
    print(model.doesnt_match(words))
  except : 
    print(f"{words} have spelled mistake or does not exist...")
  # print(model.doesnt_match("breakfast robot dinner lunch".split()))

similarity function is return how much percentage related to each other words

In [151]:
def similarity(word1 = 'woman', word2 = 'man') : 
  try : 
    word1 = word1.lower()
    word2 = word2.lower()
    print(f"{word1} and {word2} are similar to {round(model.similarity(word1, word2)*100, 2 )} %")
  except : 
    print(f"{word1} | {word2} have spelled mistake or does not exist...")

In [150]:
similarity()

woman and man are similar to 80.48 %


In [129]:
model.doesnt_match(['tea','water', 'mango'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'mango'

In [173]:
# spain called although arms roots
# print_related("spain")

In [178]:
print_related('roots')

roots related Word : 
root 61.04%
rooted 58.54%
growing 57.53%
grow 55.44%
stems 53.49%
tradition 53.38%
origins 53.08%


In [154]:
# 'Paris', 'France', 'Rome'
# 'man', 'king', 'woman'
# 'walk', 'walked' , 'go'
# 'do', 'done' , 'go'
# 'quick', 'quickest' , 'far'
# print_analogy('Paris', 'France', 'Rome', 5, percentage=True)

In [187]:
print_analogy('quick', 'quickest' , 'far', 5, percentage=True)

quick : quickest :: far : fastest
fastest - 58.34%
arguably - 56.54%
safest - 55.08%
cleanest - 53.26%
shortest - 52.17%


In [155]:
# 'breakfast', 'robot', 'dinner', 'lunch'
# 'Spain', 'Russia', 'Canada', 'Africa'
# 'banana', 'apple', 'rice', 'grape'
# 'car', 'plane', 'road', 'train'
# 'king','queen', 'prince', 'man'
# doesnt_match([])

In [192]:
doesnt_match(['car', 'plane', 'road', 'train'])

plane


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


In [156]:
# man woman
# boy girl
# cat dog
# india pakishtan
# similarity('king', 'queen')

In [197]:
similarity('you', 'me')

you and me are similar to 77.18 %
