<a href="https://colab.research.google.com/github/sujayrittikar/NLP/blob/main/Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Typically, a word is represented by a vector and each vector is represented by 300 dimensions.

- We can perform vector arithmetic with the word vectors.
ex: new_vector = king_vector - man_vector + queen_vector

In [1]:
import spacy



In [3]:
import spacy.cli
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [4]:
nlp = spacy.load('en_core_web_lg')

In [8]:
# Word to Vec
nlp(u'lion').vector.shape

(300,)

In [7]:
# Document to Vec
nlp(u'The quick brown fox jumped').vector.shape

(300,)

In [9]:
tokens = nlp(u'lion cat pet')

In [10]:
for token1 in tokens:
  for token2 in tokens:
    print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.3854507803916931
lion pet 0.20031584799289703
cat lion 0.3854507803916931
cat cat 1.0
cat pet 0.732966423034668
pet lion 0.20031584799289703
pet cat 0.732966423034668
pet pet 1.0


In [11]:
tokens = nlp(u'like love hate')

In [12]:
for token1 in tokens:
  for token2 in tokens:
    print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.5212638974189758
like hate 0.5065140724182129
love like 0.5212638974189758
love love 1.0
love hate 0.5708349943161011
hate like 0.5065140724182129
hate love 0.5708349943161011
hate hate 1.0


- Important to notice that contextual similarity of words also makes them similar, not just what normal English comprehends.

In [13]:
len(nlp.vocab.vectors)

514157

In [17]:
tokens = nlp(u'like love dargle')

In [18]:
# oov - out of vocab
for token in tokens:
  print(token.text, token.has_vector, token.vector_norm, token.is_oov)

like True 50.609623 False
love True 58.563564 False
dargle False 0.0 True


In [19]:
from scipy import spatial

cosine_similarity = lambda vec1,vec2: 1 - spatial.distance.cosine(vec1, vec2)

In [20]:
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

In [21]:
# king - man + woman ----> NEW_VECTOR similar Queen, princess, etc.
new_vector = king - man + woman

In [23]:
computed_similarities = []

for word in nlp.vocab:
  if word.has_vector:
    if word.is_lower:
      if word.is_alpha:
        similarity = cosine_similarity(new_vector, word.vector)
        computed_similarities.append((word, similarity))

In [28]:
computed_similarities = sorted(computed_similarities, key=lambda item:-item[1])

In [29]:
print([t[0].text for t in computed_similarities[:10]])

['king', 'and', 'that', 'where', 'she', 'they', 'woman', 'there', 'should', 'these']
