<a href="https://colab.research.google.com/github/sambit9238/spacy_snippets/blob/master/spacy_word_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

spaCy can compare two objects and predict similarity

**Doc.similarity(), Span.similarity() and Token.similarity()**

Take another object and return a similarity score (0 to 1)

Important: needs a model that has word vectors included, for example:

*✅ en_core_web_md (medium model)*

*✅ en_core_web_lg (large model)*

*🚫 NOT en_core_web_sm (small model)*

In [1]:
!pip3 install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.3.2)


In [2]:
!python3 -m  spacy download en_core_web_md

Collecting en_core_web_md==2.3.1
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.3.1/en_core_web_md-2.3.1.tar.gz (50.8MB)
[K     |████████████████████████████████| 50.8MB 75.3MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.3.1-cp36-none-any.whl size=50916643 sha256=a3b27c800fc02e872d2263ab038f3611ffd9c381aea4f448332885849abc2b5f
  Stored in directory: /tmp/pip-ephem-wheel-cache-yhz19lgs/wheels/6e/65/3a/34cdc26d4084d1d1f1e2ec9914964759ea17aa382c53a57d9f
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
  Found existing installation: en-core-web-md 2.2.5
    Uninstalling en-core-web-md-2.2.5:
      Successfully uninstalled en-core-web-md-2.2.5
Successfully installed en-core-web-md-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via 

**How does spaCy predict similarity?**
1. Similarity is determined using word vectors
2. Multi-dimensional meaning representations of words
3. Generated using an algorithm like Word2Vec and lots of text
4. Can be added to spaCy's statistical models
4. Default: cosine similarity, but can be adjusted
5. Doc and Span vectors default to average of token vectors
6. Short phrases are better than long documents with many irrelevant words

In [3]:
import spacy
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

In [6]:
%%time
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627204117787385
CPU times: user 24.5 ms, sys: 0 ns, total: 24.5 ms
Wall time: 29.3 ms


In [7]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


In [9]:
# Compare two spans
doc = nlp("I like pizza and love pasta")
span1 = doc[1:3]
span2 = doc[4:]
print(span1.similarity(span2))

0.7655973


In [14]:
# Compare a document with a token
doc = nlp("What is your return policy")
token = nlp("return")[0]

print(doc.similarity(token))

0.655107452021617


In [18]:
%%time
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells pizza, burgers")

print(span.similarity(doc))

0.7851726361329234
CPU times: user 25 ms, sys: 0 ns, total: 25 ms
Wall time: 26.8 ms


In [19]:
# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

In [20]:
len(bananas_vector)

300