## spaCy &mdash; Industrial-Strength NLP in Python
Tutorial sources: [here](https://nlpforhackers.io/complete-guide-to-spacy/) and [here](https://github.com/explosion/spacy-notebooks)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

![spaCy](images/spacy.png)

### It's really FAST ###
Written in Cython, it was specifically designed to be as fast as possible

### It's really ACCURATE ###
spaCy implementation of its dependency parser is one of the best-performing in the world:
[It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool](https://aclweb.org/anthology/P/P15/P15-1038.pdf)

### Batteries included ###
- **Index preserving tokenization** (details about this later)
- **Models** for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
- Supports **8 languages** out of the box
- Easy and **beautiful visualizations**
- Pretrained **word vectors**

### Extensible ###
It plays nicely with all the other already existing tools that you know and love: **Scikit-Learn**, **TensorFlow**, **gensim**

### DeepLearning Ready ###
It also has its own deep learning framework that’s especially designed for NLP tasks:
[Thinc](https://github.com/explosion/thinc)

## Installation
```bash
# Install spaCy
pip install spacy

# Install spaCy English model
python -m spacy download en
```

## Load spaCy resources

In [33]:
# Import spacy and English models
import spacy

nlp = spacy.load('en')

## Language Processing Pipelines

- When you call nlp on a text, spaCy first **tokenizes** the text to produce a **Doc object**.
- The Doc is then processed in several different steps – this is also referred to as the **processing pipeline**. 
- The pipeline used by the default models consists of a **tagger**, a **parser** and an **entity recognizer**.

![pipeline](images/pipeline.svg)

More info [here](https://spacy.io/usage/processing-pipelines)

## Tokenization
The **Token** class exposes a lot of word-level attributes:

In [34]:
doc = nlp("Next week I'll   be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	VERB	VB
in	20	in	False	False	xx	ADP	IN
Madrid	23	madrid	False	False	Xxxxx	PROPN	NNP
.	29	.	True	False	.	PUNCT	.


## Sentence detection

In [35]:
# Print sentences (one sentence per line)
doc = nlp("Garfield is a cat. Snoopy is a dog.")

for sent in doc.sents:
    print(sent)

Garfield is a cat.
Snoopy is a dog.


## Part Of Speech Tagging

In [36]:
# For each token, print corresponding part of speech tag
for token in doc:
    print('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' % (token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop))

Garfield	garfield	PROPN	NNP	nsubj	Xxxxx	True	False
is	be	VERB	VBZ	ROOT	xx	True	True
a	a	DET	DT	det	x	True	True
cat	cat	NOUN	NN	attr	xxx	True	False
.	.	PUNCT	.	punct	.	False	False
Snoopy	snoopy	NOUN	NN	nsubj	Xxxxx	True	False
is	be	VERB	VBZ	ROOT	xx	True	True
a	a	DET	DT	det	x	True	True
dog	dog	NOUN	NN	attr	xxx	True	False
.	.	PUNCT	.	punct	.	False	False


## Named Entity Recognition
Doing NER with spaCy is super easy and the pretrained model performs pretty well:

In [37]:
doc = nlp("Next week I'll be in London.")
for ent in doc.ents:
    print("%s\t-->\t%s" % (ent.text, ent.label_))

Next week	-->	DATE
London	-->	GPE


Common entity types include *ORGANIZATION*, *PERSON*, *LOCATION*, *DATE*, *TIME*, *MONEY*, and *GPE* (geo-political entity). See complete list [here](https://spacy.io/usage/linguistic-features#entity-types).

In [38]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
     print("%s\t-->\t%s" % (ent.text, ent.label_))

2	-->	CARDINAL
9 a.m.	-->	TIME
30%	-->	PERCENT
just 2 days	-->	DATE
WSJ	-->	ORG


**displaCy** comes in handy for a better visualization:

In [39]:
from spacy import displacy

text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption."""

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

## Chunking
spaCy automatically detects noun-phrases as well:

In [40]:
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, '\t', chunk.label_, '\t', chunk.root.text)

Wall Street Journal 	 NP 	 Journal
an interesting piece 	 NP 	 piece
crypto currencies 	 NP 	 currencies


Notice how the chunker also computes the *root* of the phrase, the main word of the phrase.

## Dependency Parsing

In [41]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1}<--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP<--compound-- Street/NNP
Street/NNP<--compound-- Journal/NNP
Journal/NNP<--nsubj-- published/VBD
just/RB<--advmod-- published/VBD
published/VBD<--ROOT-- published/VBD
an/DT<--det-- piece/NN
interesting/JJ<--amod-- piece/NN
piece/NN<--dobj-- published/VBD
on/IN<--prep-- piece/NN
crypto/JJ<--compound-- currencies/NNS
currencies/NNS<--pobj-- on/IN


If this doesn’t help visualizing the dependency tree, **displaCy** comes in handy:

In [42]:
from spacy import displacy

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 120})

## Word Vectors
spaCy comes shipped with a Word Vector model as well. We’ll need to download a larger model for that:
```bash
python -m spacy download en_core_web_lg
```

In [43]:
nlp = spacy.load('en_core_web_lg')
print(nlp.vocab['banana'].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

The closest word vector in the vocabulary for “man” – “king” + “woman” is still “King” but “Queen” comes right after :)

## Computing Similarity

Based on the word embeddings, spaCy offers a similarity interface for all of it’s building blocks: Token, Span, Doc and Lexeme. Here’s how to use that similarity interface:

In [44]:
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print("'dog' is %s similar to 'animal' and %s similar to 'fruit'" % (dog.similarity(animal), dog.similarity(fruit)))
print("'banana' is %s similar to 'animal' and %s similar to 'fruit'" % (banana.similarity(animal), banana.similarity(fruit)))

'dog' is 0.66185343 similar to 'animal' and 0.23552851 similar to 'fruit'
'banana' is 0.24272855 similar to 'animal' and 0.67148364 similar to 'fruit'


Let’s now use this technique on entire texts:

In [45]:
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
doc4 = nlp("Snoopy is a very smart dog.")
doc5 = nlp("Tomorrow it will rain a lot in Berlin.")
 
print(target.similarity(doc1))
print(target.similarity(doc2))
print(target.similarity(doc3))
print(target.similarity(doc4))
print(target.similarity(doc5))

0.8901766262114666
0.9115828449161616
0.7822956256736615
0.7133323899064792
0.6526212010025575


### "king" - "man" + "woman" = "queen"?
There’s a really famous example of word embedding math: "king" - "man" + "woman" = "queen". Let’s test that out:

In [46]:
from scipy import spatial
 
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
 
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector
 
# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_queen = king - man + woman

computed_similarities = []
 
for word in nlp.vocab:
    if word.has_vector:  # Ignore words without vectors
        similarity = cosine_similarity(maybe_queen, word.vector)
        computed_similarities.append((word, similarity))
 
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

['King', 'KING', 'king', 'KIng', 'Queen', 'QUEEN', 'queen', 'Prince', 'PRINCE', 'prince']


-----
# Exercises

### 1. How many sentences are there in the following text?

*Hint: doc.sents is not a list, but a 'generator'. Convert it to a list first!*

In [50]:
# Print sentences (one sentence per line)
text = "The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee."

# your code here

### 2. Print the sentences (one sentence per line, preceeded by the sentence number) 

In [51]:
# your code here

-----

## Do you want to learn more?

Visit [spaCy website](https://spacy.io/)!