# Natural Language Processing with spaCy
### Introduction

In this tutorial we will explore some of the really cool problems that Natural Language Processing tackles. We will do this by exploring the spaCy library and seeing some of the interesting things that it can do for us in our attempt to get a computer to process text in different ways. spaCy fills a gap that NLTK misses through various design descisions. Unlike NLTK, spaCy isn't aimed towards research so it makes different  uses Cython in order to speed up processes such as word tokenization and Part-of-Speech tagging. It also builds a syntactic tree for each sentence for better understanding. But as we shall see, there are many other things that this library lets us do.

### Tutorial content

In this tutorial, we will explore basic Natural Language Processing with [spaCy](https://spacy.io/).

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-Libraries)
- [Creating a spaCy Doc](#Creating-a-spaCy-Doc)
- [Looking at Tokens](#Looking-at-Tokens)
- [Visualizers](#Visualizers)
- [Similarities](#Similarities)
- [Training](#Training)
- [Conclusion](#Conclusion)
- [References](#References)

### Intalling the Libraries

In order to begin using spaCy, we can download the source packages using pip:
    
    $ pip install -U spacy
    
We can also install with conda using the following command:

    $ conda install -c conda-forge spacy

After finishing the installation, we now need to download the model for English.

    $ python -m spacy download en

spaCy also supports other languages and models that have been trained on different inputs that you can find [here](https://spacy.io/models/#available-models).

After installing spaCy, make sure the following works for you.

In [1]:
import spacy

# Load the tokenizer, tagger, parser, Named Entity Recognition, and word vectors for English
nlp = spacy.load('en')

# Process the sentence into a spacy document object
doc = nlp(u'You have installed the library!')

print(doc)
print(type(doc))

You have installed the library!
<class 'spacy.tokens.doc.Doc'>


Let's also install the Wikipedia API so we can work with the articles there.

    $ pip install wikipedia

In [2]:
import wikipedia

### Creating a spaCy Doc
We'll now look at what we can do with a document in spaCy. We'll begin by using the Wikipedia API to get an article about Alan Turing. We can use the `page` command to get a `WikipediaPage` object. From this, we can pass the plain text of the article which will be easier for spaCy to work with.

We then let the spaCy model for English process the document so that a syntactic tree is built. The default pipeline uses the following components as described in the picture below:
[<img src="https://spacy.io/assets/img/pipeline.svg">](https://spacy.io/assets/img/pipeline.svg)
1. Tokenizer - Splits the document into meaningful pieces according to the Penn Treebank standard. spaCy assumes no mult-word tokens and allows merging after.
2. Tagger - Gives Part of Speech tags to each token to give purpose to each token.
3. Parser - Assigns dependencies between tokens.
4. Named Entity Recognizer - Detects and labels named entities.

spaCy also allows for [custom components](https://spacy.io/usage/processing-pipelines#section-custom-components) that you can also add to the pipeline, but for now we'll stick with the default.

In [3]:
# Get the WikipediaPage article
article_wiki = wikipedia.page('Alan Turing')

# Process the article
article_parsed = nlp(article_wiki.content)
print(type(article_parsed))

<class 'spacy.tokens.doc.Doc'>


In [4]:
# Prints out some of the prcoessed tokens and preceived sentences.
sents = article_parsed.sents
wordlist = [] # <class 'spacy.tokens.token.Token'>
sentlist = [] # <class 'spacy.tokens.span.Span'>

for i in range(15):
    wordlist.append(article_parsed[i])
    sentlist.append(next(sents))

print("Iterating through words: ", wordlist)
print("Iterating through sentences: ", sentlist)

Iterating through words:  [Alan, Mathison, Turing,  , (;, 23, June, 1912, –, 7, June, 1954, ), was, an]
Iterating through sentences:  [Alan Mathison Turing  , (; 23 June 1912 – 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.
, Turing was highly influential in the development of theoretical computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general purpose computer., Turing is widely considered to be the father of theoretical computer science and artificial intelligence.
, During the Second World War, Turing worked for the Government Code and Cypher School, (GC&CS) at Bletchley Park, Britain's codebreaking centre that produced Ultra intelligence., For a time he led Hut 8, the section which was responsible for German naval cryptanalysis., Here he devised a number of techniques for speeding the breaking of German

As we can see, this whole preprocessing step as well as gaining of understanding of each token is very simple. In fact it takes much less lines of code and [much more efficient](https://spacy.io/usage/facts-figures) for most basic NLP tasks compared to other popular libraries such as NLTK and Stanford's CoreNLP with [comparable accuracy](https://spacy.io/usage/facts-figures).

### Looking at Tokens

Now let's look at the different information that each of our Token objects have. The [Token](https://spacy.io/api/token) object is some word, symbol, etc. Let's look at a sample sentence and see the different information that we can get from it.

In [5]:
doc = nlp(u'Jimmy wanted to throw a rock into the pond, but he didn\'t have any.')

print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t'.format('Token', 'Lemma', 'Part of Speech', 'POS Tag', 'Dependency', 'Shape', 'In Stop Word List?'))
for token in doc:
    print('{}\t{}\t{}\t\t{}\t{}\t\t{}\t{}\t'.format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_stop))

Token	Lemma	Part of Speech	POS Tag	Dependency	Shape	In Stop Word List?	
Jimmy	jimmy	PROPN		NNP	nsubj		Xxxxx	False	
wanted	want	VERB		VBD	ROOT		xxxx	False	
to	to	PART		TO	aux		xx	True	
throw	throw	VERB		VB	xcomp		xxxx	False	
a	a	DET		DT	det		x	True	
rock	rock	NOUN		NN	dobj		xxxx	False	
into	into	ADP		IN	prep		xxxx	True	
the	the	DET		DT	det		xxx	True	
pond	pond	NOUN		NN	pobj		xxxx	False	
,	,	PUNCT		,	punct		,	False	
but	but	CCONJ		CC	cc		xxx	True	
he	-PRON-	PRON		PRP	nsubj		xx	True	
did	do	VERB		VBD	aux		xxx	True	
n't	not	ADV		RB	neg		x'x	False	
have	have	VERB		VB	conj		xxxx	True	
any	any	DET		DT	dobj		xxx	True	
.	.	PUNCT		.	punct		.	False	


spaCy gives us a lot of information to understand what purpose each word has. It gives us a trained model to work with, that can help classify each token in one of many ways. 
From the table of tokens, we see that different words have been lemmatized, or getting the base form of the word. In addition, we have Parts of Speech, with varying specificities as well as dependencies, shapes, etc. In the code above, we only looked at a few of the many features that are available to us more can be found in the documentation of the [Token](https://spacy.io/api/token) class. However, just this information itself can take us a long way in working with text.

### Visualizers
spaCy also supports dependency and entity visualization through displaCy. We demonstrate both below with the `style` argument. displaCy normally can be visualized using the `serve` function, but since we are using Jupyter, we want to use `render`, which returns markup that is rendered in the cell immediately and is easier to export.

In [6]:
from spacy import displacy

# Visualize dependencies
doc1 = nlp(u'I haven\'t eaten anything since yesterday.')
displacy.render(doc1, style='dep', jupyter=True, options={'distance': 120})

# Explain what each dependency means
print("{:<15}{:<10}{}".format("Entity", "Label", "Explanation of Label"))
for tok in doc1:
    print("{:<15}{:<10}{}".format(tok.text, tok.dep_, spacy.explain(tok.dep_)))

# We can also customize our visualizations
doc2 = nlp(u'My glass is half full.')
options2 = {'compact': True, 'bg': '#d3d9e2', 'color': 'blue', 'font': 'Arial'}
displacy.render(doc2, style='dep', jupyter=True, options=options2)

Entity         Label     Explanation of Label
I              nsubj     nominal subject
have           aux       auxiliary
n't            neg       negation modifier
eaten          ROOT      None
anything       dobj      direct object
since          prep      prepositional modifier
yesterday      pobj      object of preposition
.              punct     punctuation


In the above visualization, we are able to view both part of speech information as well as how the different tokens depend on each other, which is very informative when using algorithms to deal with words. For example, a computer could look at this information and see that the word eaten is negated or that I is the nominal subject that the verb eaten is referring to. This can make it easier for a computer to figure out what the sentence means.

In [7]:
# Visualize Named entities
rawdoc2 = """Turing was highly influential in the development of theoretical computer science, 
providing a formalisation of the concepts of algorithm and computation with the Turing machine, 
which can be considered a model of a general purpose computer. Turing is widely considered to 
be the father of theoretical computer science and artificial intelligence. During the Second 
World War, Turing worked for the Government Code and Cypher School, (GC&CS) at Bletchley Park, 
Britain's codebreaking centre that produced Ultra intelligence. For a time he led Hut 8, the 
section which was responsible for German naval cryptanalysis. Here he devised a number of 
techniques for speeding the breaking of German ciphers, including improvements to the pre-war 
Polish bombe method, an electromechanical machine that could find settings for the Enigma 
machine. Turing played a pivotal role in cracking intercepted coded messages that enabled the 
Allies to defeat the Nazis in many crucial engagements, including the Battle of the Atlantic, 
and in so doing helped win the war.
""".replace('\n','')
doc2 = nlp(rawdoc2)

# Colors for certain labels
colors2 = {'NORP': 'linear-gradient(90deg, #d142f4, #f49b41)'}
# Options list, setting ents to None means all entities are highlighted
options2 = {'ents': None, 'colors': colors2}

displacy.render(doc2, style='ent', jupyter=True, options=options2)

# Explain what each label means
print("{:<30}{}\t{}".format("Entity", "Label", "Explanation of Label"))
for ent in doc2.ents:
    print("{:<30}{}\t{}".format(ent.text, ent.label_, spacy.explain(ent.label_)))

Entity                        Label	Explanation of Label
the Second World War          EVENT	Named hurricanes, battles, wars, sports events, etc.
the Government Code           EVENT	Named hurricanes, battles, wars, sports events, etc.
Cypher School                 ORG	Companies, agencies, institutions, etc.
GC&CS                         PERSON	People, including fictional
Bletchley Park                GPE	Countries, cities, states
Britain                       GPE	Countries, cities, states
Ultra                         NORP	Nationalities or religious or political groups
German                        NORP	Nationalities or religious or political groups
German                        NORP	Nationalities or religious or political groups
Polish                        NORP	Nationalities or religious or political groups
Enigma                        PRODUCT	Objects, vehicles, foods, etc. (not services)
Allies                        ORG	Companies, agencies, institutions, etc.
Nazis               

As we can see, there are different labels that are given that help explain what certain entities are. However, not all of them are necessarily accurate, as we can see by the label of 'GC&CS' as a `PERSON`. However, there is a pretty good accuracy, and it does help to have information like this in figuring out different facts from sentences that a computer can use.

### Similarities

We can use spaCy's models to compare different objects. This allows us to do things such as finding similar sentences. The can be done with any `Doc`, `Span`, or `Token` object using the `similarity` function that calculates a semantic similarity estimate with a cosine similarity. We can see some examples below. We'll also load a larger vocabulary so that we have better estimates using the following command

    $ python -m spacy download en_core_web_md
    
And then load it.

In [8]:
nlplg = spacy.load('en_core_web_md')

Now let's look at some words and their similarities.

In [9]:
dog = nlplg('dog')
cat = nlplg('cat')
lion = nlplg('lion')
truck = nlplg('truck')
print("Similarity between dog and cat: ", dog.similarity(cat))
print("Similarity between dog and lion: ", dog.similarity(lion))
print("Similarity between dog and truck: ", dog.similarity(truck))
print("Similarity between cat and lion: ", cat.similarity(lion))

Similarity between dog and cat:  0.8016853893732596
Similarity between dog and lion:  0.4742449314321914
Similarity between dog and truck:  0.355217429684785
Similarity between cat and lion:  0.5265437205262408


As we can see from the numbers, this model is pretty good at seeing similar features between words. The words "dog" and "cat" are pretty similar in that they are both pets. However, dogs and lions are less similar, with dogs and trucks even less so. Cats and lions are also slightly more related, as they are both feline. These simple associations that we take for granted as humans can be trained into the spaCy model and can be used to let computers understand these relation concepts.

We can also compare documents and bigger pieces of text. For example, the the similarity function is useful in helping with information retrieval. For example, we can compare similarities between words and documents.

In [10]:
question = nlplg('European')
doc1 = nlplg(wikipedia.page('Lego').content)
doc2 = nlplg(wikipedia.page('Greece').content)
doc3 = nlplg(wikipedia.page('Bicycle').content)
print("Similarity with doc1: ", question.similarity(doc1))
print("Similarity with doc2: ", question.similarity(doc2))
print("Similarity with doc3: ", question.similarity(doc3))

Similarity with doc1:  0.34316940126082063
Similarity with doc2:  0.3901161487616457
Similarity with doc3:  0.31983017325749513


As we can see from this code, European is most similar to the article about Greece, and even though we have only a word to go with, there is a much bigger similarity with the word European. With a little more effort, this can be expanded to be a stand alone search tool that can search for documents then search inside them for the relevant information.

We can also see how different words are related. To start, we'll download the English vector information with the following command

    $ python -m spacy download en_vectors_web_lg

Now we can load the model.

In [11]:
nlpvec = spacy.load('en_vectors_web_lg')

In [12]:
# Uses the similarity function to find the words most related to the given word
def most_similar_words(aword):
    queries = [w for w in aword.vocab if w.is_lower == aword.is_lower and w.prob >= -15]
    simrank = sorted(queries, key=lambda w: aword.similarity(w), reverse=True)
    simset = set()
    topsims = []
    i = 0
    while len(topsims) < 20:
        curword = simrank[i].lower_
        if curword not in simset:
            topsims.append(curword)
            simset.add(curword)
        i+=1
    return topsims

print(most_similar_words(nlpvec.vocab[u'Obama']))

['obama', 'barack', 'mccain', 'clinton', 'hillary', 'palin', 'biden', 'gop', 'democrats', 'republicans', 'democrat', 'america', 'cnn', 'pelosi', 'republican', 'osama', 'iraq', 'reagan', 'george', 'congress']


An even cooler feature that we get straight out of the box with spaCy is vector representation of words. What this means is that each word is given a multi-dimensional vector representation of its meaning, which can be created with algorithms such as [word2vec](https://en.wikipedia.org/wiki/Word2vec). We can use this information to help understand similarities and differences between different words and concepts. Here we use the cosine similarity to compare the vector embeddings of each word that can be gotten using the `vector` attribute of each `Token`. We use this to see what different combinations of words give us.

In [13]:
# Define a cosine similarity function
import numpy as np
cosine = lambda v1, v2: np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Define a vector similarity function
def most_similar_vecs(invec, usedwords):
    aword = nlplg('vocabulary')
    queries = [w for w in aword.vocab if w.prob >= -15 and w.has_vector and w.lower_ not in usedwords]
    simrank = sorted(queries, key=lambda w: cosine(invec, w.vector), reverse=True)
    return simrank[0]

# Example 1: king - man + woman = queen 
king = nlplg('king')
man = nlplg('man')
woman = nlplg('woman')
combo = king.vector - man.vector + woman.vector
print("king - man + woman = ", most_similar_vecs(combo, ['king','man','woman']).lower_)

# Example 2: king - man + boy = prince
king = nlplg('king')
man = nlplg('man')
boy = nlplg('boy')
combo = king.vector - man.vector + boy.vector
print("king - man + boy = ", most_similar_vecs(combo, ['king','man','boy']).lower_)

king - man + woman =  queen
king - man + boy =  prince


This is a really cool result, as we can do arithmetic on the different traits of each word, to gain more understanding about the word itself.

### Training
spaCy is extremely flexible, so we can easily train new things or add them into our existing models. As an example, here's a common way of training a new Entity, which can be found [here](https://spacy.io/usage/examples#new-entity-type), a simpler one of which, is shown below.

In [14]:
import random

# New entity label
LABEL = 'FOOD'

# Training data
TRAIN_DATA = [
    ("Burgers are very tasty.", {
        'entities': [(0, 7, 'FOOD')]
    }),
    ("Burgers are very tasty and they are great for your bulk.", {
        'entities': [(0, 7, 'FOOD')]
    }),
    ("Do I eat them?", {
        'entities': []
    }),
    ("I like burgers. Do you?", {
        'entities': [(7, 14, 'FOOD')]
    }),
    ("burgers are goood", {
        'entities': [(0, 7, 'FOOD')]
    }),
    ("yum they are tasty", {
        'entities': []
    }),
    ("how would I survive without burgers", {
        'entities': [(28, 35, 'FOOD')]
    }),
    ("burgers!", {
        'entities': [(0, 7, 'FOOD')]
    })
]

def train(model=None, new_model_name='animal', output_dir=None, n_iter=20):
    """Creates an entity recognizer and trains the entity."""
    
    # Create blank model
    nlp = spacy.blank('en')

    # Add entity recognizer to the model
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
    
    # Add new entity label
    ner.add_label(LABEL)   
    optimizer = nlp.begin_training()

    # Disable other pipes during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update([text], [annotations], sgd=optimizer, drop=0.35,
                           losses=losses)
            print("losses: ", losses['ner'])

    return nlp

entrec = train()
# test the trained model
test_text = 'Do you like burgers?'
doc = entrec(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, ent.text)

losses:  29.8425466609112
losses:  19.768894641538864
losses:  0.6083590924211353
losses:  1.2559197037586947
losses:  1.999995067574418
losses:  6.938959127561024e-06
losses:  7.774443399531807e-10
losses:  0.7661760759705846
losses:  1.4613375916554925
losses:  5.808917630956588e-07
losses:  0.011225481546843802
losses:  8.21031993301105e-06
losses:  1.8904445171356201
losses:  2.2261234293325513e-16
losses:  1.5845715999603278
losses:  1.649967823160589e-20
losses:  4.826274426037046e-16
losses:  9.086029890861538e-14
losses:  1.37945756468473e-15
losses:  3.7093330535683384e-14
Entities in 'Do you like burgers?'
FOOD burgers


### Conclusion

Overall, Natural Language Processing is a very import part of data science. It allows us to understand data through the realm of words instead of just numbers. spaCy provides us a lot of cool functionality and makes it much easier than other libraries to do a lot of the tasks that we would want, such as understanding text and doing preprocessing on it. Thus, spaCy is a very useful library to learn and will hopefully prove useful to you at some point.

## References

More detail about the libraries can be found at the following links.

1. spaCy: https://spacy.io/
2. Wikipedia Python API: https://wikipedia.readthedocs.io/en/latest/
3. The spaCy tokenizer - https://explosion.ai/blog/how-spacy-works