# CSCE 670 Spotlight: spaCy

### Submitted by Sourjya Banerjee. 

## Introduction

spaCy is a free open-source library in Python, used for Natural Language Processing(NLP). It's major characteristic is that it has been built for industrial use, which allows it to claim it can handle large amounts of data for information extraction problems or to pre-process data for deep learning applications. In this notebook, we aim to explore spaCy and the features it offers us for various natural language processing tasks.

## Installation

To install spaCy, we can use the pip installer for python. The relevant command is

Alternatively, if you are using Anaconda, you could use the command

spaCy can also be run with GPUs, by specifying the CUDA version you are using. The example for cuda92 is

Once this is installed, you can activate it by calling spacy.prefer_gpu() anywhere before the model code has been loaded in the script.

## Comparison with NLTK

NLTK and spaCy are the two most common libraries used for natural language processing(NLP) in Python. NLTK is researcher-oriented, while spaCy is clearly industry-oriented. spaCy does not have the variation that NLTK offers, but it can boast of a more efficient and faster running time than NLTK for the same tasks. 
One major difference is NLTK behaves like a string processing library which takes strings as input and returns strings as output. Spacy uses an object-oriented approach and returns document objects whose words and sentences are objects themselves.

## NLP using spaCy

spaCy has a rich set of tasks that it can perform. Let's go over a few common NLP tasks with spaCy.
In our survey, we will go over the features spaCy has to offer.

When we first use spaCy, we need to specify the language used. In this notebook, we use English for our analysis. The language model is imported and loaded as an nlp variable. This is then used further. We then use the nlp function on a string object to create a Doc variable, on which we can do our analysis.

### 1. Tokenization

Tokenization is one of the most common NLP tasks, where we are given a document and attempt to split it into it's constituent tokens. The tokens present in a word documents are words, punctuation marks, spaces, symbols and other possible elements. Let's see how we can do tokenization in spaCy. We can do it two ways, either by using a Doc object created by our nlp, or use the inbuilt tokenizer function. Both approaches have been shown here. We use English as the language for this case

In [12]:
import spacy
from spacy.lang.en import English
from spacy.tokenizer import Tokenizer
nlp = English()
def tokenize_text(text):
    
    #Approach 1
    doc = nlp(text)
    print("Approach 1")
    for token in doc:
        print(token.text)
    print("\n")
    #Approach 2
    print("Approach 2")
    tokenizer = Tokenizer(nlp.vocab)
    tokens = tokenizer(text)
    for token in enumerate(tokens):
        print(token)
    
tokenize_text("This is a really long sentence, and I added some $1 bills to improve variation!")

Approach 1
This
is
a
really
long
sentence
,
and
I
added
some
$
1
bills
to
improve
variation
!


Approach 2
(0, This)
(1, is)
(2, a)
(3, really)
(4, long)
(5, sentence,)
(6, and)
(7, I)
(8, added)
(9, some)
(10, $1)
(11, bills)
(12, to)
(13, improve)
(14, variation!)


For Approach 2, we see that the output is an object in vector form. We also see the outputs are different in these two cases. The second output using the in-built tokenizer is smart enough to identify the  '&#36;1' together, while the first approach is a raw tokenizer used to create the Doc object. There, we find the dollar sign and 1 as seperate tokens. However, Approach 2 has its own issues as with "sentence," for the 6th element, while Approach 1 clearly splits sentence and ,. Thus, either approach can be used depending on our particular usecase.
For example, if you were interested in sentiment analysis, the second tokenizer would be useful as it keeps relevant symbol and punctuation with the text, for example "variation!". However, if you were purely interested in the words present in the text and not the symbols or punctuation, the first approach would be the optimal one to use.

### 2. Parts of speech tagging(POS) and Visualization

Parts of speech tagging is an important NLP functionality, where you are given a sentence and you try to figure out the part of speech that each token represents(whether it is a noun, verb etc.). This functionality is useful in sentiment analysis, and to figure out the context and its definition. Lets see how we can do parts of speech tagging in spaCy. The output gives the token and its corresponding part of speech.

We would also like to introduce the displacy module, which is a useful visualization package integrated with spaCy. It can be used to generate beautiful visualizations of text and its corresponding analysis. This is useful for clear visualization and can be used in presentations to make things clear.

To use displacy, you need to import it from spacy, and pass the Doc object to its render function along with the style you want to present it as. 

In [14]:
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I enjoy hiking, running and writing code in Python. Python is a very expressive language.")

for token in doc:
    print(token.text, token.pos_)
displacy.render(doc, style="dep")

I PRON
enjoy VERB
hiking VERB
, PUNCT
running VERB
and CCONJ
writing VERB
code NOUN
in ADP
Python PROPN
. PUNCT
Python PROPN
is AUX
a DET
very ADV
expressive ADJ
language NOUN
. PUNCT


This, we see displacy can be used to generate beautiful visualizations of the text data, showing annotated information alongside. This is one of the most used visualization libraries for spaCy, and is very useful.

Another thing to note is that in the previous cell, we used a model called "en_core_web_sm". This is one of the many pre-trained machine learning models, all language specific, present in spaCy. In order to use the module, you need to download it using python, and use it as shown in the cell.
The command for downloading it is.

A full list of available models is available at https://spacy.io/models. It contains both language-specific and multi-language models. For our discussion going forward, we will be using the en_core_web_sm model.

### 3. Document pre-processing for model, to remove unnecessary information

Thus, we can see that each token can be tagged accurately to its corresponding part of speech using the Doc object created by Spacy. The Doc object provides a lot of useful functionality, and contains many similar useful attributes. 
To explore this further, lets try writing a small program to pre-process some text by removing stopwords from it, and lemmatization. According to Wikipedia, "lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form".

In [13]:
def pre_process(text):
    doc = nlp(text)
    new_text = ""
    for token in doc:
        if token.is_stop:
            continue
        new_text = new_text + " " + token.lemma_
    print(new_text)

text = "This is a really lovely place, and I enjoy visiting it"
pre_process(text)

 lovely place , enjoy visiting


Thus, we see the function returns the most useful information from the text, reducing it to its most basic form. This is the part of the document that can be used to build search engines, and useful information to be looked up. We use the is_top attribute of the token object to filter out the stopwords, and then use the lemma_ attribute to keep the lemmatized form of the word in the text

### 4. Named Entity Recognition

Named Entity Recognition is the process of finding named entities within a document or text. Named entites are meaningful names associated with a particular entity, like the name of companies like Google and Microsoft, or countries like India. If a name occurs in the text, spaCy has the capability to recognize the name and the entity it is associated with. This is very useful in extracting useful context and information from the text, which can then be used for further analysis. 
We also use displacy to display our data and its analysis alongside.

A list of labels and their meaning are available on https://spacy.io/api/annotation


In [15]:
def named_entity_recognition(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    displacy.render(doc, style="ent")

#Text taken from Wikipedia, Texas A&M University page notable alumni section
text = """Many Aggies have attained local, national, and international prominence. 
        Jorge Quiroga and Martin Torrijos have served as heads of state for Bolivia and Panama, respectively, 
        and Rick Perry is the current United States Secretary of Energy, 
        and former Governor of Texas and 2012 US Presidential candidate. 
        Robert Gates, United States Secretary of Defense in the George W. Bush and Obama administrations, 
        is a past president of the university. 
        Congressmen Joe Barton, Bill Flores, Jeb Hensarling, and Louie Gohmert, 
        and former Austin, Texas, mayor Will Wynn are all graduates"""
named_entity_recognition(text)


Jorge Quiroga PERSON
Martin Torrijos PERSON
Bolivia GPE
Panama GPE
Rick Perry PERSON
United States GPE
Energy ORG
Texas GPE
2012 DATE
US GPE
Robert Gates PERSON
United States GPE
Defense ORG
Obama LOC
Joe Barton PERSON
Bill Flores PERSON
Jeb Hensarling PERSON
Louie Gohmert PERSON
Austin GPE
Texas GPE
Will Wynn PERSON


Thus, we see a clear visualization of all the entities available in the text. However, it seems to not recognize Aggies yet. There are also a couple of errors, as it can't identify Obama correctly. This is probably because the full name of Barack Obama is not present in the sentence, and it assumes thus that Obama is a non geo-political location that it has  indexed.
Now using spaCy, we can add new custom entities and labels to our model, so that it can classify them correctly. To do this, we can use the EntityRuler component, which is added as a part of the pipeline by using nlp.pipe(). Lets see how this can be done. We'll try adding a custom entity for Aggies as well.

In [17]:
from spacy import displacy
from spacy.pipeline import EntityRuler

def named_entity_recognition_fixed(text):
    nlp = spacy.load("en_core_web_sm")
    ruler = EntityRuler(nlp, overwrite_ents=True)
    patterns = [{"label": "PERSON", "pattern": "Obama"},
                {"label": "TAMU students", "pattern": "Aggies"}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)

    doc = nlp(text)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    displacy.render(doc, style="ent")

#Text taken from Wikipedia, Texas A&M University page notable alumni section
text = """Many Aggies have attained local, national, and international prominence. 
        Jorge Quiroga and Martin Torrijos have served as heads of state for Bolivia and Panama, respectively, 
        and Rick Perry is the current United States Secretary of Energy, 
        and former Governor of Texas and 2012 US Presidential candidate. 
        Robert Gates, United States Secretary of Defense in the George W. Bush and Obama administrations, 
        is a past president of the university. 
        Congressmen Joe Barton, Bill Flores, Jeb Hensarling, and Louie Gohmert, 
        and former Austin, Texas, mayor Will Wynn are all graduates."""
named_entity_recognition_fixed(text)


Aggies TAMU students
Jorge Quiroga PERSON
Martin Torrijos PERSON
Bolivia GPE
Panama GPE
Rick Perry PERSON
United States GPE
Energy ORG
Texas GPE
2012 DATE
US GPE
Robert Gates PERSON
United States GPE
Defense ORG
Obama PERSON
Joe Barton PERSON
Bill Flores PERSON
Jeb Hensarling PERSON
Louie Gohmert PERSON
Austin GPE
Texas GPE
Will Wynn PERSON


Thus, we now see we have defined a custom-label for Aggies as TAMU students, and we have labelled Obama as a person and not a non geo-political entity. Note that to overwrite any previous label, we needed to add an overwrite_ents=True variable in our EntityRuler object initialization. If this was not provided, we would not have been able to overwrite the label for Obama. This flexibility of rule-based labelling makes spaCy a very powerful package. 

### 5. Noun chunks

Noun chunks are phrases that define the meaning of a particular noun. It consists of the noun and the words describing that noun. A common example would be "the beautiful starry sky" or "the deep blue sea". spaCy has the ability to extract these significant noun chunks from the text, and this can be further used in understanding the meaning of the sentence. Lets see how that can be done. We use text taken from https://en.wikipedia.org/wiki/Alps

In [9]:
def noun_chunks(text):
    doc = nlp(text)
    for chunk in doc.noun_chunks:
        print(chunk.text)
        
text = """The highest portion of the range is divided by the glacial trough of the Rhône valley, 
        from Mont Blanc to the Matterhorn and Monte Rosa on the southern side, and the Bernese Alps on the northern. 
        The peaks in the easterly portion of the range, in Austria and Slovenia, are smaller than those 
        in the central and western portions."""
noun_chunks(text)

The highest portion
the range
the glacial trough
the Rhône valley
Mont Blanc
the Matterhorn and Monte Rosa
the southern side
the Bernese Alps
the northern
The peaks
the easterly portion
the range
Austria
Slovenia
the central and western portions


We see the noun chunks have now been extracted successfully. If we want, we can also get additional information like the root of the noun chunk(i.e. the noun), along with useful dependency information between the root and its head. All these are present as attributes in the chunk object. This is one of the main advantages of using spacy, where it returns values as objects containing various useful attributes of the operation performed. Depending on our usage, we can use those attributes for our models.

We have thus seen that spaCy has a large amount of in-built functionality that allow us to implement NLP tasks successfully.


### Word embedding and similarity

spaCy also has the capability to classify and quantify word similarities using word embeddings. This can be used to extract useful information from distributed representations of the text. You can use the word embedding present to predict similarities between words, and if the words are similar in any particular context. It is suggested from the spaCy docs that for best performance, we should use a larger model with more word vectors present. We thus download and use the model en_core_web_lg. We have used the reference at https://spacy.io/usage/vectors-similarity and try to see what our results are.

In [18]:
def word_similarity(text):
    nlp = spacy.load("en_core_web_lg") 
    tokens = nlp(text)
    for token1 in tokens:
        for token2 in tokens:
            print(token1.text, token2.text, token1.similarity(token2))
            
word_similarity("wallet bag bottle ")

wallet wallet 1.0
wallet bag 0.6803649
wallet bottle 0.34340772
bag wallet 0.6803649
bag bag 1.0
bag bottle 0.50149715
bottle wallet 0.34340772
bottle bag 0.50149715
bottle bottle 1.0


The output score is a measure of the similarity and context between two words. Here we can see that the model correctly identifies there is some similarity between wallet and bag, but there is not much similarity between bottle and wallet. bottle and bag has some similarity as both are normally used together. 
Thus, we can use the model to predict word similarity present in the text.

## Conclusion

We have thus explored spaCy and tried to do various common NLP tasks using the package. We find that it is a very useful package with a lot of built-in and customizable functionality. It outperforms NLTK in terms of performance but does not have as many algorithms implemented as NLTK. Thus, it is suitable for industrial purposes, while NLTK is still used in the research space. 
I hope this spotlight can serve as a gentle introduction to the spaCy package and the features it has to offer for language processing.

### References used

1) https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2

2) https://spacy.io/

3) https://www.wikipedia.org/