<a href="https://colab.research.google.com/github/rahiakela/applied-nlp-in-enterprise/blob/main/01-introduction-to-nlp/perform_nlp_tasks_using_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Perform NLP Tasks using SpaCy

Let’s now use SpaCy for our NLP tasks.

First, install spaCy. For more resources on how to install spaCy, please visit the [official SpaCy website](https://spacy.io/usage).

If you haven't installed spaCy already, these commands will get you everything you need. If you're running them in a notebook, prefix each line with a `!` character, as we've done before.

In [None]:
%%shell

pip install -U spacy[cuda110,transformers,lookups]==3.0.3
pip install -U spacy-lookups-data==1.0.0
pip install cupy-cuda110==8.5.0
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_sm

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [9]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download word embeddings from kaggle
kaggle datasets download -d tunguz/200000-jeopardy-questions
unzip -qq 200000-jeopardy-questions.zip
rm -rf 200000-jeopardy-questions.zip

kaggle.json
Downloading 200000-jeopardy-questions.zip to /content
 70% 8.00M/11.5M [00:00<00:00, 81.1MB/s]
100% 11.5M/11.5M [00:00<00:00, 73.3MB/s]




## SpaCy Pretrained Language Model

SpaCy has pretrained language models for out-of-the-box use. Pretrained models are models that have been trained on lots of data already and are ready for us to perform inference with.

These pretrained language models will help us solve the basic NLP tasks, but more advanced users are welcome to fine-tune the pretrained models on more specific data of their choosing. This will deliver even better performance for their specific tasks at hand.

Fine-tuning is the process of taking a pretrained model and training it some more (i.e., fine-tuning the model) on a more specific corpus of text that is relevant to the domain of the user.footnote:[This operation of taking a model developed for one task and using it as a starting point for a model on a second task is known as transfer learning.] 

For example, if we worked in finance, we may decide to fine-tune a generic pretrained language model on financial documents to generate a finance-specific language model. This finance-specific language model would have even better performance on finance-related NLP tasks versus the generic pretrained language model.

SpaCy breaks out its pretrained language models into two groups: 

- core models 
- starter models

The core models are general-purpose models and will help us solve the basic NLP tasks. 

The starter models are base models useful for transfer learning; these models have pretrained weights which you could use to initialize and fine-tune for your own models. 

Think of the core models as ready-to-go models and the base models as do-it-yourself starter kits.

We will use the ready-to-go core models to perform the basic NLP tasks. 

Let's first import the core model footnote:

In [5]:
# Import spacy and download language model
import spacy
nlp = spacy.load("en_core_web_sm")

Now, let’s perform the first of the NLP tasks: tokenization.


## Tokenization

Tokenization is where all NLP work begins; before the machine can process any of the text it sees, it must break the text into bite-size tokens. Tokenization will segment text into words, punctuation marks, etc.

SpaCy automatically runs the entire NLP pipeline when you run a language model on the data (i.e., `nlp(SENTENCE)`), but to isolate just the tokenizer, we will invoke just the tokenizer using `nlp.tokenizer(SENTENCE)`.

Then, we will print the length of the tokens and the individual tokens.

In [6]:
# Tokenization
sentence = nlp.tokenizer("We live in Paris.")

# Length of sentence
print("The number of tokens: ", len(sentence))

# Print individual words (i.e., tokens)
print("The tokens: ")
for words in sentence:
    print(words)

The number of tokens:  5
The tokens: 
We
live
in
Paris
.


The length of tokens is 5, and the individual tokens are `"We"`, `"live"`, `"in"`, `"Paris"`, `"."`. The period at the end of the sentence is its own token.

Note that the spaCy tokenizer will treat new lines (`"\n"`), tabs (`"\t"`), and whitespace characters beyond a single space (`" "`) as tokens.

Let's try the tokenizer on a slightly more complex example.

We will load in publicly available Jeopardy Questions and then run the entire SpaCy language model on a few of the questions.

In [10]:
import pandas as pd
import os
cwd = os.getcwd()

# Import Jeopardy Questions
data = pd.read_csv("JEOPARDY_CSV.csv")
data = pd.DataFrame(data=data)

# Lowercase, strip whitespace, and view column names
data.columns = map(lambda x: x.lower().strip(), data.columns)

# Reduce size of data
data = data[0:1000] 

# Tokenize Jeopardy Questions
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))

For the first 1,000 Jeopardy questions, we have now created tokens. In other words, you have created tokens for each and every single one of the 1,000 Jeopardy questions.

To make sure everything worked right, let’s view the first question and the tokens created.

In [11]:
# View first question
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questions is:")
print(example_question)

The first questions is:
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory


In [12]:
# Print individual tokens of first question
print("The tokens from the first question are:")
for tokens in example_question_tokens:
    print(tokens)

The tokens from the first question are:
For
the
last
8
years
of
his
life
,
Galileo
was
under
house
arrest
for
espousing
this
man
's
theory


This is the first basic NLP task machines perform; now we can move onto the other NLP tasks. Well done!

## Part-of-speech Tagging

After tokenization, machines need to tag each token with relevant metadata such as the part-of-speech of each token. This is what we will perform now.

Since we applied the entire SpaCy language model to the Jeopardy questions, the tokens generated already have a lot of the meaningful attributes/metadata we care about.

SpaCy uses pre-loaded statistical models to predict the part-of-speech of each token. We loaded the English language statistical model earlier using the following code: `spacy.load("en_core_web_sm")`.

Let's take a look at the Part-of-speech (POS) Tagging attributes for the tokens in the first question.

In [None]:
# Print Part-of-speech tags for tokens in the first question
print("Here are the Part-of-speech tags for each token in the first question:")
for token in example_question_tokens:
    print(token.text,token.pos_, spacy.explain(token.pos_))

Here are the Part-of-speech tags for each token in the first question:
For ADP adposition
the DET determiner
last ADJ adjective
8 NUM numeral
years NOUN noun
of ADP adposition
his PRON pronoun
life NOUN noun
, PUNCT punctuation
Galileo PROPN proper noun
was AUX auxiliary
under ADP adposition
house NOUN noun
arrest NOUN noun
for ADP adposition
espousing VERB verb
this DET determiner
man NOUN noun
's PART particle
theory NOUN noun


The first token "For" is marked as an adposition (e.g., in, to, during), the second token "the" is a determiner (e.g., a, an, the), the third token "last" is an adjective, the fourth token "8" is a numeral, the fifth token "years" is a noun, and so on.

Figure 1-2 displays the full list of all possible POS tags, including descriptions and examples of each.footnote:[Please visit the [SpaCy POS documentation](https://spacy.io/api/annotation) for more.]

![Part-of-speech Tags](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0102.png?raw=1)

Now that we have used the tokenizer to create tokens for each sentence and part-of-speech tagging to tag each token with meaningful attributes, let's label each token's relationship with other tokens in the sentence. In other words, let's find the inherent structure among the tokens given the part-of-speech metadata we have generated.

#### Dependency Parsing

Dependency parsing is the process to find these relationships among the tokens. Once we have performed this step, we will be able to visualize the relationships using a dependency parsing graph.

First, let's view the depenency parsing tags for each of the tokens in the first question.

In [None]:
# Print Dependency Parsing tags for tokens in the first question
for token in example_question_tokens:
    print(token.text,token.dep_, spacy.explain(token.dep_))

For prep prepositional modifier
the det determiner
last amod adjectival modifier
8 nummod numeric modifier
years pobj object of preposition
of prep prepositional modifier
his poss possession modifier
life pobj object of preposition
, punct punctuation
Galileo nsubj nominal subject
was ROOT None
under prep prepositional modifier
house compound compound
arrest pobj object of preposition
for prep prepositional modifier
espousing pcomp complement of preposition
this det determiner
man poss possession modifier
's case case marking
theory dobj direct object


The first token "For" is marked as a prepositional modifier, the second token "the" is a determiner, the third token "last" is an adjectival modifier, the fourth token "8" is a numeric modifier, the fifth token "years" is the object of preposition, and so on.

Figures 1-3 and 1-4 list all the possible syntactic dependency tags, including descriptions and examples of each.footnote:[Please visit the [SpaCy documentation](https://spacy.io/api/annotation) for more.]

![Syntactic Dependency Parsing Labels Part 1](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0103.png?raw=1)

![Syntactic Dependency Parsing Labels Part 2](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0104.png?raw=1)

These tags help define the relationships among the tokens; using these tags, we can understand the relationship structure among the tokens that make up the sentence.

Dependency parsing is hard to unpack so let’s use spaCy’s built-in visualizer to get a better sense of the dependencies across the tokens.

In [None]:
# Visualize the dependency parse
from spacy import displacy

displacy.render(example_question_tokens, style='dep',
                jupyter=True, options={'distance': 120})

Figure 1-5 displays the first part of the sentence parsed.

![Dependency Parsing Example - Part 1](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0105.png?raw=1)

Notice the importance of "For" and "years" in the prepositional phrase -- multiple tokens map to these two.

Figure 1-6 displays the second part of the sentence parsed.

![Dependency Parsing Example - Part 2](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0106.png?raw=1)

The token "was" connects to the nominal subject "Galileo" and two prepositional phrases: "under house arrest" and "for espousing this man's theory".

These figures show how certain tokens can be grouped together and how the groups of tokens are related to one another. This is an essential step in natural language processing. First, the machine breaks the sentence apart into tokens. Then it assigns metadata to each token (e.g., part of speech), and then it connects the tokens based on their relationship to one another.

Let's move on to chunking, which is another form of grouping of related tokens.

#### Chunking

Let’s perform chunking on the following sentence: "My parents live in New York City".

In [None]:
# Print tokens for example sentence without chunking
for token in nlp("My parents live in New York City."):
    print(token.text)

My
parents
live
in
New
York
City
.


Chunking combines related tokens into a single token.

With chunking, the spaCy language model will identify "My parents" and "New York City" as noun chunks much like humans would when parsing a sentence in their head.

In [None]:
# Print chunks for example sentence
for chunk in nlp("My parents live in New York City.").noun_chunks:
      print(chunk.text)

My parents
New York City


By grouping related tokens into chunks, the machine will have an easier time processing the sentence. Instead of viewing each token in isolation, the machine now recognizes that certain tokens are related to others, a necessary step in natural language processing.

#### Lemmatization

Now, let’s go a step further and perform lemmatization. If you recall, lemmatization is the process to convert words into the base (or canonical) forms of the words. For example, horses to horse, slept to sleep, and biggest to big. Just like part-of-speech tagging, dependency parsing, and chunking, lemmatization helps the machine "process" the tokens. With lemmatization, the machine is able to simplify the tokens by converting some of the tokens into their most basic forms.

Stemming is a related concept, but stemming is simpler. Stemming reduces words to their word stems, often using a rule-based approach.

Lemmatization is a more difficult process but generally results in better outputs; stemming sometimes creates outputs that are non-sensical (non-words). In fact, spaCy does not even support stemming; it supports only lemmatization.

We will create a DataFrame to store and view the original and lemmatized versions of tokens side-by-side.

In [None]:
# Print Lemmatization for tokens in the first question
lemmatization = pd.DataFrame(data=[], \
  columns=["original","lemmatized"])
i = 0
for token in example_question_tokens:
    lemmatization.loc[i,"original"] = token.text
    lemmatization.loc[i,"lemmatized"] = token.lemma_
    i = i+1

lemmatization

Unnamed: 0,original,lemmatized
0,For,for
1,the,the
2,last,last
3,8,8
4,years,year
5,of,of
6,his,his
7,life,life
8,",",","
9,Galileo,Galileo


As you can see, words such as "years", "was", and "espousing" are lemmatized to their base forms. The other tokens are already their base forms, so the lemmatized output is the same as the original. Lemmatization simplifies tokens into their simplest forms, where possible, to simplify the process for the machine to parse sentences.

#### Named Entity Recognition

When combined together, everything we've done so far - tokenization, part-of-speech tagging, dependency parsing, chunking, and lemmatization - makes it possible for machines to perform more complex NLP tasks.

One example of a complex NLP task is named entity recognition (NER). Named entity recognition parses notable entities in natural language and labels them with their appropriate class label. For example, NER labels names of people with the label "Person" and names of cities with the label "Location." 

NER is possible only because the machine is able to perform text classification using the metadata generated by the earlier NLP tasks we've covered. Without the metadata from the earlier NLP tasks, the machine would have a very difficult time performing NER because it would not have enough features to classify names of people as "Person," names of cities as "Location," etc.

NER is a valuable NLP task because many organizations need to process lots and lots of documents in volume, and the simple act of labeling notable entities with the appropriate class label is a meaningful first step in analyzing the textual information, particularly for information retrieval tasks (e.g., finding information that you need as quickly as possible).

These documents include contracts, leases, real estate purchase agreements, financial reports, news articles, etc. Before named entity recogniton, humans would have had to label such entities by hand (at many companies, they still do). Now, named entity recognition (also known as "NER") provides an algorithmic way to perform this task.

SpaCy's NER model is able to label many types of notable entities ("real-world objects"). Figure 1-7 displays the current set of entity types the spaCy model is able to recognize.

![spaCy NER Entity Types](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0107.png?raw=1)

It's very important to note that NER is, at its very core, a classification model. Using the context of tokens around the token of interest, the NER model predicts the entity type of the token of interest. NER is a statistical model, and the corpus of data the model has trained on matters a lot. For better performance, developers of these models in enterprise will finetune the base NER models on their particular corpus of documents to achieve better performance versus the base NER model.

Let's try the spaCy NER model. We will perform NER on the first sentence describing George Washington, the first president of the United States, from his [Wikipedia article](https://en.wikipedia.org/wiki/George_Washington).

Here's the sentence: George Washington was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States from 1789 to 1797.

As you can see above, there are several real-world objects to recognize here including George Washington and the United States.

In [None]:
# Print NER results
example_sentence = "George Washington was an American political leader, \
military general, statesman, and Founding Father who served as the \
first president of the United States from 1789 to 1797.\n"

print(example_sentence)

print("Text Start End Label")
doc = nlp(example_sentence)
for token in doc.ents:
    print(token.text, token.start_char, token.end_char, token.label_)

George Washington was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States from 1789 to 1797.

Text Start End Label
George Washington 0 17 PERSON
American 25 33 NORP
first 119 124 ORDINAL
the United States 138 155 GPE
1789 to 1797 161 173 DATE



There are four elements to the output. First, the text that comprises the entity; note that the text could be a single token or a set of token that makes up the entire entity. Second, the start position of the text in the sentence. Third, the end position of the text in the sentence. Fourth, the label of the entity.

To make the value of NER even more apparent, let’s use spaCy’s built-in visualizer to visualize this sentence with the releveant entity labels.

In [None]:
# Visualize NER results
displacy.render(doc, style='ent', jupyter=True, options={'distance': 120})

As you can see in Figure 1-8, the spaCy NER model does a great job labeling the entities. "George Washington" is a person and the text starts at index 0 and ends at index 17. His nationality is "American". "first" is labeled as an ordinal number, "the United States" is a geopolitical entity, and "1789 to 1797" is a date.

![Visualize NER Results](https://github.com/nlpbook/nlpbook/blob/main/images/hulp_0108.png?raw=1)

The sentence is beautifully rendered with color-coded labels based on the entity type. This is a powerful and meaningful NLP task; you could see how doing this machine-driven labeling at scale without humans could add a lot of value to enterprises that work with a lot of textual data. Of course, to train such a model in the first place, you do need to have a lot of humans that annotate textual data. And you may need humans in the loop to deal with edge cases in production. You are never really human-free, but perhaps you could ultimately get to a process that is mostly human-free.

#### Named Entity Linking

Another complex yet very useful NLP task in enterprise is named entity linking (NEL). Named entity linking resolves a textual entity to a unique identifier in a knowledge base. In other words, NEL resolves the entity in your source text to a canonical version in a knowledge database. Let’s try to link all entities that are named persons to Google’s Knowledge Graph. We will make a Google Knowledge Graph API call to perform this named entity linking.footnote:[You will need your own [Google Knowledge Graph API key](https://developers.google.com/knowledge-graph) to perform this API call on your own machine. We will perform this using our own API key for illustrative purposes.]

Here is the function to perform this API call.

In [None]:
# Import libraries
import requests

# Define Google Knowledge Graph API Result function
def returnGraphResult(query, key, entityType):
    if entityType=="PERSON":
        google = f"https://kgsearch.googleapis.com/v1/entities:search\
         ?query={query}&key={key}"
        resp = requests.get(google)
        url = resp.json()['itemListElement'][0]['result']\
         ['detailedDescription']['url']
        description = resp.json()['itemListElement'][0]['result']\
         ['detailedDescription']['articleBody']
        return url, description
    else:
        return "no_match", "no_match"

Let’s perform entity linking on our George Washington example.

In [None]:
# Print Wikipedia descriptions and urls for entities
# You can un-comment this and run the code after you obtain your own Google Knowledge Graph API key
'''
for token in doc.ents:
    url, description = returnGraphResult(token.text, key, token.label_)
    print(token.text, token.label_, url, description)
'''

'\nfor token in doc.ents:\n    url, description = returnGraphResult(token.text, key, token.label_)\n    print(token.text, token.label_, url, description)\n'

Here is the output.

- George Washington:: PERSON https://en.wikipedia.org/wiki/George_Washington George Washington was an American political leader, military general, statesman, and Founding Father, who also served as the first President of the United States from 1789 to 1797. 
- American:: NORP no_match no_match
- first:: ORDINAL no_match no_match
- the United States:: GPE no_match no_match
- 1789 to 1797:: DATE no_match no_match

As you can see, George Washing is a PERSON and is linked successfully to the "George Washington" Wikipedia url and description. The rest are not of entity type PERSON and are not linked. If desired, we could link the other named entities, such as the United States, to relevant Wikipedia articles, too.

Named entity linking has many use cases in enterprise, especially since the need to link information to a taxonomy comes up over and over again (e.g., linking stock tickers, pharmaceutical drugs, publicly traded companies, consumer products, etc. to canonical versions in a taxonomy or knowledge base).

## Conclusion

In this chapter, we defined NLP and covered its origins, including some of the commercial applications that are popular in enterprise today. Then, we defined some basic NLP tasks and performed them using a very performant NLP library known as SpaCy. You should spend more time using SpaCy, including reviewing documentation that is available online, to hone what you have learned in this chapter.

While the tasks we performed are very basic, when combined together, NLP tasks such as tokenization, part-of-speech tagging, dependency parsing, chunking, and lemmatization make it possible for machines to perform even more complex NLP tasks such as named entity recognition and entity linking. We hope our walkthrough of these tasks helped you build some intuition on just how machines are able to unpack and process natural language, demystifying some of the space.

Today, most complex NLP applications do not require practitioners to perform these tasks manually; rather neural networks learn to perform these "tasks" on their own. In the next chapter, we will dive into some of the state of the art approaches using the Transformer architecture and large, pretrained language models from fastai and Hugging Face to show just how easy it is to get up and running with NLP today. Later in the book, we will return to the basics (which we just teased you with briefly in this chapter) and help you build more of your foundational knowledge of NLP.