# L11 - Natural Language Processing

Parts of this notebook were adapted from Chapters [15](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2015%20-%20Off%20to%20analyzing%20text.ipynb) and [19](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2019%20-%20More%20about%20Natural%20Language%20Processing%20Tools%20(spaCy).ipynb) of VU Amsterdam's [Python for Text Analysis](https://github.com/cltl/python-for-text-analysis) course.

Natural Language Processing (NLP) is a field of computational science dealing with natural (human-generated) language texts. Python has a wide range of NLP libraries, the most popular of which are perhaps [NLTK](https://www.nltk.org/) and [SpaCy](https://spacy.io/). We will focus on them in this lecture, but they are not the only game in town:

* [Stanford's CoreNLP](http://stanfordnlp.github.io/CoreNLP/) is a very powerful system
  that is able to process English, German, Spanish, French, Chinese and Arabic - it is originally written in Java, but there are also Python wrappers available, such as [py-corenlp](https://github.com/smilli/py-corenlp)
* [Textblob](http://textblob.readthedocs.io/en/dev/) is a general NLP library that builds on NLTK
* [Gensim](http://radimrehurek.com/gensim/) is mostly used for training semantic word vectors
* [Corpkit](http://corpkit.readthedocs.io/en/latest/) is a module for corpus building and corpus management. Includes an interface to the Stanford CoreNLP parser
* [Huggingface's Transformers](https://huggingface.co/transformers/) is considered state-of-the-art when working with [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) models - supports both [TensorFlow](https://www.tensorflow.org/) and [PyTorch](https://pytorch.org/)

## Setting Up

Installing `spacy` and `nltk` requires a bit more effort than we are used to from other packages. This is the case because both require the downloaded resources to function well and (in the case of `spacy`) are optimized for speedy performance.

### SpaCy

You should be able to install `spacy` by executing the following cell:

In [None]:
!pip install spacy==3.0.6

import spacy
print("Yay, it worked!")

If this didn't work, here are a few steps you can try in a terminal with your environment activated:

1. Try upgrading `pip` with `pip install --upgrade pip`
2. Try running `pip install numpy==1.19.5` (if `numpy` is not already installed)
3. Try running `pip install Cython==0.29.23` (if `Cython` is not already installed)
4. Try installing `wheel` with `pip install wheel`
5. Try installing `setuptools` with `pip install setuptools`
6. Finally, try runnning `pip install spacy==3.0.6`

There is also a quite comprehensive installation guide for different OS from `spacy` itself [here](https://spacy.io/usage).

Before you can use `spacy`, you have to download the trained models and resources for the language that you want to work with. `spacy` has models for a wider range of languages such as Chinese, Greek, Lithuanian, German, Japanese, Russian and many more. We will use the smallest English resource set `en_core_web_sm`. To download it, run the following cell:

In [None]:
!python -m spacy download en_core_web_sm

Let's try loading the downloaded resources:

In [None]:
import spacy

spacy.load("en_core_web_sm")

## NLTK

`nltk` should be easier to set up, since unlike `spacy` it does not use [Cython](https://cython.org/) for optimization.

In [None]:
!pip install nltk==3.6.2

import nltk

`nltk` does not have a single resource set for a language, but you need to download the resources for each step that you want to do. Let's start with all resources for `nltk`'s primary tutorial, the [NLTK book](https://www.nltk.org/book/).

In [None]:
import nltk

nltk.download("book")

## The NLP Pipeline

NLP is really just a collection of tasks that operate on natural language. There's different motivations for solving these tasks: 

**Researchers** often want to solve them because they believe a system that can correctly reason over a text (e.g. answer questions) necessarily has to have some sort of "understanding" of its content. In recent years, neural networks have solved tasks like question answering to such a high degree that some say we have succeeded in teaching computers Natural Language Understanding (NLU) - however, [many others disagree and urge to rethink the approach the NLP community has taken towards understanding](https://www.aclweb.org/anthology/2020.acl-main.463/).

**Companies** often want to solve these tasks because it enables them to provide a service to their customers, such as a chatbot, a voice assistant, a translation service, or detailed analysis of documents. Large companies like [Google](https://research.google/research-areas/natural-language-processing/) sometimes also publish research on NLP because they want to advance the community and add to their own reputation by achieving state-of-the-art results.

The tasks range from trivial for humans to very difficult and often build up on each other (e.g. many tasks operate on tokens). Generally, one can distinguish between **text analysis**, where some information is drawn from an existing text, and **text generation**, where a new text is produced.


Here are some low-level tasks, which often produce features that are the basis for more sophisticated tasks:

* **Tokenization:** splitting texts into individual words (tokens)
* **Sentence splitting:** splitting texts into sentences
* **Part-of-speech (POS) tagging:** identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
* **Morphological analysis:** separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs)
* **Stemming:** identifying the stems of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'trouble/troubling/troubled'
* **Lemmatization:** identifying the lemmas (dictionary forms) of words in context, such as 'go' for 'go/goes/going/went'
* **Word Sense Disambiguation (WSD):** assigning the correct meaning to words in context
* **Stop words recognition:** identifying commonly used words (such as 'the', 'a(n)', 'in', etc.) in text, possibly to ignore them in other tasks
* **Named Entity Recognition (NER):** identifying people, locations, organizations, etc. in text
* **Constituency/dependency parsing:** analyzing the grammatical structure of a sentence
* **Semantic Role Labeling (SRL):** analyzing the semantic structure of a sentence (*who* does *what* to *whom*, *where* and *when*)
* **Sentiment Analysis:** determining whether a text is mostly positive or negative
* **Static Word Representation and Semantic Similarity:** representating the meaning of words as rows of real valued numbers where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days)
* **Contextualized Word Representation:** representing the meaning of a word use in the context in which it was used

Here are some high-level tasks, which are solved based on features produced by low-level tasks:
* **Question Answering:** generating an answer to a question
* **Common-Sense Reasoning:** reasoning over the contents of a text
* **Machine Translation:** generating a translation in a different natural language
* **Paraphrase Generation:** generating a text of similar length with the same meaning
* **Abstractive Summarization:** Generate a text of shorter length that summarizes a text

This list is non-comprehensive. If you want to learn more about these and other high-level topics, have a look at [nlpprogress.com](http://nlpprogress.com/), which has a summary of each task and the methods that currently perform best on it for many languages.  

If you want to see a demonstration of what state-of-the-art NLP models can do, check out this [text autocompletion demo with GPT-2](https://transformer.huggingface.co/doc/gpt2-large) by Huggingface and this [DALL-E text-to-image  demo](https://openai.com/blog/dall-e/) by OpenAI, which is not strictly NLP but also connected with computer vision as well.

## The NLP Pipeline with NLTK

Let's look at an example of a simple NLP pipeline with `nltk`. In the following cell, you can observe how we tokenize raw text into tokens and sentences, perform part of speech tagging and lemmatize some of the tokens. Don't worry about the details just yet - we will go through them step by step. 

In [None]:
import nltk

text = "I have an awesome cat. It's sitting on the mat that I bought yesterday."

# tokenization
tokens = nltk.word_tokenize(text)

# sentence splitting
sentences = nltk.sent_tokenize(text)

# POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# lemmatization
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
lemma=lmtzr.lemmatize(tokens[4], 'v')

# print all information
print(tokens)
print(sentences)
print(tagged_tokens)
print(lemma)

### Tokenization and Sentence Splitting
#### `word_tokenize()`

Now, let's try tokenizing a story from a text file in `data/charlie.txt`. First, we will open and read the file and assign the file contents to the variable `content`. If you are unsure about how to read and write files in Python, have a look at [Chapter 14 of the Python for Text Analyis course](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2014%20-%20Reading%20and%20writing%20text%20files.ipynb). Then, we can call the `word_tokenize()` function from the `nltk` module as follows:

In [None]:
with open("data/charlie.txt") as infile:
    content = infile.read()

tokens = nltk.word_tokenize(content)
print(type(tokens), len(tokens))
print(tokens)

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens.

#### `sent_tokenize()`

Another thing that NLTK can do for you is to split a text into sentences by using the `sent_tokenize()` function. We use it on the entire text (as a string):

In [None]:
with open("data/charlie.txt") as infile:
    content = infile.read()

sentences = nltk.sent_tokenize(content)

print(type(sentences), len(sentences))
print(sentences)

We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [None]:
# open and read in file as a string, assign it to the variable `content`
with open("data/charlie.txt") as infile:
    content = infile.read()
    
# split up entire text into tokens using word_tokenize():
tokens = nltk.word_tokenize(content)

# create an empty list to collect all words having the present participle -ing:
present_participles = []

# looking through all tokens
for token in tokens:
    # checking if a token ends with the present parciciple -ing
    if token.endswith("ing"):
        # if the condition is met, add it to the list we created above (present_participles)
        present_participles.append(token)
        
# ürint the list to inspect it
print(present_participles)

This looks good! We now have a list of words like *boiling*, *sizzling*, etc. However, we can see that there is one word in the list that actually is not a present participle (*ceiling*). Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution.

### Part-of-speech (POS) tagging

Once again, `nltk` comes to the rescue. Using the function `pos_tag()`, we can label each word in the text with its part of speech. 

To do pos-tagging, you first need to tokenize the text. We have already done this above, but we will repeat the steps here, so you get a sense of what an NLP pipeline may look like.

#### `pos_tag()`

To see how `pos_tag()` can be used, we can (as always) look at the documentation by using the `help()` function. As we can see, `pos_tag()` takes a tokenized text as input and returns a list of tuples in which the first element corresponds to the token and the second to the assigned pos-tag.

In [None]:
# as always, we can start by reading the documentation:
help(nltk.pos_tag)

In [None]:
# open and read in file as a string, assign it to the variable `content`
with open("data/charlie.txt") as infile:
    content = infile.read()
    
# split up entire text into tokens using word_tokenize():
tokens = nltk.word_tokenize(content)

# apply pos tagging to the tokenized text
tagged_tokens = nltk.pos_tag(tokens)

# inspect pos tags
print(tagged_tokens)

#### Working with POS tags

As we saw above, `pos_tag()` returns a list of tuples: The first element is the token, the second element indicates the part of speech (POS) of the token. 

This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). For example, all tags starting with a V are used for verbs. 

We can now use this, for example, to identify all the verbs in a text:


In [None]:
# open and read in file as a string, assign it to the variable `content`
with open("data/charlie.txt") as infile:
    content = infile.read()
    
# apply tokenization and POS tagging
tokens = nltk.word_tokenize(content)
tagged_tokens = nltk.pos_tag(tokens)

# list of verb tags (i.e. tags we are interested in)
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]

# create an empty list to collect all verbs:
verbs = []

# iterating over all tagged tokens
for token, tag in tagged_tokens:
 
    # checking if the tag is any of the verb tags
    if tag in verb_tags:
        # if the condition is met, add it to the list we created above 
        verbs.append(token)
        
# print the list to inspect it
print(verbs)

### Lemmatization

We can also use `nltk` to lemmatize words.

The lemma of a word is the form of the word which is usually used in dictionary entries. This is useful for many NLP tasks, as it gives a better generalization than the strong a word appears in. To a computer, `cat` and `cats` are two completely different tokens, even though we know they are both forms of the same lemma. 



### The WordNet lemmatizer

We will use a kind of lemmatizer called `WordNetLemmatizer` for this using the `lemmatize()` function. [WordNet](https://en.wikipedia.org/wiki/WordNet) is an ontology-like lexical database of English words and their semantic relations. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`. Again, we show all the processing steps (consider the comments in the code below):

In [None]:

with open("data/charlie.txt") as infile:
    content = infile.read()
    
tokens = nltk.word_tokenize(content)
tagged_tokens = nltk.pos_tag(tokens)

verb_tags = ["VBD", "VB", "G", "VBN", "VBP", "VBZ"]
verbs = []

for token, tag in tagged_tokens:
    if tag in verb_tags:
        verbs.append(token)

print(verbs)

# instatiate a lemmatizer object
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

# create list to collect all the verb lemmas:
verb_lemmas = []
        
for participle in verbs:
    
    # for this lemmatizer, we need to indicate the POS of the word (in this case, v = verb)
    lemma = lmtzr.lemmatize(participle, "v") 
    verb_lemmas.append(lemma)
    
print(verb_lemmas)

**Note about the this lemmatizer:** 

We need to specify a POS tag to the `WordNetLemmatizer`, in a `WordNet` format ("n" for noun, "v" for verb, "a" for adjective). If we do not indicate the POS tag, the WordNet lemmatizer thinks it is a noun (this is the default value for its part-of-speech). See the examples below:

In [None]:
test_nouns = ('building', 'applications', 'leafs')
for n in test_nouns:
    print(f"Noun in conjugated form: {n}")
    default_lemma = lmtzr.lemmatize(n) # without specifying POS n is interpreted as a noun!
    print(f"Default lemmatization: {default_lemma}")
    verb_lemma = lmtzr.lemmatize(n, 'v')
    print(f"Lemmatization as a verb: {verb_lemma}")
    noun_lemma = lmtzr.lemmatize(n, 'n')
    print(f"Lemmatization as a noun: {noun_lemma}")
    print()

In [None]:
test_verbs = ('grew', 'standing', 'plays')

for v in test_verbs:
    print(f"Verb in conjugated form: {v}")
    default_lemma = lmtzr.lemmatize(v) # without specifying POS v is interpreted as a noun!
    print(f"Default lemmatization: {default_lemma}")
    verb_lemma = lmtzr.lemmatize(v, 'v')
    print(f"Lemmatization as a verb: {verb_lemma}")
    noun_lemma = lmtzr.lemmatize(v, 'n')
    print(f"Lemmatization as a noun: {noun_lemma}")
    print()

### Ontological Semantics: WordNet Synsets

All the steps that we have done with `nltk` so far - **sentence splitting**, **tokenization**, **POS tagging**, and **lemmatization** - are all very nice, but arguably **they only scratch at the surface** of the actual meaning of the text.

What we would really be interested in is learning about the [semantics](https://en.wikipedia.org/wiki/Semantics) of the text, the actual content. This is the holy grail of NLP that is still the subject of a lot of research. However, one way to get some idea what the text is about is looking up the words in the lexical database [WordNet](https://wordnet.princeton.edu/).

In [None]:
from nltk.corpus import wordnet

In WordNet, English nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms, so-called *synsets*, each expressing a distinct concept. That way, it is possible to look up which words could in general "mean" the same concept, although that is not a guarantee that they do mean the same thing in this specific context.

In [None]:
print(wordnet.synsets("dish"))

This given you a list of `Synset` objects that denote concepts which might be meant by the word "pole".

In [None]:
synsets = wordnet.synsets("dog")

print(type(synsets[0]))
print(synsets[0].name())
print(type(synsets[0].name()))

`Synset.name()` gives you the name as a `str`, but there's much more.

In [None]:
print(synsets[0].definition())
print(synsets[0].examples())
print(synsets[0].lemmas())

So if you would like to know if two words could possibly mean the same thing, you could check for intersection of the synsets:

In [None]:
a = wordnet.synsets("play")
b = wordnet.synsets("drama")

a_synset_names = {synset.name() for synset in a}
b_synset_names = {synset.name() for synset in b}

synsets_of_both = a_synset_names.intersection(b_synset_names)
print(synsets_of_both)


Note that it can sometimes be sensible to narrow down the search space by giving a POS tag, as in the case of "play":

In [None]:
print("Number of synsets without POS:", len(wordnet.synsets("play")))
print("Number of synsets with POS:", len(wordnet.synsets("play", pos="v"))) # v stands for verb

There's a lot more to the WordNet functionality of `nltk`. To learn more, see [this](https://www.nltk.org/howto/wordnet.html) how-to.

If you would like to learn more about `nltk` in general, check out its [documentation](https://www.nltk.org/api/nltk.html) and the [book](https://www.nltk.org/book/).

## The NLP Pipeline with SpaCy

`spacy` has mostly the same functions that `nltk` has, but they are accessed with a different syntax. First, a set of pretrained models is loaded and stored in a `nlp` object.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

The `nlp` object can be used to perform all steps of the NLP pipeline at once on a text.

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")
print(doc)

`doc` is now a Python object of the class `Doc`. It is a container for accessing linguistic annotations and a sequence of `Token` objects.

#### Doc, Token and Span Objects

At this point, there are three important types of objects to remember:

* A `Doc` is a sequence of `Token` objects.
* A `Token` object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations. 
* A `Span` object is a slice from a `Doc` object and a sequence of `Token` objects.

Since `Doc` is a sequence of `Token` objects, we can iterate over all of the tokens in the text as shown below, or select a single token from the sequence: 

In [None]:
# iterate over the tokens
for token in doc:
    print(token)
print()

# select one single token by index
first_token = doc[0]
print("First token:", first_token)

Linguistic features are then available as attributes using an object-oriented syntax:

In [None]:
for token in doc:
    print("\t".join([token.text, token.pos_, token.lemma_, token.shape_, token.dep_]))

A full list of linguistic features can be found [here](https://spacy.io/usage/linguistic-features).

Notice that some of the attributes end with an underscore. For example, tokens have both `lemma` and `lemma_` attributes. The `lemma` attribute represents the id of the lemma (integer), while the `lemma_` attribute represents the unicode string representation of the lemma. In practice, you will mostly use the `lemma_` attribute.

In [None]:
for token in doc:
    print(token.lemma, token.lemma_)

You can also use `spacy.explain` to find out more about certain labels:

In [None]:
spacy.explain("npadvmod")

You can create a `Span` object from the slice `doc[start : end]`. For instance,`doc[2:5]` produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. `doc[start : end : step]`) are not supported, as `Span` objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

In [None]:
# create a span
a_slice = doc[2:5]
print(a_slice, type(a_slice))

# iterate over span
for token in a_slice:
    print(token.lemma_, token.pos_)

#### Doc Attributes

If you call the `dir()` function on a `Doc` object, you will see that it has a range of methods and attributes. You can read more about them in the [documentation](https://spacy.io/api/doc). Below, we highlight two of them: `text` and `sents`.

In [None]:
dir(doc)

First of all, `text` simply gives you the whole document as a string:

In [None]:
print(doc.text)
print(type(doc.text))

`sents` can be used to get all the sentences. Notice that it will create a so-called 'generator'. For now, you don't have to understand exactly what a generator is (if you like, you can read more about them online). Just remember that we can use generators to iterate over an object in a fast and efficient way.

In [None]:
# get all the sentences as a generator 
print(doc.sents, type(doc.sents))

# we can use the generator to loop over the sentences; each sentence is a span of tokens
for sentence in doc.sents:
    print(sentence, type(sentence))

If you find this difficult to comprehend, you can also simply convert it to a list and then loop over the list. Remember that this is less efficient, though.

In [None]:
# you can also store the sentences in a list and then loop over the list 
sentences = list(doc.sents)
for sentence in sentences:
    print(sentence, type(sentence))

The benefit of converting it to a list is that we can use indices to select certain sentences. For example, in the following we only print some information about the tokens in the second sentence.

In [None]:
# print some information about the tokens in the second sentence.
sentences = list(doc.sents)
for token in sentences[1]:
    data = '\t'.join([token.orth_,
                      token.lemma_,
                      token.pos_,
                      token.tag_,
                      str(token.i),    # turn index into string
                      str(token.idx)]) # turn index into string
    print(data)

Can you deduce from this output what the difference between `Token.i` and `Token.idx` is?

### Named Entity Recognition

`spacy` also automatically runs a Named Entity Recognition algorithm over the text when `nlp` is applied.

In [None]:
# here's a slightly longer text from the Wikipedia page about Harry Potter
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

doc = nlp(harry_potter)

for token in doc:
    print(token.text, "\t", token.ent_type_)

As an alternative, you can simply access all entities that are mentioned in a `Doc` with `Doc.ents`:

In [None]:
print(doc.ents)

### Statistical Semantics: GloVe Word Vectors

Just like `nltk` has Wordnet synsets, `spacy` also has an interface that lets you access a semantic information about each token. `spacy` makes pretrained semantic numeric vectors for each token available via the [`Token.vector`](https://spacy.io/api/token#vector) attribute. In the case of `en_core_web_sm`, these vectors are 300-dimensional [GloVe vectors](https://nlp.stanford.edu/projects/glove/). These vectors are trained from massive corpora and encode word co-occurrence statistics. Here's how to access them:

In [None]:
harry = doc[0]

print(harry.text)
print(harry.vector)
print(type(harry.vector))
print(harry.vector.shape)

As you can see, they are simply `np.ndarray`s, which we already know how to handle. These vectors could be used e.g. as input to a neural network that learns to solve a NLP task such as e.g. sentiment analysis, the task of correctly classifying the sentiment expressed in a text.

If you would like to know more about where these word vectors come form and how they are trained, check out [this](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) blog post by Chris Olah from 2014. It might be a bit outdated, but I think it's still a good way to get started with semantic word representations.

If you would like to learn more about `spacy` in general, check out its [documentation](https://spacy.io/usage/linguistic-features).

## Homework

You can find the assignment link for `2021-homework11` in an announcement on StudIP. It will make use of both `nltk` and `spacy`, so your time scrolling through this notebook was not wasted.

>Good luck and have a good week!