# Lab 1 - Introduction to the [Natural Language Toolkit](https://www.nltk.org/) (NLTK)

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we will use the [Natural Language Toolkit](https://www.nltk.org/) to perform various Natural Language Processing tasks including sentence splitting, stop word recognition, and Named Entity Recognition (NER). The Natural Language Toolkit is a python package that provides easy access to [popular corpora and lexical resources](https://www.nltk.org/book/ch02.html#tab-corpora). Also, it contains a wide range of text processing modules (hence the name **toolkit**). NLTK is perfect for getting started with Natural Language Processing since it allows you to study each NLP task separately, which means that you can analyze the input, the algorithm, and the output.  NLTK is an open source and community-driven project.

**Main goal of this notebook**: The most important goal of this notebook is to show you how to perform various NLP tasks using NLTK. It is important that you can use the code snippets from this notebook on other language data.

**At the end of this notebook, you will be able to perform the following NLP tasks using NLTK**:
* **Sentence splitting**: *nltk.tokenize.sent_tokenize*
* **Tokenization**: *nltk.word_tokenize*
* **Part-of-speech (POS) tagging**: *nltk.pos_tag* 
* **Stop words recognition** 
* **Stemming and lemmatization**
     * *nltk.stem.porter.PorterStemmer*
     * *nltk.stem.snowball.SnowballStemmer*
     * *nltk.stem.wordnet.WordNetLemmatizer*
* **Constituency/dependency parsing** *nltk.RegexpParser*
* **Named Entity Recognition (NER)** *nltk.chunk.ne_chunk*

**If you want to learn more about these topics, you might find the following links useful (information from these blogs is used in this notebook):**
* [NLTK book](https://www.nltk.org/book/)
* [official NLTK website](https://www.nltk.org/)
* [an introduction to NLTK](https://www.pythonforengineers.com/introduction-to-nltk-natural-language-processing-with-python/)
* [another introduction to NLTK](https://nlpforhackers.io/introduction-nltk/)
* [yet another introduction to NLTK](https://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk)
* [introduction to tokenization from Stanford](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)
* [introduction to part of speech tagging](http://aritter.github.io/courses/5525_slides/pos1.pdf)
* [introduction to stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)
* [comparison stemming and lemmatization](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)
* [introduction to Named Entity Recognition](https://www.codementor.io/bofinbabu/introduction-to-named-entity-recognition-ner-k584v86r6)
* [introduction to Named Entity Recognition using NLTK](https://nlpforhackers.io/named-entity-extraction/)

## Getting started ([NLTK Chapter 1, Section 1.2](https://www.nltk.org/book/ch01.html))

Please try to import the NLTK module by running the cell below:

In [1]:
from pprint import pprint ### needed to print python data more elegantly
import nltk  

If you get an error, such as *ModuleNotFoundError: No module named 'nltk'*, please install the module using for example `conda install -c anaconda nltk ` and try again.

### Downloading data sets
The first time you import NLTK on your local machine, you will need to download the data sets that we will use in this course. When you run the cell below, you will get a pop-up window to select which data sets to download. The minimal data set you need is `book`. Take your time to check out the different TABs and get an idea of what is there. Make sure you have sufficient disk space to store what you want.

If you have already run the download command, you can skip the next step as the data are in your local drive. If you need another dataset, rerun it and take your pick.

**Tip:** comment out *nltk.download()* after you've used it, such that you can use *Restart kernel and run all cells*

In [5]:
#nltk.download() 

Note: When you run `nltk.download()` a star will appear to the left of the code cell. A program should open, that looks like [this](https://i.stack.imgur.com/hw89E.jpg). If the program does not show, you can simply replace `nltk.download()` by `nltk.download('book')`.

Please run the following cell to check that you can import the Brown corpus (which is part of the *book* data set)

In [6]:
from nltk.corpus import brown

Now that you have everything installed, we can get started with some examples of text processing using NLTK!

## Sentence splitting ([NLTK Chapter 3, Section 3.8](https://www.nltk.org/book/ch03.html))
Consider the following input that is given to a computer:

In [7]:
a_text = '''Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain.'''
print(a_text)

Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain.


Before the computer can apply most kinds of NLP tasks, it has to know what the separate sentences are.

Let's try splitting the text using a **dot**

In [8]:
dot_splitted_text = a_text.split('.')
for index, sentence in enumerate(dot_splitted_text, 1):
    print(f'SENTENCE: {index} {sentence}')

SENTENCE: 1 Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U
SENTENCE: 2 S
SENTENCE: 3  Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U
SENTENCE: 4 S
SENTENCE: 5  Open
SENTENCE: 6  H
SENTENCE: 7 J
SENTENCE: 8  Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd
SENTENCE: 9  for about $500 million
SENTENCE: 10  It's the first group action of its kind in Britain
SENTENCE: 11 


This clearly did not work. Many abbreviations such us **U.S.** have dots in them. However, sentences normally start with a capital letter. What would happen if we split a text using a dot followed by a space followed by a capital letter? This should work, right?

For this we need to be able to define a pattern. We are going to use the Regular Expressions package *re* to define a pattern. This is explained in [NLTK Chapter 3, Section 3.4](https://www.nltk.org/book/ch03.html).

In [9]:
import re
splitted_using_dot_space_capital = re.split('\. [A-Z]', a_text)
for index, sentence in enumerate(splitted_using_dot_space_capital, 1):
    print(f'SENTENCE: {index} {sentence}')

SENTENCE: 1 Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S
SENTENCE: 2 yder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S
SENTENCE: 3 pen
SENTENCE: 4 .J
SENTENCE: 5 einz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million
SENTENCE: 6 t's the first group action of its kind in Britain.


Unfortunately no. In our text, we have the following sequences:
* **U.S. Ryder**
* **H.J. Heinz Company**

Conclusion, it is actually not that easy. Luckily, NLTK contains models that are more complex than what we've just seen. Let's see how it performs on our text.

In [10]:
from nltk.tokenize import sent_tokenize

In [11]:
nltk_sentence_splitted = sent_tokenize(a_text)
for index, sentence in enumerate(nltk_sentence_splitted, 1):
    print(f'SENTENCE: {index} {sentence}')

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\zakar/nltk_data'
    - 'C:\\Users\\zakar\\Anaconda3\\nltk_data'
    - 'C:\\Users\\zakar\\Anaconda3\\share\\nltk_data'
    - 'C:\\Users\\zakar\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\zakar\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


Interestingly, the model is not perfect. It correctly determines that *U.S. Ryder Cup* is not the end of the sentence. However, it states that **H.J.** is the end of a sentence.

## Tokenization ([NLTK Chapter 5, Section 1](https://www.nltk.org/book/ch05.html))

One of the first steps of Natural Language Processing is tokenization. It is generally defined as chopping a text into pieces, which are called tokens.

The most naive way to apply tokenization is to split a text using spaces. Let's try this. Please run the following cell.

In [None]:
example_sentence = "I'll refuse to permit you to obtain the refuse permit."
tokenized_using_spaces = example_sentence.split(' ')
print(tokenized_using_spaces)

Think about the above line, is it actually the same as tokenizing? 

..

..

Well, yes and no. Tokenizing using spaces works for most tokens. However, it does not work for expressions such as **I'll**.

Let's try a real tokenizer....

In [None]:
tokenized_using_tokenizer = nltk.word_tokenize(example_sentence)
print(tokenized_using_tokenizer)

Please note that **I'll** is now correctly tokenized.

## Part of speech tagging ([Chapter 5, Section 1 Using a Tagger](https://www.nltk.org/book/ch05.html))
Now that we've established the tokens in a text, a useful next step is to determine the part of speech of each token.
The part of speech is the syntactic category of a token. 

| the | red   | clown  | behaved  | weirdly  |
|---|---|---|---|---|
| determiner | adjective | noun | verb | adverb |

We can replace tokens with another token with the same part of speech, and the sentence would still be grammatical. For example:
* The **blue** clown behaved weirdly.
* The red **cow** behaved weirdly.
* The red clown **walked** weirdly.

NLTK also provides a method to automatically tag each token in a text with a part of speech tag.

In [None]:
nltk.pos_tag(['I', "'ll", 'refuse', 'to', 'permit', 'you', 'to', 'obtain', 'the', 'refuse', 'permit', '.'])

Please note that each token has now been tagged with a part of speech tag. You might be surprised to see **VB** instead of **verb**. The main reason is that there is not one group of part of speech labels, there are [many](https://www.sketchengine.eu/tagsets/english-part-of-speech-tagset/)! The most popular tagset in NLP is the [Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). This is also the default one used in NLTK.

**Tasks**: 
* Make sure you know what each tag means. 
* Try some other sentences to get an idea of how the tagger works and where it fails.

## Removing Stop words
An important step in preprocessing data is to remove words that are likely irrelevant to the type of NLP task that you want to perform.
It's not uncommon to remove the most commonly used words, so-called *stop words*. NLTK actually keeps lists of stop words for many languages.
We show how to remove stop words for English text.

In [None]:
from nltk.corpus import stopwords

In [None]:
english_stopwords = stopwords.words('english')
set_english_stopwords = set(english_stopwords) # sets are faster to check if an element is in

In [None]:
print(english_stopwords)

In [None]:
a_sentence = ['the', 'rain', 'on', 'the', 'roof', 'was', 'soothing']

In [None]:
without_stopwords = []

for token in a_sentence:
    if token not in set_english_stopwords:
        without_stopwords.append(token)

print(without_stopwords)

Yes! We've managed to remove the stopwords!

### Questions:
* What are stopwords and why would you want to remove these from a text?
* How would you make a stop word list automatically?

#### Cleaning up text

The text you want to analyze can sometimes be messy. Punctuation can be attached to words that are at the end of a sentence, e.g., **data.** in the example sentence below, or there are just strange characters attached to words, e.g., an underscore in **works** in the example sentence below. It is important to clean your text before analyzing it.

In [None]:
messy_sentence = "The point of this example is to _learn how basic text cleaning works_ on *very simple* data."

In [None]:
import string

As an example, we will remove all occurrences from the following characters from our example sentences.

In [None]:
print(string.punctuation)

We first tokenize our example sentence:

In [None]:
tokenized_messy_sentence = nltk.word_tokenize(messy_sentence)
print(tokenized_messy_sentence)

Now we clean the tokens of these unwanted characters:

In [None]:
table = {ord(char): '' for char in string.punctuation} # in case you're interested, this is called a 'dict comprehension'

cleaned_messy_sentence = []
for messy_word in tokenized_messy_sentence:
    
    cleaned_word = messy_word.translate(table) # the translate method allows us to remove all unwanted charachters
    print()
    print('OLD', messy_word)
    print('NEW', cleaned_word)
    cleaned_messy_sentence.append(cleaned_word)

print(cleaned_messy_sentence)

As a result of cleaning out the asterisks, which were tokens by themselves, we've now ended up with some empty strings. If we want to remove them, we can add each token from `cleaned_messy_sentence` to a new list `cleaned_sentence`, unless it equals an empty string:

In [None]:
cleaned_sentence = [token for token in cleaned_messy_sentence if token != ''] # this is known as a 'list comprehension'
print(cleaned_sentence)

## Stemming and lemmatizing ([NLTK book Chapter 3, Section 3.6](https://www.nltk.org/book/ch03.html))
NLTK has various modules for stripping inflection of words (stemming) or finding the lemma (the form you can find in a dictionary). Below is a script to stem and lemmatize the words in a text example after tokenizing the text.

In [None]:
raw="SHUT UP! Enough already, Ballstein! Who cares about Derek Zoolander anyway? The man has only one look, for Christ's sake! Blue Steel? Ferrari? Le Tigra?"

In [None]:
# Stemming and Lemmatizing
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()
tokens = nltk.word_tokenize(raw)

porterlemmas = []
wordnetlemmas = []
snowballlemmas = []

for word in tokens:
    porterlemmas.append(porter.stem(word))
    snowballlemmas.append(snowball.stem(word))
    wordnetlemmas.append(wordnet.lemmatize(word))

print('Porter')
print(porterlemmas)
print('Snowball')
print(snowballlemmas)
print('Wordnet')
print(wordnetlemmas)

## Question:
* What difference do you notice between the three lists?

## Named Entity Recognition (NER) ([NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html))
In Named Entity Recognition, the goal is to determine which noun phrases refer to named entities.
Named entities can be persons, locations, organizations, etc. (see [NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html) for more information on the task).

In [None]:
from nltk.chunk import ne_chunk

text = '''In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. Samsung, which is the world's top mobile phone maker, is appealing the ruling. A similar case in the UK found in Samsung's favour and ordered Apple to publish an apology making clear that the South Korean firm had not copied its iPad when designing its own devices.'''
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    
    tokens = nltk.word_tokenize(sentence)
    tokens_pos_tagged = nltk.pos_tag(tokens)
    tokens_pos_tagged_and_named_entities = ne_chunk(tokens_pos_tagged)
    print()
    print('ORIGINAL SENTENCE', sentence)
    print('NAMED ENTITY RECOGNITION OUTPUT', tokens_pos_tagged_and_named_entities)

Please observe that for textual mentions such as **August** and **Samsung**, a named entity label is assigned.
The most frequently used named entity labels are:
* ORGANIZATION (e.g., Georgia-Pacific Corp.)
* PERSON (e.g., Eddy Bonte, President Obama)
* LOCATION (e.g., Murray River, Mount Everest)
* DATE (e.g., June, 2008-06-29)
* TIME (e.g., two fifty a m, 1:30 p.m.)
* MONEY (e.g., 175 million Canadian Dollars, GBP 10.40)
* PERCENT (e.g., twenty pct, 18.75 %)
* FACILITY (e.g., Washington Monument, Stonehenge)
* GPE (=Geo-Political Entity, e.g., South East Asia, Midlothian)

Please try to understand the output from NLTK regarding named entity recognition. 

### Task
* What do you think of the performance of the NER module in the NLTK?

## Constituency/dependency parsing ([NLTK Book Chapter 7, Section 2.1](https://www.nltk.org/book/ch07.html))
Please consider the following sentence.
- **the cat saw the dog.**

As a speaker of English, you immediately start to parse the sentence. You determine that **the cat** is the subject, **saw** is the main verb, and **the dog** is the direct object. With **constituency/dependency parsing**, we attempt to teach computers to parse sentence just like humans do.

We will use a module called **RegexpParser**, which is part of NLTK.

In [None]:
sentence = [("the", "DT"), ("little", "JJ"),
            ("dog", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN")]

We then create a very simple grammar, that we can extend later on.

In [None]:
grammar = 'NP: {<DT><NN>}'
constituent_parser = nltk.RegexpParser(grammar)

Our grammar now only contains one rule, which states that a noun phrase (NP) consists of a determiner (DT) followed by a singular noun (NN). 
The tags come from the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

Let's try to parse our example sentence 'the cat saw the dog'. We can inspect the parse result as a visualized tree structure, as well as by printing it.

In [None]:
constituent_structure = constituent_parser.parse(sentence)
print(constituent_structure) # print the sentence structure
constituent_structure.draw() # visualize the parse tree structure

Please note that **the cat** has now been identified as a noun phrase (NP). However, **the little dog** has not been identified, because we did not include that it was possible to have adjectives between the determiner and the noun. Let's fix that!

In [None]:
grammar = 'NP: {<DT><JJ>*<NN>}'
constituent_parser = nltk.RegexpParser(grammar)

We've now changed the rule for an NP. A noun phrase is now defined as:
* determiner (DT) followed by one adjective (JJ) followed by a singular noun (NN)

The star is needed to indicate that the adjective is optional.

In [None]:
constituent_structure = constituent_parser.parse(sentence)
print(constituent_structure) # print the sentence structure
constituent_structure.draw() # visualize the parse tree structure

You can continue to extend the grammar. Try to understand the following grammar:

In [None]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [None]:
tokens = ['In', 'the', 'house', 'the', 'yellow', 'cat', 'saw', 'the', 'dog']
tagged = nltk.pos_tag(tokens)
print(tagged)
constituent_structure = constituent_parser.parse(tagged)
print(constituent_structure)
constituent_structure.draw()

### Save tree structure to file
There are at least two ways of saving the tree structure to a file:
1. you can drag the image to your Desktop or File Explorer/Finder.
2. you can use the code snippet below (might need Ghostscript installation):

Please convert the .ps file to PDF.

If you get the following error:
```
===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
```

**Solution**:
* Download ghost script  and add it to path
* How to download: https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript
* On Windows it is not added to Path automatically; add C:\Program Files\gs\gs9.19\bin
* Kernel needs to be restarted to reload PATH. Probably, even Anaconda needs to be restarted to know the new environment variable.

# End of this notebook