<a href="https://colab.research.google.com/github/scskalicky/SNAP-CL/blob/main/03_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A Gentle Introduction to the Natural Language Toolkit**

There are many different NLP/CL packages and libraries to choose from. We are going to work with one called NLTK - Natural Language Tool Kit. NLTK provides built-in functions for performing common NLP tasks, such as tokenising a text (splitting it into words), counting the frequency of words, part-of-speech tagging (e.g., assign words to nouns, verbs, etc), performing sentiment analysis, and so much more. 

We are going to touch just the surface of NLTK as a means to show you how to get going with some basic text analytics.

> *You can learn Python and computational linguistics at the same time using the free NLTK book at https://www.nltk.org/book/*










## **Loading NLTK**

We need to tell Python to load NLTK. To do so, we type `import nltk` in a code cell and run it, see below:


In [None]:
# load the NLTK resource into the notebook
import nltk 

We also need to download some extra resources in order to use the NLTK functions in this notebook. Run the code cell below to download those resources. Because these notebooks are hosted on virtual servers, you would need to repeat this step each time you load this notebook. Fortunately, it does not take very long. Different functions will require different resources, and Colab will tell you if a resource is missing when you try to use NLTK functions. 

In [None]:
# download resources necessary for tokenizing and part of speech tagging.
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

## **Tokenizing a Text**

We can now use NLTK to split a string into separate words or *tokens*. We will do so using the `nltk.word_tokenize()` function. This function expects a string as the input, which you place inside the `()` at the end of the function, like so:

In [None]:
nltk.word_tokenize("These pretzels are making me thirsty!")

['These', 'pretzels', 'are', 'making', 'me', 'thirsty', '!']

Note that the output shows each word from the sentence separated by commas, and also that the punctuation mark "!" is treated as a separate word. The output is in the form of a Python `list`, another data structure which can be used to hold strings as well as other value types. 

Do you remember how you set a string to a variable? You can do the same thing with the results from functions, such as `nltk.word_tokenize()`. Consider below:

In [None]:
# save a string to a variable
pretzels_raw = 'These pretzels are making me thirsty!'

# save the tokenized version of the string held in pretzels_raw to a different variable
pretzels_tokenized = nltk.word_tokenize(pretzels_raw)

# inspect contents of tokenized version
pretzels_tokenized

['These', 'pretzels', 'are', 'making', 'me', 'thirsty', '!']

We can thus query the length of our document, in words, using the `len()` function.

In [None]:
# how many words in our example? 
len(pretzels_tokenized)

7

**Your Turn**

Try tokenising some text and measuring the length using `len()`. You should explore feeding a raw string to the function as well as saving a string to a variable first. 

In [None]:
# tokenize some text!

### **Preprocessing**

Of course, the punctuation is being counted as a "word", which we may not think is appropriate. 

This raises an important question regarding the computational analysis of text - how should texts be prepared before an analysis? Removing all of the punctuation from a text is a form of *pre-processing* and is commonly done in almost all natural language processing tasks. Other stages of pre-processing can include converting all words to lower case or removing so-called "*stopwords*", which are highly frequent *function* words such as *the*, *a*, *and*, and so on. Many existing NLP libraries / frameworks have option to conduct pre-processing automatically. 

Below, I have written a function which performs two stages of pre-processing: lowercasing and removing punctuation. Running the code cell will load the function into the notebook's memory so that you can use that same function in other code cells. 

In [None]:
# define a string containing punctuation markers we do not want
punctuation = '!.,\'";:-'

# define a function to pre-process text
def preprocess(text):
  # lower case the text and save results to a variable
  lower_case = text.lower()
  # remove punctuation from lower_case and save to a variable
  # don't worry too much if you don't understand the code in this line. 
  lower_case_no_punctuation = ''.join([character for character in lower_case if character not in punctuation])
  # return the new text to the user
  return lower_case_no_punctuation

In the next code cell, I use the `preprocess` function on a string which contains uppercase letters and one punctuation mark "!". The results show how all the letters are now lowercase, and the puncutation has been removed. 

In [None]:
# test our function on a string
preprocess('HELLO! wOrld.')

'hello world'

Before moving on, try out the preprocess function on some strings of your choice. You might want to try saving your string to a variable and then using the preprocess function, like this: 

> `my_variable = 'some string'`   
> `preprocess(my_variable)`

You might also want to try saving the results of preprocess to another variable, like this:

> `new_variable = preprocess(my_variable)`

In [None]:
# Play with the preprocess() function here
#preprocess()

We can now use the preprocess function to process a text before sending it to be tokenized, such as seen below.

In [None]:
# save a string to a variable
mood_ring = "I can't feel a thing. I keep looking at my mood ring."

# pre-process the string using the preprocess function, and save results to a variable
mood_ring_preprocessed = preprocess(mood_ring)

# tokenize the preprocessed text
mood_ring_tokenized = nltk.word_tokenize(mood_ring_preprocessed)

The next cell shows you a comparison between the original string and the processed version. This provides a glimpse of the "NLP pipeline" we are building. 

In [None]:
# compare the original input and the eventual output
print(f'Input\n{mood_ring}\n\nOutput\n{mood_ring_tokenized}')

Input
I can't feel a thing. I keep looking at my mood ring.

Output
['i', 'cant', 'feel', 'a', 'thing', 'i', 'keep', 'looking', 'at', 'my', 'mood', 'ring']


**Your Turn**

Play with the preprocess function to compare the before and after of different text.

In [None]:
# try the preprocess function here


## **Types and Tokens**

Now that we can preprocess and tokenize a text, we can start querying properties of the texts. In this section we will consider how to count the frequency of different words in a text, as well as the overall lexical diversity of a text. Let's define some terms first:

- A ***type*** is a unique word.
- A ***token*** is an individual occurence of a type.


For example - you might have three dogs: two Labradoodles and a Samoyed. If we sorted our dogs into types and tokens, we would have three tokens (three dogs), but only two types: Labradoodle or Samoyed.

When we used `nltk.word_tokenize()`, we split our string into a series of tokens. 

We saw that we can also query the number of tokens by measuring the length of the tokenized list using `len()`

For example, the number of tokens in our preprocessed example from above is 12, which we can confirm by manually counting the tokens.


In [None]:
mood_ring_tokenized

['i',
 'cant',
 'feel',
 'a',
 'thing',
 'i',
 'keep',
 'looking',
 'at',
 'my',
 'mood',
 'ring']

In [None]:
len(mood_ring_tokenized)

12

How can we figure out the number of types in that same example? We **could** manually count the number of types, which is 11 (because the token "i" occurs twice).

We can also use a built-in Python function, `set()`, which returns a data container that only allows one of any value to exist in the container. In other words, it returns an object where repeated values are not allowed. This means we can simply use `set()` to ask for the unique values in our example. 




In [None]:
# What are the unique values among our tokens? 
set(mood_ring_tokenized)

{'a',
 'at',
 'cant',
 'feel',
 'i',
 'keep',
 'looking',
 'mood',
 'my',
 'ring',
 'thing'}

We can then wrap `set()` inside `len()` to query how many types there are in our text. We see the answer is 11, which is one fewer than the number of tokens. 

In [None]:
# what is the length of the set of our tokens?
len(set(mood_ring_tokenized))

11

**Your Turn**

Compare the results of `len()` and `set()` on different strings of your choosing. 

In [None]:
# compare len() and set() here. 


### **Measuring Lexical Diversity**

We can now use this information to assess our text for a very basic measure of sophistication: lexical diversity. This is also known as a type-token ratio, and provides a measure of how many repeated words there are in a text. You can read more about it in [Chapter 1 of NLTK.](https://www.nltk.org/book/ch01.html)

To calculate lexical diveristy, we can use the following formula:

> `number of types / number of tokens`

In the code cell below, I create a function which calculates this value.

In [None]:
# define a function to calculate lexical diversity
def lexical_diversity(tokens):
  # return the result of dividing the length 
  return len(set(tokens))/len(tokens)

Let's explore what the lexical diversity of our example is:


In [None]:
lexical_diversity(mood_ring_tokenized)

0.9166666666666666

We get a result of .916, in other words 91.6% of our tokens are represented by a single type, indicating a very high lexical diversity.

Of course, such measures are relatively meaningless on such a short amount of text - the true use of lexical diversity would be to compare much larger texts against one another. One might also want to consider further pre-processing. 

Nonetheless, try the lexical diversity function on some examples yourself to see how repeating words influence the overall score. 

> ***Important!*** You need to feed a list of tokens to `lexical_diversity()`, otherwise you will get the diversity based on **characters** in the string, not words!

In [None]:
# Lexical diversity of 50%
lexical_diversity(nltk.word_tokenize('hello world hello world'))

0.5

In [None]:
# Lexical diversity of 100%
lexical_diversity(nltk.word_tokenize('hello world'))

1.0

**Your Turn**

Try out the lexical diversity function on some text. The function expects raw string as input. 

In [None]:
# Play with lexical diversity on your own examples



## **Word Frequency**

We can also query the frequency of tokens in a text using an NLTK function. We will again feed a list of tokens to this function. The syntax for this function is:

> `nltk.FreqDist(tokens)`

Consider the following example:

In [None]:
# will this become stuck in your head?
turtles = """teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            heroes in a halfshell, turtle power!"""


# save the frequency distribution to a variable
turtle_fdist = nltk.FreqDist(nltk.word_tokenize(turtles))

# inspect the results
turtle_fdist

FreqDist({',': 4, 'teenage': 3, 'mutant': 3, 'ninja': 3, 'turtles': 3, 'heroes': 1, 'in': 1, 'a': 1, 'halfshell': 1, 'turtle': 1, ...})

The resulting frequency distribution is another Python data object called a `dictionary` which stores key:value pairs. In this case, our keys are the words, and the values are the frequencies.

We can query a dictionary for specific key:value pairs using the following syntax:

> `dictionary['key']`

For example:

In [None]:
# how frequent is "turtles?"
turtle_fdist['turtles']

3

In [None]:
# how frequent is "turtle?"

turtle_fdist['turtle']

1

We can also ask for the most frequent N terms from a frequency distribution using the `.most_common()` method. We can specific the number of top results we want by putting a number in the brackets `()` used by `.most_common()`. Below I ask for the number one most common word in our example:

In [None]:
turtle_fdist.most_common(1)

[(',', 4)]

As it turns out, the most common "word" was a comma. Yet another example of why pre-processing is an important step in text analytics and NLP. 


**Your turn**

Take this opportunity to make your own frequency distributions using `nltk.FreqDist()`. Remember to supply the function with a list of tokens - if you're curious you can see what happens if you supply a raw string!

In [None]:
# Play with FreqDist here. 



## **Parts of Speech**

Words are classified into different word categories, such as nouns, verbs, adjectives, pronouns, etc. These annotations are called parts of speech (POS) and are another source of information used in NLP applications. 


You can think of the part of speech tags as additional information about a word - which can then also be counted and compared, but is also critical information for building and understanding grammars of languages. Tagging is a fundamental part of the NLP pipeline and usually the step which occurs after tokenization.

Today, most tagging (and tokenization) is done by using large language models which represent words as numerical features in a vector space. We aren't going to get into that - we'll just use the built in NLTK part of speech tagging function. 

The function expects tokens:

> `nltk.pos_tag(tokens)`

The results will be a list of `(word,tag)` pairs (which incidently introduces you to another Python data structure, the tuple.)


In [None]:
# Part of Speech (POS) tags for our example
nltk.pos_tag(mood_ring_tokenized)

[('i', 'NN'),
 ('cant', 'VBP'),
 ('feel', 'VB'),
 ('a', 'DT'),
 ('thing', 'NN'),
 ('i', 'NN'),
 ('keep', 'VBP'),
 ('looking', 'VBG'),
 ('at', 'IN'),
 ('my', 'PRP$'),
 ('mood', 'NN'),
 ('ring', 'NN')]

You can see in the output the POS tags are represented as strings such as "NN" and "VBP". These all stand for different parts of speech. You can view the built-in help for what tag means by running the code in the next cell or by [going here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In [None]:
# full list of tags, with definitions and examples
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Part of speech tags help make sense of words in the context of other words. One way tags are helpful is to distinguish different meanings/uses of words which can be used in different parts of speech. For example:

In [None]:
# what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Quick, comb the desert for droids!'))

[('Quick', 'NNP'),
 (',', ','),
 ('comb', 'VBZ'),
 ('the', 'DT'),
 ('desert', 'NN'),
 ('for', 'IN'),
 ('droids', 'NNS'),
 ('!', '.')]

In [None]:
# and what pos tag does the word "comb" have in this example?
nltk.pos_tag(nltk.word_tokenize('Where is my comb?'))

[('Where', 'WRB'), ('is', 'VBZ'), ('my', 'PRP$'), ('comb', 'NN'), ('?', '.')]

So, adding POS tag information provides more information about a text, which becomes useful for more advanced NLP applications such as information extraction, text prediction, and so on. Because the tags are stores as strings, you can use knowledge of Python to search or filter through the list in order to find specific words associated with specific tags. 

We could even feed this to a frequency distribution and see how frequently certain words appear with certain POS tags. 

In [None]:
nltk.FreqDist(nltk.pos_tag(nltk.word_tokenize('Where is my comb? Please comb the desert for droids!')))

FreqDist({('Where', 'WRB'): 1, ('is', 'VBZ'): 1, ('my', 'PRP$'): 1, ('comb', 'NN'): 1, ('?', '.'): 1, ('Please', 'NNP'): 1, ('comb', 'VBZ'): 1, ('the', 'DT'): 1, ('desert', 'NN'): 1, ('for', 'IN'): 1, ...})

**Your Turn**

Use the part of speech function to tag the part of speech of some text. Remember that you need to provide `nltk.pos_tag()` a list of tokens.

In [None]:
# Look at different POS tags here.


# **A full NLP Pipeline**

Let's combine everythign we've done into a full NLP pipeline which reads in raw text (as a string) and then provides information about that text. I will create a function which applies pre-processing and then outputs various information about a text. To do so, I will create a new function which contains the `preprocess` and `lexical_diveristy` functions we used above, as well as output the top 5 frequent word:pos_tag combinations.

In [None]:
def pipeline(string_input):
  # first lowercase the string and clear punctuation using our preprocess function (defined above)
  preprocess_string = preprocess(string_input)

  # now use NLTK to tokenize the preprocessed text
  tokenized_string = nltk.word_tokenize(preprocess_string)

  # calculate the diversity function (defined above)
  ld = lexical_diversity(tokenized_string)

  # pos tag the tokens
  pos_tagged_string = nltk.pos_tag(tokenized_string)

  # calculate frequency of words and tags
  fdist = nltk.FreqDist(pos_tagged_string)

  # output some information about the text
  print(f"""
  Length:\t{len(tokenized_string)}\n
  Lexical Diversity:\t{ld}\n
  Top 5 Frequent Words:\t{fdist.most_common(5)}
  """)

I then apply the function to a longer text below, and quickly get statistics such as total length, lexical diversity, and the top five most frequent words. 

In [None]:
george = """The sea was angry that day, my friends - like an old man trying to send back soup in a deli. 
I got about fifty feet out and suddenly, the great beast appeared before me. 
I tell you, he was ten stories high if he was a foot. 
As if sensing my presence, he let out a great bellow."""

# get info about this text.
pipeline(george)


  Length:	58

  Lexical Diversity:	0.7931034482758621

  Top 5 Frequent Words:	[(('was', 'VBD'), 3), (('a', 'DT'), 3), (('he', 'PRP'), 3), (('the', 'DT'), 2), (('my', 'PRP$'), 2)]
  


**Your Turn**

You can modify the function above (or remake your own) to provide information most relevant to your research interests or questions! Once you learn more Python, you can add options to either keep or remove punctuation, calculate frequency from only the words, or anything else. What sorts of information might you want to ask from your texts? 