<a href="https://colab.research.google.com/github/scskalicky/SNAP-CL/blob/main/03_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A Gentle Introduction to the Natural Language Toolkit**

There are many different NLP/CL packages and libraries to choose from. We are going to work with one called NLTK - Natural Language Tool Kit. NLTK provides built-in functions for performing common NLP tasks, such as tokenising a text (splitting it into words), counting the frequency of words, part-of-speech tagging (e.g., assign words to nouns, verbs, etc), performing sentiment analysis, and so much more. 

We are going to touch just the surface of NLTK as a means to show you how to get going with some basic text analytics.

> *You can learn Python and computational linguistics at the same time using the free NLTK book at https://www.nltk.org/book/*










## **Loading NLTK**

We need to tell Python to load NLTK. To do so, we type `import nltk` in a code cell and run it, see below:


In [None]:
# load the NLTK resource into the notebook
import nltk 

We also need to download some extra resources in order to use the NLTK functions in this notebook. Run the code cell below to download those resources. Because these notebooks are hosted on virtual servers, you would need to repeat this step each time you load this notebook. Fortunately, it does not take very long. Different functions will require different resources, and Colab will tell you if a resource is missing when you try to use NLTK functions. 

In [None]:
# download resources necessary for tokenizing and part of speech tagging.
nltk.download(['punkt', 'averaged_perceptron_tagger'])

## **Tokenizing a Text**

We can now use NLTK to split a string into separate words or *tokens*. We will do so using the `nltk.word_tokenize()` function. This function expects a string as the input, which you place inside the `()` at the end of the function, like so:

In [None]:
nltk.word_tokenize("These pretzels are making me thirsty!")

Note that the output shows each word from the sentence separated by commas, and also that the punctuation mark "!" is treated as a separate word. The output is in the form of a Python `list`, another data structure which can be used to hold strings as well as other value types. 

Do you remember how you set a string to a variable? You can do the same thing with the results from functions, such as `nltk.word_tokenize()`. Consider below:

In [None]:
# save a string to a variable
pretzels_raw = 'These pretzels are making me thirsty!'

# save the tokenized version of the string held in pretzels_raw to a different variable
pretzels_tokenized = nltk.word_tokenize(pretzels_raw)

# inspect contents of tokenized version
pretzels_tokenized

We can thus query the length of our document, in words, using the `len()` function.

In [None]:
# how many words in our example? 
len(pretzels_tokenized)

### **Preprocessing**

Of course, the punctuation is being counted as a "word", which we may not think is appropriate. 

This raises an important question regarding the computational analysis of text - how should texts be prepared before an analysis? Removing all of the punctuation from a text is a form of *pre-processing* and is commonly done in almost all natural language processing tasks. Other stages of pre-processing can include converting all words to lower case or removing so-called "*stopwords*", which are highly frequent *function* words such as *the*, *a*, *and*, and so on. Many existing NLP libraries / frameworks have option to conduct pre-processing automatically. 

Below, I have written a function which performs two stages of pre-processing: lowercasing and removing punctuation. Running the code cell will load the function into the notebook's memory so that you can use that same function in other code cells. 

In [None]:
# define a string containing punctuation markers we do not want
punctuation = '!.,\'";:-'

# define a function to pre-process text
def preprocess(text):
  # lower case the text and save results to a variable
  lower_case = text.lower()
  # remove punctuation from lower_case and save to a variable
  # don't worry too much if you don't understand the code in this line. 
  lower_case_no_punctuation = ''.join([character for character in lower_case if character not in punctuation])
  # return the new text to the user
  return lower_case_no_punctuation

In the next code cell, I use the `preprocess` function on a string which contains uppercase letters and one punctuation mark "!". The results show how all the letters are now lowercase, and the puncutation has been removed. 

In [None]:
# test our function on a string
preprocess('HELLO! wOrld.')

Before moving on, try out the preprocess function on some strings of your choice. You might want to try saving your string to a variable and then using the preprocess function, like this: 

> `my_variable = 'some string'`   
> `preprocess(my_variable)`

You might also want to try saving the results of preprocess to another variable, like this:

> `new_variable = preprocess(my_variable)`

In [None]:
# Play with the preprocess() function here
preprocess()

We can now use the preprocess function to process a text before sending it to be tokenized, such as seen below.

In [None]:
# save a string to a variable
mood_ring = "I can't feel a thing. I keep looking at my mood ring."

# pre-process the string using the preprocess function, and save results to a variable
mood_ring_preprocessed = preprocess(mood_ring)

# tokenize the preprocessed text
mood_ring_tokenized = nltk.word_tokenize(mood_ring_preprocessed)

The next cell shows you a comparison between the original string and the processed version. This provides a glimpse of the "NLP pipeline" we are building. 

In [None]:
# compare the original input and the eventual output
print(f'Input\n{mood_ring}\n\nOutput\n{mood_ring_tokenized}')

## **Types and Tokens**

Now that we can preprocess and tokenize a text, we can start querying properties of the texts. In this section we will consider how to count the frequency of different words in a text, as well as the overall lexical diversity of a text. Let's define some terms first:

- A ***type*** is a unique word.
- A ***token*** is an individual occurence of a type.


For example - you might have three dogs: two Labradoodles and a Samoyed. If we sorted our dogs into types and tokens, we would have three tokens (three dogs), but only two types: Labradoodle or Samoyed.

When we used `nltk.word_tokenize()`, we split our string into a series of tokens. 

We saw that we can also query the number of tokens by measuring the length of the tokenized list using `len()`

For example, the number of tokens in our preprocessed example from above is 12, which we can confirm by manually counting the tokens.


In [None]:
mood_ring_tokenized

In [None]:
len(mood_ring_tokenized)

How can we figure out the number of types in that same example? We **could** manually count the number of types, which is 11 (because the token "i" occurs twice).

We can also use a built-in Python function, `set()`, which returns a data container that only allows one of any value to exist in the container. In other words, it returns an object where repeated values are not allowed. This means we can simply use `set()` to ask for the unique values in our example. 




In [None]:
# What are the unique values among our tokens? 
set(mood_ring_tokenized)

We can then wrap `set()` inside `len()` to query how many types there are in our text. We see the answer is 11, which is one fewer than the number of tokens. 

In [None]:
# what is the length of the set of our tokens?
len(set(mood_ring_tokenized))

### **Measuring Lexical Diversity**

We can now use this information to assess our text for a very basic measure of sophistication: lexical diversity. This is also known as a type-token ratio, and provides a measure of how many repeated words there are in a text. You can read more about it in [Chapter 1 of NLTK.](https://www.nltk.org/book/ch01.html)

To calculate lexical diveristy, we can use the following formula:

> `number of types / number of tokens`

In the code cell below, I create a function which calculates this value.

In [None]:
# define a function to calculate lexical diversity
def lexical_diversity(tokens):
  # return the result of dividing the length 
  return len(set(tokens))/len(tokens)

Let's explore what the lexical diversity of our example is:


In [None]:
lexical_diversity(mood_ring_tokenized)

We get a result of .916, in other words 91.6% of our tokens are represented by a single type, indicating a very high lexical diversity.

Of course, such measures are relatively meaningless on such a short amount of text - the true use of lexical diversity would be to compare much larger texts against one another. One might also want to consider further pre-processing. 

Nonetheless, try the lexical diversity function on some examples yourself to see how repeating words influence the overall score. 

> ***Important!*** You need to feed a list of tokens to `lexical_diversity()`, otherwise you will get the diversity based on **characters** in the string, not words!

In [None]:
# play with lexical diversity here
lexical_diversity(nltk.word_tokenize('hello world hello'))

Which one will give us the total number of all the tokens in a text, and which one will give us the total number of types in a text? We can test this out without needing to rely on the NLTK objects.

In [None]:
# needs vader lexicon
nltk.download('stopwords')




In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))





One of the most basic pieces of information we can ask from a text is the distribution of different words in a text (e.g., how frequent is each word, how diverse is a text's vocabulary), as well as basic properties of those words (e.g., noun, verb). 

To do so, we first need to understand how to separate a text into words. Remember, since Python sees any string as a sequence of characters (including whitespace and punctuation), this sequence does **not** understand words in the way that we do. 

We must therefore think of ways to split strings into separate words.

We can use built-in methods to do so, such as the `string.split()` function in Python. But instead we will now shift to using a library called NLTK - Natural Language Tool Kit. 
