<a href="https://colab.research.google.com/github/scskalicky/SNAP-CL/blob/main/03_Frequency_Tokenisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Using NLTK to Tokenize and Tag Text**

One of the most basic pieces of information we can ask from a text is the distribution of differnt words in a text (e.g., how frequent is each word, how diverse is a text's vocabulary), as well as basic properties of those words (e.g., noun, verb). 

To do so, we first need to understand how to separate a text into words. Remember, since Python sees any string as a sequence of characters (including whitespace and punctuation), this sequence does **not** understand words in the way that we do. 

We must therefore think of ways to split strings into separate words.

We can use built-in methods to do so, such as the `string.split()` function in Python. But instead we will now shift to using a library called NLTK - Natural Language Tool Kit. 

> *You can learn Python and computational linguistics at the same time using their free book at https://www.nltk.org/book/*


## **NLTK**

We need to tell Python to load NLTK. To do so, we type `import nltk` in a code cell, see below:





In [None]:
# load the NLTK resource into the notebook
import nltk 

We also need to download some extra resources in order to use the NLTK functions. Run the code cell below to download those resources. Because these notebooks are hosted on virtual servers, you would need to repeat this step each time.

In [None]:
# download resources necessary for tokenizing and part of speech tagging.
nltk.download(['punkt', 'averaged_perceptron_tagger'])

## **Tokenizing a text**

We can now split a string into separate words or *tokens*. We will do so using the `nltk.word_tokenize()` function. Put the string inside the `()` at the end of the function, like so:

In [None]:
nltk.word_tokenize("These pretzels are making me thirsty!")

Note that the output shows each word from the sentence separated by commas, and also that the punctuation mark "!" is treated as a separate word. The output is in the form of a Python `list`, another data structure which can be used to hold strings as well as other data types. 

Do you remember how you set a string to a variable? You can do the same thing with the results from functions, such as `nltk.word_tokenize()`. Consider below.

In [None]:
# save a string to a variable
pretzels_raw = 'These pretzels are making me thirsty!'

# save the tokenized version of the string to a different variable
pretzels_tokenized = nltk.word_tokenize(pretzels_raw)

# inspect contents of tokenized version
pretzels_tokenized

We can thus query the length of our document, in words, using the `len()` function.

In [None]:
# how many words in our example? 
len(pretzels_tokenized)

Of course, the punctuation is being counted as a "word", which we may not think is appropriate. 

This raises an important question regarding the computational analysis of text - how should texts be prepared before an analysis. Removing all of the punctuation from a text is a form of *pre-processing* and is commonly done in almost all natural language processing tasks. Other stages of pre-processing can include converting all words to lower case or removing so-called "*stopwods*", which are highly frequent *function* words such as *the*, *a*, *and*, and so on. Many existing NLP libraries / frameworks have option to conduct pre-processing automatically. 

Below, I have written a function which performs two stages of pre-processing: lowercasing and removing punctuation.

In [None]:
# define a string containing punctuation markers we do not want
punctuation = '!.,\'";:-'

# define a function to pre-process text
def preprocess(text):
  # lower case the text and save results to a variable
  lower_case = text.lower()
  # remove punctuation from lower_case and save to a variable
  # don't worry too much if you don't understand the code in this line. 
  lower_case_no_punctuation = ''.join([character for character in lower_case if character not in punctuation])
  # return the new text to the user
  return lower_case_no_punctuation

In [None]:
# test our function on a string
preprocess('HELLO! wOrld.')

Before moving on, try out the preprocess function on some strings of your choice. You might want to try saving your string to a variable and then using the preprocess function, like this: 

> `preprocess(variable)`

You might also want to try saving the results of preprocess to another variable, like this:

> `new_variable = preprocess(variable)`

In [None]:
# Play with the preprocess() function here
preprocess()

## **Vocabulary - Tokens and Types**
What is the difference between a token and a type?

- A token is an individual occurence of a word.
- A type is the actual word.

For example - you might have three dogs: two Labradoodles and a Samoyed.

You would have three tokens (three dogs) but only two types: Labradoodle or Samoyed.

NLKT asks you to think about types and tokes by introducing you to the set and len functions. Do you remember these?

Which one will give us the total number of all the tokens in a text, and which one will give us the total number of types in a text? We can test this out without needing to rely on the NLTK objects.

In [None]:
# needs vader lexicon
nltk.download('stopwords')




In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
