# Python for Linguists

Notebook 7: Introduction to NLTK

Venelin Kovatchev

University of Barcelona 2020

In this notebook we will start working with NLTK.

We will load a corpus and try to process it.

We will apply simple transformations like sentence segmentation and tokenization.

We will calculate basic statistics using loops as well as nltk built-in functions.

In [None]:
# Import nltk
import nltk

In [None]:
# Download the important packages for today
nltk.download('reuters')
nltk.download('punkt')

In [None]:
# Import all resources needed for today
from nltk.corpus import reuters
from nltk import sent_tokenize
from nltk.tokenize import *
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk import bigrams

### Corpora in NLTK

NLTK contains many corpora for various languages

We can access them with the IMPORT function

Most corpora contains multiple files (e.g.: news articles, tweets, web pages, etc)

You can get a list of all filenames and then work with one or more of the files

You can also use the whole corpus by not specifying any filename

In [None]:
# Get the identifiers for all files in the reuters corpus
print(reuters.fileids())

In [None]:
# We will start by working with just one file, 'test/14826'
c_fid = 'test/14826'

# Let's get the raw version of this file
test_raw = reuters.raw(c_fid)

# Let's see the test file
# We can see that it doesn't contain any strange code or hashtags
# All characters are readable, so we don't need to clean the file
print(test_raw)

### Tokenization

Tokenization is an important task in NLP and CL.

During tokenization, we convert a sequencee of characters (a string) into a sequence of tokens (a list of strings).

There are several ways to tokenize a corpus:

- some corpora come pre-tokenized so we can just use the tokenized version
- as we already saw in previous classes, the .split() method converts a string into a list, separating by whitespaces
- some "libraries" such as NLTK have functions that can make a "smarter" tokenization

In [None]:
# Let's get the tokenized version of this file
test_tok = reuters.words(c_fid)

# Let's look at the first 30 characters
print(test_tok[0:30])

In [None]:
# Let's get the tokenized and sentence segmented version of this file
test_sent = reuters.sents(c_fid)

# Let's look at the first two sentences
# Remember, the slicing excludes the last element, so to take first two we need 0:2
print(test_sent[0:2])

In [None]:
# NLTK has a tool that can automatically tokenize a corpus
manual_tok = word_tokenize(test_raw)

# Now let's use the split function to separate words
split_tok = test_raw.split()

In [None]:
# Let's compare the three tokenizations next to each other
def comp_tok(corp_1, corp_2, num_tok = 30):
    # Let's zip the two corpora
    mix_corp = zip(corp_1[0:num_tok],corp_2[0:num_tok])
    # Let's go through them token by token
    for word_1, word_2 in mix_corp:
        # Print the two words
        print(word_1,word_2)          
    

In [None]:
# Let's compare the pre-tokenized corpus with the automatically tokenized one
comp_tok(manual_tok,test_tok,10)


In [None]:
# Now let's compare the automatically tokenized and the split
comp_tok(manual_tok,split_tok,25)


In [None]:
# NLTK also has a tool that autpmatically splits sentences
# NLTK also has a tool that can automatically separate a corpus by sentences
manual_sent = sent_tokenize(test_raw)

# Observe the first two sentences
print(manual_sent[0:2])

### Basic statistics 

Basic corpus statistics include things like

- number of tokens
- number of unique tokens (types)
- the frequency of each token


In [None]:
# To get the number of tokens in a list, we can use the "len" function

num_tokens_1 = len(manual_tok)
num_tokens_2 = len(test_tok)

print("The pre-tokenized document has " + str(num_tokens_2) + " tokens")
print("The word_tokenize document has " + str(num_tokens_1) + " tokens")


In [None]:
# To get the number of types (unique tokens) we can either go element by element and count or we can use the set() command
# A set is a special data type (similar to a list), where no duplicate elements are allowed

# Let's  see how it works
my_test_list = ["some","element","some","other","element"]
print(my_test_list)
my_test_set = set(my_test_list)
print(my_test_set)

In [None]:
# Using the set, we can easily see the number of unique tokens in the corpus
num_utokens_1 = len(set(manual_tok))
num_utokens_2 = len(set(test_tok))

print("The pre-tokenized document has " + str(num_utokens_2) + " unique tokens")
print("The word_tokenize document has " + str(num_utokens_1) + " unique tokens")

In [None]:
# Now there is one more pre-processing step we can do here
# In the corpus some words can be capitalized differently - "Asia" "ASIA", etc
# Depending on the task, we might want to keep this distinction or not
# If we want to ignore case, we need to "lowercase" all words

l_manual_tok = [word.lower() for word in manual_tok]
l_test_tok = [word.lower() for word in test_tok]

# Remember the [X for X in ...]
# this is the same as:
# 
# l_manual_tok = []
# for word in manual_tok:
#     l_manual_tok.append(word.lower())

num_ultokens_1 = len(set(l_manual_tok))
num_ultokens_2 = len(set(l_test_tok))

print("The pre-tokenized document has " + str(num_ultokens_2) + " unique lowercased tokens")
print("The word_tokenize document has " + str(num_ultokens_1) + " unique lowercased tokens")

In [None]:
# The next step is counting the frequency of every word
# We did this few classes ago by manually going through the list and adding the words in a dictionary
# The following code is copied from the previous exercise

# We start by creating the dictionary:
freq_dict = {}

# We loop through the manual_toc list
for word in manual_tok:
    # For every word in the tokenized corpus we check if it is in the dictionary
    if word not in freq_dict:
        # if it is NOT in the vocabulary, we add it with a default frequency of 1 (we just saw the word)
        freq_dict[word] = 1
    # If the word is already in the vocabulary, we increase the number with 1
    else:
        freq_dict[word] += 1

# We can then see the frequency of each word
print(freq_dict["of"])

In [None]:
# NLTK has a tool that can count the frequency of the words much easier

# First we create the frequency distribution with this command
fd = FreqDist(manual_tok)

# Then we can see which is the frequency of each word we are interested in
# We got the same result as in the manual version
print(fd["of"])


In [None]:
# We can also see the most frequent words in the corpus using the most_common() method
print(fd.most_common(10))


In [None]:
# Finally, we can observe the distribution of words on a plot
fd.plot(10)

### N-Grams

N-grams are important unit in NLP and CL

An n-gram is a sequence of words of a predefined length

The most common n-grams are bi-grams (two elements) and tri-grams (three elements).

In [None]:
# NLTK has a function that calculates the bigrams from a list of tokens
# Observe the following
test_bigr = list(bigrams(manual_tok))

print(test_bigr[0:5])

In [None]:
# Now, instead of working with just one document, let's load 20 documents to have a little more text
# Remember, if we don't give any file id, it loads the whole corpus. However working with the whole corpus can be slow.
reuters_20 = reuters.raw(['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', 'test/14844', 'test/14849', 'test/14852', 'test/14854', 'test/14858', 'test/14859', 'test/14860', 'test/14861', 'test/14862', 'test/14863'])

# Let's tokenize it using word_tokenize
# N.B.: this might be take some time
reuters_tok = word_tokenize(reuters_20)

# Task 1
# 
# Calculate the statistics for the whole corpus:
# - number of tokens
# - number of types
# - the 20 most frequent words

In [None]:
# Task 2
# 
# Calculate the number of bigrams in reuters_20
# Calculate the number of unique bigrams in reuters_20
# Print the 5 most frequent bigrams in reuters_20

Advanced Topics

Language models, regular expressions

In [None]:
# Advanced tasks

# The following code reads a corpus of speeches of Donald Trump and generates a "pseudo" Trump sentences

# Import numpy
import numpy as np

# Import codecs
import codecs

# Import random
import random

def get_markov_stats(trigrams):
    """
    Input: 
        (list) trigrams
    
    Output: 
        (dict: key = str, value = list) 
            key: string of first two words in trigram  
            value: list of all possible third words in trigram
    """
    # Initialize
    markov_stats = {}
    
    for words in trigrams:
        words_list = list(words)
        word_12 = " ".join([words_list[0], words_list[1]])
        word_3 = words_list[2]
        
        # Check if there is an entry for the current word
        if word_12 in markov_stats.keys():
            # If it is, append the second one
            markov_stats[word_12].append(word_3)

        # If it isn't, create it with the corresponding value
        else:
            markov_stats[word_12] = [word_3]
    return(markov_stats)

def generate_sentence(corpus, stats):
    """
    Input:
        corpus: (list) corpus of words
        stats: output of get_markov_stats
    Output:
        prints sentence according to rules given in assignment
    """
    
    # Get first two words of sentence, excluding .!?
    first_bigram = list(random.choice(list(nltk.bigrams(corpus[:-1]))))
    while "." in first_bigram or "!" in first_bigram or "?" in first_bigram or "," in first_bigram or "’" in first_bigram or first_bigram[0].islower():
        first_bigram = list(random.choice(list(nltk.bigrams(corpus[:-1]))))
    
    new_speech = first_bigram

    # Generate a sentence of minimum length 5
    length = 1
    while True:
        # Get next word from previous two words
        next_word = np.random.choice(stats[" ".join(new_speech[-2:])])
        
        # Is sentence shorter than 5 words?
        if length < 5: 
            # If it finds punctuation restart
            if "." in next_word or "!" in next_word or "?" in next_word:
                return(generate_sentence(corpus = corpus, stats = stats))  
            # If no punctuation append word to existing sentence
            else:
                new_speech.append(next_word)
                length += 1
                
        # Is sentence at least 5 words long?
        elif length >= 5:
            # If includes punctuation return
            if "." in next_word or "!" in next_word or "?" in next_word:
                new_speech.append(next_word)
                return(" ".join(new_speech))
            # If no punctuation append word to existing sentence
            else:
                new_speech.append(next_word)



In [None]:
# Read the corpus
raw_corpus = codecs.open('speeches.txt','r','utf8').read()

# Tokenize the corpus
corpus = nltk.word_tokenize(raw_corpus)

# Generate list of trigrams
trump_trigr = list(nltk.trigrams(corpus))
    
# Generate model
markov_stats = get_markov_stats(trump_trigr)
   
# Generate and print sentences
for i in range(5):
    sentence = generate_sentence(corpus, markov_stats)
    print(sentence)    

In [None]:
# Regular expression tokenizer

# The following code uses a regular espression grammar to process the corpus:

pattern = r'''(?x)     # set flag to allow verbose regexps
(?:[A-Z]\.)+       # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)*       # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\.             # ellipsis
| [][.,;"'?():-_`]   # these are separate tokens; includes ], [
'''

regex_tok = regexp_tokenize(reuters_20,pattern)

# Let's compare the regexp tokenizer and the word_tokenizer
comp_tok(regex_tok,reuters_tok)

In [None]:
# Task 3 - advanced
# Try to improve the regex tokenizer
# Think of patterns that you want to match