# Python for Linguists

Notebook 7: Introduction to NLTK

Venelin Kovatchev

University of Barcelona 2020

In this notebook we will start working with NLTK.

We will load a corpus and try to process it.

We will apply simple transformations like sentence segmentation and tokenization.

We will calculate basic statistics using loops as well as nltk built-in functions.

In [None]:
# Import nltk
import nltk

In [None]:
# Download the important packages for today
nltk.download()

In [None]:
# Import all resources needed for today
from nltk.corpus import reuters
from nltk import sent_tokenize
from nltk.tokenize import *
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk import bigrams

In [None]:
# Get the identifiers for all files in the reuters corpus
print(reuters.fileids())

In [None]:
# We will start by working with just one file, 'test/14826'
c_fid = 'test/14826'

# Let's get the raw version of this file
test_raw = reuters.raw(c_fid)

print(test_raw)

In [None]:
# Let's get the tokenized version of this file
test_tok = reuters.words(c_fid)

print(test_tok[0:30])

In [None]:
# Let's get the tokenized and sentence segmented version of this file
test_sent = reuters.sents(c_fid)

print(test_sent[0])

In [None]:
# NLTK has a tool that can automatically tokenize a corpus
manual_tok = word_tokenize(test_raw)

# Let's see the tokenized corpus
# Compare it with the test_tok corpus
print(manual_tok[0:30])

In [None]:
# Now let's use the split function to separate words
split_tok = test_raw.split()

# Let's see and compare this version as well
print(split_tok[0:30])

In [None]:
# NLTK also has a tool that can automatically separate a corpus by sentences
manual_sent = sent_tokenize(test_raw)

# Observe the first two sentences
print(manual_sent[0:2])

In [None]:
# NLTK has a tool that can count the frequency of the words in a corpus
fd = FreqDist(test_tok)
print(fd.most_common(10))
fd.plot(10)


In [None]:
# NLTK has a function that calculates the bigrams from a list of tokens
# Observe the following
test_bigr = list(bigrams(test_tok))

print(test_bigr[0:5])

In [None]:
# Task 1
# 
# Tokenize the reuters corpus using "split" and "word_tokenize"
# Compare the result with the already tokenized version of the corpus
# 
# First write your code for a single file id ('test/14826')
# After that run the code on the whole corpus (without providing any id)
# 
# Compare  the results
# Count the number of tokens in each "corpus"
# Count the number of "types" in each "corpus" (types = unique tokens)


In [None]:
# Task 2
# 
# For the corpus created with word_tokenize, calculate the most frequent words
# 
#    - by counting manually (as we did in the previous classes)
#    - by using FreqDist
#
#    Compare the most frequent words - are the numbers the same?

In [None]:
# Task 3
# 
# For the corpus created with word tokenize and the pre-tokenized corpus, obtain a list of bigrams
# Calculate and compare the frequency distribution of the bigrams for the two corpora
# 

In [None]:
# Advanced Tasks
# 
# Task 3 - calculate the frequency of words in the corpus, ignoring multiple repetitions of the word in a sentence
# e.g.: ["this is not that, rather is this!","it is really dissappointing, but it is what it is"]
#       
#       In this corpus:
#           the frequency of "this" is 1 (1 in sent 1, 0 in sent 2)
#           the frequency of "is" is 2 (1 in sent 1, 1 in sent 2)
#           The frequency of "it" is 1 (0 in sent 1, 1 in sent 2)
#       We ignore multiple repetitions within the same sentence
#
#       In this task you can either use the existing corpus that is sentence segmented and tokenized
#       or use sent_tokenize and word_tokenize

In [None]:
# Advanced tasks
# 
# Task 4 - similar to task 3, calculate the frequency of bigrams ignoring multiple repetitions within the same sentence