Introduction to NLP course (2017-2018).

Notebook 3: Basic corpus statistics.

by Venelin Kovatchev, University of Barcelona

In [1]:
# Import section

# Import nltk
import nltk
from nltk import FreqDist
from nltk import bigrams, trigrams
from nltk.collocations import *

# Import regular expressions
import re

# Import corpora
from nltk.corpus import gutenberg

1) Basic corpus statistics:

- number of tokens in the corpus
- number of types (unique tokens) in the corpus
- average corpus length

For a reference, see chapter 3 of NLTK book.

2) Basic frequency statistics:

- what are the most frequent tokens in the corpus
- what is the frequency of a specific token


In [2]:
# Obtain the corpus
corpus = gutenberg.words()

# Generate the frequency distribution of the corpus
fdist = FreqDist(corpus)

# Print the 10 most frequent words in the corpus:
print("\nThe 10 most frequent words in the corpus are: \n {}".format(fdist.most_common(10)))

# Print the frequency of man and woman
print("\nThe frequency of 'man' is: {}".format(fdist['man']))
print("\nThe frequency of 'woman' is: {}".format(fdist['woman']))


The 10 most frequent words in the corpus are: 
 [(u',', 186091), (u'the', 125748), (u'and', 78846), (u'.', 73746), (u'of', 70078), (u':', 47406), (u'to', 46443), (u'a', 32504), (u'in', 31959), (u'I', 30221)]

The frequency of 'man' is: 5372

The frequency of 'woman' is: 932


3) Bigrams and Trigrams

- (linear) sequences of tokens
- not a linguistic unit per se
- implemented natively in nltk

In [3]:
# Obtain the list of all bigrams in the corpus
bigr_list = list(bigrams(corpus))

# Obtain the list of all trigrams in the corpus
trigr_list = list(trigrams(corpus))

# Print the first 5 bigrams and trigrams
print("\nThe first 5 bigrams are: {}".format(bigr_list[:5]))
print("\nThe first 5 trigrams are: {}".format(trigr_list[:5]))


The first 5 bigrams are: [(u'[', u'Emma'), (u'Emma', u'by'), (u'by', u'Jane'), (u'Jane', u'Austen'), (u'Austen', u'1816')]

The first 5 trigrams are: [(u'[', u'Emma', u'by'), (u'Emma', u'by', u'Jane'), (u'by', u'Jane', u'Austen'), (u'Jane', u'Austen', u'1816'), (u'Austen', u'1816', u']')]


4) Co-occurrence: Elements which appear together within a predefined context, with a certain frequency or a degree of statistical association

- depends on the definition of context
- can use surface (linear) context or linguistic (sentence, dependency) context

In [4]:
# Sample sentence
my_sent = 'the student who is carrying a laptop is going to the university'.split()

# Calculate the co-occurrence statistics based on surface window 5 and 7
# NB: due to the specifics of the NLTK implementation, 
# calculating window size of 5 requires parameter window_size=6
cooc_5 = BigramCollocationFinder.from_words(my_sent, window_size=6)
cooc_7 = BigramCollocationFinder.from_words(my_sent, window_size=8)

# Print the co-occurrence statistics for "student" with window 5
print("\nCo-occurrence of 'student' in a window of 5")
for pair,freq in cooc_5.ngram_fd.items():
    if 'student' in pair:
        print(pair,freq)
        
# Print the co-occurrence statistics for "student" with window 7
print("\nCo-occurrence of 'student' in a window of 7")
for pair,freq in cooc_7.ngram_fd.items():
    if 'student' in pair:
        print(pair,freq)


Co-occurrence of 'student' in a window of 5
(('the', 'student'), 1)
(('student', 'carrying'), 1)
(('student', 'who'), 1)
(('student', 'is'), 1)
(('student', 'laptop'), 1)
(('student', 'a'), 1)

Co-occurrence of 'student' in a window of 7
(('the', 'student'), 1)
(('student', 'carrying'), 1)
(('student', 'who'), 1)
(('student', 'is'), 2)
(('student', 'a'), 1)
(('student', 'going'), 1)
(('student', 'laptop'), 1)


5) Filtering co-occurrence

- by frequency
- using statistical association

In [5]:
# Filter only the co-occurrences with frequency of 2 or above
cooc_7.apply_freq_filter(2)

# Print the co-occurrence statistics for "student" with window 7 and freqency of 2 or above
print("\nCo-occurrence of 'student' in a window of 7 and frequency minimum of 2")
for pair,freq in cooc_7.ngram_fd.items():
    if 'student' in pair:
        print(pair,freq)

# Load the pre-built association measures
bigram_measures = nltk.collocations.BigramAssocMeasures()
# Score the co-occurrence (5) based on the degree of association
print("\nCo-occurrence in a window of 5 after applying MI")
cooc_5.score_ngrams(bigram_measures.pmi)



Co-occurrence of 'student' in a window of 7 and frequency minimum of 2
(('student', 'is'), 2)

Co-occurrence in a window of 5 after applying MI


[(('is', 'going'), 1.2630344058337943),
 (('who', 'is'), 1.2630344058337943),
 (('a', 'going'), 1.2630344058337941),
 (('a', 'laptop'), 1.2630344058337941),
 (('a', 'to'), 1.2630344058337941),
 (('carrying', 'a'), 1.2630344058337941),
 (('carrying', 'going'), 1.2630344058337941),
 (('carrying', 'laptop'), 1.2630344058337941),
 (('carrying', 'to'), 1.2630344058337941),
 (('going', 'to'), 1.2630344058337941),
 (('going', 'university'), 1.2630344058337941),
 (('laptop', 'going'), 1.2630344058337941),
 (('laptop', 'to'), 1.2630344058337941),
 (('laptop', 'university'), 1.2630344058337941),
 (('student', 'a'), 1.2630344058337941),
 (('student', 'carrying'), 1.2630344058337941),
 (('student', 'laptop'), 1.2630344058337941),
 (('student', 'who'), 1.2630344058337941),
 (('to', 'university'), 1.2630344058337941),
 (('who', 'a'), 1.2630344058337941),
 (('who', 'carrying'), 1.2630344058337941),
 (('who', 'laptop'), 1.2630344058337941),
 (('a', 'is'), 0.2630344058337941),
 (('a', 'the'), 0.2630344