### NLTK 101

- Tokenization
- Frequency Distributions
- Corpus - Isolated, Categorical, Overlapping, Temporal etc.
- Web Scraping using BeautifulSoup
- Regular Expressions for Tokenization
- Bigrams and Collocations
- Stemming
- Lemmatization
- POS Tagging

In [1]:
import nltk

In [2]:
text = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991."

In NLTK, **sent_tokenize** function generates **sentences** from the given text.

In [3]:
sentences = nltk.sent_tokenize(text)
len(sentences)

2

In NLTK, **word_tokenize** function generates **words** from the given text.

In [4]:
words = nltk.word_tokenize(text)
len(words)

22

In [5]:
words[:5]

['Python', 'is', 'an', 'interpreted', 'high-level']

**Frequency Distribution of Words**

In [6]:
wordfreq = nltk.FreqDist(words)

In [7]:
wordfreq.most_common(2)

[('programming', 2), ('.', 2)]

In [8]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [9]:
text1.findall("<tri.*r>")

triangular; triangular; triangular; triangular


In [10]:
type(text1)

nltk.text.Text

In [11]:
n_words = len(text1)
n_words

260819

In [12]:
n_unique_words = len(set(text1))
n_unique_words

19317

#### Extracting lower case unique words from text1 example.

In [13]:
text1_lcw = [word.lower() for word in set(text1)]
len(text1_lcw)

19317

In [14]:
n_unique_words_lc = len(set(text1_lcw))
n_unique_words_lc

17231

In [15]:
word_coverage1 = n_words / n_unique_words
word_coverage1

13.502044830977896

In [16]:
word_coverage2 = n_words / n_unique_words_lc
word_coverage2

15.136614241773549

In [17]:
big_words = [word for word in set(text1) if len(word) > 18]
big_words

['uninterpenetratingly']

In [18]:
sun_words = [word for word in set(text1) if word.startswith('Sun')]
sun_words

['Sunset', 'Sunda', 'Sunday']

In [19]:
text1_freq = nltk.FreqDist(text1)
text1_freq.most_common(3)

[(',', 18713), ('the', 13721), ('.', 6862)]

In [20]:
text1_freq['Sunday']

7

In [21]:
large_uncommon_words = [word for word in text1 if word.isalpha() and len(word) > 7]

In [22]:
text1_uncommon_freq = nltk.FreqDist(large_uncommon_words)
text1_uncommon_freq.most_common(3)

[('Queequeg', 252), ('Starbuck', 196), ('something', 119)]

In [23]:
from nltk.corpus import genesis

**genesis** is an **isolated** corpus - holds individual text collections.

In [24]:
genesis.fileids()

['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']

In [25]:
for fileid in genesis.fileids():
    # Raw: Characters
    n_chars = len(genesis.raw(fileid))
    n_words = len(genesis.words(fileid))
    n_sents = len(genesis.sents(fileid))
    print(int(n_chars/n_words), int(n_words/n_sents), fileid)

4 30 english-kjv.txt
4 19 english-web.txt
5 15 finnish.txt
4 23 french.txt
4 23 german.txt
4 20 lolcat.txt
4 27 portuguese.txt
4 30 swedish.txt


In [26]:
# from nltk.corpus import PlaintextCorpusReader
# # Path to text files on drive
# wordlists = PlaintextCorpusReader(., '.*')
# wordlists.fileids()

In [27]:
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage', 'potato']
nltk.FreqDist(items)

FreqDist({'apple': 2, 'cabbage': 2, 'kiwi': 1, 'potato': 1})

**Conditional Frequency Distribution** takes in the list of tuples in below example.

In [28]:
c_items = [('F','apple'), ('F','apple'), ('F','kiwi'), ('V','cabbage'), ('V','cabbage'), ('V','potato')]
cfd = nltk.ConditionalFreqDist(c_items)
cfd.conditions()

['F', 'V']

In [29]:
cfd['V']

FreqDist({'cabbage': 2, 'potato': 1})

- **brown** is a **categorical** corpus 
- **reuters** is an **overlapping** corpus - each text collection tagged to one or more categories
- **inaugural** is a **temporal** corpus

In [30]:
from nltk.corpus import brown
cfd_brown = nltk.ConditionalFreqDist([(genre, word) for genre in brown.categories() for word in brown.words(categories=genre)])
print(cfd_brown)
cfd_brown.conditions()

<ConditionalFreqDist with 15 conditions>


['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [31]:
cfd_brown.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'])

           leadership    worship   hardship 
government         12          3          2 
     humor          1          0          0 
   reviews         14          1          2 


In [32]:
cfd_brown.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'], cumulative = True)

           leadership    worship   hardship 
government         12         15         17 
     humor          1          1          1 
   reviews         14         15         17 


In [33]:
news_fd = cfd_brown['news']
news_fd.most_common(3)

[('the', 5580), (',', 5188), ('.', 4030)]

In [34]:
news_fd['the']

5580

In [35]:
from nltk.corpus import names
nt = [(fid, name) for fid in names.fileids() for name in names.words(fid)]
print("(fid, name): {}".format(nt[:5]))
nt = [(fid.split('.')[0], name[-1]) for fid in names.fileids() for name in names.words(fid)]
print("(fid, last char in name): {}".format(nt[:5]))
cfd_names = nltk.ConditionalFreqDist(nt)
print(cfd_names.conditions())

(fid, name): [('female.txt', 'Abagael'), ('female.txt', 'Abagail'), ('female.txt', 'Abbe'), ('female.txt', 'Abbey'), ('female.txt', 'Abbi')]
(fid, last char in name): [('female', 'l'), ('female', 'l'), ('female', 'e'), ('female', 'y'), ('female', 'i')]
['female', 'male']


In [36]:
cfd_names['female'] > cfd_names['male']

False

In [37]:
cfd_names.tabulate(samples=['a', 'e'])

          a    e 
female 1773 1432 
  male   29  468 


In [38]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
content1 = request.urlopen(url).read()

In [39]:
url = "http://www.bbc.com/news/health-42802191"
html_content = request.urlopen(url).read()

### Beautifulsoup
- Module used for scraping the required text from the webpages.

In [40]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

In [41]:
inner_body = soup.find_all('div', attrs={'class':'story-body__inner'})

In [42]:
inner_text = [elm.text for elm in inner_body[0].find_all(['h1', 'h2', 'p', 'li'])]

In [43]:
text_content2 = '\n'.join(inner_text)

Third party libraries such as ****pywin32, pypdf**** are required for accessing binary data as in Microsoft Word or PDF documents.

In [44]:
text_content1 = content1.decode('unicode_escape')  # Converts bytes to unicode
tokens1 = nltk.word_tokenize(text_content1)
tokens1[3:8]

['Project', 'Gutenberg', 'EBook', 'of', 'Crime']

In [45]:
tokens2 = nltk.word_tokenize(text_content2)
print(len(tokens2))
print(tokens2[:5])

751
['Smokers', 'need', 'to', 'quit', 'cigarettes']


### Regular Expressions for Tokenization

In [46]:
import re
tokens2_2 = re.findall(r'\w+', text_content2)
print(len(tokens2_2))
print(tokens2_2[:5])

668
['Smokers', 'need', 'to', 'quit', 'cigarettes']


In [47]:
# Alternatively use regexp_tokenize function from NLTK
pattern = r'\w+'
tokens2_3 = nltk.regexp_tokenize(text_content2, pattern)
len(tokens2_3)

668

In [48]:
input_text2 = nltk.Text(tokens2)
type(input_text2)

nltk.text.Text

With ****word_tokenize****, punctuation characters are treated as words.

In [49]:
nltk.word_tokenize('Python is cool!!!')

['Python', 'is', 'cool', '!', '!', '!']

With ****regexp_tokenize****, punctuation characters are not treated as words.

In [50]:
nltk.regexp_tokenize('Python is cool!!!', r'\w+')

['Python', 'is', 'cool']

In [51]:
s = 'Python is cool!!!'
print(re.findall(r'\s\w+\b', s))

[' is', ' cool']


### Bigrams and Collocations

In [52]:
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.bigrams(tokens))

[('Python', 'is'),
 ('is', 'an'),
 ('an', 'awesome'),
 ('awesome', 'language'),
 ('language', '.')]

In [53]:
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
filtered_bigrams = [(w1, w2) for w1, w2 in eng_bigrams if len(w1) >=5 and len(w2) >= 5]

In [54]:
eng_bifreq = nltk.FreqDist(filtered_bigrams)
eng_bifreq.most_common(3)

[(('their', 'father'), 19), (('lived', 'after'), 16), (('seven', 'years'), 15)]

In [55]:
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
eng_cfd = nltk.ConditionalFreqDist(eng_bigrams)
eng_cfd['living'].most_common(2)

[('creature', 7), ('thing', 4)]

#### Generating Most Frequent Next Word

In [56]:
def generate(cfd, word, n=5):
    n_words = []
    for i in range(n):
        n_words.append(word)
        word = cfd[word].max()
    return n_words

In [57]:
generate(eng_cfd, 'living')

['living', 'creature', 'that', 'he', 'said']

In [58]:
list(nltk.ngrams(tokens, 4))

[('Python', 'is', 'an', 'awesome'),
 ('is', 'an', 'awesome', 'language'),
 ('an', 'awesome', 'language', '.')]

In [59]:
gen_text = nltk.Text(eng_tokens)
gen_text.collocations()

said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast


In [60]:
list(nltk.trigrams(nltk.word_tokenize('Python is cool!!!')))

[('Python', 'is', 'cool'),
 ('is', 'cool', '!'),
 ('cool', '!', '!'),
 ('!', '!', '!')]

In [61]:
text6

<Text: Monty Python and the Holy Grail>

#### TODO: Make the below code generic.

In [62]:
text6_bigrams = nltk.bigrams(text6)
filtered_text6_bigrams = [(w1, w2) for w1, w2 in text6_bigrams if w1 =='clop' and w2 == 'clop']
nltk.FreqDist(filtered_text6_bigrams)

FreqDist({('clop', 'clop'): 26})

In [63]:
temp = [(w1, w2) for w1, w2 in text6_bigrams]
nltk.FreqDist(temp)

FreqDist({})

### STEMMING

In [64]:
from nltk import PorterStemmer
porter = nltk.PorterStemmer()
porter.stem('builders')

'builder'

In [65]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
lancaster.stem('builders')

'build'

In [66]:
print(len(set(text1)))
lc_words = [word.lower() for word in text1] 
print(len(set(lc_words)))

19317
17231


In [67]:
p_stem_words = [porter.stem(word) for word in set(lc_words)]
len(set(p_stem_words))

10927

In [68]:
l_stem_words = [lancaster.stem(word) for word in set(lc_words)]
len(set(l_stem_words))

9036

**Lemma** is a lexical entry in a lexical resource such as word dictionary.

You can find multiple Lemma's with the same spelling. These are known as homonyms.

For example, consider the two Lemma's listed below, which are homonyms.

1. saw [verb] - Past tense of see
2. saw [noun] - Cutting instrument

nltk comes with **WordNetLemmatizer**. This lemmatizer removes affixes only if the resulting word is found in lexical resource, **Wordnet**.

**WordNetLemmatizer** is majorly used to build a vocabulary of words, which are valid Lemmas.

In [69]:
wnl = nltk.WordNetLemmatizer()
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_words)]
len(set(wnl_stem_words))

15168

Find how many words ending with 'ing' in text6 corpus?

In [70]:
ing = [word for word in text6 if word.endswith('ing')]
len(ing)

281

**POS Tagging**

The method of categorizing words into their parts of speech and then labeling them respectively is called POS Tagging.

A **POS Tagger** processes a sequence of words and tags a part of speech to each word.

**pos_tag** is the simplest tagger available in nltk.

In [71]:
words = nltk.word_tokenize(text)
nltk.pos_tag(words)
# The words Python, is and awesome are tagged to Proper Noun (NNP), Present Tense Verb (VB), and adjective (JJ) respectively.

[('Python', 'NNP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('interpreted', 'JJ'),
 ('high-level', 'NN'),
 ('programming', 'NN'),
 ('language', 'NN'),
 ('for', 'IN'),
 ('general-purpose', 'JJ'),
 ('programming', 'NN'),
 ('.', '.'),
 ('Created', 'VBN'),
 ('by', 'IN'),
 ('Guido', 'NNP'),
 ('van', 'NN'),
 ('Rossum', 'NNP'),
 ('and', 'CC'),
 ('first', 'RB'),
 ('released', 'VBN'),
 ('in', 'IN'),
 ('1991', 'CD'),
 ('.', '.')]

In [72]:
# Help command - Remove the argument for full details
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


**Tagging Text**

Constructing a list of tagged words from a string is possible.

A tagged word or token is represented in a tuple, having the word and the tag.

In the input text, each word and tag are separated by /.

In [73]:
text = 'Python/NN is/VB awesome/JJ ./.'
[nltk.tag.str2tuple(word) for word in text.split()]

[('Python', 'NN'), ('is', 'VB'), ('awesome', 'JJ'), ('.', '.')]

In [74]:
# brown corpus is an example of Tagged Corpora
brown_tagged = brown.tagged_words()
brown_tagged[:3]

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

**DefaultTagger** assigns a specified tag to every word or token of given text.

In [75]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(words)

[('Python', 'NN'), ('is', 'NN'), ('awesome', 'NN'), ('.', 'NN')]

#### Unigram Tagger ####

UnigramTagger provides us the flexibility to create our own taggers.

Unigram taggers are built based on statistical information. i.e., they tag each word or token to **most likely** tag for that particular word.

We can build a unigram tagger through a process known as **training**.

Then use the tagger to tag words in a test set and evaluate the performance.

In [76]:
# Custom tagger
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

In [77]:
baseline_tagger = nltk.UnigramTagger(model=defined_tags)
baseline_tagger.tag(words)

[('Python', None), ('is', 'BEZ'), ('awesome', None), ('.', None)]

Let's consider the tagged sentences of **brown** corpus collections, associated with **government** genre.

Let's also compute the training set size, i.e., 80%.

In [78]:
brown_tagged_sents = brown.tagged_sents(categories='government')
brown_sents = brown.sents(categories='government')
print(len(brown_sents))
train_size = int(len(brown_sents)*0.8)
train_size

3032


2425

In [79]:
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.7799495586380832

In [80]:
# Predict tagging for a sentence taken from the test set
unigram_tagger.tag(brown_sents[3000])

[('The', 'AT'),
 ('first', 'OD'),
 ('step', 'NN'),
 ('is', 'BEZ'),
 ('a', 'AT'),
 ('comprehensive', 'JJ'),
 ('self', None),
 ('study', 'NN'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('faculty', None),
 (',', ','),
 ('by', 'IN'),
 ('outside', 'IN'),
 ('consultants', 'NNS'),
 (',', ','),
 ('or', 'CC'),
 ('by', 'IN'),
 ('a', 'AT'),
 ('combination', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('two', 'CD'),
 ('.', '.')]

In [81]:
tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)

('fly', 'NN')


Which tag occurs maximum in text collections associated with **news** genre of **brown** corpus?

In [82]:
type(brown_tagged)

nltk.corpus.reader.util.ConcatenatedCorpusView

In [83]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()

'NN'

In [84]:
s = 'Python is awesome'
print(nltk.pos_tag(nltk.word_tokenize(s)))

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ')]


**Summary**

Tokenizing text using functions **word_tokenize** and **sent_tokenize**.

Computing Frequencies with **FreqDist** and **ConditionalFreqDist**.

Generating Bigrams and collocations with **bigrams** and **collocations**.

Stemming word affixes using **PorterStemmer** and **LancasterStemmer**.

Tagging words to their parts of speech using **pos_tag**.