Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit
Steven Bird, mEwan Klein, and Edward Loper
http://www.nltk.org/book/

# Chapter 3. Processing Raw Text

## 3.1 Accessing Text from the Web and from Disk

### Electronic Books

In [10]:
%matplotlib inline

In [11]:
import nltk, re, pprint
from nltk import word_tokenize

In [12]:
from urllib import request

In [13]:
# url = "http://www.gutenberg.org/files/2554/2554.txt"
url = "http://www.gutenberg.org/files/2554/2554-0.txt"

In [40]:
response = request.urlopen(url)

In [41]:
raw = response.read().decode('utf-8-sig')

In [42]:
type(raw)
# <class 'str'>

str

In [43]:
len(raw)
# 1176893

1176966

In [44]:
raw[:75]
# 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

In [45]:
tokens = word_tokenize(raw)

In [46]:
type(tokens)
# <class 'list'>

list

In [47]:
len(tokens)
# 254354

257727

In [48]:
tokens[:10]
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [49]:
text = nltk.Text(tokens)

In [50]:
type(text)
# <class 'nltk.text.Text'>

nltk.text.Text

In [51]:
text[1024:1062]
# ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
#  'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
#  'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly',
#  ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.']

['an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge',
 '.',
 'He',
 'had',
 'successfully']

In [53]:
text.collocation_list()
# Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
# Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
# woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
# great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
# Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market

['Katerina Ivanovna',
 'Pyotr Petrovitch',
 'Pulcheria Alexandrovna',
 'Avdotya Romanovna',
 'Rodion Romanovitch',
 'Marfa Petrovna',
 'Sofya Semyonovna',
 'old woman',
 'Project Gutenberg-tm',
 'Porfiry Petrovitch',
 'Amalia Ivanovna',
 'great deal',
 'young man',
 'Nikodim Fomitch',
 'Ilya Petrovitch',
 'Project Gutenberg',
 'Andrey Semyonovitch',
 'Hay Market',
 'Dmitri Prokofitch',
 'Good heavens']

In [54]:
raw.find("PART I")
# 5338

5335

In [64]:
raw.rfind("End of Project Gutenberg’s Crime")
# 1157743

1157811

In [65]:
raw = raw[5335:1157811]

In [66]:
raw.find("PART I")
# 0

0

### Dealing with HTML

In [74]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

In [75]:
html = request.urlopen(url).read().decode('utf8')

In [76]:
html[:60]
# '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [78]:
# raw = nltk.clean_html(html)
# NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

In [79]:
from bs4 import BeautifulSoup

In [83]:
# raw = BeautifulSoup(html).get_text()
raw = BeautifulSoup(html, "lxml").get_text()

In [89]:
tokens = word_tokenize(raw)

In [90]:
tokens
# ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', ...]

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

In [92]:
tokens = tokens[110:390]

In [93]:
text = nltk.Text(tokens)

In [94]:
text.concordance('gene')
# Displaying 5 of 5 matches:
# hey say too few people now carry the gene for blondes to last beyond the next
# blonde hair is caused by a recessive gene . In order for a child to have blond
# have blonde hair , it must have the gene on both sides of the family in the g
# ere is a disadvantage of having that gene or by chance . They do n't disappear
# des would disappear is if having the gene was a disadvantage and I do not thin

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


### Processing Search Engine Results

### Processing RSS Feeds

In [96]:
# need to install feedparser
import feedparser

In [118]:
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

In [119]:
llog['feed']['title']
# 'Language Log'

'Language Log'

In [120]:
len(llog.entries)
# 15

13

In [121]:
post = llog.entries[2]

In [122]:
post.title
# "He's My BF"

'Emoji in Chinese music video lyric'

In [123]:
content = post.content[0].value

In [124]:
content[:70]
# '<p>Today I was chatting with three of our visiting graduate students f'

'<p>From Charles Belov:</p>\n<p style="padding-left: 40px;">I thought I '

In [125]:
# raw = BeautifulSoup(content).get_text()
raw = BeautifulSoup(content, "lxml").get_text()

In [126]:
word_tokenize(raw)
# ['Today', 'I', 'was', 'chatting', 'with', 'three', 'of', 'our', 'visiting',
# 'graduate', 'students', 'from', 'the', 'PRC', '.', 'Thinking', 'that', 'I',
# 'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',
# 'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]

['From',
 'Charles',
 'Belov',
 ':',
 'I',
 'thought',
 'I',
 'was',
 'going',
 'to',
 'be',
 'sending',
 'you',
 'a',
 'case',
 'of',
 'Google',
 'Translate',
 'munging',
 'a',
 'song',
 'lyric',
 'when',
 'translating',
 'it',
 'from',
 'Chinese',
 'to',
 'English',
 '.',
 'Instead',
 ',',
 'I',
 "'m",
 'sending',
 'you',
 'a',
 'case',
 'of',
 'a',
 'Chinese',
 'music',
 'video',
 'making',
 'use',
 'of',
 'an',
 'emoji',
 'in',
 'the',
 'song',
 'lyrics',
 '.',
 'The',
 'song',
 'in',
 'question',
 'is',
 'gǎibiàn',
 '改變',
 '(',
 'Changes',
 ')',
 'by',
 'Taiwan',
 'rocker',
 'Zhāng',
 'Zhènyuè',
 '張震嶽',
 'A-Yue',
 '.',
 'If',
 'I',
 'copy',
 'the',
 'lyrics',
 'from',
 'Rock',
 'Records',
 "'",
 'posting',
 'on',
 'YouTube',
 ',',
 'Google',
 'Translate',
 'translates',
 'the',
 'line',
 'in',
 'question',
 '``',
 'Wǒ',
 'xiǎng',
 'dàbiàn',
 '我想大便',
 "''",
 'as',
 '``',
 'I',
 'want',
 'to',
 'have',
 'a',
 'bowel',
 'movement',
 '.',
 "''",
 'Now',
 'I',
 'am',
 'familiar',
 'wit

### Reading Local Files

In [136]:
f = open('document.txt')

In [137]:
raw = f.read()
raw

'Hello\nWorld!\n'

In [138]:
import os
os.listdir('.')

['.gitignore',
 '.ipynb_checkpoints',
 'document.txt',
 'groucho_grammar.cfg',
 'iu_mien_samp.xml',
 'mygrammar.cfg',
 'nltk-book-chap-x.ipynb',
 'nltk-book-chap01.ipynb',
 'nltk-book-chap02-1.ipynb',
 'nltk-book-chap02-2.ipynb',
 'nltk-book-chap02-3.ipynb',
 'nltk-book-chap02-4.ipynb',
 'nltk-book-chap02-5.ipynb',
 'nltk-book-chap03-1.ipynb',
 'nltk-book-chap03-2.ipynb',
 'nltk-book-chap03-3.ipynb',
 'nltk-book-chap03-4.ipynb',
 'nltk-book-chap04-1.ipynb',
 'nltk-book-chap04-2.ipynb',
 'nltk-book-chap05-1.ipynb',
 'nltk-book-chap05-2.ipynb',
 'nltk-book-chap05-3.ipynb',
 'nltk-book-chap05-4.ipynb',
 'nltk-book-chap05-5.ipynb',
 'nltk-book-chap05-6.ipynb',
 'nltk-book-chap06-1.ipynb',
 'nltk-book-chap06-2.ipynb',
 'nltk-book-chap06-3.ipynb',
 'nltk-book-chap06-4_5.ipynb',
 'nltk-book-chap07-1.ipynb',
 'nltk-book-chap07-2.ipynb',
 'nltk-book-chap07-3.ipynb',
 'nltk-book-chap07-4.ipynb',
 'nltk-book-chap07-5.ipynb',
 'nltk-book-chap07-6.ipynb',
 'nltk-book-chap08-1_2.ipynb',
 'nltk-book-

In [141]:
f = open('document.txt', 'r')
for line in f:
    print(line.strip())

Hello
World!


In [142]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
# raw = open(path, 'rU').read()
# /Users/hisakato/Documents/anaconda5/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: 
# DeprecationWarning: 'U' mode is deprecated
raw = open(path, 'r').read()

### Extracting Text from PDF, MSWord, and Other Binary Formats

Third-party libraries such as pypdf and pywin32 provide access to these formats.

### Capturing User Input

In [145]:
s = input("Enter some text: ")
# Enter some text: On an exceptionally hot evening early in July

Enter some text: On an exceptionally hot evening early in July


In [146]:
print("You typed", len(word_tokenize(s)), "words.")
# You typed 8 words.

You typed 8 words.


### The NLP Pipeline

In [147]:
raw = open('document.txt').read()

In [148]:
type(raw)
# <class 'str'>

str

In [149]:
tokens = word_tokenize(raw)

In [150]:
type(tokens)
# <class 'list'>

list

In [151]:
words = [w.lower() for w in tokens]

In [152]:
type(words)
# <class 'list'>

list

In [153]:
vocab = sorted(set(words))

In [154]:
type(vocab)
# <class 'list'>

list

In [155]:
vocab.append('blog')

In [156]:
raw.append('blog')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# AttributeError: 'str' object has no attribute 'append'

AttributeError: 'str' object has no attribute 'append'

In [157]:
query = 'Who knows?'

In [158]:
beatles = ['john', 'paul', 'george', 'ringo']

In [159]:
query + beatles
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# TypeError: cannot concatenate 'str' and 'list' objects

TypeError: must be str, not list