# Processing Raw Text


1. How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
2. How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
3. How can we write programs to produce formatted output and save it in a file?


In [1]:
import nltk, re, pprint
from nltk import word_tokenize # tokenization
from urllib import request # text from web
from bs4 import BeautifulSoup # html parse

## Accessing Text from the Web and from Disk

In [2]:
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [3]:
print("The length of this string is {:d}".format(len(raw)))

The length of this string is 1176812


In [4]:
raw[:75]

'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r'

## Tokenization

In [5]:
tokens = word_tokenize(raw)
type(tokens)

list

In [6]:
print("The length of the tokens is {:d}".format(len(tokens)))

The length of the tokens is 257058


In [7]:
print(tokens[:50])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away']


## Create NLTK text from list of tokens

In [8]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

### By converting tokens to NLTK text, we are able to run NLTK methods on the tokens.

In [9]:
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Project Gutenberg; Ilya
Petrovitch; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


### Recall that collocations are a sequence of words that occur together unusually often. 

## Subset text to what we need
### We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming *raw* to be just the content and nothing else. So that it begins with "PART I" and goes up to the phrase that marks the end of the content.

In [10]:
start = raw.find("PART I")
end = raw.find("End of Project Gutenberg's Crime")
raw = raw[start:end]
raw.find("PART I")

0

## Dealing with HTML

In [11]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

### To get text out of HTML we will use a Python library called *BeautifulSoup*

In [12]:
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
print(tokens[:50])

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'NEWS', 'SPORT', 'WEATHER', 'WORLD', 'SERVICE', 'A-Z', 'INDEX', 'SEARCH', 'You', 'are', 'in', ':', 'Health', 'News', 'Front', 'Page', 'Africa', 'Americas', 'Asia-Pacific', 'Europe', 'Middle', 'East', 'South', 'Asia', 'UK', 'Business', 'Entertainment', 'Science/Nature', 'Technology', 'Health', 'Medical', 'notes', '--', '--', '--', '--', '--', '--']


### This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

In [13]:
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 


## Reading Local Files

In [14]:
f = open('data/crimeandpunishment.txt')
raw = f.read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
print(vocab[:50])
f.close()

['!', "'", "''", "'_i", "'_is", "'_please", "'_seems_", "'again", "'an", "'and", "'answered", "'at", "'ave", "'be", "'because", "'behold", "'blame", "'but", "'campany", "'can", "'catch", "'change", "'cinq", "'clasped", "'come", "'could", "'d", "'defend", "'destroyers", "'disgraceful", "'do", "'dounia", "'eliminate", "'everybody", "'evidence", "'extraordinary", "'filka", "'for", "'from", "'general", "'genteel", "'germans", "'give", "'go", "'good", "'government", "'happiness", "'he", "'here", "'honourable"]


In [15]:
print("The vocabulary of 'crimeandpunishment.txt' has {:d} words".format(len(vocab)))

The vocabulary of 'crimeandpunishment.txt' has 10092 words


## Capturing User Input

In [16]:
s = input("Enter some text: ")
print(s)
print("You typed {:d} words.".format(len(word_tokenize(s))))

For the word of god is alive and active. sharper than any double-edged sword it penetrates even to dividing soul and spirit, joints and marrow. It judges the thoughts and attitudes of the heart.
You typed 38 words.


## Regular Expressions for Detecting Word Patterns

In [17]:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [117]:
print("There are {:d} lowercase words in nltk.corpus.words.words('en')".format(len(wordlist)))

There are 210687 lowercase words in nltk.corpus.words.words('en')


In [24]:
from random import sample

# random english words
i_list = sample(range(len(wordlist)), 50)
print([wordlist[i] for i in i_list])

['lesser', 'intermediate', 'nonprofessorial', 'muckweed', 'pararhotacism', 'melodramatist', 'nontan', 'uncondensed', 'novemcostate', 'unridably', 'unschool', 'ionizer', 'criocephalus', 'enumerate', 'hexagyn', 'satyrlike', 'wartweed', 'scabland', 'nonfinite', 'isolysis', 'spirometer', 'reservation', 'duotriacontane', 'porch', 'unbraided', 'foliose', 'sitiology', 'prickspur', 'reinstallation', 'pseudochromia', 'bucketful', 'jaculator', 'iconomatography', 'sizable', 'nonfacial', 'palliopedal', 'nepionic', 'hogger', 'broking', 'sindon', 'meritless', 'assertiveness', 'permoralize', 'palatine', 'reharden', 'reformableness', 'hypocrize', 'atomology', 'humic', 'unprofuse']


### Words ending with *'ed'*

In [25]:
wordlistsub1 = [w for w in wordlist if re.search('ed$', w)]
print("There are {:d} lowercase words in nltk.corpus.words.words('en') that end in 'ed'".format(len(wordlistsub1)))

There are 9148 lowercase words in nltk.corpus.words.words('en') that end in 'ed'


### The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

In [26]:
wordlistsub2 = [w for w in nltk.corpus.words.words('en') if re.search('^..j..t..$', w)]
print("There are {:d} words in nltk.corpus.words.words('en') that contain the above pattern".format(len(wordlistsub2)))
print(wordlistsub2)

There are 13 words in nltk.corpus.words.words('en') that contain the above pattern
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']


### Finally, the ? symbol specifies that the previous character is optional.

In [27]:
aList = ['e-mail','email', 'apple','dog','bear']
wordlistsub3 = [w for w in aList if re.search('^e-?mail$', w)]
print(wordlistsub3)

['e-mail', 'email']


### Ranges and Closures
Two or more words that are constructed using the same sequence of letter groups (i.e. T9 system) are known as textonyms. Square brackets [] constrains character to be only those within that range.

In [28]:
# (e.g. [mno] constrains character to be *m, n, or o*)
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

### '+' simply means one or more instances of the preceding item, which could be an individual character or a range.
### '*' means zero or more instances of the preceding item.
### Both '+' and '*' are referred to as **closures**.

In [31]:
len(nltk.corpus.treebank.words())

100676

In [30]:
wsj = sorted(set(nltk.corpus.treebank.words()))
print("A list of {:d} words".format(len(wsj)))

A list of 12408 words


### Finding decimals in a list of words

In [32]:
# find decimals in list of words
print([w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:50])

['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', '1.20', '1.24', '1.25', '1.26', '1.28', '1.35', '1.39', '1.4', '1.457', '1.46', '1.49', '1.5', '1.50', '1.55', '1.56', '1.5755', '1.5805', '1.6', '1.61', '1.637', '1.64']


### Finding words ending with specific character in a list of words

In [33]:
# find words ending with specific character
[w for w in wsj if re.search('^[A-z]+\$$', w)]

['C$', 'US$']

### Finding 4-digit integers in a list of words

In [34]:
print([w for w in wsj if re.search('^[0-9]{4}$', w)][:50])

['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005']


### Finding 3 or 5-letter words that have an integer and hyphen preceding them in a list of words.

In [35]:
# Finding 3 or 5-letter words that have an integer and hyphen preceding them in a list of words.
print([w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][:20])

['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year']


### Finding 3-word sequences separated by hyphen where the first word has 5+ characters, second word has 2 or 3 characters, and third word has no more than 6 characters

In [36]:
# Finding 3-word sequences separated by hyphen where the first word has 5+ characters, second word has 2 or 3 characters, and third word has no more than 6 characters
print([w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)])

['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan']


### Finding words that end in 'ed' or 'ing'

The braced expressions, like {3,5}, specify the number of repeats of the previous item. The pipe character indicates a choice between the material on its left or its right. Parentheses indicate the scope of an operator: they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot.

In [37]:
# Finding words that end in 'ed' or 'ing'
print([w for w in wsj if re.search('(ed|ing)$', w)][:50])

['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', 'Alfred', 'Allied', 'Annualized', 'Anything', 'Arbitrage-related', 'Arbitraging', 'Asked', 'Assuming', 'Atlanta-based', 'Baking', 'Banking', 'Beginning', 'Beijing', 'Being', 'Bermuda-based', 'Betting', 'Boeing', 'Broadcasting', 'Bucking', 'Buying', 'Calif.-based', 'Change-ringing', 'Citing', 'Concerned', 'Confronted', 'Conn.based', 'Consolidated', 'Continued', 'Continuing', 'Declining', 'Defending', 'Depending', 'Designated', 'Determining', 'Developed', 'Died', 'During', 'Encouraged', 'Encouraging', 'English-speaking', 'Estimated', 'Everything', 'Excluding', 'Exxon-owned']
