    1. Language Processing and Python
    2. Accessing Text Corpora and Lexical Resources
   ### 3. Processing Raw Text
    4. Writing Structured Programs
    5. Categorizing and Tagging Words (minor fixes still required)
    6. Learning to Classify Text
    7. Extracting Information from Text
    8. Analyzing Sentence Structure
    9. Building Feature Based Grammars
    10. Analyzing the Meaning of Sentences (minor fixes still required)
    11. Managing Linguistic Data (minor fixes still required)
    12. Afterword: Facing the Language Challenge

The goal of this chapter is to answer the following questions:

1. How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
2. How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
3. How can we write programs to produce formatted output and save it in a file?


### Accessing Text from the Web and from Disk

In [1]:
from urllib import request
import nltk, re, pprint
from nltk import word_tokenize

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [2]:
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [3]:
len(raw)

1176967

In [4]:
raw[:75]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

In [5]:
tokens = word_tokenize(raw)
type(tokens)
len(tokens)
tokens[:10]

list

257085

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [6]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [7]:
text[1024:1062]

['I',
 'CHAPTER',
 'I',
 'On',
 'an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge']

In [8]:
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:

In [9]:
raw.find("PART I")

5336

In [10]:
raw.rfind("End of Project Gutenberg's Crime")

-1

In [11]:
raw[5336:-1].find("PART I")

0

The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

#### Dealing with HTML

Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the section on files below. However, if you're going to do this often, it's easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend passed along by the BBC as established scientific fact:

In [12]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:600]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">\r\n<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">\r\n<meta name="IFS_URL" content="/2/hi/health/2284783.stm">\r\n<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">\r\n<meta name="Headline" conte'

In [13]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

This still contains unwanted materials concerning site navigation and related stories.

In [14]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


#### Processing Search Engine Results

The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable.

Google Hits for Collocations: The number of hits for collocations involving the words absolutely or definitely, followed by one of adore, love, like, or prefer. (Liberman, in LanguageLog, 2005).


|Google hits |	adore |	love |	like 	|prefer      |
|----|---|---|---|---|
|absolutely |	289,000 |	905,000 	|16,200 	|644|
|definitely |	1,460 |	51,000 |	158,000 |	62,600|
|ratio |	198:1 |	18:1 |	1:10 |	1:97 |

Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).


### Processing Feeds

The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the Universal Feed Parser, available from https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:

In [15]:
!pip install feedparser



In [16]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']

{'language': 'en-US',
 'title': 'Language Log',
 'title_detail': {'type': 'text/plain',
  'language': 'en-US',
  'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'value': 'Language Log'},
 'subtitle': '',
 'subtitle_detail': {'type': 'text/plain',
  'language': 'en-US',
  'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'value': ''},
 'updated': '2021-07-30T12:07:07Z',
 'updated_parsed': time.struct_time(tm_year=2021, tm_mon=7, tm_mday=30, tm_hour=12, tm_min=7, tm_sec=7, tm_wday=4, tm_yday=211, tm_isdst=0),
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://languagelog.ldc.upenn.edu/nll'},
  {'rel': 'self',
   'type': 'application/atom+xml',
   'href': 'https://languagelog.ldc.upenn.edu/nll/?feed=atom'}],
 'link': 'https://languagelog.ldc.upenn.edu/nll',
 'id': 'https://languagelog.ldc.upenn.edu/nll/?feed=atom',
 'guidislink': False}

In [17]:
llog['feed']['title']

'Language Log'

In [18]:
llog['feed']['language']


'en-US'

In [19]:
llog['feed']['title_detail']

{'type': 'text/plain',
 'language': 'en-US',
 'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
 'value': 'Language Log'}

In [20]:
llog['feed']['subtitle_detail']

{'type': 'text/plain',
 'language': 'en-US',
 'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
 'value': ''}

In [21]:
llog.entries

[{'authors': [{'name': 'Victor Mair',
    'href': 'https://ealc.sas.upenn.edu/people/prof-victor-h-mair'}],
  'author_detail': {'name': 'Victor Mair',
   'href': 'https://ealc.sas.upenn.edu/people/prof-victor-h-mair'},
  'href': 'https://ealc.sas.upenn.edu/people/prof-victor-h-mair',
  'author': 'Victor Mair',
  'title': 'Excepted for publication',
  'title_detail': {'type': 'text/html',
   'language': 'en-US',
   'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
   'value': 'Excepted for publication'},
  'links': [{'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://languagelog.ldc.upenn.edu/nll/?p=51665&utm_source=rss&utm_medium=rss&utm_campaign=excepted-for-publication'},
   {'rel': 'replies',
    'type': 'text/html',
    'href': 'https://languagelog.ldc.upenn.edu/nll/?p=51665&utm_source=rss&utm_medium=rss&utm_campaign=excepted-for-publication#comments',
    'count': '21',
    'thr:count': '21'},
   {'rel': 'replies',
    'type': 'application/atom+xml',
    

In [22]:
len(llog.entries)

13

In [23]:
post = llog.entries[2]
post.title

'Better PR for bats'

In [24]:
content = post.content[0].value

In [25]:
content[:70]

'<p>A link from Michael Glazer, with the note "Bats have been getting a'

In [26]:
raw = BeautifulSoup(content, 'html.parser').get_text()
word_tokenize(raw)

['A',
 'link',
 'from',
 'Michael',
 'Glazer',
 ',',
 'with',
 'the',
 'note',
 '``',
 'Bats',
 'have',
 'been',
 'getting',
 'a',
 'bad',
 'name',
 'recently',
 'epidemiologically',
 ',',
 'so',
 'it',
 '’',
 's',
 'nice',
 'to',
 'hear',
 'them',
 'mentioned',
 'in',
 'a',
 'positive',
 'way',
 "''",
 ':',
 '``',
 'Nathan',
 'Ruiz',
 ',',
 '``',
 'Young',
 'bats',
 'offer',
 'hope…',
 "''",
 ',',
 'WaPo',
 '7/27/2021',
 '.',
 'Well',
 ',',
 'OK',
 ',',
 'the',
 'full',
 'headline',
 'makes',
 'the',
 'real',
 'context',
 'clear',
 ':',
 '``',
 'Young',
 'bats',
 'offer',
 'hope',
 'as',
 'Orioles',
 'fall',
 'to',
 'Marlins',
 "''",
 '.',
 'But',
 'as',
 'Michael',
 'observes',
 ',',
 'Zoologically',
 ',',
 'we',
 '’',
 've',
 'got',
 'three',
 'of',
 'the',
 'Vertebrata',
 'subphylum',
 '’',
 's',
 'seven',
 'Classes',
 'here',
 '.',
 'Stuff',
 'in',
 'Sidewinders',
 'and',
 'Sharks',
 ',',
 'and',
 'there',
 '’',
 's',
 'another',
 'two',
 '.',
 'Jawless',
 'fishes',
 'and',
 'amph

With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.

#### Reading Local Files

In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a file document.txt, you can load its contents like this:

In [31]:
file = open('document.txt')
raw = file.read()

In [34]:
print(raw)

Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print(f.read()).


In [40]:
f = open('document.txt','r')
for line in f:
    print(line.strip())

Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print(f.read()).


In [41]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()

In [42]:
print(raw)

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true." --HACKLUYT

"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness
or rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER'S
DICTIONARY

"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;
A.S. WALW-IAN, to roll, to wallow." --RICHARDSON'S DICTIONARY

KETOS,               GRE

Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.

In [43]:
s = input("Enter some text: ")

Enter some text: Hello How are you there


In [44]:
s

'Hello How are you there'

In [45]:
print("You typed", len(word_tokenize(s)), "words")

You typed 5 words
