# Processing Raw Text in Python 

In [1]:
import nltk, re, pprint
from nltk import word_tokenizenize

### Accessing different Types of Text

#### Electronic Books

In [4]:
from urllib import request
url = 'https://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode('utf8')

In [23]:
raw[:1000]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Crime and Punishment\r\n\r\nAuthor: Fyodor Dostoevsky\r\n\r\nRelease Date: March 28, 2006 [EBook #2554]\r\nLast Updated: October 27, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***\r\n\r\n\r\n\r\n\r\nProduced by John Bickers; and Dagny\r\n\r\n\r\n\r\n\r\n\r\nCRIME AND PUNISHMENT\r\n\r\nBy Fyodor Dostoevsky\r\n\r\n\r\n\r\nTranslated By Constance Garnett\r\n\r\n\r\n\r\n\r\nTRANSLATOR’S PREFACE\r\n\r\nA few words about Dostoevsky himself may help the English reader to\r\nunderstand his work.\r\n\r\nDostoevsky was the son of a doctor. His pa

In [14]:
# Tokenize the Text into words
tokens = word_tokenize(raw)

In [15]:
len(tokens)

257726

In [16]:
tokens[:150]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by',
 'Fyodor',
 'Dostoevsky',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 'Title',
 ':',
 'Crime',
 'and',
 'Punishment',
 'Author',
 ':',
 'Fyodor',
 'Dostoevsky',
 'Release',
 'Date',
 ':',
 'March',
 '28',
 ',',
 '2006',
 '[',
 'EBook',
 '#',
 '2554',
 ']',
 'Last',
 'Updated',
 ':',
 'October',
 '27',
 ',',
 '2016',
 'Language',
 ':',
 'English',
 'Character',
 'set',
 'encoding',
 ':',
 'UTF-8',
 '***',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'CRIME',
 'AND',
 'PUNISHMENT',
 '*

In [17]:
# Now we can create a Text from the Tokens to continue to work with more advanced features
text = nltk.Text(tokens)

In [19]:
text[1024:]

['an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge',
 '.',
 'He',
 'had',
 'successfully',
 'avoided',
 'meeting',
 'his',
 'landlady',
 'on',
 'the',
 'staircase',
 '.',
 'His',
 'garret',
 'was',
 'under',
 'the',
 'roof',
 'of',
 'a',
 'high',
 ',',
 'five-storied',
 'house',
 'and',
 'was',
 'more',
 'like',
 'a',
 'cupboard',
 'than',
 'a',
 'room',
 '.',
 'The',
 'landlady',
 'who',
 'provided',
 'him',
 'with',
 'garret',
 ',',
 'dinners',
 ',',
 'and',
 'attendance',
 ',',
 'lived',
 'on',
 'the',
 'floor',
 'below',
 ',',
 'and',
 'every',
 'time',
 'he',
 'went',
 'out',
 'he',
 'was',
 'obliged',
 'to',
 'pass',
 'her',
 'kitchen',
 ',',
 'the',
 'door',
 'of',
 'which',
 'invariably',
 'stood',
 'open',
 '.',
 'And',
 'eac

### Subsetting the Document

In [29]:
# Find chapter 1
start_1 = raw.find('PART I')
end = raw.find('End of Project Gutenberg')

In [30]:
# Select the Text
chap1 = nltk.Text(word_tokenize(raw[start_1:end]))

In [37]:
# Print the Beginning
chap1[:100]
# Print the Ending
chap1[-20:-1]

['unknown',
 'life',
 '.',
 'That',
 'might',
 'be',
 'the',
 'subject',
 'of',
 'a',
 'new',
 'story',
 ',',
 'but',
 'our',
 'present',
 'story',
 'is',
 'ended']

### Processing "Search Engine Results"

The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable.

### Processing "Blogosphere Data"

In [38]:
import feedparser

ModuleNotFoundError: No module named 'feedparser'