## 3. Processing Raw Text
##### By Ruben Seoane, adaptation of Chapter 3 from http://www.nltk.org/book_1ed/
_"Please note that the original link and book are using the urllib library methods, where this notebook is using Python 3.6.4, where **urllib3** is neccessary, thus, I've changed methods like urlopen(), read.decode() and others to the ones used by urllib3, documentation can be found here:http://urllib3.readthedocs.io/en/latest/user-guide.html. Project Gutenberg has slightly changed their link structure, so I updated it here "_

### Objectives:
1. Write programs to access local and web files.
2. Split documents into individual words and punctuation simbols for further analysis.
3. Write programs to produce formated output and export to file.

In [19]:
## All exercises require the following code to start:
import nltk, re, pprint
from nltk import word_tokenize

### 3.1 Accessing Text from Web and a Local Disk
#### Electronic Books
We'll operate by accessing the free library from _Project Gutenberg_.
In this case we choose document #2554, an English translation of _Crime and Punishment_

In [24]:
import urllib3
from urllib3 import request
http = urllib3.PoolManager()
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = http.request('GET',url)
raw = response.data.decode('utf8')
type(raw)

str

In [25]:
len(raw)

1176965

In [26]:
raw[:75]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

In [27]:
# Tokenizing: Breaking up the string into a list of words and punctuation signs:
tokens = word_tokenize(raw)
type(tokens)

list

In [28]:
len(tokens)

257726

In [29]:
tokens[:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [30]:
# Let's crete an NLTK object to store our tokenized text so we can perform further processing
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [31]:
text[1024:1062]

['an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge',
 '.',
 'He',
 'had',
 'successfully']

In [32]:
# Collocations are expressions of multiple words which commonly co-occur.
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


_Project Gutenberg_ appears as a collocation. Each text obtained from the project contains a header with the name of the text, author, names of people who scanned the document, worked on the license, etc. In order to correctly detect where the content begins and ends, we inspect it manually:

In [33]:
raw.find('PART I')

5336

In [35]:
raw.rfind("End of Project Gutenberg’s Crime and Punishment, by Fyodor Dostoevsky")

1157810

In [36]:
# Let's delimit the text to the found positions:
## rfind() is for "reverse find"
raw = raw[5336:1157810]
raw.find("PART I")

0

##### Dealing with HTML
We'll pick a BBC News story called _Blondes to die out in 200 years_, an urban legend passed along by the BBC as established scientific fact

In [38]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
r = http.request('GET',url)
html = r.data.decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In order to see all HTML content we can use _**print(html)**_. But to extract the article's text out of it, we will employ the _**BeatifulSoup**_ library, which can be installed, in its last version as **pip install bs4**.

In [51]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

The output still contanins unwanted material belonging to site navigation menus and related stories, through trial and error you can find the start and end indexes for the content of interest.

In [52]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


##### Processing RSS Feeds
We will use a Python library called _Universal Feed Parser_

In [63]:
import feedparser
blog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom") 
print(blog['feed']['title'])

Language Log


In [55]:
len(blog.entries)

13

In [56]:
post = blog.entries[2]
post.title

'Graphic antipairs'

In [57]:
content = post.content[0].value
content[:70]

'<p>Currently on the internet in China, there is a flurry of discussion'

In [64]:
raw = BeautifulSoup(content).get_text()
word_tokenize(raw)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['Currently',
 'on',
 'the',
 'internet',
 'in',
 'China',
 ',',
 'there',
 'is',
 'a',
 'flurry',
 'of',
 'discussion',
 'on',
 'characters',
 'that',
 'are',
 'mirror',
 ',',
 'flipped',
 ',',
 'reversed',
 ',',
 'or',
 'inverted',
 'images',
 'of',
 'each',
 'other',
 '.',
 'Here',
 'are',
 'some',
 'of',
 'the',
 'examples',
 'that',
 'have',
 'been',
 'cited',
 '(',
 'except',
 'for',
 'the',
 'last',
 'two',
 'sets',
 ',',
 'which',
 'were',
 'added',
 'by',
 'me',
 'to',
 'illustrate',
 'other',
 'types',
 'of',
 'minimal',
 'differences',
 ')',
 ':',
 'chǎng',
 '厂',
 '(',
 '``',
 'factory',
 "''",
 ')',
 '||',
 'yí',
 ',',
 'jí',
 '乁',
 ',',
 'ancient',
 'form',
 'of',
 'yí',
 '移',
 '(',
 '``',
 'move',
 ';',
 'shift',
 "''",
 ')',
 'or',
 'jí',
 '及',
 '(',
 '``',
 'and',
 ';',
 'reach',
 'to',
 "''",
 ')',
 '移',
 'piàn',
 '片',
 '(',
 '``',
 'sheet',
 ';',
 'piece',
 ';',
 'slice',
 "''",
 ')',
 '||',
 'pán',
 '爿',
 '(',
 '``',
 'half',
 'of',
 'a',
 'tree',
 'trunk',
 "''",
 '