# WWC - Accessing Text Corpora and Lexical Resources


In [1]:
import nltk

### GUTENBERG CORPUS

NLTK includes a selection of texts from the Project Gutenberg Electronic Text Archive, which contains some 25,000 free electronic books hosted as *http://www.gutenberg.org*.

In [2]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Give the first text a short name, and find out how many words it contains.

In [3]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

In [4]:
len(emma)

192427

In [5]:
emma[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

Perform some concordancing in Emma's text and find the context for "surprize"

In [6]:
emmaText = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

In [7]:
emmaText.concordance("surprize")

Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the my

The **words()** function of the gutenberg object in NLTK's corpus package provides another version of the **import** statement.

In [18]:
from nltk.corpus import gutenberg

In [19]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Let's write a program to display general information about the texts by looping over all the values of **fileid** corresponding to the gutenberg files identifiers listed earlier and then computing statistics for each text.
1. Average word length
2. Average sentence length
3. Average number of times each vocabulary item appears in the text on average (lexical diversity score)

In [20]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) # raw() returns the number of characters. Not split into tokens
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))# sents() divides the text into sentences
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print(int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid)

4 24 26 austen-emma.txt
4 26 16 austen-persuasion.txt
4 28 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 20 12 carroll-alice.txt
4 20 11 chesterton-ball.txt
4 22 11 chesterton-brown.txt
4 18 10 chesterton-thursday.txt
4 20 24 edgeworth-parents.txt
4 25 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 36 12 whitman-leaves.txt


sents( ) divides the text up into sentences, where each sentence is a list of words. Check the next example:

In [21]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

In [22]:
macbeth_sentences[1] 

['Actus', 'Primus', '.']

In [23]:
macbeth_sentences[1037]

['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']

Let's try to find the longest sentence in Macbeth

In [24]:
longest_len = max([len(s) for s in macbeth_sentences])

In [25]:
longest_len

158

In [26]:
[s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  '(',
  'Worthie',
  'to',
  'be',
  'a',
  'Rebell',
  ',',
  'for',
  'to',
  'that',
  'The',
  'multiplying',
  'Villanies',
  'of',
  'Nature',
  'Doe',
  'swarme',
  'vpon',
  'him',
  ')',
  'from',
  'the',
  'Westerne',
  'Isles',
  'Of',
  'Kernes',
  'and',
  'Gallowgrosses',
  'is',
  'supply',
  "'",
  'd',
  ',',
  'And',
  'Fortune',
  'on',
  'his',
  'damned',
  'Quarry',
  'smiling',
  ',',
  'Shew',
  "'",
  'd',
  'like',
  'a',
  'Rebells',
  'Whore',
  ':',
  'but',
  'all',
  "'",
  's',
  'too',
  'weake',
  ':',
  'For',
  'braue',
  'Macbeth',
  '(',
  'well',
  'hee',
  'deserues',
  'that',
  'Name',
  ')',
  'Disdayning',
  'Fortune',
  ',',
  'with',
  'his',
  'brandisht',
  'Steele',
  ',',
  'Which',
  'smoak',
  "'",
  'd',
 

### WEB and CHAT Text

NLTK contains a small collections of web text which includes content from a Firefox discuss forum, conversations overhead in New York, the movie script of Pirates of the Caribbean, personal advertisements, and wine reviews.

In [27]:
from nltk.corpus import webtext

In [28]:
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65])

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb


There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduates School for research on automatic detection of Internet pedators. 
The corpus contains over 10,000 anonymized posts.
The corpus is organized in 15 files collected in a specific date for an age-specific chatroom (teens, 20s, 30s,...)
The filename contains the date, chatroom and number of posts

In [29]:
from nltk.corpus import nps_chat

In [30]:
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

In [31]:
chatroom[123]

['i',
 'do',
 "n't",
 'want',
 'hot',
 'pics',
 'of',
 'a',
 'female',
 ',',
 'I',
 'can',
 'look',
 'in',
 'a',
 'mirror',
 '.']

### BROWN CORPUS

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.
This corpus contains text from 500 sources, and the sources have been categorized by genre (news, editorial, romance, and so on.)
For a complete list see *http://icame.uib.no/brown/bcm-los.html*

You can access the corpus as a list of words or a list of sentences, or by category.

The Brown Corpus is a convenient resource for studying systematic differences between genres such a stylistics.

In [32]:
from nltk.corpus import brown

In [33]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [34]:
brown.words(categories='news')[:15]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced']

In [35]:
brown.words(fileids=['cg22'])[:15] # 'cg22' fileid specifies belles lettres genre

['Does',
 'our',
 'society',
 'have',
 'a',
 'runaway',
 ',',
 'uncontrollable',
 'growth',
 'of',
 'technology',
 'which',
 'may',
 'end',
 'our']

In [36]:
brown.sents(categories=['news','editorial','reviews'])[1]

['The',
 'jury',
 'further',
 'said',
 'in',
 'term-end',
 'presentments',
 'that',
 'the',
 'City',
 'Executive',
 'Committee',
 ',',
 'which',
 'had',
 'over-all',
 'charge',
 'of',
 'the',
 'election',
 ',',
 '``',
 'deserves',
 'the',
 'praise',
 'and',
 'thanks',
 'of',
 'the',
 'City',
 'of',
 'Atlanta',
 "''",
 'for',
 'the',
 'manner',
 'in',
 'which',
 'the',
 'election',
 'was',
 'conducted',
 '.']

The Brown Corpus is a convenient resource for studying systematic differences between genres such a **stylistics**.

Let's compare genres in their usage of **modal verbs**. 

In [37]:
from nltk.corpus import brown

In [38]:
news_text = brown.words(categories='news')

In [39]:
fdist = nltk.FreqDist([w.lower() for w in news_text])

In [40]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [41]:
for m in modals:
    print(m + ':', fdist[m])

can: 94
could: 87
may: 93
might: 38
must: 53
will: 389


Let's run the same comparison for six different genres using **conditional frequency distributions** in NLTK.

In [42]:
genres = ['news','religion','hobbies','science_fiction','romance','humor']

In [43]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [44]:
cfd = nltk.ConditionalFreqDist((genre,word)
                              for genre in brown.categories()
                              for word in brown.words(categories=genre))

In [45]:
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


Observe that the most frequent modal in "news" is **will**, while the most frequent modal in "romance" is **could**.

 ### REUTERS CORPUS
 
 The Reuters Corpus contains  10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test".
 
 For instance, the text with fileid 'test/14826' is a document drawn from the test set.
 
 This split is for training and testing algorithms that automatically detect the topic of a document.

In [46]:
from nltk.corpus import reuters

In [47]:
reuters.fileids()

['test/14826',
 'test/14828',
 'test/14829',
 'test/14832',
 'test/14833',
 'test/14839',
 'test/14840',
 'test/14841',
 'test/14842',
 'test/14843',
 'test/14844',
 'test/14849',
 'test/14852',
 'test/14854',
 'test/14858',
 'test/14859',
 'test/14860',
 'test/14861',
 'test/14862',
 'test/14863',
 'test/14865',
 'test/14867',
 'test/14872',
 'test/14873',
 'test/14875',
 'test/14876',
 'test/14877',
 'test/14881',
 'test/14882',
 'test/14885',
 'test/14886',
 'test/14888',
 'test/14890',
 'test/14891',
 'test/14892',
 'test/14899',
 'test/14900',
 'test/14903',
 'test/14904',
 'test/14907',
 'test/14909',
 'test/14911',
 'test/14912',
 'test/14913',
 'test/14918',
 'test/14919',
 'test/14921',
 'test/14922',
 'test/14923',
 'test/14926',
 'test/14928',
 'test/14930',
 'test/14931',
 'test/14932',
 'test/14933',
 'test/14934',
 'test/14941',
 'test/14943',
 'test/14949',
 'test/14951',
 'test/14954',
 'test/14957',
 'test/14958',
 'test/14959',
 'test/14960',
 'test/14962',
 'test/149

In [48]:
reuters.categories()[:10]

['acq',
 'alum',
 'barley',
 'bop',
 'carcass',
 'castor-oil',
 'cocoa',
 'coconut',
 'coconut-oil',
 'coffee']

The corpus methods accept a single fileid or a list of fileids.

In [49]:
reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

In [50]:
reuters.categories(['training/9865', 'training/9880'])

['barley', 'corn', 'grain', 'money-fx', 'wheat']

In [51]:
reuters.fileids('barley')[:10]

['test/15618',
 'test/15649',
 'test/15676',
 'test/15728',
 'test/15871',
 'test/15875',
 'test/15952',
 'test/17767',
 'test/17769',
 'test/18024']

We can also specify the words or sentences we want in terms of files or categories. 
Remember that titles are store in uppercase by convention.

In [52]:
reuters.words('training/9865')[:10]

['FRENCH',
 'FREE',
 'MARKET',
 'CEREAL',
 'EXPORT',
 'BIDS',
 'DETAILED',
 'French',
 'operators',
 'have']

In [53]:
reuters.words(['training/9865', 'training/9880'])

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

In [54]:
reuters.words(categories = 'barley')

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

In [55]:
reuters.words(categories = ['barley','corn'])

['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

### ANNOTATED CORPORA

Many text corpora contain linguistic annotations, representing part of the speech tags, named entities, syntatic structures, semantic roles, and so forth.
For information on how to download free corpora, samples and data packages check http://nltk.org/data

### CORPORA IN OTHER LANGUAGES

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora.

Let's use corpora from the UNiversal Declaration of Human Rights in over 300 languages and create a conditional Frquency distribution to examine the differences in words lengths for a group od languages.

In [56]:
from nltk.corpus import udhr

In [57]:
languages = ['Chickasaw', 'English','German_Deutsch', 'Greenlandic_Inuktikut',
             'Hungarian_Magyar', 'Ibibio_Efik']

In [58]:
cfd = nltk.ConditionalFreqDist((lang, len(word))
                              for lang in languages
                              for word in udhr.words(lang + '-Latin1'))

In [59]:
# cfd.plot(cumulative=True)

### LOADING YOUR OWN CORPUS

If you have your own collection of text files, you can load them using NLTK's PlaintextCorpusReader.

Instructions:

Set the value of corpus_root to the location of your files
The second paramenter of the PlaintextCorpusReader initializer can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all the fileids, like '[abc]/.*\.txt'

In [60]:
from nltk.corpus import PlaintextCorpusReader

In [84]:
corpus_root = '/Users/dianaamador/Corpus/'

In [91]:
wordlists = PlaintextCorpusReader(corpus_root,'.*')

In [92]:
wordlists.fileids()

['.DS_Store', 'Im_the_shooter_HP.txt', 'Orlando_NYT.txt', 'UTF.txt']

In [93]:
wordlists.words('UTF.txt')

['UTF', '-', '8', 'is', 'a', 'character', 'encoding', ...]

In [94]:
wordlists.sents('UTF.txt')

[['UTF', '-', '8', 'is', 'a', 'character', 'encoding', 'capable', 'of', 'encoding', 'all', 'possible', 'characters', ',', 'or', 'code', 'points', ',', 'defined', 'by', 'Unicode', 'and', 'originally', 'designed', 'by', 'Ken', 'Thompson', 'and', 'Rob', 'Pike', '.'], ['The', 'encoding', 'is', 'variable', '-', 'length', 'and', 'uses', '8', '-', 'bit', 'code', 'units', '.']]

In [95]:
wordlists.words('Im_the_shooter_HP.txt')

['‘', 'I', '’', 'm', 'the', 'shooter', '.', 'It', '’', ...]

In [96]:
len(wordlists.sents('Im_the_shooter_HP.txt'))

25

Basic Corpus Functionality defined in NLTK (http://www.nltk,org/howto)

fileids() The files of the corpus   
fileids([categories]) The files of the corpus corresponding to these categories    
categories() The categories of the corpus   
categories([fileids]) The categories of the corpus corresponding to these files   
raw() The raw content of the corpus   
raw(fileids=[f1,f2,f3]) The raw content of the specified files   
raw(categories=[c1,c2]) The raw content of the specified categories   
words() The words of the whole corpus   
words(fileids=[f1,f2,f3]) The words of the specified fileids   
words(categories=[c1,c2]) The words of the specified categories   
sents() The sentences of the specified categories   
sents(fileids=[f1,f2,f3]) The sentences of the specified fileids    
sents(categories=[c1,c2]) The sentences of the specified categories    
abspath(fileid) The location of the given file on disk   
encoding(fileid) The encoding of the file (if known)   
open(fileid) Open a stream for reading the given corpus file    
root() The path to the root of locally installed corpus    
readme() The contents of the README file of the corpus   
