We'll use the Natural Language Toolkit (NLTK) to prepare a text. (And this afternoon to do some analysis.)

In [1]:
import nltk
import re

To begin with, we will import the texts from our corpus.

In [2]:
from nltk.corpus import PlaintextCorpusReader

We tell it where we want the text to come from. If you don't know the path to your UNC files, right click the file and see its location. My files are on my desktop. For this notebook to work, you'll need to click into the cell below and change the root location.

In [3]:
corpus_root = '/Users/senderle/Dropbox/Shared/cwp/nasn/input-texts'

Now we are going to read in the files from that folder.

In [4]:
wordlists = PlaintextCorpusReader(corpus_root,  '.*.txt')

Let's see if we have the files we want by asking for the file ids.

In [5]:
wordlists.fileids()

['church-hatcher-hatcher.txt',
 'fpn-ball-ball.txt',
 'fpn-brownw-brown.txt',
 'fpn-bruce-bruce.txt',
 'fpn-burton-burton.txt',
 'fpn-burtont-burton.txt',
 'fpn-ferebee-ferebee.txt',
 'fpn-grandy-grandy.txt',
 'fpn-hortonlife-horton.txt',
 'fpn-hortonpoem-hortonpoem.txt',
 'fpn-hughes-hughes.txt',
 'fpn-jackson-jackson.txt',
 'fpn-jacobs-jacobs.txt',
 'fpn-jones-jones.txt',
 'fpn-lane-lane.txt',
 'fpn-mason-mason.txt',
 'fpn-northup-northup.txt',
 'fpn-robinson-robinson.txt',
 'fpn-roper-roper.txt',
 'fpn-steward-steward.txt',
 'fpn-veney-veney.txt',
 'fpn-washeducation-washing.txt',
 'fpn-washington-washing.txt',
 'fpn-williams-williams.txt',
 'nc-jones85-jones85.txt',
 'nc-omarsaid-omarsaid.txt',
 'neh-aaron-aaron.txt',
 'neh-adams-adams.txt',
 'neh-adamsh-adamsh.txt',
 'neh-aga-aga.txt',
 'neh-albert-albert.txt',
 'neh-aleckson-aleckson.txt',
 'neh-alexander-alexander.txt',
 'neh-allen-allen.txt',
 'neh-allinson-allinson.txt',
 'neh-anderson-anderson.txt',
 'neh-andersonr-andersonr.

When you use NLTK to read in text, it does the work of tokenizing (turning strings with spaces between into "words"). We can see this by using the "words" function.  (There are other functions, like sentences -- "sents", as well.) 

In [6]:
wordlists.words('neh-webb-webb.txt')

['THIS', 'book', 'was', 'composed', 'by', 'WILLIAM', ...]

We can make a list of these words to use with other functions.

In [7]:
webb = wordlists.words('neh-webb-webb.txt')

In [8]:
webb[0:20]

['THIS',
 'book',
 'was',
 'composed',
 'by',
 'WILLIAM',
 'WEBB',
 ',',
 'and',
 'written',
 'by',
 'his',
 'wife',
 ',',
 'containing',
 'the',
 'life',
 'he',
 'has',
 'went']

However, this awesome list of words has a few idiosnycracies that might change our analysis.

In [9]:
webb.count('THIS')

1

In [10]:
webb.count('this')

77

In [11]:
webb.count('This')

8

So the first thing we want to do is make make THIS, this, and This all into this.  To do this we will take every text in the corpus -- in their wordlist form -- and lowercase them. 

In [12]:
webblow = [w.lower() for w in webb]

In [13]:
webblow[0:20]

['this',
 'book',
 'was',
 'composed',
 'by',
 'william',
 'webb',
 ',',
 'and',
 'written',
 'by',
 'his',
 'wife',
 ',',
 'containing',
 'the',
 'life',
 'he',
 'has',
 'went']

In [14]:
webblow.count('this')

86

We might also want to remove the punctuation. 

In [15]:
webbclean = [w for w in webblow if w.isalpha() ]

In [16]:
webbclean[1:20]

['book',
 'was',
 'composed',
 'by',
 'william',
 'webb',
 'and',
 'written',
 'by',
 'his',
 'wife',
 'containing',
 'the',
 'life',
 'he',
 'has',
 'went',
 'through',
 'his']

In [17]:
words = wordlists.words()

In [18]:
from nltk.corpus import stopwords

In [19]:
eng_stopwords = stopwords.words('english')

In [20]:
clean_corpus = [w.lower() for w in words if w.lower() not in eng_stopwords and w.isalpha()]

In [21]:
clean_corpus[1:20]

['image',
 'spine',
 'image',
 'john',
 'jasper',
 'frontispiece',
 'image',
 'title',
 'page',
 'image',
 'title',
 'page',
 'verso',
 'image',
 'contents',
 'illustrations',
 'introduction',
 'reader',
 'stay']

You have reached the end of preparing texts. 
We will return to this notebook in order to do some work with analyzing texts this afternoon!





Welcome back!  Now we are going to do some simple analysis with NLTK. 

First, how long is our text?

In [22]:
len(webbclean)

30455

In [23]:
len(clean_corpus)

5397959

How diverse is the vocabulary in our texts?

In [24]:
def lexical_diversity(text): 
...     return len(set(text)) / len(text)

In [25]:
lexical_diversity(clean_corpus)

0.012332809493365918

In [26]:
lexical_diversity(webb)

0.06797858571596659

In [27]:
lexical_diversity(wordlists.words('neh-craft-craft.txt'))

0.13553580412544022

Next we are going to make a frequency distribution table. This is a count of each time a word occurs in the text.

In [28]:
from nltk.book import *


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [29]:
fdistwebb = FreqDist(webb)

In [30]:
fdistwebb.most_common(20)

[(',', 1911),
 ('I', 1770),
 ('the', 1478),
 ('.', 1403),
 ('and', 1231),
 ('to', 1201),
 ('was', 589),
 ('that', 562),
 ('a', 546),
 ('me', 472),
 ('in', 452),
 ('of', 437),
 ('had', 405),
 ('he', 357),
 ('it', 355),
 ('they', 326),
 ('would', 276),
 ('for', 269),
 ('told', 262),
 ('them', 245)]

In [31]:
fdistcorpus = FreqDist(clean_corpus)

In [32]:
fdistcorpus.most_common(20)

[('one', 41109),
 ('would', 38655),
 ('man', 27246),
 ('time', 25739),
 ('said', 24563),
 ('could', 23764),
 ('mr', 22791),
 ('upon', 19182),
 ('people', 17568),
 ('day', 17227),
 ('made', 17070),
 ('men', 16892),
 ('great', 16320),
 ('slave', 16211),
 ('us', 15682),
 ('master', 15502),
 ('many', 15078),
 ('good', 15054),
 ('god', 15025),
 ('well', 14824)]

In [33]:
len(fdistcorpus.hapaxes())

21683

Finally, we are going to add some metadata and then add conditions to our frequency distribution table.

In [34]:
import csv
metadata_path = '/Users/senderle/Dropbox/Shared/cwp/nasn/North American Slave Narratives 2016-05-09 - Main Data.csv'
with open(metadata_path) as metadata_file:
    metadata_reader = csv.DictReader(metadata_file)
    metadata = list(metadata_reader)
gender_map = {row['Filename']: row['Sex'] for row in metadata}

Now that we have out metadata in and mapped, we will make the conditional frequency distribution. (You'll see that we are also cleaning the corpus in the process.)

In [35]:
gender_conditional_freq = nltk.ConditionalFreqDist((gender_map[fid], w.lower()) 
                                                   for fid in wordlists.fileids() 
                                                   for w in wordlists.words(fid)
                                                   if w.lower() not in eng_stopwords and w.isalpha())

What are the most common words used by authors classified as male vs. authors classified as female?

In [36]:
gender_conditional_freq['male'].most_common(20)

[('one', 34058),
 ('would', 31600),
 ('man', 24449),
 ('time', 21388),
 ('mr', 19580),
 ('could', 19372),
 ('said', 18578),
 ('upon', 16411),
 ('men', 15272),
 ('people', 14676),
 ('made', 14666),
 ('slave', 14654),
 ('day', 14222),
 ('great', 13879),
 ('master', 13288),
 ('us', 12870),
 ('two', 12381),
 ('many', 12275),
 ('good', 12081),
 ('well', 11997)]

In [37]:
gender_conditional_freq['female'].most_common(20)

[('would', 6660),
 ('one', 6612),
 ('said', 5675),
 ('could', 4181),
 ('time', 4093),
 ('go', 3370),
 ('god', 3276),
 ('little', 2966),
 ('mr', 2911),
 ('good', 2824),
 ('day', 2801),
 ('people', 2730),
 ('lord', 2728),
 ('well', 2716),
 ('came', 2691),
 ('old', 2688),
 ('see', 2648),
 ('upon', 2646),
 ('many', 2625),
 ('went', 2607)]