We'll use the Natural Language Toolkit (NLTK) to prepare a text. (And this afternoon to do some analysis.)

In [1]:
import nltk
import re

To begin with, we will import the text of a corpus. This time we are using a subset because NLP is resource intensive. We could run the entire Slave Narrative corpus; however, it takes more time to process than we have. The subset is of women-authored texts published in the state of New York after the Civil War.

In [2]:
from nltk.corpus import PlaintextCorpusReader

We tell it where we want the text to come from. Use the path to get to the women_ny_postwar folder of texts. For this notebook to work, you'll need to click into the cell below and change the root location.

In [3]:
corpus_root = '/Users/kalle/Desktop/HILT 2016/data/women_ny_postwar'

Now we are going to read in the files from that folder.

In [4]:
corpus = PlaintextCorpusReader(corpus_root,  '.*.txt')

We will keep capital letters, but remove punctuation and tokenize.

In [5]:
words_corpus = corpus.words()

In [6]:
words_corpus = [w for w in words_corpus if w.isalpha()]

Next we want to tag up with part of speech in the text.

In [7]:
pos_corpus = nltk.pos_tag(words_corpus)

Let's look at a sample of what we have produced.

In [8]:
pos_corpus[539:569]

[('it', 'PRP'),
 ('is', 'VBZ'),
 ('and', 'CC'),
 ('will', 'MD'),
 ('remain', 'VB'),
 ('forever', 'RB'),
 ('impossible', 'JJ'),
 ('to', 'TO'),
 ('adequately', 'RB'),
 ('portray', 'VB'),
 ('its', 'PRP$'),
 ('unspeakable', 'JJ'),
 ('horrors', 'NNS'),
 ('its', 'PRP$'),
 ('heartbreaking', 'NN'),
 ('sorrows', 'VBZ'),
 ('its', 'PRP$'),
 ('fathomless', 'JJ'),
 ('miseries', 'NNS'),
 ('of', 'IN'),
 ('hopeless', 'JJ'),
 ('grief', 'NN'),
 ('its', 'PRP$'),
 ('intolerable', 'JJ'),
 ('shames', 'NNS'),
 ('and', 'CC'),
 ('its', 'PRP$'),
 ('heaven', 'NN'),
 ('defying', 'NN'),
 ('and', 'CC')]

What does this mean?  The results are in tuples -- pairs of words and tags. The tags have a key, which you can find here: http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/. NN is a singular or mass noun -- "grief." NNS is a plural noun -- "miseries." JJ is an adjective -- "fathomless." VB is a verb -- "portray." RB is an adverb -- "adequately."  

Just as we could create a term frequency distribution table, we can also create a tag frequency table. 

In [9]:
tag_fd = nltk.FreqDist(tag for (word, tag) in pos_corpus)

We can sort by how common a part of speech is.

In [10]:
tag_fd.most_common(10)

[('NN', 31566),
 ('IN', 29540),
 ('DT', 22574),
 ('PRP', 20743),
 ('VBD', 16170),
 ('NNP', 16144),
 ('JJ', 14141),
 ('RB', 11010),
 ('CC', 10835),
 ('VB', 10569)]

Given the prevalence of "little" in women's narratives that we saw in our discussion on Tuesday, I would like to know more about the kinds of adjectives that these writers were using.  Adjectives are denoted by JJ, JJR, and JJS.

First, we make a frequency distribution for the words in the corpus. 

In [11]:
word_fd = nltk.FreqDist(word for (word, tag) in pos_corpus)

In [12]:
word_fd.most_common(10)

[('the', 11677),
 ('and', 8606),
 ('to', 7742),
 ('of', 7015),
 ('I', 5388),
 ('a', 4277),
 ('in', 4079),
 ('was', 3750),
 ('that', 2874),
 ('her', 2685)]

(Oh hai, stopwords.)

Then we are going to employ the conditional frequency distribution again. This time the condition will be being an adjective.

In [13]:
cfd_pos = nltk.ConditionalFreqDist((tag, word) for (word, tag) in pos_corpus)

In [14]:
cfd_pos['JJ'].most_common(10)

[('old', 445),
 ('good', 371),
 ('many', 349),
 ('great', 314),
 ('little', 305),
 ('poor', 278),
 ('other', 277),
 ('own', 236),
 ('much', 216),
 ('last', 215)]

Now we're going to store these word frequency counts in a file for use elsewhere. We'll store them in a CSV file that many different kinds of software can open. Python provides support for CSV files in the `csv` library, so first we import it.

In [24]:
import csv

Let's save word frequency counts per document for the 200 most common adjectives to our CSV. First, we create a conditional frequency distribution in which the condition is the filename in which the word appears.

In [25]:
corpus_fileids = [fileid for fileid in corpus.fileids()]

In [26]:
fileid_frequency = []

In [27]:
for fileid in corpus.fileids():
    for word in corpus.words(fileid):
        fileid_frequency.append((fileid, word))

In [28]:
csv_data = nltk.ConditionalFreqDist(fileid_frequency)

In [29]:
common_adjectives = [word for word, tag in cfd_pos['JJ'].most_common(200)]

In [30]:
with open('common-adjectives.csv', 'w') as output:
    csv_writer = csv.writer(output)
    header = ['Filename']
    header.extend(common_adjectives)
    csv_writer.writerow(header)
    for fileid in corpus.fileids():
        row = [fileid]
        row.extend(csv_data[fileid][w] for w in common_adjectives)
        csv_writer.writerow(row)

In [31]:
cfd_pos_new = nltk.ConditionalFreqDist(pos_corpus)

In [32]:
with open('corpus-pos.csv', 'w') as output:
    csv_writer = csv.writer(output)
    header = ['ID', 'Word', 'POS', 'Count']
    csv_writer.writerow(header)
    row_id = 0
    for word in cfd_pos_new:
        for pos, count in cfd_pos_new[word].items():
            row = [row_id, word, pos, count]
            csv_writer.writerow(row)
            row_id += 1