# NLTK corpus reader

## Background

In this exercise, we'll modify the custom NLTK corpus reader introduced in the textbook to accommodate multiple categories for each novel and to work with plaintext documents rather than HTML. But first, let's see how we would use the stock `PlaintextCorpusReader` class.

In [9]:
# Imports
import os
from   nltk.corpus.reader import PlaintextCorpusReader

# Where are the corpus texts on your system
text_dir = os.path.join('..', 'data', 'texts')

# Regex to identify text files
text_pattern = '.+\.txt'

# Initialize the corpus reader
plain_corpus = PlaintextCorpusReader(text_dir, text_pattern)

Couple of things to notice here:

* We've once again used `os.path.join()` to create a file path that's portable across Mac, Windows, etc. Make sure the directories in question are the right ones for wherever you're keeping your copy of the corpus. Note, too, that `'..'` indicates the parent directory of the current directory (wherever the notebook lives). `'.'` (single period rather than double) indicates the current directory.
* We use a regular expression to identify the file names that should be included in the corpus. This isn't especially relevant yet, but it will allow us to have files in the corpus directory that *aren't* included in the corpus itself.
* We'll say a bit about regular expressions in class. A full treatment is beyond the scope of the course, but there are [many](https://developers.google.com/edu/python/regular-expressions) [overviews](https://docs.python.org/3/howto/regex.html) [online](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial). In the present case, the `text_pattern` regex will match any string (here, a file name) that ends in `.txt` and contains one or more characters before that suffix.

NLTK corpus reader objects have some useful convenience methods, like `.fileids()`...

In [10]:
plain_corpus.fileids()

['A-Alcott-Little_Women-1868-F.txt',
 'A-Cather-Antonia-1918-F.txt',
 'A-Chesnutt-Marrow-1901-M.txt',
 'A-Chopin-Awakening-1899-F.txt',
 'A-Crane-Maggie-1893-M.txt',
 'A-Davis-Life_Iron_mills-1861-F.txt',
 'A-Dreiser-Sister_Carrie-1900-M.txt',
 'A-Freeman-Pembroke-1894-F.txt',
 'A-Gilman-Herland-1915-F.txt',
 'A-Harper-Iola_Leroy-1892-F.txt',
 'A-Hawthorne-Scarlet_Letter-1850-M.txt',
 'A-Howells-Silas_Lapham-1885-M.txt',
 'A-James-Golden_Bowl-1904-M.txt',
 'A-Jewett-Pointed_Firs-1896-F.txt',
 'A-London-Call_Wild-1903-M.txt',
 'A-Melville-Moby_Dick-1851-M.txt',
 'A-Norris-Pit-1903-M.txt',
 'A-Stowe-Uncle_Tom-1852-F.txt',
 'A-Twain-Huck_Finn-1885-M.txt',
 'A-Wharton-Age_Innocence-1920-F.txt',
 'B-Austen-Pride_Prejudice-1813-F.txt',
 'B-Bronte_C-Jane_Eyre-1847-F.txt',
 'B-Bronte_E-Wuthering_Heights-1847-F.txt',
 'B-Burney-Evelina-1778-F.txt',
 'B-Conrad-Heart_Darkness-1902-M.txt',
 'B-Dickens-Bleak_House-1853-M.txt',
 'B-Disraeli-Sybil-1845-M.txt',
 'B-Eliot-Middlemarch-1869-F.txt',
 'B-F

... and `.paras()`. Here, we print the second sentence of the 101st paragraph of *Mrs Dalloway*:

In [3]:
print(plain_corpus.paras(fileids=['B-Woolf-Mrs_Dalloway-1925-F.txt'])[100][1])

['No', 'vulgar', 'jealousy', 'could', 'separate', 'her', 'from', 'Richard', '.']


Note that the text has been fully parsed into paragraphs, sentences, and words, which are returned as nested lists.

NLTK also contains a `CategorizedPlaintextCorpusReader` class, which is similar to `PlaintextCorpusReader`, but handles categories for us as well. Categorized corpus readers allow users to specify category labels in place of the `fileids=` argument, which is useful for subsetting the corpus according to divisions of interest.

In many cases, the NLTK built-in corpus readers will be all you need. But we're going to use a modified version of the textbook's `HTMLCorpusReader` class, both because it's instructive to see how `CorpusReader` methods are implemented and because we'll need some non-standard features in future work.

## Exercise

**NB. Look for '`# YOUR CODE HERE`' comments in the code blocks below to find the spots where you need to make changes or to create code of your own.** In every case, the amount of code for you to add is no more than a few lines.

Using the examples in chapter two and the (extensive) starter code below, write a custom NLTK-based corpus reader that ingests our course corpus (40 novels) and labels each of them by the author's gender and national origin. Name your corpus reader `TMNCorpusReader`. When complete, your corpus reader should successfully execute the code examples given further down in the notebook.

NLTK corpora support multiple category labels for each text, but there's a limitation to the way it treats these labels: they can only be combined via logical OR, not AND. In other words, if you select multiple categories, you get all the texts that belong any one of them. So we need to label our texts both restrictively ("American female", 10 texts) and at higher levels ("American", 20 texts).

The example in the textbook uses directory structure to assign categories. That specific approach won't work for us, because the same text can belong to multiple categories (we're not going to explore cumbersome work-arounds like symlinks or multiple copies of each file). But our files *are* named in ways that indicate the categories to which they belong. For example:

```
A-Alcott-Little_Women-1868-F.txt
```

The format is:

```
nationality-author-title-year-gender.txt
```

Multi-word titles or author names are joined with an underscore; fields are separated with a hyphen.

So ... we'll use a dictionary to associate multiple labels with each input text, then pass that dictionary to the corpus reader using the `cat_map` keyword.

In [12]:
# More imports, in addition to those above
from   glob import glob

# We're going to read just the file names to create the category map
file_paths = glob(os.path.join(text_dir, '*.txt')) # glob lets us use wildcards in paths
file_names = [os.path.split(i)[1] for i in file_paths] # split filenames from paths

category_map = {} # Dict to hold filename:[categories] mappings

for file in file_names:
    parsed = file.rstrip('.txt').split('-') # strip extension and split on hyphens
    nation = parsed[0]
    gender = parsed[4]
    category_map[file] = [nation, gender, nation+gender]

print("Category labels for _Little Women_:", 
      category_map['A-Alcott-Little_Women-1868-F.txt'])
assert(category_map['A-Alcott-Little_Women-1868-F.txt']==['A', 'F', 'AF'])

Category labels for _Little Women_: ['A', 'F', 'AF']


Notice that each text in the corpus has three labels:

* Author nationality
* Author gender
* Combined nationality and gender

This will let us subset the corpus at any desired level of granularity.

In [14]:
DOC_PATTERN = '.+\.txt'         # Documents are just files that end in '.txt'
CAT_PATTERN = r'([a-z_\s]+)/.*' # We won't use this, but fall back to directory-based labels
                                # if no other labels are supplied

import codecs
import nltk.data
from   nltk.tokenize import *
from   nltk.corpus.reader.util import *
from   nltk.corpus.reader.api import *
from   nltk.corpus.reader import PlaintextCorpusReader

class TMNCorpusReader(CategorizedCorpusReader, PlaintextCorpusReader):
    """
    A corpus reader for categorized text documents to enable preprocessing.
    """
    
    def __init__(
        self, 
        root, 
        fileids=DOC_PATTERN,
        word_tokenizer=WordPunctTokenizer(),
        sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'),
        para_block_reader=read_blankline_block,
        encoding='utf8', 
        **kwargs
    ):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            kwargs['cat_pattern'] = CAT_PATTERN

        # Initialize the NLTK corpus reader objects
        CategorizedCorpusReader.__init__(self, kwargs)
        PlaintextCorpusReader.__init__(self, root, fileids, encoding)
        self._word_tokenizer = word_tokenizer
        self._sent_tokenizer = sent_tokenizer
        self._para_block_reader = para_block_reader

    def resolve(self, fileids, categories):
        """
        Returns a list of fileids or categories depending on what is passed
        to each internal corpus reader function. Implemented similarly to
        the NLTK ``CategorizedPlaintextCorpusReader``.
        """
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories)
        return fileids

    def docs(self, fileids=None, categories=None):
        """
        Returns the complete text of a document, closing the document
        after we are done reading it and yielding it in a memory safe fashion.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            with codecs.open(path, 'r', encoding=encoding) as f:
                yield f.read()

    def sizes(self, fileids=None, categories=None):
        """
        Returns a list of tuples, the fileid and size on disk of the file.
        This function is used to detect oddly large files in the corpus.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, getting every path and computing filesize
        for path in self.abspaths(fileids):
            yield os.path.getsize(path)
            
    # Code below this line is extra, not (yet) covered in the textbook.
    # You can leave it as-is. It provides some standard corpus methods.
    # We're using PlaintextCorpusReader methods, but providing category resolution
    def raw(self, fileids=None, categories=None):
        """
        Returns raw text as a string.
        """
        return PlaintextCorpusReader.raw(self, self.resolve(fileids, categories))

    def words(self, fileids=None, categories=None):
        """
        Returns a list of words.
        """
        return PlaintextCorpusReader.words(self, self.resolve(fileids, categories))

    def sents(self, fileids=None, categories=None):
        """
        Returns a list of tokenized sentences.
        """
        return PlaintextCorpusReader.sents(self, self.resolve(fileids, categories))

    def paras(self, fileids=None, categories=None):
        """
        Returns a list of tokenized sentences.
        """
        return PlaintextCorpusReader.paras(self, self.resolve(fileids, categories))

In [15]:
corpus = TMNCorpusReader(text_dir, cat_map=category_map)

print("Categories in the corpus:\n", corpus.categories())
print("\nThe first five fileids:\n", corpus.fileids()[:5])

woolf = corpus.paras(fileids=['B-Woolf-Mrs_Dalloway-1925-F.txt'])
print("\nThe 101st paragraph of Mrs Dalloway:\n", woolf[100])
print("\nThe second sentence of that paragraph:\n", woolf[100][1])

sizes = [i for i in corpus.sizes(categories=['AF'])]
print("\nAmerican-female subcorpus file sizes in bytes:\n", sizes)

print("\nThe first 300 characters of the British-male subcorpus:")
for doc in corpus.docs(categories=['BM']):
    print(doc[:300])
    break

print("\nThe 1,000,001-1,000,020th characters of the corpus:")
print(corpus.raw()[1000000:1000020])

print("\nTotal words in the corpus:")
print(len(corpus.words()))

Categories in the corpus:
 ['A', 'AF', 'AM', 'B', 'BF', 'BM', 'F', 'M']

The first five fileids:
 ['A-Alcott-Little_Women-1868-F.txt', 'A-Cather-Antonia-1918-F.txt', 'A-Chesnutt-Marrow-1901-M.txt', 'A-Chopin-Awakening-1899-F.txt', 'A-Crane-Maggie-1893-M.txt']

The 101st paragraph of Mrs Dalloway:
 [['Millicent', 'Bruton', ',', 'whose', 'lunch', 'parties', 'were', 'said', 'to', 'be', 'extraordinarily', 'amusing', ',', 'had', 'not', 'asked', 'her', '.'], ['No', 'vulgar', 'jealousy', 'could', 'separate', 'her', 'from', 'Richard', '.'], ['But', 'she', 'feared', 'time', 'itself', ',', 'and', 'read', 'on', 'Lady', 'Bruton', "'", 's', 'face', ',', 'as', 'if', 'it', 'had', 'been', 'a', 'dial', 'cut', 'in', 'impassive', 'stone', ',', 'the', 'dwindling', 'of', 'life', ';', 'how', 'year', 'by', 'year', 'her', 'share', 'was', 'sliced', ';', 'how', 'little', 'the', 'margin', 'that', 'remained', 'was', 'capable', 'any', 'longer', 'of', 'stretching', ',', 'of', 'absorbing', ',', 'as', 'in', 'the', 'y

Note that the only truly slow part of the code above is the bit that counts all the words. If this were production code, you'd want to calculate that value once and store the result.

## Mini problems

What fraction of the total corpus words are contained in the British subcorpus?

In [10]:
total_words = len(corpus.words())
british_words = len(corpus.words(categories=['B']))
print(f"Fraction British words: {round(british_words/total_words,3)}")

Fraction British words: 0.609


How many books in the corpus are written by female authors?

In [11]:
print("Female-authored books:", len(corpus.fileids(categories=['F'])))

Female-authored books: 20


Print words 2-4 of the second sentence of the fifth paragraph of _Middlemarch_.

In [12]:
print("Indicated words from Middlemarch:", 
      corpus.paras(fileids=['B-Eliot-Middlemarch-1869-F.txt'])[4][1][1:4])

Indicated words from Middlemarch: ['hand', 'and', 'wrist']


List the file names of books by American men.

In [13]:
print("Files by American men:\n", corpus.fileids(categories=['AM']))

Files by American men:
 ['A-Chesnutt-Marrow-1901-M.txt', 'A-Crane-Maggie-1893-M.txt', 'A-Dreiser-Sister_Carrie-1900-M.txt', 'A-Hawthorne-Scarlet_Letter-1850-M.txt', 'A-Howells-Silas_Lapham-1885-M.txt', 'A-James-Golden_Bowl-1904-M.txt', 'A-London-Call_Wild-1903-M.txt', 'A-Melville-Moby_Dick-1851-M.txt', 'A-Norris-Pit-1903-M.txt', 'A-Twain-Huck_Finn-1885-M.txt']


In [16]:
# Or, more legibly:
print("Volumes by American men:\n")
for file in corpus.fileids(categories=['AM']):
    print(file)

Volumes by American men:

A-Chesnutt-Marrow-1901-M.txt
A-Crane-Maggie-1893-M.txt
A-Dreiser-Sister_Carrie-1900-M.txt
A-Hawthorne-Scarlet_Letter-1850-M.txt
A-Howells-Silas_Lapham-1885-M.txt
A-James-Golden_Bowl-1904-M.txt
A-London-Call_Wild-1903-M.txt
A-Melville-Moby_Dick-1851-M.txt
A-Norris-Pit-1903-M.txt
A-Twain-Huck_Finn-1885-M.txt
