# Working with corpus data

## Background

In this exercise, we'll make use of the (further) modified NLTK corpus reader from the previous exercise and of a new version that deals with saved (pickled) corpora. Your tasks -- outlined below -- will be to put these readers to use to begin characterizing the corpora.

First, you'll need to update the file locations below to match your own system:

In [1]:
import os

# Where are the corpus texts on your system
text_dir = os.path.join('..', 'data', 'texts')
pickle_dir = os.path.join('..', 'data', 'pickled')

## Existing code

Now, review the code in the next (long) cell to make sure you understand it. I'm not going to make you mess with it just for the sake of messing with it, but you should know how you would, for instance, change the `tokenize()` or `process()` methods to return lowercase tokens or remove punctuation, etc. 

In [2]:
DOC_PATTERN = r'.+\.txt'        # Documents are just files that end in '.txt'
PKL_PATTERN = r'.+\.pickle'     # Pickled files end in .pickle
CAT_PATTERN = r'([a-z_\s]+)/.*' # We won't use this, but fall back to directory-based labels
                                # if no other labels are supplied

import codecs
import time
import nltk
import os
import pickle
from   glob import glob
from   nltk.corpus.reader.api import CorpusReader
from   nltk.corpus.reader.api import CategorizedCorpusReader
from   nltk import pos_tag, sent_tokenize, wordpunct_tokenize

def make_cat_map(path, extension):
    """
    Takes a directory path and file extension (e.g., 'txt').
    Returns a dictionary of file:category mappings from standard file names:
      nation-author-title-year-gender
    """
    file_paths = glob(os.path.join(path, f'*.{extension}'))
    file_names = [os.path.split(i)[1] for i in file_paths]
    category_map = {} # Dict to hold filename:[categories] mappings
    for file in file_names:
        parsed = file.rstrip(f'.{extension}').split('-') # strip extension and split on hyphens
        nation = parsed[0]
        gender = parsed[4]
        category_map[file] = [nation, gender, nation+gender]
    return category_map

class TMNCorpusReader(CategorizedCorpusReader, CorpusReader):
    """
    A corpus reader for categorized text documents to enable preprocessing.
    """
    
    def __init__(
        self, 
        root, 
        fileids=DOC_PATTERN,
        encoding='utf8', 
        **kwargs
    ):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            # First, try to build a cat_map from standard-style filenames
            try: 
                kwargs['cat_map'] = make_cat_map(root, 'txt')
            # On error, fall back to dir names for categories    
            except Exception as e:
                print(type(e), e, "\nUnable to build category map from file names.\nFalling back to categories by directory name.")
                kwargs['cat_pattern'] = CAT_PATTERN

        # Initialize the NLTK corpus reader objects
        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids, encoding)
        
    def resolve(self, fileids, categories):
            """
            Returns a list of fileids or categories depending on what is passed
            to each internal corpus reader function. Implemented similarly to
            the NLTK ``CategorizedPlaintextCorpusReader``.
            """
            if fileids is not None and categories is not None:
                raise ValueError("Specify fileids or categories, not both")

            if categories is not None:
                return self.fileids(categories)
            return fileids

    def docs(self, fileids=None, categories=None):
        """
        Returns the complete text of a document, closing the document
        after we are done reading it and yielding it in a memory safe fashion.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            with codecs.open(path, 'r', encoding=encoding) as f:
                yield f.read()

    def sizes(self, fileids=None, categories=None):
        """
        Returns a list of tuples, the fileid and size on disk of the file.
        This function is used to detect oddly large files in the corpus.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, getting every path and computing filesize
        for path in self.abspaths(fileids):
            yield os.path.getsize(path)
            
    def paras(self, fileids=None, categories=None):
        """
        Uses splitlines() to parse the paragraphs from plain text.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)
        
        for doc in self.docs(fileids):
            for par in doc.splitlines():
                if len(par) > 0:
                    yield par

    def sents(self, fileids=None, categories=None):
        """
        Uses the built in sentence tokenizer to extract sentences from the
        paragraphs. Note that this method uses BeautifulSoup to parse HTML.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)
        
        for paragraph in self.paras(fileids):
            for sentence in sent_tokenize(paragraph):
                yield sentence

    def words(self, fileids=None, categories=None):
        """
        Uses the built in word tokenizer to extract tokens from sentences.
        Note that this method uses BeautifulSoup to parse HTML content.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)
        
        for sentence in self.sents(fileids):
            for token in wordpunct_tokenize(sentence):
                yield token

    def describe(self, fileids=None, categories=None):
        """
        Performs a single pass of the corpus and
        returns a dictionary with a variety of metrics
        concerning the state of the corpus.
        """
        started = time.time()

        # Structures to perform counting.
        counts  = nltk.FreqDist()
        tokens  = nltk.FreqDist()

        # Perform single pass over paragraphs, tokenize and count
        for para in self.paras(fileids, categories):
            counts['paras'] += 1

            for sent in sent_tokenize(para):
                counts['sents'] += 1

                for word in wordpunct_tokenize(sent):
                    counts['words'] += 1
                    tokens[word] += 1

        # Compute the number of files and categories in the corpus
        n_fileids = len(self.resolve(fileids, categories) or self.fileids())
        n_topics  = len(self.categories(self.resolve(fileids, categories)))

        # Return data structure with information
        return {
            'files':  n_fileids,
            'categories': n_topics,
            'paragraphs':  counts['paras'],
            'sentences':  counts['sents'],
            'words':  counts['words'],
            'vocabulary_size':  len(tokens),
            'lexical_diversity': float(counts['words']) / float(len(tokens)),
            'paras_per_doc':  float(counts['paras']) / float(n_fileids),
            'words_per_doc':  float(counts['words']) / float(n_fileids),
            'sents_per_para':  float(counts['sents']) / float(counts['paras']),
            'secs':   time.time() - started,
        }
    
class PickledCorpusReader(CategorizedCorpusReader, CorpusReader):

    def __init__(self, root, fileids=PKL_PATTERN, **kwargs):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining arguments
        are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            # First, try to build a cat_map from standard-style filenames
            try: 
                kwargs['cat_map'] = make_cat_map(root, 'pickle')
            # On error, fall back to dir names for categories    
            except Exception as e:
                print(type(e), e, "\nUnable to build category map from file names.\nFalling back to categories by directory name.")
                kwargs['cat_pattern'] = CAT_PATTERN

        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids)

    def resolve(self, fileids, categories):
        """
        Returns a list of fileids or categories depending on what is passed
        to each internal corpus reader function. This primarily bubbles up to
        the high level ``docs`` method, but is implemented here similar to
        the nltk ``CategorizedPlaintextCorpusReader``.
        """
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories)
        return fileids

    def docs(self, fileids=None, categories=None):
        """
        Returns the document loaded from a pickled object for every file in
        the corpus. Similar to the BaleenCorpusReader, this uses a generator
        to acheive memory safe iteration.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, enc, fileid in self.abspaths(fileids, True, True):
            with open(path, 'rb') as f:
                yield pickle.load(f)

    def paras(self, fileids=None, categories=None):
        """
        Returns a generator of paragraphs where each paragraph is a list of
        sentences, which is in turn a list of (token, tag) tuples.
        """
        for doc in self.docs(fileids, categories):
            for paragraph in doc:
                yield paragraph

    def sents(self, fileids=None, categories=None):
        """
        Returns a generator of sentences where each sentence is a list of
        (token, tag) tuples.
        """
        for paragraph in self.paras(fileids, categories):
            for sentence in paragraph:
                yield sentence

    def tagged(self, fileids=None, categories=None):
        for sent in self.sents(fileids, categories):
            for token in sent:
                yield token

    def words(self, fileids=None, categories=None):
        """
        Returns a generator of (token, tag) tuples.
        """
        for token in self.tagged(fileids, categories):
            yield token[0]
            
class Preprocessor(object):
    """
    The preprocessor wraps a corpus object (usually a `TMNCorpusReader`)
    and manages the stateful tokenization and part of speech tagging into a
    directory that is stored in a format that can be read by the
    `PickledCorpusReader`.
    """

    def __init__(self, corpus, target=None, **kwargs):
        """
        The corpus is the `TMNCorpusReader` to preprocess and pickle.
        The target is the directory on disk to output the pickled corpus to.
        """
        self.corpus = corpus
        self.target = target

    def fileids(self, fileids=None, categories=None):
        """
        Helper function access the fileids of the corpus
        """
        fileids = self.corpus.resolve(fileids, categories)
        if fileids:
            return fileids
        return self.corpus.fileids()

    def abspath(self, fileid):
        """
        Returns the absolute path to the target fileid from the corpus fileid.
        """
        # Find the directory, relative from the corpus root.
        parent = os.path.relpath(
            os.path.dirname(self.corpus.abspath(fileid)), self.corpus.root
        )

        # Compute the name parts to reconstruct
        basename  = os.path.basename(fileid)
        name, ext = os.path.splitext(basename)

        # Create the pickle file extension
        basename  = name + '.pickle'

        # Return the path to the file relative to the target.
        return os.path.normpath(os.path.join(self.target, parent, basename))

    def tokenize(self, fileid):
        """
        Segments, tokenizes, and tags a document in the corpus. Returns a
        generator of paragraphs, which are lists of sentences, which in turn
        are lists of part of speech tagged words.
        """
        for paragraph in self.corpus.paras(fileids=fileid):
            yield [
                pos_tag(wordpunct_tokenize(sent))
                for sent in sent_tokenize(paragraph)
            ]

    def process(self, fileid):
        """
        For a single file does the following preprocessing work:
            1. Checks the location on disk to make sure no errors occur.
            2. Gets all paragraphs for the given text.
            3. Segements the paragraphs with the sent_tokenizer
            4. Tokenizes the sentences with the wordpunct_tokenizer
            5. Tags the sentences using the default pos_tagger
            6. Writes the document as a pickle to the target location.
        This method is called multiple times from the transform runner.
        """
        # Compute the outpath to write the file to.
        target = self.abspath(fileid)
        parent = os.path.dirname(target)

        # Make sure the directory exists
        if not os.path.exists(parent):
            os.makedirs(parent)

        # Make sure that the parent is a directory and not a file
        if not os.path.isdir(parent):
            raise ValueError(
                "Please supply a directory to write preprocessed data to."
            )

        # Create a data structure for the pickle
        document = list(self.tokenize(fileid))

        # Open and serialize the pickle to disk
        with open(target, 'wb') as f:
            pickle.dump(document, f, pickle.HIGHEST_PROTOCOL)

        # Clean up the document
        del document

        # Return the target fileid
        return target

    def transform(self, fileids=None, categories=None):
        """
        Transform the wrapped corpus, writing out the segmented, tokenized,
        and part of speech tagged corpus as a pickle to the target directory.
        This method will also directly copy files that are in the corpus.root
        directory that are not matched by the corpus.fileids().
        """
        # Make the target directory if it doesn't already exist
        if not os.path.exists(self.target):
            os.makedirs(self.target)

        # Resolve the fileids to start processing and return the list of 
        # target file ids to pass to downstream transformers. 
        return [
            self.process(fileid)
            for fileid in self.fileids(fileids, categories)
        ]

## Load, preprocess, and pickle to disk

Execute this code. Again, be sure you understand what it's doing ...

In [3]:
# Initialize our corpus reader
corpus = TMNCorpusReader(text_dir, r'.+\.txt')

In [4]:
# Get descriptive stats for full corpus
description = corpus.describe()
for key in description:
    print(key, description[key])

files 40
categories 8
paragraphs 98477
sentences 295138
words 6488322
vocabulary_size 64215
lexical_diversity 101.04059799112358
paras_per_doc 2461.925
words_per_doc 162208.05
sents_per_para 2.9970246859672818
secs 26.308953046798706


In [5]:
%%time
# Compare stats for A/B and M/F categories
a_desc = corpus.describe(categories=['A'])
b_desc = corpus.describe(categories=['B'])
f_desc = corpus.describe(categories=['F'])
m_desc = corpus.describe(categories=['M'])

print("American/British comparison")
for key in a_desc.keys():
    print(key, round(a_desc[key]/b_desc[key], 2))

print("\nMale/female comparison")
for key in m_desc.keys():
    print(key, round(m_desc[key]/f_desc[key], 2))

American/British comparison
files 1.0
categories 1.0
paragraphs 0.71
sentences 0.67
words 0.64
vocabulary_size 0.96
lexical_diversity 0.67
paras_per_doc 0.71
words_per_doc 0.64
sents_per_para 0.96
secs 0.61

Male/female comparison
files 1.0
categories 1.0
paragraphs 1.25
sentences 1.37
words 1.19
vocabulary_size 1.2
lexical_diversity 0.99
paras_per_doc 1.25
words_per_doc 1.19
sents_per_para 1.1
secs 1.23
CPU times: user 51.2 s, sys: 167 ms, total: 51.4 s
Wall time: 51.6 s


In [6]:
# Initialize preprocessor
preproc = Preprocessor(corpus, pickle_dir)

In [7]:
%%time
# Perform preprocessing and save output to disk
processed = preproc.transform()

CPU times: user 5min 32s, sys: 9.14 s, total: 5min 41s
Wall time: 6min 16s


In [4]:
# Show that we can work with the pickled versions
pcorpus = PickledCorpusReader(pickle_dir)
print("Categories in the corpus:\n", pcorpus.categories())
print("\nBritish-female-authored files:\n", pcorpus.fileids(categories=['BF']))
print("\nA bit of one pickled text")
for doc in pcorpus.docs(fileids=['A-Stowe-Uncle_Tom-1852-F.pickle']):
    print(doc[11])
    break

Categories in the corpus:
 ['A', 'AF', 'AM', 'B', 'BF', 'BM', 'F', 'M']

British-female-authored files:
 ['B-Austen-Pride_Prejudice-1813-F.pickle', 'B-Bronte_C-Jane_Eyre-1847-F.pickle', 'B-Bronte_E-Wuthering_Heights-1847-F.pickle', 'B-Burney-Evelina-1778-F.pickle', 'B-Eliot-Middlemarch-1869-F.pickle', 'B-Gaskell-North_South-1855-F.pickle', 'B-Mitford-Our_Village-1826-F.pickle', 'B-Radcliffe-Mysteries_Udolpho-1794-F.pickle', 'B-Shelley-Frankenstein-1818-F.pickle', 'B-Woolf-Mrs_Dalloway-1925-F.pickle']

A bit of one pickled text
[[('“', 'JJ'), ('No', 'NNP'), (';', ':'), ('I', 'PRP'), ('mean', 'VBP'), (',', ','), ('really', 'RB'), (',', ','), ('Tom', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), (',', ','), ('steady', 'JJ'), (',', ','), ('sensible', 'JJ'), (',', ','), ('pious', 'JJ'), ('fellow', 'NN'), ('.', '.')], [('He', 'PRP'), ('got', 'VBD'), ('religion', 'NN'), ('at', 'IN'), ('a', 'DT'), ('camp', 'NN'), ('-', ':'), ('meeting', 'NN'), (',', ','), ('four', 'CD'), ('years', 'NNS')

## Finally, your code!

So, now we've read the corpus, preprocessed it, saved the tokenized and part-of-speech-tagged version to disk, and loaded the processed version (as `pcorpus`) for investigation.

Your tasks:

1. Print a list of the 20 most frequently occurring words in the British female ('BF') corpus.
2. Print a list of the 10 most frequently occurring parts of speech in the full corpus. Note that the part-of-speech tags are from the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). You can get more info about the tags via `nltk.help.upenn_tagset()`.
3. Devise a method to assess which parts of speech are under- and over-represented in the American corpus relative to the British corpus. This could be simple, or it might be more complex (especially if you know a bit about statistics). Simple is OK. Remember that the subcorpora *do not* contain the same number of words (this may or may not matter, depending on how you approach the problem). Print a list of the 5 parts of speech that are most overrepresented in the American corpus according to your metric.
4. Print the same list as in task 1, but using lemmatized, lowercase versions of the words and removing stopwords and punctuation. You do not need to reprocess the corpus; you can transform the tokenized output that we're already using. To lemmatize, use `nltk.stem.wordnet.WordNetLemmatizer()`. You'll need to translate between Penn Treebank tags and WordNet tags; there's a function for this included below. You can get a list of English stopwords (high-frequency words like 'the' and 'an' that are not generally informative on their own) from `nltk.corpus.stopwords.words('english')` and a list of punctuation from `string.punctuation`.

In [5]:
# 1. Top 20 words in BF corpus
from collections import Counter
words_bf = Counter()
for token in pcorpus.tagged(categories='BF'):
    words_bf[token[0]] += 1
print("Top 20 words in BF corpus:")
for word in words_bf.most_common(n=20):
    print(word[0], '\t', word[1])

Top 20 words in BF corpus:
, 	 129573
the 	 66885
. 	 57577
to 	 47804
and 	 47011
of 	 41668
I 	 34279
a 	 29904
' 	 24965
in 	 21795
her 	 20872
was 	 20242
; 	 19381
that 	 19251
he 	 16491
she 	 15132
you 	 14959
it 	 14804
had 	 14703
his 	 13635


In [6]:
# 2. Top 10 PoS in full corpus
pos_all = Counter()
for token in pcorpus.tagged():
    pos_all[token[1]] += 1
print("Top 10 parts of speech in full corpus:")
for pos in pos_all.most_common(n=10):
    print(pos[0], '\t', pos[1])

Top 10 parts of speech in full corpus:
NN 	 873979
IN 	 653926
DT 	 501505
PRP 	 495040
, 	 402550
JJ 	 384182
VBD 	 374716
RB 	 328877
NNP 	 288752
. 	 251017


In [7]:
# 3. Under/overrepresentation of PoS in A corpus vs. B corpus.
import numpy as np
pos_a = Counter()
pos_b = Counter()
for token in pcorpus.tagged(categories='A'):
    pos_a[token[1]] += 1
for token in pcorpus.tagged(categories='B'):
    pos_b[token[1]] += 1
wc_a = np.sum(list(pos_a.values()))
wc_b = np.sum(list(pos_b.values()))

pos_ratios = Counter()
wordcount_ratio = wc_b/wc_a
for pos in pos_b.items():
    try:
        tag = pos[0]
        pos_ratios[tag] = pos_a[tag]/pos_b[tag] * wordcount_ratio
    except ZeroDivisionError:
        pass
print("Top 10 PoS enriched in A corpus relative to B:")
for tag in pos_ratios.most_common(10):
    print(tag[0], '\t', round(tag[1], 2))
print("\nTop 10 PoS deficient in A corpus relative to B:")
for tag in pos_ratios.most_common()[:-11:-1]:
    print(tag[0], '\t', round(tag[1], 2))

Top 10 PoS enriched in A corpus relative to B:
$ 	 5.83
FW 	 2.58
'' 	 1.51
RP 	 1.48
SYM 	 1.25
CD 	 1.16
NNPS 	 1.1
JJ 	 1.09
EX 	 1.09
NNS 	 1.08

Top 10 PoS deficient in A corpus relative to B:
) 	 0.4
( 	 0.4
WP$ 	 0.62
MD 	 0.72
VBZ 	 0.85
WP 	 0.85
VBN 	 0.85
WDT 	 0.85
TO 	 0.87
PRP$ 	 0.87


In [44]:
# Using chi-squared statistic
import pandas as pd
from sklearn.feature_selection import chi2
from collections import Counter

dpm = pd.DataFrame(index=sorted(pos_all.keys()))
labels = []
for nation in ['A', 'B']:
    for fileid in pcorpus.fileids(categories=nation):
        pos = Counter()
        for token in pcorpus.tagged(fileids=fileid):
            pos[token[1]] += 1
        df = pd.Series(pos, name=fileid)
        df.sort_index(inplace=True)
        df = df/df.sum()
        df = pd.DataFrame(df)
        labels.append(nation)
        dpm = dpm.join(df)
dpm.fillna(0, inplace=True)
dpm = dpm.T
dpm.head()

Unnamed: 0,$,'',(,),",",.,:,CC,CD,DT,...,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WP$,WRB,``
A-Alcott-Little_Women-1868-F.pickle,4e-06,0.015878,0.000118,3.9e-05,0.077302,0.034055,0.003976,0.043827,0.004395,0.068154,...,0.060621,0.01662,0.016327,0.021927,0.009619,0.003893,0.004286,0.000157,0.005709,0.0
A-Cather-Antonia-1918-F.pickle,0.0,0.014392,5.1e-05,0.0,0.052536,0.047359,0.012598,0.037385,0.005115,0.078603,...,0.080377,0.01512,0.017406,0.01429,0.005197,0.00325,0.003434,8.2e-05,0.007237,0.0
A-Chesnutt-Marrow-1901-M.pickle,0.0,0.01866,0.0,0.0,0.061304,0.039306,0.008001,0.026679,0.004998,0.090023,...,0.053366,0.011067,0.02803,0.0168,0.009561,0.006014,0.004617,0.000399,0.004073,0.0
A-Chopin-Awakening-1899-F.pickle,0.0,0.00547,1.3e-05,0.0,0.045042,0.052318,0.013518,0.034914,0.004868,0.08044,...,0.075847,0.020467,0.021893,0.012144,0.007158,0.007394,0.004345,0.000209,0.006478,0.0
A-Crane-Maggie-1893-M.pickle,0.0,0.018938,7e-05,0.0,0.046577,0.059296,0.007757,0.026591,0.003774,0.087669,...,0.074251,0.016842,0.0181,0.011775,0.008002,0.002306,0.004682,0.00028,0.004612,0.0


In [58]:
chisq, _ = chi2(dpm, labels)
chisq = pd.DataFrame(chisq, index=dpm.columns, columns=['chi-sq'])
print("Most distinctive PoS tags")
display(chisq.sort_values(by='chi-sq', ascending=False).head(10))

Most distinctive PoS tags


Unnamed: 0,chi-sq
'',0.021468
MD,0.013908
:,0.0079
RP,0.006745
NN,0.005711
VBD,0.005057
FW,0.004629
VB,0.00392
(,0.00391
JJ,0.003515


In [55]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Examine the list of over-represented parts of speech. Do you feel you've found a good metric? Why or why not? What else might you try? Enter a sentence or two of reflection here ...


In [12]:
# 4. Top 20 lemmatized, lowercase, non-punctuation, non-stopwords in BF corpus
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
import string

lemmatizer = WordNetLemmatizer()
stop = set(stopwords.words('english')).union(set(string.punctuation))

def wordnet_tag(penn_tag):
    """
    For an input Penn Treebank PoS tag, return a WordNet PoS tag or None.
    """
    if penn_tag.startswith('N'):
        return wordnet.NOUN
    elif penn_tag.startswith('V'):
        return wordnet.VERB
    elif penn_tag.startswith('J'):
        return wordnet.ADJ
    elif penn_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None # input tag is other or not recognized

def get_lemma(token, pos):
    '''
    Return the lemmatized version of an English word.
    '''
    pos = wordnet_tag(pos)
    if pos:
        lemma = lemmatizer.lemmatize(token, pos)
    else:
        lemma = lemmatizer.lemmatize(token)
    return lemma

lemmas_bf = Counter()
for token in pcorpus.tagged(categories='BF'):
    word = token[0].lower()
    pos = token[1]
    if word not in stop:
        lemma = get_lemma(word, pos)
        lemmas_bf[lemma] += 1
        
print("Top 20 lowercased, non-stopword lemmas in the BF corpus:")
for lem in lemmas_bf.most_common(n=20):
    print(lem[0], '\t', lem[1])

Top 20 lowercased, non-stopword lemmas in the BF corpus:
say 	 9625
mr 	 7660
-- 	 6979
would 	 6072
." 	 5072
go 	 4789
could 	 4430
one 	 4185
know 	 4026
make 	 3922
come 	 3902
think 	 3884
see 	 3532
," 	 3520
look 	 3383
.' 	 3146
take 	 2873
like 	 2751
good 	 2673
time 	 2593


Examine this list. Are there other words or symbols we might want to add to our list of stopwords? No need to write this up, but we'll discuss it in class.