Working with a corpus. This corpus template was taken from the animorphs corpus work. You could get way more involved with it. Might be good to think about what of this is the bare minimum and what could go further. Also note that this won't run in cell with the main piece
Q for Brandon: what do you want to keep from here
____

When working with a corpus of texts it can quickly become confusing to keep track of which step in an NLP pipeline you are on. Say you want to run a Frequency Distribution, did you remeber to tokenize the text? To pull out the stopwords? While this is simple enough if you are working with a small group of texts in a discrete timeperiod, this quickly becomes challenging when working with a large body of texts or when working over a longer period of time. Matters become more complicated if you want to switch between corpus-level analysis and text-level analysis. The realities of your project may quickly mean that manually performing each step in your pipeline becomes redundant, hard to keep track of, or a waste of time. This is where classes come in. Utilizing classes allows you to store the qualities of your corpus (it's "attributes") and instructions for things you want to execute on those attributes (called "methods"). Ultimately, using classes allows you to more easily organize text level and corpus level functions, is easier to grasp when working at scale, and allows you to store your parameters so they can be imported as a module (a file that contains Python defintions and statements).  

Classes can be as simple or as complex as you want them to be. In the following template, we will define a "Corpus" and a "Text" class and assign to each class the different attributes we want it to contain and sample methods that might commonly be executed within an NLP project on those attributes. 

The first code block is a class code template. This script could be saved as a file in your working directory and updated as neccessary. The subsequent blocks of code can be used in the interpreter to import the class as a module or to reload the class after any changes are made to it. The module and the file (with extension .py behind it) will have the same name. We have named this file class_practice.py

Working with classes happens through a back-and-forth between writing/tweaking your code and interacting with it within the interpreter. You can run the file through the interpreter 

In [3]:
import os
import nltk
import string


class Corpus(object):
    # rather than enter the data bit by bit, we create a constructor that takes in the data at one time 
    def __init__(self, corpus_dir):
        # all the attributes we want the class to have
        self.dir = corpus_dir # where corpus_dir is the corpus' filepath
        # classes may contain functions we define ourselves, the all_files function is defined below
        self.filenames = self.all_files()
        # this attribute combines multiple parameters (is this the right word?) it calls nltk's built in English stopwords, something built in from string?, and quotation marks
        self.stopwords = nltk.corpus.stopwords.words('english') + [char for char in string.punctuation] + ['``', "''"]
        # for testing limiting to the first few texts
        self.texts = [Text(fn, self.stopwords) for fn in self.filenames[0:3]]

    def all_files(self):
        """given the corpus_dir, return the filenames in it"""
        texts = []
        for (root, _, files) in os.walk(self.dir):
            for fn in files:
                path = os.path.join(root, fn)
                texts.append(path)
        return texts
    
class Text(object):
    def __init__(self, fn, stopwords):
        self.filename = fn
        self.raw_text = self.get_text()
        self.raw_tokens = nltk.word_tokenize(self.raw_text)
        self.cleaned_tokens = self.clean_tokens(stopwords)
        self.nltk_text = nltk.Text(self.cleaned_tokens)
        
    def get_text(self):
        with open(self.filename) as fin:
            return fin.read()
    
    def clean_tokens(self, stopwords):
        return [token.lower() for token in self.raw_tokens if token not in stopwords]
        
# this is what is ran if you run the file as a one-off event, $ python3 class_practice.py
def main():
    corpus_dir = 'corpus/'
    print('This is being run from the command line.') # anything that you might want to jump to, such as a graph, FreqDist, etc. would go here

# this allows you to import the classes as a module. it uses the special built-in variable __name__ set to the value "__main__" if the module is being run as the main program.] 
if __name__ == "__main__":
    main()

The payoff of organzing your project within classes is that you can run them as a module from the interpreter. To do so:

In [None]:
# import the script as a module--file name without the extension
import class_practice 
# instantiate the Corpus template as class_practice, store as a varaible named this_corpus
this_corpus = class_practice.Corpus()

# replace "self" with "this_corpus" to call the methods
this_corpus.dir # will show the directory of the corpus
this_corpus.filenames # returns all the filenames in the corpus

# to work with the text class, instantiate the particular text you want to use
illiad = class_practice.Text('corpus/illiad.xml')

As you make changes to your class_practice file, you have to re-import it into python and re-instantiate your classes. This makes sure you are running the most up-to-date version of your file. 

In [None]:
import importlib

importlib.reload(class_practice)

#re-instantiate the corpus or text
this_corpus = class_practice.Corpus()
illiad = class_practice.Text('corpus/illiad.xml')