## Dictionaries
If you want to store information and the order doesn't matter, but you want to look it up with a key, then you should use a dictionary (indicated with {} curley brackets):

In [None]:
ages = {"Paul":34, "Padma":25, "Gorp":155}

print(ages)

The computer will not rember this order! To get information out, you can use the keys:

In [None]:
gorpage = ages["Gorp"]

print(gorpage)

You can add data to a dictionary easily.

In [None]:
ages["Blobl"] = 20
print(ages)

# The Natural Language Toolkit
The natural langauge toolkit is a library in Python that contains a large number of functions and materials that will be useful to linguists and anyone interested in using Python to study natural language documents. Today we will introduce the basic functionalities of the library, including using the built in corpora (there are many of them), using concordances, preparing a collection of texts you found for use with NLTK, lemmatization, part of speech tagging, and so forth.

In [None]:
# To use NLTK, you will need to import the library at the top of your script
import nltk
# The first time you will need to run 
# nltk.download() to download extra materials.
# just download the book materials.

## Included Corpora
If you have downloaded the extra information from NLTK, it includes many useful corpora. We will explore some of them here, so you know what is available. It is also possible to load your own corpora to use NLTK functionality, which we will discuss in a moment

### Gutenberg Sample
Project Gutenberg is a project that makes many public domain texts freely available. You can check it out at www.gutenberg.org
Some of these are preloaded in NLTK:

In [None]:
from nltk.corpus import gutenberg
print(gutenberg.fileids())

### Twitter Sample

In [None]:
from nltk.corpus import twitter_samples
twitter_samples.fileids()

### Shakespeare Sample

In [None]:
from nltk.corpus import shakespeare

In total, NLTK includes over 100 corpora prepackaged. You can find the full list at
http://www.nltk.org/nltk_data/

## Creating an NLTK Text Object
In practice, you will probably need to create your own NLTK text object to be able to use all of the NLTK functions easily. This is relatively straigthforward.

### Initial Cleaning
In order for the computer to do much with a document, it will need to be cleaned and then tokenized.  

We will first load a text and then clean it up a bit:

In [None]:
# Load a text stored in this directory
text = open("holmes.txt","r").read()

# Print the first 200 characters to be sure that it has loaded properly
print(text[0:200])

In [None]:
# We don't want this preamble, so we will get rid of it (and the text at the end of the document)
# Most gutenberg ebooks contain some boilerplate that ends with
# *** START OF THIS PROJECT GUTENBERG EBOOK
# The final boilerplate starts with
# *** END OF THIS PROJECT GUTENBERG EBOOK
# So we will search for these two phrases and get rid of them:
text = text[text.find('*** START OF THIS PROJECT GUTENBERG EBOOK'):text.find('*** END OF THIS PROJECT GUTENBERG EBOOK')]

print("Start of text",text[0:100])
print("End of text",text[-100:])
print(text)

### Tokenization
NLTK can break texts apart into individual sentences or words (or both, if you like).

NLTK comes with a tokenizer that makes this easy (this works best in English, but there are many more tokenizers available online).

In [None]:
# This breaks the text into sentences
sentences = nltk.sent_tokenize(text)
print(sentences[1000:1010])

In [None]:
# Alternatively, you can break the document into words:
# Note that this considers punctuation marks "words"
words = nltk.word_tokenize(text)
print(words[1000:1010])

In [None]:
# You can do both at the same time to make a list of sentences, which are each a list of words
sent_word_tokens = []
for sentence in nltk.sent_tokenize(text):
        tokenized_sent = nltk.word_tokenize(sentence)
        sent_word_tokens.append(tokenized_sent)
print(sent_word_tokens[1000])

In [None]:
# This code can be condensed into a single line:
sent_word_tokens = [nltk.word_tokenize(sentence) for sentence in nltk.sent_tokenize(text)]
print(sent_word_tokens[1000])

### Create the Text Object
From here, creating the text object is easy

In [None]:
# We will generate the Text object with just the words, because
# the functions are easier to use
novel = nltk.Text(words)
print(novel)

## Text Objects
NLTK text objects contain many useful functions that will let you quickly visualize assertain information about the text. We will discuss several of the most useful here.

In [None]:
# Get a concordance for a specific term
novel.concordance('Holmes', 100,        50)

In [None]:
# Get similar words based on context
# Note that the results will make better sense with a very large corpus
novel.similar("jewel")

In [None]:
# Make a dispersion plot with a list of words.
# This will make visualizing relations easy
# You can give it a single word, or a list of words
novel.dispersion_plot(["murder","detective","computer"])

In [None]:
# You can generate a list of collocations very easily
novel.collocations()

## Basic information
Sometimes we will want to get some basic information about a text. How long is it in characters, words, and sentences (you can see a description of this in http://www.nltk.org/book/ch02.html)? What is the lexical diversity (the ratio of unique words to total words)?

In [None]:
# Get the lenght in characters. This is VERY easy:
charlength = len(text)
print(charlength)

Getting the length in words is slightly more complicated, because we don't really want to consider punctuation marks words (unless we do). As such, we will want to do a bit of editing first.

In [None]:
# First let's look at the contents of the "words" list:
print(words[:100])
print(len(words))

In [None]:
# Let's get rid of the punctuation marks and make everything lowercase:
filtered_words = []
# Go through each word in the words list
for word in words:
    # check if the "word" is alpha-numeric 
    if word.isalnum():
        # if it is, append a lowercase version of the word to the filter words list:
        filtered_words.append(word.lower())
'''        
This can be done in one line with:
filter_words = [word.lower() for word in words if word.isalnum()]
'''
print(filtered_words[:100])


Now getting the number of words is straight forward:

In [None]:
total_words = len(filtered_words)
print(total_words)

Getting unique words is also easy. The "set" object type is a default Python object type and only allows one of each value (we will turn it back to a list for easy use):

In [None]:
unique_words = []
for word in filtered_words:
    if word not in unique_words:
        unique_words.append(word)


#unique_words = list(set(filtered_words))
print(unique_words[:100])

### Lexical diversity
Now that we have the length in words and the unique words, calculating the lexical diversity of the text is easy:

In [None]:
lexical_diversity = len(unique_words)/total_words
print(lexical_diversity)

### Average word length
There are several ways to calculate this. One could, as is done in the NLTK book, simply divide the number of characters in the text by the number of words. However, remember that we got rid of punctuation, which will skew the results. To avoid this, we can use a loop to cacluate this informatin:

In [None]:
total_word_length = 0
for word in filtered_words:
    # add together the length of each word
    total_word_length = total_word_length + len(word)
avg_word_length = total_word_length/len(filtered_words)
print("refined word length:", avg_word_length)
print("NLTK approach:", charlength/len(words))

### Average sentence length
Calculating average sentence length is done in much the same way (here we will use the sentence and word tokenized object):

In [None]:
wordsinsentences = 0
for sent in sent_word_tokens:
    for word in sent:
        if word.isalnum():
            wordsinsentences += 1
            
avgsentencelength = wordsinsentences/len(sent_word_tokens)
print(avgsentencelength)

### Frequency Distribution
You can quickly get word frequencies by using the FreqDist function.

In [None]:
frequencies = nltk.FreqDist(filtered_words)
print(frequencies['and'])

In [None]:
# you can see the most common words (or words that only occur once)
#print(frequencies.most_common(50))
print(frequencies.hapaxes())

## Creating and using Corpus objects
Working with a single text is usfeul, but it is often more useful to deal with multiple texts. As mentioned at the start of this lesson, NLTK comes with a number of built in corpora, but it is just as easy to build your own corpus. Just create a folder that contains the texts you are interested in. I've provided a set of texts downloaded from the most popular section of Project Gutenberg and placed them in a folder called "my_corpus":

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Just give the corpus reader your folder as the first parameter
# and a regular expression that matches files of interest.
# We don't have time to go into regular expressions, but I've provided one
# That will simply match all files in the folder.
my_corpus = PlaintextCorpusReader('my_corpus','.+')


### Getting information from Corpus object
Now that we have the corpus object, we can extract information from it

In [None]:
# Print a list of the files in the corpus
print(my_corpus.fileids())

In [None]:
# raw gives you access to the text as a string (we will use frederickdouglas.txt)
print(my_corpus.raw('frederickdouglas.txt')[40000:40050])

In [None]:
# words provides the words (this will be similar to the TextObject used above!)
print(my_corpus.words('frederickdouglas.txt'))

In [None]:
# sents provides the sentences
print(my_corpus.sents('frederickdouglas.txt'))

In [None]:
# The following example is taken from nltk.org/book/ch02.html.
# print information for each text in the corpus

# here you might run into an esoteric AssertationError relate to check1 vs check2
# this appears to be bug in NLTK related to Byte Order Markers. Copying the offending
# text directly from the source into a new file seems to fix it.
for fileid in my_corpus.fileids():
    if fileid != '.DS_Store':
        num_chars = len(my_corpus.raw(fileid))
        num_words = len(my_corpus.words(fileid))
        num_sents = len(my_corpus.sents(fileid))
        num_vocab = len(set(w.lower() for w in my_corpus.words(fileid)))
        print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

## Removing Stopwords:
Often we do not care about the most common words in the corpus. NLTK comes with stopword lists in a variety of languages:

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
my_function_words = ["an","of"]
print(stopwords)

In [None]:
# removing these stopwords from a text is easy:
no_stop_words = [word for word in filtered_words if word not in stopwords]
print(no_stop_words[:50])

In [None]:
# the NLTK book recomends suggests calculating the content fraction
print(len(no_stop_words)/len(filtered_words))

## Stemming words in a document
Often you will want to stem the words in your document, that way you don't have the 'whale' and 'whales' cluttering up your counts. Similarly, you may want "runs" and "running" to be counted with "run". NLTK comes with an extensive toolbox for doing this:

In [None]:
# import the SnowballStemmer (one of the included stemmers)
from nltk.stem.snowball import SnowballStemmer
# instantiate a stemmer object
stemmer = nltk.stem.snowball.SnowballStemmer("english")

stemmed_words = []
for word in filtered_words:
    stemmed_words.append(stemmer.stem(word))
    
# or do this:
# stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words[:50])

You can skip stopwords when using the stemmer by adding a simple parameter when you instantiate the original stemmer object:

In [None]:
stemmer = SnowballStemmer("english", ignore_stopwords=True)
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words[:50])

## Lemmatization
NLTK also includes a lemmatizer, which checks that the stem is actually a real word. This often provides better results, but it is slower:

In [None]:
lemmafinder = nltk.WordNetLemmatizer()
lemmas = [lemmafinder.lemmatize(word) for word in filtered_words]
print(lemmas[:50])

## Part of Speech Tagging
Some corpora built into NLTK are already tagged by Part of speech (such as the Brown corpus), but you may want to do this yourself. This works best at the sentence level.

In [None]:
pos_tagged_sentences = []
for sent in sent_word_tokens:
    pos_tagged_sentences.append(nltk.pos_tag(sent))
print(pos_tagged_sentences[1000])

You can see that the sentence has been tagged with its part of speech. To (TO: to), smoke (VB: verb, base form), he (PRP: personal pronoun), answered (VBD: verb past tense).

NLTK uses the Penn Treebank Project part of speech tags, which you can find listed here:
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html