In [46]:
import nltk, re, pprint
from nltk import word_tokenize
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

**First, load in the file or files below.**  First, take a look at your text.  An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate.  You may have to do a bit of work to figure out which will be the "opening phrase" that Wolfram Alpha shows.  Below, write the code to read in the text and split it into sentences, and then print out the **opening phrase**.

In [47]:
tal = open("../tal_stories/tal_text.txt", "r")
tal_text = tal.read()
sentences = sent_tokenizer.tokenize(tal_text)
print(sentences[1])

I listen to a lot of podcasts.


**Next, tokenize.**  Look at the several dozen sentences to see what kind of tokenization issues you'll have.  Write a regular expression tokenizer, using the nltk.regexp_tokenize() as seen in class, or using something more sophisticated if you prefer, to do a nice job of breaking your text up into words.  You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection. 

*How you break up the words will have effects down the line for how you can manipulate your text collection.  You may want to refine this code later.*

In [49]:
pattern = r'''(?x)    # set flag to allow verbose regexps
    <
    | :
    | ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+([-']\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [.,;"'?():-_`]+  # these are separate tokens
 '''
tokens = nltk.regexp_tokenize(tal_text,pattern)
tokens[0:10]

['<', 'EPISODE', 'NUMBER', ':', '496', '>', '<', 'EPISODE', 'NAME', ':']

**Compute word counts.** Now compute your frequency distribution using a FreqDist over the words. Let's not do lowercasing or stemming yet.  You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below and show the most common 20 words that result.

In [50]:
fd = nltk.FreqDist(tokens)
fd.most_common(20)

[('.', 5963),
 (',', 5220),
 ('the', 3293),
 ('to', 2313),
 ('a', 1966),
 ('of', 1710),
 (':', 1689),
 ('>', 1665),
 ('<', 1665),
 ('I', 1551),
 ('that', 1432),
 ('and', 1420),
 ('in', 1197),
 ('And', 1162),
 ('you', 965),
 ('was', 887),
 ('SUBJECT', 818),
 ('it', 806),
 ('is', 784),
 ('he', 679)]

**Normalize the text.** Now adjust the output by normalizing the text: things you can try include removing stopwords, removing very short words, lowercasing the text, improving the tokenization, and/or doing other adjustments to bring content words higher up in the results.  The goal is to dig deeper into the collection to find interesting but relatively frequent words.  Show the code for these changes below.  

In [51]:
import string
from nltk.corpus import stopwords

exclude = list(string.punctuation) + list(string.digits) + ['...'] + stopwords.words('english')
filtered_tokens = list(filter(lambda x: (not(x.isupper()) and not(x in exclude) and len(x) > 5), tokens))

**Show adjusted word counts.** Show the most frequent 20 words that result from these adjustments.

In [52]:
fd2 = nltk.FreqDist(filtered_tokens)
fd2_most_common = fd2.most_common(20)
fd2_most_common

[('people', 300),
 ('patent', 227),
 ('really', 149),
 ('Ventures', 120),
 ('Intellectual', 120),
 ('actually', 118),
 ("didn't", 111),
 ('company', 111),
 ('patents', 109),
 ('something', 106),
 ('things', 105),
 ('called', 102),
 ("that's", 101),
 ('companies', 101),
 ("you're", 99),
 ('little', 90),
 ("That's", 83),
 ('number', 81),
 ("they're", 80),
 ('started', 77)]

**Creating a table.**
Python provides an easy way to line columns up in a table.  You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3.  *AND* if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '\*s%' or the '-\*d%'.  Check out this example (this is just fyi):

In [53]:
print('%-16s' % 'Info type', '%-16s' % 'Value')
print('%-16s' % 'number of words', '%-16d' % 100000)

Info type        Value           
number of words  100000          


**Word Properties Table** Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed).  Make a table that prints out:
1. number of words
2. number of unique words
3. average word length
4. longest word

You can make your table look prettier than the example I showed above if you like!

You can decide for yourself if you want to eliminate punctuation and function words (stop words) or not.  It's your collection!  


In [55]:
print('%-16s' % 'Info type', '%-16s' % 'Value')
print('%-16s' % 'number of words', '%-16d' % len(filtered_tokens))
print('%-16s' % 'number of unique words', '%-16d' % len(set(filtered_tokens)))
import statistics
word_lengths = [len(x) for x in filtered_tokens]
print('%-16s' % 'average word length', '%-16d' % round(statistics.mean(word_lengths), 2))
print('%-16s' % 'longest word', '%-16s' % max(filtered_tokens, key=len))

Info type        Value           
number of words  20444           
number of unique words 5385            
average word length 7               
longest word     cosmetically-altered


**Most Frequent Words List.** Next is the most frequent words list.  This table shows the percent of the total as well as the most frequent words, so compute this number as well.  

In [56]:
count_of_words = len(filtered_tokens)
print('%-16s' % 'Word', '%-16s' % 'Percent')
for item in fd2_most_common:
    print('%-16s' % item[0], '%-16s' % str(round(((item[1]/count_of_words) * 100), 2)))

Word             Percent         
people           1.47            
patent           1.11            
really           0.73            
Ventures         0.59            
Intellectual     0.59            
actually         0.58            
didn't           0.54            
company          0.54            
patents          0.53            
something        0.52            
things           0.51            
called           0.5             
that's           0.49            
companies        0.49            
you're           0.48            
little           0.44            
That's           0.41            
number           0.4             
they're          0.39            
started          0.38            


**Most Frequent Capitalized Words List** We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one.  Write the code here for the new FreqDist and the List itself.  Show the list here.

In [57]:
exclude = list(string.punctuation) + list(string.digits) + ['...'] + stopwords.words('english')
capitalized_words = list(filter(lambda x: (x[0].isupper() and not(x.isupper()) and not(x in exclude) and len(x) > 5), tokens))
fd3 = nltk.FreqDist(capitalized_words)
fd3_most_common = fd3.most_common(20)
fd3_most_common

[('Ventures', 120),
 ('Intellectual', 120),
 ("That's", 83),
 ('Anthony', 68),
 ('Because', 61),
 ("There's", 52),
 ("They're", 45),
 ('Crawford', 43),
 ('Schoolcraft', 43),
 ('American', 43),
 ('Public', 41),
 ('Adrian', 38),
 ("Crawford's", 37),
 ('Foxconn', 36),
 ('International', 31),
 ('Sadeem', 29),
 ('Research', 28),
 ('Detkin', 28),
 ('Myhrvold', 25),
 ("You're", 25)]

**Sentence Properties Table** This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not).  Print those out in a table here.

In [58]:
print('%-16s' % 'Info type', '%-16s' % 'Value')
print('%-16s' % 'Number of Sentences', '%-16d' % len(sentences))
print('%-16s' % 'Average Length in Words', '%-16d' % statistics.mean([len(nltk.regexp_tokenize(sentence,pattern)) for sentence in sentences]))
print('%-16s' % 'Average Length in Characters', '%-16d' % statistics.mean([len(sentence) for sentence in sentences]))

Info type        Value           
Number of Sentences 6778            
Average Length in Words 15              
Average Length in Characters 72              


**Reflect on the Output** (Write a brief paragraph below answering these questions.) What does it tell you about your collection?  What does it fail to tell you?  How does your collection perhaps differ from others?

It is interesting to see some of the top words from the This American Life Corpus. Since I primarily focused on the episodes that were tagged as technology related, it is interesting to see some of the more hot topic items in the most frequent word lists such as foxcon, patents, and intellectual (property). I think this shows the various topics of technology that a more general audience, such as that of This American Life, may be interested. However, since most of the topics are related to the two episodes related to patents, I may need to think about expanding my subset of episodes that I have chosen to a more diverse set.

**Compare to Another Collection** Now do the same analysis on another collection in NLTK.  
If your collection is a book, you can compare against another book.   Or you can contrast against an entirely different collection  (Brown corpus, presidential inaugural addresses, etc) to see the difference.
The list of collections is here: http://www.nltk.org/nltk_data/
Reflect on the similarities to or differences from your text collection.


In [62]:
state_of_union = nltk.corpus.state_union.raw()
pattern = r'''(?x)    # set flag to allow verbose regexps
    <
    | :
    | ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+([-']\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [.,;"'?():-_`]+  # these are separate tokens
 '''
sou_tokens = nltk.regexp_tokenize(state_of_union,pattern)

**Compute word counts.**

In [63]:
sou_fd = nltk.FreqDist(tokens)
sou_fd.most_common(20)

[('.', 5963),
 (',', 5220),
 ('the', 3293),
 ('to', 2313),
 ('a', 1966),
 ('of', 1710),
 (':', 1689),
 ('>', 1665),
 ('<', 1665),
 ('I', 1551),
 ('that', 1432),
 ('and', 1420),
 ('in', 1197),
 ('And', 1162),
 ('you', 965),
 ('was', 887),
 ('SUBJECT', 818),
 ('it', 806),
 ('is', 784),
 ('he', 679)]

**Show Adjusted Word Counts.**

In [68]:
exclude = list(string.punctuation) + list(string.digits) + ['...'] + stopwords.words('english')
sou_filtered_tokens = list(filter(lambda x: (not(x.isupper()) and not(x in exclude) and len(x) > 5), sou_tokens))

sou_fd2 = nltk.FreqDist(sou_filtered_tokens)
sou_fd2_most_common = sou_fd2.most_common(20)
sou_fd2_most_common

[('people', 1264),
 ('Congress', 1014),
 ('America', 880),
 ('American', 768),
 ('Federal', 565),
 ('Americans', 559),
 ('program', 521),
 ('economic', 504),
 ('States', 496),
 ('Government', 485),
 ('country', 480),
 ('United', 474),
 ('freedom', 457),
 ('economy', 430),
 ('national', 429),
 ('government', 419),
 ('pplause', 415),
 ('security', 410),
 ('nations', 404),
 ('budget', 403)]

**Word Properties Table**

In [69]:
print('%-16s' % 'Info type', '%-16s' % 'Value')
print('%-16s' % 'number of words', '%-16d' % len(sou_filtered_tokens))
print('%-16s' % 'number of unique words', '%-16d' % len(set(sou_filtered_tokens)))
import statistics
sou_word_lengths = [len(x) for x in sou_filtered_tokens]
print('%-16s' % 'average word length', '%-16d' % round(statistics.mean(sou_word_lengths), 2))
print('%-16s' % 'longest word', '%-16s' % max(sou_filtered_tokens, key=len))

Info type        Value           
number of words  114601          
number of unique words 11777           
average word length 8               
longest word     competition-restricting


**Most Frequent Word List**

In [71]:
count_of_words = len(sou_filtered_tokens)
print('%-16s' % 'Word', '%-16s' % 'Percent')
for item in sou_fd2_most_common:
    print('%-16s' % item[0], '%-16s' % str(round(((item[1]/count_of_words) * 100), 2)))

Word             Percent         
people           1.1             
Congress         0.88            
America          0.77            
American         0.67            
Federal          0.49            
Americans        0.49            
program          0.45            
economic         0.44            
States           0.43            
Government       0.42            
country          0.42            
United           0.41            
freedom          0.4             
economy          0.38            
national         0.37            
government       0.37            
pplause          0.36            
security         0.36            
nations          0.35            
budget           0.35            


**Most Frequent Capitalized Word List**

In [72]:
exclude = list(string.punctuation) + list(string.digits) + ['...'] + stopwords.words('english')
sou_capitalized_words = list(filter(lambda x: (x[0].isupper() and not(x.isupper()) and not(x in exclude) and len(x) > 5), sou_tokens))
sou_fd3 = nltk.FreqDist(sou_capitalized_words)
sou_fd3_most_common = sou_fd3.most_common(20)
sou_fd3_most_common

[('Congress', 1014),
 ('America', 880),
 ('American', 768),
 ('Federal', 565),
 ('Americans', 559),
 ('States', 496),
 ('Government', 485),
 ('United', 474),
 ('President', 347),
 ('Nation', 285),
 ("America's", 192),
 ('Soviet', 170),
 ('Security', 137),
 ('Europe', 128),
 ('Tonight', 115),
 ('Nations', 115),
 ('Social', 112),
 ('Members', 100),
 ('Speaker', 91),
 ('Administration', 88)]

**Sentence Properties Table**

In [73]:
sou_sentences = sent_tokenizer.tokenize(state_of_union)
print(sou_sentences[1])

print('%-16s' % 'Info type', '%-16s' % 'Value')
print('%-16s' % 'Number of Sentences', '%-16d' % len(sou_sentences))
print('%-16s' % 'Average Length in Words', '%-16d' % statistics.mean([len(nltk.regexp_tokenize(sentence,pattern)) for sentence in sou_sentences]))
print('%-16s' % 'Average Length in Characters', '%-16d' % statistics.mean([len(sentence) for sentence in sou_sentences]))

Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.
Info type        Value           
Number of Sentences 17736           
Average Length in Words 21              
Average Length in Characters 115             


**Reflection**

I picked to compare my collection to the state of the union because they are both transcripts of spoken word. Therefore, I thought it would be interesting to see the similarities and differences betweent the two. Firstly, it is interesting to see that the most frequent word in both corpuses is "people". Although, this may have just been due to the fact that I had a minimum character count of 6 or more for words to be included. Secondly, the average count of words per sentence between the two corpuses is vastly different. In the This American Life corpus, the average word count is 14, compared to 21 in the state of the union. This may be due to the fact that This American Life includes dialogue between multiple people and has reponses that would count as a single sentence (such as a guest responding to questions with yes, or no).