## Practice One: NLTK Basics

NLTK is a powerful computational linguistic library in python. One aspect of the library is it provides the full text of some well known books so you can explore computational linguistic concepts easily, and with material you might already by familiary with.  Run the library import and check out some of the materials that are available to you in this library. 

In [1]:
from nltk.book import *


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Concordance

In the most simple terms, concordance is an alphabetical list of all the words used in a book or set of books, with information about where they can be found and usually about how they are used. 

In this example, we identify the use of the word "impressive" in the context of 3 of the texts. 

In [2]:
text1.concordance("impressive")
text4.concordance("impressive")
text6.concordance("impressive")

Displaying 2 of 2 matches:
was a most unwonted hour , yet so impressive was the cry , and so deliriously 
 show you the diamond in its most impressive lustre , he lays it against a glo
Displaying 3 of 3 matches:
e early fathers . One of the most impressive evidences of that wisdom is to be
s himself to their service . This impressive ceremony adds little to the solem
aracter and a culture which is an impressive contribution to human progress . 
no matches


## Similar

In [3]:
text1.similar("impressive")
## returns words that appear in a "similar" context

it there extreme nor spent roughly ahab manifold


In [4]:
text1.common_contexts(["pretty", "very"])

and_soon a_large a_little be_much a_sharp


## The Knights who say "ni"

[Where in Monty Python is the section about the "Knights who say 'ni'"?](http://www.montypython.net/scripts/HG-niscenes.php)

In [5]:
%matplotlib notebook
text6.dispersion_plot(["knights", 'Arthur', 'grail', 'lady',"ni"])
##Shows location of word in a text

<IPython.core.display.Javascript object>

In [6]:
## Total Length of Each Text. 
print(len(text1))
print(len(text6))
print( "\n" )

## Occurances of "Ahab" in Moby Dick
print(text1.count("Ahab"))

## Occurences of "Arthuer" in the Holy Grail. 
print(text6.count("Arthur"))

260819
16967


501
36


In [7]:
### Percentage of text that is "Ahab"

100* text1.count("Ahab")/float(len(text1))


0.19208723290864546

In [8]:
sort = sorted(set(text1))
#print len(sort)

# Frequency Distributions

In [9]:
### identify the most commonly used words. 

fdist = FreqDist(text1)
print(fdist)

fdist.most_common(50)

#fdist["whale"]

<FreqDist with 19317 samples and 260819 outcomes>


[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

### Other NLTK Functions
       # Example                                        #Description
      
    fdist = FreqDist(samples) 	    create a frequency distribution containing the given samples
    fdist[sample] += 1 	            increment the count for this sample
    fdist['monstrous'] 	            count of the number of times a given sample occurred
    fdist.freq('monstrous') 	    frequency of a given sample
    fdist.N() 	                    total number of samples
    fdist.most_common(n) 	        the n most common samples and their frequencies
    for sample in fdist:            iterate over the samples
    fdist.max() 	                sample with the greatest count
    fdist.tabulate() 	            tabulate the frequency distribution
    fdist.plot() 	                graphical plot of the frequency distribution
    fdist.plot(cumulative=True) 	cumulative plot of the frequency distribution
    fdist1 |= fdist2 	            update fdist1 with counts from fdist2
    fdist1 < fdist2 	 test if samples in fdist1 occur less frequently than in fdist2

# List Comprehensions

In [10]:
import nltk
sent = "That isn't a problem, Bob."
a= sent.split()
b = nltk.word_tokenize(sent)

print(a)
print(b)

['That', "isn't", 'a', 'problem,', 'Bob.']
['That', 'is', "n't", 'a', 'problem', ',', 'Bob', '.']


    [x for x in array]
Is the same as 

    `dest = []`
    `for x in array:`
       `dest.append(x)`

In [11]:
array = b
[len(x) for x in array]

[4, 2, 3, 1, 7, 1, 3, 1]

In [12]:
##You can put conditionals in them too
[x for x in array if len(x) == 3]

["n't", 'Bob']

In [13]:
## Look at how complicated these can get!
[x.upper() for x in array if len(x) > 3 and x.startswith('T')]

['THAT']

In [14]:
sent2 = "This isn't a fish, Mary."
array2 =nltk.word_tokenize(sent2)

[x+y for x in array for y in array2]


['ThatThis',
 'Thatis',
 "Thatn't",
 'Thata',
 'Thatfish',
 'That,',
 'ThatMary',
 'That.',
 'isThis',
 'isis',
 "isn't",
 'isa',
 'isfish',
 'is,',
 'isMary',
 'is.',
 "n'tThis",
 "n'tis",
 "n'tn't",
 "n'ta",
 "n'tfish",
 "n't,",
 "n'tMary",
 "n't.",
 'aThis',
 'ais',
 "an't",
 'aa',
 'afish',
 'a,',
 'aMary',
 'a.',
 'problemThis',
 'problemis',
 "problemn't",
 'problema',
 'problemfish',
 'problem,',
 'problemMary',
 'problem.',
 ',This',
 ',is',
 ",n't",
 ',a',
 ',fish',
 ',,',
 ',Mary',
 ',.',
 'BobThis',
 'Bobis',
 "Bobn't",
 'Boba',
 'Bobfish',
 'Bob,',
 'BobMary',
 'Bob.',
 '.This',
 '.is',
 ".n't",
 '.a',
 '.fish',
 '.,',
 '.Mary',
 '..']

# Conditional Frequency Distributions

In [15]:
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre,word)for genre in brown.categories() for word in brown.words(categories=genre))

In [16]:
print(cfd["romance"].most_common(50))
print(cfd["news"].most_common(50))


[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496), ('with', 460), ('you', 456), ('for', 410), ('at', 402), ('He', 366), ('on', 362), ('him', 339), ('said', 330), ('!', 316), ('--', 291), ('be', 289), ('as', 282), (';', 264), ('have', 258), ('but', 252), ('not', 250), ('would', 244), ('She', 232), ('The', 230), ('out', 217), ('were', 214), ('up', 211), ('all', 209), ('from', 202), ('could', 193), ('me', 193), ('like', 185), ('been', 179), ('so', 174), ('there', 169)]
[('the', 5580), (',', 5188), ('.', 4030), ('of', 2849), ('and', 2146), ('to', 2116), ('a', 1993), ('in', 1893), ('for', 943), ('The', 806), ('that', 802), ('``', 732), ('is', 732), ('was', 717), ("''", 702), ('on', 657), ('at', 598), ('with', 545), ('be', 526), ('by', 497), ('as', 481), ('he', 451), ('sai

In [17]:
%matplotlib notebook
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch','Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word)) for lang in languages for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],samples=range(10), cumulative=True)
cfd.plot()

                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275 


<IPython.core.display.Javascript object>

### More happy charts from your book!

    Example 	                                      Description
    cfdist = ConditionalFreqDist(pairs) 	          create a conditional frequency distribution from a list of pairs
    cfdist.conditions() 	                          the conditions
    cfdist[condition] 	                              the frequency distribution for this condition
    cfdist[condition][sample] 	                      frequency for the given sample for this condition
    cfdist.tabulate() 	                              tabulate the conditional frequency distribution
    cfdist.tabulate(samples, conditions)              tabulation limited to the specified samples and conditions
    cfdist.plot() 	                                  graphical plot of the conditional frequency distribution
    cfdist.plot(samples, conditions) 	              graphical plot limited to the specified samples and conditions
    cfdist1 < cfdist2 	                              test if samples in cfdist1 occur less frequently than in cfdist2