# GCDRI Jan 2017:  Text Analysis and Text Classification with NLTK and scikit-learn
- Jupyter Notebook by Rachel Rakov

## Welcome!  Let's get started by importing some data!
- A large collection of textual data is called a *corpus* (pluralized as *corpora*) 
- I will be using the term corpus or corpora throughout this workshop

In [1]:
import nltk
import matplotlib
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Let's take a look at some text!

In [2]:
print text3[:100]
#print (text3[:100])

[u'In', u'the', u'beginning', u'God', u'created', u'the', u'heaven', u'and', u'the', u'earth', u'.', u'And', u'the', u'earth', u'was', u'without', u'form', u',', u'and', u'void', u';', u'and', u'darkness', u'was', u'upon', u'the', u'face', u'of', u'the', u'deep', u'.', u'And', u'the', u'Spirit', u'of', u'God', u'moved', u'upon', u'the', u'face', u'of', u'the', u'waters', u'.', u'And', u'God', u'said', u',', u'Let', u'there', u'be', u'light', u':', u'and', u'there', u'was', u'light', u'.', u'And', u'God', u'saw', u'the', u'light', u',', u'that', u'it', u'was', u'good', u':', u'and', u'God', u'divided', u'the', u'light', u'from', u'the', u'darkness', u'.', u'And', u'God', u'called', u'the', u'light', u'Day', u',', u'and', u'the', u'darkness', u'he', u'called', u'Night', u'.', u'And', u'the', u'evening', u'and', u'the', u'morning', u'were', u'the']


### Concordance
#### Shows the context of a particular word in your corpus, with a default width of 40 characters

In [11]:
word = "death"
#width = int
text1.concordance(word)

print 

Displaying 25 of 90 matches:
e ." -- IBID . " HISTORY OF LIFE AND DEATH ." " The sovereignest thing on earth
 BLACKSTONE . " Soon to the sport of death the crews repair : Rodmond unerring 
 both sides , and of which the wight Death is the only glazier ." True enough ,
al and savage could ever have gone a death - harvesting with such a hacking , h
arly sells the sailors deliriums and death . Abominable are the tumblers into w
ed me tightly , as though naught but death should part us twain . I now strove 
why the Life Insurance Companies pay death - forfeitures upon immortals ; in wh
 immortal by brevet . Yes , there is death in this business of whaling -- a spe
ely mistaken this matter of Life and Death . Methinks that what they call my sh
 him , as over the man who bleeds to death , for conscience is the wound , and 
leg , what it is to have the fear of death ; how , then , can ' st thou prate i
in Ahab , did ' st thou not think of Death and the Judgment then ?" " Hear him 
ent we thou

In [12]:
text3.concordance(word)
print 

grail_death = text6.concordance(word)
print grail_death

Displaying 7 of 7 matches:
sh for she said , Let me not see the death of the child . And she sat over agai
c was comforted after his mother ' s death . Then again Abraham took a wife , a
wife . And it came to pass after the death of Abraham , that God blessed his so
n or his wife shall surely be put to death . Then Isaac sowed in that land , an
ilistines had stopped them after the death of Abrah and he called their names a
bless thee before the LORD before my death . Now therefore , my son , obey my v
nd that he may bless thee before his death . And Jacob said to Rebekah his moth

Displaying 10 of 10 matches:
 ?! OLD MAN : Seek you the Bridge of Death . ARTHUR : The Bridge of Death , whi
ge of Death . ARTHUR : The Bridge of Death , which leads to the Grail ? OLD MAN
son Herbert , has just fallen to his death . GUESTS : Oh ! Oh no ! FATHER : But
p clap clap ] For , since the tragic death of her father -- GUEST # 2 : He ' s 
over , suddenly felt the icy hand of death upon him . BRIDE ' S

## Common contexts
### Takes two words as an argument, returns contexts in which they appear similarly across the text (within one text)

In [13]:
text1.common_contexts(["pretty", "very"])
#requires a list as an argument
## across the same text

a_sharp a_large and_soon a_little be_much


## Dispersion plots
### Used to see where particular words appear in your corpus

In [14]:
%matplotlib notebook
#text6.dispersion_plot(["knights", 'ARTHUR', 'ROBIN', 'grail', 'lady',"ni"])
#text1.dispersion_plot(["Starbuck", 'whale', 'Ahab', 'Ishmael', 'sea',"death"])
text3.dispersion_plot(["God", "apple", "garden", "woman"])

##Shows location of word in a text

<IPython.core.display.Javascript object>

## Great!  Let's count things now!

In [15]:
#How many words are in the texts?
print len(text1)
print len(text6)
print "\n"


# How many times are particular words in the text?
print text1.count("Ahab")
print text6.count("Arthur")
print text6.count("ARTHUR")

260819
16967


501
36
225


In [45]:
## How can I get percentages of words in a text?
print 100* text1.count("the")/float(len(text1))

print 100* text3.count("the")/float(len(text2))

5.26073637273
1.70297225518


## Frequency Distributions

In [16]:
from nltk import FreqDist
fdist = FreqDist(text1)
#print fdist

#fdist.most_common(100)

#fdist["whale"]

In [17]:
## shows how many times "whale" occurs in text
fdist["whale"] 

#shows percentage of "whale" in the text
fdist.freq("whale")

# shows the most common token
#fdist.max()



0.003473673313677301

### A happy chart from "Natural Language Processing with Python" (Bird, Klein & Loper)
       # Example                                        #Description
      
    fdist = FreqDist(samples) 	   create a frequency distribution containing the given samples
    fdist[sample] += 1 	           increment the count for this sample
    fdist['monstrous'] 	           count of the number of times a given sample occurred
    fdist.freq('monstrous') 	   frequency of a given sample
    fdist.N() 	                   total number of samples
    fdist.most_common(n) 	       the n most common samples and their frequencies
    for sample in fdist:           iterate over the samples
    fdist.max() 	               sample with the greatest count
    fdist.tabulate() 	           tabulate the frequency distribution
    fdist.plot() 	               graphical plot of the frequency distribution
    fdist.plot(cumulative=True) 	umulative plot of the frequency distribution
    fdist1 |= fdist2 	           update fdist1 with counts from fdist2
    fdist1 < fdist2 	 test if samples in fdist1 occur less frequently than in 

## Word Tokenization

In [18]:
from nltk import word_tokenize

In [19]:
paragraph = "Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies"\
" a small unregarded yellow sun.  Orbiting this at a distance at roughly nintey-eight million miles is an utterly "\
"insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think "\
"digital watches are a pretty neat idea."
print paragraph

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun.  Orbiting this at a distance at roughly nintey-eight million miles is an utterly insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.


In [20]:
p = word_tokenize(paragraph)
print p

['Far', 'out', 'in', 'the', 'uncharted', 'backwaters', 'of', 'the', 'unfashionable', 'end', 'of', 'the', 'Western', 'Spiral', 'arm', 'of', 'the', 'Galaxy', 'lies', 'a', 'small', 'unregarded', 'yellow', 'sun', '.', 'Orbiting', 'this', 'at', 'a', 'distance', 'at', 'roughly', 'nintey-eight', 'million', 'miles', 'is', 'an', 'utterly', 'insignificant', 'little', 'blue-green', 'planet', 'whose', 'ape-descended', 'life', 'forms', 'are', 'so', 'amazingly', 'primitive', 'that', 'they', 'still', 'think', 'digital', 'watches', 'are', 'a', 'pretty', 'neat', 'idea', '.']


## Text comparisions using Frequency Distributions

In [21]:
from nltk.corpus import brown

In [22]:
cats = brown.categories()
for i in cats:
    print i

adventure
belles_lettres
editorial
fiction
government
hobbies
humor
learned
lore
mystery
news
religion
reviews
romance
science_fiction


In [7]:
romance_sent = brown.sents(categories=["romance"])
print romance_sent[:5]

[[u'They', u'neither', u'liked', u'nor', u'disliked', u'the', u'Old', u'Man', u'.'], [u'To', u'them', u'he', u'could', u'have', u'been', u'the', u'broken', u'bell', u'in', u'the', u'church', u'tower', u'which', u'rang', u'before', u'and', u'after', u'Mass', u',', u'and', u'at', u'noon', u',', u'and', u'at', u'six', u'each', u'evening', u'--', u'its', u'tone', u',', u'repetitive', u',', u'monotonous', u',', u'never', u'breaking', u'the', u'boredom', u'of', u'the', u'streets', u'.'], [u'The', u'Old', u'Man', u'was', u'unimportant', u'.'], [u'Yet', u'if', u'he', u'were', u'not', u'there', u',', u'they', u'would', u'have', u'missed', u'him', u',', u'as', u'they', u'would', u'have', u'missed', u'the', u'sounds', u'of', u'bees', u'buzzing', u'against', u'the', u'screen', u'door', u'in', u'early', u'June', u';', u';'], [u'or', u'the', u'smell', u'of', u'thick', u'tomato', u'paste', u'--', u'the', u'ripe', u'smell', u'that', u'was', u'both', u'sweet', u'and', u'sour', u'--', u'rising', u'up', 

In [23]:
news = brown.words(categories=["news"])  #get all of the words from the "news" category
romance = brown.words(categories=["romance"]) # get all of the words from the "romance" category

## Build some frequency distribution!!!! 
fdist_news = FreqDist(w.lower() for w in news)
fdist_romance = FreqDist(w.lower() for w in romance)

modals = ["can", "could", "might", "may", "would", "must", "will"]

print "word:\t news \t \t romance"
print "_________________________________"
for m in modals:
    print m +":","\t", "%f \t %f"  %(fdist_news.freq(m)*100, fdist_romance.freq(m)*100)


word:	 news 	 	 romance
_________________________________
can: 	0.093482 	 0.112822
could: 	0.086521 	 0.278484
might: 	0.037791 	 0.072834
may: 	0.092488 	 0.015709
would: 	0.244645 	 0.352746
must: 	0.052708 	 0.065694
will: 	0.386857 	 0.069978


## Part-of-Speech (POS) tagging

In [24]:
from nltk import pos_tag  ###part of speech tags a list
from nltk import pos_tag_sents ### pos tags sentences, rather than individual words

In [25]:
paragraph = "Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies"\
" a small unregarded yellow sun.  Orbiting this at a distance at roughly nintey-eight million miles is an utterly "\
"insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think "\
"digital watches are a pretty neat idea."
p = word_tokenize(paragraph)
print p

['Far', 'out', 'in', 'the', 'uncharted', 'backwaters', 'of', 'the', 'unfashionable', 'end', 'of', 'the', 'Western', 'Spiral', 'arm', 'of', 'the', 'Galaxy', 'lies', 'a', 'small', 'unregarded', 'yellow', 'sun', '.', 'Orbiting', 'this', 'at', 'a', 'distance', 'at', 'roughly', 'nintey-eight', 'million', 'miles', 'is', 'an', 'utterly', 'insignificant', 'little', 'blue-green', 'planet', 'whose', 'ape-descended', 'life', 'forms', 'are', 'so', 'amazingly', 'primitive', 'that', 'they', 'still', 'think', 'digital', 'watches', 'are', 'a', 'pretty', 'neat', 'idea', '.']


In [26]:
paragraph_POS = pos_tag(p)
print paragraph_POS

[('Far', 'CD'), ('out', 'RP'), ('in', 'IN'), ('the', 'DT'), ('uncharted', 'JJ'), ('backwaters', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('unfashionable', 'JJ'), ('end', 'NN'), ('of', 'IN'), ('the', 'DT'), ('Western', 'JJ'), ('Spiral', 'NNP'), ('arm', 'NN'), ('of', 'IN'), ('the', 'DT'), ('Galaxy', 'NNP'), ('lies', 'VBZ'), ('a', 'DT'), ('small', 'JJ'), ('unregarded', 'JJ'), ('yellow', 'JJ'), ('sun', 'NN'), ('.', '.'), ('Orbiting', 'VBG'), ('this', 'DT'), ('at', 'IN'), ('a', 'DT'), ('distance', 'NN'), ('at', 'IN'), ('roughly', 'RB'), ('nintey-eight', 'JJ'), ('million', 'CD'), ('miles', 'NNS'), ('is', 'VBZ'), ('an', 'DT'), ('utterly', 'JJ'), ('insignificant', 'JJ'), ('little', 'JJ'), ('blue-green', 'JJ'), ('planet', 'NN'), ('whose', 'WP$'), ('ape-descended', 'JJ'), ('life', 'NN'), ('forms', 'NNS'), ('are', 'VBP'), ('so', 'RB'), ('amazingly', 'RB'), ('primitive', 'VBP'), ('that', 'IN'), ('they', 'PRP'), ('still', 'RB'), ('think', 'VBP'), ('digital', 'JJ'), ('watches', 'NNS'), ('are', 'VBP'), (

In [27]:
### Because not all of these tags are intuitive ###
nltk.help.upenn_tagset("JJ")

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


## Feature extraction using the Brown Corpus
#### Can we train a computer to predict whether a sentence belongs in the news corpus or the romance corpus?

### Part of speech - number of nouns in sentences
Let's start by getting all of the sentences from the news and romance categories of the Brown corpus

In [28]:
news_sent = brown.sents(categories=["news"])
romance_sent = brown.sents(categories=["romance"])

print len(news_sent)
print len(romance_sent)

4623
4431


Next, let's part of speech tag each word in the sentence!

Note: We use pos_tag_sent because we are tagging sentences, rather than individual words

In [29]:
news_pos = pos_tag_sents(news_sent)
romance_pos = pos_tag_sents(romance_sent)

Now let's create a function that will count how many nouns are in each sentence of the corpus

In [30]:
def countNouns(pos_tag_sents):
    noun_count = 0
    all_noun_counts = []
    for sentence in pos_tag_sents:
        for word in sentence:
            tag = word[1]
            if tag [:2] == "NN":  ## so that we capture both singluar and plural nouns
                noun_count = noun_count+1
        all_noun_counts.append(noun_count)
        noun_count = 0
    return all_noun_counts

news_counts = countNouns(news_pos)
romance_counts = countNouns(romance_pos)

           
        

In [31]:
print romance_counts[:20]
print news_counts[:20]

[2, 11, 2, 4, 8, 8, 3, 1, 3, 3, 4, 1, 4, 6, 3, 1, 3, 0, 5, 4]
[11, 13, 16, 9, 5, 5, 11, 1, 5, 9, 3, 6, 5, 11, 20, 5, 5, 5, 12, 1]


## Machine learning:  Building train and test sets
### Seperating data, getting labels, and aligning them with features

In [32]:
import pandas as pd
import sklearn

## Create training and testing labels

In [33]:
cats = ["news", "romance"] #define what categories of brown corpus we want
print cats    
text = [brown.sents(categories=cat) for cat in cats]
test_sets_int = 500 ## specify how many test sentences we will have per category

######## create labels for test and training sets ##############
### find how many sentences there are, subtract test_sets_int for the correct number of training and testing labels
lengths = []
for i in range(len(cats)):
    start_length = len(text[i])
    print start_length
    length = start_length - test_sets_int
    #print length
    lengths.append(length)

print lengths

#### concatenate the labels together #############
train_labels = ["news"]*lengths[0]+["romance"]*lengths[1]
test_labels = ["news"]*test_sets_int+["romance"]*test_sets_int

print train_labels[:10]
print train_labels[-10:]

['news', 'romance']
4623
4431
[4123, 3931]
['news', 'news', 'news', 'news', 'news', 'news', 'news', 'news', 'news', 'news']
['romance', 'romance', 'romance', 'romance', 'romance', 'romance', 'romance', 'romance', 'romance', 'romance']


## Create training and testing data

#### First, let's separate out the training data from the testing data

In [34]:
#### take the first 500 count features from each dataset and use them as test - use the rest as train ######
news_values_test = news_counts[:test_sets_int]
news_values_train = news_counts[test_sets_int:]
romance_values_test = romance_counts[:test_sets_int]
romance_values_train = romance_counts[test_sets_int:]
print len(news_values_test)
print len(news_values_train)
print len (romance_values_test)
print len(romance_values_train)


500
4123
500
3931


In [66]:
###### concatenate the lists of data together ######
train_features = news_values_train+romance_values_train
test_features = news_values_test+romance_values_test

## Introduction to pandas DataFrames
So new we have both train_features and train_labels, as well as test_features and test_labels.  Let's manage this data!

In [67]:
#### create two DataFrames - one for train, one for testing ####
train_data = pd.DataFrame(train_features, columns=["number of nouns"])
test_data = pd.DataFrame(test_features, columns=["number of nouns"])

In [68]:
train_data

Unnamed: 0,number of nouns
0,2
1,11
2,10
3,7
4,8
5,12
6,4
7,11
8,2
9,2


Let's add our labels to our dataframe!

In [69]:
##### you can add columns to DataFrames like this! ####
train_data["labels"] = train_labels
test_data["labels"] = test_labels

In [70]:
test_data

Unnamed: 0,number of nouns,labels
0,11,news
1,13,news
2,16,news
3,9,news
4,5,news
5,5,news
6,11,news
7,1,news
8,5,news
9,9,news


## Text classification using scikit learn

#### Seperate the dataframe into data and labels, for both train and test sets

In [71]:
####### We use the naming conventions of sklearn here ########
X_train = train_data["number of nouns"]
y_train = train_data["labels"]


X_test = test_data["number of nouns"]
y_test = test_data["labels"]

### Note that because we only have one feature, we need to reshape our data

##### sklearn will tell you when your data needs to be reshaped, and will tell you how to do it

In [72]:
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)

In [73]:
print len(train_features)
print len(train_labels)

8054
8054


## Classification with Linear SVC

### Step 1:  Import your classifier

In [74]:
from sklearn.svm import LinearSVC

### Step 2: Create an instance of your classifier 

In [75]:
classifier = LinearSVC()

### Step 3: Fit, predict, and score

In [76]:
classifier.fit(X_train,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [77]:
predictions = classifier.predict(X_test)

In [78]:
for i in range(len(y_test)):
    print predictions[i], y_test[i]

news news
news news
news news
news news
news news
news news
news news
romance news
news news
news news
romance news
news news
news news
news news
news news
news news
news news
news news
news news
romance news
news news
news news
news news
news news
romance news
romance news
news news
romance news
news news
news news
news news
romance news
romance news
news news
romance news
romance news
news news
news news
news news
news news
news news
romance news
romance news
news news
news news
news news
news news
news news
romance news
romance news
news news
news news
news news
news news
romance news
news news
news news
news news
news news
news news
news news
romance news
news news
news news
news news
romance news
news news
news news
news news
news news
news news
news news
romance news
news news
news news
news news
news news
news news
romance news
news news
news news
romance news
news news
news news
news news
news news
news news
romance news
romance news
romance news
news news
news news
news news
r

In [79]:
classifier.score(X_test, y_test)

0.67400000000000004

## Evaluate your model

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y_test, predictions)

# Extra materials
We may not have time to get to these things, but here are some additional materials to help you in your classification future!
- Changing classifiers
- Adding additional features

## Classification with k-nearest neighbors
What if we used the classifier we described before, k-nearest neighbors?  How well do we do?

### Remember our steps from before:

In [None]:
from sklearn.neighbors import KNeighborsClassifier  ### import your classifier

classifier = KNeighborsClassifier(n_neighbors=3)  ### Create a new instance of your classifier 

classifier.fit(X_train, y_train)  ### fit

predictions = classifier.predict(X_test) ### predict


In [None]:
classifier.score(X_test, y_test) ### score

In [None]:
confusion_matrix(y_test, predictions) ### evaluate

## Let's add another feature - number of modal verbs in a sentence!

In [None]:
### If any of these modal words appear in our sentences, accumulate the total for each sentence

def modals(setType):
    modals_count = 0
    modal_features = []
    modals = ["can", "could", "might", "may", "would", "must", "will"]
    for sent in setType:
        for word in modals:
            if word in sent:
                modals_count = modals_count+1
        modal_features.append(modals_count)
        modals_count = 0
    print len(modal_features)   
    return modal_features

news_modals = modals((brown.sents(categories="news")))
romance_modals = modals((brown.sents(categories="romance")))


In [None]:
print news_modals[:5]

## Create training and test sets of modal features
Second verse, same as the first!

In [None]:
###### create feature vectors of modal counts #####
news_modals_test = news_modals[:test_sets_int]
news_modals_train = news_modals[test_sets_int:]
romance_modals_test = romance_modals[:test_sets_int]
romance_modals_train = romance_modals[test_sets_int:]
print len(news_modals_test)
print len(news_modals_train)
print len (romance_modals_test)
print len(romance_modals_train)

In [None]:
### concatenate the modal features #####
modal_features_train = news_modals_train+romance_modals_train
modal_features_test = news_modals_test+romance_modals_test

In [None]:
print modal_features_train[:10]


### Adding columns to existing DataFrames
You can add columns in DataFrames by location!

In [None]:
train_data.insert(1, "number of modals", modal_features_train)

In [None]:
test_data.insert(1, "number of modals", modal_features_test)

### Splitting DataFrames with more than one feature
Split columns based on column order, or use the name of the column to split

In [None]:
X_train = train_data[train_data.columns[:2]]
y_train = train_data["labels"]


X_test = test_data[test_data.columns[:2]]
y_test = test_data["labels"]

In [None]:
print X_train

## NOTE:  Because we have more than one feature, we no longer need to reshape our data!

### Classify like before!  (No need to re-import your classifier)

In [None]:
classifier = LinearSVC()
classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)
classifier.score(X_test, y_test)

### Evaluate

In [None]:
confusion_matrix(y_test, predictions)

Doesn't actually improve things by much, but that also shouldn't be too much of a surprise, from when we looked at it before