# NLP with NLTK (.. and a little sklearn) 

Natural Language Processing with Natural Language Toolkit (`NLTK`): est. 2001

[nltk](http://www.nltk.org/) is a Python package for NLP.

TOKENIZATION: (nltk tokenize) <br>
POS TAGGING: (nltk pos_tag, word blob tags) <br>
SENTIMENT ANALYSIS: (word blob) <br>
STEMMING: (nlkt stem) <br>
WORD COUNTS: (nltk word_counts) <br>
CHUNKING: (nltk chunking) <br>


* GOAL:*
- Total number of different words
- Word repetition count
- Count of words starting with a specific letter
- Number of time letters or numbers are mentioned
- Count of pronouns
- Number of words outside of early childhood vocabulary. 
- How many characters invovled?
- Any way to count self regulation? Kindness?
- Rhyming? Counting phonemes.
- Words per minute?

In [1]:
from __future__ import print_function

In [24]:
#!pip install nltk
import nltk
import pandas as pd

Much of NLTK depends on additional data which you'll have to download. Use `nltk.download()` to get at least the following:

 * averaged_perceptron_tagger (in models)
 * maxent_treebank_pos_tagger (in models)
 * punkt (in models)
 * maxent_ne_chunk (in models)
 * words (in corpora)

You can install these and continue without restarting your kernel.

In [25]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Sentence tokenization

In [3]:
from nltk.tokenize import sent_tokenize

text = """Skin!
Covered all over with beautiful skin
Skin!
Covered all over from ankle to chin
Lovely skin on kneeses and noses and ev'rywhere
Skin on tummies and toeses and under your hair it's even there
Oh skin!
Wonderful colors and beautiful tones
Skin!
Think of without it you're nothing but bones
Skin is ever so lovely no matter the color you're in
Let's hear it for skin

Beautiful skin

Skin!
Covered all over with beautiful skin
Skin!
Covered all over from ankle to chin
Without skin for touching and rubbing how much we'd miss
There'd be no hands for shaking and scrubbing
And just think of this no cheeks to kiss
Oh skin!
Wonderful colors and beautiful tones
Skin!
Keeping the rain off our muscles and bones
Skin is ever so lovely no matter the color you're in
Let's hear it for skin
Beautiful skin"""

sentences = sent_tokenize(text)
print(sentences)


['Skin!', 'Covered all over with beautiful skin\nSkin!', "Covered all over from ankle to chin\nLovely skin on kneeses and noses and ev'rywhere\nSkin on tummies and toeses and under your hair it's even there\nOh skin!", 'Wonderful colors and beautiful tones\nSkin!', "Think of without it you're nothing but bones\nSkin is ever so lovely no matter the color you're in\nLet's hear it for skin\n\nBeautiful skin\n\nSkin!", 'Covered all over with beautiful skin\nSkin!', "Covered all over from ankle to chin\nWithout skin for touching and rubbing how much we'd miss\nThere'd be no hands for shaking and scrubbing\nAnd just think of this no cheeks to kiss\nOh skin!", 'Wonderful colors and beautiful tones\nSkin!', "Keeping the rain off our muscles and bones\nSkin is ever so lovely no matter the color you're in\nLet's hear it for skin\nBeautiful skin"]


## Word tokenization

In [4]:
# TreebankWordTokenizer assumes that our input has already been segmented into sentences..
## Seperates from punctuation. 

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentences[5])

['Covered', 'all', 'over', 'with', 'beautiful', 'skin', 'Skin', '!']

In [5]:
from nltk.tokenize import word_tokenize
words = word_tokenize(sentences[5])
words

['Covered', 'all', 'over', 'with', 'beautiful', 'skin', 'Skin', '!']

In [6]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(sentences[5])

['Covered', 'all', 'over', 'with', 'beautiful', 'skin', 'Skin', '!']

Demo of different tokenizers: http://text-processing.com/demo/tokenize/

## Part of speech tagging

In [7]:
from nltk.tag import pos_tag

for i in sentences:
    words=pos_tag(word_tokenize(i))
    print(words)

[('Skin', 'NN'), ('!', '.')]
[('Covered', 'NNP'), ('all', 'DT'), ('over', 'IN'), ('with', 'IN'), ('beautiful', 'JJ'), ('skin', 'NN'), ('Skin', 'NNP'), ('!', '.')]
[('Covered', 'NNP'), ('all', 'DT'), ('over', 'IN'), ('from', 'IN'), ('ankle', 'NN'), ('to', 'TO'), ('chin', 'VB'), ('Lovely', 'RB'), ('skin', 'VBN'), ('on', 'IN'), ('kneeses', 'NNS'), ('and', 'CC'), ('noses', 'NNS'), ('and', 'CC'), ("ev'rywhere", 'JJ'), ('Skin', 'NNP'), ('on', 'IN'), ('tummies', 'NNS'), ('and', 'CC'), ('toeses', 'NNS'), ('and', 'CC'), ('under', 'IN'), ('your', 'PRP$'), ('hair', 'NN'), ('it', 'PRP'), ("'s", 'VBZ'), ('even', 'RB'), ('there', 'EX'), ('Oh', 'NNP'), ('skin', 'NN'), ('!', '.')]
[('Wonderful', 'JJ'), ('colors', 'NNS'), ('and', 'CC'), ('beautiful', 'JJ'), ('tones', 'NNS'), ('Skin', 'NNP'), ('!', '.')]
[('Think', 'NN'), ('of', 'IN'), ('without', 'IN'), ('it', 'PRP'), ('you', 'PRP'), ("'re", 'VBP'), ('nothing', 'NN'), ('but', 'CC'), ('bones', 'NNS'), ('Skin', 'NNP'), ('is', 'VBZ'), ('ever', 'RB'), ('so

### Some of POS tags: 
WP: wh-pronoun ("who", "what")  
VBZ: verb, 3rd person sing. present ("takes")  
VBG: verb, gerund/present participle ("taking")  
TO: to ("to go", "to him")   
DT: determiner ("the", "this")  
NN: noun, singular or mass ("door")  
.: Punctuation (".", "?")  

- [All tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
- [Breakdown](https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)

## Chunking
Extracting phrases

In [8]:
## the 'named entity' chunker!  ne_chunk utilizes 

from nltk.chunk import ne_chunk

for i in sentences:
    words = word_tokenize(i)
    tags = pos_tag(words)
    tree = ne_chunk(tags)
    print(tree)

(S (GPE Skin/NN) !/.)
(S
  (GPE Covered/NNP)
  all/DT
  over/IN
  with/IN
  beautiful/JJ
  skin/NN
  (PERSON Skin/NNP)
  !/.)
(S
  (GPE Covered/NNP)
  all/DT
  over/IN
  from/IN
  ankle/NN
  to/TO
  chin/VB
  Lovely/RB
  skin/VBN
  on/IN
  kneeses/NNS
  and/CC
  noses/NNS
  and/CC
  ev'rywhere/JJ
  Skin/NNP
  on/IN
  tummies/NNS
  and/CC
  toeses/NNS
  and/CC
  under/IN
  your/PRP$
  hair/NN
  it/PRP
  's/VBZ
  even/RB
  there/EX
  Oh/NNP
  skin/NN
  !/.)
(S
  (GPE Wonderful/JJ)
  colors/NNS
  and/CC
  beautiful/JJ
  tones/NNS
  (PERSON Skin/NNP)
  !/.)
(S
  Think/NN
  of/IN
  without/IN
  it/PRP
  you/PRP
  're/VBP
  nothing/NN
  but/CC
  bones/NNS
  (PERSON Skin/NNP)
  is/VBZ
  ever/RB
  so/RB
  lovely/JJ
  no/DT
  matter/NN
  the/DT
  color/NN
  you/PRP
  're/VBP
  in/IN
  (GPE Let/NNP)
  's/POS
  hear/VB
  it/PRP
  for/IN
  skin/JJ
  Beautiful/NNP
  skin/NN
  (PERSON Skin/NNP)
  !/.)
(S
  (GPE Covered/NNP)
  all/DT
  over/IN
  with/IN
  beautiful/JJ
  skin/NN
  (PERSON Skin/NNP)
  

In [31]:
tree.draw()

# TextBlob

In [9]:
#!pip install textblob
# 
from textblob import TextBlob

sesame = TextBlob(text)

In [10]:
sesame.tags

[('Skin', 'NN'),
 ('Covered', 'NNP'),
 ('all', 'DT'),
 ('over', 'IN'),
 ('with', 'IN'),
 ('beautiful', 'JJ'),
 ('skin', 'NN'),
 ('Skin', 'NNP'),
 ('Covered', 'NNP'),
 ('all', 'DT'),
 ('over', 'IN'),
 ('from', 'IN'),
 ('ankle', 'NN'),
 ('to', 'TO'),
 ('chin', 'VB'),
 ('Lovely', 'RB'),
 ('skin', 'VBN'),
 ('on', 'IN'),
 ('kneeses', 'NNS'),
 ('and', 'CC'),
 ('noses', 'NNS'),
 ('and', 'CC'),
 ("ev'rywhere", 'JJ'),
 ('Skin', 'NNP'),
 ('on', 'IN'),
 ('tummies', 'NNS'),
 ('and', 'CC'),
 ('toeses', 'NNS'),
 ('and', 'CC'),
 ('under', 'IN'),
 ('your', 'PRP$'),
 ('hair', 'NN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('even', 'RB'),
 ('there', 'EX'),
 ('Oh', 'NNP'),
 ('skin', 'NN'),
 ('Wonderful', 'JJ'),
 ('colors', 'NNS'),
 ('and', 'CC'),
 ('beautiful', 'JJ'),
 ('tones', 'NNS'),
 ('Skin', 'NNP'),
 ('Think', 'NN'),
 ('of', 'IN'),
 ('without', 'IN'),
 ('it', 'PRP'),
 ('you', 'PRP'),
 ("'re", 'VBP'),
 ('nothing', 'NN'),
 ('but', 'CC'),
 ('bones', 'NNS'),
 ('Skin', 'NNP'),
 ('is', 'VBZ'),
 ('ever', 'RB'),
 

In [11]:
sesame.noun_phrases

WordList(['skin', 'covered', 'beautiful skin', 'skin', 'covered', 'lovely skin', 'skin', 'oh', 'wonderful', 'beautiful tones', 'skin', 'think', 'skin', 'beautiful', 'skin', 'covered', 'beautiful skin', 'skin', 'covered', 'oh', 'wonderful', 'beautiful tones', 'skin', 'keeping', 'skin', 'beautiful'])

###  How do you really feel?    TextBlob:  Sentiment Analysis

In [12]:
TextBlob(text).sentiment

Sentiment(polarity=0.7865384615384615, subjectivity=0.8423076923076922)

In [13]:
sesame.sentences

[Sentence("Skin!"), Sentence("Covered all over with beautiful skin
 Skin!"), Sentence("Covered all over from ankle to chin
 Lovely skin on kneeses and noses and ev'rywhere
 Skin on tummies and toeses and under your hair it's even there
 Oh skin!"), Sentence("Wonderful colors and beautiful tones
 Skin!"), Sentence("Think of without it you're nothing but bones
 Skin is ever so lovely no matter the color you're in
 Let's hear it for skin
 
 Beautiful skin
 
 Skin!"), Sentence("Covered all over with beautiful skin
 Skin!"), Sentence("Covered all over from ankle to chin
 Without skin for touching and rubbing how much we'd miss
 There'd be no hands for shaking and scrubbing
 And just think of this no cheeks to kiss
 Oh skin!"), Sentence("Wonderful colors and beautiful tones
 Skin!"), Sentence("Keeping the rain off our muscles and bones
 Skin is ever so lovely no matter the color you're in
 Let's hear it for skin
 Beautiful skin")]

In [14]:
sesame.words

WordList(['Skin', 'Covered', 'all', 'over', 'with', 'beautiful', 'skin', 'Skin', 'Covered', 'all', 'over', 'from', 'ankle', 'to', 'chin', 'Lovely', 'skin', 'on', 'kneeses', 'and', 'noses', 'and', "ev'rywhere", 'Skin', 'on', 'tummies', 'and', 'toeses', 'and', 'under', 'your', 'hair', 'it', "'s", 'even', 'there', 'Oh', 'skin', 'Wonderful', 'colors', 'and', 'beautiful', 'tones', 'Skin', 'Think', 'of', 'without', 'it', 'you', "'re", 'nothing', 'but', 'bones', 'Skin', 'is', 'ever', 'so', 'lovely', 'no', 'matter', 'the', 'color', 'you', "'re", 'in', 'Let', "'s", 'hear', 'it', 'for', 'skin', 'Beautiful', 'skin', 'Skin', 'Covered', 'all', 'over', 'with', 'beautiful', 'skin', 'Skin', 'Covered', 'all', 'over', 'from', 'ankle', 'to', 'chin', 'Without', 'skin', 'for', 'touching', 'and', 'rubbing', 'how', 'much', 'we', "'d", 'miss', 'There', "'d", 'be', 'no', 'hands', 'for', 'shaking', 'and', 'scrubbing', 'And', 'just', 'think', 'of', 'this', 'no', 'cheeks', 'to', 'kiss', 'Oh', 'skin', 'Wonderful',

In [15]:
sesame.sentences[0].words

WordList(['Skin'])

### Stemming

In [16]:
stemmer = nltk.stem.porter.PorterStemmer()
for word in TextBlob(text).words:
    print(stemmer.stem(word))

skin
cover
all
over
with
beauti
skin
skin
cover
all
over
from
ankl
to
chin
love
skin
on
knees
and
nose
and
ev'rywher
skin
on
tummi
and
toes
and
under
your
hair
it
's
even
there
Oh
skin
wonder
color
and
beauti
tone
skin
think
of
without
it
you
're
noth
but
bone
skin
is
ever
so
love
no
matter
the
color
you
're
in
let
's
hear
it
for
skin
beauti
skin
skin
cover
all
over
with
beauti
skin
skin
cover
all
over
from
ankl
to
chin
without
skin
for
touch
and
rub
how
much
we
'd
miss
there
'd
be
no
hand
for
shake
and
scrub
and
just
think
of
thi
no
cheek
to
kiss
Oh
skin
wonder
color
and
beauti
tone
skin
keep
the
rain
off
our
muscl
and
bone
skin
is
ever
so
love
no
matter
the
color
you
're
in
let
's
hear
it
for
skin
beauti
skin


To see different nltk stemmers in effect:
http://text-processing.com/demo/stem/

In [21]:
sesame.word_counts.items()

dict_items([('skin', 19), ('covered', 4), ('all', 4), ('over', 4), ('with', 2), ('beautiful', 6), ('from', 2), ('ankle', 2), ('to', 3), ('chin', 2), ('lovely', 3), ('on', 2), ('kneeses', 1), ('and', 10), ('noses', 1), ("ev'rywhere", 1), ('tummies', 1), ('toeses', 1), ('under', 1), ('your', 1), ('hair', 1), ('it', 4), ('s', 3), ('even', 1), ('there', 2), ('oh', 2), ('wonderful', 2), ('colors', 2), ('tones', 2), ('think', 2), ('of', 2), ('without', 2), ('you', 3), ('re', 3), ('nothing', 1), ('but', 1), ('bones', 2), ('is', 2), ('ever', 2), ('so', 2), ('no', 4), ('matter', 2), ('the', 3), ('color', 2), ('in', 2), ('let', 2), ('hear', 2), ('for', 4), ('touching', 1), ('rubbing', 1), ('how', 1), ('much', 1), ('we', 1), ('d', 2), ('miss', 1), ('be', 1), ('hands', 1), ('shaking', 1), ('scrubbing', 1), ('just', 1), ('this', 1), ('cheeks', 1), ('kiss', 1), ('keeping', 1), ('rain', 1), ('off', 1), ('our', 1), ('muscles', 1)])

In [17]:
for word, count in sesame.word_counts.items():
    print("%15s %i" % (word, count))

           skin 19
        covered 4
            all 4
           over 4
           with 2
      beautiful 6
           from 2
          ankle 2
             to 3
           chin 2
         lovely 3
             on 2
        kneeses 1
            and 10
          noses 1
     ev'rywhere 1
        tummies 1
         toeses 1
          under 1
           your 1
           hair 1
             it 4
              s 3
           even 1
          there 2
             oh 2
      wonderful 2
         colors 2
          tones 2
          think 2
             of 2
        without 2
            you 3
             re 3
        nothing 1
            but 1
          bones 2
             is 2
           ever 2
             so 2
             no 4
         matter 2
            the 3
          color 2
             in 2
            let 2
           hear 2
            for 4
       touching 1
        rubbing 1
            how 1
           much 1
             we 1
              d 2
           miss 1
        

In [None]:
def get_count(item):
    return item[1]

for word, count in sorted(gatsby.word_counts.items(), key=get_count, reverse=True):
    print("%15s %i" % (word, count))

## Movie Reviews 
(without stopwords!)

In [None]:
#nltk.download()

In [27]:
import nltk
from textblob import TextBlob
from nltk.corpus import movie_reviews

fileids = movie_reviews.fileids()[:100]
doc_words = [movie_reviews.words(fileid) for fileid in fileids]
documents = [' '.join(words) for words in doc_words]
print(documents[0:1])

['plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what \' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn \' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it \' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no

##### Top bigrams in reviews

In [29]:
from nltk.util import ngrams

from collections import Counter
from operator import itemgetter

from nltk.corpus import stopwords
stop = stopwords.words('english')
stop += ['.', ',', '(', ')', "'", '"']
stop = set(stop)

counter = Counter()

n = 2
for doc in documents:
    words = TextBlob(doc).words
    words = [w for w in words if w not in stop]
    bigrams = ngrams(words, n)
    counter += Counter(bigrams)

for phrase, count in counter.most_common(30):
    print('%20s %i' % (" ".join(phrase), count))

     special effects 20
         ghosts mars 18
         first movie 14
           prinze jr 12
         monkey bone 12
         even though 11
           hong kong 11
        fight scenes 11
            want see 10
           van damme 10
         jackie chan 10
         every scene 10
         movies like 9
          romeo must 9
            must die 9
            big john 9
              sci fi 9
           years ago 8
         sounds like 8
         screen time 8
      john carpenter 8
           two hours 8
            year old 8
         action film 8
         big gorilla 8
            one best 8
      freddie prinze 8
           dr moreau 8
               ho ho 8
         spice girls 8


### Using Sklearn With Text
SKlearn can quantify text for modeling applications

Another option is `spaCy`, which is performance optimized for real-time applications. `NLTK` is much slower, but more full-featured.

CountVectorizer:  Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation.


In [30]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['That it should come to this!', 'This above all: to thine own self be true.', 'Something is rotten in the state of Denmark.']

# CountVectorizer is a class; so `vectorizer` below represents an instance of that object.
vectorizer = CountVectorizer(ngram_range=(1,2))

# call `fit` to build the vocabulary
vectorizer.fit(text)

# then, use `get_feature_names` to return the tokens
print(vectorizer.get_feature_names())

# finally, call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

['above', 'above all', 'all', 'all to', 'be', 'be true', 'come', 'come to', 'denmark', 'in', 'in the', 'is', 'is rotten', 'it', 'it should', 'of', 'of denmark', 'own', 'own self', 'rotten', 'rotten in', 'self', 'self be', 'should', 'should come', 'something', 'something is', 'state', 'state of', 'that', 'that it', 'the', 'the state', 'thine', 'thine own', 'this', 'this above', 'to', 'to thine', 'to this', 'true']


In [31]:
print('Sparse Matrix')
# A compressed version; the "sparse" matrix.
print(type(x))
print(x)

print ('Matrix')
x_back = x.toarray()
print(type(x_back))
print(x_back)

Sparse Matrix
<class 'scipy.sparse.csr.csr_matrix'>
  (0, 6)	1
  (0, 7)	1
  (0, 13)	1
  (0, 14)	1
  (0, 23)	1
  (0, 24)	1
  (0, 29)	1
  (0, 30)	1
  (0, 35)	1
  (0, 37)	1
  (0, 39)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1
  (1, 17)	1
  (1, 18)	1
  (1, 21)	1
  (1, 22)	1
  (1, 33)	1
  (1, 34)	1
  (1, 35)	1
  (1, 36)	1
  (1, 37)	1
  (1, 38)	1
  (1, 40)	1
  (2, 8)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1
  (2, 12)	1
  (2, 15)	1
  (2, 16)	1
  (2, 19)	1
  (2, 20)	1
  (2, 25)	1
  (2, 26)	1
  (2, 27)	1
  (2, 28)	1
  (2, 31)	1
  (2, 32)	1
Matrix
<class 'numpy.ndarray'>
[[0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1
  0 1 0 1 0]
 [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1
  1 1 1 0 1]
 [0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0
  0 0 0 0 0]]


In [32]:
pd.DataFrame(x_back, columns=vectorizer.get_feature_names())

Unnamed: 0,above,above all,all,all to,be,be true,come,come to,denmark,in,...,the,the state,thine,thine own,this,this above,to,to thine,to this,true
0,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,1,0,1,0,1,0
1,1,1,1,1,1,1,0,0,0,0,...,0,0,1,1,1,1,1,1,0,1
2,0,0,0,0,0,0,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0


In [33]:
#### TF: frequency in this document
#### IDF: inverse frequency in the corpus

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import MultinomialNB

vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2))
doc_vectors = vectorizer.fit_transform(documents) # remember, this is our movie review dataset

classes = np.array(['pos']*50 + ['neg']*50)


model = MultinomialNB().fit(doc_vectors, classes)

In [34]:
print(GATSBY_TEXT)

NameError: name 'GATSBY_TEXT' is not defined

In [35]:
gatsby_vector = vectorizer.transform([GATSBY_TEXT])
model.predict(gatsby_vector)

NameError: name 'GATSBY_TEXT' is not defined