# Lecture Code - Introduction to Natural Language Processing

<p><a name="sections"></a></p>


## Sections

- <a href="#prepare">Prepare Our Environment</a><br>
- <a href="#corpus">Creating Corpus</a><br>
    - <a href="#stemming">Stemming</a><br>
    - <a href="#ex1">Exercise: Lemmatization</a><br>
- <a href="#pos">POS Tag</a><br>
    - <a href="#ex2">Exercise: Lemmatization with POS Tag</a><br>
- <a href="#chunk">Chunk</a><br>
    - <a href="#ex3">Exercise: Syntax Tree/ Chunking</a><br>
- <a href="#classify">Text Classification</a><br>
    - <a href="#ex4">Exercise: Classify the Testing Set</a><br>
- <a href="#lda">Brief Introduction to LDA</a><br>

<p><a name="prepare"></a></p>
## Prepare Our Environment

In [1]:
from __future__ import print_function
from bs4 import BeautifulSoup
# Python 2 and 3: alternative 4
try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen
import re
import nltk
from nltk import *
import numpy as np

### Download everything from NLTK

In [2]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/rheineke/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-dat

True

<p><a name="corpus"></a></p>
## Creating Corpus

The first step is to gather a data set to work on. In this case, we will use the text from The Great Gatsby by F. Scott Fitzgerald. It's freely available on the Project Gutenberg.

- Scrape the text for the novel, clean it up using regex, and put into a list.

In [3]:
url = "http://gutenberg.net.au/ebooks02/0200041.txt"
text = urlopen(url).read()
soup = BeautifulSoup(text, 'html.parser')
cleantext = BeautifulSoup.get_text(soup)
cleantext = re.sub('\s+', ' ', cleantext).strip()
cleantext = cleantext.lower()
cleantext = re.sub('[.:\',\-!;"()?]', '', cleantext).strip()
corpus = cleantext.split(" ")
# corpus

** Cleaning the corpus: tokenization **

In [4]:
for x in range(0, len(corpus)): 
    if corpus[x] == "chapter":
        break_number_1 = x 
        break
for x in range((len(corpus)-1), 0, -1): 
    if corpus[x] == "end":
        break_number_2 = x + 1 
        break

corpus = corpus[break_number_1 : break_number_2]
# corpus

<p><a name="stemming"></a></p>
### Stemming
- This is probably the first time you'll see a Python object of some **tasks**.

In [5]:
stemmer = nltk.stem.PorterStemmer() # Create our stemmer 
stemmed_corpus = [stemmer.stem(word) for word in corpus]

**Potential problems**
- What is the result of the code below? what kind of problem do you suspect could arise? How can we fix it?

In [6]:
to_be_stemmed = ["monkeys","possesses", "possess", "dogs", "eating",
"constitutional", "worthy", "relatable", "had", "was", "ties", "ive"]
[stemmer.stem(word) for word in to_be_stemmed]

['monkey',
 'possess',
 'possess',
 'dog',
 'eat',
 'constitut',
 'worthi',
 'relat',
 'had',
 'wa',
 'tie',
 'ive']

<p><a name="ex1"></a></p>
### Exercise: Lemmatization

- To reduce some of the problem, we might want to add more steps before stemming. For example, we can lemmatize in advance. Look up the documentation, find the function you need for lemmatization, then apply it (in the correct way) to the words below.
    - 'tips', 'criteria', 'minima'
    - 'was', 'wore', 'caught'


In [7]:
lmtizer=nltk.stem.WordNetLemmatizer()

In [8]:
#### Your code here
print(lmtizer.lemmatize('tips'))
print(lmtizer.lemmatize('criteria'))
print(lmtizer.lemmatize('minima'))

tip
criterion
minimum


In [9]:
print(lmtizer.lemmatize('was'))
print(lmtizer.lemmatize('wore'))
print(lmtizer.lemmatize('caught'))

wa
wore
caught


In [10]:
print(lmtizer.lemmatize('was'))
print(lmtizer.lemmatize('was', 'v')+'\n')

print(lmtizer.lemmatize('wore'))
print(lmtizer.lemmatize('wore', 'v')+'\n')

print(lmtizer.lemmatize('caught'))
print(lmtizer.lemmatize('caught', 'v'))

wa
be

wore
wear

caught
catch


It is crucial to decide what class each word belongs to (noun, verb or adjective). We will see how this can be done.

<p><a name="pos"></a></p>
## POS Tag

- In order to look up definitions for all of the parts of speech, you can run this command. 

In [11]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### A different way to tokenize

In [12]:
sentence = """At eight o'clock on Thursday morning Arthur didn't
feel very good."""
tokens=nltk.word_tokenize(sentence)
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 "didn't",
 'feel',
 'very',
 'good',
 '.']

** Compare to the old tokenization**

In [13]:
clean_sentence = re.sub('\s+', ' ', sentence).strip()
clean_sentence = clean_sentence.lower()
clean_sentence = re.sub('[.:\',\-!;"()?]', "", clean_sentence).strip()
old_tokens = clean_sentence.split(" ")

In [14]:
tagged = nltk.pos_tag(tokens)
tagged

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('Arthur', 'NNP'),
 ("didn't", 'NN'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

<p><a name="ex2"></a></p>
#### Exercise: Lemmatization with POS Tag

- Consider our `to_be_stemmed` list again.

In [15]:
print(to_be_stemmed)

['monkeys', 'possesses', 'possess', 'dogs', 'eating', 'constitutional', 'worthy', 'relatable', 'had', 'was', 'ties', 'ive']


If we only need very basic tags (noun, verb and adjective), try passing `'universal'` to the `pos_tag` function?

In [16]:
#### Your code here
tagged = nltk.pos_tag(to_be_stemmed, tagset='universal')

- Now lemmatize the sentence above.

In [17]:
tagged

[('monkeys', 'NOUN'),
 ('possesses', 'VERB'),
 ('possess', 'ADJ'),
 ('dogs', 'NOUN'),
 ('eating', 'VERB'),
 ('constitutional', 'ADJ'),
 ('worthy', 'NOUN'),
 ('relatable', 'NOUN'),
 ('had', 'VERB'),
 ('was', 'VERB'),
 ('ties', 'NOUN'),
 ('ive', 'ADJ')]

In [18]:
#### Your code here
mapping={'NOUN':'n', 'VERB':'v', 'ADJ':'a'}
tagged=map(lambda tup: [tup[0], mapping[tup[1]]], tagged)
to_be_stemmed2 = map(lambda x: lmtizer.lemmatize(*x), tagged)

- Stem it. Compare the result from the old stemming method.

In [19]:
#### Your code here
print([stemmer.stem(word) for word in to_be_stemmed])
print([stemmer.stem(word) for word in to_be_stemmed2])

['monkey', 'possess', 'possess', 'dog', 'eat', 'constitut', 'worthi', 'relat', 'had', 'wa', 'tie', 'ive']
['monkey', 'possess', 'possess', 'dog', 'eat', 'constitut', 'worthi', 'relat', 'have', 'be', 'tie', 'ive']


<p><a name="chunk"></a></p>
## Chunk

Defining patterns in POS helps, for example, to find phrases in a corpus -- especially noun phrases. Let's consider the sentence below (This example os from [The nltk book](http://www.nltk.org/book/ch07.html)):

In [20]:
sentence = 'The little yellow dog barked at the cat.'
sentence

'The little yellow dog barked at the cat.'

- As before we tokenize and tag the tokens.

In [21]:
tokens   = nltk.word_tokenize(sentence.lower())
tagged   = nltk.pos_tag(tokens)
print(tagged)

[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN'), ('.', '.')]


- Then we may use the `RegexpParser` (Regular expression parser) function to do **chunking**: 

In [22]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN)
  ./.)


- The result actually admits a tree structure and can be seen with the `draw()` function (The sketching is not inline here):

In [23]:
result.draw()

<p><a name="ex3"></a></p>
### Exercise: Syntax Tree/ Chunking

Try chunking:
- `'We live in New York City'` with `grammar = "NP: {<NNP>+}"`.
- `'I went to a baseball game'` with `grammar = "NP: {<NN>+}"`.
- Identify th esubject of the sentence: `'the young tall guy looked at the window'`.

You can visualize the structure with the `draw()` function.

In [24]:
#### Your code here
sentence = 'We live in New York City'
tokens   = nltk.word_tokenize(sentence)
tagged   = nltk.pos_tag(tokens)
grammar = "NP: {<NNP>+}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)

(S We/PRP live/VBP in/IN (NP New/NNP York/NNP City/NNP))


In [25]:
sentence = 'I went to a baseball game.'
tokens   = nltk.word_tokenize(sentence)
tagged   = nltk.pos_tag(tokens)
grammar = "NP: {<NN>+}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)

(S I/PRP went/VBD to/TO a/DT (NP baseball/NN game/NN) ./.)


In [26]:
sentence = 'the young tall guy looked at the window'
tokens   = nltk.word_tokenize(sentence)
tagged   = nltk.pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)

(S
  (NP the/DT young/JJ tall/NN)
  (NP guy/NN)
  looked/VBD
  at/IN
  (NP the/DT window/NN))


Try the same processing for this sentence:

- *** Time flies like an arrow; fruit flies like a banana ***

In [27]:
#### Your code here

sentence = 'Time flies like an arrow; fruit flies like a banana'
tokens   = nltk.word_tokenize(sentence)
tagged   = nltk.pos_tag(tokens)
grammar = "NP: {<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)

(S
  Time/NNP
  flies/NNS
  like/IN
  an/DT
  (NP arrow/NN)
  ;/:
  fruit/CC
  flies/NNS
  like/IN
  a/DT
  (NP banana/NN))


<p><a name="classify"></a></p>
## Text Classification

### creating text 

In [28]:
pos_tweets = [('I love this book', 'positive'),
              ('This food is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the party', 'positive'),
              ('He is my best friend', 'positive')]
neg_tweets = [('I do not like this book', 'negative'),
              ('This food is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the party', 'negative'),
              ('He is my enemy', 'negative')]

### tokenization

In [29]:
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, sentiment))
tweets

[(['love', 'this', 'book'], 'positive'),
 (['this', 'food', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'party'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'book'], 'negative'),
 (['this', 'food', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'party'], 'negative'),
 (['enemy'], 'negative')]

### creating test set 

In [30]:
test_tweets = [
(['feel', 'happy', 'this', 'morning'], 'positive'), (['larry', 'friend'], 'positive'),
(['not', 'like', 'that', 'man'], 'negative'), (['house', 'not', 'great'], 'negative'),
(['your', 'song', 'annoying'], 'negative')]

### creating a set of all the words

In [31]:
def get_words_in_tweets(tweets): 
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words) 
    return all_words

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

word_features = get_word_features(get_words_in_tweets(tweets))

### creating feature columns

In [32]:
def extract_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words) 
    return features

In [33]:
training_set = nltk.classify.apply_features(extract_features, tweets) 
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [34]:
tweet = 'Larry is my friend'
print(classifier.classify(extract_features(tweet.split())))

positive


<p><a name="ex4"></a></p>
### Exercise: Classify the Testing Set

In [35]:
#### Your code here
test_set=[extract_features(lst) for lst in [tup[0] for tup in test_tweets]]
print('Predicted: ' + str([classifier.classify(instance) for instance in test_set]))
print('Actual:    ' + str([tup[1] for tup in test_tweets]))

Predicted: ['positive', 'positive', 'negative', 'negative', 'positive']
Actual:    ['positive', 'positive', 'negative', 'negative', 'negative']


<p><a name="lda"></a></p>
### Brief Introduction to Latent Dirichlet Allocation

In natural language processing, latent Dirichlet allocation (LDA) is an unsupervised method that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. 

For example, a **topic** is a common latent variable produced by LDA and used to characterise a document.

Below we demonstrate, with the [20 newsgroups dataset](http://scikit-learn.org/stable/datasets/), how we can:
 - Extract **term frequency** feature from the corpus.
 - Fit the LDA model, return the **topics**, and then inspect each topic.
 - Finally we inspect the relation between the topics and the particular documents.

In [36]:
from time import time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print()

We first load the data set.

In [37]:
n_features = 1000
n_topics = 10
n_top_words = 20

print("Loading dataset...")
remove = ('headers', 'footers', 'quotes')
%time dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=remove)
data_samples = dataset.data


Loading dataset...
CPU times: user 2.64 s, sys: 123 ms, total: 2.76 s
Wall time: 2.8 s


Then we compute the term frequency for the selected words:

In [38]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

In [39]:
print('fit_transform returns: {}'.format(type(tf)))
print(tf)

fit_transform returns: <class 'scipy.sparse.csr.csr_matrix'>
  (0, 703)	1
  (0, 422)	1
  (0, 502)	1
  (0, 554)	1
  (0, 146)	1
  (0, 573)	1
  (0, 424)	1
  (0, 748)	1
  (0, 762)	1
  (0, 229)	1
  (0, 760)	1
  (0, 745)	1
  (0, 897)	1
  (0, 914)	1
  (0, 535)	1
  (0, 444)	1
  (0, 985)	1
  (0, 497)	2
  (0, 714)	1
  (0, 589)	4
  (0, 854)	1
  (0, 303)	1
  (0, 860)	1
  (0, 872)	1
  (1, 321)	1
  :	:
  (11313, 758)	1
  (11313, 747)	1
  (11313, 840)	1
  (11313, 637)	1
  (11313, 210)	1
  (11313, 560)	1
  (11313, 646)	2
  (11313, 529)	1
  (11313, 421)	3
  (11313, 550)	2
  (11313, 773)	1
  (11313, 995)	1
  (11313, 434)	1
  (11313, 931)	1
  (11313, 191)	1
  (11313, 207)	1
  (11313, 954)	1
  (11313, 516)	1
  (11313, 540)	1
  (11313, 321)	1
  (11313, 508)	2
  (11313, 88)	1
  (11313, 359)	1
  (11313, 422)	1
  (11313, 303)	1


We can call the method toarray(...) on this sparse matrix object to get a familiar numpy array

In [40]:
print(type(tf.toarray()))
print(tf.toarray().shape)
print(tf.toarray()[11313, 300: 310])

<class 'numpy.ndarray'>
(11314, 1000)
[0 0 0 1 0 0 0 0 0 0]


In [41]:
tf_vectorizer.get_feature_names()[300: 310]

['development',
 'device',
 'devices',
 'did',
 'didn',
 'die',
 'difference',
 'different',
 'difficult',
 'digital']

Then we start to fit the LDA (Latent Dirichlet Allocatcation) model:

In [42]:
print("Fitting LDA models with tf features, n_features=%d..."
      % n_features)
kwds = dict(n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0)
lda = LatentDirichletAllocation(**kwds)
t0 = time()
lda.fit(tf)
print("Fit done in %0.3fs." % (time() - t0))

Fitting LDA models with tf features, n_features=1000...
Fit done in 19.837s.


What `lda` return to us is an array:

In [43]:
lda.components_

array([[  1.03197977e-01,   3.71351204e+02,   1.11548432e-01, ...,
          5.56365604e+00,   1.72149288e+02,   7.44101756e+01],
       [  1.17592487e-01,   1.04091218e-01,   1.00069609e-01, ...,
          1.16151717e+02,   7.11178897e+00,   1.00269719e-01],
       [  9.18690074e+00,   7.36980863e+00,   8.92878098e+00, ...,
          2.87613192e+00,   1.49034051e-01,   1.00075326e-01],
       ..., 
       [  1.19368426e-01,   1.16239615e-01,   1.05411754e-01, ...,
          1.00057552e-01,   1.06259394e-01,   1.00161900e-01],
       [  1.05681809e-01,   4.58678916e+01,   1.00492572e-01, ...,
          3.01590736e+02,   6.37732768e+01,   1.88298644e+02],
       [  1.06484645e+03,   2.48463781e+02,   2.14078518e+02, ...,
          1.31786491e-01,   4.90524664e+01,   2.93634961e+01]])

This array consists of 10 row and 1000 columns, because we specify that we want 10 topics and each topic is represented by the **distribution** of the one thousand words:

In [44]:
lda.components_.shape

(10, 1000)

With the `print_top_words` function, we extract the most frequent words from each topic:

In [45]:
print("\nTopics in LDA model:\n")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:

Topic #0:
people gun armenian armenians war turkish states israel said children jews 000 state new guns israeli vs military years american

Topic #1:
government people law mr use president don think right public make state going privacy private security know new rights want

Topic #2:
space program output entry data nasa use science research build section center launch time high earth year rules long satellite

Topic #3:
key car chip used keys bike use bit clipper number phone like cars just engine ground des algorithm good secret

Topic #4:
edu file com available mail ftp files information image send list use version server email pub software cs code window

Topic #5:
god people does jesus say think believe don know just way like true question life time christian did point bible

Topic #6:
windows use drive thanks does problem know card like using db scsi dos disk bit need pc memory mac work

Topic #7:
ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75

#### Inspection

We then randomly select text from the dataset and inspect how lda predict mixture of topic to it:

In [46]:
indx = np.random.choice(tf.shape[0], 1)[0]
text = data_samples[indx]

print(text)
print('-'*88)
print(lda.transform(tf[indx,:] ))

Anyone have the AL individual stats or where i can find them?
----------------------------------------------------------------------------------------
[[ 0.0333454   0.35310095  0.03333861  0.03333575  0.03334173  0.03333927
   0.03334062  0.03333778  0.03334672  0.38017317]]


In [47]:
topic_lst= [0, 1, -2] 

for topic in topic_lst:
    print(' '.join([tf_feature_names[i] for i in lda.components_[topic].argsort()[:-n_top_words - 1:-1]]))
    print('-'*88)

people gun armenian armenians war turkish states israel said children jews 000 state new guns israeli vs military years american
----------------------------------------------------------------------------------------
government people law mr use president don think right public make state going privacy private security know new rights want
----------------------------------------------------------------------------------------
just don like think know good time ve people said year did didn got ll going way game really team
----------------------------------------------------------------------------------------
