# <center>Natural Language Processing Using NLTK (I)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf

## 1. NLTK installation
 1. Install NLTK package using: pip install nltk 
 2. Open your python editor (Jupyter Notebook, Spyder etc.) and type the following comands below. Select "all packages" to install data included in NLTK, including corpora and books. It may take a few minutes to download all data

In [2]:
import nltk
#nltk.download()

## 2. NLP Objectives and Basic Steps

 - Objectives:
   * Split documents into tokens, phrases, or segments
   * Clean up tokens and annotate tokens
   * Extract features from tokens for further text mining tasks
 - Basic processing steps:
   * Tokenization: split documents into individual words, phrases, or segments
   * Remove stop words and filter tokens
   * POS (part of speech) Tagging
   * Normalization: Stemming, Lemmatization
   * Named Entity Recognition (NER)
   * Term Frequency and Inverse Dcoument Frequency (TF-IDF)
   * Create document-to-term matrix (bag of words)
 - NLP packages: NLTK, Gensim, spaCy


In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re    # import re module
import nltk

In [4]:
# Exercise 2.1. Load the text for analysis

text='''`strange days' chronicles the last two days of 1999 in los angeles. 
 as the locals gear up for the new millenium , lenny nero (ralph fiennes) goes about his business of peddling erotic memory clips. 
 he pines for his ex-girlfriend, faith (juliette lewis), but doesn't notice that another friend, mace (angela bassett) really cares for him. 
 this film features good performances, impressive film-making technique and breath-taking crowd scenes. 
 director kathryn bigelow knows her stuff and does not hesitate to use it. 
 but as a whole, this is an unsatisfying movie. 
 the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama. 
 not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously. 
 the film just ends up preachy, unexciting and uninvolving.'''

text


"`strange days' chronicles the last two days of 1999 in los angeles. \n as the locals gear up for the new millenium , lenny nero (ralph fiennes) goes about his business of peddling erotic memory clips. \n he pines for his ex-girlfriend, faith (juliette lewis), but doesn't notice that another friend, mace (angela bassett) really cares for him. \n this film features good performances, impressive film-making technique and breath-taking crowd scenes. \n director kathryn bigelow knows her stuff and does not hesitate to use it. \n but as a whole, this is an unsatisfying movie. \n the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama. \n not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously. \n the film just ends up preachy, unexciting and uninvolving."

## 3. Tokenization
 - **Definition**: the process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens.
    * Word (Unigram)
    * Bigram (Two consecutive words)
    * Trigram (Three consecutive words)
    * Sentence
 - Different methods exist:
    * Split by regular expression patterns
    * NLTK's word tokenizer
    * NLTK's regular expression tokenizer (customizable)
 - None of them can be perfect for any tokenization task. 

### 3.1. Unigram

In [5]:
# Exercise 3.1.1. Simply split the text by one or more non-word characters

# \W+: one or more non-words
tokens = re.split(r"\W+", text)   

# get the number of tokens

print(len(tokens))                   
print(tokens)                     

# Pros: no punctuation, just words
# Cons: breath-taking and film-making, doesn't
# are split into two words

150
['', 'strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex', 'girlfriend', 'faith', 'juliette', 'lewis', 'but', 'doesn', 't', 'notice', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film', 'making', 'technique', 'and', 'breath', 'taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'a', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'a', 'film', 'with', 'social', 'r

In [6]:
# Exercise 3.1.2 NLTK's word tokenizer: 

# break down text into words and punctuations

# invoke NLTK's word tokenizer
tokens = nltk.word_tokenize(text)    
print(len(tokens) )                   
print (tokens)       

# Pros: words are well tokenized, 
# e.g. breath-taking and film-making each is captured as one word
# doesn't becomes does n't
# Pros: need to remove punctuation 

175
['`', 'strange', 'days', "'", 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', '.', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', ',', 'lenny', 'nero', '(', 'ralph', 'fiennes', ')', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', '.', 'he', 'pines', 'for', 'his', 'ex-girlfriend', ',', 'faith', '(', 'juliette', 'lewis', ')', ',', 'but', 'does', "n't", 'notice', 'that', 'another', 'friend', ',', 'mace', '(', 'angela', 'bassett', ')', 'really', 'cares', 'for', 'him', '.', 'this', 'film', 'features', 'good', 'performances', ',', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', '.', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', '.', 'but', 'as', 'a', 'whole', ',', 'this', 'is', 'an', 'unsatisfying', 'movie', '.', 'the', 'problem', 'is', 'that', 'the', 'writers', ',', 'james', 'cameron', 'and', '

In [7]:
# Exercise 3.1.3 remove leading or trailing punctuations

import string

string.punctuation

tokens=[token.strip(string.punctuation) for token in tokens]

# remove empty tokens
tokens=[token.strip() for token in tokens if token.strip()!='']
print(len(tokens) )
print(tokens)  


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

144
['strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex-girlfriend', 'faith', 'juliette', 'lewis', 'but', 'does', "n't", 'notice', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'a', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'a', 'film', 'with', 'social', 'relevance', '

In [8]:
# Exercise 3.1.4 NLTK's regular expression tokenizer (customizable)

# Pattern can be customized to your need

# a word is defined as:
# (1) must start with a word character  
# (2) then contain zero or more word characters,"-", 
#     or "'" in the middle 
# (3) must end with a word character
# e.g. film-making, doesn't

pattern=r'\w[\w\'-]*\w'                        

# call NLTK's regular expression tokenization
tokens=nltk.regexp_tokenize(text, pattern)

print(len(tokens))
print (tokens)

141
['strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex-girlfriend', 'faith', 'juliette', 'lewis', 'but', "doesn't", 'notice', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'film', 'with', 'social', 'relevance', 'thrills', 'and

In [9]:
# Exercise 3.1.5 Use NLTK's regular expression tokenizer 
# to define sentences, i.e. 
# (1) starts with non-space character, 
# (2) contains any number of characters in the middle, 
#     as long as they are not "!?."
# (3) ends with !?.

pattern=r'\w[^!?.]*[?!.]'  
tokens=nltk.regexp_tokenize(text, pattern)

print(len(tokens))
print (tokens)

9
["strange days' chronicles the last two days of 1999 in los angeles.", 'as the locals gear up for the new millenium , lenny nero (ralph fiennes) goes about his business of peddling erotic memory clips.', "he pines for his ex-girlfriend, faith (juliette lewis), but doesn't notice that another friend, mace (angela bassett) really cares for him.", 'this film features good performances, impressive film-making technique and breath-taking crowd scenes.', 'director kathryn bigelow knows her stuff and does not hesitate to use it.', 'but as a whole, this is an unsatisfying movie.', 'the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama.', 'not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously.', 'the film just ends up preachy, unexciting and uninvolving.']


### 3.2. Sentence

In [10]:
# Exercise 3.2.1. Segmentation by Sentences #分割

sentences = nltk.sent_tokenize(text)
len(sentences)
sentences

# what patterns can be used to segment 
# text into sentences?

9

["`strange days' chronicles the last two days of 1999 in los angeles.",
 'as the locals gear up for the new millenium , lenny nero (ralph fiennes) goes about his business of peddling erotic memory clips.',
 "he pines for his ex-girlfriend, faith (juliette lewis), but doesn't notice that another friend, mace (angela bassett) really cares for him.",
 'this film features good performances, impressive film-making technique and breath-taking crowd scenes.',
 'director kathryn bigelow knows her stuff and does not hesitate to use it.',
 'but as a whole, this is an unsatisfying movie.',
 'the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama.',
 'not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously.',
 'the film just ends up preachy, unexciting and uninvolving.']

### 3.3 Phrases: Bigrams (2 consecutive words),  Trigrams (3 consecutive words), or in general n-grams
 - Why bigrams and trigrams?
 - How to get bigrams or trigrams:
    1. First tokenize text into unigrams
    2. Slice through the list of unigrams to get bigrams

In [12]:
# Exercise 3.3.1. Get bigrams from the text                       

# bigrams are formed from unigrams
# nltk.bigram returns an iterator
tokens = nltk.word_tokenize(text)
bigrams=list(nltk.bigrams(tokens))  # tokens are created in Exercise 3.1.4
print(bigrams)

# trigrams
list(nltk.trigrams(tokens))

[('`', 'strange'), ('strange', 'days'), ('days', "'"), ("'", 'chronicles'), ('chronicles', 'the'), ('the', 'last'), ('last', 'two'), ('two', 'days'), ('days', 'of'), ('of', '1999'), ('1999', 'in'), ('in', 'los'), ('los', 'angeles'), ('angeles', '.'), ('.', 'as'), ('as', 'the'), ('the', 'locals'), ('locals', 'gear'), ('gear', 'up'), ('up', 'for'), ('for', 'the'), ('the', 'new'), ('new', 'millenium'), ('millenium', ','), (',', 'lenny'), ('lenny', 'nero'), ('nero', '('), ('(', 'ralph'), ('ralph', 'fiennes'), ('fiennes', ')'), (')', 'goes'), ('goes', 'about'), ('about', 'his'), ('his', 'business'), ('business', 'of'), ('of', 'peddling'), ('peddling', 'erotic'), ('erotic', 'memory'), ('memory', 'clips'), ('clips', '.'), ('.', 'he'), ('he', 'pines'), ('pines', 'for'), ('for', 'his'), ('his', 'ex-girlfriend'), ('ex-girlfriend', ','), (',', 'faith'), ('faith', '('), ('(', 'juliette'), ('juliette', 'lewis'), ('lewis', ')'), (')', ','), (',', 'but'), ('but', 'does'), ('does', "n't"), ("n't", 'no

[('`', 'strange', 'days'),
 ('strange', 'days', "'"),
 ('days', "'", 'chronicles'),
 ("'", 'chronicles', 'the'),
 ('chronicles', 'the', 'last'),
 ('the', 'last', 'two'),
 ('last', 'two', 'days'),
 ('two', 'days', 'of'),
 ('days', 'of', '1999'),
 ('of', '1999', 'in'),
 ('1999', 'in', 'los'),
 ('in', 'los', 'angeles'),
 ('los', 'angeles', '.'),
 ('angeles', '.', 'as'),
 ('.', 'as', 'the'),
 ('as', 'the', 'locals'),
 ('the', 'locals', 'gear'),
 ('locals', 'gear', 'up'),
 ('gear', 'up', 'for'),
 ('up', 'for', 'the'),
 ('for', 'the', 'new'),
 ('the', 'new', 'millenium'),
 ('new', 'millenium', ','),
 ('millenium', ',', 'lenny'),
 (',', 'lenny', 'nero'),
 ('lenny', 'nero', '('),
 ('nero', '(', 'ralph'),
 ('(', 'ralph', 'fiennes'),
 ('ralph', 'fiennes', ')'),
 ('fiennes', ')', 'goes'),
 (')', 'goes', 'about'),
 ('goes', 'about', 'his'),
 ('about', 'his', 'business'),
 ('his', 'business', 'of'),
 ('business', 'of', 'peddling'),
 ('of', 'peddling', 'erotic'),
 ('peddling', 'erotic', 'memory'),
 

### 3.4. Collocation
 - Most bigrams or trigrams may sound odd. However, we need to pay attention to frequent bigrams or trigrams
 - **Collocation**: an expression consisting of two or more words that correspond to some conventional way of saying things, e.g. red wine, United States, graduate students etc.
    - Collocations are not fully compositional in that there is usually an element of meaning added to the combination.
 - Question: how to find collocations?

In [13]:
# Exercise 3.4.1. Get collocation

from nltk.collocations import *

# bigram association measures
bigram_measures = nltk.collocations.BigramAssocMeasures()

# construct bigrams using words from our example
finder = BigramCollocationFinder.from_words(tokens) # tokens are created in Exercise 3.1.4

# the corpus is too small
finder.nbest(bigram_measures.raw_freq, 10)  

[('.', 'the'),
 ('it', 'fails'),
 ("'", 'chronicles'),
 ('(', 'angela'),
 ('(', 'juliette'),
 ('(', 'ralph'),
 (')', ','),
 (')', 'goes'),
 (')', 'really'),
 (',', 'aiming')]

In [14]:
# construct bigrams using words from a large bulit-in NLTK corpus

finder = BigramCollocationFinder.from_words(\
        nltk.corpus.genesis.words('english-web.txt'))

finder.nbest(bigram_measures.raw_freq, 10) 

# Note that the most frequent bigrams are very odd
# how to fix it?

[(',', 'and'),
 (',', '"'),
 ('of', 'the'),
 ("'", 's'),
 ('in', 'the'),
 ('said', ','),
 ('said', 'to'),
 ('.', 'He'),
 ('the', 'land'),
 ('.', 'The')]

In [20]:
# Exercise 3.4.2. Find collocation by filter

import string
# construct bigrams using words from a NLTK corpus

stop_words = nltk.corpus.stopwords.words('english')

finder.apply_word_filter(lambda w: w.lower() in stop_words\
                         or w.strip(string.punctuation)=='')

finder.nbest(bigram_measures.raw_freq, 10) 

# better?
# most of them are in the pattern of "xxx said"

[('God', 'said'),
 ('one', 'hundred'),
 ('Jacob', 'said'),
 ('Yahweh', 'God'),
 ('Yahweh', 'said'),
 ('years', 'old'),
 ('seven', 'years'),
 ('Joseph', 'said'),
 ('every', 'man'),
 ('five', 'years')]

### 3.4.1 How to find collocations - PMI
- By **frequency** (perhaps with filter)
- **Pointwise Mutual Information (PMI)**
  - giving two words $w_1, w_2$, $$PMI(w_1,w_2)=\log{\frac{p(w_1,w_2)}{p(w_1)*p(w_2)}}$$
  - Some observations:
    - if $w_1$ and $w_2$ are independent, $PMI(w_1,w_2)=0$
    - if $w_1$ is completely dependent on $w_2$, i.e. $p(w_1,w_2)=p(w_2)$, $PMI(w_1,w_2)=\frac{1}{p(w_1)}$. In this case, what if $w_1$ just appear once in the corpus? 
    - PMI favors less frequent collocations 
    - how to fix it?


In [36]:
# Exercise 3.4.1.1 Metrics for Collocations

from nltk.collocations import *

# load a built-in NLTK corpus as a list of words
words=nltk.corpus.genesis.words('english-web.txt')

# construct bigrams using words from a NLTK corpus
finder = BigramCollocationFinder.from_words(words)

# find top-n bigrams by pmi
finder.nbest(bigram_measures.pmi, 10) 


[('Allon', 'Bacuth'),
 ('Ashteroth', 'Karnaim'),
 ('Ben', 'Ammi'),
 ('En', 'Mishpat'),
 ('Jegar', 'Sahadutha'),
 ('Salt', 'Sea'),
 ('Whoever', 'sheds'),
 ('appoint', 'overseers'),
 ('aromatic', 'resin'),
 ('cutting', 'instrument')]

In [38]:
# 3.4.1.2 filter bigrams by frequency
# only trigrams that appear 5+ times
finder.apply_freq_filter(5)
finder.nbest(bigram_measures.pmi, 10) 

[('burnt', 'offering'),
 ('Paddan', 'Aram'),
 ('living', 'creature'),
 ('young', 'lady'),
 ('little', 'ones'),
 ('Be', 'fruitful'),
 ('still', 'alive'),
 ('savory', 'food'),
 ('creeping', 'thing'),
 ('find', 'favor')]

### 3.4.2 How to find collocations - NPMI and others
- **Normalized Pointwise Mutual Information (NPMI)**
   - If $w_1$ and $w_2$ always occur together, i.e., $p(w_1)=p(w_2)=p(w_1,w_2)$, PMI reaches the maximum: $$PMI(w_1,w_2)=-\log{p(w_1)}=-\log{p(w_2)}=-\log{p(w_1,w_2)}$$
   - Normalized PMI is the PMI divided by the upper bound:
   $$PMI(w_1,w_2)=\frac{\log{\frac{p(w_1,w_2)}{(p(w_1)*p(w_2))}}}{-\log{p(w_1,w_2)}}$$
   
- Another simple method by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf):

$$Score(w_1, w_2)=\frac{count(w_1,w_2)-\delta}{count(w_1)*count(w_2)}$$, where $\delta$ is the minimum collocation frequency
- Both methods are implemented in gensim package

## 3.5. Vocabulary 
 - Vocabulary: the set of unique tokens (unigrams/phrases)  
 - Dictionary: typicallly, the vocabulary of a text can be represented as a dictionary 
    * Key: word, Value: count of the word
    * **nltk.FreqDist()**: a nice function for calculating frequncy of words/phrases
        - Get the frequency of items in the parameter list 
        - Retruns an object similar to a dictionary

In [23]:
# 3.5.1 Get token frequency

# get unigram frequency 
# recall, you can also get the dictionary by 
# {token:count(token) for token in set(tokens)}

word_dist=nltk.FreqDist(tokens)
print("word_dist:", word_dist)

# get the most frequent items
print("top 10 words:", word_dist.most_common(10))

# what kind of words usually have high frequency?

# it behaves as a dictionary
for word in word_dist:
    print(word,":", word_dist[word])
    

word_dist: <FreqDist with 115 samples and 175 outcomes>
top 10 words: [(',', 13), ('.', 9), ('the', 6), ('and', 6), ('for', 4), ('that', 4), ('(', 3), (')', 3), ('film', 3), ('it', 3)]
` : 1
strange : 1
days : 2
' : 1
chronicles : 1
the : 6
last : 1
two : 1
of : 2
1999 : 1
in : 1
los : 1
angeles : 1
. : 9
as : 2
locals : 1
gear : 1
up : 2
for : 4
new : 1
millenium : 1
, : 13
lenny : 1
nero : 1
( : 3
ralph : 1
fiennes : 1
) : 3
goes : 1
about : 1
his : 2
business : 1
peddling : 1
erotic : 1
memory : 1
clips : 1
he : 1
pines : 1
ex-girlfriend : 1
faith : 1
juliette : 1
lewis : 1
but : 2
does : 2
n't : 1
notice : 1
that : 4
another : 1
friend : 1
mace : 1
angela : 1
bassett : 1
really : 1
cares : 1
him : 1
this : 2
film : 3
features : 1
good : 1
performances : 1
impressive : 1
film-making : 2
technique : 1
and : 6
breath-taking : 1
crowd : 1
scenes : 1
director : 1
kathryn : 1
bigelow : 1
knows : 1
her : 1
stuff : 1
not : 2
hesitate : 1
to : 2
use : 1
it : 3
a : 2
whole : 1
is : 2
an : 1


## 3.5.1 Stop words and word filtering

 - Stop words: a set of commonly used words, have very little meaning, and cannot differentiate a text from others, such as "and", "the" etc. 
 - Stop words are typically ignored in NLP processing or by search engine
 - Stop words usually are application specific. You can define your own stop words!

In [24]:
# Exercise 3.5.1.1
# get NLTK English stop words
# You can modify this list by adding more stop words or remove stop words

from nltk.corpus import stopwords
import string

stop_words = stopwords.words('english')
stop_words+=["film", "films"]
#print (stop_words)

# filter stop words out of the dictionary
# by creating a new dictionary

filtered_dict={word: word_dist[word] \
                     for word in word_dist \
                     if word not in stop_words and
                        word not in string.punctuation}

print("\nsort dictionary without stop words by frequency")
print(sorted(filtered_dict.items(), key=lambda item:-item[1]))

print(len(filtered_dict))


sort dictionary without stop words by frequency
[('days', 2), ('film-making', 2), ('ambitious', 2), ('fails', 2), ('strange', 1), ('chronicles', 1), ('last', 1), ('two', 1), ('1999', 1), ('los', 1), ('angeles', 1), ('locals', 1), ('gear', 1), ('new', 1), ('millenium', 1), ('lenny', 1), ('nero', 1), ('ralph', 1), ('fiennes', 1), ('goes', 1), ('business', 1), ('peddling', 1), ('erotic', 1), ('memory', 1), ('clips', 1), ('pines', 1), ('ex-girlfriend', 1), ('faith', 1), ('juliette', 1), ('lewis', 1), ("n't", 1), ('notice', 1), ('another', 1), ('friend', 1), ('mace', 1), ('angela', 1), ('bassett', 1), ('really', 1), ('cares', 1), ('features', 1), ('good', 1), ('performances', 1), ('impressive', 1), ('technique', 1), ('breath-taking', 1), ('crowd', 1), ('scenes', 1), ('director', 1), ('kathryn', 1), ('bigelow', 1), ('knows', 1), ('stuff', 1), ('hesitate', 1), ('use', 1), ('whole', 1), ('unsatisfying', 1), ('movie', 1), ('problem', 1), ('writers', 1), ('james', 1), ('cameron', 1), ('jay', 1)

## 3.5.2 positive/negative words: sentiment analysis

In [40]:
# Exercise 3.5.2.1
# Find positive words 

with open("positive-words.txt",'r') as f:
    positive_words=[line.strip() for line in f]

#positive_words
#print(positive_words)
positive_tokens=[token for token in tokens \
                 if token in positive_words]

print(positive_tokens)

FileNotFoundError: [Errno 2] No such file or directory: 'positive-words.txt'

- **Naive sentiment analysis**:
  - Find positive/negative words
  - If more positive words than negative, then positive
  - Otherwise, negative
- Note the sentence: 
  -  "the problem is that the writers, james cameron and jay cocks , were **<font color="red">too ambitious</font>**, aiming for a film with social relevance, thrills, and drama. **<font color="red">not that ambitious</font>** film-making should be discouraged; just that when it fails to achieve its goals ..."
- How to deal with **negation**?
- Some useful rules:
    - Negative sentiment: 
      - negative words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - positive words preceded by a negation within $n$ (e.g. three) words in the same sentence.
    - Positive sentiment (in the similar fashion):
      - positive words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - negative terms following a negation within  $n$ (e.g. three) words in the same sentence


In [None]:
# Exercise 3.5.2.2 # check if a positive word is preceded by negation words
# e.g. not, too, n't, no, cannot

# this is not an exhaustive list of negation words!
negations=['not', 'too', 'n\'t', 'no', 'cannot', 'neither','nor']
tokens = nltk.word_tokenize(text)  

#print(tokens)

positive_tokens=[]
for idx, token in enumerate(tokens):
    if token in positive_words:
        if idx>0:
            if tokens[idx-1] not in negations:
                positive_tokens.append(token)
        else:
            positive_tokens.append(token)


print(positive_tokens)

# what if a positive word is preceded 
# by a negation within N words? 
# e.g. 'does not make any customer happy'