# <center>Natural Language Processing Using NLTK (I)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
 - https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
 - http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf

## 1. NLTK installation
 1. Install NLTK package using: pip install nltk 
 2. Open your python editor (Jupyter Notebook, Spyder etc.) and type the following comands below. Select "all packages" to install data included in NLTK, including corpora and books. It may take a few minutes to download all data

In [None]:
import nltk
#nltk.download()

## 2. NLP Objectives and Basic Steps

 - Objectives:
   * Split documents into tokens or segments
   * Clean up tokens and annotate tokens
   * Extract features from tokens for further text mining tasks
 - Basic processing steps:
   * Tokenization: split documents into individual words or segments
   * Remove stop words and filter tokens
   * POS (part of speech) Tagging
   * Normalization: Stemming, Lemmatization
   * Named Entity Recognition (NER)
   * Term Frequency and Inverse Dcoument Frequency (TF-IDF)
   * Create document-to-term matrix (bag of words)


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re    # import re module
import nltk

In [None]:
# Exercise 2.1. Load the text for analysis

text='''`strange days' chronicles the last two days of 1999 in los angeles. 
 as the locals gear up for the new millenium , lenny nero (ralph fiennes) goes about his business of peddling erotic memory clips. 
 he pines for his ex-girlfriend, faith (juliette lewis), but doesn't notice that another friend, mace (angela bassett) really cares for him. 
 this film features good performances, impressive film-making technique and breath-taking crowd scenes. 
 director kathryn bigelow knows her stuff and does not hesitate to use it. 
 but as a whole, this is an unsatisfying movie. 
 the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama. 
 not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously. 
 the film just ends up preachy, unexciting and uninvolving.'''

text


## 3. Tokenization
 - **Definition**: the process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens.
    * Word (Unigram)
    * Bigram (Two consecutive words)
    * Trigram (Three consecutive words)
    * Sentence
 - Different methods exist:
    * Split by regular expression patterns
    * NLTK's word tokenizer
    * NLTK's regular expression tokenizer (customizable)
 - None of them can be perfect for any tokenization task. 

### 3.1. Unigram

In [None]:
# Exercise 3.1.1. Simply split the text by one or more non-word characters

# \W+: one or more non-words
tokens = re.split(r"\W+", text)   

# get the number of tokens

print(len(tokens))                   
print(tokens)                     

# Pros: no punctuation, just words
# Cons: breath-taking and film-making, doesn't
# are split into two words

In [None]:
# Exercise 3.1.2 NLTK's word tokenizer: 

# break down text into words and punctuations

# invoke NLTK's word tokenizer
tokens = nltk.word_tokenize(text)    
print(len(tokens) )                   
print (tokens)       

# Pros: words are well tokenized, 
# e.g. breath-taking and film-making each is captured as one word
# doesn't becomes does n't
# Pros: need to remove punctuation 

In [None]:
# remove leading or trailing punctuations

import string

string.punctuation

tokens=[token.strip(string.punctuation) for token in tokens]

# remove empty tokens
tokens=[token.strip() for token in tokens if token.strip()!='']
print(len(tokens) )
print(tokens)  


In [None]:
# Exercise 3.1.2 NLTK's regular expression tokenizer (customizable)

# Pattern can be customized to your need

# a word is defined as a sequence of word characters  
# followed by optional word characters or "-|'" 
# ended with a word character

# e.g. film-making, doesn't

pattern=r'\w[\w\-\']*\w'                        


# call NLTK's regular expression tokenization
tokens=nltk.regexp_tokenize(text, pattern)

print(len(tokens))
print (tokens)

In [None]:
# Exercise 3.1.3 Use NLTK's regular expression tokenizer 
# to define sentences (i.e. starts with non-space character, 
# ends with !?.)



## 3.2. Vocabulary 
 - Vocabulary: the set of unique tokens  
 - Dictionary: typicallly, the vocabulary of a text can be represented as a dictionary 
    * Key: word
    * Value: count of the word
 - Find what words are frequently used (stop words)

In [None]:
# Exercise 3.2.1 
# Get vocabulary and dictionary of text

vocabulary= set(tokens)                                        
# set() convert a list to a set without any duplicates
print (vocabulary)

# tokens.count(word) returns the count of the word in tokens (list)
dictionary={word: tokens.count(word) for word in vocabulary}
# by default, dictionary is sorted by key
print("\nsort by word")
print (dictionary)

# find what are the frequent words
# sort the dictionary by value
# sorted(iterable, key) sorts an iterable object by the comparison key
# lambda: anonymous function defined without a name. 
# lambda item:-item[1] sorts the list by the 2nd element in a descending order
print("\nsort by frequency")
print(sorted(dictionary.items(), key=lambda item:-item[1]))

# what kind of words usually have high frequency?

## 3.3. Stop words and word filtering

 - Stop words: a set of commonly used words, have very little meaning, and cannot differentiate a text from others, such as "and", "the" etc. 
 - Stop words are typically ignored in NLP processing or by search engine
 - Stop words usually are application specific. You can define your own stop words!

In [None]:
# Exercise 3.3.1
# get NLTK English stop words
# You can modify this list by adding more stop words or remove stop words

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words+=["film", "films"]
print (stop_words)

# filter stop words out of the dictionary
# by creating a new dictionary

filtered_dictionary={word: dictionary[word] \
                     for word in dictionary \
                     if word not in stop_words}
print("\nsort dictionary without stop words by frequency")
print(sorted(filtered_dictionary.items(), key=lambda item:-item[1]))

print(len(filtered_dictionary))

In [None]:
# Exercise 3.3.2
# Find positive words 

with open("positive-words.txt",'r') as f:
    positive_words=[line.strip() for line in f]

#print(positive_words)
positive_tokens=[token for token in tokens \
                 if token in positive_words]

print(positive_tokens)

## Can you use positive/negative words to determine the sentiment?

- Naive sentiment analysis:
  - Find positive/negative words
  - If more positive words than negative, then positive
  - Otherwise, negative
- Note the sentence: 
  -  "the problem is that the writers, james cameron and jay cocks , were **<font color="red">too ambitious</font>**, aiming for a film with social relevance, thrills, and drama. **<font color="red">not that ambitious</font>** film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously. the film just ends up preachy, unexciting and uninvolving."
- How to deal with negation?
- Some useful rules:
    - Negative sentiment: 
      - negative words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - positive words preceded by a negation within $n$ (e.g. three) words in the same sentence.
    - Positive sentiment (in the similar fashion):
      - positive words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - negative terms following a negation within  $n$ (e.g. three) words in the same sentence


In [None]:
# Exercise 3.3.1 # check if a positive word is preceded by negation words
# e.g. not, too, n't, no, cannot

negations=['not', 'too', 'n\'t', 'no', 'cannot', 'neither','nor']
tokens = nltk.word_tokenize(text)  

print(tokens)

positive_tokens=[]
for idx, token in enumerate(tokens):
    if token in positive_words:
        if idx>0:
            if tokens[idx-1] not in negations:
                positive_tokens.append(token)
        else:
            positive_tokens.append(token)


print(positive_tokens)

# what if the positive word is preceded by a negation within 3 words in a sentence?