[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/M11_CCS0--AG--NLP_Objective_1_Part_1.ipynb)

### Objectives:

1) Get acquainted with the basics of NLP

2) Compare documents or words

3) Diagnose word similarity using Word2vec

4) Conduct Sentiment Analysis

### Objective 1: Get acquainted with the basics of NLP

**Introduction - Natural Language Processing**

Natural Language Processing (NLP) is experiencing significant growth; it is being used everywhere from search engines to voice interfaces such as Amazon Alexa or Siri and is increasingly playing a central role in our daily lives. 

In this session on NLP, we will cover a number of key concepts including a) Tokenization b) Analyzing Sentence Structures c) Information Extraction and Text Classification  d) Sentiment Analysis and e) Building a Custom Corpora

As we step through the different topics, we will make use of Natural Language Toolkit (NLTK) and spacy to get hands-on with text processing. **NLTK** and **spacy** (a more recent introduction) are comprehensive Python libraries for natural language processing and text mining/analytics. 

Some terms that you need to be aware of:

Corpus (singular) – is a body of text. The plural of the term is Corpora. An example is a collection of architectural journals.

Lexicon comprises of words and their meanings. Various fields will have different lexicons. There is a lexicon for doctors, financial professional, mechanics, etc.



### Tokenize a body of text

A document is comprised of sentences and sentences are made up of words. The first thing you want to do when you receive a document is break the document into a number of pieces such as sentences, words and punctuation marks. The process of decomposing a document into its constituent parts is called Tokenization.  

In text processing, we start off by making sense of the smallest or most granular units (i.e. words) and then progressively (as needed) work our way up the hierarchy to decipher the sentence, paragraph, document and the corpus.

The following tokenizers are often used:

1) Sentence Tokenizer - breaks the body of text into sentences

2) Line Tokenizer - breaks the body of text into lines

3) Space Tokenizer - splits the body of text based on the space character

4) Tweet Tokenizer - Since tweets contain special characters, special words, hashtags, and emojis, there is a special 
tokenizer i.e. TweetTokenizer that can be leveraged when dealing with tweets which contain special strings 

5) Word Tokenizer - It operates similar to the Space Tokenizer but breaks up the words and punctuation marks

In [1]:
# Installing NLTK
# Reference: http://www.nltk.org/install.html
!pip install -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 6.4MB/s 
[?25hRequirement not upgraded as not directly required: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.11.0)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25l- \ | / - done
[?25h  Stored in directory: /content/.cache/pip/wheels/d1/ab/40/3bceea46922767e42986aef7606a600538ca80de6062dc266c
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.3


In [2]:
# Import the NLTK package
import nltk

# Get all the data associated with NLTK – could take a while to download all the data
nltk.download('all')

# Import the relevant packages
from nltk.tokenize import LineTokenizer, SpaceTokenizer, TweetTokenizer
from nltk import sent_tokenize, word_tokenize

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /content/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /content/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /content/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /content/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /content/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to

In [0]:
# Raw Text
text = "Mr. Jones loves pizzas.\nHe has a Ph.D. in Pieology."

# Initializing Line Tokenizer
lineTokenizer = LineTokenizer();

# Output from LineTokenizer - breaks body of text into lines
print("Line Tokenizer output :",lineTokenizer.tokenize(text))

# Initializing Space Tokenizer
spaceTokenizer = SpaceTokenizer();

# Output from SpaceTokenizer - splits the body of text on the space character
print("Space Tokenizer output :",spaceTokenizer.tokenize(text))

# Output from Sentence Tokenizer - breaks the body of text into sentences
tokenize_sentences = nltk.sent_tokenize(text)
for sent in tokenize_sentences:
    print("Sentence:")
    print(sent)
    print()

# Output from Word Tokenizer - breaks up the words and punctuation marks
print("Word Tokenizer output :", word_tokenize(text))

# Output from Tweet Tokenizer - use this tokenizer when dealing with tweets
# as they contain special strings (e.g. #)
tweetTokenizer = TweetTokenizer()
print("Tweet Tokenizer output :",tweetTokenizer.tokenize(
    "This is a awesome #gesture: :-) :-D"))

### Filter out Stop Words

Stop words are words that do not add much to the semantic content of text – they are effectively filler words. Words such as ‘when’, ‘nor’, ‘or’, ‘and’, ‘the’ are considered as stop words and it is best to filter then out while processing text.  The Natural Language Toolkit comes with a stop word list that will be used to eradicate stop words. 

In [5]:
# Print the English stop words
en_stopwords = set(nltk.corpus.stopwords.words('english'))
print(en_stopwords)

# Tokenize the sentence
from nltk.tokenize import word_tokenize
rawText = "The dog ran up the steps and entered the owner's room to check if the owner was in the room"
tokens = word_tokenize(rawText)

# Print only the words that are not deemed as stop words
not_stop_tokens = [token for token in tokens if token.lower() not in en_stopwords]
print(not_stop_tokens)

{'then', 'than', "you'd", 're', 'at', 'myself', "couldn't", 'yourselves', 'its', 'be', 'were', 'or', 'up', 'hers', 'haven', 'because', 've', 'theirs', 'with', 'in', 'few', 'their', 'have', 'are', 'the', 'shouldn', 'if', 'i', 'how', 'our', 'these', 'wouldn', 'same', 'each', "aren't", 'why', 'them', 'his', 'over', 'was', 'don', 'but', 'below', "should've", 'too', 'll', 'me', "weren't", 'mightn', "you'll", "wouldn't", 'is', 'against', 'wasn', 'yourself', 'doing', 'about', 'do', "shan't", "doesn't", 'an', 'through', 'mustn', "mustn't", 'hasn', 'for', 'you', 'this', 'once', 'above', 'off', 'has', "you're", 'until', 'that', 'all', 'so', "hasn't", 'm', 'own', 'as', 'again', 'him', "haven't", 'ours', 'during', 'aren', 'will', 'who', 'whom', 'd', 'weren', 'couldn', "needn't", 'he', 'didn', 'hadn', 'very', 'when', "you've", 'out', 'she', 'being', 'a', 'by', 'more', 'does', 'now', 'on', 's', 'other', 'into', 'himself', "she's", 'ourselves', 'only', 'did', 'further', 't', 'we', 'to', "that'll", 't

### Utilize stemming to derive the base form of a word

Stemming is a technique to uncover the base form of a word without any suffixes. Many variations of words carry the same meaning, for example: drive and driving. A stemming algorithm removes the suffix and returns the stem of the word which in our simple example is drive. Search engines utilize stemming for indexing words. Storing all forms of a word would be highly redundant and inefficient. Instead a search engine stores only the stems, greatly shrinking the size of index while enhancing retrieval accuracy.

In the tutorial, we will cover 2 Stemming algorithms i.e.

1)	Porter Stemming Algorithm

2)	Lancaster Stemming Algorithm

The Lancaster Stemmer is greedier than the Porter Stemmer. It tries to eradicate as many characters as possible from the end. However, the Porter Stemmer is usually the default algorithm utilized for stemming.

One of the drawbacks of Stemming is that it results in considerable data loss which detracts from truly understanding the semantics of text

In [6]:
# Import the Stemmers and Word Tokenizer
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

rawText = "My name is Thomson Comer, commander-in-chief of the Machine Learning program at Lambda school. I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program beginning in April 2018."

tokens = word_tokenize(rawText)

pStemmer = PorterStemmer()
porter_Stems = [pStemmer.stem(t) for t in tokens]
print(porter_Stems)

# You will notice that the suffixes such as "ing" has been removed. 
# In some case, trailing 'e' has been removed as well.

lStemmer = LancasterStemmer()
lancaster_Stems = [lStemmer.stem(t) for t in tokens]
print(lancaster_Stems)

# You will notice that the suffixes such as "ing" has been removed. 
# However in the case of Lancaster Stemmer, trailing 'e', 'um', 'er', 'e', 'a' have been removed as well.

['My', 'name', 'is', 'thomson', 'comer', ',', 'commander-in-chief', 'of', 'the', 'machin', 'learn', 'program', 'at', 'lambda', 'school', '.', 'I', 'am', 'creat', 'the', 'curriculum', 'for', 'the', 'machin', 'learn', 'program', 'and', 'will', 'be', 'teach', 'the', 'full-tim', 'machin', 'learn', 'program', 'begin', 'in', 'april', '2018', '.']
['my', 'nam', 'is', 'thomson', 'com', ',', 'commander-in-chief', 'of', 'the', 'machin', 'learn', 'program', 'at', 'lambd', 'school', '.', 'i', 'am', 'cre', 'the', 'curricul', 'for', 'the', 'machin', 'learn', 'program', 'and', 'wil', 'be', 'teach', 'the', 'full-time', 'machin', 'learn', 'program', 'begin', 'in', 'april', '2018', '.']


### Understand how Lemmatization differs from Stemming

A Lemma is a base form of a word. Unlike a stem which is obtained by removing the suffixes, a lemma is a dictionary-matched base form and removes suffixes only if it can find the resulting word in a dictionary. Even though Lemmatization performs better at uncovering the base form of a word than Stemming, it is not completely perfect. 

In the Lemmatization example covered in the tutorial, we will use WordNet which is a lexical database for the English language, created by Princeton University, and is part of the NLTK corpus.

As you step through the next snippet of code below, you will uncover that the lemmatizer makes fewer mistakes than the stemmer. As a result, the lemmatizer does a far better job of getting us to a base form as compared to the stemmer.

In [7]:
# Import WordNet
from nltk.corpus import wordnet

# Import packages
from nltk import word_tokenize, PorterStemmer, WordNetLemmatizer

# Raw Text
rawText = "My name is Thomson Comer, commander-in-chief of the Machine Learning program at Lambda school. I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program beginning in April 2018. We are driving full steam forward to have enough content ready for the launch."

# Break the body of text into tokens
tokens = word_tokenize(rawText)

# Apply from the Porter Stemmer
pStemmer = PorterStemmer()
porter_Stems = [pStemmer.stem(t) for t in tokens]
print(porter_Stems)

# Apply the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
word_lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(word_lemmas)

# The WordNet Lemmatizer is essentially doing a match against the WordNet lexical database

['My', 'name', 'is', 'thomson', 'comer', ',', 'commander-in-chief', 'of', 'the', 'machin', 'learn', 'program', 'at', 'lambda', 'school', '.', 'I', 'am', 'creat', 'the', 'curriculum', 'for', 'the', 'machin', 'learn', 'program', 'and', 'will', 'be', 'teach', 'the', 'full-tim', 'machin', 'learn', 'program', 'begin', 'in', 'april', '2018', '.', 'We', 'are', 'drive', 'full', 'steam', 'forward', 'to', 'have', 'enough', 'content', 'readi', 'for', 'the', 'launch', '.']
['My', 'name', 'is', 'Thomson', 'Comer', ',', 'commander-in-chief', 'of', 'the', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'school', '.', 'I', 'am', 'creating', 'the', 'curriculum', 'for', 'the', 'Machine', 'Learning', 'program', 'and', 'will', 'be', 'teaching', 'the', 'full-time', 'Machine', 'Learning', 'program', 'beginning', 'in', 'April', '2018', '.', 'We', 'are', 'driving', 'full', 'steam', 'forward', 'to', 'have', 'enough', 'content', 'ready', 'for', 'the', 'launch', '.']


Now, we will leverage a range of related techniques to uncover the structure of text. The first technique that we will examine is Part-Of-Speech tagging (POS) which is used to label words in a sentence as nouns, pronouns, adjectives, verbs, etc. Following that, we will explore the concept of "Dependency Parsing" which parses a body of text to uncover chunks of meaning and relations. Then, we will look at how to find the "head" of a sentence which denotes the most important word in the sentence. Finally, we will review the topic of "Named Entity Recognition" which helps in finding named entities within sentences such as places, people, things, monetary figures, etc.

**Use Part-Of-Speech (POS) Tagging to label words in a sentence**

Part-Of-Speech (POS) tagging indicates what type of word in a sentence each word is. It informs you what is a verb, what is a noun, what is a pronoun, what is an adjective, etc. 

For your reference, the link below provides definitions for a majority of the tags in the POS tag list:

https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk


In [0]:
compoundSentence = "Sacramento is the Capital of California and Washington DC is the Capital of the United States"

# Tokenize the sentence
tokensInSentence = nltk.word_tokenize(compoundSentence)

# Retrieve the parts of speech
posTags = nltk.pos_tag(tokensInSentence)
print(posTags)

# A list of tuples is output
# Each tuple contains the token i.e. work and the POS identier

# The POS identifiers for this example is outlined below
# NNP -- proper noun
# VBZ -- verb, 3rd person sing. present
# DT  --  determiner
# NN  -- noun, singular 
# IN  --  preposition/subordinating conjunction
# CC --  coordinating conjunction
# NNPS -- proper noun, plural

### Lecture is continued in new tab....