# NLTK Tutorial

* NLTK is a powerful Python package that provides a set of diverse natural languages processing algorithms. <br> 
* It is free, opensource, easy to use, large community, and well documented. <br> 
* With NLTK, we can do tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. <br> 
* NLTK helps the computer to do analysis, preprocess, and understand the written text.

We need to install NLTK before using it. It can be installed with the help of the following command  <br>
* pip install nltk

Next we need to import it

In [1]:
import nltk

We then download the NLTK packages for Text Processing

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Tokenization 

* Tokenization is the first step in text analytics. <br>
* It is the process of breaking down a text paragraph into smaller chunks such as words or sentences. <br>
* Token is a single entity that is a building block for a sentence or paragraph.

### Sentence Tokenization <br>
Sentence tokenizer breaks text paragraph into sentences

As an example , I have taken the following text from a student review on Macquarie University for the Bachelor of Speech and Hearing Science course

In [3]:
paragraph = """ Overall, I enjoyed my course and time at Macquarie Uni. The course was extremely relevant and insightful into what career I wanted to pursue, but heavily lacked the practical element."""

In [4]:
print(paragraph)

 Overall, I enjoyed my course and time at Macquarie Uni. The course was extremely relevant and insightful into what career I wanted to pursue, but heavily lacked the practical element.


Import the sent_tokenize package

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
tokenized_sentences = sent_tokenize(paragraph)

In [7]:
print(tokenized_sentences)

[' Overall, I enjoyed my course and time at Macquarie Uni.', 'The course was extremely relevant and insightful into what career I wanted to pursue, but heavily lacked the practical element.']


### Word Tokenization

Word tokenizer breaks text paragraph into words

Import the word_tokenize package

In [8]:
from nltk.tokenize import word_tokenize

In [9]:
tokenized_words = word_tokenize(paragraph)

In [10]:
print(tokenized_words)

['Overall', ',', 'I', 'enjoyed', 'my', 'course', 'and', 'time', 'at', 'Macquarie', 'Uni', '.', 'The', 'course', 'was', 'extremely', 'relevant', 'and', 'insightful', 'into', 'what', 'career', 'I', 'wanted', 'to', 'pursue', ',', 'but', 'heavily', 'lacked', 'the', 'practical', 'element', '.']


## Stop Words

* Stop words are words which occur frequently in a corpus. <br>
* Stop words are considered as noise in the text as they don't add any value in Text Analysis <br>
* Examples of Stop words : is, am, are, this, a, an, the, etc.

In NLTK for removing stopwords, we need to create a list of stopwords and filter out our list of tokens from these words.

In [11]:
from nltk.corpus import stopwords

In [12]:
stop_words = set(stopwords.words('english'))

In [13]:
print(stop_words)

{'which', 'while', "you'd", 'during', 'himself', 'too', 'myself', 'll', 'd', "aren't", "haven't", 'him', 'don', 'when', "you've", 'over', 'she', 'being', 'or', 'them', 'couldn', 'his', "mustn't", 'what', 'where', 'wouldn', 'once', 'we', 'up', 'hadn', 'same', 'themselves', 'whom', 'more', 'been', "wasn't", 'these', 'that', 'y', 'off', 'did', 'aren', 'about', 'other', 'if', 'shan', "isn't", "won't", "you'll", "doesn't", 'through', 'each', 'above', 'of', 'between', 'our', 'very', "weren't", 'be', "don't", 'most', 'me', 've', 'mustn', 'under', 'should', 'o', 'down', 'now', 'own', 'my', 'the', "hasn't", 'to', 're', 'on', 'how', 'few', 'further', "hadn't", 'itself', 'yourself', 'herself', 'her', 'doing', 'its', 'can', "she's", 'ours', 'theirs', 'why', 'nor', 'who', 'their', 'am', 'mightn', 'you', 'until', 't', 'i', 'shouldn', 'ma', 'won', 'below', "shan't", 'hasn', 'were', "shouldn't", "wouldn't", 'have', 'some', 'a', 'but', 'any', 'will', 'as', 'had', 'this', 'having', 'all', 'do', 'yoursel

Filter the Text paragraph by removing the Stop Words

In [14]:
filtered_words = []
for word in tokenized_words :
    if word not in stop_words :
        filtered_words.append(word)

In [15]:
print(filtered_words)

['Overall', ',', 'I', 'enjoyed', 'course', 'time', 'Macquarie', 'Uni', '.', 'The', 'course', 'extremely', 'relevant', 'insightful', 'career', 'I', 'wanted', 'pursue', ',', 'heavily', 'lacked', 'practical', 'element', '.']


## Stemming

* Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes <br>
* It is reduction of inflection from words. Words with same origin will get reduced to a form which may or may not be a word 

Some of the Stemming packages are PorterStemmer, LancasterStemmer and SnowballStemmer

We will use PorterStemmer for our use case and import the library as below

In [16]:
from nltk.stem import PorterStemmer

In [17]:
stemmer = PorterStemmer()

In [18]:
stemmed_words = []
for word in filtered_words :
    stemmed_words.append(stemmer.stem(word))

In [19]:
print(stemmed_words)

['overal', ',', 'I', 'enjoy', 'cours', 'time', 'macquari', 'uni', '.', 'the', 'cours', 'extrem', 'relev', 'insight', 'career', 'I', 'want', 'pursu', ',', 'heavili', 'lack', 'practic', 'element', '.']


## Lemmatization

* Lemmatization reduces words to their base word, which is linguistically correct lemmas. <br>
* It transforms the root word with the use of vocabulary and morphological analysis. <br> 
* Lemmatization is usually more sophisticated than stemming. <br>
* Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma.This thing will get missed by stemming because it requires a dictionary look-up.

Import the WordNetLemmatizer package

In [20]:
from nltk.stem import WordNetLemmatizer

In [21]:
lemmatizer = WordNetLemmatizer()

In [22]:
lemmatized_words = []
for word in filtered_words :
    lemmatized_words.append(lemmatizer.lemmatize(word))

In [23]:
print(lemmatized_words)

['Overall', ',', 'I', 'enjoyed', 'course', 'time', 'Macquarie', 'Uni', '.', 'The', 'course', 'extremely', 'relevant', 'insightful', 'career', 'I', 'wanted', 'pursue', ',', 'heavily', 'lacked', 'practical', 'element', '.']


## Part-of-Speech(POS) Tagging

* The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group or parts of speech of a given word.<br>
* Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERB, etc. based on the context. <br>
* POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

In [24]:
pos_tokens = nltk.pos_tag(tokenized_words)

In [25]:
print(pos_tokens)

[('Overall', 'JJ'), (',', ','), ('I', 'PRP'), ('enjoyed', 'VBP'), ('my', 'PRP$'), ('course', 'NN'), ('and', 'CC'), ('time', 'NN'), ('at', 'IN'), ('Macquarie', 'NNP'), ('Uni', 'NNP'), ('.', '.'), ('The', 'DT'), ('course', 'NN'), ('was', 'VBD'), ('extremely', 'RB'), ('relevant', 'JJ'), ('and', 'CC'), ('insightful', 'JJ'), ('into', 'IN'), ('what', 'WP'), ('career', 'NN'), ('I', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('pursue', 'VB'), (',', ','), ('but', 'CC'), ('heavily', 'RB'), ('lacked', 'VBD'), ('the', 'DT'), ('practical', 'JJ'), ('element', 'NN'), ('.', '.')]


## Bag-of-Words Model

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers. The vectors x are derived from textual data, in order to reflect various linguistic properties of the text. This is called feature extraction or feature encoding.

* The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. <br> 
* In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. <br> 
* The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. <br> 
* A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:<br> 1) A vocabulary of known words. <br> 2) A measure of the presence of known words.<br> 
* It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.<br> 
* The model is only concerned with whether known words occur in the document, not where in the document.The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.


Steps <br> 
* We iterate through each sentence in the corpus(paragraph), convert the sentence to lower case, and then remove the punctuation and empty spaces from the text.

In [26]:
corpus = tokenized_sentences

In [27]:
# Import Regular Expression module
import re

In [28]:
for i in range(len(corpus)):
    corpus[i] = corpus[i].lower()
    corpus[i] = re.sub(r'\W',' ',corpus[i])
    corpus[i] = re.sub(r'\s+',' ',corpus[i])

In [29]:
print(corpus)

[' overall i enjoyed my course and time at macquarie uni ', 'the course was extremely relevant and insightful into what career i wanted to pursue but heavily lacked the practical element ']


* The next step is to tokenize the words in the sentences and create a dictionary that contains words and their corresponding frequencies in the corpus.

In [30]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

In [31]:
wordfreq

{'overall': 1,
 'i': 2,
 'enjoyed': 1,
 'my': 1,
 'course': 2,
 'and': 2,
 'time': 1,
 'at': 1,
 'macquarie': 1,
 'uni': 1,
 'the': 2,
 'was': 1,
 'extremely': 1,
 'relevant': 1,
 'insightful': 1,
 'into': 1,
 'what': 1,
 'career': 1,
 'wanted': 1,
 'to': 1,
 'pursue': 1,
 'but': 1,
 'heavily': 1,
 'lacked': 1,
 'practical': 1,
 'element': 1}

* The final step is to convert the sentences in our corpus into their corresponding vector representation. <br> 
* For each word in the wordfreq dictionary if the word exists in the sentence, a 1 will be added for the word, else 0 will be added.

In [32]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in wordfreq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [33]:
import numpy as np

In [34]:
sentence_vector_array = np.asarray(sentence_vectors)

In [35]:
sentence_vector_array

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1]])

## TF-IDF Model
### Term Frequency - Inverse Document Frequency

* One of the main problems associated with the bag of words model is that it assigns equal value to the words, irrespective of their importance. <br>
* The words that are rare have more classifying power compared to the words that are common. <br>
* The idea behind the TF-IDF approach is that the words that are more common in one sentence and less common in other sentences should be given high weights.<br>

#### IDF: log((Total number of sentences (documents))/(Number of sentences (documents) containing the word))

In [36]:
word_idf_values = {}
for token in wordfreq:
    doc_containing_word = 0
    for document in corpus:
        if token in nltk.word_tokenize(document):
            doc_containing_word += 1
    word_idf_values[token] = np.log(len(corpus)/(1 + doc_containing_word))

In [37]:
word_idf_values

{'overall': 0.0,
 'i': -0.40546510810816444,
 'enjoyed': 0.0,
 'my': 0.0,
 'course': -0.40546510810816444,
 'and': -0.40546510810816444,
 'time': 0.0,
 'at': 0.0,
 'macquarie': 0.0,
 'uni': 0.0,
 'the': 0.0,
 'was': 0.0,
 'extremely': 0.0,
 'relevant': 0.0,
 'insightful': 0.0,
 'into': 0.0,
 'what': 0.0,
 'career': 0.0,
 'wanted': 0.0,
 'to': 0.0,
 'pursue': 0.0,
 'but': 0.0,
 'heavily': 0.0,
 'lacked': 0.0,
 'practical': 0.0,
 'element': 0.0}

* The next step is to create the TF dictionary for each word. <br>
* In the TF dictionary, the key will be the most frequently occuring words, while values will be 'N' dimensional vectors, where 'N' is the number of sentences. <br> 
* Each value in the vector will belong to the TF value of the word for the corresponding sentence.

In [38]:
word_tf_values = {}
for token in wordfreq:
    sent_tf_vector = []
    for document in corpus:
        doc_freq = 0
        for word in nltk.word_tokenize(document):
            if token == word:
                  doc_freq += 1
        word_tf = doc_freq/len(nltk.word_tokenize(document))
        sent_tf_vector.append(word_tf)
    word_tf_values[token] = sent_tf_vector

In [39]:
word_tf_values

{'overall': [0.1, 0.0],
 'i': [0.1, 0.05],
 'enjoyed': [0.1, 0.0],
 'my': [0.1, 0.0],
 'course': [0.1, 0.05],
 'and': [0.1, 0.05],
 'time': [0.1, 0.0],
 'at': [0.1, 0.0],
 'macquarie': [0.1, 0.0],
 'uni': [0.1, 0.0],
 'the': [0.0, 0.1],
 'was': [0.0, 0.05],
 'extremely': [0.0, 0.05],
 'relevant': [0.0, 0.05],
 'insightful': [0.0, 0.05],
 'into': [0.0, 0.05],
 'what': [0.0, 0.05],
 'career': [0.0, 0.05],
 'wanted': [0.0, 0.05],
 'to': [0.0, 0.05],
 'pursue': [0.0, 0.05],
 'but': [0.0, 0.05],
 'heavily': [0.0, 0.05],
 'lacked': [0.0, 0.05],
 'practical': [0.0, 0.05],
 'element': [0.0, 0.05]}

Now we have IDF values of all the words, along with TF values of every word across the sentences. <br>
* The next step is to simply multiply IDF values with TF values.

In [40]:
tfidf_values = []
for token in word_tf_values.keys():
    tfidf_sentences = []
    for tf_sentence in word_tf_values[token]:
        tf_idf_score = tf_sentence * word_idf_values[token]
        tfidf_sentences.append(tf_idf_score)
    tfidf_values.append(tfidf_sentences)

In [41]:
tf_idf_model = np.asarray(tfidf_values)

In [42]:
tf_idf_model

array([[ 0.        ,  0.        ],
       [-0.04054651, -0.02027326],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [-0.04054651, -0.02027326],
       [-0.04054651, -0.02027326],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [ 0.        ,  0.        ]])

From above, each column represents the TF-IDF vector for the corresponding sentence. <br>
We want rows (instead of Columns) to represent the TF-IDF vectors, so we transpose our numpy array as follows

In [43]:
tf_idf_model_t = np.transpose(tf_idf_model)

In [44]:
tf_idf_model_t

array([[ 0.        , -0.04054651,  0.        ,  0.        , -0.04054651,
        -0.04054651,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        , -0.02027326,  0.        ,  0.        , -0.02027326,
        -0.02027326,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

Though TF-IDF is an improvement over the simple bag of words approach and yields better results for common NLP tasks, we still need to create a huge sparse matrix, which also takes a lot more computation than the simple bag of words approach.

## Word2Vec

Word2Vec approach uses deep learning and neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.

* Word2Vec retains the semantic meaning of different words in a document.The context information is not lost. <br>
* Another great advantage of Word2Vec approach is that the size of the embedding vector is very small. Each dimension in the embedding vector contains information about one aspect of the word. <br> 
* We do not need huge sparse vectors, unlike the bag of words and TF-IDF approaches.

* We import Gensim library to create Word2Vec model. <br> 
* The word list is passed to the Word2Vec class of the gensim.models package. <br> 
* We need to specify the value for the min_count parameter. A value of 1 for min_count specifies to include only those words in the Word2Vec model that appear at least once in the corpus.<br>

In [45]:
import gensim

In [46]:
from gensim.models import Word2Vec

In [47]:
word2vec = Word2Vec(filtered_words, min_count=1)

To see the dictionary of unique words that exist at least once in the corpus, execute the following script:

In [48]:
vocabulary = word2vec.wv.vocab

In [49]:
sim_words = word2vec.wv.most_similar('I')

In [50]:
sim_words

[('O', 0.21362608671188354),
 ('j', 0.2060253620147705),
 ('m', 0.1966397911310196),
 ('w', 0.13187554478645325),
 ('M', 0.13053743541240692),
 ('t', 0.10804124176502228),
 ('y', 0.08303216099739075),
 ('u', 0.05196388438344002),
 ('s', 0.030994432047009468),
 ('a', 0.0288662351667881)]