**Natural Language Processing (NLP)** is one of the fastest growing parts of Artificial intelligence. One must have a good command over NLP to process text-based data sets. I recently started on this and after doing some research got to know that below concepts needs to be understood very well before starting a journey on advance NLP computations. Here we will only focus on text preprocessing and feature extraction and later will solve some interesting problem using same.

**Steps -** 

1. <a href='#import-libs' target='_self'>Importing Libraries</a>
1. <a href='#preprocessing' target='_self'>Basics (Preprocessing)</a>
    1. <a href='#corpora' target='_self'>NLTK Corpora</a>
    1. <a href='#stopwords' target='_self'>Stopwords</a>
    1. <a href='#tokenization' target='_self'>Tokenization</a>
    1. <a href='#stem-lemma' target='_self'>Stemming & Lemmatization</a>
    1. <a href='#post' target='_self'>Part of Speech Tagging</a>
1. <a href='#feature-extraction' target='_self'>Feature Extraction (Vectorization)</a>
    1. <a href='#bow' target='_self'>Bag of Words</a>
    1. <a href='#tf-idf' target='_self'>TF-IDF</a>

## <a id='import-libs'>1. Importing Libraries</a>

In [1]:
!pip install numpy
!pip install pandas
!pip install nltk



In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir(".."))

['CIER', 'desktop.ini', 'NCCU STAT', 'TAROBO', 'Web概念與技術', '企業倫理與永續發展', '應用迴歸分析', '研究方法（一）', '統計學（二）', '高等數理統計']


There are many libraries out there like NLTK, TextBlob, SpaCy, Pattern etc that we can use.But, here we are going to prefer NLTK since it is used most commonly and will be good to start with, once we get the grasp of all fundamental operations, we can explore and understand the significance of other libraries too.

In [17]:
# download nltk data 'wordnet' to use lemmatization
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zuoch\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\zuoch\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

## <a id='preprocessing'>2. Basics (Preprocessing)</a>

### <a id='corpora'>A. NLTK Corpora</a>
One of the best thing about NLTK is that it provides many sample text datasets (Corpora) where each dataset is called Corpus; we can directly import any desired dataset directly from NLTK. Here we are going to import product_reviews data set but you can pick any of the available dataset from http://www.nltk.org/nltk_data/

In [4]:
from nltk.corpus import product_reviews_1
nltk.download('product_reviews_1')

[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     C:\Users\zuoch\AppData\Roaming\nltk_data...
[nltk_data]   Package product_reviews_1 is already up-to-date!


True

Each dataset contains text in text files and to read any file we need to know its name.

In [5]:
product_reviews_1.fileids()

['Apex_AD2600_Progressive_scan_DVD player.txt',
 'Canon_G3.txt',
 'Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt',
 'Nikon_coolpix_4300.txt',
 'Nokia_6610.txt',
 'README.txt']

Once we know the file name then we can read from that file in desired way, for eg- 

In [6]:
# Will read raw text from this file
product_review_raw = product_reviews_1.raw('Apex_AD2600_Progressive_scan_DVD player.txt')
product_review_raw[:750] 
#We are setting upper limit otherwise it will product the big output with lots of scrolling 

'*****************************************************************************\n* Annotated by: Minqing Hu and Bing Liu, 2004.\n*\t\tDepartment of Computer Sicence\n*               University of Illinois at Chicago              \n*\n* Product name: Apex AD2600 Progressive-scan DVD player\n* Review Source: amazon.com\n*\n* See Readme.txt to find the meaning of each symbol. \n*****************************************************************************\n\n[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . \n##repost from january 13 , 2004 with a better fit title . \n##does your apex dvd player only play dvd audio without video ? \n##or does it play audio and video but scrolling in black and white ? \n##before you try to return the player or was'

In [7]:
# Will break down file in sentences
product_review_sents = product_reviews_1.sents('Apex_AD2600_Progressive_scan_DVD player.txt')
product_review_sents

[['repost', 'from', 'january', '13', ',', '2004', 'with', 'a', 'better', 'fit', 'title', '.'], ['does', 'your', 'apex', 'dvd', 'player', 'only', 'play', 'dvd', 'audio', 'without', 'video', '?'], ...]

In [8]:
# Will break down file in words
product_review_words = product_reviews_1.words('Apex_AD2600_Progressive_scan_DVD player.txt')
product_review_words

['repost', 'from', 'january', '13', ',', '2004', ...]

### <a id='stopwords'>B. Stopwords</a>

Stopwords are extra words that don't have any useful meaning they are there just for the sake of sentence formation. They are not really helpful because they can't be categorized, so in NLP projects we prefer their elimination.

In [14]:
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
print(stoplist)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let's check the difference between product_reviews_1 length with or without stop_words

In [10]:
print(f'word length with stopwords {len(product_review_words)}')
product_review_wo_stopwords = [word for word in product_review_words if not word in stoplist]
print(f'word length without stopwords {len(product_review_wo_stopwords)}')

word length with stopwords 12593
word length without stopwords 7190


We had so many stopwords, so it is somewhat useful to eliminate stopwords before performing any actual NLP operation.

### <a id='tokenization'>C. Tokenization</a>
A 'Token' is nothing but a single entity of whole entity we are referreing to. We can perform sentence and word split in below way:

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize

print(f'Word Tokens - \n{sent_tokenize(product_review_raw[750:1250])}\n\n\n')
print(f'Sentence Tokens - \n{word_tokenize(product_review_raw[750:1250])}')

Word Tokens - 
['te hours calling apex tech support , or run the player over with your car , try these simple troubleshooting ideas first .', '##no picture : \n##hopefully you still have the remote control .', '##if you tossed it out the window , you need to fetch it .', '##using the remote control , press the i/p button located on the bottom right corner of the remote .', '##the i/p button switches the tv display between interlace and progressive .', '##if this doesnt bring back the picture , try pressing this button with']



Sentence Tokens - 
['te', 'hours', 'calling', 'apex', 'tech', 'support', ',', 'or', 'run', 'the', 'player', 'over', 'with', 'your', 'car', ',', 'try', 'these', 'simple', 'troubleshooting', 'ideas', 'first', '.', '#', '#', 'no', 'picture', ':', '#', '#', 'hopefully', 'you', 'still', 'have', 'the', 'remote', 'control', '.', '#', '#', 'if', 'you', 'tossed', 'it', 'out', 'the', 'window', ',', 'you', 'need', 'to', 'fetch', 'it', '.', '#', '#', 'using', 'the', 'remote

### <a id='stem-lemma'>D. Stemming and Lemmatization</a>
They both are used for text normalization. Stemming basically removes the redundancy by bringing everything in its simple form for example 'dancing' & 'dancer' becomes 'dance' in this. On the other hand, Lemmatization does the morphological analysis and keeps part of speech into consideration. This can be better understood by examples :

Let's consider below sentence and perform <br/>
**Because I had to catch the train, and as we were short on time, I forgot to pack my toothbrush for our vacation.**

In [15]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
sample_sentence = 'A middle-aged woman entered the room, her hands full of hamburger meat as she formed a patty'
porter_stemmer = PorterStemmer()
word_lemmatizer = WordNetLemmatizer()

for w in word_tokenize(sample_sentence):
    print(f'Actual Word - {w}')
    print(f'Stem - {porter_stemmer.stem(w)}')
    print(f'Lemma - {word_lemmatizer.lemmatize(w)}\n')

Actual Word - A
Stem - a
Lemma - A

Actual Word - middle-aged
Stem - middle-ag
Lemma - middle-aged

Actual Word - woman
Stem - woman
Lemma - woman

Actual Word - entered
Stem - enter
Lemma - entered

Actual Word - the
Stem - the
Lemma - the

Actual Word - room
Stem - room
Lemma - room

Actual Word - ,
Stem - ,
Lemma - ,

Actual Word - her
Stem - her
Lemma - her

Actual Word - hands
Stem - hand
Lemma - hand

Actual Word - full
Stem - full
Lemma - full

Actual Word - of
Stem - of
Lemma - of

Actual Word - hamburger
Stem - hamburg
Lemma - hamburger

Actual Word - meat
Stem - meat
Lemma - meat

Actual Word - as
Stem - as
Lemma - a

Actual Word - she
Stem - she
Lemma - she

Actual Word - formed
Stem - form
Lemma - formed

Actual Word - a
Stem - a
Lemma - a

Actual Word - patty
Stem - patti
Lemma - patty



### <a id='post'>E. Part of Speech Tagging</a>
Also know as POS Taggin or POST. Why POST requried ? 
Because same sentence or paragraph can have the same word in different grammatically contexts and it is not a good idea to consider the second occurrence as redundancy, so as a solution we prefer tagging each word with its Part of Speech to make it grammatically unique. Consider below example, here all **above** words are not grammatically same. 

1. The heavens are **above**. (Adverb)

2. The moral code of conduct is **above** the civil code of conduct. (Proposition)

3. Our blessings come from **above**. (Noun)

In [18]:
sample_sentence_words = word_tokenize(sample_sentence)
nltk.pos_tag(sample_sentence_words)

[('A', 'DT'),
 ('middle-aged', 'JJ'),
 ('woman', 'NN'),
 ('entered', 'VBD'),
 ('the', 'DT'),
 ('room', 'NN'),
 (',', ','),
 ('her', 'PRP$'),
 ('hands', 'NNS'),
 ('full', 'JJ'),
 ('of', 'IN'),
 ('hamburger', 'NN'),
 ('meat', 'NN'),
 ('as', 'IN'),
 ('she', 'PRP'),
 ('formed', 'VBD'),
 ('a', 'DT'),
 ('patty', 'NN')]

## <a id='feature-extraction'>3. Feature Extraction</a>
We can not use text directly to train our models. We need to convert it in the form of features, only then it can be used to train any model for desired outcome and we know very well that most of the models respond to the numeric features very well. So we need to bring all these text representations in the form of numbers.

There are two popular approaches to extract features from texts: 
1. Count the number of occurrece of each word in a document. 
2. Calculate the frequency of each word occurrence out of all word in a document.

Few most commonly used techniqus to perform feature extraction are:<br/>
**1. Bag of Words**<br/>
**2. TF-IDF (Term Frequency - Inverse Document Frequency)**

### <a id='bow'>A. Bag of Words</a>
Bag of words is one of the simplest approaches of feature extraction, here we simply keep the frequency count of all unique words and consider it as a feature. Example: 

Suppose we have below sentences (also referred as documents):

> 1. Must have a subject and a verb.
> 2. Must express a complete thought.
> 3. Must only have one clause.

Feature extraction we need to perform are:

**1. Identify Unique words**
    Unique words from all documents are:
    **must, have, a, subject, and, verb, express, complete, thought, only, one, clause**
    
**2. Perform Vectorization**
    we need to find the frequency count of each unique word and if it is not there then we need to put 0. For eg vector for first document can be formed as: 

> must - 1 <br/>
> have - 1 <br/>
> subject - 1 <br/>
> and - 1 <br/>
> verb - 1 <br/>
> express - 0 <br/>
> complete - 0 <br/>
> thought - 0 <br/>
> only - 0 <br/>
> one - 0 <br/>
> clause - 0 <br/>

So, it will become
> 1. [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

similar way doucument2 and document3 will become:
> 2. [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
> 3. [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1]

scikit-learn library provides CountVectorizer class to perform this action

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize sample document
sample_documents = ['Must have a subject and a verb','Must express a complete thought','Must only have one clause']
# instantiate
vectorizer = CountVectorizer()
vectorizer.fit(sample_documents)
# summarize
print(f':: vector vocabulary - {vectorizer.vocabulary_}\n')
# encode document
vector = vectorizer.transform(sample_documents)
# summarize encoded vector
print(f':: vector shape - {vector.shape}\n')
print(f':: vector list - {vector.toarray()}')

:: vector vocabulary - {'must': 5, 'have': 4, 'subject': 8, 'and': 0, 'verb': 10, 'express': 3, 'complete': 2, 'thought': 9, 'only': 7, 'one': 6, 'clause': 1}

:: vector shape - (3, 11)

:: vector list - [[1 0 0 0 1 1 0 0 1 0 1]
 [0 0 1 1 0 1 0 0 0 1 0]
 [0 1 0 0 1 1 1 1 0 0 0]]


So if you cross check with our calculated vector list then you will get that both are same, just position is different because of key positions in the dictionary, CountVectorizer lists keys in the dictionary in alphabetical order. 

This approach is very basic, but has some limitations like it gives importance to words on the basis of their occurrence count, mostly resulting in higher importance to most common and un-important words like 'the', 'is', 'and' etc, so is not very preferred approach for feature extraction. This limitations is handled by TF-IDF method. 

### <a id='tf-idf'>B. Term Frequency – Inverse Document Frequency (TF – IDF)</a>

It is the most popular method to perform feature extraction. To understand better let's understand TF and IDF separately.

**Term Frequency: **Simply finds out the frequency of a word in document.<br/>
**Inverse Document Frequency:** Assigns a lower weight to the words which appear most frequently. It basically depicts the rarity of the word in all documents.

![](https://mungingdata.files.wordpress.com/2017/11/equation.png?w=430&h=336) <br/>
Similar to CountVectorizer, we can import TfidfVectorizer class from scikit-learn library.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
# initialize sample document
sample_documents = ['Must have a subject and a verb','Must express a complete thought','Must only have one clause']
# instantiate
vectorizer = TfidfVectorizer()
vectorizer.fit(sample_documents)
# summarize
print(f':: vector vocabulary - {vectorizer.vocabulary_}\n')
# encode document
vector = vectorizer.transform(sample_documents)
# summarize encoded vector
print(f':: vector shape - {vector.shape}\n')
print(f':: vector list - {vector.toarray()}')

:: vector vocabulary - {'must': 5, 'have': 4, 'subject': 8, 'and': 0, 'verb': 10, 'express': 3, 'complete': 2, 'thought': 9, 'only': 7, 'one': 6, 'clause': 1}

:: vector shape - (3, 11)

:: vector list - [[0.50461134 0.         0.         0.         0.38376993 0.29803159
  0.         0.         0.50461134 0.         0.50461134]
 [0.         0.         0.54645401 0.54645401 0.         0.32274454
  0.         0.         0.         0.54645401 0.        ]
 [0.         0.50461134 0.         0.         0.38376993 0.29803159
  0.50461134 0.50461134 0.         0.         0.        ]]


**Interpretation: **

dictionary - 
> {'and': 0, 'clause': 1, 'complete': 2, 'express': 3, 'have': 4, 'must': 5, 'one': 6, 'only': 7, 'subject': 8, 'thought': 9, 'verb': 10} <br/>

document 1 - 
> 'Must have a subject and a verb'<br/>

vector - 
> [0.50461134  0.  0.  0.  0.38376993  0.29803159  0.  0.  0.50461134  0.  0.50461134]

> and - 0.50461134<br/>
> clause - 0.<br/>
> complete - 0. <br/>
> express - 0.<br/>
> have - 0.38376993<br/>
> must - 0.29803159<br/>
> one - 0. <br/>
> only - 0. <br/>
> subject - 0.50461134<br/>
> thought - 0. <br/>
> verb - 0.50461134<br/>

Excluding 0, 'must' have the lowest weight because it is most frequent in all documents and this is what IDF does.

All these concepts can be grasped better while working on the actual problem. This is it for now ;)