# Natural Language Processing

**Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.**




## NLTK (Natural Language Tool Kit)

**The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.** 

**NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!**

In [56]:
import nltk

## NLP Pipeline

**1.Tokenization**

Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing in NLP. In NLTK, we have two types of tokenizers

* Word Tokenizer
* Sentence Tokenizer



In [57]:
from  nltk import sent_tokenize,word_tokenize

In [58]:
data = 'Hi class, We are going to discuss Natural Language Processing today. It will be interesting and easy to understand. Are you ready?. '

In [59]:
print(sent_tokenize(data))

['Hi class, We are going to discuss Natural Language Processing today.', 'It will be interesting and easy to understand.', 'Are you ready?.']


In [60]:
print(word_tokenize(data))

['Hi', 'class', ',', 'We', 'are', 'going', 'to', 'discuss', 'Natural', 'Language', 'Processing', 'today', '.', 'It', 'will', 'be', 'interesting', 'and', 'easy', 'to', 'understand', '.', 'Are', 'you', 'ready', '?', '.']


In [61]:
for i in word_tokenize(data):
    print(i)

Hi
class
,
We
are
going
to
discuss
Natural
Language
Processing
today
.
It
will
be
interesting
and
easy
to
understand
.
Are
you
ready
?
.


**2.Stop Words**

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. 

NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords

**Eg. Data Scientist -  a person who does analysis on data and predicts**

***data scientist - person does analysis data predict (removed stop words)***

In [62]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [63]:
words = word_tokenize(data)

In [64]:
stop_words = set(stopwords.words("english"))

In [65]:
for i in words:
    if i not in stop_words:
        print(i)

Hi
class
,
We
going
discuss
Natural
Language
Processing
today
.
It
interesting
easy
understand
.
Are
ready
?
.


**3.Stemming**

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. 

Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

* Eg. I was taking a ride on bike.
* Eg. I was riding on a bike

In [66]:
from nltk.stem import PorterStemmer

In [67]:
from nltk.tokenize import word_tokenize

In [68]:
ps = PorterStemmer()

In [69]:
words = ["Consultant","Consulting","Consult","Consultants","Consultative"]

In [70]:
for w in words:
    print(ps.stem(w))

consult
consult
consult
consult
consult


In [71]:
sent = "all the python programmers, who are pythoning now have poorly pythoned atleast once"

In [72]:
new_words = word_tokenize(sent)

In [73]:
for new_words in new_words:
    print(ps.stem(new_words))

all
the
python
programm
,
who
are
python
now
have
poorli
python
atleast
onc


**4.Lemmatizing**

A very similar operation to stemming is called lemmatizing. The major difference between these is, as we saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

In [74]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("snakes"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("studying"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))


cat
snake
goose
rock
studying
good
best


**5.Part of Speech**

![image.png](attachment:image.png)

Part-of-Speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.In the identification of words as nouns, verbs, adjectives, adverbs, etc.

In [75]:
from nltk.corpus import state_union

In [76]:
from nltk.tokenize import PunktSentenceTokenizer

In [77]:
train_text = state_union.raw("2005-GWBush.txt")

In [78]:
text = state_union.raw("2006-GWBush.txt")

In [79]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [80]:
tokenized = custom_sent_tokenizer.tokenize(text)

In [81]:
def process():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tag = nltk.pos_tag(words)
            print(tag)
    except:
        print("Expection Occured")

In [82]:
process()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

[('First', 'RB'), (',', ','), ('we', 'PRP'), ("'re", 'VBP'), ('helping', 'VBG'), ('Iraqis', 'NNP'), ('build', 'VB'), ('an', 'DT'), ('inclusive', 'JJ'), ('government', 'NN'), (',', ','), ('so', 'IN'), ('that', 'DT'), ('old', 'JJ'), ('resentments', 'NNS'), ('will', 'MD'), ('be', 'VB'), ('eased', 'VBN'), ('and', 'CC'), ('the', 'DT'), ('insurgency', 'NN'), ('will', 'MD'), ('be', 'VB'), ('marginalized', 'VBN'), ('.', '.')]
[('Second', 'JJ'), (',', ','), ('we', 'PRP'), ("'re", 'VBP'), ('continuing', 'VBG'), ('reconstruction', 'NN'), ('efforts', 'NNS'), (',', ','), ('and', 'CC'), ('helping', 'VBG'), ('the', 'DT'), ('Iraqi', 'NNP'), ('government', 'NN'), ('to', 'TO'), ('fight', 'VB'), ('corruption', 'NN'), ('and', 'CC'), ('build', 'VB'), ('a', 'DT'), ('modern', 'JJ'), ('economy', 'NN'), (',', ','), ('so', 'IN'), ('all', 'DT'), ('Iraqis', 'NNP'), ('can', 'MD'), ('experience', 'VB'), ('the', 'DT'), ('benefits', 'NNS'), ('of', 'IN'), ('freedom', 'NN'), ('.', '.')]
[('And', 'CC'), (',', ','), ('th

[('White', 'NNP'), ('House', 'NNP'), ('photo', 'NN'), ('by', 'IN'), ('Eric', 'NNP'), ('Draper', 'NNP'), ('The', 'DT'), ('same', 'JJ'), ('is', 'VBZ'), ('true', 'JJ'), ('of', 'IN'), ('Iran', 'NNP'), (',', ','), ('a', 'DT'), ('nation', 'NN'), ('now', 'RB'), ('held', 'VBN'), ('hostage', 'NN'), ('by', 'IN'), ('a', 'DT'), ('small', 'JJ'), ('clerical', 'JJ'), ('elite', 'NN'), ('that', 'WDT'), ('is', 'VBZ'), ('isolating', 'VBG'), ('and', 'CC'), ('repressing', 'VBG'), ('its', 'PRP$'), ('people', 'NNS'), ('.', '.')]
[('The', 'DT'), ('regime', 'NN'), ('in', 'IN'), ('that', 'DT'), ('country', 'NN'), ('sponsors', 'NNS'), ('terrorists', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('Palestinian', 'JJ'), ('territories', 'NNS'), ('and', 'CC'), ('in', 'IN'), ('Lebanon', 'NNP'), ('--', ':'), ('and', 'CC'), ('that', 'DT'), ('must', 'MD'), ('come', 'VB'), ('to', 'TO'), ('an', 'DT'), ('end', 'NN'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('The', 'DT'), ('Iranian', 'JJ'), ('government

[('Because', 'IN'), ('America', 'NNP'), ('needs', 'VBZ'), ('more', 'JJR'), ('than', 'IN'), ('a', 'DT'), ('temporary', 'JJ'), ('expansion', 'NN'), (',', ','), ('we', 'PRP'), ('need', 'VBP'), ('more', 'JJR'), ('than', 'IN'), ('temporary', 'JJ'), ('tax', 'NN'), ('relief', 'NN'), ('.', '.')]
[('I', 'PRP'), ('urge', 'VBP'), ('the', 'DT'), ('Congress', 'NNP'), ('to', 'TO'), ('act', 'VB'), ('responsibly', 'RB'), (',', ','), ('and', 'CC'), ('make', 'VB'), ('the', 'DT'), ('tax', 'NN'), ('cuts', 'NNS'), ('permanent', 'NN'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('Keeping', 'VBG'), ('America', 'NNP'), ('competitive', 'JJ'), ('requires', 'VBZ'), ('us', 'PRP'), ('to', 'TO'), ('be', 'VB'), ('good', 'JJ'), ('stewards', 'NNS'), ('of', 'IN'), ('tax', 'NN'), ('dollars', 'NNS'), ('.', '.')]
[('Every', 'DT'), ('year', 'NN'), ('of', 'IN'), ('my', 'PRP$'), ('presidency', 'NN'), (',', ','), ('we', 'PRP'), ("'ve", 'VBP'), ('reduced', 'VBN'), ('the', 'DT'), ('growth', 'NN'), ('

[('With', 'IN'), ('more', 'JJR'), ('research', 'NN'), ('in', 'IN'), ('both', 'CC'), ('the', 'DT'), ('public', 'NN'), ('and', 'CC'), ('private', 'JJ'), ('sectors', 'NNS'), (',', ','), ('we', 'PRP'), ('will', 'MD'), ('improve', 'VB'), ('our', 'PRP$'), ('quality', 'NN'), ('of', 'IN'), ('life', 'NN'), ('--', ':'), ('and', 'CC'), ('ensure', 'VB'), ('that', 'DT'), ('America', 'NNP'), ('will', 'MD'), ('lead', 'VB'), ('the', 'DT'), ('world', 'NN'), ('in', 'IN'), ('opportunity', 'NN'), ('and', 'CC'), ('innovation', 'NN'), ('for', 'IN'), ('decades', 'NNS'), ('to', 'TO'), ('come', 'VB'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('Third', 'NNP'), (',', ','), ('we', 'PRP'), ('need', 'VBP'), ('to', 'TO'), ('encourage', 'VB'), ('children', 'NNS'), ('to', 'TO'), ('take', 'VB'), ('more', 'JJR'), ('math', 'NN'), ('and', 'CC'), ('science', 'NN'), (',', ','), ('and', 'CC'), ('to', 'TO'), ('make', 'VB'), ('sure', 'JJ'), ('those', 'DT'), ('courses', 'NNS'), ('are', 'VBP'), ('ri

[('Like', 'IN'), ('Americans', 'NNPS'), ('before', 'IN'), ('us', 'PRP'), (',', ','), ('we', 'PRP'), ('will', 'MD'), ('show', 'VB'), ('that', 'DT'), ('courage', 'NN'), ('and', 'CC'), ('we', 'PRP'), ('will', 'MD'), ('finish', 'VB'), ('well', 'RB'), ('.', '.')]
[('We', 'PRP'), ('will', 'MD'), ('lead', 'VB'), ('freedom', 'NN'), ("'s", 'POS'), ('advance', 'NN'), ('.', '.')]
[('We', 'PRP'), ('will', 'MD'), ('compete', 'VB'), ('and', 'CC'), ('excel', 'VB'), ('in', 'IN'), ('the', 'DT'), ('global', 'JJ'), ('economy', 'NN'), ('.', '.')]
[('We', 'PRP'), ('will', 'MD'), ('renew', 'VB'), ('the', 'DT'), ('defining', 'VBG'), ('moral', 'JJ'), ('commitments', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('land', 'NN'), ('.', '.')]
[('And', 'CC'), ('so', 'RB'), ('we', 'PRP'), ('move', 'VBP'), ('forward', 'RB'), ('--', ':'), ('optimistic', 'JJ'), ('about', 'IN'), ('our', 'PRP$'), ('country', 'NN'), (',', ','), ('faithful', 'JJ'), ('to', 'TO'), ('its', 'PRP$'), ('cause', 'NN'), (',', ','), ('and', 'CC'), ('confi

**5.Chunking**

Grouping elecmemts of similar kind based on POS. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In order to chunk, we combine the part of speech tags with ***regular expressions.***

![image.png](attachment:image.png)

In [83]:
def process1():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tag = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
            chunkParser = nltk.RegexpParser(chunkGram)

            chunked = chunkParser.parse(tagged)
            print(chunked)
            chunked.draw()
            
    except:
        print("Exception Occured")

In [84]:
process1()

Exception Occured


**6.Chinking**

After a lot of chunking, you have some words in your chunk you still do not want.Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

In [85]:
def process2():
    try:
         for i in tokenized:
            words = nltk.word_tokenize(i)
            tag = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            chunked.draw()

    except:
        print("Exception Occured")

In [86]:
process2()

Exception Occured


**7.Named Entity Rercognition**

Named Entity Recognition is an important method in order to extract relevant information from the data.

In [92]:
text = "MS Dhoni is the best captain of India,He won ICC World Cup in the year 2011"

In [93]:
text1 = nltk.word_tokenize(text)

In [94]:
tag = nltk.pos_tag(text1)

In [95]:
namedent = nltk.ne_chunk(tag)

In [96]:
namedent.draw()

**8.Vectorising**

This is a process which basically convert the text to a numeric representation of that text, where you are essentially counting the occurrences of each word in each text message using a matrix with one row per text message and one column per word.

![image.png](attachment:image.png)

**Applying Machine Learning**

![ChessUrl](https://cdn-images-1.medium.com/max/1000/1*fTPhu7PqgIbnngbWG5zFWA.gif)