# Exploring NLTK

# NLP

NLP is a field of computer science and linguistic. It's an intersection between computer and human languages. Common task is natural language understanding. For more about Natural Language Processing visit <a href="https://en.wikipedia.org/wiki/Natural_language_processing"> NLP</a>

# Tokenization

We can not use whole corpus as an input for parsing or text analysis. We have to break it down into something called tokens. Tokenization is a process of breaking stream of text into words, phrases or other meaningful elements. Most of the time tokenization is the first process in text analysis or NLP.

NLTK ( Natural Language Tool Kit for Python ) provides two types of functions for tokenization- sentence tokenization and word tokenization.

Sentence Tokenization - We break down the stream of text into a set of sentences. In this case our tokens would be sentences.

Word Tokenization - We break down the stream of text into a set of words. In this case our tokens would be words.

In [3]:
#import our dependency -nltk
import nltk

#import required modules
from nltk import word_tokenize, sent_tokenize

#uncomment below line if you haven't downloaded all nltk libs and corpus
#nltk.download()

In [2]:
#get text
corpus = open('./data/MyText.txt','r').read()

#here, f is a string or stream of text. Let's check it
type(corpus)

str

Now, we have our text, let's perform tokenization on it.

In [5]:
#sentence tokenization
sentences = sent_tokenize(corpus)
sentences


["A ``knowledge engineer'' interviews experts in a certain domain and tries to embody their knowledge in a computer program for carrying out some task.",
 'How well this works depends on whether the intellectual mechanisms required for the task are within the present state of AI.',
 'When this turned out not to be so, there were many disappointing results.',
 'One of the first expert systems was MYCIN in 1974, which diagnosed bacterial infections of the blood and suggested treatments.',
 'It did better than medical students or practicing doctors, provided its limitations were observed.',
 'Namely, its ontology included bacteria, symptoms, and treatments and did not include patients, doctors, hospitals, death, recovery, and events occurring in time.',
 'Its interactions depended on a single patient being considered.',
 'Since the experts consulted by the knowledge engineers knew about patients, doctors, death, recovery, etc., it is clear that the knowledge engineers forced what the expe

In [6]:
#word tokenization
words = word_tokenize(corpus)
words

['A',
 '``',
 'knowledge',
 'engineer',
 "''",
 'interviews',
 'experts',
 'in',
 'a',
 'certain',
 'domain',
 'and',
 'tries',
 'to',
 'embody',
 'their',
 'knowledge',
 'in',
 'a',
 'computer',
 'program',
 'for',
 'carrying',
 'out',
 'some',
 'task',
 '.',
 'How',
 'well',
 'this',
 'works',
 'depends',
 'on',
 'whether',
 'the',
 'intellectual',
 'mechanisms',
 'required',
 'for',
 'the',
 'task',
 'are',
 'within',
 'the',
 'present',
 'state',
 'of',
 'AI',
 '.',
 'When',
 'this',
 'turned',
 'out',
 'not',
 'to',
 'be',
 'so',
 ',',
 'there',
 'were',
 'many',
 'disappointing',
 'results',
 '.',
 'One',
 'of',
 'the',
 'first',
 'expert',
 'systems',
 'was',
 'MYCIN',
 'in',
 '1974',
 ',',
 'which',
 'diagnosed',
 'bacterial',
 'infections',
 'of',
 'the',
 'blood',
 'and',
 'suggested',
 'treatments',
 '.',
 'It',
 'did',
 'better',
 'than',
 'medical',
 'students',
 'or',
 'practicing',
 'doctors',
 ',',
 'provided',
 'its',
 'limitations',
 'were',
 'observed',
 '.',
 'Namel

# Stopwords

Stop words are most common words such as I, my, for etc.. in a particular language which doesn't affect our analytical process. So, we should remove these stop words from our corpus and focus on important words. 

NLTK provides a list of stopwords which we can use to remove stopwords from our corpus.

In [7]:
#import stopwords
from nltk.corpus import stopwords

In [8]:
s_words = stopwords.words('english')
#let's have a look at few of them
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

Remove stopwords from a corpus. 

In [9]:
#we will use word tokenized text for further processing.
filtered_corpus = [w for w in words if not w in s_words]

filtered_corpus

['A',
 '``',
 'knowledge',
 'engineer',
 "''",
 'interviews',
 'experts',
 'certain',
 'domain',
 'tries',
 'embody',
 'knowledge',
 'computer',
 'program',
 'carrying',
 'task',
 '.',
 'How',
 'well',
 'works',
 'depends',
 'whether',
 'intellectual',
 'mechanisms',
 'required',
 'task',
 'within',
 'present',
 'state',
 'AI',
 '.',
 'When',
 'turned',
 ',',
 'many',
 'disappointing',
 'results',
 '.',
 'One',
 'first',
 'expert',
 'systems',
 'MYCIN',
 '1974',
 ',',
 'diagnosed',
 'bacterial',
 'infections',
 'blood',
 'suggested',
 'treatments',
 '.',
 'It',
 'better',
 'medical',
 'students',
 'practicing',
 'doctors',
 ',',
 'provided',
 'limitations',
 'observed',
 '.',
 'Namely',
 ',',
 'ontology',
 'included',
 'bacteria',
 ',',
 'symptoms',
 ',',
 'treatments',
 'include',
 'patients',
 ',',
 'doctors',
 ',',
 'hospitals',
 ',',
 'death',
 ',',
 'recovery',
 ',',
 'events',
 'occurring',
 'time',
 '.',
 'Its',
 'interactions',
 'depended',
 'single',
 'patient',
 'considered',

Let's make sure that our new filtered corpus is really filtered.
Most of the corpus even if it's a one sentence contains few stopwords.
Obviously, lenth of the filtered text should be less than old corpus.

In [10]:
len(filtered_corpus) < len(words)

True

In [11]:
#How much words did we remove?
print(len(words) - len(filtered_corpus))

72


# POS Tagging

POS Tagging (Part of Speech Taggin) is a process of classification in which we classify words into a lexical category such as noun, verb.
Apparently, it's very useful for future analytical processes such as structure analysis.

In [12]:
#Here, we are going to use pos tagger provided by nltk, we can train our own pos_tag classifier as well
#import
from nltk import pos_tag

#assign tags
tags = pos_tag(filtered_corpus)
tags

[('A', 'DT'),
 ('``', '``'),
 ('knowledge', 'NN'),
 ('engineer', 'NN'),
 ("''", "''"),
 ('interviews', 'NNS'),
 ('experts', 'NNS'),
 ('certain', 'JJ'),
 ('domain', 'NN'),
 ('tries', 'NNS'),
 ('embody', 'VBP'),
 ('knowledge', 'JJ'),
 ('computer', 'NN'),
 ('program', 'NN'),
 ('carrying', 'NN'),
 ('task', 'NN'),
 ('.', '.'),
 ('How', 'WRB'),
 ('well', 'RB'),
 ('works', 'VBZ'),
 ('depends', 'VBZ'),
 ('whether', 'IN'),
 ('intellectual', 'JJ'),
 ('mechanisms', 'NNS'),
 ('required', 'VBN'),
 ('task', 'NN'),
 ('within', 'IN'),
 ('present', 'JJ'),
 ('state', 'NN'),
 ('AI', 'NNP'),
 ('.', '.'),
 ('When', 'WRB'),
 ('turned', 'VBD'),
 (',', ','),
 ('many', 'JJ'),
 ('disappointing', 'JJ'),
 ('results', 'NNS'),
 ('.', '.'),
 ('One', 'CD'),
 ('first', 'JJ'),
 ('expert', 'JJ'),
 ('systems', 'NNS'),
 ('MYCIN', 'NNP'),
 ('1974', 'CD'),
 (',', ','),
 ('diagnosed', 'VBD'),
 ('bacterial', 'JJ'),
 ('infections', 'NNS'),
 ('blood', 'NN'),
 ('suggested', 'VBD'),
 ('treatments', 'NNS'),
 ('.', '.'),
 ('It', 'P

Have a look at  - <a href="http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/"> List of POS Tags </a> . To know more about these words (NN, RB, VBZ)

# Chunking

Chunking is basically identification of parts of speech and short phrases ( i.e noun phrase ) in a text. Parts of Speech gives us information about whether words are noun, adjective or verbs, but sometimes it is useful to have more information about the structure of a text. We use chunking in <a href="http://en.wikipedia.org/wiki/Named_entity_recognition"> Named Entity Recognition </a> in which we are interested in finding named entity in the text.

Chunking is useful for extracting meaningful information from a text. Chunking allows us to extract group of words with set characteristics. 

For ex - john was driving so fast.

In the above example, named entity for john will be "Person".

In [12]:
from nltk.chunk import RegexpParser
from nltk import sent_tokenize,word_tokenize

In [15]:
pattern = """
    NP: {<JJ>*<NN>+}
    {<JJ>*<NN><CC>*<NN>+}
    """
#pattern to detect noun phrases

In [16]:
chunker = RegexpParser(pattern)

In [9]:
text = """
he National Wrestling Association was an early professional wrestling sanctioning body created in 1930 by 
the National Boxing Association (NBA) (now the World Boxing Association, WBA) as an attempt to create
a governing body for professional wrestling in the United States. The group created a number of "World" level 
championships as an attempt to clear up the professional wrestling rankings which at the time saw a number of 
different championships promoted as the "true world championship". The National Wrestling Association's NWA 
World Heavyweight Championship was later considered part of the historical lineage of the National Wrestling 
Alliance's NWA World Heavyweight Championship when then National Wrestling Association champion Lou Thesz 
won the National Wrestling Alliance championship, folding the original championship into one title in 1949."""

In [17]:
tokenized_sentence = nltk.sent_tokenize(text)  # Tokenize the text into sentences.
tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]  # Tokenize words in sentences.
tagged_words = [nltk.pos_tag(word) for word in tokenized_words]  # Tag words for POS in each sentence.
word_tree = [chunker.parse(word) for word in tagged_words]  # Identify NP chunks

In [20]:
word_tree[0].draw() #draw the tree

# Named Entity Recognition

Named Entity Recognition is one of the most important task in information extraction. It basically means extracting named entities in a text. For example, if we have a text - "John Studies at Stanford" then NER for john will be Person, for Stanford it will be Organization.

In [22]:
from nltk import ne_chunk, pos_tag,  word_tokenize
 
sentence = "John studies at Stanford University."
 
print(ne_chunk(pos_tag(word_tokenize(sentence))))


(S
  (PERSON John/NNP)
  studies/NNS
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  ./.)


# Stemming and Lemmatizing 

Stemming and Lemmatizing both really means removing morphological affixes from words, that is, leaving only the word stem. This is very helpful in many natural language processing tasks. Stem of a word "running" is "run", "makes" -> "make",etc.

In [25]:
from nltk.stem import PorterStemmer

In [26]:
stemmer = PorterStemmer()
stemmer.stem("running")

'run'

In [28]:
stemmer.stem("makes")

'make'

In [30]:
#another stemmer is snowball
from nltk.stem import SnowballStemmer

In [33]:
stemmer2 = SnowballStemmer("english")

In [34]:
stemmer2.stem("grows")

'grow'

Lemmatization is closely related to stemming. But lemmatization also takes into account the context in which word appears. Stemmers are just a set of rules, such as remove "s" from the end if xyz condition is satisfied. Stemmers are faster than lemmatizer.

In [36]:
from nltk.stem import WordNetLemmatizer

In [37]:
lemmatizer = WordNetLemmatizer()

In [39]:
lemmatizer.lemmatize("makes")

'make'

you can also use <a href="https://stanfordnlp.github.io/CoreNLP/">stanford core nlp modules</a> such as NER tagger in nltk. Usually they are faster and more accurate.

# Word Sense

Word sense is one of the meaning of a word. There are some english words which has multiple meanings such as bench. NLTK provides an api for WordNet which is a semantic graph for words.

In [41]:
from nltk.corpus import wordnet as wn
wn.synsets('man')

[Synset('man.n.01'),
 Synset('serviceman.n.01'),
 Synset('man.n.03'),
 Synset('homo.n.02'),
 Synset('man.n.05'),
 Synset('man.n.06'),
 Synset('valet.n.01'),
 Synset('man.n.08'),
 Synset('man.n.09'),
 Synset('man.n.10'),
 Synset('world.n.08'),
 Synset('man.v.01'),
 Synset('man.v.02')]

In [45]:
wn.synsets('man')[1].definition()

'someone who serves in the armed forces; a member of a military force'

In [48]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [51]:
dog = wn.synset('dog.n.01')

In [52]:
dog.examples()[0]

'the dog barked all night'

In [53]:
dog.hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

# Frequency Distribution

It is usually required to count the frequency of a word in given text. We can use FreqDist to do that.

In [55]:
t = """Science is the methodical study of nature including testable explanations and predictions. From classical antiquity through the 19th century, science as a type of knowledge was more closely linked to philosophy than it is now and, in fact, in the Western world, the term "natural philosophy" encompassed fields of study that are today associated with science, such as astronomy, medicine, and physics. However, during the Islamic Golden Age foundations for the scientific method were laid by Ibn al-Haytham in his Book of Optics. While the classification of the material world by the ancient Indians and Greeks into air, earth, fire and water was more philosophical, medieval Middle Easterns used practical, experimental observation to classify materials."""

In [56]:
from nltk.probability import FreqDist

In [57]:
fd = FreqDist(nltk.word_tokenize(t))

In [61]:
fd

FreqDist({"''": 1,
          ',': 12,
          '.': 4,
          '19th': 1,
          'Age': 1,
          'Book': 1,
          'Easterns': 1,
          'From': 1,
          'Golden': 1,
          'Greeks': 1,
          'However': 1,
          'Ibn': 1,
          'Indians': 1,
          'Islamic': 1,
          'Middle': 1,
          'Optics': 1,
          'Science': 1,
          'Western': 1,
          'While': 1,
          '``': 1,
          'a': 1,
          'air': 1,
          'al-Haytham': 1,
          'ancient': 1,
          'and': 5,
          'antiquity': 1,
          'are': 1,
          'as': 2,
          'associated': 1,
          'astronomy': 1,
          'by': 2,
          'century': 1,
          'classical': 1,
          'classification': 1,
          'classify': 1,
          'closely': 1,
          'during': 1,
          'earth': 1,
          'encompassed': 1,
          'experimental': 1,
          'explanations': 1,
          'fact': 1,
          'fields': 1,
          'f

# Text Classification Example - Sentiment Analysis

Check out <a href="https://github.com/savan77/Practical-Machine-Learning-With-Python/blob/master/Part%20-%202/Sentiment%20Analysis.ipynb">this notebook </a>for sentiment analysis task.

# Emotion Detection Task

Check out <a href="Emotion%20Detection.ipynb">notebook</a> on Emotion Detection task.