This notebook illustrated preprocessing techniques used in NLP. All examples are based on the nltk(natural language toolkit) in python.

## Tokenization

Often the most important step in NLP and text analytics. It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. 

In [111]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
import nltk
from nltk.tokenize import (word_tokenize,
                          sent_tokenize,
                          TreebankWordTokenizer, 
                          TweetTokenizer)

In [158]:
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 🤖 concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. The students went to an NLP course. The student goes to an NLP course."""

In [159]:
text_to_sentence = sent_tokenize(text)
print(text_to_sentence)

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 🤖 concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.', 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.', 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.', 'The students went to an NLP course.', 'The student goes to an NLP course.']


In [160]:
text_to_sentence[1]

'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.'

In [161]:
tokenized_word = word_tokenize(text)
print(tokenized_word)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '🤖', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.', 'The', 'students', 'went', 'to', 'an', 'NLP', 'course', '.', 'The', 'student', 'goes', 'to', 'an', 'NLP', 'course', '.']


In [162]:
text_to_sentence[0]

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 🤖 concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.'

In [163]:
tokenized_word[1]

'language'

## Stop word removal

Stop words are words which are repetitive and don’t hold any information. For example, words like – {that these, below, is, are, etc.} don’t provide any information, so they need to be removed from the text. NLTK comes preloaded with a dictionary of stopwords for English.

In [164]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'do', "don't", 'because', "shan't", 'during', 'her', 'here', 'each', "mightn't", 'shan', 'be', 'wasn', 'how', 'won', 'the', "won't", 'which', 'off', 'm', "should've", 'hadn', 'd', 'so', "that'll", 'between', 'its', 'too', "she's", 'down', 'it', "you've", 'mightn', 'until', "you'd", "haven't", 'him', 'by', 't', 'those', 're', 'aren', 'but', 'while', 'whom', 'this', 'now', 'ain', 'with', "couldn't", 'yourselves', 's', 'don', 'any', 'after', 'such', 'to', 'of', 'or', 'being', 'have', 'needn', 'my', 'yourself', 'can', 'hasn', 'there', 'again', 'is', "doesn't", 'myself', 'through', 'up', 'when', "mustn't", 'once', 'what', 'i', 'am', 'having', 'about', 'only', 'our', "shouldn't", "isn't", 'in', 'against', "didn't", 'out', 'shouldn', 'mustn', 'weren', "you're", 'if', 'will', 'couldn', "it's", 'doesn', 'has', 'that', 'll', 'your', 'these', 'they', 'them', 'should', 'ours', 'we', 'under', 'isn', 'just', 've', 'from', 'above', 'as', 'their', 'further', "weren't", 'all', 'into', 'had', 'some', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [165]:
filtered_text = []
tokenized_word = word_tokenize(text)
for each_word in tokenized_word:
    if each_word not in stop_words:
        filtered_text.append(each_word)

In [166]:
print('Tokenized list with stop words: {}'.format(tokenized_word))
print('Tokenized list with out stop words: {}'.format(filtered_text))

Tokenized list with stop words: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '🤖', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.', 'The', 'students', 'went', 'to', 'an', 'NLP', 'course', '.', 'The', 'student', 'goes', 'to', 

## Frequency Distribution

In [167]:
from nltk.probability import FreqDist
freq_dist_of_words = FreqDist(tokenized_word)
print(freq_dist_of_words)

<FreqDist with 69 samples and 110 outcomes>


In [168]:
freq_dist_of_words.most_common(10)

[('the', 6),
 ('of', 5),
 ('and', 5),
 ('.', 5),
 ('language', 4),
 (',', 4),
 ('to', 4),
 ('The', 4),
 ('NLP', 3),
 ('documents', 3)]

In [169]:
freq_dist_of_words_cleaned = FreqDist(filtered_text)
print(freq_dist_of_words_cleaned)

<FreqDist with 53 samples and 73 outcomes>


In [170]:
freq_dist_of_words_cleaned.most_common(10)

[('.', 5),
 ('language', 4),
 (',', 4),
 ('The', 4),
 ('NLP', 3),
 ('documents', 3),
 ('computer', 2),
 ('computers', 2),
 ('course', 2),
 ('Natural', 1)]

## Stemming

One of the techniques to reduce a word, most a verb to it's root form or stem is called Stemming. 

In [171]:
from nltk.stem import PorterStemmer
pstemmer = PorterStemmer()

In [172]:
for word in filtered_text:
    print(word + '--->' + pstemmer.stem(word))

Natural--->natur
language--->languag
processing--->process
(--->(
NLP--->nlp
)--->)
subfield--->subfield
linguistics--->linguist
,--->,
computer--->comput
science--->scienc
,--->,
artificial--->artifici
intelligence--->intellig
🤖--->🤖
concerned--->concern
interactions--->interact
computers--->comput
human--->human
language--->languag
,--->,
particular--->particular
program--->program
computers--->comput
process--->process
analyze--->analyz
large--->larg
amounts--->amount
natural--->natur
language--->languag
data--->data
.--->.
The--->the
goal--->goal
computer--->comput
capable--->capabl
``--->``
understanding--->understand
''--->''
contents--->content
documents--->document
,--->,
including--->includ
contextual--->contextu
nuances--->nuanc
language--->languag
within--->within
.--->.
The--->the
technology--->technolog
accurately--->accur
extract--->extract
information--->inform
insights--->insight
contained--->contain
documents--->document
well--->well
categorize--->categor
organize--->o

In [173]:
from nltk.stem.snowball import SnowballStemmer
snow_stem = SnowballStemmer(language='english')

In [174]:
for word in filtered_text:
    print(word + '--->' + snow_stem.stem(word))

Natural--->natur
language--->languag
processing--->process
(--->(
NLP--->nlp
)--->)
subfield--->subfield
linguistics--->linguist
,--->,
computer--->comput
science--->scienc
,--->,
artificial--->artifici
intelligence--->intellig
🤖--->🤖
concerned--->concern
interactions--->interact
computers--->comput
human--->human
language--->languag
,--->,
particular--->particular
program--->program
computers--->comput
process--->process
analyze--->analyz
large--->larg
amounts--->amount
natural--->natur
language--->languag
data--->data
.--->.
The--->the
goal--->goal
computer--->comput
capable--->capabl
``--->``
understanding--->understand
''--->''
contents--->content
documents--->document
,--->,
including--->includ
contextual--->contextu
nuances--->nuanc
language--->languag
within--->within
.--->.
The--->the
technology--->technolog
accurately--->accur
extract--->extract
information--->inform
insights--->insight
contained--->contain
documents--->document
well--->well
categorize--->categor
organize--->o

## Lemmatization

In [175]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [176]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [199]:
text1 = """He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun. He went to the pool at 2pm"""

In [200]:
lemmatized_words_list = []

In [201]:

tokenized_word = word_tokenize(text1)
for each_word in tokenized_word:
    lem_word = lemmatizer.lemmatize(each_word)
    lemmatized_words_list.append(lem_word)    

In [202]:
print('Text with Stop Words: {}'.format(tokenized_word))
print('Lemmatized Words list {}'.format(lemmatized_words_list))

Text with Stop Words: ['He', 'was', 'running', 'and', 'eating', 'at', 'same', 'time', '.', 'He', 'has', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hours', 'in', 'the', 'Sun', '.', 'He', 'went', 'to', 'the', 'pool', 'at', '2pm']
Lemmatized Words list ['He', 'wa', 'running', 'and', 'eating', 'at', 'same', 'time', '.', 'He', 'ha', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'Sun', '.', 'He', 'went', 'to', 'the', 'pool', 'at', '2pm']


In [203]:
lemmatized_words_list = []
for each_word in tokenized_word:
    lem_word_v = lemmatizer.lemmatize(each_word, pos="v")
    lemmatized_words_list.append(lem_word_v)    

In [204]:
print('Lemmatized Words list {}'.format(lemmatized_words_list))

Lemmatized Words list ['He', 'be', 'run', 'and', 'eat', 'at', 'same', 'time', '.', 'He', 'have', 'bad', 'habit', 'of', 'swim', 'after', 'play', 'long', 'hours', 'in', 'the', 'Sun', '.', 'He', 'go', 'to', 'the', 'pool', 'at', '2pm']


## Parts Of Speech tagging

PoS tagging is the process of tagging an input text with the part of speech for each word. It identifies if each word is a noun, pronoun, adjective, adverb

In [205]:
nltk.download('universal_tagset')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [207]:
text = "I'm going to watch Silicon-Valley on HBO."
tokenized_word = word_tokenize(text)
nltk.pos_tag(tokenized_word, tagset='universal')

[('I', 'PRON'),
 ("'m", 'VERB'),
 ('going', 'VERB'),
 ('to', 'PRT'),
 ('watch', 'VERB'),
 ('Silicon-Valley', 'NOUN'),
 ('on', 'ADP'),
 ('HBO', 'NOUN'),
 ('.', '.')]

## Named Entity Recognition

In [92]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [209]:
text = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Today APPLE has policies tha are inclusive"
for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
ORGANIZATION APPLE


## Word Net

In [210]:
from nltk.corpus import wordnet
synonym = wordnet.synsets("good")
print(synonym)

[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]


In [216]:
print(synonym[5].definition())

having the normally expected amount
