# **Natural Language Processing [NLP]**

## ***NLP Basics***

> ### ***Natural Language Toolkit[NLTK]***
>
>**Installation:**
>- conda install nltk (recommended)
>- pip install -U nltk

In [1]:
# Download NLTK Data

import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
# see all the fuctions, attributes and methods in package "nltk"

dir(nltk)

['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGrap

**What can you do with NLTK?**  (one example, using stopwords)

In [3]:
# stopwords on NLTK

from nltk.corpus import stopwords    #importing stopwords form corpus

stopwords.words('english')[0::20]       #getting stopwords from english, similarly we can get stopwords form other languages too.

['i', 'himself', 'that', 'a', 'through', 'here', 'own', 're', 'ma']

> ### **Reading in text data & Need to clean the text?**
> **Read in *Semi-Structured* text data**<hr>
> We will use the dataset from UCI machine learning repository, this dataset is a collection of text messages, each with a label of either *spam or ham*.

In [5]:
# Read in the RAW text
rawData = open('SMSSpamCollection.tsv').read()

# print the part of raw data
rawData[0:800]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aids patent.\nham\tI HAVE A DATE ON SUNDAY WITH WILL!!\nham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\nspam\tWINNER!! As a valued network customer you have been selected to receivea £900 pr"

> Here we have **\t** and **\n** text separators, that are separating labels and text messages from each other respectively.

In [6]:
# replace \t with \n. And split the text based on \n, which returns a list
parsedData = rawData.replace('\t','\n').split('\n')

In [11]:
parsedData[:10]    # now we have a proper separations for messages and labels. 
# Labels are now placed on even indexes in the list -> some Structure

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham',
 "Nah I don't think he goes to usf, he lives around here though",
 'ham',
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 'ham',
 'I HAVE A DATE ON SUNDAY WITH WILL!!']

In [14]:
# extracting the labels and text messages in separate lists
labelList = parsedData[0::2]   
textList = parsedData[1::2]

print(labelList[0:5])
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


> Now that we have two separate lists for labels and text messages each, we have to arrange these somehow in structured format. <br>
> So createa dataframe for this purpose.

In [26]:
# check if the length of both the lists is same or not
assert len(labelList) == len(textList), 'Length of lebel list and text list is not same. It must be same so as to create a dataframe'

AssertionError: Length of lebel list and text list is not same. It must be same so as to create a dataframe

In [27]:
print(len(labelList))
print(len(textList))

5571
5570


In [28]:
#debugging the length mismatch problem
print(labelList[-5:])

['ham', 'ham', 'ham', 'ham', '']


In [29]:
# thus drop the last element of label list while creating dataframe

In [31]:
import pandas as pd

df_fullCorpus = pd.DataFrame({
    'label': labelList[:-1],
    'body_list': textList
})

df_fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


> **And here we have a nice clean version of out dataset<br>**
> This is also the ***Structured version*** of the data.
> <hr>

> All this work can also be done by reading the dataset using pandas **read_csv** fucntion and ***`sep = '\t'`*** as a separator argument, as we already knew that here '\t' was the text seperator.

In [33]:
# shortcut for above method
fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
fullCorpus.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [34]:
fullCorpus.columns = ['label', 'text_body']
fullCorpus.sample(5)

Unnamed: 0,label,text_body
1978,ham,"Sorry, I'll call later in meeting any thing re..."
1714,spam,WOW! The Boys R Back. TAKE THAT 2007 UK Tour. ...
2889,ham,Babe? You said 2 hours and it's been almost 4 ...
315,spam,December only! Had your mobile 11mths+? You ar...
1026,ham,"Its good, we'll find a way"


> **Explore the dataset**

In [38]:
# What is the shape of the dataset?

print(f'Input data has {fullCorpus.shape[0]} rows and {fullCorpus.shape[1]} columns')

Input data has 5568 rows and 2 columns


In [54]:
# How many spam/ham are there?

print(f"Out of {len(fullCorpus)} rows, {dict(fullCorpus['label'].value_counts())['spam']} are spam, and {dict(fullCorpus['label'].value_counts())['ham']} are ham.")

# following syntax can also be used
# print(f"Out of {len(fullCorpus)} rows, {len(fullCorpus[fullCorpus['label']=='spam'])} are spam, and {len(fullCorpus[fullCorpus['label']=='ham'])} are ham")

Out of 5568 rows, 746 are spam, and 4822 are ham.


In [58]:
# How much missing data is there

print(f"Out of {len(fullCorpus)} rows, {dict(fullCorpus.isnull().sum())['label']} labels are missing, and {dict(fullCorpus.isnull().sum())['text_body']} text_body are missing.")

Out of 5568 rows, 0 labels are missing, and 0 text_body are missing.


In [59]:
# there are many similar kind of exploration can be done.

> ### **Regular Expressions**
> **Using regular expressions in Python**<br>
> Python's [**`re`** ***package***](https://docs.python.org/3/library/re.html) is the most commonly used **regex** resources

In [61]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This       is a made up        string to   test 2 different      regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test~~~~2"""""different-regex-methods'

> **Splitting a sentence into list of words (tokens)**<br>
> Using **`re.split()`**

In [62]:
re.split('\s', re_test)      #'\s' indicates white-space
#returns a clean list

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [65]:
re.split('\s', re_test_messy)
#not as clean as re_test, as this string contains many extra white-spaces
# python doesn't know what to do with those extra white-spaces, because re.split() looks only for a single white-space 

['This',
 '',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'string',
 'to',
 '',
 '',
 'test',
 '2',
 'different',
 '',
 '',
 '',
 '',
 '',
 'regex',
 'methods']

In [67]:
re.split('\s+', re_test_messy)
# Here, just by adding a '+' sign, python knows to split based on continuous multiple '\s'
# thus returns a clean tokens this time

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [69]:
re.split('\s+', re_test_messy1)
# searching for white-spaces doesn't make sense.
# try a regex that that deals with special characters.

['This-is-a-made/up.string*to>>>>test~~~~2"""""different-regex-methods']

In [70]:
re.split('\W+', re_test_messy1)     #'W' -> search for non-word characters.
# keep '+' as to deal with multiple special charaters
# returns a clean list of tokens this time

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [71]:
re.split('\W+', re_test)     
# 'W+' works on other examples too, as white-space is also a non-word character. 

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

> Using **`re.findall()`**

In [72]:
re.findall('\S+', re_test)     # '\S+' looks for one or more non-white-space charaters
# returns a clean list

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [74]:
re.findall('\S+', re_test_messy)
# the same '\S+' works well on re_test_messy too

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [76]:
re.findall('\S+', re_test_messy1)
# S+ doesn't make sense here
# try for searching work-characters

['This-is-a-made/up.string*to>>>>test~~~~2"""""different-regex-methods']

In [78]:
re.findall('\w+', re_test_messy1)    # '\w+' -> \w looks for word-characters.
# returns a clean list of tokens

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [81]:
re.findall('\w+', re_test)
# this works the same of above two examples, hence we can use '\w+' in [almost]all cases to properly tokenize messy/ simple sentences

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

> **Replacing a specific string**<br>
> use **`re.findall()`** and **`re.sub()`** for this replacing task

In [82]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'


#task is to replace PEP8 and other miss-spelled words like PEP7,PEEP8 to 'PEP8 Python style guid'

In [83]:
import re

re.findall('[A-Z0-9]+', pep8_test)   #this will capture every capital letter substing with or without numbers in it

['I', 'PEP8']

In [86]:
re.findall('[A-Z]+[0-9]+', pep8_test)     #this will capture only the substings with capital letters and numbers
#thus this satisfies our task, and we will use this search pattern

['PEP8']

In [88]:
re.findall('[A-Z]+[0-9]+', pep7_test)
#the same pattern also captures PEP7 properly

['PEP7']

In [91]:
re.findall('[A-Z]+[0-9]+', peep8_test)
#the same pattern also captures PEEP8 properly, and hence we will finalize this pattern
#thus we have a perfect regex for our task, now use it to replace.

['PEEP8']

> Use **`re.sub()`** to replace, use the same ***regex pattern*** which you got while `re.findall()`

In [93]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Style Guide', pep8_test)      #find PEP8/PEP7/PEEP8 and replace with PEP8 Python Style Guide in pep8_test

'I try to follow PEP8 Python Style Guide guidelines'

In [94]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Style Guide', pep7_test)

'I try to follow PEP8 Python Style Guide guidelines'

In [96]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Style Guide', peep8_test)
# thus our regex does te perfect job for us

'I try to follow PEP8 Python Style Guide guidelines'

> **Other examples of regex methods and many more:**
> - `re.search()`
> - `re.match()`
> - `re.fullmatch()`
> - `re.escape()`
> - `re.finditer()`

> ### Implementing a pipeline to **Clean Text**
> <hr>

> **Pre-processing text data.**<br>
> Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
> - **Remove Punctuation**
> - **Tokenization**
> - **Removing stopwords**
> - **Lemmatize/Stem**

In [99]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)     #limmiting display length for the column(number of characters to see in the dataframe)

data = pd.read_csv('SMSSpamCollection.tsv', sep= '\t', header=None)
data.columns = ['label', 'text_body']
data.head()

Unnamed: 0,label,text_body
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


> **Remove Punctuation**<br>
> We need to remove all these puctuations because python doesn't understand that these are punctuations, it looks at them as a another charater

In [100]:
import string       #string package has a list of punctuations in it.
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [103]:
'I like NLP.' == 'I like NLP'     
# python doesn't understand that these sentences have same meaning.
# Thus we need to remove these punctuations / periods.

False

In [112]:
# fucntion to remove punctuation
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])    #join the characters which are not separated by white-space
    return text_nopunct

# apply 'remove_punct' using lambda, lambda applies this function on each row of the 'text_body' column.
data['text_body_clean'] = data['text_body'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,label,text_body,text_body_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


> **Tokenization**<br>
> Tokenizing is splitting some string or sentence into a list of words.

In [118]:
import re

# fucntion for tokenization
def tokenize(text):
    tokens = re.split('\W+', text)      # '\W+' looks for non-word characters
    return tokens

# apply 'tokenize' on 'text_body_clean' column using lambda fucntion.
data['text_body_tokenized'] = data['text_body_clean'].apply(lambda x: tokenize(x.lower()))    #convert to lower case as python is case-sensitive.
# python will treat 'NLP' and 'nlp' differently, hence it is necessary to convert the text to lower case

data.head()

Unnamed: 0,label,text_body,text_body_clean,text_body_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


In [121]:
'NLP' == 'nlp'     # thus, converting to lower is necessary, as we want to explicitly tell python that those are the same things.

False

> **Remove Stopwords**<br>
> Stopwords are irrelevent(Redundant) for any sentence and just add extra burden of tokens on our model, hence remove them.<br>
> Use `nltk` for removing stopwords

In [122]:
import nltk

stopwords = nltk.corpus.stopwords.words('english')

In [123]:
# fucntion to remove stopwords
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopwords]
    return text

# apply this fuction to the dataset's 'text_body_tokenized' column's each row
data['text_body_nostop'] = data['text_body_tokenized'].apply(lambda x: remove_stopwords(x))
data.head()


Unnamed: 0,label,text_body,text_body_clean,text_body_tokenized,text_body_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


Thus we saw basics of NLP, Regex, and text cleaning. There is a need to use more advanced fuctions for cleaning text.

In [130]:
# store this basic cleaned data as tsv file for further use
data.to_csv('Basic_cleaned_SMS.tsv', sep='\t', index=False, header=False)