# Author: Valinteshley Pierre
## Natural Language Processing Essential Training
### Date: 10/25/2023
#### Course from LinkedIn Learning


**Download NLTK Data**

In [36]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
dir(nltk)

**What can you do with NLTK**

In [38]:
from nltk.corpus import stopwords

stopwords.words('english') [0:500:25] # gives a list of stopwordsz: words used frequently but don't contribute much to the meaning of a sentence

['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

**Read in semi-structured text data**

In [39]:
# Read in the raw text
rawData = open("SMSSpamCollection.tsv").read()

# Print the raw data
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

In [40]:
# replace \t with \n
parsedData = rawData.replace('\t', '\n').split('\n') # take our string and split it based on the occurance of a certain character and return a list
parsedData[0:5]


['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham']

In [41]:
labelList = parsedData[0::2] # grab all the 'ham' and 'spam' texts
textList = parsedData[1::2] # grab the rest of the texts
print(labelList[0:5])
print("=============")
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


In [77]:
import pandas as pd

# fullCorpus = pd.DataFrame({
#     'label': labelList,
#     'body_list': textList,
# })

# fullCorpus.head() # error since the lists are mismatched in healt

In [44]:
# get the length of the lists
print(len(labelList))
print(len(textList))
print("=============")
print(labelList[-5:]) # print out the last 5 items of a list

5571
5570
['ham', 'ham', 'ham', 'ham', '']


In [46]:
fullCorpus = pd.DataFrame({
    'label': labelList[:-1], # start from the beginning and don't grab the last 1 
    'body_list': textList,
})

fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [47]:
dataset= pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
dataset

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u...
5564,ham,Will ü b going to esplanade fr home?
5565,ham,"Pity, * was in mood for that. So...any other s..."
5566,ham,The guy did some bitching but I acted like i'd...


In [48]:
fullCorpus.columns = ['Label', 'body_text']
fullCorpus.head()

Unnamed: 0,Label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


**Explore the dataset**

In [34]:
# What is he shape of the dataset

print("Input data has {} rows and {} columns" .format(len(fullCorpus), len(fullCorpus.columns)))

Input data has 5570 rows and 2 columns


In [49]:
# How many spam/ham are there

print('Out of {} rows, {} are spam, {} are ham' .format(len(fullCorpus),
                                                        len(fullCorpus[fullCorpus['Label']=='spam']), # taking the dataset filtering it to say the rows have to have data that says 'spam'
                                                        len(fullCorpus[fullCorpus['Label']=='ham'])))

Out of 5570 rows, 746 are spam, 4824 are ham


In [50]:
# How much missing data is there? If there is text missing, we won't be able to use it to build the model

print('Number of null in label: {}'.format(fullCorpus['Label'].isnull().sum()))
print('Number of null in text: {}'.format(fullCorpus['body_text'].isnull().sum()))

# Data exploration

Number of null in label: 0
Number of null in text: 0


## **Regular Expression**

Text string for describing a search pattern

Python's 're' package is the msot commonly used regex resource.

Let's learn how to use regular expressions

In [51]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This       is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different-regex-methods'

**Splitting a sentence into a list of words**

Splitting the sentence in to a list of words so that python can understand what it needs to be looking at.

Python will split the strings into 'tokens' or 'words' so that the model can learn how the tokens relate to the response variable. 

Use split method allows us to tokenize by finding characters that split the words

In [52]:
re.split('\s', re_test) # look for a single white space to split the string

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [53]:
re.split('\s', re_test_messy)

['This',
 '',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [54]:
re.split('\s+', re_test_messy) # \s+ tells python to look for 1 or more white spaces

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [56]:
re.split('\s+', re_test_messy1) # doesn't split it. Searching white space isn't always sufficient

['This-is-a-made/up.string*to>>>>test----2""""""different-regex-methods']

In [58]:
re.split('\W+', re_test_messy1) # Searches for any non-word character and uses that to define its split

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [59]:
re.findall('\S+', re_test) # Looks for 1 or more non-white space characters

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [60]:
re.findall('\S+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [61]:
re.findall('\S+', re_test_messy1) # the dashes and other symbols still count as non-white space characters

['This-is-a-made/up.string*to>>>>test----2""""""different-regex-methods']

In [62]:
re.findall('\w+', re_test_messy1) # lower case 'w' searches for 1 or more word characters

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

**Replacing a specific string**

Come up with a pattern that not only captures PEP8 but also the mistakes 'PEP7', 'PEEP8'

In [63]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [65]:
# import re

re.findall('[a-z]+', pep8_test)

['try', 'to', 'follow', 'guidelines']

In [66]:
re.findall('[A-Z]+', pep8_test)

['I', 'PEP']

In [68]:
re.findall('[A-Z]+[0-9]+', pep8_test)

['PEP8']

In [70]:
re.findall('[A-Z]+[0-9]+', peep8_test)

['PEEP8']

In [75]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep8_test)

'I try to follow PEP8 Python Styleguide guidelines'

In [76]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep7_test)

'I try to follow PEP8 Python Styleguide guidelines'

## **Implementing a pipeline to clean text**

**Pre-processing text data** 

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:

1. **Remove Punctuation**
2. **Tokenizeation**
3. **Remove stopwords**
4. Lemaatize/Stem

In [78]:
pd.set_option('display.max_colwidth', 100) # how many characters we can see in a pandas df when we print it out; default is 50

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns =['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [79]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [81]:
# build function to remove punctuation using list comprehension

def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation]) # join with no space in between the characters
    return text_nopunct

# create new column for the cleaned up text
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x)) # apply func to each row in body_text

data.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


**Tokenization**



In [83]:
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

# to apply to our dataset use the lambda function

data['body_text_tokenize'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower())) # apply tokenize to each row in body_text_clean

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenize
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


**Remove stopwords**

In [84]:
stopword = nltk.corpus.stopwords.words('english')

In [85]:
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword] # cycle through our list of tokens and check to see if it's a stopword
    return text

data['body_text_nostop'] = data['body_text_tokenize'].apply(lambda x: remove_stopwords(x))
data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenize,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


# **Chapter 2: Supplemental Data Cleaning**

## **Stemming**

Process of reducing inflected (or sometimes derived) words to their word stem or root

In [86]:
ps = nltk.PorterStemmer()
dir(ps)

['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'vowels']

In [87]:
# Examples of how stemming works
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))

grow
grow
grow


In [88]:
print(ps.stem('run'))
print(ps.stem('running'))
print(ps.stem('runner'))

run
run
runner


**Read in raw text**

In [91]:
# same code used for uploading the .tsv file
pd.set_option('display.max_colwidth', 100) # how many characters we can see in a pandas df when we print it out; default is 50

stopwords = nltk.corpus.stopwords.words('english')

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns =['label', 'body_text']

data.head()


Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


**Clean up text**


In [92]:
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))
data.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


**Stem Text**

In [None]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text] # apply the stemmer and returned the stemmed word from the word foudn in body_text_nostop (tokenized text)
    