## NLP with Python and Machine Learning
### Chapters 1-2: NLP Basics and Data Cleaning
[Tutorial link here](https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/nltk-setup-and-overview?autoAdvance=true&autoSkip=true&autoplay=true&resume=false)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import nltk
nltk.download()
# dir(nltk)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### What can you do with nltk?

In [37]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

#### NLP Basics: Reading in text data and cleaning it

In [5]:
rawData = open('Ex_Files_NLP_Python_ML_EssT/Exercise Files/Ch01/01_03/Start/SMSSpamCollection.tsv').read()

rawData[:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

#### Data reading and 1st pass on processing steps:
* Segment into sentences
* Make text and label lists
* Convert to dataframe


In [6]:
parsedData=rawData.replace('\t','\n').split('\n')
parsedData[:5]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham']

In [7]:
labelList = parsedData[0::2]
textList = parsedData[1::2]

print(len(labelList))
print(len(textList))
print(labelList[-5:]); #last string is empty, so remove it

labelList = labelList[:-1]
print(len(labelList))

5571
5570
['ham', 'ham', 'ham', 'ham', '']
5570


In [8]:
full_corpus = pd.DataFrame(
{'label':labelList,
'text': textList})

full_corpus.head()

Unnamed: 0,label,text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


Or I could've just read this from a csv file

In [9]:
dataset = pd.read_csv('Ex_Files_NLP_Python_ML_EssT/Exercise Files/Ch01/01_03/Start/SMSSpamCollection.tsv',
                     sep='\t',header=None)

dataset.head()
dataset.columns = ['label','text']

#### Exploring the dataset
* Size and shape
* Count of ham/spam
* Missing data?

In [10]:
print(dataset.shape)


nham, nspam = len(dataset[dataset['label']=='ham']), len(dataset[dataset['label']=='spam'])

print(f'Count(ham) vs count(spam): {nham} vs {nspam}')

nmiss = dataset['label'].isna().sum()
print(f'Missing label count: {nmiss}')

(5568, 2)
Count(ham) vs count(spam): 4822 vs 746
Missing label count: 0


#### Aside: Using regular expressions
* tokenization
* replacement

The `\s` and `\w` tokens correspond to spaces and words respectively

In [21]:
import re

# Tokenization
str1 = 'This is a made-up string to try different regex methods'
str2 = 'This    is a made-up      string to try different regex.   methods'
str3 = 'This-is-a-really-messy<<<<<-string'

print(str1.split(' '));
print(re.split('\s', str1))
print('================================================')

print(str2.split(' '))
print(re.split('\s', str2))
print(re.split('\s+', str2))
print('================================================')

print(str3.split(' '))
print(re.split('\s+', str3))
print(re.split('\W+', str3))

print('The same can be done with findall() instead of split()')

# Replacement

pep7 = 'I follow PEP7 guidelines'
pep8 = 'I follow PEP8 guidelines'
peep8 = 'I follow PEEP8 guidelines'
#I want to replace all data of type 'PEP#' to 'PEP8 Python Style'

re.sub('[A-Z]+[0-9]+','PEP8 Python Style',pep7)

['This', 'is', 'a', 'made-up', 'string', 'to', 'try', 'different', 'regex', 'methods']
['This', 'is', 'a', 'made-up', 'string', 'to', 'try', 'different', 'regex', 'methods']
['This', '', '', '', 'is', 'a', 'made-up', '', '', '', '', '', 'string', 'to', 'try', 'different', 'regex.', '', '', 'methods']
['This', '', '', '', 'is', 'a', 'made-up', '', '', '', '', '', 'string', 'to', 'try', 'different', 'regex.', '', '', 'methods']
['This', 'is', 'a', 'made-up', 'string', 'to', 'try', 'different', 'regex.', 'methods']
['This-is-a-really-messy<<<<<-string']
['This-is-a-really-messy<<<<<-string']
['This', 'is', 'a', 'really', 'messy', 'string']
The same can be done with findall() instead of split()


'I follow PEP8 Python Style guidelines'

#### Machine Learning pipeline

* Tokenize text stream to words
    * Clean tokenized data by removing stop words
* Vectorize: Convert to numeric form
* Fit ML model and predict

#### Pre-processing
* Remove punctuation
* Tokenization
* Remove stop words

In [22]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [39]:
# Function to remove punctuation
def remove_punct(str_text):
    str_no_punct = "".join([char for char in str_text if char not in string.punctuation])
    return str_no_punct

# Function to tokenize string into words
def tokenize(str_text):
    tok_text = re.split('\W+', str_text)
    return tok_text

# Function to remove stop words from tokenized data
def remove_stopwords(tokenized_list):
    rem_sw_text = [word for word in tokenized_list if word not in stopwords]
    return rem_sw_text

# 
tst_string = 'This is. a line. with? punctuation!'
dataset['text_clean'] = dataset['text'].apply(lambda x: remove_punct(x))
dataset['text_clean'] = dataset['text_clean'].apply(lambda x: tokenize(x))
dataset['remove_stop'] = dataset['text_clean'].apply(lambda x: remove_stopwords(x))
dataset.head()

Unnamed: 0,label,text,text_clean,remove_stop
0,ham,I've been searching for the right words to tha...,"[Ive, been, searching, for, the, right, words,...","[Ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, dont, think, he, goes, to, usf, he, l...","[Nah, I, dont, think, goes, usf, lives, around..."
3,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...","[Even, brother, like, speak, They, treat, like..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


In [46]:
dataset.to_pickle('Spam_ch2.pkl')

In [42]:
dataset.head()

Unnamed: 0,label,text,text_clean,remove_stop
0,ham,I've been searching for the right words to tha...,"[Ive, been, searching, for, the, right, words,...","[Ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, dont, think, he, goes, to, usf, he, l...","[Nah, I, dont, think, goes, usf, lives, around..."
3,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...","[Even, brother, like, speak, They, treat, like..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


In [44]:
type(dataset.iloc[0]['text_clean'])

list