## NLP with Python and Machine Learning
### Chapter 3: Vectorizing raw data

Converting text to numeric values. Three ways:

* Count vectorization

* N-grams

* Term frequency - inverse document frequency

[Tutorial link here](https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/count-vectorization?autoAdvance=true&autoSkip=true&autoplay=true&resume=false)

In [4]:
import numpy as np
import pandas as pd
import re
import string
import pdb
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [14]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english')
ps = nltk.PorterStemmer()

In [3]:
dataset = pd.read_pickle('Spam_stem_lemma.pkl')
dataset.head()

Unnamed: 0,label,text,text_clean,remove_stop,stemmed_data,lemmatized_data
0,ham,I've been searching for the right words to tha...,"[Ive, been, searching, for, the, right, words,...","[Ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, I,...","[Ive, searching, right, word, thank, breather,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin...","[free, entri, 2, wkli, comp, win, FA, cup, fin...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, dont, think, he, goes, to, usf, he, l...","[Nah, I, dont, think, goes, usf, lives, around...","[nah, I, dont, think, goe, usf, live, around, ...","[Nah, I, dont, think, go, usf, life, around, t..."
3,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...","[Even, brother, like, speak, They, treat, like...","[even, brother, like, speak, they, treat, like...","[Even, brother, like, speak, They, treat, like..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, have, A, date, ON, sunday, with, will]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
## For some reason, the list of stemmed words does not work,
## So, make a function to keep the string as a string instead of converting to a list
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens=re.split('\W+',text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

count_vectorizer = CountVectorizer(analyzer=clean_text);
X_counts = count_vectorizer.fit_transform(dataset['text'])
print(X_counts.shape) # this is a sparse matrix object

(5568, 8107)


In [26]:
X_counts_df = pd.DataFrame(X_counts.toarray())
X_counts_df.columns = count_vectorizer.get_feature_names()
X_counts_df.head(5)

Unnamed: 0,Unnamed: 1,0,008704050406,0089mi,0121,01223585236,01223585334,0125698789,02,020603,...,zindgi,zoe,zogtoriu,zoom,zouk,zyada,é,ü,üll,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### N-grams
Instead of counting individual words' frequency, with N-grams we can count occurences of a sequential group of words

In [30]:
def clean_text_ngram(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens=re.split('\W+',text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    return text

dataset['clean_data_ngram'] = dataset['text'].apply(lambda x: clean_text_ngram(x))
ngram = CountVectorizer(ngram_range=(2,2));
X_counts = ngram.fit_transform(dataset['clean_data_ngram'])
print(X_counts.shape) #

(5568, 31275)


### TF-IDF .... To be continued