## **Vectorizing Raw Data**
> ### ***Vectorizing***
> - Process of encoding text as integers to create ***feature vectors***.<br> 
> 
> #### **Feature vector**
> - An n-dimensional vector of numerical features that represent some object<br>
>
> **Vectorizing**: What we really do?<br>
> - Python doesn't understand charaters, their meanings and for python and ML models to understand/process this cleaned text data, we need to convert each text into numbers(feature vector). <br>
> - We take each row as a text and convert it to a feature vector.<br>
> - Every word in the entire document or each unique lemmatized/stemmed word is a column name.<br>
> - Value of each cell in each row is the number of occurences(how many times that word appeared in text) of that word(column name) in that particular row(text). This is called ***Document Term Matrix*** (one method of vectorization - **Count Vectorization**)<br>
> - Thus Machine Learning algorithm understands these counts and we can now fit a algorithm.<br> 
> <hr>
>
> **Why Do We Care?**<br>
> - When looking at a word, Python only sees a string of characters.<br>
> - Raw text needs to be converted to numbers so that Python and the algorithms used for machine learning can understand.<br>
> <hr>
>
> ***Different Types***:<br>
> - Count Vectorization
> - N-grams
> - Term frequency - inverse document frequency (TF-IDF)


> #### ***1. Count Vectorization***
> Creates a *document-term matrix* where the entry cell will be a count of the number of times that word occurred in that document.

In [1]:
# Read in raw text
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()     #used to truncate the words

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

In [2]:
# function to remove punctuation, tokenize, remove stopwords, adn stem
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

> #### **Apply CountVectorizer**

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data['text_body'])
print(X_counts.shape)
# print(count_vect.get_feature_names())

(5568, 8107)


> Thus, now we can see that we have **5568 text messages as rows**, and **8107 unique words(tokens) as individual features**<br>
> **`analyzer`** parameter in the CountVectorizer which takes in the cleaning or **preprocessing function** that is to be **applied on each row** in the dataset/Series that is passed in while **fit_transform** to directly get the transformed data. Hence no need to do it explicitely.<br>
> **get_feature_nemes()** is the **CountVectorizer *Object* function** that returns names of all the features, ie. all unique tokens.<br> 

In [4]:
# Apply Countvectorizer to smaller sample as it will be tough to visualize 8107 columns
data_sample = data[:20]
count_vect_sample = CountVectorizer(analyzer=clean_text)
X_counts_sample = count_vect_sample.fit_transform(data_sample['text_body'])
print(X_counts_sample.shape)
print(count_vect_sample.get_feature_names())

(20, 201)
['08002986030', '08452810075over18', '09061701461', '1', '100', '100000', '11', '12', '150pday', '16', '2', '20000', '2005', '21st', '3', '4', '4403ldnw1a7rw18', '4txtú120', '6day', '81010', '87077', '87121', '87575', '9', '900', 'aft', 'aid', 'alreadi', 'anymor', 'appli', 'ard', 'around', 'b', 'bless', 'breather', 'brother', 'call', 'caller', 'callertun', 'camera', 'cash', 'chanc', 'claim', 'click', 'co', 'code', 'colour', 'comin', 'comp', 'copi', 'cost', 'credit', 'cri', 'csh11', 'cup', 'custom', 'da', 'date', 'dont', 'eg', 'eh', 'england', 'enough', 'entitl', 'entri', 'even', 'fa', 'feel', 'final', 'fine', 'finish', 'first', 'free', 'friend', 'fulfil', 'go', 'goalsteam', 'goe', 'gonna', 'gota', 'grant', 'ha', 'help', 'hl', 'home', 'hour', 'httpwap', 'im', 'info', 'ive', 'jackpot', 'joke', 'k', 'kim', 'kl341', 'lar', 'latest', 'lccltd', 'like', 'link', 'live', 'lor', 'lunch', 'macedonia', 'make', 'may', 'mell', 'membership', 'messag', 'minnaminungint', 'miss', 'mobil', 'mon

In [5]:
X_counts_sample

<20x201 sparse matrix of type '<class 'numpy.int64'>'
	with 228 stored elements in Compressed Sparse Row format>

> ***Vectorizers output sparse matrices***<br>
> - As we can see that X_counts_sample is *"20x201 sparse matrix with 228 stored elements in Compressed Sparse Row format".*
> - **Sparse Matrix**: A matrix in which ***most entries are 0***. In the interest of efficient storage, a sparse matrix will be stored by ***only storing the locations of the non-zero elements***.
> - To print this sparse matrix, we have to expand this matrix to a collection of arrays, and then create a dataframe from that.

In [6]:
# print sparse matrix by expanding to a collection of arrays.
X_counts_sample_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_sample_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,200
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# getting the actual column names
X_counts_sample_df.columns = count_vect_sample.get_feature_names()
X_counts_sample_df

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150pday,16,...,winner,wkli,wonder,wont,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,ü
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


> #### ***2. N-Grams***
> Creates a *Document-term Matrix* where counts still occupy the cell but **instead of** the columns representing **single terms**, they represent all ***combinations of adjacent words of `length n` in your test***.<br>
> example,<br>
> *`NLP is an interesting topic`*<br>
>
> |n|Name|Tokens|
> |:---:|:---:|:---:|
> |2|bigram|["nlp is","is an","an interesting","interesting topic"]
> |3|trigram|["nlp is an","is an interesting","an interesting topic"]
> |4|four-gram|["nlp is an interesting","is an interesting topic"]
>
> <hr>
> 
> ***`n is a hyperparameter`***, we need to **tune n** to see what generates the best model<br>
> When `n = 1`, N-grams becomes **`unigrams`** CountVectorizer.<br>
> **Advantage** over unigram CountVectorizer: It takes a little context in the sentence as tokens are made from more than one word.<br> 
> **Disadvantages**: It creates ***huge number of features*** (number of features increases tremendously). Hence experiment with ngrams depending on how it performs.<br>
> Google's auto complete while searching uses an n-gram like approach.<br>

In [8]:
# Read in raw text
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()     #used to truncate the words

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

In [9]:
# fucntion to remove punctuation, tokenize, remove stopwords, and stem
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    # restructured tokens back into sentence as N-Grams expects this so as to make grams itself by taking value of n as a parameter.
    return text

# N-Grams doesn't take this fucntion as a parameter, hence we need to clean the text explicitly.
data['cleaned_text'] = data['text_body'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,text_body,cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,ive search right word thank breather promis wont take help grant fulfil promis wonder bless time
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
2,ham,"Nah I don't think he goes to usf, he lives around here though",nah dont think goe usf live around though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,even brother like speak treat like aid patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date sunday


> #### **Apply CountVectorizer (with *N-Grams*)**

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vect = CountVectorizer(ngram_range=(2,2))      # don't use 'analyzer' parameter as we are not passing our cleaning function into this, we already have a cleaned_text column.
# ngram_range -> range of n-grams that we'd like to look for. eg, we look for n=1,n=2,and n=3, when 'ngram_range=(1,3)'.

# fit_transform the Series data
X_counts = ngram_vect.fit_transform(data['cleaned_text'])

# print shape of X_counts
print(X_counts.shape)

# print feature names
# print(ngram_vect.get_feature_names())

(5568, 31275)


> Here we have ***31275 unique combinations of bigrams***, and **5568 text messages**.<br>
> Number of features in n-grams increased. 

In [11]:
# apply n-grams on smaller data sample
data_sample = data[0:20]

ngram_vect_sample = CountVectorizer(ngram_range=(2,2))    
X_counts_sample = ngram_vect_sample.fit_transform(data_sample['cleaned_text'])

# print shape of X_counts
print(X_counts_sample.shape)

# print feature names
print(ngram_vect_sample.get_feature_names())

# here it is clearly visible that we have each feature as a bigram

(20, 209)
['09061701461 claim', '100 20000', '100000 prize', '11 month', '12 hour', '150pday 6day', '16 tsandc', '20000 pound', '2005 text', '21st may', '4txtú120 poboxox36504w45wq', '6day 16', '81010 tc', '87077 eg', '87077 trywal', '87121 receiv', '87575 cost', '900 prize', 'aft finish', 'aid patent', 'anymor tonight', 'appli 08452810075over18', 'appli repli', 'ard smth', 'around though', 'bless time', 'breather promis', 'brother like', 'call 09061701461', 'call mobil', 'caller press', 'callertun caller', 'camera free', 'cash 100', 'chanc win', 'claim 81010', 'claim call', 'claim code', 'click httpwap', 'click wap', 'co free', 'code kl341', 'colour mobil', 'comp win', 'copi friend', 'cost 150pday', 'credit click', 'cri enough', 'csh11 send', 'cup final', 'custom select', 'da stock', 'date sunday', 'dont miss', 'dont think', 'dont want', 'eg england', 'eh rememb', 'england 87077', 'england macedonia', 'enough today', 'entitl updat', 'entri questionstd', 'entri wkli', 'even brother', '

In [12]:
X_counts_sample    # similar to unigrams X_counts_sample here is also a sparse matrix 

<20x209 sparse matrix of type '<class 'numpy.int64'>'
	with 210 stored elements in Compressed Sparse Row format>

In [13]:
# print sparse matrix by expanding it to array
ngrams_df_sample = pd.DataFrame(X_counts_sample.toarray())
ngrams_df_sample.columns = ngram_vect_sample.get_feature_names()
ngrams_df_sample

Unnamed: 0,09061701461 claim,100 20000,100000 prize,11 month,12 hour,150pday 6day,16 tsandc,20000 pound,2005 text,21st may,...,win fa,winner valu,wkli comp,wonder bless,wont take,word claim,word thank,wwwdbuknet lccltd,xxxmobilemovieclub use,ye naughti
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,1,0,0,0
1,0,0,0,0,0,0,0,0,1,1,...,1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


> #### ***TF-IDF (Term frequency - inverse document frequency)***
> - TF-IDF creates a ***document term matrix,**** where there's still *one row per text message/document*, and *columns still represent single unique terms*.
> - ***Cells represent a weighing*** that is meant to identify ***how important a word is*** *to an individual text message.*
> - The weighting for each cell is given by the below formula.
> <hr>
>
> ## **$ w_{i,j} = tf_{i,j}  *  log(\frac{N}{df_{i}})$**
> - ***i*** - *ith feature(word)(column_name)*<br>
> - ***j*** - *jth text/document (text in corresponding row)*<br>
> - ***$ tf_{i,j} $ = number of times i occurs in j divided by total number of terms in j.*** <br>
> - ***$ df_{i} $ = number of documents containing i.***<br>
> - ***N = total number of documents.***
> <hr>
>
> **Example**<br>
> Text - ***'`I like NLP`'*** <br>
> - $ tf_{NLP,j} = \frac{number of occurences of **NLP**}{number of words in text message} = \frac{1}{3} = 0.\bar{33} $<br>
> - $ N = 20 $
> - $ df{NLP} = 1 $
> - $ w_{i,j} = tf_{i,j}  *  log(\frac{N}{df_{i}}) $
> - $ w_{i,j} = 0.\bar{33}  *  log(\frac{20}{1}) = 0.43 $ 
> <hr>
>
> ***Insights***<br>
> - The rares the word is, higher is the value of log term.<br>
> - If a word occurs very frequently within a particular text message, so that's TF.
> - Elsewhere, the word in infrequently, that's going to be the second term. Thus very large will be assigned, and it'll be assumed to be very important to differentiating that text message from others.
>
> **This method helps you pull out important but seldom-used words.**

In [14]:
# Read in raw text
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()     #used to truncate the words

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

In [15]:
# function to remove punctuation, tokenize, remove stopwords, adn stem
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [ps.stem(token) for token in tokens if token not in stopwords]  # we can directly pass the tokenized word to TF-IDF 
    return text

# this function can be directly passed to the vectorizer just like unigram CountVectorizer

> #### **Apply TfidfVectorizer**

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['text_body'])
print(X_tfidf.shape)
# print(tfidf_vect.get_feature_names())

(5568, 8107)


> We can see that the shape is exactly similar to that as CountVectorizer, because tfidf also used unigrams.

In [18]:
# apply tfidf to a smaller sample
data_sample = data[0:20]

tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text)
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['text_body'])
print(X_tfidf_sample.shape)
print(tfidf_vect_sample.get_feature_names())

(20, 201)
['08002986030', '08452810075over18', '09061701461', '1', '100', '100000', '11', '12', '150pday', '16', '2', '20000', '2005', '21st', '3', '4', '4403ldnw1a7rw18', '4txtú120', '6day', '81010', '87077', '87121', '87575', '9', '900', 'aft', 'aid', 'alreadi', 'anymor', 'appli', 'ard', 'around', 'b', 'bless', 'breather', 'brother', 'call', 'caller', 'callertun', 'camera', 'cash', 'chanc', 'claim', 'click', 'co', 'code', 'colour', 'comin', 'comp', 'copi', 'cost', 'credit', 'cri', 'csh11', 'cup', 'custom', 'da', 'date', 'dont', 'eg', 'eh', 'england', 'enough', 'entitl', 'entri', 'even', 'fa', 'feel', 'final', 'fine', 'finish', 'first', 'free', 'friend', 'fulfil', 'go', 'goalsteam', 'goe', 'gonna', 'gota', 'grant', 'ha', 'help', 'hl', 'home', 'hour', 'httpwap', 'im', 'info', 'ive', 'jackpot', 'joke', 'k', 'kim', 'kl341', 'lar', 'latest', 'lccltd', 'like', 'link', 'live', 'lor', 'lunch', 'macedonia', 'make', 'may', 'mell', 'membership', 'messag', 'minnaminungint', 'miss', 'mobil', 'mon

In [19]:
X_tfidf_sample

<20x201 sparse matrix of type '<class 'numpy.float64'>'
	with 228 stored elements in Compressed Sparse Row format>

In [20]:
# expanding this sparse matrix by converting into array
tfidf_df_sample = pd.DataFrame(X_tfidf_sample.toarray())
tfidf_df_sample.columns = tfidf_vect_sample.get_feature_names()
tfidf_df_sample.head()

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150pday,16,...,winner,wkli,wonder,wont,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,ü
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.238737,0.238737,0.209853,0.0,0.0,0.0,0.0,0.0
1,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> It can be seen that the **cells contains weights**, giving importance(how frequently does it occur) of that feature for each document