# Vectorizing Raw Data: N-Grams

### N-Grams 

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

### Read in text

In [1]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [2]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,cleaned_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
1,ham,"Nah I don't think he goes to usf, he lives around here though",nah dont think goe usf live around though
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.,even brother like speak treat like aid patent
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date sunday
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend...


### Apply CountVectorizer (w/ N-Grams)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vect = CountVectorizer(ngram_range=(2,2))
x_counts = ngram_vect.fit_transform(data['body_text']) 
print(x_counts.shape)
print(ngram_vect.get_feature_names())

(5567, 41739)


### Apply CountVectorizer (w/ N-Grams) to smaller sample

In [5]:
data_sample = data[0:20]

ngram_vect_sample = CountVectorizer(ngram_range=(2,2))
x_counts_sample = ngram_vect_sample.fit_transform(data_sample['body_text']) 
print(x_counts_sample.shape)
print(ngram_vect_sample.get_feature_names())

(20, 310)
['000 pounds', '000 prize', '09061701461 claim', '100 000', '100 to', '11 months', '12 hours', '150p day', '16 tsandcs', '20 000', '20 poboxox36504w45wq', '2005 text', '21st may', '4txt ú1', '6days 16', '81010 www', '87077 eg', '87077 try', '87121 to', '87575 cost', '900 prize', 'about this', 'aft finish', 'aids patent', 'all callers', 'alright no', 'and don', 'and send', 'anymore tonight', 'apply 08452810075over18', 'apply reply', 'ard smth', 'around here', 'as per', 'as valued', 'as your', 'be home', 'been selected', 'been set', 'brother is', 'call 09061701461', 'call the', 'callers press', 'callertune for', 'camera for', 'can meet', 'cash from', 'chances to', 'claim call', 'claim code', 'claim to', 'click here', 'click the', 'co free', 'code kl341', 'colour mobiles', 'com qjkgighjjgcbl', 'comp to', 'copy your', 'cost 150p', 'credit click', 'cried enough', 'csh11 and', 'cup final', 'customer you', 'da stock', 'date on', 'day 6days', 'dbuk net', 'did he', 'don think', 'don w

### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [6]:
x_counts_df = pd.DataFrame(x_counts_sample.toarray())
x_counts_df.columns = ngram_vect_sample.get_feature_names()

In [8]:
print (x_counts_df)

    000 pounds  000 prize  09061701461 claim  100 000  100 to  11 months  \
0            0          0                  0        0       0          0   
1            0          0                  0        0       0          0   
2            0          0                  0        0       0          0   
3            0          0                  0        0       0          0   
4            0          0                  0        0       0          0   
5            0          0                  1        0       0          0   
6            0          0                  0        0       0          1   
7            0          0                  0        0       0          0   
8            1          0                  0        0       1          0   
9            0          1                  0        1       0          0   
10           0          0                  0        0       0          0   
11           0          0                  0        0       0          0   
12          