# Vectorizing Raw Data: Count Vectorization

#### Count vectorization 

Creates a document-term matrix where the entry of each cell will be a count of the number of times that word occurred in that document.

#### Read in text

In [4]:
import pandas as pd
import re
import string
import nltk

pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

ps = nltk.PorterStemmer()   # Just because is faster than Lemmatizer

In [5]:
data = pd.read_csv("data/SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1,ham,"Nah I don't think he goes to usf, he lives around here though"
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...


#### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [7]:
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation]) # Remove punctuation
    tokens = re.split('\W+', text)                                            # Tokenize it
    text = [word for word in tokens if word not in stopwords]                 # Remove stop words
    return text

## Apply CountVectorizer

Count vectorization creates the document-term matrix and then simply counts the number of times each word appears in that given document, or text message in our case, and that's what's stored in the given cell.

In [10]:
# Import the CountVectorizer class from the sklearn.feature_extraction.text module.
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer with a custom analyzer.
count_vect = CountVectorizer(analyzer=clean_text)

# Apply the CountVectorizer to the 'body_text' column in the 'data' DataFrame.
X_counts = count_vect.fit_transform(data['body_text'])

# Print the shape of the resulting sparse matrix (number of documents, number of unique tokens).
print(X_counts.shape)

print()

# Use get_feature_names_out() to get the list of feature names (tokens).
# This method returns the list of tokens that represent the columns in the sparse matrix.
print(count_vect.get_feature_names_out())

(5567, 11516)

['' '0' '008704050406' ... 'ü' 'üll' '〨ud']


#### Apply CountVectorizer to smaller sample

In [12]:
# Select a sample of the data
data_sample = data[0:20]

# Create a CountVectorizer instance with a custom analyzer
count_vect_sample = CountVectorizer(analyzer=clean_text)

# Fit the vectorizer to the sample data and transform the text data into a sparse matrix of token counts
X_counts_sample = count_vect_sample.fit_transform(data_sample['body_text'])

# Print the shape of the resulting document-term matrix
print(X_counts_sample.shape)  # Outputs the dimensions (number of documents, number of unique words)

# Print the feature names (unique words) extracted from the sample data
print(count_vect_sample.get_feature_names_out())  # Outputs the list of feature names

(20, 223)
['08002986030' '08452810075over18s' '09061701461' '1' '100' '100000' '11'
 '12' '150pday' '16' '2' '20000' '2005' '21st' '3' '4' '4403LDNW1A7RW18'
 '4txtú120' '6days' '81010' '87077' '87121' '87575' '9' '900' 'A' 'Aft'
 'Alright' 'Ard' 'As' 'CASH' 'CLAIM' 'CSH11' 'Call' 'Callers' 'Callertune'
 'Claim' 'Co' 'Cost' 'Cup' 'DATE' 'ENGLAND' 'Eh' 'England' 'Even' 'FA'
 'FREE' 'Ffffffffff' 'Fine' 'Free' 'From' 'HAVE' 'HL' 'Had' 'He' 'I' 'Im'
 'Is' 'Ive' 'Jackpot' 'KL341' 'LCCLTD' 'Macedonia' 'May' 'Melle'
 'Minnaminunginte' 'Mobile' 'Nah' 'No' 'Nurungu' 'ON' 'Oh' 'Oru' 'POBOX'
 'POBOXox36504W45WQ' 'Press' 'Prize' 'R' 'Reply' 'SCOTLAND' 'SIX' 'SUNDAY'
 'So' 'TC' 'Text' 'That' 'The' 'Then' 'They' 'To' 'TryWALES' 'TsandCs'
 'Txt' 'U' 'URGENT' 'Update' 'Valid' 'Vettam' 'WAP' 'WILL' 'WINNER' 'WITH'
 'XXXMobileMovieClub' 'Yes' 'You' 'aids' 'already' 'anymore' 'apply'
 'around' 'b' 'brother' 'call' 'callertune' 'camera' 'chances' 'claim'
 'click' 'code' 'colour' 'comin' 'comp' 'copy' 'cred

In [13]:
# Select a sample of the data
data_sample = data[0:4]

# Create a CountVectorizer instance with a custom analyzer
count_vect_sample = CountVectorizer(analyzer=clean_text)

# Fit the vectorizer to the sample data and transform the text data into a sparse matrix of token counts
X_counts_sample = count_vect_sample.fit_transform(data_sample['body_text'])

# Print the shape of the resulting document-term matrix
print(X_counts_sample.shape)  # Outputs the dimensions (number of documents, number of unique words)

# Print the feature names (unique words) extracted from the sample data
print(count_vect_sample.get_feature_names_out())  # Outputs the list of feature names

(4, 45)
['08452810075over18s' '2' '2005' '21st' '87121' 'A' 'Cup' 'DATE' 'Even'
 'FA' 'Free' 'HAVE' 'I' 'May' 'Nah' 'ON' 'SUNDAY' 'Text' 'They' 'WILL'
 'WITH' 'aids' 'apply' 'around' 'brother' 'comp' 'dont' 'entry' 'final'
 'goes' 'like' 'lives' 'patent' 'questionstd' 'rateTCs' 'receive' 'speak'
 'think' 'though' 'tkts' 'treat' 'txt' 'usf' 'win' 'wkly']


### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [15]:
X_counts_sample

<4x45 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

In [16]:
X_counts_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,1,1,1,1,1,0,1,0,0,2,...,1,0,0,0,1,0,1,0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Set the column names to the feature names (unique words) from the CountVectorizer
X_counts_df.columns = count_vect_sample.get_feature_names_out()

# Now we just have the actual column names here
X_counts_df        

Unnamed: 0,08452810075over18s,2,2005,21st,87121,A,Cup,DATE,Even,FA,...,receive,speak,think,though,tkts,treat,txt,usf,win,wkly
0,1,1,1,1,1,0,1,0,0,2,...,1,0,0,0,1,0,1,0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


# Vectorizing Raw Data: N-Grams

### N-Grams 

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

In [20]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("data/SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1,ham,"Nah I don't think he goes to usf, he lives around here though"
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...


Important: With **n-grams**, it wants a string pest into it so that it can look for the adjacent words in the string and chunk them together rather than taking an already tokenized list.

In [22]:
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])    # Remove punctuation
    tokens = re.split('\W+', text)                                               # Tokenize it
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords]) # New Remove Stopwords (Now, it is a string) 
    #text = [word for word in tokens if word not in stopwords]                   # Old Remove Stopwords (It was a list of words)
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,cleaned_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
1,ham,"Nah I don't think he goes to usf, he lives around here though",nah i dont think goe usf live around though
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.,even brother like speak they treat like aid patent
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,i have a date on sunday with will
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,as per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi fri...


We just took the tokenized list and created the string out of it. Now we can see this clean text column has all the tokens from our tokenized list. It's just reconstructed back into a sentence. 

### Apply CountVectorizer (w/ N-Grams)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with ngram_range set to (2,2) for bigrams
ngram_vect = CountVectorizer(ngram_range=(2,2))

# Fit the vectorizer to the cleaned text data and transform it into a document-term matrix
X_counts = ngram_vect.fit_transform(data['cleaned_text'])

# Print the shape of the resulting matrix (number of documents, number of unique bigrams)
print(X_counts.shape)

# Print the feature names (unique bigrams) extracted from the data
print(ngram_vect.get_feature_names_out())

(5567, 34114)
['008704050406 sp' '0089mi last' '0121 2025050' ... 'üll submit'
 'üll take' '〨ud even']


In [26]:
# Display the first 100 bigrams in the list
print(ngram_vect.get_feature_names_out()[:50])

# Alternatively, display all bigrams
# for bigram in ngram_vect.get_feature_names_out():
#     print(bigram)

['008704050406 sp' '0089mi last' '0121 2025050' '01223585236 xx'
 '01223585334 cum' '0125698789 ring' '02 user' '020603 2nd' '020603 thi'
 '0207 153' '02072069400 bx' '02073162414 cost' '02085076972 repli'
 '020903 thi' '021 3680' '021 3680offer' '050703 tcsbcm4235wc1n3xx'
 '06 good' '07046744435 arrang' '07090298926 reschedul'
 '07099833605 reschedul' '07123456789 87077' '0721072 find'
 '07732584351 rodger' '07734396839 ibh' '07742676969 show'
 '07753741225 show' '0776xxxxxxx uve' '077xxx won' '07801543489 guarante'
 '07808 xxxxxx' '07808247860 show' '07808726822 award' '07815296484 show'
 '0784987 show' '0789xxxxxxx today' '0796xxxxxx today' '07973788240 show'
 '07xxxxxxxxx 2000' '07xxxxxxxxx show' '0800 0721072' '0800 169' '0800 18'
 '0800 195' '0800 1956669' '0800 505060' '0800 542' '08000407165 18'
 '08000776320 repli' '08000839402 2stoptx']


### Apply CountVectorizer (w/ N-Grams) to smaller sample

In [28]:
# Select a sample of 20 documents from the dataset for analysis
data_sample = data[0:20]

# Initialize a CountVectorizer with ngram_range set to (2,2) to extract bigrams
ngram_vect_sample = CountVectorizer(ngram_range=(2,2))

# Fit the vectorizer to the cleaned text data in the sample and transform it into a document-term matrix
X_counts_sample = ngram_vect_sample.fit_transform(data_sample['cleaned_text'])

# Print the shape of the resulting matrix (number of documents, number of unique bigrams)
print(X_counts_sample.shape)

# Print the list of bigrams (unique word pairs) extracted from the sample
print(ngram_vect_sample.get_feature_names_out())

(20, 217)
['09061701461 claim' '100 20000' '100000 prize' '11 month' '12 hour'
 '150pday 6day' '16 tsandc' '20000 pound' '2005 text' '21st may'
 '4txtú120 poboxox36504w45wq' '6day 16' '81010 tc' '87077 eg'
 '87077 trywal' '87121 receiv' '87575 cost' '900 prize' 'aft finish'
 'aid patent' 'alright way' 'anymor tonight' 'appli 08452810075over18'
 'appli repli' 'ard smth' 'around though' 'as per' 'as valu'
 'brother like' 'call 09061701461' 'call the' 'caller press'
 'callertun caller' 'camera free' 'cash from' 'chanc win' 'claim call'
 'claim code' 'claim no' 'click httpwap' 'click wap' 'co free'
 'code kl341' 'colour mobil' 'comp win' 'copi friend' 'cost 150pday'
 'credit click' 'cri enough' 'csh11 send' 'cup final' 'custom select'
 'da stock' 'date on' 'dont miss' 'dont think' 'dont want' 'eg england'
 'eh rememb' 'england 87077' 'england macedonia' 'enough today'
 'entitl updat' 'entri questionstd' 'entri wkli' 'even brother' 'fa 87121'
 'fa cup' 'feel that' 'ffffffffff alright' 'fina

In [29]:
# Convert the sparse matrix of bigram counts into a dense array and create a DataFrame from it
X_counts_df = pd.DataFrame(X_counts_sample.toarray())

# Assign the bigram feature names as column headers in the DataFrame
X_counts_df.columns = ngram_vect_sample.get_feature_names_out()

# Display the DataFrame with bigram counts for each document in the sample
X_counts_df

Unnamed: 0,09061701461 claim,100 20000,100000 prize,11 month,12 hour,150pday 6day,16 tsandc,20000 pound,2005 text,21st may,...,win cash,win fa,winner as,with will,wkli comp,word claim,wwwdbuknet lccltd,xxxmobilemovieclub to,ye he,you week
0,0,0,0,0,0,0,0,0,1,1,...,0,1,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,0,1,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
9,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1


<img align="left" src="images/tfidf.png"     style=" width:800px;  ">

**Term Frequency-Inverse Document Frequency (TF- IDF)** creates a document-term matrix where the columns single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document. It is a inverse document frequency weighting.

### Read in text

In [33]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("data/SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

#### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [35]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

## Apply TfidfVectorizer

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer with a custom analyzer function for text preprocessing
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

# Fit the vectorizer to the raw text data and transform it into a TF-IDF matrix
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

# Print the shape of the resulting matrix (number of documents, number of unique terms)
print(X_tfidf.shape)

# Print the list of feature names (terms) extracted from the data
print(tfidf_vect.get_feature_names_out())

(5567, 8104)
['' '0' '008704050406' ... 'ü' 'üll' '〨ud']


TF-IDF creates a document term matrix, where there's still one row per text message and the columns still represent single unique terms. But instead of the cells representing the count, the cells represent a weighting that's meant to identify **how important a word is to an individual text message**.

#### Apply TfidfVectorizer to smaller sample

In [40]:
data_sample = data[0:20]

# Initialize TfidfVectorizer with a custom analyzer function for text preprocessing
tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text)

# Fit the vectorizer to the text data in the sample and transform it into a TF-IDF matrix
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['body_text'])

# Print the shape of the resulting matrix (number of documents, number of unique terms)
print(X_tfidf_sample.shape)

# Print the list of feature names (terms) extracted from the data
print(tfidf_vect_sample.get_feature_names_out())

(20, 192)
['08002986030' '08452810075over18' '09061701461' '1' '100' '100000' '11'
 '12' '150pday' '16' '2' '20000' '2005' '21st' '3' '4' '4403ldnw1a7rw18'
 '4txtú120' '6day' '81010' '87077' '87121' '87575' '9' '900' 'aft' 'aid'
 'alreadi' 'alright' 'anymor' 'appli' 'ard' 'around' 'b' 'brother' 'call'
 'caller' 'callertun' 'camera' 'cash' 'chanc' 'claim' 'click' 'co' 'code'
 'colour' 'comin' 'comp' 'copi' 'cost' 'credit' 'cri' 'csh11' 'cup'
 'custom' 'da' 'date' 'dont' 'eg' 'eh' 'england' 'enough' 'entitl' 'entri'
 'even' 'fa' 'feel' 'ffffffffff' 'final' 'fine' 'finish' 'first' 'free'
 'friend' 'go' 'goalsteam' 'goe' 'gonna' 'gota' 'ha' 'hl' 'home' 'hour'
 'httpwap' 'im' 'info' 'ive' 'jackpot' 'joke' 'k' 'kim' 'kl341' 'lar'
 'latest' 'lccltd' 'like' 'link' 'live' 'lor' 'lunch' 'macedonia' 'make'
 'may' 'meet' 'mell' 'membership' 'messag' 'minnaminungint' 'miss' 'mobil'
 'month' 'nah' 'name' 'nation' 'naughti' 'network' 'news' 'next' 'nurungu'
 'oh' 'oru' 'patent' 'pay' 'per' 'pobox' 'p

#### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [42]:
# Convert the sparse TF-IDF matrix of the sample into a dense array and create a DataFrame from it
X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray())

# Assign the term feature names as column headers in the DataFrame
X_tfidf_df.columns = tfidf_vect_sample.get_feature_names_out()

# Display the DataFrame with TF-IDF values for each document in the sample
X_tfidf_df

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150pday,16,...,wet,win,winner,wkli,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,ü
0,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.174912,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.231645,0.0,0.0,0.0,0.0,0.231645,0.0,0.0,...,0.0,0.0,0.231645,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.197682,0.0,0.0,0.0,0.0,0.0,0.197682,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.224905,0.0,0.0,0.0,0.224905,0.197695,...,0.0,0.197695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.252972,0.0,0.252972,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.252972,0.252972,0.0,0.0,0.0,0.0


Instead of regular integers in the cells, you have decimals. This .2316 is likely more important than this .1977. This .2316 is likely more important than this .1976. What that means is, either 12 occurs more frequently in the 5th text message than 11 does in the 6th text message, or it means 12 occurs less frequently across all the other text messages than 11 does across all the other text messages. 

So in summary, we created this false choice here, indicating that there are three different ways to vectorize. These are all very closely related, though, and some can actually be used together. **TF-IDF** is basically a count vectorizer that includes some consideration for the length of the document, and also how common the word is across other text messages. And then **n-grams** is just used within either of these two methods to look for groups of adjacent words instead of just looking for single terms. They're all just slight modifications of each other, and typically you'll test different vectorization methods depending on your problem, and then you let the results determine which one you use. That wraps up our vectorization section. Next, we're going to learn about feature engineering.