# **Text to Vector conversion** (Part 1)

## **Bag Of Words**
Bag of Words (BoW) is a text vectorization technique that converts text into numerical form using word frequency. It builds a vocabulary of unique words from the dataset and represents each document as a vector based on how many times each word appears. BoW ignores grammar and word order, focusing only on word occurrence.

- Example:
    - Sentence 1: “I love NLP”
    - Sentence 2: “I love ML”

    - Vocabulary → [I, love, NLP, ML]

- Vectors:
    - S1 → [1, 1, 1, 0]
    - S2 → [1, 1, 0, 1]

## Implementation

In [1]:
import pandas as pd

dataset = [
    ["Win a free iPhone now!!! Click here to claim.", "spam"],
    ["Hey, are we still meeting tomorrow?", "ham"],
    ["Congratulations! You have won a $1000 gift card.", "spam"],
    ["Can you send me the notes from today’s class?", "ham"],
    ["Limited time offer, buy one get one free!", "spam"],
    ["Don't forget to bring the documents.", "ham"],
    ["Urgent! Your account has been suspended. Verify now.", "spam"],
    ["Let’s catch up over coffee this weekend.", "ham"],
    ["You have been selected for a lucky draw prize.", "spam"],
    ["Please review the attached assignment.", "ham"],
    ["Earn money from home with this simple trick.", "spam"],
    ["Call me when you reach home.", "ham"],
    ["Exclusive deal just for you. Act fast!", "spam"],
    ["Happy birthday! Have a great year ahead.", "ham"],
    ["Claim your cashback reward today.", "spam"],
    ["Can you help me with this coding problem?", "ham"],
    ["Get cheap medicines without prescription.", "spam"],
    ["Meeting has been postponed to 3 PM.", "ham"],
    ["Click this link to reset your password immediately.", "spam"],
    ["Lunch at 1 PM?", "ham"],
    ["You won’t believe these shocking results!", "spam"],
    ["Assignment submission deadline is tonight.", "ham"],
    ["Lowest prices guaranteed. Shop now!", "spam"],
    ["Let me know if you need any help.", "ham"],
    ["Free entry in 2 million dollar competition.", "spam"],
    ["Project presentation slides are ready.", "ham"],
    ["Hot singles in your area waiting!", "spam"],
    ["Thanks for your support yesterday.", "ham"],
    ["This is not a scam! Claim your prize now.", "spam"],
    ["Can we reschedule our meeting?", "ham"],
    ["Double your income in 7 days.", "spam"],
    ["Please share the GitHub repo link.", "ham"],
    ["Your loan has been approved instantly.", "spam"],
    ["I will call you after class.", "ham"],
    ["Get rich quick with crypto investment.", "spam"],
    ["Are you coming to the seminar?", "ham"],
    ["Congratulations! You’ve been pre-approved for credit.", "spam"],
    ["Let’s work on the project together.", "ham"],
    ["Unlock premium features for free.", "spam"],
    ["See you at the library.", "ham"],
    ["Act now! Limited stock available.", "spam"],
    ["Can you explain this ML concept?", "ham"],
    ["You have won a lottery. Send details now.", "spam"],
    ["Don’t forget the team meeting tomorrow.", "ham"],
    ["Special discount on electronics today only.", "spam"],
    ["Submit your lab record by Friday.", "ham"],
    ["Risk-free investment opportunity.", "spam"],
    ["Thanks for the update.", "ham"],
    ["Cheap flight tickets available now.", "spam"],
    ["Are you free for a quick call?", "ham"],
    ["Winner! Claim your reward before midnight.", "spam"],
    ["Please check the email I sent.", "ham"],
    ["Exclusive membership offer just for you.", "spam"],
    ["Let’s revise DSA tonight.", "ham"],
    ["You have been chosen for a surprise gift.", "spam"]
]


In [2]:
df = pd.DataFrame(dataset, columns=['content', 'spam'])
df['spam'] = df['spam'].map({'spam': 1, 'ham': 0})
df

Unnamed: 0,content,spam
0,Win a free iPhone now!!! Click here to claim.,1
1,"Hey, are we still meeting tomorrow?",0
2,Congratulations! You have won a $1000 gift card.,1
3,Can you send me the notes from today’s class?,0
4,"Limited time offer, buy one get one free!",1
5,Don't forget to bring the documents.,0
6,Urgent! Your account has been suspended. Verif...,1
7,Let’s catch up over coffee this weekend.,0
8,You have been selected for a lucky draw prize.,1
9,Please review the attached assignment.,0


In [3]:
# getting our pipeline for preprocessing
from nltk.tokenize import word_tokenize
import spacy
from nltk.corpus import stopwords

class TextPreprocessor:
    def __init__(self, corpus):
        self.corpus = corpus
    
    def process(self) -> list:
        # tokenise
        words = word_tokenize(self.corpus)

        # remove stopwords
        words = [word.lower() for word in words if word not in stopwords.words('english')]

        # find root word
        stem = spacy.load("en_core_web_sm")
        doc = stem(" ".join(words))

        # final list
        final_list = []
        for token in doc:
            final_list.append(token.lemma_)
        
        return final_list

In [4]:
processed = []

for line, i in zip(df['content'], range(len(df['content']))):
    pre = TextPreprocessor(line)
    p = pre.process()
    p = [token for token in p if token != "."]
    processed.append(" ".join(p))

processed

['win free iphone ! ! ! click claim',
 'hey , still meet tomorrow ?',
 'congratulation ! you $ 1000 gift card',
 "can send note today ' class ?",
 'limited time offer , buy one get one free !',
 'do not forget bring document',
 'urgent ! your account suspend verify',
 "let ' catch coffee weekend",
 'you select lucky draw prize',
 'please review attach assignment',
 'earn money home simple trick',
 'call reach home',
 'exclusive deal act fast !',
 'happy birthday ! have great year ahead',
 'claim cashback reward today',
 'can help cod problem ?',
 'get cheap medicine without prescription',
 'meeting postpone 3 pm',
 'click link reset password immediately',
 'lunch 1 pm ?',
 "you ' believe shocking result !",
 'assignment submission deadline tonight',
 'low price guarantee shop !',
 'let know need help',
 'free entry 2 million dollar competition',
 'project presentation slide ready',
 'hot single area wait !',
 'thank support yesterday',
 'this scam ! claim prize',
 'can reschedule meeti

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=50)
X, y = cv.fit_transform(processed).toarray(), df['spam']
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 0]], dtype=int64)

In [6]:
X = pd.DataFrame(X, columns=cv.get_feature_names_out())
X

Unnamed: 0,act,assignment,available,be,call,can,cheap,claim,class,click,...,revise,reward,rich,send,thank,today,tomorrow,tonight,you,your
0,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,1,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In binary BOW, it will take any word as 1, no matter how many times it occurs

In [7]:
for col, i in zip(X.columns, range(50)):
    if(2 in X[col].values):
        print("2 is there!")
        break

2 is there!


In [8]:
cv = CountVectorizer(max_features=50, binary=True)
X = pd.DataFrame(cv.fit_transform(processed).toarray())
X

for col, i in zip(X.columns, range(50)):
    if(2 in X[col].values):
        print("2 is there!")
        break

if i == 49:
    print("There's no 2")

There's no 2


## **N-grams**

N-grams are a text representation technique that captures sequences of N consecutive words instead of single words. Unlike Bag of Words, N-grams preserve partial word order and help capture context. They are useful for identifying phrases and handling negation patterns, but they increase vocabulary size and dimensionality.

- Example:

    - Sentences:

        - “food is good”

        - “food is not good”

    Unigrams (N=1): [food, is, good, not]

    Both sentences look very similar.

    - Bigrams (N=2):
        - [food is, is good]
        - [food is, is not, not good]

    Now we clearly see the difference between “is good” and “not good”.

In [9]:
cv = CountVectorizer(max_features=200, ngram_range=(1, 2))
X = cv.fit_transform(processed).toarray()

cv.vocabulary_

{'win': 181,
 'free': 34,
 'click': 15,
 'claim': 10,
 'win free': 182,
 'click claim': 16,
 'still': 142,
 'meet': 44,
 'tomorrow': 170,
 'still meet': 143,
 'meet tomorrow': 45,
 'congratulation': 26,
 'you': 192,
 '1000': 0,
 'gift': 36,
 'congratulation you': 27,
 'you 1000': 193,
 'can': 6,
 'send': 126,
 'note': 62,
 'today': 167,
 'class': 14,
 'send note': 128,
 'note today': 63,
 'today class': 168,
 'time': 165,
 'offer': 64,
 'one': 66,
 'get': 35,
 'time offer': 166,
 'offer buy': 65,
 'one get': 68,
 'one free': 67,
 'not': 60,
 'forget': 33,
 'not forget': 61,
 'urgent': 176,
 'your': 199,
 'suspend': 154,
 'verify': 178,
 'urgent your': 177,
 'suspend verify': 155,
 'let': 40,
 'coffee': 20,
 'weekend': 180,
 'coffee weekend': 21,
 'select': 123,
 'prize': 88,
 'you select': 198,
 'select lucky': 124,
 'please': 72,
 'review': 108,
 'assignment': 2,
 'please review': 74,
 'review attach': 109,
 'money': 56,
 'home': 38,
 'simple': 134,
 'trick': 172,
 'money home': 57,
 

## **TF-IDF**
TF-IDF is a text vectorization technique that measures how important a word is in a document relative to a collection of documents. Unlike Bag of Words, which only counts frequency, TF-IDF reduces the weight of very common words and increases the importance of rare but meaningful words.

$$ \text{TF-IDF}(t, d) = TF(t, d) * IDF(t) $$
$$ TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total terms in document d}} $$
$$ IDF(t) = log_{e}(\frac{N}{df(t)}) $$
$$ ... $$
$$ N = \text{Total Number of documents} $$
$$ df(t) = \text{number of documents containing term t} $$

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X = tf.fit_transform(processed).toarray()

In [19]:
X = pd.DataFrame(X, columns=[tf.get_feature_names_out()])
X

Unnamed: 0,1000,act,assignment,available,be,call,can,cheap,claim,class,...,reward midnight,rich,select,send,thank,today,tomorrow,tonight,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.564859,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.501886,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.356753,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.35787,0.0,0.0,0.411387,...,0.0,0.0,0.0,0.381248,0.0,0.381248,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.655566,0.0,0.0,0.0,0.0,0.0,0.465993,0.0
9,0.0,0.0,0.426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
