# **Vectorizing Text**

## BOG (Bag Of Words) Model

- **this model has a limiation: it only counts words, but does not take into account the order of words, or the importance of any word in the sentence compared to others**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [4]:
countvec = CountVectorizer()
countvec_fit = countvec.fit_transform(data)

*`fit_transform` looks through the data and learns which unique words appear in the data, then convert text into numbers and counts how often each word occurs and creates a matrix from this information*

In [7]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())

*`to_array` returns a 2-D representation array of numbers, which can be placed in a data-frame.*

*`get_feature_names_out` returns the list of column names (i.e. the list of all unique words), which we provide as the label for each column in the data-frame.*

*Every row is a document*

*Every column is a word from the vocabulary*

*Every entry is a word count*

In [8]:
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 

## TF-IDF (Term frequency - Inverse document frequency)

*Instead of looking inside individual documents exclusively, it looks across the entire collection of documents*

**<u>Term Frequency</u> -** *frequency of each word in a single document*

**<u>Inverse document frequency</u> -** *weightage of words across the dataset. Once calculated, the common words will have a lower score because even though they appear more often, they dont help distinguish one document from another. Less common words will have a higher score because they carry more weight in showing what a specific document is about (How unique the word is across all documents)*


**Higher Score means:**
- The word appears frequently in this specific document
- AND appears in few other documents (hence important across the dataset)

**Lower Score means:**
- The word appears rarely in this document
- OR appears commonly across many documents


**Mathematical Formula**


> <br/>TF-IDF score = TF(t,d) × IDF(t)<br/><br/>


Where:

**TF(t,d)** = *Number of times term t appears in document d / number of words in document*<br/><br/>
**IDF(t)** = *log(N/df(t))*<br/><br/>
**N** = *Total number of documents*<br/><br/>
**df(t)** = *Number of documents containing term t*<br/><br/>

**Simple Example:**

Let's use a small corpus of 3 documents:

"The cat sat"

"The cat sat on the mat"

"The dog ran"

<br/>

*Step 1: Calculate TF (Term Frequency)* <br/>
<br/>Example term frequencies for "cat": 

> <br/>TF("cat", doc1) = 1/3<br/><br/>
TF("cat", doc2) = 1/6<br/><br/>
TF("cat", doc3) = 0/3<br/><br/>

*Step 2: Calculate IDF (Inverse Document Frequency)
For word "cat":*


> <br/>N = 3 (total documents)<br/><br/>
df("cat") = 2 (appears in 2 documents)<br/><br/>
IDF("cat") = log(3/2) ≈ 0.405<br/><br/>
Step 3: Calculate TF-IDF<br/><br/>


For "cat" in document 1:


> <br/>TF-IDF = 1/6 × 0.405 ≈ 0.405<br/><br/>

Why It's Useful:

Common words like "the" get lower scores because they appear in many documents (high df → low IDF)

Rare, meaningful words get higher scores **(low df → high IDF)**

Words frequent in one document but rare across all documents get highest scores

In [14]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [16]:
tfidfvec = TfidfVectorizer()
tfidfvec_fit = tfidfvec.fit_transform(data)
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns=tfidfvec.get_feature_names_out())
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 