## üìä Text Vectorization: Bag of Words and TF-IDF
*(Personal Practice Notes)*

Before applying machine learning to text, we must convert text into a **numerical format**.
This process is known as **text vectorization**.

## 1Ô∏è‚É£ Why Text Vectorization?

Machine learning models cannot work directly with raw text.

Text vectorization converts text into numbers so that:
- documents can be compared mathematically
- patterns can be learned by ML algorithms

Two common text vectorization techniques are:
1. **Bag of Words (BoW)**
2. **TF-IDF (Term Frequency‚ÄìInverse Document Frequency)**


## 2Ô∏è‚É£ Bag of Words (BoW)

The **Bag of Words** model:
- counts which words appear in which documents
- ignores word order and grammar
- represents text as word frequency vectors

Each document becomes:
- a row
- each unique word becomes a column
- values represent how often the word appears


In [None]:
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer   
#transforms a collection of text documents to a matrix of token counts
#it breaks the text into words (tokens) and counts the occurrences of each word in the document
#this gives us the numerical representation we need to apply machine learning techniques to text data

In [None]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

## 3Ô∏è‚É£ Creating a Bag of Words Model

We use **CountVectorizer** from scikit-learn to create a Bag of Words model.

CountVectorizer:
- tokenizes text into words
- builds a vocabulary of unique words
- counts how often each word appears in each document


In [None]:
#Creating a bag-of-words model is straight forward with CountVectorizer
countvec = CountVectorizer()  #not passing any parameters uses default settings
#sets up the vectorizer ready to learn the vocabulary of the text data

## 4Ô∏è‚É£ Interpreting the Bag of Words Output

- Each **row** represents one document
- Each **column** represents a word from the vocabulary
- Each value shows how many times the word appears in that document

This gives us a numerical representation suitable for machine learning.


In [None]:

countvec_fit = countvec.fit_transform(data)   #fit means the Countvectorizer looks through the text and learns which unique words appear in the data
#this learns the vocabulary and transforms the text data into a matrix of token counts
#transform converts the text into numbers by creating a matrix that counts how often each word occurs

bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())
#.toarray() creates a regular two-dimensional array of numbers , which can be placed in dataframe
#to label the columns we get the feature names (unique words) from the count vectorizer using get_feature_names_out()

#get_feature_names_out() returns the actual words in the vocabulary learned from the text data
print(bag_of_words)

#each row relates to each piece of text in our data and each column reelated to one individual word from the vocabulary


## 5Ô∏è‚É£ Binary Bag of Words Representation

Sometimes we only care whether a word **appears or not**, not how many times.

Setting `binary=True`:
- assigns 1 if the word appears
- assigns 0 if it does not


In [None]:
countvec_binary = CountVectorizer(binary=True)
countvec_binary_fit = countvec_binary.fit_transform(data)
bag_of_words_binary = pd.DataFrame(countvec_binary_fit.toarray(), columns=countvec_binary.get_feature_names_out())
print(bag_of_words_binary)


## 6Ô∏è‚É£ Limitations of the Bag of Words Model

While Bag of Words is simple and effective, it has important limitations:

- ‚ùå Ignores word order
- ‚ùå Ignores context
- ‚ùå Treats all words as equally important
- ‚ùå Cannot capture meaning or relationships between words

Because of these limitations, more advanced methods such as **TF-IDF** and
**word embeddings** are often preferred.


## 7Ô∏è‚É£ TF-IDF (Term Frequency‚ÄìInverse Document Frequency)

TF-IDF improves upon Bag of Words by:
- measuring how important a word is to a specific document
- reducing the impact of very common words

It considers:
- how often a word appears in one document
- how common that word is across all documents

Words that are frequent in one document but rare overall receive higher scores.
-Term frequency means how many times a word appears in a single document
-inverse document frequency looks at how many times a word appears in the entire collection of documents
-common words will have the lowest score while less common will have more score



In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [3]:
tfidfvec = TfidfVectorizer()

In [5]:
tfidfvec_fit = tfidfvec.fit_transform(data)

In [7]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns=tfidfvec.get_feature_names_out())
print(tfidf_bag) 

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 

## ‚úÖ Final Takeaways

- Text vectorization is required to apply ML to text
- Bag of Words represents text using word counts
- Binary BoW captures presence instead of frequency
- Bag of Words ignores order and context
- TF-IDF helps highlight more informative words
- Choosing the right representation depends on the task
