# BoW vs TFIDF

### What is Bag of Words (BoW)?

BoW is a simple technique to convert text into numerical features. It creates a vocabulary of all unique words in your corpus and represents each document as a vector of word counts.

- It ignores grammar and word order.
- It’s sparse and high-dimensional.
- Example:

For two sentences:
- "I love data"
- "I love Python"

The vocabulary is: ["I", "love", "data", "Python"]
The vectors become:
- [1, 1, 1, 0]
- [1, 1, 0, 1]


### What is TF-IDF?

TF-IDF (Term Frequency–Inverse Document Frequency) improves on BoW by weighing words based on how important they are to a document in a corpus.

- TF: How often a word appears in a document.
- IDF: How rare the word is across all documents.
- Common words like “the” get lower weights, while rare but meaningful words get higher weights.


### Key Differences

|Feature|BoW|TF-IDF|
|----|----|----|
|Representation|Raw Word Counts| Weighted scores|
|Captures Importance|No|Yes|
|Handles Common Words|No|Downweights them|
|Output|Sparse matrix|Sparse matrix with floatd weights|

In [3]:
!pip install scikit-learn



In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#Sample Corpus

corpus = [
    "I love data science", 
    "Data science is fun", 
    "I love python and data"
         ]

# Bag of Words

bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(corpus)
print("BoW Vocabulary:", bow_vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())


BoW Vocabulary: ['and' 'data' 'fun' 'is' 'love' 'python' 'science']
BoW Matrix:
 [[0 1 0 0 1 0 1]
 [0 1 1 1 0 0 1]
 [1 1 0 0 1 1 0]]


In [5]:
# TF-IDF

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
print("\nTF-IDF Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())



TF-IDF Vocabulary: ['and' 'data' 'fun' 'is' 'love' 'python' 'science']
TF-IDF Matrix:
 [[0.         0.48133417 0.         0.         0.61980538 0.
  0.61980538]
 [0.         0.34520502 0.5844829  0.5844829  0.         0.
  0.44451431]
 [0.5844829  0.34520502 0.         0.         0.44451431 0.5844829
  0.        ]]


### Practical Applications of BoW

- Spam Detection:
BoW can flag spam emails by identifying common spammy terms like “free,” “win,” or “urgent.”

- Sentiment Analysis :
In product or movie reviews, BoW helps classify text as positive or negative based on word frequency.

- Topic Classification :
News articles or documents can be categorized (e.g., sports, politics, tech) using BoW features.

- Search Engines :
Early search engines used BoW to match user queries with documents based on keyword overlap.

- Plagiarism Detection :
By comparing word frequency patterns, BoW can help detect copied content.


### Practical Applications of TF-IDF:

- Information Retrieval :
TF-IDF is used to rank documents by relevance in search engines—highlighting documents that contain rare but important terms.

- Keyword Extraction :
Automatically identify the most important words in a document (e.g., for summarization or tagging).

- Document Similarity:
TF-IDF vectors can be used to measure how similar two documents are—useful in recommendation systems.


- Chatbots and FAQ Matching:
Match user queries to the most relevant predefined answers based on TF-IDF similarity.


- Legal and Academic Text Mining:
Extract meaningful terms from large corpora of legal or research documents.


### BoW for Sentiment Analysis:

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample data
texts = ["I love this movie", "This film was terrible", "Amazing acting", "Worst plot ever"]
labels = ["positive", "negative", "positive", "negative"]

# Vectorize using BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Train a simple classifier
model = MultinomialNB()
model.fit(X, labels)

# Predict on new text
test = vectorizer.transform(["Terrible acting"])
print(model.predict(test))  # Output: ['negative']

test = vectorizer.transform(["Worst acting"])
print(model.predict(test))  # Output: ['negative']

test = vectorizer.transform(["Amazing fighting"])
print(model.predict(test))  # Output: ['negative']

['positive']
['positive']
['positive']


### TF-IDF for Sentiment Analysis

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample data
texts = ["I love this movie", "This film was terrible", "Amazing acting", "Worst plot ever"]
labels = ["positive", "negative", "positive", "negative"]

# Vectorize using TFIDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train a simple classifier
model = MultinomialNB()
model.fit(X, labels)

# Predict on new text
test = vectorizer.transform(["Terrible acting"])
print(model.predict(test))  # Output: ['negative']


test = vectorizer.transform(["Worst acting"])
print(model.predict(test))  # Output: ['negative']

test = vectorizer.transform(["Amazing fighting"])
print(model.predict(test))  # Output: ['negative']

['positive']
['positive']
['positive']


As you can see BoW & TFIDF are not  great approaches since they dont capture contextual meanings and semantic meanings

Note:

Semantic Meaning: 
This refers to the literal or dictionary definition of a word or phrase. It’s the core meaning that remains consistent regardless of where or how the word is used.
- Example: The word “bank” semantically means a financial institution or the side of a river—both are valid dictionary meanings.


Contextual Meaning: 
This is the specific meaning a word takes on based on its surrounding words, situation, or usage. It’s how we figure out which semantic meaning is intended.
- Example:
- “He sat by the bank and watched the water.” → Here, “bank” means the side of a river.
- “She deposited money in the bank.” → Now, “bank” refers to a financial institution.
So, contextual meaning helps us choose the correct semantic meaning based on the situation.


