<a href="https://colab.research.google.com/github/tfindiamooc/mlp/blob/main/TextAnalysisClass2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lesson #2: From Words to Numbers - Bag of Words and TF-IDF

Welcome to Lesson #2! In the last lesson, you learned how to preprocess text using `CountVectorizer`. Now, we'll focus on a few more methods of **text vectorization** - turning words into numbers that machine learning models can understand.

We'll cover:

*   **Bag of Words (BoW):**  A simple but fundamental technique.
*   **N-grams:**  Capturing some word order.
*   **TF-IDF:**  Weighting words by importance.

Let's dive into the code!

In [None]:
# Code Cell 1: Basic BoW Code (from Lesson 1 Recap)
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Sachin Tendulkar is a legendary batsman.",
    "India won the Cricket World Cup in 2011.",
    "RRR is a blockbuster Indian movie.",
    "Bollywood movies are very entertaining.",
    "Is Sachin Tendulkar the greatest batsman?",
]

vectorizer_bow = CountVectorizer(
    lowercase=True,
    token_pattern=r'[a-zA-Z]+',
    stop_words='english')
vectorizer_bow.fit(corpus)
X_bow = vectorizer_bow.transform(corpus)

print("Bag of Words Vocabulary:")
print(vectorizer_bow.get_feature_names_out())
print("\nBag of Words Document-Term Matrix (Counts):")
print(X_bow.toarray())

Bag of Words Vocabulary:
['batsman' 'blockbuster' 'bollywood' 'cricket' 'cup' 'entertaining'
 'greatest' 'india' 'indian' 'legendary' 'movie' 'movies' 'rrr' 'sachin'
 'tendulkar' 'won' 'world']

Bag of Words Document-Term Matrix (Counts):
[[1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1]
 [0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0]
 [0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0]]


### Bag of Words (BoW) in Detail

Remember this code from the last lesson? Let's break down what's happening with **Bag of Words (BoW)**:

*   **Vocabulary Creation:** `CountVectorizer` first builds a **vocabulary**. This is just a list of all the unique words in your text data, after preprocessing (like lowercasing and removing stop words).  Look at the "Vocabulary" output above - these are the words `CountVectorizer` learned from our `corpus`.

*   **Document-Term Matrix:**  Then, `CountVectorizer` creates a **Document-Term Matrix**.
    *   Each **row** in this matrix represents a **document** (in our case, each sentence in the `corpus`).
    *   Each **column** represents a **word** from the **vocabulary**.
    *   The **numbers** in the matrix are simply the **counts** of each word in each document.

*   **"Bag of Words" Concept:**  The name "Bag of Words" comes from the fact that we **lose the order of words**.  We only care about *which words are present* and *how often* they appear in each document.  Word order is ignored.

Let's move on to N-grams to capture some word order!

### Introducing N-grams

**Problem with basic BoW:**  BoW treats phrases like "good movie" and "movie good" as the same thing because it ignores word order.  Sometimes, word order *does* matter for meaning!

**Solution: N-grams!** N-grams are sequences of N words that can capture some word order. Let's start with **bigrams** (N=2), which are pairs of words.

In [None]:
# Code Cell 2: N-gram Code Example (Bigrams)
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "good movie",
    "movie good",
    "not good movie",
]

vectorizer_bigram = CountVectorizer(
    ngram_range=(2, 2), # ngram_range=(2, 2) for bigrams
    lowercase=True,
    token_pattern=r'[a-zA-Z]+',
    stop_words='english')
vectorizer_bigram.fit(corpus)
X_bigram = vectorizer_bigram.transform(corpus)

print("Bigram Vocabulary:")
print(vectorizer_bigram.get_feature_names_out())
print("\nBigram Document-Term Matrix:")
print(X_bigram.toarray())

Bigram Vocabulary:
['good movie' 'movie good']

Bigram Document-Term Matrix:
[[1 0]
 [0 1]
 [1 0]]


### Explanation of N-grams (Bigrams)

Look at the code and output above.  The key change is:

`vectorizer_bigram = CountVectorizer(ngram_range=(2, 2), ...)`

*   **`ngram_range=(2, 2)`:** This is what tells `CountVectorizer` to use **bigrams** (sequences of 2 words).

Run the code and check the output:

*   **Bigram Vocabulary:** Notice the vocabulary now contains word pairs like `"good movie"` and `"movie good"`.  These are treated as distinct features!

*   **Bigram Document-Term Matrix:** The matrix now reflects the counts of these bigrams in each document.

By using bigrams, we've captured a bit of word order information that was lost in basic BoW.

Now, let's try using both unigrams (single words) and bigrams together!

In [None]:
# Code Cell 3: N-gram Code Example (Unigrams and Bigrams)
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "good movie",
    "movie good",
    "not good movie",
]

vectorizer_unigram_bigram = CountVectorizer(
    ngram_range=(1, 2),
    lowercase=True,
    token_pattern=r'[a-zA-Z]+',
    stop_words='english') # ngram_range=(1, 2) for unigrams and bigrams
vectorizer_unigram_bigram.fit(corpus)
X_unigram_bigram = vectorizer_unigram_bigram.transform(corpus)

print("\nUnigram and Bigram Vocabulary:")
print(vectorizer_unigram_bigram.get_feature_names_out())
print("\nUnigram and Bigram Document-Term Matrix:")
print(X_unigram_bigram.toarray())


Unigram and Bigram Vocabulary:
['good' 'good movie' 'movie' 'movie good']

Unigram and Bigram Document-Term Matrix:
[[1 1 1 0]
 [1 0 1 1]
 [1 1 1 0]]


### Explanation of Unigrams and Bigrams (`ngram_range=(1, 2)`)

In this code, we changed `ngram_range` to:

`vectorizer_unigram_bigram = CountVectorizer(ngram_range=(1, 2), ...)`

*   **`ngram_range=(1, 2)`:**  This tells `CountVectorizer` to include **both unigrams (single words) AND bigrams (pairs of words)** in the vocabulary.

Run this code and look at the output:

*   **Unigram and Bigram Vocabulary:** The vocabulary now has both individual words (like "good", "movie") and word pairs (like "good movie", "movie good").

*   **Unigram and Bigram Document-Term Matrix:** The matrix counts both unigrams and bigrams.

Using `ngram_range=(1, 2)` or higher allows you to capture more context than just single words. However, be aware that the vocabulary can become much larger as you increase the `ngram_range`!

Next, let's learn about TF-IDF, which weights words based on their importance.

### TF-IDF - Weighting Words by Importance

**Problem with BoW and N-grams:**  Common words like "movie", "sentence", "document", "is", "the", etc., might appear frequently in *all* documents. Are these words really the most important for understanding what a document is *about*?  Probably not.

**Solution: TF-IDF (Term Frequency-Inverse Document Frequency)**

TF-IDF is a technique that **weights words based on their importance**. It gives higher weights to words that are:

*   **Frequent in a specific document (Term Frequency - TF)**
*   **Rare across *all* documents in the corpus (Inverse Document Frequency - IDF)**

Let's see TF-IDF in action!

In [None]:
# Code Cell 4: TF-IDF Code Example
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Sachin Tendulkar is a legendary batsman.",
    "India won the Cricket World Cup in 2011.",
    "RRR is a blockbuster Indian movie.",
    "Bollywood movies are very entertaining.",
    "Is Sachin Tendulkar the greatest batsman?",
]

vectorizer_tfidf = TfidfVectorizer(lowercase=True, token_pattern=r'[a-zA-Z]+', stop_words='english')
vectorizer_tfidf.fit(corpus)
X_tfidf = vectorizer_tfidf.transform(corpus)

print("TF-IDF Vocabulary:")
print(vectorizer_tfidf.get_feature_names_out())
print("\nTF-IDF Document-Term Matrix:")
print(X_tfidf.toarray())

TF-IDF Vocabulary:
['batsman' 'blockbuster' 'bollywood' 'cricket' 'cup' 'entertaining'
 'greatest' 'india' 'indian' 'legendary' 'movie' 'movies' 'rrr' 'sachin'
 'tendulkar' 'won' 'world']

TF-IDF Document-Term Matrix:
[[0.4695148  0.         0.         0.         0.         0.
  0.         0.         0.         0.5819515  0.         0.
  0.         0.4695148  0.4695148  0.         0.        ]
 [0.         0.         0.         0.4472136  0.4472136  0.
  0.         0.4472136  0.         0.         0.         0.
  0.         0.         0.         0.4472136  0.4472136 ]
 [0.         0.5        0.         0.         0.         0.
  0.         0.         0.5        0.         0.5        0.
  0.5        0.         0.         0.         0.        ]
 [0.         0.         0.57735027 0.         0.         0.57735027
  0.         0.         0.         0.         0.         0.57735027
  0.         0.         0.         0.         0.        ]
 [0.4695148  0.         0.         0.         0.      

### Explanation of TF-IDF

In this code, we simply replaced `CountVectorizer` with `TfidfVectorizer`:

`vectorizer_tfidf = TfidfVectorizer(...)`

Run the code and compare the output to the BoW output from earlier.

*   **TF-IDF Values:** Look at the numbers in the "TF-IDF Document-Term Matrix".  They are no longer just counts! They are **TF-IDF scores**.

Let's understand TF and IDF:

*   **TF (Term Frequency):**  This is basically the same as the word counts we saw in Bag of Words. It measures how often a word appears in a *particular document*.

*   **IDF (Inverse Document Frequency):** This measures how rare a word is across the *entire corpus* (collection of documents).
    *   Words that are very common in *all* documents (like "is", "the", "document") will have a *low IDF*.
    *   Words that are rare and appear mainly in *specific* documents will have a *high IDF* .

*   **TF-IDF Score = TF \* IDF:** The TF-IDF score for a word in a document is calculated by multiplying its TF and IDF values.
    *   Words that are frequent in a document *and* rare in the overall corpus will get **high TF-IDF scores**. These are likely the words that are most important for understanding the content of *that specific document*.

**Compare BoW and TF-IDF Matrices:** If you compare the TF-IDF matrix to the BoW matrix (mentally or side-by-side), you might notice that common words like "document", "first", "is", "the", etc., tend to have lower values in the TF-IDF matrix, while words that are more specific to certain documents might have relatively higher values.  (This depends on the example corpus, but that's the general idea).

TF-IDF is often a more effective way to represent text for machine learning than simple Bag of Words because it takes into account the importance of words within the context of the entire corpus.

Now, let's experiment and compare these vectorization methods!

### Experimentation and Comparison

Time to experiment and see how these vectorization techniques work! Try a few experiments.

#### Experiment 1a. `ngram_range` in `CountVectorizer`

Try different `ngram_range` values in `CountVectorizer`.  For example, try `ngram_range`=(1, 3) to include unigrams, bigrams, and trigrams.  What happens to the vocabulary size and the document-term matrix when you increase the `ngram_range`?

In [None]:
# your code

#### Experiment 1b. `ngram_range` in `TfIdfVectorizer`

Try different `ngram_range` values in `TfIdfVectorizer`.  For example, try `ngram_range`=(1, 3) to include unigrams, bigrams, and trigrams.  What happens to the vocabulary size and the document-term matrix when you increase the `ngram_range`?

In [None]:
# your code

#### Experiment 2.  Different Corpora
>   
Change the `corpus` variable to use different text examples (e.g., movie reviews, news snippets, your own sentences).

Observe how the BoW and TF-IDF matrices change with different text data.



In [None]:
# your code

#### Experiment 3. Stop Word Removal:

>
What happens if you remove `stop_words='english'` in `TfidfVectorizer`?

> Do common words like "is", "the", "and" get higher TF-IDF scores when you remove stop word filtering?

In [None]:
# your code

#### Experiment 4. Real-World Example (Qualitative)
*   Think about a real-world text classification task, like classifying movie reviews as positive or negative.

*   Which words do you think TF-IDF would weight highly in a **positive** movie review?

*   Which words would TF-IDF weight highly in a **negative** movie review?

*   Discuss qualitatively - you don't need to code this part, just think about it!

After experimenting, read the comparison summary below.

### Comparison Summary: BoW, N-grams, TF-IDF

Let's summarize the vectorization techniques we've learned:

*   **Bag of Words (BoW):**
    *   **Pros:** Simple, easy to understand, computationally fast.
    *   **Cons:** Ignores word order, treats all words equally (doesn't account for word importance).

*   **N-grams:**
    *   **Pros:** Captures some word order information, can improve performance in some cases.
    *   **Cons:** Vocabulary can become very large very quickly (especially for higher N values), still mostly loses word order beyond the N-gram window.

*   **TF-IDF:**
    *   **Pros:** Weights words by importance (frequency in document AND rarity in corpus), often performs better than simple BoW in practice.
    *   **Cons:** Still a Bag-of-Words approach - word order is mostly lost. Can be slightly more computationally expensive than BoW.

**Which vectorization method is "best"?**

There's no single "best" method for all text tasks!  It depends on your specific data and problem.

*   For simple tasks or as a starting point, **BoW** can be a good baseline.
*   If word order is potentially important, **N-grams** might help.
*   For many text classification tasks, **TF-IDF** often provides a good balance of simplicity and performance.

In the next lessons, we'll experiment with these vectorization methods when we build machine learning models and see how they affect model performance!

### Summary and Next Steps

Great job! In this lesson, you learned about **text vectorization** and explored three important techniques:

*   **Bag of Words (BoW)**
*   **N-grams**
*   **TF-IDF**

You used `CountVectorizer` and `TfidfVectorizer` in scikit-learn to create numerical representations of text.

**Key Takeaway:** You now have different ways to transform text into numbers that machine learning models can understand.  You can choose the vectorization method that is most appropriate for your text analysis task.

**Next Steps:**

In the next lesson, we'll finally start building **machine learning models for text classification**! We'll use the vector representations we learned today and apply models like **Logistic Regression** and **Decision Trees** to classify text documents. Get ready to build your first text classifiers!

### Key Takeaways for Lesson #2 (for students):

*   **Text vectorization** is essential to use text data in machine learning.
*   **Bag of Words (BoW)** is a basic method that counts word frequencies.
*   **N-grams** extend BoW to capture some word order.
*   **TF-IDF** weights words by their importance in a document and corpus.
*   `CountVectorizer` and `TfidfVectorizer` in scikit-learn are used to implement these techniques.

### Resources for Lesson #2:

*   **Scikit-learn documentation on `CountVectorizer`:** [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

*   **Scikit-learn documentation on `TfidfVectorizer`:** [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

### Additional Notes:

*   **Customization:**  Remember that `CountVectorizer` and `TfidfVectorizer` have many more parameters you can explore (e.g., `max_features`, custom tokenizers, etc.).  Check the scikit-learn documentation for details!

*   **Choosing Vectorizer:**  The "best" vectorizer often depends on your specific text data and task. Experimentation is key!  We'll see this in later lessons when we build models and compare performance with different vectorizers.

*   **Beyond TF-IDF:**  TF-IDF is a classic and effective technique, but there are more advanced vectorization methods, such as word embeddings (Word2Vec, GloVe, FastText), which we'll cover in later lessons. These methods capture semantic meaning and relationships between words in a richer way than BoW or TF-IDF.