# Unit 2 Bag-of-Words and N-Grams in NLP

Welcome to this lesson on **Bag-of-Words (BoW) and N-Grams**, a foundational technique in **Natural Language Processing (NLP)** for converting text data into numerical representations. Before we move on to more advanced text vectorization methods, understanding BoW is crucial because it provides the basic framework for handling textual data in machine learning models.

In this lesson, you will learn:

  * What the **Bag-of-Words model** is and why it’s useful.
  * How **n-grams** enhance text representation.
  * How to implement BoW with **scikit-learn’s CountVectorizer**.
  * Challenges and best practices for working with BoW.

### What is Bag-of-Words?

The **Bag-of-Words (BoW)** model is a simple yet effective way to represent text data numerically. It converts text into a fixed-length numerical feature vector by counting word occurrences, disregarding grammar and word order.

#### How BoW Works

1.  **Tokenization**: Splitting text into individual words (tokens).
2.  **Building a Vocabulary**: Creating a list of all unique words in the dataset.
3.  **Encoding**: Counting how many times each word appears in a document.

#### Example

Consider these three sample documents:

  * "Machine learning is amazing."
  * "Bag-of-Words is a fundamental NLP technique."
  * "NLP models often rely on n-grams."

**Step 1: Creating the Vocabulary**

The unique words across all sentences are:

`tokenized = ["machine", "learning", "is", "amazing", "bag-of-words", "a", "fundamental", "nlp", "technique", "models", "often", "rely", "on", "n-grams"]`

**Step 2: Creating the Frequency Matrix**

Each sentence is represented as a vector:

| | Machine | Learning | Is | Amazing | Bag-of-Words | A | Fundamental | NLP | Technique | Models | Often | Rely | On | N-Grams |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| **Doc 1** | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **Doc 2** | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| **Doc 3** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |

Each row represents a document, and each column represents the occurrence of a word in that document.

### Understanding N-Grams

**N-Grams** are sequences of words that appear together in a document. The basic types include:

  * **Unigrams**: Single words (e.g., "machine")
  * **Bigrams**: Two consecutive words (e.g., "machine learning")
  * **Trigrams**: Three consecutive words (e.g., "learning is amazing")

Using n-grams helps capture context better than single-word unigrams. For example, "not good" is negative, but in a unigram model, "not" and "good" would be treated separately.

#### Example of N-Grams

Given the sentence:

"Natural Language Processing is fascinating."

  * **Unigrams**: ["Natural", "Language", "Processing", "is", "fascinating"]
  * **Bigrams**: ["Natural Language", "Language Processing", "Processing is", "is fascinating"]
  * **Trigrams**: ["Natural Language Processing", "Language Processing is", "Processing is fascinating"]

By using n-grams, models can capture **phrases** and **contextual meaning** instead of isolated words.

### Implementing Bag-of-Words with Unigrams in Python

Let's now implement BoW using **scikit-learn’s CountVectorizer**, focusing only on **unigrams**.

```python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example Documents
docs = [
    "Machine learning and NLP are closely related fields.",
    "Bag-of-Words is a common technique in NLP.",
    "NLP models often use n-grams to improve performance."
]

# Initialize CountVectorizer with n-gram range for unigrams only
vectorizer = CountVectorizer(ngram_range=(1,1), stop_words='english') ## If we don't specify ngram_range=(1,1), it will by default understand it
bow_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
```

**Step-by-Step Explanation**

1.  **Import Libraries**:
      * Import `CountVectorizer` from `sklearn.feature_extraction.text` for creating the BoW model.
      * Import `pandas` for handling data in a tabular format.
2.  **Define Example Documents**:
      * Create a list `docs` containing three example text documents. These documents will be used to demonstrate the BoW model.
3.  **Initialize CountVectorizer**:
      * Create an instance of `CountVectorizer` with `ngram_range=(1,1)` to specify that only unigrams (single words) should be considered.
      * Set `stop_words='english'` to remove common English stop words from the documents.
4.  **Fit and Transform Documents**:
      * Use the `fit_transform` method of `CountVectorizer` to learn the vocabulary from the documents and transform them into a BoW matrix. This matrix contains the frequency of each word in each document.
5.  **Convert to DataFrame**:
      * Convert the BoW matrix to a Pandas DataFrame for better visualization. The columns of the DataFrame represent the unique words (features) in the vocabulary, and the rows represent the documents.
6.  **Print the DataFrame**:
      * Print the DataFrame to display the BoW representation of the documents, showing the frequency of each unigram in each document.

**Output Example**

```
   bag  closely  common  fields  grams  improve  learning  machine  models  nlp  performance  related  technique  use  words
0    0        1       0       1      0        0         1        1       0    1            0        1          0    0      0
1    1        0       1       0      0        0         0        0       0    1            0        0          1    0      1
2    0        0       0       0      1        1         0        0       1    1            1        0          0    1      0
```

This matrix shows the frequency of different unigrams in each document. This information can be used to train machine learning models for classification, clustering, and other NLP tasks.

### Practical Example: Text Classification with Bag-of-Words and N-Grams

In this example, you will apply the Bag-of-Words and n-gram techniques to a simple text classification task using a dataset of categorized short texts. We will implement a complete pipeline that includes creating a BoW representation, splitting the data, training a classifier, and evaluating performance.

#### Step-by-Step Implementation

1.  **Dataset Preparation**:
      * Use a dataset of categorized short texts, such as news headlines or product descriptions, with 2-3 different categories.
2.  **BoW Representation with N-Grams**:
      * Implement BoW with an appropriate n-gram range and preprocessing options like stop word removal and stemming/lemmatization.
3.  **Data Splitting**:
      * Split the data into training and testing sets.
4.  **Model Training**:
      * Train a simple classifier, such as Naive Bayes or Logistic Regression, on the BoW features.
5.  **Performance Evaluation**:
      * Evaluate the classification performance using metrics like accuracy, precision, recall, and F1-score.
6.  **Experimentation and Analysis**:
      * Experiment with different n-gram ranges and preprocessing options.
      * Analyze how these choices affect classification accuracy.

#### Example Code

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import pandas as pd

# Example dataset
data = {
    'text': [
        "Breaking news: Market hits all-time high",
        "New product launch: Innovative tech gadget",
        "Sports update: Local team wins championship",
        "Economy news: Inflation rates are rising",
        "Tech news: New smartphone features unveiled"
    ],
    'category': ['news', 'product', 'sports', 'news', 'tech']}

df = pd.DataFrame(data)

# BoW with n-grams
# Initialize CountVectorizer with ngram_range=(1, 2) to include both unigrams and bigrams
# This means the model will consider single words (unigrams) and pairs of consecutive words (bigrams)
# stop_words='english' removes common English stop words to reduce noise in the data
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X = vectorizer.fit_transform(df['text'])  # Transform the text data into a BoW matrix
y = df['category']  # Target labels

# Split data into training and testing sets
# test_size=0.2 indicates 20% of the data will be used for testing, and 80% for training
# random_state=42 ensures reproducibility of the data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier on the training data
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Evaluate performance on the test data
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))  # Print classification metrics

# Experiment with different n-gram ranges and preprocessing options
# Analyze the impact on classification accuracy
```

By following this example, you can demonstrate your ability to apply BoW models to solve practical NLP tasks and understand the impact of different preprocessing and n-gram settings on classification performance.

### Challenges

While BoW is simple and effective, it has some challenges:

  * **High Dimensionality**: With a large vocabulary, the feature space grows significantly.
  * **Loss of Semantic Meaning**: BoW ignores the order and meaning of words.
  * **Sparse Representation**: Most values in the feature matrix are zeros, making it inefficient for large datasets.

### Best Practices

✔ Use N-Grams: Bigrams and trigrams can improve context understanding.
✔ Remove Stop Words: Reduces unnecessary features.
✔ Apply Dimensionality Reduction: Use techniques like PCA or feature selection to manage feature explosion.
✔ Consider Stemming/Lemmatization: Helps normalize words and reduce redundancy.

### Summary

In this lesson, we covered the **Bag-of-Words model** and how **n-grams** improve text representation. We implemented BoW using **scikit-learn** and explored practical examples.

In the next lesson, we will continue our journey into more advanced text representation techniques. Apply what you've learned to real-world datasets and experiment with different n-gram settings to deepen your understanding\!

## Bag-of-Words Model Implementation

You've learned about the Bag-of-Words model and its implementation. Now, let's put that knowledge into practice. Your task is to:

Initialize the CountVectorizer for unigrams.
Transform the text data into a numerical matrix.
Print both the matrix and the vocabulary.
Afterward, analyze the word frequency by identifying the most frequent words and determine which document contains specific keywords. This exercise will help you solidify your understanding of text representation. Dive in and see how well you can apply these concepts!

```python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example Documents
docs = [
    "Machine learning and NLP are closely related fields. Machine learning is powerful.",
    "Bag-of-Words is a common technique in NLP. NLP is widely used.",
    "NLP models often use n-grams to improve performance. N-grams are useful."
]

# TODO: Initialize CountVectorizer with n-gram range for unigrams only

# TODO: Convert to DataFrame for better visualization

```

Excellent\! This is a great way to put your knowledge of the Bag-of-Words model into practice. By completing the code and analyzing the results, you will gain a deeper understanding of how text is transformed into a numerical representation.

Here is the completed code, including the steps to initialize the `CountVectorizer`, transform the data, and print the results for analysis.

```python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example Documents
docs = [
    "Machine learning and NLP are closely related fields. Machine learning is powerful.",
    "Bag-of-Words is a common technique in NLP. NLP is widely used.",
    "NLP models often use n-grams to improve performance. N-grams are useful."
]

# Initialize CountVectorizer with n-gram range for unigrams only and remove stop words
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words='english')

# Transform the text data into a BoW matrix
bow_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df_bow = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Print the resulting matrix as a DataFrame
print("Bag-of-Words Matrix:\n")
print(df_bow)

# Print the vocabulary
print("\nVocabulary (sorted alphabetically):\n")
print(vectorizer.get_feature_names_out())
```

### Analysis of the Results

After running the code, you'll see a matrix and a vocabulary list. Let's analyze them to answer your questions:

**1. Most Frequent Words:**
By examining the matrix, you can identify the most frequent words by looking for the highest numbers in the columns.

  * `learning` and `machine` each appear twice in Document 1, but only once across the entire corpus.
  * `nlp` appears 3 times across all documents (1 in Doc 1, 1 in Doc 2, 1 in Doc 3).
  * `n-grams` appears 2 times across all documents (1 in Doc 2, 1 in Doc 3).
  * `powerful` appears once in Document 1.

The most frequent words in the entire corpus are **nlp**, **n-grams**, **machine**, and **learning**.

**2. Document Keywords:**
By looking at the rows and columns in the DataFrame, you can determine which document contains specific keywords.

  * `'powerful'` is in **Document 1** (as indicated by the `1` in the first row under the `'powerful'` column).
  * `'technique'` is in **Document 2**.
  * `'performance'` is in **Document 3**.

## Enhance Text Analysis with N-Grams

## Text Classification with Bag-of-Words