```{contents}
```
# Bag of Words (BoW)

The **Bag of Words** model is a way to **represent text data numerically** by treating a document as a "bag" of its words, **ignoring grammar and word order**, but keeping multiplicity (how many times a word appears).

Essentially, BoW **converts text into a vector of numbers**, which can then be used for machine learning algorithms.

---

## **How BoW Works: Step-by-Step**

1. **Collect the corpus**

   * A corpus is the entire collection of text documents you want to analyze.
   * Example corpus:

     ```
     Doc1: I love NLP
     Doc2: NLP is amazing
     Doc3: I love machine learning
     ```

2. **Create the vocabulary**

   * Extract all **unique words** from the corpus.
   * Vocabulary = `[I, love, NLP, is, amazing, machine, learning]`

3. **Vectorize the documents**

   * For each document, count the occurrence of each word in the vocabulary.
   * Represent each document as a **vector** of word counts.

**Example Table:**

| Vocabulary                    | I | love | NLP | is | amazing | machine | learning |
| ----------------------------- | - | ---- | --- | -- | ------- | ------- | -------- |
| Doc1: I love NLP              | 1 | 1    | 1   | 0  | 0       | 0       | 0        |
| Doc2: NLP is amazing          | 0 | 0    | 1   | 1  | 1       | 0       | 0        |
| Doc3: I love machine learning | 1 | 1    | 0   | 0  | 0       | 1       | 1        |

---

## **Key Features of Bag of Words**

1. **Simplicity**

   * Very easy to understand and implement.

2. **Ignores grammar and word order**

   * Only considers **presence and frequency** of words, not sequence.
   * “I love NLP” and “NLP love I” are treated the same.

3. **Frequency-based representation**

   * Each vector entry shows how many times a word occurs in the document.

4. **Sparse vectors**

   * Many words in the vocabulary may not appear in every document, resulting in zeros.

---

## **Advantages of BoW**

* Simple and intuitive.
* Works well for basic text classification problems.
* Can be combined with **TF-IDF** to improve performance.

---

## **Disadvantages of BoW**

* **Ignores word order** → loses context.
* **High dimensionality** → very large vocabulary can lead to large sparse vectors.
* **Cannot capture semantics** → “good” and “great” are treated as different words.

---

## **BoW in Python using scikit-learn**



In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP",
    "NLP is amazing",
    "I love machine learning"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:\n", X.toarray())


Vocabulary: ['amazing' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Vectors:
 [[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]



**Key Points**

* BoW converts text to numeric vectors.
* Represents **word presence and frequency**.
* Does **not capture meaning or context**.



## **What are N-Grams?**

* **Definition:** N-Grams are **continuous sequences of n items** (usually words or characters) from a given text.
* They help capture **context and word order** which basic Bag-of-Words ignores.
* The “n” in N-Grams refers to the **number of items in the sequence**.

---

## **Types of N-Grams**

1. **Unigram (1-gram)**

   * Sequence of **1 word**.
   * Captures individual word frequency.
   * Example:

     ```
     Text: "I love NLP"
     Unigrams: ["I", "love", "NLP"]
     ```

2. **Bigram (2-gram)**

   * Sequence of **2 consecutive words**.
   * Captures some local context.
   * Example:

     ```
     Text: "I love NLP"
     Bigrams: ["I love", "love NLP"]
     ```

3. **Trigram (3-gram)**

   * Sequence of **3 consecutive words**.
   * Captures slightly longer context.
   * Example:

     ```
     Text: "I love NLP models"
     Trigrams: ["I love NLP", "love NLP models"]
     ```

4. **n-gram (general)**

   * Sequence of **n consecutive words**.
   * Example: 4-gram (quadgram) from "I love NLP models today":

     ```
     ["I love NLP models", "love NLP models today"]
     ```

---

## **Why use N-Grams?**

* Helps **capture context** and word order in text.
* Useful in:

  * Text classification
  * Sentiment analysis
  * Spam detection
  * Predictive text / autocomplete
* Can be used for both **words** and **characters**:

  * **Character-level n-grams** are useful for spelling correction, language modeling, or handling noisy text.

---

**Trade-offs**

| N-Gram Type | Pros                    | Cons                             |
| ----------- | ----------------------- | -------------------------------- |
| Unigram     | Simple, less memory     | Ignores word order/context       |
| Bigram      | Captures local context  | Increases feature space          |
| Trigram     | Captures longer context | Higher dimensionality, sparsity  |
| Higher N    | More context            | Exponential increase in features |

---

N-Grams are a **bridge between simple Bag-of-Words and advanced embeddings**, giving models some sense of word sequences without requiring deep learning.



In [None]:
# Import required library
from nltk import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "I love natural language processing"

# Tokenize the text into words
tokens = word_tokenize(text)

print("Tokens:", tokens)

# Unigrams (1-gram)
unigrams = list(ngrams(tokens, 1))
print("\nUnigrams:")
for uni in unigrams:
    print(uni)

# Bigrams (2-gram)
bigrams = list(ngrams(tokens, 2))
print("\nBigrams:")
for bi in bigrams:
    print(bi)

# Trigrams (3-gram)
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
for tri in trigrams:
    print(tri)


Tokens: ['I', 'love', 'natural', 'language', 'processing']

Unigrams:
('I',)
('love',)
('natural',)
('language',)
('processing',)

Bigrams:
('I', 'love')
('love', 'natural')
('natural', 'language')
('language', 'processing')

Trigrams:
('I', 'love', 'natural')
('love', 'natural', 'language')
('natural', 'language', 'processing')


: 