# Bag of Words (BoW) Function Documentation

## Overview

The Bag of Words (BoW) function leverages the `CountVectorizer` class from the `sklearn.feature_extraction.text` module. It is used to transform a collection of text documents into a matrix of token counts, enabling text data to be represented numerically for further analysis.

### Key Components

### 1. **CountVectorizer Initialization**
`CountVectorizer` is a class designed to create a **Bag of Words (BoW)** representation of text data. When initialized, an instance of the class is created with default parameters, which is then used to transform the text corpus into a token count matrix.

Example:
```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
```

---

### 2. **`fit_transform` Method**

The `fit_transform` method performs two key operations:

#### **a. Fitting**
- **Purpose**: Learns the vocabulary from the input corpus.
- **Process**:
  - Identifies all the unique words (tokens) in the corpus.
  - Assigns a unique index to each word in the vocabulary.

#### **b. Transforming**
- **Purpose**: Converts the corpus into a **sparse matrix** of token counts.
- **Structure**:
  - **Rows**: Represent individual documents.
  - **Columns**: Represent unique words in the vocabulary.
  - **Values**: Indicate the frequency of a word in a given document.

The resulting sparse matrix has the shape `(n_samples, n_features)`, where:
- `n_samples`: Number of documents in the corpus.
- `n_features`: Number of unique words in the vocabulary.

---

### 3. **Feature Names Extraction**

The `get_feature_names_out` method retrieves an array of **feature names**, i.e., the unique words in the vocabulary, in the order they appear in the sparse matrix columns.

Example:
```python
feature_names = vectorizer.get_feature_names_out()
print(feature_names)
# Output: ['apple', 'banana', 'orange']
```

---

### 4. **Sparse Matrix Conversion**

The sparse matrix generated by `fit_transform` can be converted into a dense matrix for better readability using the `.toarray()` method.

#### Example:

Input corpus:
```python
corpus = ["apple banana apple", "banana orange"]
```

#### Sparse Matrix Representation:
```plaintext
(0, 0)  2  # Document 0: 2 occurrences of "apple" (index 0)
(0, 1)  1  # Document 0: 1 occurrence of "banana" (index 1)
(1, 1)  1  # Document 1: 1 occurrence of "banana" (index 1)
(1, 2)  1  # Document 1: 1 occurrence of "orange" (index 2)
```

#### Dense Matrix Representation:
```python
X_dense = X.toarray()
print(X_dense)
# Output:
# [[2 1 0]  # Document 0: 2 "apple", 1 "banana", 0 "orange"
#  [0 1 1]] # Document 1: 0 "apple", 1 "banana", 1 "orange"
```

---

### Summary
- **`CountVectorizer`**: Converts text into a matrix of token counts.
- **`fit_transform`**: Learns the vocabulary and transforms the corpus into a sparse matrix.
- **`get_feature_names_out`**: Retrieves the unique words in the vocabulary.
- **`.toarray()`**: Converts the sparse matrix into a dense matrix for readability.