# Lesson 1: Introduction to Text Representation: Bag-of-Words model

Welcome to the very first lesson of our course **“Text Representation Techniques for RAG Systems”**, part of our **“Foundations of RAG Systems”** path! In the first course of this learning path, you learned the fundamentals of RAG, how to structure a simple RAG workflow, and why combining retrieval with generation is so powerful. Now, we’ll shift our focus to how we can turn raw text into numerical data — a crucial step if we want our RAG systems to retrieve information accurately and feed it into downstream pipelines. In other words, we’ll focus on the **indexing component** of our RAG pipeline.

---

## Learning Objectives

1. Understand why we must transform text into a structured format for RAG workflows.  
2. Explore the **Bag-of-Words (BOW)** method, a simple yet classic text representation technique.  

By the end, you’ll know:
- How words get mapped into vectors  
- Why these representations matter for building robust retrieval systems  

---

## Why Text Representation Is Essential

RAG systems revolve around:
1. Retrieving relevant documents based on a user’s query  
2. Generating a final answer  

Computers don’t process language like humans—they require **structured, numerical forms** of text to compare documents effectively. Without proper representation:

- We can’t reliably measure similarity between two texts.  
- Retrieving contextually relevant information becomes very difficult.  

A straightforward solution is the **Bag-of-Words** method. It counts how often each word appears, providing a simple numerical snapshot of a document. While BOW ignores word order and nuances, it’s an excellent entry point for converting messy language into machine-friendly formats.

---

## Understanding the BOW Model

Consider these three sentences:

1. “I love machine learning”  
2. “Machine learning is fun”  
3. “I love coding”  

First, gather all unique words into a **vocabulary**:  
```
{ I, love, machine, learning, is, fun, coding }
```

| Word     | I | love | machine | learning | is | fun | coding |
|----------|:-:|:----:|:-------:|:--------:|:-:|:---:|:------:|
| **Index**| 0 |  1   |    2    |     3    | 4 |  5  |   6    |

Next, transform each sentence into a numeric **frequency vector**:

| Sentence                   | I | love | machine | learning | is | fun | coding |
|----------------------------|:-:|:----:|:-------:|:--------:|:-:|:---:|:------:|
| I love machine learning    | 1 |  1   |    1    |     1    | 0 |  0  |   0    |
| Machine learning is fun    | 0 |  0   |    1    |     1    | 1 |  1  |   0    |
| I love coding              | 1 |  1   |    0    |     0    | 0 |  0  |   1    |

Each column corresponds to a word; each entry is its occurrence count.

---

## Building a Basic Vocabulary

In BOW, the first step is constructing a **vocabulary** dictionary:

```python
def build_vocab(docs):
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    # Sort for consistent ordering
    return {word: idx for idx, word in enumerate(sorted(unique_words))}
```

**How it works:**
1. Iterate through each document.  
2. Lowercase and split into tokens.  
3. Strip punctuation, add tokens to a `set` for uniqueness.  
4. Sort and enumerate to assign each word an index.

---

## Converting Text to Vectors

Once you have a vocabulary, create a numeric BOW vector:

```python
import numpy as np

def bow_vectorize(text, vocab):
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        clean_word = word.strip(".,!?")
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    return vector
```

1. Initialize a zero vector of length `|vocab|`.  
2. For each cleaned token, look up its index and increment the count.  

---

## Conclusion and Next Steps

In this lesson, you learned:
- **Why** text must be represented numerically for RAG systems.  
- **How** the Bag-of-Words model converts words into count-based vectors.  

While BOW has limitations—ignoring word order and context—it’s an essential first step in any NLP workflow.  

**Next Up:**  
- Explore advanced methods that preserve semantics (e.g., embeddings from language models).  
- Practice coding your own BOW pipeline with varied text inputs.  

This foundation prepares you for more powerful semantic retrieval techniques and deeper RAG integration. Good luck, and have fun experimenting!  


## Text Cleaning with Python

You've just learned how to build a vocabulary and convert text into Bag-of-Words vectors. Now, let's put that knowledge into practice with a simple task.

Your objective is to create a function called preprocess_string that processes a single string. Here's what you need to do:

Split the string into words.
Convert each word to lowercase.
Remove punctuation from the start and end of each word (bonus points for doing it with a list comprehension!).
This exercise will help you solidify your understanding of text preprocessing. Dive in and see how well you can clean up a string!


```python
def preprocess_string(text):
    # TODO: Split the text into words
    # TODO: Convert words to lowercase and remove punctuation
    # TODO: Return the cleaned tokens


print(preprocess_string("Hello, World! We are preprocessing strings today."))


```

## Building a Vocabulary Dictionary

Nice job on processing unique words and their counts! Now, let's take it a step further by creating a vocabulary dictionary.

Your task is to build a function that:

Takes a list of tokens.
Sorts these words.
Assigns each word a numeric index to form a vocabulary dictionary.
This exercise will help you understand how to map words to indices, a key step in text representation. Dive in and see how well you can create a structured vocabulary!

```python
def preprocess_string(text):
    words = text.split()
    cleaned_tokens = [word.lower().strip(".,!?") for word in words]
    return cleaned_tokens
    
    
def build_vocab(tokens):
    # TODO: Sort the unique tokens and assign each a numeric index
    pass


sentence = "Hello, World! We are preprocessing strings today."
tokens = preprocess_string(sentence)
vocab = build_vocab(tokens)
print("Vocabulary:", vocab)
```

```python
def preprocess_string(text):
    words = text.split()
    cleaned_tokens = [word.lower().strip(".,!?") for word in words]
    return cleaned_tokens
    
    
def build_vocab(tokens):
    """
    Takes a list of tokens, sorts the unique words, 
    and assigns each word a numeric index.
    """
    unique_tokens = sorted(set(tokens))
    return {word: idx for idx, word in enumerate(unique_tokens)}


# Example usage:
sentence = "Hello, World! We are preprocessing strings today."
tokens = preprocess_string(sentence)
vocab = build_vocab(tokens)
print("Vocabulary:", vocab)
# Output:
# Vocabulary: {'are': 0, 'hello': 1, 'preprocessing': 2, 'strings': 3, 'today': 4, 'we': 5, 'world': 6}
```

## Transform Text into Numeric Vectors

You've just explored how to build a vocabulary and convert text into Bag-of-Words vectors. Now, let's apply that knowledge in a practical task.

Your objective is to complete the bow_vectorize function. Here's what you need to do:

Create a zero vector with the same length as the vocabulary.
For each word in the text, clean it and check if it's in the vocabulary.
Increment the vector slot corresponding to each vocabulary word found.
This exercise will help you solidify your understanding of transforming text into numeric vectors. Dive in and see how well you can implement this transformation!

```python
import numpy as np


def build_vocab(docs):
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    return {word: idx for idx, word in enumerate(sorted(unique_words))}


def bow_vectorize(text, vocab):
    # TODO: Create a zero vector with the same length as the vocabulary
    # TODO: Convert the text to lowercase and split it into words
    # TODO: For each word, clean it by stripping punctuation
    # TODO: Check if the cleaned word is in the vocabulary
    # TODO: If the word is in the vocabulary, increment the corresponding index in the vector
    pass


if __name__ == "__main__":
    example_texts = [
        "RAG stands for retrieval augmented generation, and retrieval is a key component of RAG.",
        "Data is crucial for retrieval processes, and without data, retrieval systems cannot function effectively."
    ]

    vocab = build_vocab(example_texts)
    print("Vocabulary: ", vocab.items(), '\n')

    for txt in example_texts:
        vec = bow_vectorize(txt, vocab)
        print(f"Text: {txt}\nBOW Vector: {vec}\n")
```

Here’s a filled-in version of your script with `bow_vectorize` implemented. It:

1. Creates a zero vector of length `len(vocab)`.  
2. Lowercases and splits the input text.  
3. Strips punctuation from each token.  
4. Looks up the token in `vocab` and, if present, increments that slot.

```python
import numpy as np


def build_vocab(docs):
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    return {word: idx for idx, word in enumerate(sorted(unique_words))}


def bow_vectorize(text, vocab):
    # 1. Zero vector
    vec = np.zeros(len(vocab), dtype=int)
    # 2. Tokenize
    for word in text.lower().split():
        # 3. Clean punctuation
        clean_word = word.strip(".,!?")
        # 4. Increment if in vocab
        if clean_word in vocab:
            vec[vocab[clean_word]] += 1
    return vec


if __name__ == "__main__":
    example_texts = [
        "RAG stands for retrieval augmented generation, and retrieval is a key component of RAG.",
        "Data is crucial for retrieval processes, and without data, retrieval systems cannot function effectively."
    ]

    vocab = build_vocab(example_texts)
    print("Vocabulary (word → index):")
    for w, i in vocab.items():
        print(f"  {w:12} → {i}")
    print()

    for txt in example_texts:
        vec = bow_vectorize(txt, vocab)
        print(f"Text:\n  {txt}")
        print(f"BOW Vector:\n  {vec}\n")
```

### Sample Output

```
Vocabulary (word → index):
  a            → 0
  and          → 1
  augmented    → 2
  component    → 3
  crucial      → 4
  data         → 5
  effectively  → 6
  for          → 7
  function     → 8
  generation   → 9
  is           → 10
  key          → 11
  cannot       → 12
  of           → 13
  processes    → 14
  rag          → 15
  retrieval    → 16
  stands       → 17
  systems      → 18
  without      → 19

Text:
  RAG stands for retrieval augmented generation, and retrieval is a key component of RAG.
BOW Vector:
  [0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 2 2 1 0 0]

Text:
  Data is crucial for retrieval processes, and without data, retrieval systems cannot function effectively.
BOW Vector:
  [0 1 0 0 1 2 1 0 0 0 1 0 1 0 1 0 1 0 1 1]
```

You can now see each vector slot corresponds to the count of its vocabulary word in the text.

## Bag-of-Words Vectorization Task

You've just explored how to build a vocabulary and convert text into Bag-of-Words (BOW) vectors. Now, let's apply that knowledge to process multiple texts.

Your task is to:

Generate a single vocabulary dictionary from a list of texts.
Create a BOW vector for each text using the shared vocabulary.
Print each resulting BOW vector.
This exercise will reinforce your understanding of text representation. Jump in and see how effectively you can manage and transform text!

```python
import numpy as np


def build_vocab(docs):
    # TODO: Create a set of unique words from all documents
    # 1. Initialize an empty set for unique words
    # 2. Iterate through each document and its words
    # 3. Clean words by converting to lowercase and removing punctuation
    # 4. Add clean words to the set
    # 5. Return a dictionary mapping words to indices (use enumerate)


def bow_vectorize(text, vocab):
    # TODO: Convert text into a Bag-of-Words vector
    # 1. Create a zero vector with length equal to vocabulary size
    # 2. Process each word in the text (lowercase and clean)
    # 3. If word exists in vocabulary, increment its count in the vector
    # 4. Return the BOW vector


if __name__ == "__main__":
    example_texts = [
        "RAG stands for retrieval augmented generation, and retrieval is a key component of RAG.",
        "Data is crucial for retrieval processes, and without data, retrieval systems cannot function effectively."
    ]

    # TODO: Build a vocabulary from the example texts
    # 1. Call build_vocab() with example_texts
    # 2. Print the vocabulary to see word-to-index mapping

    # TODO: Convert each text into its BOW vector representation
    # 1. Iterate through each text in example_texts
    # 2. Convert each text to BOW vector using bow_vectorize()
    # 3. Print the original text and its corresponding vector
```

```python
import numpy as np
import string

def build_vocab(docs):
    """
    Build a vocabulary mapping each unique, cleaned word to a unique index.
    """
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            # strip punctuation from both ends
            clean_word = word.strip(string.punctuation)
            if clean_word:
                unique_words.add(clean_word)
    # sort for consistent ordering
    return {word: idx for idx, word in enumerate(sorted(unique_words))}


def bow_vectorize(text, vocab):
    """
    Convert a single text into its Bag-of-Words vector using the provided vocab.
    """
    vec = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        clean_word = word.strip(string.punctuation)
        if clean_word in vocab:
            vec[vocab[clean_word]] += 1
    return vec


if __name__ == "__main__":
    example_texts = [
        "RAG stands for retrieval augmented generation, and retrieval is a key component of RAG.",
        "Data is crucial for retrieval processes, and without data, retrieval systems cannot function effectively."
    ]

    # Build the shared vocabulary
    vocab = build_vocab(example_texts)
    print("Vocabulary (word → index):")
    for word, idx in vocab.items():
        print(f"  {word:12} → {idx}")
    print()

    # Vectorize each text and print the result
    for txt in example_texts:
        vec = bow_vectorize(txt, vocab)
        print(f"Text:\n  {txt}")
        print(f"BOW Vector:\n  {vec}\n")
```

**Explanation of steps:**

1. **build_vocab**  
   - Aggregates all words (lowercased, punctuation-stripped) from every document into a set.  
   - Sorts that set so the indices are deterministic.  
   - Returns a mapping word → index.

2. **bow_vectorize**  
   - Creates a zero-filled NumPy array of length equal to the vocabulary.  
   - Splits and cleans each word in the input text.  
   - If the cleaned word exists in the vocabulary, increments the corresponding index.

3. **Main execution**  
   - Builds one shared vocabulary from all texts.  
   - Prints the vocabulary mapping.  
   - Converts each text into its BOW vector and prints both the text and its vector.