# Lesson 1: Introduction to Text Representation: Bag-of-Words model

Welcome to the very first lesson of our course **“Text Representation Techniques for RAG Systems”**, part of our **“Foundations of RAG Systems”** path! In the first course of this learning path, you learned the fundamentals of RAG, how to structure a simple RAG workflow, and why combining retrieval with generation is so powerful. Now, we’ll shift our focus to how we can turn raw text into numerical data — a crucial step if we want our RAG systems to retrieve information accurately and feed it into downstream pipelines. In other words, we’ll focus on the **indexing component** of our RAG pipeline.

---

## Learning Objectives

1. Understand why we must transform text into a structured format for RAG workflows.  
2. Explore the **Bag-of-Words (BOW)** method, a simple yet classic text representation technique.  

By the end, you’ll know:
- How words get mapped into vectors  
- Why these representations matter for building robust retrieval systems  

---

## Why Text Representation Is Essential

RAG systems revolve around:
1. Retrieving relevant documents based on a user’s query  
2. Generating a final answer  

Computers don’t process language like humans—they require **structured, numerical forms** of text to compare documents effectively. Without proper representation:

- We can’t reliably measure similarity between two texts.  
- Retrieving contextually relevant information becomes very difficult.  

A straightforward solution is the **Bag-of-Words** method. It counts how often each word appears, providing a simple numerical snapshot of a document. While BOW ignores word order and nuances, it’s an excellent entry point for converting messy language into machine-friendly formats.

---

## Understanding the BOW Model

Consider these three sentences:

1. “I love machine learning”  
2. “Machine learning is fun”  
3. “I love coding”  

First, gather all unique words into a **vocabulary**:  
```
{ I, love, machine, learning, is, fun, coding }
```

| Word     | I | love | machine | learning | is | fun | coding |
|----------|:-:|:----:|:-------:|:--------:|:-:|:---:|:------:|
| **Index**| 0 |  1   |    2    |     3    | 4 |  5  |   6    |

Next, transform each sentence into a numeric **frequency vector**:

| Sentence                   | I | love | machine | learning | is | fun | coding |
|----------------------------|:-:|:----:|:-------:|:--------:|:-:|:---:|:------:|
| I love machine learning    | 1 |  1   |    1    |     1    | 0 |  0  |   0    |
| Machine learning is fun    | 0 |  0   |    1    |     1    | 1 |  1  |   0    |
| I love coding              | 1 |  1   |    0    |     0    | 0 |  0  |   1    |

Each column corresponds to a word; each entry is its occurrence count.

---

## Building a Basic Vocabulary

In BOW, the first step is constructing a **vocabulary** dictionary:

```python
def build_vocab(docs):
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    # Sort for consistent ordering
    return {word: idx for idx, word in enumerate(sorted(unique_words))}
```

**How it works:**
1. Iterate through each document.  
2. Lowercase and split into tokens.  
3. Strip punctuation, add tokens to a `set` for uniqueness.  
4. Sort and enumerate to assign each word an index.

---

## Converting Text to Vectors

Once you have a vocabulary, create a numeric BOW vector:

```python
import numpy as np

def bow_vectorize(text, vocab):
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        clean_word = word.strip(".,!?")
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    return vector
```

1. Initialize a zero vector of length `|vocab|`.  
2. For each cleaned token, look up its index and increment the count.  

---

## Conclusion and Next Steps

In this lesson, you learned:
- **Why** text must be represented numerically for RAG systems.  
- **How** the Bag-of-Words model converts words into count-based vectors.  

While BOW has limitations—ignoring word order and context—it’s an essential first step in any NLP workflow.  

**Next Up:**  
- Explore advanced methods that preserve semantics (e.g., embeddings from language models).  
- Practice coding your own BOW pipeline with varied text inputs.  

This foundation prepares you for more powerful semantic retrieval techniques and deeper RAG integration. Good luck, and have fun experimenting!  


## Text Cleaning with Python

You've just learned how to build a vocabulary and convert text into Bag-of-Words vectors. Now, let's put that knowledge into practice with a simple task.

Your objective is to create a function called preprocess_string that processes a single string. Here's what you need to do:

Split the string into words.
Convert each word to lowercase.
Remove punctuation from the start and end of each word (bonus points for doing it with a list comprehension!).
This exercise will help you solidify your understanding of text preprocessing. Dive in and see how well you can clean up a string!


```python
def preprocess_string(text):
    # TODO: Split the text into words
    # TODO: Convert words to lowercase and remove punctuation
    # TODO: Return the cleaned tokens


print(preprocess_string("Hello, World! We are preprocessing strings today."))


```

## Building a Vocabulary Dictionary

Nice job on processing unique words and their counts! Now, let's take it a step further by creating a vocabulary dictionary.

Your task is to build a function that:

Takes a list of tokens.
Sorts these words.
Assigns each word a numeric index to form a vocabulary dictionary.
This exercise will help you understand how to map words to indices, a key step in text representation. Dive in and see how well you can create a structured vocabulary!

```python
def preprocess_string(text):
    words = text.split()
    cleaned_tokens = [word.lower().strip(".,!?") for word in words]
    return cleaned_tokens
    
    
def build_vocab(tokens):
    # TODO: Sort the unique tokens and assign each a numeric index
    pass


sentence = "Hello, World! We are preprocessing strings today."
tokens = preprocess_string(sentence)
vocab = build_vocab(tokens)
print("Vocabulary:", vocab)
```

```python
def preprocess_string(text):
    words = text.split()
    cleaned_tokens = [word.lower().strip(".,!?") for word in words]
    return cleaned_tokens
    
    
def build_vocab(tokens):
    """
    Takes a list of tokens, sorts the unique words, 
    and assigns each word a numeric index.
    """
    unique_tokens = sorted(set(tokens))
    return {word: idx for idx, word in enumerate(unique_tokens)}


# Example usage:
sentence = "Hello, World! We are preprocessing strings today."
tokens = preprocess_string(sentence)
vocab = build_vocab(tokens)
print("Vocabulary:", vocab)
# Output:
# Vocabulary: {'are': 0, 'hello': 1, 'preprocessing': 2, 'strings': 3, 'today': 4, 'we': 5, 'world': 6}
```

## Transform Text into Numeric Vectors

## Bag-of-Words Vectorization Task