# Lesson 4: Comparing Bag-of-Words and Embeddings-Based Semantic Search

Welcome to our final lesson in this course about **Text Representation Techniques for RAG systems**! You’ve already explored the basics of Bag-of-Words (BOW) representations and experimented with sentence embeddings in earlier lessons. Now, we’re going to compare how these two methods differ in actual search scenarios. Think of this as a practical refresher on BOW and embeddings, but with an added focus on side-by-side comparison and deciding which approach might be best for different retrieval use cases.

---

## From Words To Meaning: Why We Need Both Approaches

Before diving into the code, let’s clarify why both methods—from straightforward word matching to deeper semantic modeling—are valuable:

- **Lexical Overlap (BOW)**  
  This approach checks for exact word matches, making it easy to interpret how documents are scored. If your query has the phrase `"external data"`, any document containing those exact words gets a higher score. It’s simple, transparent, and efficient for many tasks—but can struggle with synonyms or varying phrasing.

- **Semantic Similarity (Embeddings)**  
  Here, we focus on the overall meaning rather than specific words. Two differently phrased sentences can still be close in the embedding space if they convey the same idea. This approach excels at capturing nuances. However, it depends on a trained model and requires more computation.

> In some real-world settings, you might even combine both: run a quick lexical match and then refine the results with a more precise semantic model. Let’s see how these methods look in code so you can start comparing results for yourself.

---

## Implementing Bag-of-Words Search

Below is an example of how to implement a BOW-based search workflow. We first build a vocabulary, then vectorize each document and the query according to how often each word appears.

```python
def bow_vectorize(text, vocab):
    """
    Convert a text into a Bag-of-Words vector by counting how many times 
    each token from our vocabulary appears in the text.
    """
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        # Remove punctuation for consistency
        clean_word = word.strip(".,!?")
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    return vector

def bow_search(query, docs):
    """
    Rank documents by lexical overlap using the BOW technique. 
    The dot product between the query vector and each document vector 
    indicates how many words they share.
    """
    query_vec = bow_vectorize(query, VOCAB)
    scores = []
    for i, doc in enumerate(docs):
        doc_vec = bow_vectorize(doc, VOCAB)
        score = np.dot(query_vec, doc_vec)  # Higher score = more overlap
        scores.append((i, score))
    # Sort by descending overlap
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores
```

Let’s break this down:

1. **`bow_vectorize`**  
   - Splits the text into words, applies some light cleanup (punctuation removal), and counts occurrences.  
   - If “external” appears once in the query, that contributes 1 to the corresponding position in the query vector.

2. **`bow_search`**  
   - Converts the query into a BOW vector, does the same for each document, and uses the dot product to measure shared token counts.  
   - Documents with many overlapping terms move to the top of the list.

> This method is straightforward and fast for situations when exact word usage is critical. But what if your query is phrased differently than the document’s text? That’s where embeddings shine.

---

## Implementing Embedding-based Search

To tackle the challenge of phrasing differences or synonyms, let’s look at embedding-based search:

```python
def cos_sim(a, b):
    """
    Compute cosine similarity between two vectors, 
    indicating how similar they are.
    """
    return np.dot(a, b) / (norm(a) * norm(b))

def embedding_search(query, docs, model):
    """
    Rank documents by comparing how semantically close they are 
    to the query in the embedding space using cosine similarity.
    """
    # Encode both the query and documents into embeddings
    query_emb = model.encode([query])[0]
    doc_embs = model.encode(docs)

    scores = []
    for i, emb in enumerate(doc_embs):
        score = cos_sim(query_emb, emb)
        scores.append((i, score))
    # Sort by semantic similarity in descending order
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores
```

In this snippet:

- **`cos_sim`** computes the cosine similarity between two vectors. Vectors pointing in a similar direction get a higher score.  
- **`embedding_search`** encodes the query and each document into high-dimensional embeddings using a pre-trained model, then ranks documents by cosine similarity.

> This approach depends more on interpretive meaning than precise word matching. A query about “combining external data with generative models” can find documents discussing “merging external text into RAG systems,” even if some words differ.

---

## Analyzing the Search Output

Let’s compare results for the sample query:

> **Query:**  
> How does a system combine external data with language generation to improve responses?

### BOW Search Results

```
Doc 3 | Score: 5 | Text: Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.
Doc 0 | Score: 4 | Text: Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.
Doc 4 | Score: 3 | Text: Financial institutions analyze market data and use automated report generation to guide investment decisions.
Doc 2 | Score: 2 | Text: By merging retrieved text with generative models, RAG overcomes the limitations of static training data.
Doc 5 | Score: 2 | Text: Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.
Doc 1 | Score: 1 | Text: RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.
Doc 6 | Score: 0 | Text: Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C.
```

### Embedding-based Search Results

```
Doc 0 | Score: 0.5939 | Text: Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.
Doc 1 | Score: 0.4375 | Text: RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.
Doc 2 | Score: 0.4234 | Text: By merging retrieved text with generative models, RAG overcomes the limitations of static training data.
Doc 3 | Score: 0.3179 | Text: Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.
Doc 4 | Score: 0.2539 | Text: Financial institutions analyze market data and use automated report generation to guide investment decisions.
Doc 5 | Score: 0.2015 | Text: Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.
Doc 6 | Score: 0.0802 | Text: Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C.
```

- **BOW** ranks **Doc 3** highest because of exact keyword matches (“combine” + “external”), even though **Doc 0** is more relevant to RAG.  
- **Embeddings** correctly place **Doc 0** and **Doc 1** at the top, capturing the semantic relationship between “language generation” and “integrating external documents.”

---

## Conclusion And Next Steps

In this lesson, we compared a **Bag-of-Words** search with an **embedding-based** semantic search and saw how each method ranks documents differently. BOW is agile for quick, vocabulary-based matches, while embeddings capture deeper connections between words and phrases.

> **Next**, you’ll get hands-on practice implementing these approaches. Have fun exploring!


## Building a Bag of Words

Congratulations on reaching this point in your learning journey! You've already explored the fundamentals of text processing, and now it's time to apply that knowledge to a practical exercise.

In this activity, you'll work on the bow_vectorize function, a key component for converting text into a Bag-of-Words (BOW) vector. Your objective is to complete the function by implementing the following:

Remove punctuation from each word in the text to ensure consistent matching.
Count the occurrences of each word and update the vector accordingly.
For instance, given the input text "Hello, world! Hello." and a vocabulary containing "hello" and "world", the resulting vector should be [2, 1].

This exercise will deepen your understanding of BOW vectors, preparing you for more advanced comparisons with embeddings. Dive in and enjoy the coding experience!

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]


def build_vocab(docs):
    """
    Dynamically build a vocabulary from the given docs.
    Each new word in the corpus is an entry in the vocabulary.
    """
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    return {word: idx for idx, word in enumerate(sorted(unique_words))}

VOCAB = build_vocab(KNOWLEDGE_BASE)


def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token appears.
    """
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        # TODO: Remove punctuation from the word to ensure consistent matching
        clean_word = _______
        # TODO: If the cleaned word exists in our vocabulary, increment its count in the vector
        if clean_word in vocab:
            _______
    return vector


def bow_search(query, docs):
    """
    Rank documents by lexical overlap (BOW).
    The dot product between query and doc vectors reflects how many
    tokens they have in common.
    """
    query_vec = bow_vectorize(query)
    scores = []
    for i, doc in enumerate(docs):
        doc_vec = bow_vectorize(doc)
        score = np.dot(query_vec, doc_vec)
        scores.append((i, score))
    # Sort documents so that higher lexical overlap is first
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores


if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")

    # BOW-based search
    bow_results = bow_search(query, KNOWLEDGE_BASE)
    print("BOW Search Results:")
    for idx, score in bow_results:
        print(f"  Doc {idx} | Score: {score} | Text: {KNOWLEDGE_BASE[idx]}")

```

To complete the `bow_vectorize` function, you'll need to implement the following:

1. **Remove Punctuation**: We need to ensure that punctuation is removed so that the words are consistently matched (e.g., "hello" and "hello," should be considered the same).
2. **Count Word Occurrences**: For each word in the `text`, we'll count its occurrences based on the provided vocabulary.

Here's how you can modify the `bow_vectorize` function:

1. **Remove punctuation**: You can use Python's `str.strip()` to remove punctuation or regular expressions to clean the words.
2. **Count occurrences**: For each cleaned word, check if it exists in the vocabulary, and if so, increment the corresponding entry in the vector.

### Completed Code:

```python
import numpy as np
import string

def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token appears.
    """
    # Initialize the vector with zeros
    vector = np.zeros(len(vocab), dtype=int)
    
    # Remove punctuation and convert text to lowercase
    for word in text.lower().split():
        # Remove punctuation using str.translate and string.punctuation
        clean_word = word.translate(str.maketrans("", "", string.punctuation))
        
        # If the cleaned word exists in the vocabulary, increment its count in the vector
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    
    return vector
```

### Explanation:
1. **`str.translate(str.maketrans("", "", string.punctuation))`**: This removes all punctuation from the word using the `string.punctuation` constant.
2. **Count word occurrences**: For each cleaned word, we check if it exists in the vocabulary. If it does, we increment its corresponding index in the vector.

### Example:

For the input `"Hello, world! Hello."` and a vocabulary of `{"hello": 0, "world": 1}`, the resulting vector would be `[2, 1]`.

### Next Steps:

1. **Test the function**: You can run the `bow_vectorize` function with sample text to see how it converts the text into a vector based on the vocabulary.
2. **Implement in search**: The `bow_search` function will now be able to rank documents based on their lexical overlap using the BOW vector.

Let me know if you need further assistance with testing or additional functionality!

## Enhance Bag-of-Words with Bigrams

You've done a great job understanding the basics of Bag-of-Words and embeddings. Now, let's enhance the Bag-of-Words approach by incorporating bigrams into the bow_vectorize function.

Bigrams are a type of N-grams, which are contiguous sequences of n items from a given text. In the case of bigrams, n is 2, meaning each bigram consists of two consecutive words; for example, in the sentence "RAG systems retrieve information" the bigrams are "RAG systems", "systems retrieve", and "retrieve information".

Your tasks are to:

Modify build_vocab to include bigrams. This means that for each document, you should extract both individual words and pairs of consecutive words (bigrams) to add to the vocabulary.
Update bow_vectorize to count bigrams. When vectorizing a text, ensure that both unigrams (single words) and bigrams are counted and represented in the vector.
Integrate these changes into the bow_search function. This will allow the search to consider both individual words and word pairs, potentially improving the relevance of document rankings.
By incorporating bigrams, you'll be able to capture more context from the text, which can lead to more accurate search results. Dive in and explore the impact of bigrams on search results!


```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]


def build_vocab(docs):
    """
    Dynamically build a vocabulary from the given docs.
    Each new word or bigram in the corpus is an entry in the vocabulary.
    """
    unique_tokens = set()
    for doc in docs:
        words = [word.strip(".,!?") for word in doc.lower().split()]
        for i in range(len(words)):
            if words[i]:
                unique_tokens.add(words[i])
            # TODO: Add bigrams to the vocabulary
    return {token: idx for idx, token in enumerate(sorted(unique_tokens))}


VOCAB = build_vocab(KNOWLEDGE_BASE)


def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token or bigram appears.
    """
    vector = np.zeros(len(vocab), dtype=int)
    words = [word.strip(".,!?") for word in text.lower().split()]
    for i in range(len(words)):
        if words[i] in vocab:
            vector[vocab[words[i]]] += 1
        # TODO: Count bigrams in the vector
    return vector


def bow_search(query, docs):
    """
    Rank documents by lexical overlap (BOW) using unigrams and bigrams.
    The dot product between query and doc vectors reflects how many
    tokens they have in common.
    """
    query_vec = bow_vectorize(query)
    scores = []
    for i, doc in enumerate(docs):
        doc_vec = bow_vectorize(doc)
        score = np.dot(query_vec, doc_vec)
        scores.append((i, score))
    # Sort documents so that higher lexical overlap is first
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores


if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")
    
    print(f"Vocabulary: {VOCAB}")

    # BOW-based search
    bow_results = bow_search(query, KNOWLEDGE_BASE)
    print("BOW Search Results:")
    for idx, score in bow_results:
        print(f"  Doc {idx} | Score: {score} | Text: {KNOWLEDGE_BASE[idx]}")



```

To incorporate bigrams into the Bag-of-Words (BOW) model, we need to:

1. **Modify `build_vocab` to include bigrams**: This involves extracting both unigrams (single words) and bigrams (pairs of consecutive words) from each document.
2. **Update `bow_vectorize` to count bigrams**: In addition to counting unigrams, we'll also count the occurrences of bigrams.
3. **Modify the `bow_search` function**: Ensure that the search process accounts for both unigrams and bigrams.

Here is the updated version of the code with bigram support:

### Updated Code:

```python
import numpy as np

KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]

def build_vocab(docs):
    """
    Dynamically build a vocabulary from the given docs.
    Each new word or bigram in the corpus is an entry in the vocabulary.
    """
    unique_tokens = set()
    for doc in docs:
        words = [word.strip(".,!?") for word in doc.lower().split()]
        for i in range(len(words)):
            # Add individual word to vocabulary
            if words[i]:
                unique_tokens.add(words[i])
            # Add bigrams (pair of consecutive words)
            if i < len(words) - 1:
                bigram = f"{words[i]} {words[i+1]}"
                unique_tokens.add(bigram)
    return {token: idx for idx, token in enumerate(sorted(unique_tokens))}


VOCAB = build_vocab(KNOWLEDGE_BASE)


def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token or bigram appears.
    """
    vector = np.zeros(len(vocab), dtype=int)
    words = [word.strip(".,!?") for word in text.lower().split()]
    
    for i in range(len(words)):
        # Count unigrams
        if words[i] in vocab:
            vector[vocab[words[i]]] += 1
        # Count bigrams
        if i < len(words) - 1:
            bigram = f"{words[i]} {words[i+1]}"
            if bigram in vocab:
                vector[vocab[bigram]] += 1
    return vector


def bow_search(query, docs):
    """
    Rank documents by lexical overlap (BOW) using unigrams and bigrams.
    The dot product between query and doc vectors reflects how many
    tokens they have in common.
    """
    query_vec = bow_vectorize(query)
    scores = []
    for i, doc in enumerate(docs):
        doc_vec = bow_vectorize(doc)
        score = np.dot(query_vec, doc_vec)
        scores.append((i, score))
    # Sort documents so that higher lexical overlap is first
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores


if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")
    
    print(f"Vocabulary: {VOCAB}")

    # BOW-based search
    bow_results = bow_search(query, KNOWLEDGE_BASE)
    print("BOW Search Results:")
    for idx, score in bow_results:
        print(f"  Doc {idx} | Score: {score} | Text: {KNOWLEDGE_BASE[idx]}")
```

### Key Changes:

1. **`build_vocab`**: 
   - Now extracts both unigrams (individual words) and bigrams (pairs of consecutive words).
   - Adds bigrams to the vocabulary, e.g., for the sentence `"RAG systems retrieve information"`, it adds the bigrams `"RAG systems"`, `"systems retrieve"`, and `"retrieve information"`.

2. **`bow_vectorize`**:
   - It now counts both unigrams and bigrams when processing the text.
   - For each consecutive pair of words, it creates a bigram and updates the vector if the bigram exists in the vocabulary.

3. **`bow_search`**:
   - This function uses the `bow_vectorize` function to rank documents by their overlap with the query, considering both unigrams and bigrams.

### Example of Impact:

For the query `"How does a system combine external data with language generation to improve responses?"`, the search will now consider both the individual words and the two-word combinations (bigrams) when ranking the documents. This should lead to a more contextually relevant ranking of documents, as bigrams capture relationships between consecutive words that unigrams alone might miss.

### Next Steps:

1. **Test the system**: Run the script and analyze how incorporating bigrams affects the document ranking. You should see improved relevance in the results.
2. **Further Optimizations**: You can experiment with trigrams or higher-order N-grams if needed to capture even more context.

Let me know if you'd like to dive deeper into any part of this or have further questions!

## Bag of Words Search Task

You've just explored how BOW and embeddings differ in search scenarios. Now, let's put that knowledge into practice by completing the bow_search function.

Your tasks are to:

Transform the query and documents into BOW vectors using bow_vectorize.
Compute the dot product to measure overlap for each document.
Return a list of document indices sorted by their lexical similarity scores in descending order.
This exercise will help you see how document rankings change with different query inputs. Dive in and witness the impact of your work!


```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]


def build_vocab(docs):
    """
    Dynamically build a vocabulary from the given docs.
    Each new word in the corpus is an entry in the vocabulary.
    """
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    return {word: idx for idx, word in enumerate(sorted(unique_words))}

VOCAB = build_vocab(KNOWLEDGE_BASE)


def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token appears.
    """
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        clean_word = word.strip(".,!?")
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    return vector


def bow_search(query, docs):
    """
    Rank documents by lexical overlap (BOW).
    The dot product between query and doc vectors reflects how many
    tokens they have in common.
    """
    # TODO: Transform the query into a BOW vector

    # TODO: Transform each document into a BOW vector

    # TODO: Compute the dot product between query and document to measure overlap

    # TODO: Sort documents so that higher lexical overlap is first

    # TODO: Return sorted scores
    pass


if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")

    # BOW-based search
    bow_results = bow_search(query, KNOWLEDGE_BASE)
    print("BOW Search Results:")
    for idx, score in bow_results:
        print(f"  Doc {idx} | Score: {score} | Text: {KNOWLEDGE_BASE[idx]}")


```


To complete the `bow_search` function, we need to follow these steps:

1. **Transform the query into a BOW vector**: Use the `bow_vectorize` function to convert the query into a vector based on the vocabulary.
2. **Transform each document into a BOW vector**: Apply the `bow_vectorize` function to each document in the knowledge base to get their BOW vectors.
3. **Compute the dot product between the query and each document**: The dot product will indicate how much overlap there is between the query and each document, which is a measure of their similarity.
4. **Sort documents by lexical similarity**: The documents should be sorted in descending order based on the similarity score.
5. **Return the sorted scores**: The function will return a list of tuples containing the document index and its similarity score.

Here is the completed `bow_search` function:

### Updated Code:

```python
import numpy as np

KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]

def build_vocab(docs):
    """
    Dynamically build a vocabulary from the given docs.
    Each new word in the corpus is an entry in the vocabulary.
    """
    unique_words = set()
    for doc in docs:
        for word in doc.lower().split():
            clean_word = word.strip(".,!?")
            if clean_word:
                unique_words.add(clean_word)
    return {word: idx for idx, word in enumerate(sorted(unique_words))}

VOCAB = build_vocab(KNOWLEDGE_BASE)


def bow_vectorize(text, vocab=VOCAB):
    """
    Convert a text into a Bag-of-Words vector, using a shared vocabulary.
    Each element counts how many times a particular token appears.
    """
    vector = np.zeros(len(vocab), dtype=int)
    for word in text.lower().split():
        clean_word = word.strip(".,!?")
        if clean_word in vocab:
            vector[vocab[clean_word]] += 1
    return vector


def bow_search(query, docs):
    """
    Rank documents by lexical overlap (BOW).
    The dot product between query and doc vectors reflects how many
    tokens they have in common.
    """
    # Transform the query into a BOW vector
    query_vec = bow_vectorize(query)
    
    # Initialize a list to store document similarity scores
    scores = []
    
    # Transform each document into a BOW vector and compute the dot product
    for i, doc in enumerate(docs):
        doc_vec = bow_vectorize(doc)
        score = np.dot(query_vec, doc_vec)  # Dot product for overlap
        scores.append((i, score))
    
    # Sort documents so that higher lexical overlap is first
    scores.sort(key=lambda x: x[1], reverse=True)
    
    return scores


if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")

    # BOW-based search
    bow_results = bow_search(query, KNOWLEDGE_BASE)
    print("BOW Search Results:")
    for idx, score in bow_results:
        print(f"  Doc {idx} | Score: {score} | Text: {KNOWLEDGE_BASE[idx]}")
```

### Explanation of Changes:

1. **Query Vectorization**: 
   - We use `bow_vectorize(query)` to convert the query into a vector based on the vocabulary.

2. **Document Vectorization**:
   - Each document in the `KNOWLEDGE_BASE` is also transformed into a BOW vector using `bow_vectorize(doc)`.

3. **Dot Product Calculation**:
   - For each document, the dot product between the query vector and the document vector is computed. The dot product essentially measures the overlap between the two vectors (how many words they share).

4. **Sorting**:
   - The documents are sorted by their similarity scores in descending order. The higher the score, the more similar the document is to the query.

5. **Return Scores**:
   - The function returns a sorted list of tuples where each tuple contains the document index and its similarity score.

### Example Output:

When you run the script with the query `"How does a system combine external data with language generation to improve responses?"`, the results will show which documents from the `KNOWLEDGE_BASE` are most similar to the query, based on the lexical overlap of their BOW vectors.

Let me know if you'd like further adjustments or explanations!

You've just explored how BOW and embeddings differ in search scenarios. Now, let's apply that knowledge by completing the embedding_search function.

Your tasks are to:

Implement the cos_sim function to calculate cosine similarity.
Encode the query and documents into embeddings.
Calculate their similarity using the cos_sim function.
Return a list of document indices ranked by semantic closeness to the query.
This exercise will help you see how semantic search can enhance document retrieval. Dive in and see the impact of your work!


```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]


def cos_sim(a, b):
    # TODO: Implement cosine similarity
    return 0

def embedding_search(query, docs, model):
    # TODO: Encode the query into an embedding
    # TODO: Encode the documents into embeddings
    # TODO: Calculate and store the cosine similarity between query_emb and each document embedding
    # TODO: Sort the scores in descending order of similarity
    pass

if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")

    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    emb_results = embedding_search(query, KNOWLEDGE_BASE, model)
    print("\nEmbedding-based Search Results:")
    for idx, score in emb_results:
        print(f"  Doc {idx} | Score: {score:.4f} | Text: {KNOWLEDGE_BASE[idx]}")


```

To complete the `embedding_search` function, we need to:

1. **Implement `cos_sim`**: This function calculates the cosine similarity between two vectors.
2. **Encode the query and documents into embeddings**: We'll use a pre-trained model from `sentence-transformers` to convert the query and documents into vectors (embeddings).
3. **Calculate similarity**: For each document, we'll compute the cosine similarity between the query and document embeddings.
4. **Return ranked results**: We need to sort the documents by their semantic similarity to the query and return the sorted list of indices and scores.

Let's go step by step.

### 1. **Cosine Similarity (cos_sim) Implementation**:
Cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes.

Formula:
\[
\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}
\]

### 2. **Embedding the Query and Documents**:
We'll use `SentenceTransformer` to encode both the query and the documents into embeddings.

### 3. **Ranking by Similarity**:
After computing the cosine similarity, we will sort the documents by their similarity to the query in descending order.

Here is the implementation of the `embedding_search` function:

### Updated Code:

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

KNOWLEDGE_BASE = [
    "Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process.",
    "RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model.",
    "By merging retrieved text with generative models, RAG overcomes the limitations of static training data.",
    "Media companies combine external data feeds with digital editing tools to optimize broadcast schedules.",
    "Financial institutions analyze market data and use automated report generation to guide investment decisions.",
    "Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans.",
    "Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C."
]

def cos_sim(a, b):
    """
    Compute the cosine similarity between two vectors.
    """
    return np.dot(a, b) / (norm(a) * norm(b))

def embedding_search(query, docs, model):
    """
    Rank documents based on semantic similarity with the query using embeddings.
    """
    # Encode the query into an embedding
    query_emb = model.encode([query])[0]
    
    # Encode all documents into embeddings
    doc_embs = model.encode(docs)
    
    # Calculate cosine similarity between the query embedding and each document embedding
    scores = []
    for idx, doc_emb in enumerate(doc_embs):
        score = cos_sim(query_emb, doc_emb)
        scores.append((idx, score))
    
    # Sort documents based on similarity scores in descending order
    scores.sort(key=lambda x: x[1], reverse=True)
    
    return scores

if __name__ == "__main__":
    query = "How does a system combine external data with language generation to improve responses?"
    print(f"Query: {query}")

    # Initialize the model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Perform the embedding-based search
    emb_results = embedding_search(query, KNOWLEDGE_BASE, model)

    # Print the results
    print("\nEmbedding-based Search Results:")
    for idx, score in emb_results:
        print(f"  Doc {idx} | Score: {score:.4f} | Text: {KNOWLEDGE_BASE[idx]}")
```

### Explanation:

1. **`cos_sim(a, b)`**: 
   - This function calculates the cosine similarity between two vectors `a` and `b` by using the formula mentioned above. The `np.dot(a, b)` computes the dot product, and `norm(a)` and `norm(b)` compute the magnitudes of the vectors.
   
2. **`embedding_search(query, docs, model)`**: 
   - The query is encoded into an embedding using `model.encode([query])`. We use `SentenceTransformer` to get a dense vector representation of the query.
   - Similarly, we encode the documents into embeddings using `model.encode(docs)`.
   - We then compute the cosine similarity between the query embedding and each document embedding.
   - The results are sorted by their similarity score in descending order.
   
3. **Model**: 
   - We use the `sentence-transformers/all-MiniLM-L6-v2` model for encoding the query and documents into embeddings. This model is efficient and commonly used for semantic similarity tasks.

### Example Output:

For the query `"How does a system combine external data with language generation to improve responses?"`, the search results will show the documents in the `KNOWLEDGE_BASE` ranked by their semantic similarity to the query.

Let me know if you need any further adjustments or clarifications!