# Raw RAG 02: You Don't Have to Embed

Welcome to the second notebook of our Raw RAG series! This time, we're going to explore an interesting twist: creating a Retrieval-Augmented Generation (RAG) system without using embeddings.

## Why No Embeddings?

You might be wondering, "Aren't embeddings a crucial part of RAG?" Well, not always! Depending on your specific use case, there are alternative techniques that can be just as effective - and sometimes even more efficient.

## What We'll Cover

In this notebook, we'll dive into several non-embedding approaches for information retrieval:

1. **BM25**: A powerful ranking function used in document retrieval
2. **Natural Language Processing (NLP) techniques**: Using linguistic features for matching
3. **Reranking**: Improving initial search results for better relevance

These methods can often provide excellent results, especially for certain types of data or specific application requirements.

*Note: For simplicity, we'll be using Cohere's Rerank API in some examples.*

Ready to see how we can do RAG without embeddings? Let's get started!

In [2]:
%pip install openai python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load the environment variables from the .env file

from dotenv import load_dotenv
import os

dotenv_path = ".env"
load_dotenv(dotenv_path=dotenv_path)

True

In [4]:
# Load the short novel text

file_path = "docs/the_lottery_text.txt"

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

print(text[:690])

“The Lottery [abridged]” (1948)--- By Shirley Jackson

The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.  But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. 
The children assembled first, of course. Bobby Martin had already stuffed his pockets full of stones, and the other boys soon followed his example, selecting the smooth


In [5]:
from utils import TextProcessor, SearchEngine

text_processor = TextProcessor()
search_engine = SearchEngine()

# Split the text into paragraphs
paragraphs = text_processor.text_splitter(text)

for i, paragraph in enumerate(paragraphs[:5]):
    print(f"Paragraph {i+1}: {paragraph}")

Paragraph 1: “The Lottery [abridged]” (1948)--- By Shirley Jackson The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.
Paragraph 2: But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. The children assembled first, of course.
Paragraph 3: Bobby Martin had already stuffed his pockets full of stones, and the other boys soon followed his example, selecting the smoothest and roundest stones; Bobby and Harry Jones and Dickie Delacroix-- the villagers pronounced this name "Dellacroy"--eventually made a great pile of stones in one corner of the square and guarded it against the raids of the other boys. The lottery was conducted--as were th

# BM25 Search Algorithm: A Brief Overview

BM25 (Best Matching 25) is a ranking function used in information retrieval, improving upon earlier TF-IDF methods. It's widely used in search applications due to its effectiveness and simplicity.

## How It Works

BM25 calculates a relevance score for documents based on:

1. **Term Frequency**: How often query terms appear in a document.
2. **Inverse Document Frequency**: How rare or common terms are across all documents.
3. **Document Length**: Adjusting scores to avoid bias towards longer documents.

## Key Features

- Caps the impact of repeated terms
- Normalizes for document length
- Based on probabilistic retrieval framework

## Advantages for RAG Systems

- Effective: Often outperforms simpler models
- Efficient: Low computational overhead
- No training required: Works well without large datasets or training phases

In our RAG system, BM25 offers a powerful way to retrieve relevant documents without embeddings, making it suitable for many applications.

Wikipedia: https://en.wikipedia.org/wiki/Okapi_BM25

*Note*: Please see the bm25 code in `utils.py` for implementation details, and utils_readme.md for instructions on how to use it.

In [6]:
query = "When is the lottery held?"

# bm25 search
results = search_engine.bm25_search(query, paragraphs)
for idx, result in enumerate(results):
    paragraph_idx = result[0]
    print(f"Paragraph {result[0]+1}: {paragraphs[paragraph_idx]}")
    print(f"Similarity: {result[1]}")

Paragraph 4: Summers, who had time and energy to devote to civic activities. When he arrived in the square, carrying the black wooden box, there was a murmur of conversation among the villagers, and he waved and called, "Little late today, folks. " There was a great deal of fussing to be done before Mr. Summers declared the lottery open. There were the lists to make up--of heads of families- heads of households in each family. There was the proper swearing-in of Mr.
Similarity: 4.506020604301391
Paragraph 13: " Bill Hutchinson said regretfully. "My daughter draws with her husband's family; that's only fair. And I've got no other family except the kids. " "Then, as far as drawing for families is concerned, it's you," Mr. Summers said in explanation, "and as far as drawing for households is concerned, that's you, too. Right?" "Right," Bill Hutchinson said. "How many kids, Bill?" Mr. Summers asked formally. "Three," Bill Hutchinson said. "There's Bill, Jr. , and Nancy, and little Dave, an

# Rerank Method: Refining Search Results

Reranking is a crucial step in many advanced information retrieval systems, including RAG. It's a process of fine-tuning initial search results to improve their relevance and quality.

## How Reranking Works

1. **Initial Retrieval**: A fast, broad search method (like BM25) retrieves a set of potentially relevant documents.
2. **Reranking**: A more sophisticated algorithm reassesses this initial set, reordering the results based on additional criteria.

## Key Features

- Uses more complex relevance models than initial retrieval
- Can incorporate additional context or features not used in the first pass
- Often leverages machine learning or deep learning techniques

## Benefits in RAG Systems

- **Improved Accuracy**: Helps surface the most relevant documents for the given query
- **Balances Efficiency and Effectiveness**: Combines fast initial retrieval with more nuanced ranking
- **Flexibility**: Can be tailored to specific use cases or domains

By incorporating reranking, RAG systems can significantly enhance the quality of retrieved context, leading to more accurate and relevant generated responses.

In [7]:
# Rerank the results using Cohere API
%pip install cohere


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
# Rerank the initial search results using Cohere API

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=paragraphs,
    top_n=5,
)

for idx, result in enumerate(response.results):
    paragraph_idx = result.index
    print(f"Paragraph {paragraph_idx+1}: {paragraphs[paragraph_idx]}")
    print(f"Similarity: {result.relevance_score}")

Paragraph 1: “The Lottery [abridged]” (1948)--- By Shirley Jackson The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.
Similarity: 0.77898574
Paragraph 2: But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. The children assembled first, of course.
Similarity: 0.19590157
Paragraph 4: Summers, who had time and energy to devote to civic activities. When he arrived in the square, carrying the black wooden box, there was a murmur of conversation among the villagers, and he waved and called, "Little late today, folks. " There was a great deal of fussing to be done before Mr. Summers declared the lottery open. There were the lists to make

## Efficient Keyword Extraction with spaCy

While Large Language Models (LLMs) offer powerful natural language processing capabilities, their probabilistic nature can lead to inconsistent results when extracting keywords. Enter spaCy, a lightweight and efficient NLP library that provides a more deterministic approach to keyword extraction.

### Why Choose spaCy?

1. **Deterministic Results**: Unlike LLMs, spaCy's rule-based and statistical models produce consistent outputs for the same input, enhancing reproducibility.

2. **Efficiency**: spaCy's smaller models are designed for speed and low resource consumption, making them ideal for production environments.

3. **Ease of Use**: With a simple API, spaCy allows for quick implementation and straightforward integration into existing workflows.

4. **Customizability**: spaCy offers various pre-trained models of different sizes, allowing you to balance accuracy and performance based on your specific needs.

5. **Debuggability**: The deterministic nature of spaCy's models makes it easier to trace and debug the keyword extraction process.

By leveraging spaCy's smaller, focused models for keyword extraction, we can achieve consistent and efficient results. This approach not only simplifies our RAG pipeline but also makes it more robust and easier to maintain in the long run.

spaCy: https://spacy.io/

In [9]:
%pip install spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
# Download the spaCy model, other models can be used as well, see https://spacy.io/usage/models

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
import json
from heapq import nlargest
from typing import List, Dict, Any
from collections import Counter

import spacy
from spacy.language import Language
from spacy.tokens import Doc


def extract_nlp_keywords(
    text: str, num_keywords: int = 5, model: str = "en_core_web_sm"
) -> Dict[str, Any]:
    """
    Extract dates, person names, locations, and keywords from the given text using spaCy.

    Args:
    text (str): The input text to process.
    num_keywords (int): The number of keywords to extract (default: 5).
    model (str): The spaCy model to use (default: "en_core_web_sm").

    Returns:
    Dict[str, Any]: A dictionary containing extracted when, who, where, and keywords.
    """
    # Load the language model
    try:
        nlp: Language = spacy.load(model)
    except OSError:
        raise ValueError(
            f"Could not load the spaCy model '{model}'. Make sure it's installed."
        )

    # Process the text
    doc: Doc = nlp(text)

    # Extract dates (when), person names (who), and locations (where)
    when: List[str] = [ent.text for ent in doc.ents if ent.label_ == "DATE"]
    who: List[str] = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
    where: List[str] = [ent.text for ent in doc.ents if ent.label_ in ("GPE", "LOC")]

    # Extract keywords
    pos_tag: List[str] = ["PROPN", "ADJ", "NOUN", "VERB"]
    keywords: List[str] = [
        token.text
        for token in doc
        if not token.is_stop and not token.is_punct and token.pos_ in pos_tag
    ]

    # Count keyword frequencies and get top keywords
    keyword_freq: Counter = Counter(keywords)
    top_keywords: List[str] = [
        word for word, _ in keyword_freq.most_common(num_keywords)
    ]

    # Create a dictionary with the extracted information
    result: Dict[str, Any] = {
        "when": list(set(when)),
        "who": list(set(who)),
        "where": list(set(where)),
        "keywords": top_keywords,
    }

    return result

In [12]:
# extract keywords from query using spaCy

query_keywords = extract_nlp_keywords(query)

print(query_keywords)

{'when': [], 'who': [], 'where': [], 'keywords': ['lottery', 'held']}


In [13]:
# bm25 search with extracted keywords 

extracted_keywords = " ".join(query_keywords["keywords"])
results = search_engine.bm25_search(extracted_keywords, paragraphs)

result_paragraph = []
for idx, result in enumerate(results):
    paragraph_idx = result[0]
    print(f"Paragraph {result[0]+1}: {paragraphs[paragraph_idx]}")
    print(f"Similarity: {result[1]}")
    result_paragraph.append(paragraphs[paragraph_idx])

Paragraph 1: “The Lottery [abridged]” (1948)--- By Shirley Jackson The people of the village began to gather in the square, between the post office and the bank, around ten o'clock; in some towns there were so many people that the lottery took two days and had to be started on June 2nd.
Similarity: 1.998777831130354
Paragraph 2: But in this village, where there were only about three hundred people, the whole lottery took less than two hours, so it could begin at ten o'clock in the morning and still be through in time to allow the villagers to get home for noon dinner. The children assembled first, of course.
Similarity: 1.9717036908971395
Paragraph 3: Bobby Martin had already stuffed his pockets full of stones, and the other boys soon followed his example, selecting the smoothest and roundest stones; Bobby and Harry Jones and Dickie Delacroix-- the villagers pronounced this name "Dellacroy"--eventually made a great pile of stones in one corner of the square and guarded it against the r

In [14]:
# Install OpenAI Python package

%pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
# Generate answer with OpenAI API

from openai import OpenAI

client = OpenAI()

context = result_paragraph

full_query = f"""Use the below context to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{context}
\"\"\"

Question: {query}"""

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You answer questions for the user.",
        },
        {"role": "user", "content": full_query},
    ],
    model="gpt-4-turbo",
    temperature=0,
)

print(response.choices[0].message.content)

The lottery is held on June 27th.


# Conclusion: Powerful RAG Without Embeddings

In this notebook, we've explored three powerful techniques that enable us to build effective Retrieval-Augmented Generation (RAG) systems without relying on embeddings:

1. **BM25 (Best Matching 25)**
   - A robust ranking function for information retrieval
   - Balances term frequency, inverse document frequency, and document length
   - Provides efficient and interpretable search results

2. **Reranking**
   - Refines initial search results for improved relevance
   - Combines fast initial retrieval with more sophisticated ranking
   - Enhances the quality of retrieved context for RAG systems

3. **NLP(Natural Language Processing) with spaCy**
   - Offers deterministic keyword extraction and named entity recognition
   - Provides efficient, lightweight models for various NLP tasks
   - Enables consistent and debuggable text analysis

By leveraging these techniques, we've demonstrated that it's possible to create highly effective RAG systems without the need for complex embedding models. This approach offers several advantages:

- **Efficiency**: These methods often require less computational resources than embedding-based approaches.
- **Interpretability**: The logic behind BM25 and rule-based NLP is easier to understand and debug.
- **Flexibility**: Each component can be fine-tuned or replaced to suit specific use cases.
- **Consistency**: Deterministic results from spaCy provide reproducible outcomes.

As you continue to develop RAG systems, consider how these techniques can be combined or adapted to meet your specific needs. Remember, the best approach often depends on your particular use case, data, and performance requirements.

Next steps for exploration:
- Experiment with different BM25 parameters
- Implement custom reranking algorithms
- Explore advanced spaCy pipelines for more complex NLP tasks

By mastering these techniques, you're well-equipped to build sophisticated RAG systems that are both powerful and efficient. Keep experimenting and refining your approach!