# Finding the best open-source embedding model for your RAG application

In this tutorial, we explore the process of selecting the most suitable open-source embedding model for your Retrieval-Augmented Generation (RAG) application. 

We evaluate three open-source embedding models: [nomic-embed-text](https://ollama.com/library/nomic-embed-text), [bge-m3](https://ollama.com/library/bge-m3/blobs/daec91ffb5dd), and [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large/blobs/819c2adf5ce6).
 
To facilitate the execution of these models and the generation of embeddings, we will use the following tools:

- [Ollama](https://ollama.com/), a platform that provides access to a variety of open-source Large Language Models (LLMs).
- [pgai](https://github.com/timescale/pgai), an open-source extension that seamlessly integrates LLM workflows, such as embedding creation and management, directly into your PostgreSQL database.

The evaluation process involves:

1. Setting up a test environment with Ollama and PostgreSQL
2. Loading Paul Graham's essays as our test dataset
3. Generating embeddings using different models
4. Creating diverse test questions across multiple categories
5. Evaluating each model's retrieval performance

## Environment Setup

Before you begin, execute the `install.sh` shell script to set up your environment. This script will install all necessary Docker containers required for the environment.

**Note for macOS Users:**
  
To run the script, first make it executable by running the command: `chmod +x install.sh`. Then, execute the script with: `./install.sh`.

Let's install the necessary Python libraries for this notebook.

In [None]:
%pip install pandas psycopg2-binary Jinja2

Let's define the environment variables.

In [None]:
DATABASE_CONNECTION_STRING="postgres://postgres:postgres@localhost:5432/postgres"
OLLAMA_HOST="http://ollama:11434"

## Dataset Ingestion

Let's setup the PostgreSQL database and install [pgai](https://github.com/timescale/pgai).

In [4]:
import psycopg2

def connect_db():
    return psycopg2.connect(DATABASE_CONNECTION_STRING)

In [34]:
with connect_db() as conn:
    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS ai CASCADE;")

    with conn.cursor() as cur:
        cur.execute("""
            CREATE TABLE IF NOT EXISTS essays (
                id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
                title TEXT NOT NULL,
                date TEXT,
                text TEXT NOT NULL
            );
          """) 

### Loading the Dataset

We'll now load [Paul Graham's essays](https://huggingface.co/datasets/sgoel9/paul_graham_essays) into our PostgreSQL database. The dataset consists of 215 essays with their titles, dates, and full text content. We'll verify the successful ingestion by displaying the first essay from the database.

In [None]:
with connect_db() as conn:
    with conn.cursor() as cur:
        # Load Paul Graham's essays dataset into the 'essays' table
        cur.execute("""
            SELECT ai.load_dataset(
                    'sgoel9/paul_graham_essays', 
                    table_name => 'essays', 
                    if_table_exists => 'append');
        """)
    
    with conn.cursor() as cur:
        # Fetch and print the first row from the 'essays' table to verify the data
        cur.execute("SELECT * FROM essays LIMIT 1;")
        print(cur.fetchone())

## Generating Embeddings

pgai makes generating embeddings for various models incredibly straightforward with its [pgai Vectorizer](https://github.com/timescale/pgai/blob/main/docs/vectorizer.md). With just a single SQL command, you can effortlessly configure a vectorizer to automatically create and update embeddings from any chosen embedding model.

We create vectorizers for each embedding model. Each vectorizer will:

- Process text using the same chunking strategy (512 characters with 50 character overlap)
- Use consistent text formatting across all models
- Generate embeddings with model-specific dimensions

In [53]:
def create_vectorizer(embedding_model, embeddings_dimensions):
    embeddings_view_name = f"{'essays'}{'_'}{embedding_model.replace('-','_')}{'_'}{'embeddings'}"

    with connect_db() as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT ai.create_vectorizer(
                'essays'::regclass,
                destination => %s,
                embedding => ai.embedding_ollama(%s, %s),
                chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50),
                formatting => ai.formatting_python_template('title: $title $chunk')
            );
            """, (embeddings_view_name, embedding_model, embeddings_dimensions, )
            )

In [5]:
EMBEDDING_MODELS = [
    {'name':'mxbai-embed-large', 'dimensions': 1024},
    {'name':'nomic-embed-text','dimensions': 768},
    {'name':'bge-m3','dimensions': 1024},
] 

for model in EMBEDDING_MODELS:
    create_vectorizer(model['name'], model['dimensions'])

The vectorizers will take some time to complete the embedding generation. Use the following command to monitor their progress:  

In [None]:
with connect_db() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM ai.vectorizer_status;")

        for row in cur.fetchall():
            print(f"Vectorizer ID: {row[0]}, Embedding Table: {row[2]}, Pending Items: {row[4]}")
        

## Embeddings Evaluation



### Evaluation Parameters

To ensure a fair comparison, we'll establish consistent evaluation parameters:

- Number of text chunks to evaluate
- Questions per chunk across different categories
- Number of top results to consider (K)
- Distribution of question types to test different aspects of embedding quality

In [7]:
NUM_CHUNKS = 20
NUM_QUESTIONS_PER_CHUNK = 20
TOP_K = 10

QUESTION_DISTRIBUTION = {
    'short': 4,
    'long': 4,
    'direct': 4,
    'implied': 4,
    'unclear': 4
}

assert sum(QUESTION_DISTRIBUTION.values()) == NUM_QUESTIONS_PER_CHUNK

### Evaluation Chunks

We select 20 random chunks from one of the embeddings view since all three models use the same chunks. 

In [26]:
import pandas as pd 

evaluation_chunks = []

with connect_db() as conn:
    with conn.cursor() as cur:
        cur.execute("""
                SELECT id, chunk_seq, chunk, title 
                FROM essays_nomic_embed_text_embeddings 
                ORDER BY RANDOM() 
                LIMIT %s
            """, (NUM_CHUNKS,))
        
        for row in cur.fetchall():
            evaluation_chunks.append({
                'id': row[0],
                'chunk_seq': row[1],
                'chunk': row[2],
                'title': row[3]
            })

pd.DataFrame(evaluation_chunks).to_csv('./chunks.csv')

### Evaluation Question Generation

We'll generate diverse questions for each text chunk across five categories:

1. **Short questions**: Simple, direct queries under 10 words
2. **Long questions**: Detailed questions requiring comprehensive understanding
3. **Direct questions**: Questions about explicit information
4. **Implied questions**: Questions requiring contextual understanding
5. **Unclear questions**: Ambiguous queries to test robustness

Each category tests different aspects of the embedding models' capabilities.

In [27]:
def generate_questions_by_question_type(chunk, question_type, num_questions):
    prompts = {
        'short': "Generate {count} short, simple questions about this text. Questions should be direct, under 10 words",
        'long': "Generate {count} detailed, comprehensive questions about this text. Include specific details:",
        'direct': "Generate {count} questions that directly ask about explicit information in this text",
        'implied': "Generate {count} questions that require understanding context and implications of the text:",
        'unclear': "Generate {count} vague, ambiguous questions about the general topic of this text:"
    }

    prompt = prompts[question_type].format(count=num_questions) + f"\n\nText: {chunk}"

    system_instructions = """
        Generate different types of questions about the given text following the prompt provided. 
        Each question must be on a new line. Do not include empty lines or blank questions.
    """

    with connect_db() as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT ai.ollama_generate(
                    'llama3.2',
                    %s,
                    system_prompt=>%s, 
                    host=>%s
                )->>'response';
            """,(prompt, system_instructions, OLLAMA_HOST))

            generated_questions = [q.strip() for q in cur.fetchone()[0].split("\n") if q.strip()]
            print(f"Number of questions generated for {question_type}: {len(generated_questions)}")
            return generated_questions 

In [None]:
evaluation_questions = []

for i, chunk in enumerate(evaluation_chunks, 1):
    print(f"Processing chunk {i}/{len(evaluation_chunks)}")

    for question_type, count in QUESTION_DISTRIBUTION.items():
        questions = generate_questions_by_question_type(chunk['chunk'], question_type, count)

        for q in questions:
            evaluation_questions.append({
                'question': q,
                'source_chunk_id': chunk['id'],
                'source_chunk_seq': chunk['chunk_seq'],
                'question_type': question_type,
                'chunk': chunk['chunk']
            })

print("Generated questions in total:", len(evaluation_questions))

pd.DataFrame(evaluation_questions).to_csv('./generated_questions.csv')

### Embedding Model Evaluation

The evaluation process involves:

1. For each model:

   - Converting questions to embeddings
   - Performing similarity search against chunk embeddings
   - Checking if the source chunk appears in `top-K` results

2. Calculating performance metrics:

   - Overall accuracy
   - Per-question-type accuracy
   - Detailed success/failure analysis

This comprehensive evaluation will help identify each model's strengths and weaknesses.

In [29]:
OLLAMA_HOST = os.environ["OLLAMA_HOST"]

def vector_similarity_search(embeddings_view, embedding_model, question):
    with connect_db() as conn:
        with conn.cursor() as cur:
            cur.execute(f"""
                SELECT id, chunk_seq 
                FROM {embeddings_view} 
                ORDER BY embedding <=> ai.ollama_embed(%s, %s, host => %s)
                LIMIT %s;
            """, (embedding_model, question, OLLAMA_HOST, TOP_K,)
            )

            return cur.fetchall()

In [30]:
def evaluate_embedding_models():
    evaluation_results = []
    detailed_results = []

    for model in EMBEDDING_MODELS:
        print(f"Evaluating {model['name']}...")

        embeddings_view = f"{'essays'}{'_'}{model['name'].replace('-','_')}{'_'}{'embeddings'}"
        scores = []

        for q in evaluation_questions:
            vector_search_results = vector_similarity_search(embeddings_view, model['name'], q['question'])
            found = any(
                row[0] == q['source_chunk_id'] and row[1]== q['source_chunk_seq'] 
                for row in vector_search_results
            )

            scores.append(1 if found else 0)

            detailed_results.append({
                'model': model['name'],
                'question': q['question'],
                'question_type': q['question_type'],
                'source_chunk_id': q['source_chunk_id'],
                'source_chunk_seq': q['source_chunk_seq'],
                'found_correct_chunk': found,
                'num_results': len(vector_search_results)
            })

        evaluation_results.append({
            'model': model['name'],
            'overall_accuracy': sum(scores) / len(scores),
            'by_type': {
                q_type: sum(scores[i] for i, q in enumerate(evaluation_questions) 
                            if q['question_type'] == q_type) / QUESTION_DISTRIBUTION[q_type] / NUM_CHUNKS
                for q_type in QUESTION_DISTRIBUTION.keys()
            }
        })
    
    pd.DataFrame(detailed_results).to_csv('./evaluation_data/detailed_results.csv')
    return evaluation_results

In [31]:
def create_results_table(evaluation_results):
    # Create lists to store the data
    rows = []
    
    # Process each model's results
    for result in evaluation_results:
        row = {
            'Model': result['model'],
            'Overall Accuracy': f"{result['overall_accuracy']:.2%}",
        }
        # Add accuracies for each question type
        for q_type, acc in result['by_type'].items():
            row[q_type.capitalize()] = f"{acc:.2%}"
        
        rows.append(row)
    
    # Create DataFrame
    df = pd.DataFrame(rows)
    
    # Reorder columns to put Overall Accuracy after Model
    columns = ['Model', 'Overall Accuracy'] + [col for col in df.columns if col not in ['Model', 'Overall Accuracy']]
    df = df[columns]
    
    # Display the table
    return df.style.set_properties(**{
        'text-align': 'center',
        'border': '1px solid black',
        'padding': '8px'
    }).set_table_styles([
        {'selector': 'th', 'props': [
            ('background-color', 'black'),
            ('text-align', 'center'),
            ('padding', '8px'),
            ('border', '1px solid black')
        ]},
        {'selector': 'caption', 'props': [
            ('text-align', 'center'),
            ('font-weight', 'bold'),
            ('font-size', '1.1em'),
            ('padding', '8px')
        ]}
    ]).set_caption('Embedding Models Evaluation Results')

In [None]:
# Display the results
evaluation_results = evaluate_embedding_models()
results_table = create_results_table(evaluation_results)
display(results_table)