Skip to content

semvec/embedstresstest

Repository files navigation

Semantic Stress Test for Embedding Models

This benchmark is designed to stress-test the semantic understanding of embedding models. It evaluates whether a model comprehends the global meaning of a text or merely uses local information.

The core of the benchmark is a dataset of triplets: (Base, Looks Similar, Looks Different).

  • Base: A description of a software component.
  • Looks Similar (The Lexical Trap): A sentence that reuses most of the words and structure from the Base but has a completely different meaning.
  • Looks Different (The Semantic Twin): A sentence that expresses the exact same meaning as the Base but uses entirely different vocabulary and sentence structure.

A good model should score (Base, Looks Different) as more similar than (Base, Looks Similar).

Examples

Here are a couple of examples to illustrate the concept:

Example 1: Encryption vs. Decryption

  • Base: A middleware service encrypts HTTP requests before forwarding them to the internal API gateway.
  • Looks Similar (Lexical Trap): A middleware service decrypts HTTP requests before forwarding them to the internal API gateway. (Swaps encrypts for decrypts, inverting the meaning).
  • Looks Different (Semantic Twin): An intermediary application secures inbound web traffic by encoding it for confidentiality prior to relaying messages to back-end service entry points. (Describes encryption abstractly).

Example 2: Compression vs. Encryption

  • Base: A Node.js middleware automatically compresses HTTP responses with gzip when the response body exceeds a configurable size threshold.
  • Looks Similar (Lexical Trap): A Node.js middleware automatically encrypts HTTP responses with AES when the response body exceeds a configurable size threshold. (Swaps compression for security).
  • Looks Different (Semantic Twin): The component enhances web server efficiency by integrating a runtime filter that detects large outgoing payloads and dynamically applies industry-standard lossless compression before transmission. (Describes the same compression logic using different terms).

Interestingly, while Gemini's embedding-001 model is a top performer on the MTEB leaderboard, it scores 0% (I believe 0% -5% is statistical noise, there is no meaningful difference between a model scoring 0% and 5% with sample sizes this small) on this specific benchmark.

How Accuracy is Measured

The measure_accuracy.py script determines the model's success rate. For each triplet, it checks if the cosine similarity score between the Base and the Looks Different sentence is higher than the score between the Base and the Looks Similar sentence.

The final accuracy is the percentage of triplets where the model correctly identified the "Semantic Twin" (Looks Different) as being more similar than the "Lexical Trap" (Looks Similar).

Why This Benchmark is a Robust Evaluation Method

Comparing embedding models based on their raw cosine similarity scores can be misleading. This benchmark circumvents two common problems:

  1. Different Score Distributions: Some models produce cosine scores that are tightly clustered (e.g., all scores fall between 0.7 and 0.9), while others have a much wider, more uniform spread. This makes it difficult to compare the absolute scores between models. A score of 0.7 from one model might signify a stronger connection than a score of 0.8 from another.

  2. Varying Score Magnitudes: Some models naturally produce lower similarity scores across the board. A model that gives a top score of 0.6 isn't necessarily worse than one that gives 0.9; it just operates on a different scale.

This benchmark solves these issues by using a relative comparison. Instead of looking at the absolute scores, we only check the ranking for each triplet within a single model's output. The only question we ask is: "Did the model correctly score the semantic twin higher than the lexical trap?"

This turns the evaluation into a simple accuracy percentage, allowing for a fair and direct comparison between any embedding models, regardless of their internal scoring behavior.

Dataset Generation and Adaptability

The entire dataset is synthetically generated by GPT-4 based on a detailed generation prompt. This approach is very flexible.

  • Scalability & Complexity: The number and complexity of the test cases can be easily scaled by running the data generation script (data_gen.py) or by modifying the prompt.
  • Domain Agnostic: While the current focus is on software engineering, the benchmark can be adapted to any domain (e.g., legal, medical, finance) simply by changing the content and examples in the prompt.
  • Fine-Tuning Potential: Since we can generate a virtually unlimited number of high-quality triplets, this dataset can be used to fine-tune an existing embedding model or train a new one to excel at this specific type of semantic challenge.

How to run the benchmark

  1. Install Git LFS and pull the assets to download the pre-generated dataset.

    git lfs install
    git lfs pull
  2. Add your OPENAI_API_KEY and GEMINI_KEY to a .env file.

  3. Run these commands in order:

# (Optional) To re-generate the dataset using GPT-4
rm software_repo_desc/similarity_scores.jsonl
rm software_repo_desc/triplets.jsonl
uv run python data_gen.py
# Run the benchmark against Gemini's embedding model
uv run python gemini_bench.py 

# Measure the accuracy of the model
uv run python measure_accuracy.py

Contributing

Contributions of any kind are welcome! Whether it's improving the prompts, adding new benchmark models, or refining the code, please feel free to open an issue or submit a pull request.

About

Stress Testing Embedding Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages