Semantic Stress Test for Embedding Models

This benchmark is designed to stress-test the semantic understanding of embedding models. It evaluates whether a model comprehends the global meaning of a text or merely uses local information.

The core of the benchmark is a dataset of triplets: (Base, Looks Similar, Looks Different).

Base: A description of a software component.
Looks Similar (The Lexical Trap): A sentence that reuses most of the words and structure from the Base but has a completely different meaning.
Looks Different (The Semantic Twin): A sentence that expresses the exact same meaning as the Base but uses entirely different vocabulary and sentence structure.

A good model should score (Base, Looks Different) as more similar than (Base, Looks Similar).

Examples

Here are a couple of examples to illustrate the concept:

Example 1: Encryption vs. Decryption

Base: A middleware service encrypts HTTP requests before forwarding them to the internal API gateway.
Looks Similar (Lexical Trap): A middleware service decrypts HTTP requests before forwarding them to the internal API gateway. (Swaps encrypts for decrypts, inverting the meaning).
Looks Different (Semantic Twin): An intermediary application secures inbound web traffic by encoding it for confidentiality prior to relaying messages to back-end service entry points. (Describes encryption abstractly).

Example 2: Compression vs. Encryption

Base: A Node.js middleware automatically compresses HTTP responses with gzip when the response body exceeds a configurable size threshold.
Looks Similar (Lexical Trap): A Node.js middleware automatically encrypts HTTP responses with AES when the response body exceeds a configurable size threshold. (Swaps compression for security).
Looks Different (Semantic Twin): The component enhances web server efficiency by integrating a runtime filter that detects large outgoing payloads and dynamically applies industry-standard lossless compression before transmission. (Describes the same compression logic using different terms).

Interestingly, while Gemini's embedding-001 model is a top performer on the MTEB leaderboard, it scores 0% (I believe 0% -5% is statistical noise, there is no meaningful difference between a model scoring 0% and 5% with sample sizes this small) on this specific benchmark.

How Accuracy is Measured

The measure_accuracy.py script determines the model's success rate. For each triplet, it checks if the cosine similarity score between the Base and the Looks Different sentence is higher than the score between the Base and the Looks Similar sentence.

The final accuracy is the percentage of triplets where the model correctly identified the "Semantic Twin" (Looks Different) as being more similar than the "Lexical Trap" (Looks Similar).

Why This Benchmark is a Robust Evaluation Method

Comparing embedding models based on their raw cosine similarity scores can be misleading. This benchmark circumvents two common problems:

Different Score Distributions: Some models produce cosine scores that are tightly clustered (e.g., all scores fall between 0.7 and 0.9), while others have a much wider, more uniform spread. This makes it difficult to compare the absolute scores between models. A score of 0.7 from one model might signify a stronger connection than a score of 0.8 from another.
Varying Score Magnitudes: Some models naturally produce lower similarity scores across the board. A model that gives a top score of 0.6 isn't necessarily worse than one that gives 0.9; it just operates on a different scale.

This benchmark solves these issues by using a relative comparison. Instead of looking at the absolute scores, we only check the ranking for each triplet within a single model's output. The only question we ask is: "Did the model correctly score the semantic twin higher than the lexical trap?"

This turns the evaluation into a simple accuracy percentage, allowing for a fair and direct comparison between any embedding models, regardless of their internal scoring behavior.

Dataset Generation and Adaptability

The entire dataset is synthetically generated by GPT-4 based on a detailed generation prompt. This approach is very flexible.

Scalability & Complexity: The number and complexity of the test cases can be easily scaled by running the data generation script (data_gen.py) or by modifying the prompt.
Domain Agnostic: While the current focus is on software engineering, the benchmark can be adapted to any domain (e.g., legal, medical, finance) simply by changing the content and examples in the prompt.
Fine-Tuning Potential: Since we can generate a virtually unlimited number of high-quality triplets, this dataset can be used to fine-tune an existing embedding model or train a new one to excel at this specific type of semantic challenge.

How to run the benchmark

Install Git LFS and pull the assets to download the pre-generated dataset.
```
git lfs install
git lfs pull
```
Add your OPENAI_API_KEY and GEMINI_KEY to a .env file.
Run these commands in order:

# (Optional) To re-generate the dataset using GPT-4
rm software_repo_desc/similarity_scores.jsonl
rm software_repo_desc/triplets.jsonl
uv run python data_gen.py

# Run the benchmark against Gemini's embedding model
uv run python gemini_bench.py 

# Measure the accuracy of the model
uv run python measure_accuracy.py

Contributing

Contributions of any kind are welcome! Whether it's improving the prompts, adding new benchmark models, or refining the code, please feel free to open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
software_repo_desc		software_repo_desc
.DS_Store		.DS_Store
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
data_gen.py		data_gen.py
gemini_bench.py		gemini_bench.py
measure_accuracy.py		measure_accuracy.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Stress Test for Embedding Models

Examples

How Accuracy is Measured

Why This Benchmark is a Robust Evaluation Method

Dataset Generation and Adaptability

How to run the benchmark

Contributing

About

Uh oh!

Releases

Packages

Languages

semvec/embedstresstest

Folders and files

Latest commit

History

Repository files navigation

Semantic Stress Test for Embedding Models

Examples

How Accuracy is Measured

Why This Benchmark is a Robust Evaluation Method

Dataset Generation and Adaptability

How to run the benchmark

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages