This benchmark is designed to stress-test the semantic understanding of embedding models. It evaluates whether a model comprehends the global meaning of a text or merely uses local information.
The core of the benchmark is a dataset of triplets: (Base, Looks Similar, Looks Different)
.
- Base: A description of a software component.
- Looks Similar (The Lexical Trap): A sentence that reuses most of the words and structure from the
Base
but has a completely different meaning. - Looks Different (The Semantic Twin): A sentence that expresses the exact same meaning as the
Base
but uses entirely different vocabulary and sentence structure.
A good model should score (Base, Looks Different)
as more similar than (Base, Looks Similar)
.
Here are a couple of examples to illustrate the concept:
Example 1: Encryption vs. Decryption
- Base:
A middleware service encrypts HTTP requests before forwarding them to the internal API gateway.
- Looks Similar (Lexical Trap):
A middleware service decrypts HTTP requests before forwarding them to the internal API gateway.
(Swapsencrypts
fordecrypts
, inverting the meaning). - Looks Different (Semantic Twin):
An intermediary application secures inbound web traffic by encoding it for confidentiality prior to relaying messages to back-end service entry points.
(Describes encryption abstractly).
Example 2: Compression vs. Encryption
- Base:
A Node.js middleware automatically compresses HTTP responses with gzip when the response body exceeds a configurable size threshold.
- Looks Similar (Lexical Trap):
A Node.js middleware automatically encrypts HTTP responses with AES when the response body exceeds a configurable size threshold.
(Swaps compression for security). - Looks Different (Semantic Twin):
The component enhances web server efficiency by integrating a runtime filter that detects large outgoing payloads and dynamically applies industry-standard lossless compression before transmission.
(Describes the same compression logic using different terms).
Interestingly, while Gemini's embedding-001
model is a top performer on the MTEB leaderboard, it scores 0% (I believe 0% -5% is statistical noise, there is no meaningful difference between a model scoring 0% and 5% with sample sizes this small) on this specific benchmark.
The measure_accuracy.py
script determines the model's success rate. For each triplet, it checks if the cosine similarity score between the Base
and the Looks Different
sentence is higher than the score between the Base
and the Looks Similar
sentence.
The final accuracy is the percentage of triplets where the model correctly identified the "Semantic Twin" (Looks Different
) as being more similar than the "Lexical Trap" (Looks Similar
).
Comparing embedding models based on their raw cosine similarity scores can be misleading. This benchmark circumvents two common problems:
-
Different Score Distributions: Some models produce cosine scores that are tightly clustered (e.g., all scores fall between 0.7 and 0.9), while others have a much wider, more uniform spread. This makes it difficult to compare the absolute scores between models. A score of 0.7 from one model might signify a stronger connection than a score of 0.8 from another.
-
Varying Score Magnitudes: Some models naturally produce lower similarity scores across the board. A model that gives a top score of 0.6 isn't necessarily worse than one that gives 0.9; it just operates on a different scale.
This benchmark solves these issues by using a relative comparison. Instead of looking at the absolute scores, we only check the ranking for each triplet within a single model's output. The only question we ask is: "Did the model correctly score the semantic twin higher than the lexical trap?"
This turns the evaluation into a simple accuracy percentage, allowing for a fair and direct comparison between any embedding models, regardless of their internal scoring behavior.
The entire dataset is synthetically generated by GPT-4 based on a detailed generation prompt. This approach is very flexible.
- Scalability & Complexity: The number and complexity of the test cases can be easily scaled by running the data generation script (
data_gen.py
) or by modifying the prompt. - Domain Agnostic: While the current focus is on software engineering, the benchmark can be adapted to any domain (e.g., legal, medical, finance) simply by changing the content and examples in the prompt.
- Fine-Tuning Potential: Since we can generate a virtually unlimited number of high-quality triplets, this dataset can be used to fine-tune an existing embedding model or train a new one to excel at this specific type of semantic challenge.
-
Install Git LFS and pull the assets to download the pre-generated dataset.
git lfs install git lfs pull
-
Add your
OPENAI_API_KEY
andGEMINI_KEY
to a.env
file. -
Run these commands in order:
# (Optional) To re-generate the dataset using GPT-4
rm software_repo_desc/similarity_scores.jsonl
rm software_repo_desc/triplets.jsonl
uv run python data_gen.py
# Run the benchmark against Gemini's embedding model
uv run python gemini_bench.py
# Measure the accuracy of the model
uv run python measure_accuracy.py
Contributions of any kind are welcome! Whether it's improving the prompts, adding new benchmark models, or refining the code, please feel free to open an issue or submit a pull request.