# Testing LLM Query Outputs with Cosine Similarity

In this notebook, we demonstrate a metamorphic testing approach for LLM-based features. Instead of using hard-coding your expected outputs, we test whether the outputs from similar or contrasting queries confirm the expected relationships by measuring their cosine similarity.

For example, if two queries with the same intent (but expressed differently) are issued, we expect their outputs to be semantically similar. Conversely, if a query is rephrased to flip its sentiment, we expect a low or even negative cosine similarity between the responses.

In [None]:
%pip install sentence-transformers scikit-learn

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# We are using one of the very popular small yet efficient Transformer model for computing embeddings for our texts. 
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text: str) -> np.ndarray:
    """
    Returns the embedding vector for the given text.
    """
    embedding = model.encode([text])
    return embedding

def compute_cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """
    Computes the cosine similarity between two vectors. Returns a value between -1 and 1.
    """
    return cosine_similarity(vec1, vec2)[0][0]

def simulate_llm_output(query: str) -> str:
    """
    Simulates an LLM query response. In a real-world scenario, this function would call an LLM API.
    """
    # For demonstration, we return a predefined response based on the query content
    if "drawbacks" in query or "negative aspects" in query:
        return "Eating outside can expose you to unpredictable weather and risks of foodborne illnesses."
    elif "benefits" in query or "positive aspects" in query:
        return "Dining outdoors can improve your mood and offer a refreshing change from routine indoor meals."
    else:
        return "The experience of outdoor dining depends on various factors including weather and food quality."


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Define queries with metamorphic relations
query_similar_1 = "What are the drawbacks of eating outside?"
query_similar_2 = "What negative aspects come with outdoor dining?"

query_opposite_1 = "What are the drawbacks of eating outside?"
query_opposite_2 = "What are the benefits of eating outside?"

# Simulate LLM outputs for the queries
output_similar_1 = simulate_llm_output(query_similar_1)
output_similar_2 = simulate_llm_output(query_similar_2)

output_opposite_1 = simulate_llm_output(query_opposite_1)
output_opposite_2 = simulate_llm_output(query_opposite_2)

# Obtain embeddings for the simulated outputs
embedding_similar_1 = get_embedding(output_similar_1)
embedding_similar_2 = get_embedding(output_similar_2)

embedding_opposite_1 = get_embedding(output_opposite_1)
embedding_opposite_2 = get_embedding(output_opposite_2)

# Compute cosine similarities
similarity_similar = compute_cosine_similarity(embedding_similar_1, embedding_similar_2)
similarity_opposite = compute_cosine_similarity(embedding_opposite_1, embedding_opposite_2)

print(f"Cosine Similarity for similar queries (drawbacks): {similarity_similar:.3f}")
print(f"Cosine Similarity for opposite queries (drawbacks vs benefits): {similarity_opposite:.3f}")

def interpret_similarity(sim: float) -> str:
    """
    Provides an interpretation of the cosine similarity value.
    """
    if sim >= 0.7:
        return "The outputs are highly similar (expected for similar queries)."
    elif sim <= -0.7:
        return "The outputs are highly opposite (expected for contrasting queries)."
    elif -0.3 < sim < 0.3:
        return "The outputs are largely unrelated."
    else:
        return "The outputs show moderate similarity/difference."

print("Interpretation for similar queries:", interpret_similarity(similarity_similar))
print("Interpretation for opposite queries:", interpret_similarity(similarity_opposite))


## Conclusion

This notebook illustrates a metamorphic testing approach for LLM query outputs using cosine similarity. By simulating LLM responses for different queries and comparing their semantic similarity, we can verify whether the model's outputs adhere to the expected relationships-without relying on fixed, deterministic expected outputs.

Such an approach is especially useful when working with non-deterministic LLM outputs where traditional testing methods may fall short.