Imports:

In [21]:
import openai
import anthropic
import google.generativeai as gemini

API keys - replace with your keys:

In [22]:
OPENAI_API_KEY ="your key here"
ANTHROPIC_API_KEY = "your key here"
GOOGLE_API_KEY = "your key here"

OpenAI helper functions:

In [23]:
def generate_openai(prompt):
    client = openai.OpenAI(api_key=OPENAI_API_KEY)
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant who exclusively gives factual and true responses. You only give facts you are 100 percent certain of, and if not, you clarify that you are uncertain about an answer or do not have that knowledge."},
        {"role": "user", "content": prompt},
    ],)
    return response.choices[0].message.content

In [24]:
def evaluate_openai(prompt, response):
    client = openai.OpenAI(api_key=OPENAI_API_KEY)
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a professional fact-checker who, when given a piece of text that was generated by an LLM, gives a score from 1 to 10 evaluating its overall accuracy and factuality, as well as a short two-or-three-sentence explanation of why you gave that score. You are not evaluating the prompt, only the response. Rate the answer based only on how accurate and factual the information contained within is."},
        {"role": "user", "content": "The prompt was as follows: \n" + prompt + "\n The response generated by an LLM that you must evaluate is as follows: \n" + response},
    ],)
    return response.choices[0].message.content

Anthropic helper functions:

In [25]:
def generate_anthropic(prompt):
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": "You are a helpful assistant who exclusively gives factual and true responses. You only give facts you are 100 percent certain of, and if not, you clarify that you are uncertain about an answer or do not have that knowledge. My question for you is as follows: " + prompt},
        ]
    )
    return message.content[0].text

In [26]:
def evaluate_anthropic(prompt, response):
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": "You are a professional fact-checker who, when given a piece of text that was generated by an LLM, gives a score from 1 to 10 evaluating its overall accuracy and factuality, as well as a short two-or-three-sentence explanation of why you gave that score. You are not evaluating the prompt, only the response. Rate the answer based only on how accurate and factual the information contained within is. The prompt was as follows: \n" + prompt + "\n The response generated by an LLM that you must evaluate is as follows: \n" + response},
        ]
    )
    return message.content[0].text

Google helper functions:

In [27]:
def generate_google(prompt):
    gemini.configure(api_key=GOOGLE_API_KEY)
    generation_config = {
        "temperature": 1,
        "top_p": 0.95,
        "top_k": 64,
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain",
    }
    model = gemini.GenerativeModel(
        model_name="gemini-1.5-flash",
        generation_config=generation_config,
        system_instruction="You are a helpful assistant who exclusively gives factual and true responses. You only give facts you are 100 percent certain of, and if not, you clarify that you are uncertain about an answer or do not have that knowledge",
    )
    chat_session = model.start_chat(history=[])
    return chat_session.send_message(prompt).text

In [28]:
def evaluate_google(prompt, response):
    gemini.configure(api_key=GOOGLE_API_KEY)
    generation_config = {
        "temperature": 1,
        "top_p": 0.95,
        "top_k": 64,
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain",
    }
    model = gemini.GenerativeModel(
        model_name="gemini-1.5-flash",
        generation_config=generation_config,
        system_instruction="You are a professional fact-checker who, when given a piece of text that was generated by an LLM, gives a score from 1 to 10 evaluating its overall accuracy and factuality, as well as a short two-or-three-sentence explanation of why you gave that score. Rate the answer based only on how accurate and factual the information contained within is. You are not evaluating the prompt, only the response. The prompt was as follows: \n" + prompt + "\n The response generated by an LLM that you must evaluate is as follows: \n" + response,
    )
    chat_session = model.start_chat(history=[])
    return chat_session.send_message(prompt).text

Generate answer function:

In [29]:
def generate_response(prompt, model):
    if model == "openai":
        return generate_openai(prompt)
    elif model == "anthropic":
       return generate_anthropic(prompt)
    elif model == "google":
       return generate_google(prompt)
    else:
        raise ValueError(f"Unsupported model: {model}")

Function to generate one evaluation from a model of choice:

In [30]:
def evaluate_response(prompt, response, model):
    if model == "openai":
        return evaluate_openai(prompt, response)
    elif model == "anthropic":
       return evaluate_anthropic(prompt, response)
    elif model == "google":
       return evaluate_google(prompt, response)
    else:
        raise ValueError(f"Unsupported model: {model}")

Generate evaluations from all 3 models:

In [36]:
def evalAll(prompt, response):
    openaiEval = evaluate_openai(prompt, response,)
    anthropicEval = evaluate_anthropic(prompt, response)
    googleEval = evaluate_google(prompt, response)
    evaluations = {"openai": openaiEval, "anthropic": anthropicEval, "google": googleEval}
    print("\n")
    for key in evaluations:
        print(key + ":\n" + evaluations[key] + "\n")

'All-in-one' function to pass in a prompt and model string for rapid testing, rather than from user input box:

In [32]:
def runAll(prompt, model):
    response = generate_response(prompt, model)
    print(response)
    evalAll(prompt, response)

Main function for example end-user experience - allows user input for prompt and for choice of model to generate prompt response from, then presents the response and an evaluation by all 3 models:

In [33]:
def main():
    models = ["openai", "anthropic", "google"]

    prompt = input("Enter a prompt: ")
    model_choice = input(f"Choose a model ({', '.join(models)}): ")

    if model_choice not in models:
        print(f"Invalid model choice. Available models: {', '.join(models)}")
        return

    output = generate_response(prompt, model_choice)
    print(f"\nGenerated output:\n{output}")

    print(f"\nEvaluations:\n")
    evalAll(prompt, output)

Run the below block to input your own prompt and choice of model to run it on.

In [None]:
main()

Below is a set of prompts I found interesting for testing purposes, including some cases where the models seem to struggle or outright fail at both giving accurate responses and at fact-checking the innacurate responses. Try running the block with different prompts and model choices as input for the runAll function.

The logical_prompt gives a logic question that every LLM I have tested is unable to give a correct answer to (the marble would fall out of the cup), and the fact-checkers are unable to identify the discrepancy as well.

The date_prompt represents a more realistic case for LLM fact-checking being useful, as a specific fact-based prompt. There are usually some mistakes in the generated list of events - for example, the discovery of Benedict Arnold's treason is often given as an event in the list, when it was the meeting where his actual treason occurred that was on the 21st, and its discovery by militiamen was on the 23rd. Similar such minor errors are frequently generated, and the fact-checkers do usually catch some but not all, while sometimes actually identifying correct events in the list as being incorrect.

The math_prompt, while usually generating a correct response and accurate fact-checking, occasionally has the fact-checkers describing an error that was not actually present in the generated response at all.

In [56]:
logical_prompt = "Assume the laws of physics on Earth. A small marble is put into a normal cup and the cup is placed upside down on a table. Someone then takes the cup without changing its orientation and puts it inside the microwave. Where is the marble now? Explain your reasoning step by step."
date_prompt = "What significant historical events happened on September 21st?"
math_prompt = "Solve the quadratic equation 3x^2 - 12x + 9 = 0."
runAll(date_prompt, "openai")

Here are a few notable historical events that occurred on September 21st:

1. **1776 - Great Fire of New York:** During the American Revolutionary War, a fire broke out in New York City, destroying a large portion of the city. It happened shortly after the British forces captured the city.

2. **1937 - Publication of "The Hobbit":** J.R.R. Tolkien’s fantasy novel "The Hobbit" was first published. It introduced readers to Middle-earth and characters like Bilbo Baggins and Gandalf.

3. **1942 - B-29 Superfortress Makes its First Flight:** The Boeing B-29 Superfortress, a significant bomber for the US during WWII, made its maiden flight.

4. **1976 - Seychelles Joins the United Nations:** The Seychelles, an archipelago in the Indian Ocean, became a member of the United Nations.

5. **1981 - Sandra Day O'Connor is confirmed:** Sandra Day O'Connor is unanimously approved by the U.S. Senate as the first female Supreme Court justice.

These events span a range of contexts including war, liter