# AutoJudge

The goal of this notebook is to showcase a practical implementation of `AutoJudge`

In [None]:
import sys
from pathlib import Path


parent_dir = Path.cwd().parent 
sys.path.append(str(parent_dir))

# Import AutoJudge after modifying sys.path
from core import AutoJudge
import pandas as pd

dataset_path = "/Users/jamesliounis/Documents/judges/judges/autojudge/tests/data/synthetic_travel_qa.csv"

df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,input_text,completion,label,feedback
0,What's the best time to visit Paris?,The best time to visit Paris is during the spr...,1,Provides accurate and detailed advice.
1,Can you recommend a good hotel in Tokyo?,Certainly! Hotel Sunroute Plaza Shinjuku is hi...,1,Offers a specific and helpful recommendation.
2,What are the top attractions in New York City?,Some top attractions in NYC include the Statue...,1,Lists relevant and popular attractions.
3,How do I apply for a visa to Canada?,"To apply for a Canadian visa, visit the offici...",1,Gives clear and actionable steps.
4,Is it safe to travel to Brazil?,"Brazil is generally safe for travelers, but it...",1,Provides balanced and cautious advice.


In [None]:
# Initialize AutoJudge
autojudge = AutoJudge()


# Run the full pipeline
autojudge.generate_rubric(dataset_path)

print("\nPipeline executed successfully.")
print("Generated Rubric:")
print(autojudge.rubric)

2024-12-18 00:15:26,459 [INFO] AutoJudge: Loading dataset...
2024-12-18 00:15:26,461 [INFO] AutoJudge: Data loaded successfully with 20 records.
2024-12-18 00:15:26,461 [INFO] AutoJudge: Aggregating feedback from bad data points.
2024-12-18 00:15:26,461 [INFO] AutoJudge: Generating structured feedback using LLM.
2024-12-18 00:15:26,461 [INFO] AutoJudge: Loading prompt from /Users/jamesliounis/Documents/judges/judges/autojudge/prompts/STRUCTURE_FEEDBACK.txt
2024-12-18 00:15:35,073 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-12-18 00:15:35,077 [INFO] AutoJudge: Generating grading notes using LLM.
2024-12-18 00:15:35,078 [INFO] AutoJudge: Loading prompt from /Users/jamesliounis/Documents/judges/judges/autojudge/prompts/GENERATE_RUBRIC.txt
2024-12-18 00:15:35,080 [INFO] AutoJudge: Loading prompt from /Users/jamesliounis/Documents/judges/judges/autojudge/prompts/RUBRIC_FORMAT.txt
2024-12-18 00:15:53,852 [INFO] httpx: HTTP Request: POST 


Pipeline executed successfully.
Generated Rubric:
#### Grading Note for Evaluating AI-Generated Responses

**Objective:** To determine if the AI-generated response is suitable (1) or unsuitable (0) based on its ability to accurately and effectively address the user's query.

**Criteria for Evaluation:**

1. **Accuracy of Information:**
   - **Factual Accuracy:** The response must contain only true and verifiable information. Any statement that can be fact-checked should align with current, recognized facts.
   - **Technological Realism:** The response should accurately reflect the current state of technology. Claims about non-existent technologies like teleportation or time travel should be marked as unsuitable.
   - **Scientific Accuracy:** Responses must adhere to established scientific principles. Any deviation from known scientific facts renders the response unsuitable.

2. **Representation of Reality:**
   - **Mythical and Fictional Elements:** The response should clearly disting

In [5]:
# Test the judge method with sample input and output
sample_input = "Where should I go for my honeymoon?"
sample_output = "The moon, probably"

print("\nTesting judge method...")
judgment_result = autojudge.judge(
    input=sample_input,
    output=sample_output
)

# Print the judgment result
print("\nJudgment Result:")
print(f"Reasoning: {judgment_result.reasoning}")
print(f"Score: {'Correct' if judgment_result.score else 'Incorrect'}")


Testing judge method...


2024-12-18 00:16:41,122 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



Judgment Result:
Reasoning: The response suggests going to 'the moon' for a honeymoon, which is not a feasible or realistic option at present. This fails to meet the criteria for factual accuracy and technological realism, as current technology does not allow for commercial trips to the moon. Additionally, it does not take into account practical considerations or preferences typically associated with honeymoon destinations, thereby lacking relevance and clarity in addressing the user's query. Therefore, the response is classified as unsuitable.
Score: Correct
