# 2- Scientific alignment

We provide here the script we will use to check the scientific alignment of created benchmarks for reproducibility purpose.

First, make sure that you have all pairs of question answers stored in a `json` file with the following format:

```json
[
  {
    "content": "Question: Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.\nA) Safe practices, Fear, Jealousy, Trivial\nB) Unsafe practices, Distress, Joy, Trivial\nC) Safe practices, Wants, Jealousy, Trivial\nD) Safe practices, Distress, Fear, Trivial\nE) Unsafe practices, Wants, Jealousy, Serious\nF) Safe practices, Distress, Jealousy, Serious\nG) Safe practices, Wants, Fear, Serious\nH) Unsafe practices, Wants, Fear, Trivial\nI) Unsafe practices, Distress, Fear, Serious\nAnswer: I"
  },
  {
    "content": "Question: Managers are entrusted to run the company in the best interest of ________. Specifically, they have a duty to act for the benefit of the company, as well as a duty of ________ and of _______.\nA) Shareholders, Diligence, Self-interest\nB) Shareholders, Self-interest, Care and Skill\nC) Stakeholders, Care and skill, Self-interest\nD) Stakeholders, Diligence, Care and Skill\nE) Customers, Care and Skill, Diligence\nF) Shareholders, Care and Skill, Diligence\nG) Shareholders, Self-interest, Diligence\nH) Employees, Care and Skill, Diligence\nI) Stakeholders, Self-interest, Diligence\nJ) Stakeholder, Care and Skill, Diligence\nAnswer: F"
  }
  ...
]
```

We use OpenAI's GPT-4o as a judge, make sure to use your OpenAI API key and run the script below:

In [None]:
import openai
import json
import random
from tqdm import tqdm
import os


process_function = {
    "name": "classify_prompt",
    "description": "Classify a prompt as STEM/Math (Accept) or General NLP (Reject).",
    "schema": {
        "type": "object",
        "properties": {
            "verdict": {"type": "string", "enum": ["Accept", "Reject"]}
        },
        "required": ["verdict"],
        "additionalProperties": False
    }
}


def read_jsonl(jsonl_file_path):
    """
    Read a .jsonl file into a Python list of dictionaries (or other JSON-serializable objects).
    """
    records = []
    with open(jsonl_file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if line:  # skip empty lines
                record = json.loads(line)
                records.append(record)
    return records

def read_json(json_file_path):
    """
    Read a JSON file into a Python object.
    """
    with open(json_file_path, 'r') as f:
        records = json.load(f)
    return records

def classify_prompts(json_path, num_samples=50, output_file="prompt_classifications.json"):
    """
    Sample question-answer pairs from a JSON file and classify them using GPT-4o.

    Args:
        json_path: Path to the JSON file containing question-answer pairs
        num_samples: Number of question-answer pairs to sample for classification
        output_file: Path to save the classification results
    """
    # Load all question-answer pairs from the JSON file
    qa_pairs = read_json(json_path)

    # Sample a subset of question-answer pairs
    if len(qa_pairs) > num_samples:
        sampled_qa_pairs = random.sample(qa_pairs, num_samples)
    else:
        sampled_qa_pairs = qa_pairs
        print(f"Warning: Requested {num_samples} samples but only {len(qa_pairs)} question-answer pairs available.")

    # Create the classification prompt
    classification_prompt = """You are tasked with classifying each of the following question-answer pairs into one of two groups:
    ---
    ### Accept

    Classify as **Accept** if the question requires **domain-specific knowledge**, **scientific understanding**, **academic expertise**, or **professional reasoning**. This includes but is not limited to:

    **Scientific & Technical Fields**
    - Biology
    - Chemistry
    - Physics
    - Mathematics
    - Computer Science
    - Engineering (all disciplines)
    - Medicine and Health Sciences
    - Neuroscience
    - Pharmacology
    - Veterinary Science
    - Environmental Science
    - Earth Science / Astronomy
    - Statistics & Data Science

    **Professional & Applied Domains**
    - Law
    - Business (e.g., Finance, Accounting, Marketing)
    - Economics
    - Political Science
    - Education
    - Sociology
    - Anthropology
    - Linguistics
    - Communications / Media Studies
    - Library & Information Science
    - Social Work
    - Public Policy
    - Nursing / Allied Health
    - Architecture / Urban Planning
    - Agriculture / Food Science

    **Humanities with Reasoning Requirements**
    - History
    - Philosophy (e.g., logic, ethics, epistemology)
    - Theology / Religious Studies
    - Art History
    - Literary Theory

    Any question that falls under these or similar knowledge-intensive domains should be marked **Accept**, even if it's not explicitly listed.

    ---

    ### Reject

    Classify as **Reject** only if the question is based on:
    - General language understanding
    - Common sense or cultural knowledge
    - Vocabulary, grammar, spelling, or idioms
    - Word analogies or word associations
    - Trivia or factoids that don’t require reasoning
    - Sentiment, emotion, or tone recognition
    - Simple reading comprehension without technical content
    - NLP-specific tasks (e.g., joke detection, paraphrasing, summarization)

    These do not require deep subject-matter knowledge and should be marked **Reject**.

    ---

    ### Output Format
    ```json
    {
      "verdict": "Accept" // or "Reject"
    }

        """

    # Process each question-answer pair individually for simpler output structure
    results = []

    for i, qa_pair in enumerate(sampled_qa_pairs):
        print(f"Processing QA pair {i+1}/{len(sampled_qa_pairs)}...")

        # Extract question and answer from the pair
        question_answer = qa_pair.get('content', '')

        # Create prompt-specific classification request
        current_prompt = f"{classification_prompt}\n{question_answer}"

        messages = [
            {"role": "system", "content": "You are an expert educational content classifier specializing in identifying STEM, Math, and general language tasks."},
            {"role": "user", "content": current_prompt}
        ]

        response_format = {
            "type": "json_schema",
            "json_schema": process_function,
        }

        try:
            response = openai.ChatCompletion.create(
                model="gpt-4o-2024-08-06",
                messages=messages,
                max_tokens=300,
                temperature=0.0,
                response_format=response_format
            )

            # Extract and parse the JSON response
            classification_result = json.loads(response.choices[0].message.content)

            # Add question-answer and ID to the result
            result_with_qa = {
                "qa_id": i+1,
                "content": question_answer,
                "verdict": classification_result["verdict"]
            }

            results.append(result_with_qa)

        except Exception as e:
            print(f"Error processing QA pair {i+1}: {e}")
            results.append({
                "qa_id": i+1,
                "content": question_answer,
                "verdict": "Error"
            })

    acc = 0
    for item in results:
        if item['verdict'].lower() == 'accept':
            acc += 1
    print(f"Accuracy: {acc/len(results)}")

    output_data = {"classifications": results}
    with open(output_file, 'w') as f:
        json.dump(output_data, f, indent=2)

    print(f"Classification complete! Results saved to {output_file}")
    return output_data

if __name__ == '__main__':
    import argparse
    import os

    parser = argparse.ArgumentParser(description='Classify prompts as STEM, Math, or General NLP tasks')
    parser.add_argument('--input_json', type=str, default='./hellaswag_formatted.json', help='Path to the input JSON file containing prompts')
    parser.add_argument('--num_samples', type=int, default=50, help='Number of prompts to sample for classification')
    parser.add_argument('--output_file', type=str, default='prompt_classifications_hellaswag.json', help='Path to save the classification results')
    parser.add_argument('--openai_key', type=str, help='OpenAI API key')

    args = parser.parse_args()

    os.environ["OPENAI_API_KEY"] = args.openai_key
    classify_prompts(
        json_path=args.input_json,
        num_samples=args.num_samples,
        output_file=args.output_file
    )

The final result will be saved in `output_file`.