In [9]:
csqa_prompt = prompt = {
    "system": "You are a developmental expert evaluating how well a child's response demonstrates a specific developmental skill at a given stage, using a provided instruction and background text.\n\nYou will receive:\n- A short **text** (used as context for the instruction)\n- A **skill-based instruction** given to the child\n- The child’s **response**\n- The child’s **developmental stage** (0–9)\n- The child’s **age group** (e.g., '0–5', '5–11', '11–14')\n- The **target skill**, **subskill**, **goal**, and **indicator** that the instruction was designed to assess\n\nYour job is to:\n1. **Rate the child's response on a scale from 1 to 5**, using these criteria:\n   - **5 – Excellent:** Fully demonstrates the targeted skill/indicator with clarity and developmental appropriateness. Strong reasoning, appropriate expression, and alignment with instruction.\n   - **4 – Strong:** Mostly appropriate and well-formed. Some minor gaps in completeness, precision, or phrasing, but shows the intended skill.\n   - **3 – Adequate:** Response attempts the skill but may be vague, simplistic, or only partially aligned with the goal/indicator.\n   - **2 – Limited:** Weak or unclear demonstration of the skill. Response is partially off-track, underdeveloped, or barely relevant.\n   - **1 – Inadequate:** Fails to demonstrate the intended skill. Response is irrelevant, confusing, or clearly inappropriate for the stage.\n\n2. **Use stage-specific developmental expectations**:\n   - **Stage 0 (Age 5):** Short, concrete, present-focused responses with simple vocabulary\n   - **Stages 1–3 (Ages 6–8):** Clear expression of ideas, simple cause-effect, emotional awareness, basic reasoning\n   - **Stages 4–6 (Ages 9–11):** Logical structure, hypothetical thinking, connections to personal experience, comparisons\n   - **Stages 7–9 (Ages 12–14):** Advanced abstraction, multiple perspectives, justification, nuanced expression\n\n3. **Evaluate:**\n   - Does the child’s response meaningfully follow the instruction?\n   - Does it demonstrate the **targeted skill and indicator**?\n   - Is the language, reasoning, and expression developmentally appropriate for the stage?\n   - Is the response authentic and logically consistent with the instruction and the context text?\n\n4. **Output Format:**\nReturn only the following dictionary:\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```\nDo not add any other text or formatting. Only return the JSON object.",    
    "user": "Evaluate the child's response to a skill-based instruction using the provided text and developmental context. Focus on how well the response demonstrates the intended skill.\n\nContext:\n{context}\n\nInstruction:\n{instruction}\n\nResponse:\n{response}\n\nStage: {stage}\nAge group: {age_group}\nSkill: {skill}\nSubskill: {subskill}\nGoal: {goal}\nIndicator: {indicator}\nIndex: {q_index}\n\nOutput format:\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```"
}

print(csqa_prompt["system"])
print("--------------")
print(csqa_prompt["user"])

#save the prompt as json
import json

with open("/datadrive/pavan/az_storage/CL_results/csqa_prompt.json", "w") as f:
    json.dump(csqa_prompt, f, indent=4)

You are a developmental expert evaluating how well a child's response demonstrates a specific developmental skill at a given stage, using a provided instruction and background text.

You will receive:
- A short **text** (used as context for the instruction)
- A **skill-based instruction** given to the child
- The child’s **response**
- The child’s **developmental stage** (0–9)
- The child’s **age group** (e.g., '0–5', '5–11', '11–14')
- The **target skill**, **subskill**, **goal**, and **indicator** that the instruction was designed to assess

Your job is to:
1. **Rate the child's response on a scale from 1 to 5**, using these criteria:
   - **5 – Excellent:** Fully demonstrates the targeted skill/indicator with clarity and developmental appropriateness. Strong reasoning, appropriate expression, and alignment with instruction.
   - **4 – Strong:** Mostly appropriate and well-formed. Some minor gaps in completeness, precision, or phrasing, but shows the intended skill.
   - **3 – Adequa

In [10]:
cqa_prompt = {
    "system": "You are a developmental expert evaluating how well a child's answer to a reading comprehension question reflects appropriate understanding and reasoning for a specific developmental stage.\n\nYou will receive:\n- The original **context** paragraph\n- A **question** based on the context\n- The child's **answer** to the question\n- The child's **developmental stage** (0–9)\n- The child's **age group** (e.g., '0–5', '5–11', '11–14')\n\nYour job is to:\n1. **Rate the child’s answer on a scale from 1 to 5**, using the following criteria:\n   - **5 – Excellent:** Fully correct, precise, and well-formed for the stage. Shows strong comprehension and reasoning.\n   - **4 – Strong:** Mostly correct and appropriate; may have minor phrasing issues or slight gaps in reasoning.\n   - **3 – Adequate:** Understands the gist but may be vague, partially incorrect, or simplistic for the stage.\n   - **2 – Limited:** Misunderstands part of the question or context; reasoning is weak or off-track.\n   - **1 – Inadequate:** Confused, incorrect, or clearly not appropriate for the stage.\n\n2. **Consider developmental expectations** for language and reasoning:\n   - **Stage 0 (Age 5):** Very basic phrases, literal recall, present-focused answers\n   - **Stages 1–3 (Ages 6–8):** Simple reasoning, sequencing, basic cause-effect, clear answers\n   - **Stages 4–6 (Ages 9–11):** Logical inference, comparative language, clear justification\n   - **Stages 7–9 (Ages 12–14):** Abstract reasoning, complex ideas, nuanced explanations\n\n3. **Evaluate:**\n   - Does the child’s answer meaningfully address the question using the provided context?\n   - Is the reasoning and language appropriate for the stage?\n   - Does it reflect comprehension of the text and question?\n\n4. **Output Format:**\nOnly return the following dictionary:\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```\nDo not add any other text or formatting. Only return the JSON object.",
    "user": "Evaluate the child’s answer to a reading comprehension question. Consider the context and the developmental stage.\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n{answer}\n\nStage: {stage}\nAge group: {age_group}\nIndex: {q_index}\n\n**Output Format:**\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```"
}
print(cqa_prompt["system"])
print("--------------")
print(cqa_prompt["user"])

import json

with open("/datadrive/pavan/az_storage/CL_results/cqa_prompt.json", "w") as f:
    json.dump(cqa_prompt, f, indent=4)

You are a developmental expert evaluating how well a child's answer to a reading comprehension question reflects appropriate understanding and reasoning for a specific developmental stage.

You will receive:
- The original **context** paragraph
- A **question** based on the context
- The child's **answer** to the question
- The child's **developmental stage** (0–9)
- The child's **age group** (e.g., '0–5', '5–11', '11–14')

Your job is to:
1. **Rate the child’s answer on a scale from 1 to 5**, using the following criteria:
   - **5 – Excellent:** Fully correct, precise, and well-formed for the stage. Shows strong comprehension and reasoning.
   - **4 – Strong:** Mostly correct and appropriate; may have minor phrasing issues or slight gaps in reasoning.
   - **3 – Adequate:** Understands the gist but may be vague, partially incorrect, or simplistic for the stage.
   - **2 – Limited:** Misunderstands part of the question or context; reasoning is weak or off-track.
   - **1 – Inadequate:*

In [11]:
ir_prompt = {
    "system": "You are a developmental expert rating how well a child's response to a prompt demonstrates age-appropriate reasoning and language for a given developmental stage.\n\nYou will receive:\n- An **instruction** given to the child\n- The child's **response**\n- The child's **developmental stage** (0–9)\n- The child's **age group** (e.g., '0–5', '5–11', '11–14')\n\nYour job is to:\n1. **Rate the response on a scale from 1 to 5**, using the following criteria:\n   - **5 – Excellent:** The response fully addresses the instruction with clear, developmentally appropriate reasoning and language. It meets expectations for the stage with no major issues.\n   - **4 – Strong:** Mostly appropriate and coherent; minor gaps in clarity, depth, or completeness.\n   - **3 – Adequate:** A reasonable attempt that partially addresses the instruction; may be vague, brief, or contain small misunderstandings.\n   - **2 – Limited:** Weak or underdeveloped response; minimal reasoning or limited relevance to the instruction.\n   - **1 – Inadequate:** Response is off-topic, confusing, or clearly inappropriate for the stage.\n\n2. **Use stage-specific developmental expectations**:\n   - **Stage 0 (Age 5):** Very simple sentences, concrete ideas, focused on here and now\n   - **Stages 1–3 (Ages 6–8):** Simple reasoning, some past/future thinking, familiar examples\n   - **Stages 4–6 (Ages 9–11):** Logical structure, comparisons, abstract or hypothetical reasoning\n   - **Stages 7–9 (Ages 12–14):** Nuanced reasoning, multi-step thinking, advanced vocabulary\n\n3. **Evaluate:**\n   - Does the child’s response meaningfully address the instruction?\n   - Is the language and reasoning developmentally appropriate for the stage?\n   - Is the response authentic and logically consistent?\n\n4. **Output Format:**\nOnly return the following dictionary:\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```\nDo not add any other text or formatting. Only return the JSON object.", 
    "user": "Evaluate the child's response to the instruction below based on the developmental stage and age group. Return a numerical rating (1–5) and a short explanation.\n\nInstruction: {instruction}\nResponse: {response}\nStage: {stage}\nAge group: {age_group}\nIndex: {q_index}\n\n**Output Format:**\nOnly return the following dictionary:\n```json\n{{\n    \"rating\": <integer from 1 to 5>,\n    \"explanation\": \"<2–3 sentence rationale>\"\n}}\n```\n"
}
print(ir_prompt["system"])
print("--------------")
print(ir_prompt["user"])

import json

with open("/datadrive/pavan/az_storage/CL_results/ir_prompt.json", "w") as f:
    json.dump(ir_prompt, f, indent=4)

You are a developmental expert rating how well a child's response to a prompt demonstrates age-appropriate reasoning and language for a given developmental stage.

You will receive:
- An **instruction** given to the child
- The child's **response**
- The child's **developmental stage** (0–9)
- The child's **age group** (e.g., '0–5', '5–11', '11–14')

Your job is to:
1. **Rate the response on a scale from 1 to 5**, using the following criteria:
   - **5 – Excellent:** The response fully addresses the instruction with clear, developmentally appropriate reasoning and language. It meets expectations for the stage with no major issues.
   - **4 – Strong:** Mostly appropriate and coherent; minor gaps in clarity, depth, or completeness.
   - **3 – Adequate:** A reasonable attempt that partially addresses the instruction; may be vague, brief, or contain small misunderstandings.
   - **2 – Limited:** Weak or underdeveloped response; minimal reasoning or limited relevance to the instruction.
   

In [None]:
from datasets import Dataset, load_dataset
import pickle

def prepare_seed_csqa(stage, split_type, model_name):
    ds = load_dataset(f"Pavankalyan/stage{stage}_csqa_eval", split=split_type)
    root_dir = ""
    seeds = []
    for i in range(len(ds)):
        seeds.append({
            'context': ds[i]['context'],
            'instruction': ds[i]['question'],
            'response': ds[i][model_name],
            'stage': stage,
            'age_group': ds[i]['age_group'],
            'skill': ds[i]['skill'],
            'subskill': ds[i]['subskill'],
            'goal': ds[i]['goal'],
            'indicator': ds[i]['indicator'],
            'q_index': ds[i]['q_index']
        })
        
    with open(f"{root_dir}/seed_{model_name}.pkl", "wb") as f:
        pickle.dump(seeds, f)

def prepare_seed_cqa(stage, split_type, model_name):
    ds = load_dataset(f"Pavankalyan/stage{stage}_cqa_eval", split=split_type)
    root_dir = ""
    seeds = []
    for i in range(len(ds)):
        seeds.append({
            'context': ds[i]['context'],
            'question': ds[i]['question'],
            'answer': ds[i][model_name],
            'stage': stage,
            'age_group': ds[i]['age_group'],
            'q_index': ds[i]['q_index']
        })
        
    with open(f"{root_dir}/seed_{model_name}.pkl", "wb") as f:
        pickle.dump(seeds, f)
        
def prepare_seed_ir(stage, split_type, model_name):
    ds = load_dataset(f"Pavankalyan/stage{stage}_ir_eval", split=split_type)
    root_dir = ""
    seeds = []
    for i in range(len(ds)):
        seeds.append({
            'question': ds[i]['question'],
            'answer': ds[i][model_name],
            'stage': stage,
            'age_group': ds[i]['age_group'],
            'q_index': ds[i]['q_index']
        })
        
    with open(f"{root_dir}/seed_{model_name}.pkl", "wb") as f:
        pickle.dump(seeds, f)


In [None]:
import os
import re
import json


res_path = f"outputs_stage/instruct"
hf_df = load_dataset(
        "parquet",
        data_files=os.path.join(res_path, "*.parquet"),
        streaming=False
    )

def extract_details(row):
    seed_data = row['user']
    f1 = '''\nIndex: '''
    f2 = '''**Output Format:**'''
    s = seed_data.find(f1)
    e = seed_data.find(f2)
    q_in = seed_data[s+len(f1):e].strip()
    return {"q_index": q_in}

def parse_json_string(text):
    try:
        cleaned = text.strip()
        cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
        cleaned = re.sub(r'^```', '', cleaned, flags=re.MULTILINE)
        cleaned = re.sub(r'```$', '', cleaned, flags=re.MULTILINE)

        parsed = json.loads(cleaned)
        return {
            "rating": parsed.get("rating"),
            "explanation": parsed.get("explanation")
        }

    except Exception as e:
        def extract_int(key):
            match = re.search(rf'"{key}"\s*:\s*(\d+)', cleaned)
            return int(match.group(1)) if match else None

        rating = extract_int("rating")
        return {
            "rating": rating,
            "explanation": None
        }

def flatten_parsed_fields(example):
    parsed = parse_json_string(example['answer'])
    return {
        "rating": parsed.get("rating")
    }

hf_df = hf_df['train'].map(flatten_parsed_fields)
hf_df = hf_df.remove_columns(['batch_uuid', 'embeddings', 'generated_tokens', 'messages', 'metrics', 'num_generated_tokens', 'num_input_tokens', 'params', 'prompt', 'prompt_token_ids', 'request_id', 'system', 'time_taken_llm'])
hf_df = hf_df.map(extract_details)