<a href="https://colab.research.google.com/github/shuyu-M/Chain_of_Thought/blob/main/Generate_rubric.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install "transformers>=4.44.0" "accelerate>=0.33.0"

import torch, random, re, os, pandas as pd
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL = "Qwen/Qwen3-1.7B"
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16
)

def chat(messages, max_new_tokens=200, temperature=0.2, top_p=0.9):
    """通用对话调用（Messages 是 [{"role": "...", "content": "..."}] 结构）"""
    prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs  = tok(prompt, return_tensors="pt").to(mdl.device)
    with torch.inference_mode():
        out = mdl.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            eos_token_id=tok.eos_token_id,
            pad_token_id=tok.eos_token_id,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    # 取“最后一段”作为模型回复
    return text.split(prompt)[-1].strip() if prompt in text else text.strip()

print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

GPU: Tesla T4


In [None]:
# 可选：清理默认 sample_data
!rm -rf sample_data

from google.colab import files
uploaded = files.upload()
import pandas as pd
train_raw  = pd.read_csv("train.csv")
val_raw    = pd.read_csv("val.csv")
unseen_raw = pd.read_csv("unseen_answers.csv")  # 如果名字不同，改成你的文件名
print(train_raw.shape, val_raw.shape, unseen_raw.shape)
train_raw.head(2)


Saving unseen_answers.csv to unseen_answers.csv
Saving train.csv to train.csv
Saving val.csv to val.csv
Saving unseen_question.csv to unseen_question.csv
(3662, 6) (407, 6) (980, 6)


Unnamed: 0,Question_id,Question,Student Answer,Correct Answer,output_label,feedback
0,,Q1. State TRUE or FALSE and justify. No correc...,FALSE,"False, parent and child processes are two inde...",0,Your answer is incorrect. Parent and child pro...
1,324731.0,A rating curve is only valid when?,Rating curve is valid only when there is no re...,A rating curve is only valid when there is no ...,1,Your answer is partially correct. While the ...


In [None]:
train = pd.read_csv("/content/train.csv")  # <- 你已经上传好的
train = train.rename(columns={
    "Question": "question",
    "Student Answer": "answer",
    "output_label": "score",
})
# 清洗/类型
train["question"] = train["question"].astype(str).str.strip()
train["answer"]   = train["answer"].astype(str).str.strip()
train["score"]    = pd.to_numeric(train["score"], errors="coerce").astype("Int64")
train = train[train["score"].isin([0,1,2])].reset_index(drop=True)

# 题干规范化（去多空格/统一大小写）
def norm_q(s: pd.Series) -> pd.Series:
    return (s.astype(str)
              .str.replace(r"\s+", " ", regex=True)
              .str.strip())

train["q_norm"] = norm_q(train["question"])

# 仅保留有 0/1/2 三类样本的题干
avail = (train.groupby("q_norm")["score"]
              .agg(lambda x: set(x.tolist()))
              .reset_index(name="labels"))
full_qs = avail[avail["labels"].apply(lambda s: {0,1,2}.issubset(s))]["q_norm"].tolist()
pool = train[train["q_norm"].isin(full_qs)].copy()

print("唯一题干数:", pool["q_norm"].nunique(), " 完整题数:", len(full_qs))


唯一题干数: 88  完整题数: 88


In [None]:
train = train_raw.rename(columns={
    "Question": "question",
    "Student Answer": "answer",
    "output_label": "score"
})


In [None]:
# Baseline with SIMPLE_SYSTEM/SIMPLE_PROMPT (with sampling + robust parse + fallback)
import re, math, random, pandas as pd
from tqdm import tqdm

#columns / config
QCOL, ACOL, SCOL, CCOL = "Question", "Student Answer", "output_label", "Correct Answer"
PREVIEW_N = 3                     # None -> 全量；整数 -> 抽样预览
OUTFILE   = "/content/rubrics_simple_fixed.csv"

# System / Prompt (使用普通英文引号)
SIMPLE_SYSTEM = (
    "You are an educational measurement specialist. "
    "Return a 0/1/2 scoring rubric tailored to the given Question and 3 Example Responses. "
    "Output EXACTLY three lines and nothing else in this order:\n"
    "Score 2: ...\nScore 1: ...\nScore 0: ..."
)

SIMPLE_PROMPT = (
    "Question: \"{QUESTION}\"\n\n"
    "Example (score 0): \"{EX0}\"\n"
    "Example (score 1): \"{EX1}\"\n"
    "Example (score 2): \"{EX2}\"\n\n"
    "Write ONLY these three lines, one concise sentence each:\n"
    "Score 2: <one sentence>\n"
    "Score 1: <one sentence>\n"
    "Score 0: <one sentence>"
)

# 采样 0/1/2 示例（每题各 1 条）
def sample_examples_for_question(g_one_q, acol=ACOL, scol=SCOL, max_len=300):
    s0 = g_one_q[g_one_q[scol]==0][acol].dropna().astype(str).tolist()
    s1 = g_one_q[g_one_q[scol]==1][acol].dropna().astype(str).tolist()
    s2 = g_one_q[g_one_q[scol]==2][acol].dropna().astype(str).tolist()
    if not (s0 and s1 and s2):
        return None
    ex0 = random.choice(s0)[:max_len]
    ex1 = random.choice(s1)[:max_len]
    ex2 = random.choice(s2)[:max_len]
    return ex0, ex1, ex2

#  强力三行解析器（把话痨压成三行）
def _first_sentence(s: str, max_chars: int = 180) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    parts = re.split(r"(?<=[\.\!\?;])\s+", s, maxsplit=1)
    s = parts[0] if parts else s
    if len(s) > max_chars:
        s = s[:max_chars].rsplit(" ", 1)[0].rstrip(",;: ")
    return s

def parse_rubric(text: str):
    if not isinstance(text, str) or not text.strip():
        return None

    # 从第一个 "Score" 开始截取
    m_start = re.search(r"Score\s*2\s*:", text, flags=re.I)
    if m_start:
        text = text[m_start.start():]

    # 匹配三段
    blocks = re.findall(r"(Score\s*([210])\s*:\s*)(.*?)(?=Score\s*[210]\s*:|$)", text, flags=re.S|re.I)
    if len(blocks) < 3:
        return None

    out = {"2": "", "1": "", "0": ""}
    for _, lbl, body in blocks:
        body = re.sub(r"(?i)\bassistant\b|\bthink\b|\bone sentence\b|<.*?>", " ", body)
        body = re.sub(r"\s+", " ", body).strip()
        # 只取第一句话，避免长篇
        parts = re.split(r"(?<=[\.\!\?;])\s+", body, maxsplit=1)
        body = parts[0] if parts else body
        if not body:
            return None
        out[lbl] = body

    if not (out["2"] and out["1"] and out["0"]):
        return None

    return {
        "score2": out["2"],
        "score1": out["1"],
        "score0": out["0"]
    }

def fix_difference_rubric(row):
    """
    针对问句里包含 'difference' 的情况，修正 rubric 确保 Score2 明确说明 difference。
    """
    q = row["question"].lower()
    if "difference" in q:  # 题目明确要求对比
        # 修正 score2
        if "difference" not in row["score2"].lower():
            row["score2"] += " Must clearly explain the difference between the two concepts."
        # 修正 score1
        if "difference" not in row["score1"].lower():
            row["score1"] += " Mentions both but does not clearly explain the difference."
        # 修正 score0（通常没提difference就可以判0分，不必强加）
        if "difference" not in row["score0"].lower():
            row["score0"] += " Does not mention or explain the difference."
    return row



# 规则兜底：基于 Correct Answer 抽要点 → 合成 rubric（三行）
STOP = set("the a an to of in on for with by as is are was were be been being this that these those which and or from".split())
def _clean(s: str) -> str: return re.sub(r"\s+", " ", str(s)).strip()
def split_candidates(text: str):
    if not isinstance(text, str): return []
    t = _clean(text)
    parts = re.split(r"(?:;|,|·|•|–|-|\(|\)|\n| and | or | respectively | namely | such as )", t, flags=re.I)
    more = []
    for p in parts:
        more += re.split(r"(?:\d+\)|\d+\.\s+|:)", p)
    cand = [x.strip(" .,:;()-").lower() for x in (parts + more)]
    clean, seen = [], set()
    for x in cand:
        if len(x) < 3: continue
        words = [w for w in re.findall(r"[a-zA-Z0-9%\.]+", x) if w.lower() not in STOP]
        if 2 <= len(words) <= 10:
            s = " ".join(words)
            if s and s not in seen:
                seen.add(s); clean.append(s)
    return clean

def extract_keypoints(question: str, correct: str, k: int = 3):
    pts = split_candidates(correct) or split_candidates(question)
    if len(pts) < k:
        toks = re.findall(r"[A-Za-z]{4,}", f"{question} {correct}".lower())
        for w in toks:
            if w not in pts: pts.append(w)
    pts = sorted(pts, key=lambda s: (-len(s.split()), -len(s)))[:max(k,1)]
    pts = [re.sub(r"^(explain|describe|state)\s+", "", p) for p in pts]
    return pts

def to_en_list(pts):
    if not pts: return ""
    if len(pts) == 1: return pts[0]
    if len(pts) == 2: return f"{pts[0]} and {pts[1]}"
    return ", ".join(pts[:-1]) + f", and {pts[-1]}"
import re, math

#  检测是否为 TRUE/FALSE 类问题
def is_tf_question(q: str) -> bool:
    q = (q or "").lower()
    return bool(re.search(r"\b(true|false|t/f)\b", q)) or "state true or false" in q

# 从正确答案中尽量抽取正确的 True/False
def parse_truth_from_correct(correct: str):
    if not isinstance(correct, str):
        return None
    txt = correct.lower()
    if "true" in txt and "false" not in txt:
        return "True"
    if "false" in txt and "true" not in txt:
        return "False"
    return None  # 抽不到就留空

# — 简单抽要点（可复用你现有的 split_candidates / extract_keypoints）
def tf_extract_points(question: str, correct: str, k: int = 3):
    # 你已有的 extract_keypoints 如果在作用域里，直接用它；否则用最简兜底：
    try:
        pts = extract_keypoints(question, correct, k=k)
    except NameError:
        words = re.findall(r"[A-Za-z]{4,}", f"{question} {correct}".lower())
        uniq = []
        for w in words:
            if w not in uniq:
                uniq.append(w)
        pts = uniq[:k] or ["key mechanism"]
    return pts

#  TF 专用 rubric：明确真值 + 解释必须点
def build_rubric_tf(question: str, correct: str, k:int=3):
    truth = parse_truth_from_correct(correct)  # "True"/"False"/None
    pts = tf_extract_points(question, correct, k=k)
    # 拼接英文要点串
    if len(pts) == 1:
        lst = pts[0]
        need_min = 1
    elif len(pts) == 2:
        lst = f"{pts[0]} and {pts[1]}"
        need_min = 1
    else:
        lst = ", ".join(pts[:2]) + f", and {pts[2]}"
        need_min = 2

    truth_phrase = truth if truth else "the correct truth value (True/False)"
    score2 = (f"States {truth_phrase} and provides a correct justification referencing "
              f"{lst} (e.g., the causal link / mechanism).")
    score1 = (f"States {truth or 'the correct truth value'} OR gives a mostly correct justification, "
              f"but the explanation is vague, misses the key link, or omits the explicit truth value.")
    score0 = "States the wrong truth value or provides an incorrect/irrelevant justification."

    return score2, score1, score0


def fallback_rubric_from_correct(question: str, correct: str, k:int=3):
    # ✅ 先处理 TRUE/FALSE 题
    if is_tf_question(question):
        return build_rubric_tf(question, correct, k=k)

    # ……下面保持你原来的“关键点 → rubric”逻辑 ……
    pts = extract_keypoints(question, correct, k=k)
    lst = ", ".join(pts[:-1]) + f", and {pts[-1]}" if len(pts) > 1 else pts[0]
    need_min = max(1, math.ceil(len(pts)/2))
    s2 = f"Clearly and specifically addresses the key requirement by covering {lst} with accurate details."
    s1 = (f"Mentions {pts[0]} but is vague or incomplete."
          if len(pts)==1 else
          f"Covers some key elements (at least {need_min} of: {lst}) but lacks clarity or misses essential detail.")
    s0 = f"Does not address the key requirement or fails to mention {lst}."
    return s2, s1, s0


# --- 数据准备 ---
df = train_df.copy()
df["__q"] = df[QCOL].map(lambda x: str(x).strip().lower())
qn_list = list(df["__q"].unique())

rows = []
for qn in tqdm(qn_list, desc="Generating rubrics (all)"):
    g = df[df["__q"] == qn]
    q_text = str(g[QCOL].iloc[0])

    exs = sample_examples_for_question(g)
    # 如果题目没有 0/1/2 三档样本，直接兜底
    if not exs:
        corr = str(g[CCOL].dropna().astype(str).iloc[0]) if not g[CCOL].dropna().empty else ""
        s2,s1,s0 = fallback_rubric_from_correct(q_text, corr, k=3)
        rows.append({"question": q_text, "score2": s2, "score1": s1, "score0": s0})
        continue

    ex0, ex1, ex2 = exs
    user_prompt = SIMPLE_PROMPT.format(QUESTION=q_text, EX0=ex0, EX1=ex1, EX2=ex2)

    # 👉 用你的 chat() 封装调用模型；temperature 用极小值近似 greedy
    reply = chat(
        [{"role":"system","content": SIMPLE_SYSTEM},
         {"role":"user","content":   user_prompt}],
        max_new_tokens=240, temperature=1e-6, top_p=1.0
    )

    parsed = parse_rubric(reply)
    if parsed:
        rows.append({
            "question": q_text,
            "score2": parsed["score2"],
            "score1": parsed["score1"],
            "score0": parsed["score0"],
        })
    else:
        # 解析失败 → 用 Correct Answer 兜底
        corr = str(g[CCOL].dropna().astype(str).iloc[0]) if not g[CCOL].dropna().empty else ""
        s2,s1,s0 = fallback_rubric_from_correct(q_text, corr, k=3)
        rows.append({"question": q_text, "score2": s2, "score1": s1, "score0": s0})

rubrics_df = pd.DataFrame(rows, columns=["question","score2","score1","score0"])
rubrics_df = rubrics_df.apply(fix_difference_rubric, axis=1)

display(rubrics_df.head(20))
rubrics_df.to_csv(OUTFILE, index=False)
print(f"✅ saved {OUTFILE}  (rows: {len(rubrics_df)})")
OUTFILE = "/content/rubrics_simple_fixed.csv"   # 或 rubrics_all.csv
rubrics_df.to_csv(OUTFILE, index=False)
print(f"✅ saved {OUTFILE}  (rows: {len(rubrics_df)})")



Generating rubrics (all): 100%|██████████| 106/106 [14:45<00:00,  8.35s/it]


Unnamed: 0,question,score2,score1,score0
0,Q1. State TRUE or FALSE and justify. No correc...,States False and provides a correct justificat...,States False OR gives a mostly correct justifi...,States the wrong truth value or provides an in...
1,A rating curve is only valid when?,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: ratin...,Does not address the key requirement or fails ...
2,c. Assume the OS is using a lazy allocation po...,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: execu...,Does not address the key requirement or fails ...
3,To segment the rose petals [4 marks]:,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: narro...,Does not address the key requirement or fails ...
4,Write three parameters which affect ...,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: abras...,Does not address the key requirement or fails ...
5,a. List and explain two privileged actions tha...,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: write...,Does not address the key requirement or fails ...
6,"Consider 5 replicas N1, N2, N3, N4, and N5 run...",Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: n2 co...,Does not address the key requirement or fails ...
7,Intercepted precipitation by trees reaches ear...,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: water...,Does not address the key requirement or fails ...
8,Two advantages of separating declaration ...,Clearly and specifically addresses the key req...,Covers some key elements (at least 2 of: overr...,Does not address the key requirement or fails ...
9,Q1. State TRUE or FALSE and justify. No correc...,States True and provides a correct justificati...,States True OR gives a mostly correct justific...,States the wrong truth value or provides an in...


✅ saved /content/rubrics_simple_fixed.csv  (rows: 106)
✅ saved /content/rubrics_simple_fixed.csv  (rows: 106)
