In [1]:
!nvidia-smi  # Confirm GPU is active

Tue Apr 15 23:37:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Evidence-Based Policy Scoring with Open LLMs (Zero-Shot)

This notebook lets you score policy documents using an open-source LLM and a rubric-based prompt. It uses Mistral-7B via Hugging Face. Steps include:
* Load Mistral-7B-Instruct via Hugging Face
* Accept .txt policy files in a folder
* Prompt the model using the evidence-based policy (EBP) rubric
* Parse the response and output a .csv with scores and justifications for them

> Go to Runtime > Change runtime type > GPU in Colab if not doing it locally.


In [19]:
!pip install -q transformers accelerate bitsandbytes sentencepiece
!pip install -q unstructured
!pip install bitsandbytes --prefer-binary --no-cache-dir --force-reinstall

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting torch<3,>=2.0 (from bitsandbytes)
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting numpy>=1.17 (from bitsandbytes)
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock (from torch<3,>=2.0->bitsandbytes)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.10.0 (from torch<3,>=2.0->bitsandbytes)
  Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch<3,>=2.0->bitsandbytes)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch<3,>=2.0->bitsandbytes)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9

In [12]:
import json
import re

# Handler for bad model text/need to only parse valid json blocks
def extract_json_block(text):
    # Use regex to extract the first complete JSON object
    match = re.search(r"\{(?:[^{}]|(?R))*\}", text, re.DOTALL)
    if match:
        return match.group(0)
    return None

### Loading in models
Note:  Mistral is a huge model and will take a long time to cache initially. It will also take a long time to process docs through it. A smaller model, like Llama-3b, is easier to use locally via CPU. Use a GPU (e.g., Colab) to run Mistral.

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_4bit=True
)

llm = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=1024)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
rubric_prompt = """You are a policy analyst evaluating how evidence-based a policy document is.
Use the rubric below to assess the document on a 0–3 scale for each dimension.
Provide both a score and a short justification for each.

### Rubric
1. Use of Empirical Research
- 0: No references to empirical evidence or data
- 1: Vague or anecdotal references (e.g., “studies show”)
- 2: Clear empirical support, but limited sourcing
- 3: Multiple, clearly cited, high-quality sources (e.g., peer-reviewed, systematic reviews)

2. Formal Evidence-Gathering Process
- 0: No structured data gathering
- 1: Informal or anecdotal input
- 2: Basic assessments (e.g., internal reports, cost estimates)
- 3: Formal tools (e.g., RCTs, modeling, pilot programs)

3. Transparency and Accessibility
- 0: No documentation or rationale
- 1: Minimal or internal-only documentation
- 2: Public access with basic explanation
- 3: Fully open access, replicable, with detailed methods

4. Expert and Stakeholder Input
- 0: No input from external experts or stakeholders
- 1: Informal or internal-only consultation
- 2: Formal expert or stakeholder involvement
- 3: Broad, interdisciplinary consultation, including marginalized groups

5. Evaluation and Iteration
- 0: No evaluation mechanism
- 1: Evaluation mentioned but vague
- 2: Evaluation planned or metrics included
- 3: Evaluation built-in with feedback loops

### Document:
{document_text}

### Task:
Provide a JSON-formatted output like this:
{
  "Use of Empirical Research": {"score": 2, "justification": "..."},
  "Formal Evidence-Gathering Process": {"score": 1, "justification": "..."},
  "Transparency and Accessibility": {"score": 3, "justification": "..."},
  "Expert and Stakeholder Input": {"score": 1, "justification": "..."},
  "Evaluation and Iteration": {"score": 2, "justification": "..."}
}
"""


In [9]:
import os

folder_path = "/content/txt"
os.makedirs(folder_path, exist_ok=True)

print(f"Upload your .txt policy documents into: {folder_path}")


Upload your .txt policy documents into: /content/txt


In [10]:
os.getcwd()

'/content'

In [11]:
import os
import json
import pandas as pd
from pathlib import Path

txt_folder = Path("txt")  # or wherever your .txt files are
results = []

for filename in os.listdir(txt_folder):
    if not filename.endswith(".txt"):
        continue
    print(f"🔍 Processing {filename}")
    with open(txt_folder / filename, "r", encoding="utf-8") as f:
        doc_text = f.read()

    full_prompt = rubric_prompt.replace("{document_text}", doc_text[:4000])
    response = llm(full_prompt)[0]["generated_text"]
    print("🧠 Raw response:\n", response[:500], "\n---")

    try:
        json_start = response.find("{")
        json_end = response.rfind("}") + 1
        parsed = json.loads(response[json_start:json_end])
        print("✅ Parsed JSON")
    except Exception as e:
        print(f"❌ Failed to parse output for {filename}: {e}")
        continue

    flat = {"filename": filename}
    for k, v in parsed.items():
        flat[f"{k} Score"] = v.get("score")
        flat[f"{k} Justification"] = v.get("justification")
    results.append(flat)

# Save if any results made it
if results:
    df = pd.DataFrame(results)
    df.to_csv("evidence_scores.csv", index=False)
    print("✅ Scoring complete! CSV saved to model/evidence_scores.csv")
    display(df.head())
else:
    print("⚠️ No results were successfully parsed.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


🔍 Processing 2025-03-10_IN12522_d12b023987361bb7fa1d58c5c2013f97dbec1abc.txt


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


🧠 Raw response:
 You are a policy analyst evaluating how evidence-based a policy document is.
Use the rubric below to assess the document on a 0–3 scale for each dimension.
Provide both a score and a short justification for each.

### Rubric
1. Use of Empirical Research
- 0: No references to empirical evidence or data
- 1: Vague or anecdotal references (e.g., “studies show”)
- 2: Clear empirical support, but limited sourcing
- 3: Multiple, clearly cited, high-quality sources (e.g., peer-reviewed, systematic revi 
---
❌ Failed to parse output for 2025-03-10_IN12522_d12b023987361bb7fa1d58c5c2013f97dbec1abc.txt: Extra data: line 9 column 1 (char 364)
🔍 Processing 20190801_R45122_c10894b4e7f120db51682d523898415b08bc43d3.txt


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


🧠 Raw response:
 You are a policy analyst evaluating how evidence-based a policy document is.
Use the rubric below to assess the document on a 0–3 scale for each dimension.
Provide both a score and a short justification for each.

### Rubric
1. Use of Empirical Research
- 0: No references to empirical evidence or data
- 1: Vague or anecdotal references (e.g., “studies show”)
- 2: Clear empirical support, but limited sourcing
- 3: Multiple, clearly cited, high-quality sources (e.g., peer-reviewed, systematic revi 
---
❌ Failed to parse output for 20190801_R45122_c10894b4e7f120db51682d523898415b08bc43d3.txt: Extra data: line 9 column 1 (char 364)
🔍 Processing 20191205_R45122_304e614457ef38889e12ca6ce2813aac03bce778.txt


KeyboardInterrupt: 