# Language Modeling instead of classification head

The paper [MATH-SHEPHERD](https://huggingface.co/datasets/peiyi9979/Math-Shepherd) present a more elegant solution, which is use language modeling then directly estimate from the turn tokens. Although their code is not released but it is quite simple to implement compared to my previous method for PRM.

use native unsloth

In [1]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "checkpoints/llama3-8b-gsm8k-1epoch", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

#Use LoRA to reduce memory usage:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = "unsloth", # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.642 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.9. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.62it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [2]:

prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant to solve math problems step by step <|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}"""

def formatting_prompts_func(examples):
    texts = []
    
    for instruction, responses, next_response, rating in zip(examples['instruction'], examples['responses'], examples['next_response'], examples['rating']):
        # Combine all responses and the next response into a single string with newline separation
        combined_responses = " + \n".join(responses) + " + \n" + next_response
        if rating == -1:
            combined_responses = combined_responses + " - \n"
        else:
            combined_responses = combined_responses + " + \n"
        
        # Format the text with the prompt template
        text = prompt.format(instruction, combined_responses) 
        texts.append(text)

    
    return {"text": texts,}

    # # Tokenize all texts at once using the tokenizer
    # model_inputs = tokenizer(texts, padding="max_length", truncation=True, max_length=512)

    # # Add labels to the model inputs
    # model_inputs['labels'] = labels
    
    # return model_inputs


If we treat it as a language model task, then the incorrect steps that ends there should also be included in the dataset. It means we still need to append all the incorrect steps.

We can use a token to represent each step's correctness, such as + and -.
For the prediction of the probability of + and - during inference. We use the special token to represent the correctness of the step. We then use softmax to get the probability of correctness.

During training, because decoder is auto-regressive, we don't need to predict the correctness of the step. We can just use the token to represent the correctness of the step. We can use the special token to represent the correctness of the step. We then use softmax to get the probability of the special token, which is the correctness of the step.

In [3]:
from datasets import load_dataset

# Load and preprocess the dataset
dataset = load_dataset("Birchlabs/openai-prm800k-stepwise-critic", split='train')
dataset = dataset.filter(lambda x: x['rating'] is not None)  # Filter entries without ratings

#filter out the examples that has 'next_response' in the responses of the solution
dataset = dataset.filter(lambda x: not(x['rating'] == 1 and x['is_solution'] == False))

#convert ratings of 0 to 1 so we have only binary labels
dataset = dataset.map(lambda x: {'rating': 1 if x['rating'] == 0 else x['rating']})

dataset = dataset.map(formatting_prompts_func, batched=True)  # Apply the preprocessing function

len(dataset)

Filter: 100%|██████████| 1015027/1015027 [00:11<00:00, 89580.28 examples/s]
Map: 100%|██████████| 369283/369283 [00:16<00:00, 21922.39 examples/s]
Map: 100%|██████████| 369283/369283 [00:01<00:00, 189351.47 examples/s]


369283

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
logging.set_verbosity_info()

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 2e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 100,
        save_steps= 5000,
        save_total_limit=2,
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "checkpoints/llama3-8b-critic-lora",
        report_to= "wandb"
    ),
)

PyTorch: setting up devices
Map (num_proc=2): 100%|██████████| 369283/369283 [00:51<00:00, 7157.29 examples/s] 
Using auto half precision backend


In [9]:
# #@title Show current memory stats
# gpu_stats = torch.cuda.get_device_properties(0)
# start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
# print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
# print(f"{start_gpu_memory} GB of memory reserved.")

In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 369,283 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 23,080
 "-____-"     Number of trainable parameters = 83,886,080
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss


KeyboardInterrupt: 

In [None]:
model.save_pretrained("checkpoints/llama3-8b-critic-lora") # Local saving


# Evaluation

In [1]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "checkpoints/llama3-8b-critic-SFT-id_label", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
tokenizer.padding_side = "left" # Padding side for faster inference

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.642 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.9. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.60it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [2]:

good_token = '+'
bad_token = '-'
step_tag = ' ки'

candidate_tokens = tokenizer.encode(f"{good_token} {bad_token}") # [648, 387]
step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902


In [3]:
question = """Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make \n"""
output1 = """The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000 ки\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000 ки\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000 ки\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000 ки\n#### 70000 ки""" # 18 is right
output2 = """The house was worth 80,000 + 50,000 = $<<80000+50000=130000>>130,000 after the repairs ки\nThe house is now worth 130,000 x 150% = $<<130000*150*.01=195000>>195,000 ки\nHe made a profit of 195,000 - 130,000 = $<<195000-130000=65000>>65,000 ки\n#### 65 ки""" #wrong

for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)]).to(model.device)

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        score_product = step_scores.prod()
        print(score_product)
        print(step_scores)

tensor(2.5047e-13, device='cuda:0', dtype=torch.bfloat16)
tensor([9.8828e-01, 1.4877e-04, 8.0490e-04, 1.5030e-03, 1.4114e-03],
       device='cuda:0', dtype=torch.bfloat16)
tensor(5.0477e-11, device='cuda:0', dtype=torch.bfloat16)
tensor([9.8438e-01, 1.1587e-04, 3.7956e-04, 1.1673e-03], device='cuda:0',
       dtype=torch.bfloat16)


In [4]:
question = """Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? \n"""
output1 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18 ки""" # 18 is right
output2 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers' market. The answer is: 17 ки""" # 17 is wrong

for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)]).to(model.device)

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        score_product = step_scores.prod()
        print(score_product)
        print(step_scores)

tensor(3.7107e-09, device='cuda:0', dtype=torch.bfloat16)
tensor([0.2334, 0.0012, 0.0012, 0.0117], device='cuda:0', dtype=torch.bfloat16)
tensor(4.4529e-09, device='cuda:0', dtype=torch.bfloat16)
tensor([0.2334, 0.0012, 0.0012, 0.0140], device='cuda:0', dtype=torch.bfloat16)


In [3]:
question = """Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? \n"""
output1 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18 ки""" # 18 is right
output2 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers' market. The answer is: 17 ки""" # 17 is wrong
input_for_prm = []
for output in [output1, output2]:
    input_for_prm.append(f"{question} {output}")

with torch.no_grad():
    inputs = critic_tokenizer(input_for_prm, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
    outputs = critic(**inputs)


In [4]:
outputs.logits[:,:,candidate_tokens]

torch.Size([2, 256, 128256])

In [1]:
import os
import json
import re
import torch
import torch.nn.functional as F
from tqdm import tqdm
import sys
sys.path.append('Data/prm800k/prm800k')
from grading import grader

def load_json_answers(directory):
    json_files = [file for file in os.listdir(directory) if file.endswith(".json")]
    all_answers = []
    for file in json_files:
        file_path = os.path.join(directory, file)
        with open(file_path, "r") as f:
            answers = json.load(f)
            all_answers.append(answers)
    return all_answers

def extract_answers(all_answers):
    extracted_answers = [[] for _ in range(len(all_answers[0]))]
    for answers in all_answers:
        for i, answer in enumerate(answers):
            match = re.search(r"####\s*(.*)", answer)
            if match:
                extracted_answers[i].append(match.group(1).strip())
            else:
                extracted_answers[i].append("")
    return extracted_answers

def compute_probabilities(all_answers, critic_tokenizer, critic, batch_size=32):
    answers_prob = [[] for _ in range(len(all_answers[0]))]
    
    good_token = '+'
    bad_token = '-'
    step_tag = ' ки'

    candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}") # [648, 387]
    step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902

    with torch.no_grad():
        for answers in tqdm(all_answers, desc="Processing answers"):
            results = []
            for answer in answers:
                result = answer.split('assistant\n\n')[0].split('You are a helpful assistant to solve math problems step by step user\n\n')[1] + '\n'
                responses = answer.split('assistant\n\n')[1].split('\n')
                for response in responses:
                    result += response + " ки\n"
                results.append(result)
                                
            correct_probabilities = []
            for i in range(0, len(results), batch_size):
                batch_results = results[i:i+batch_size]
                inputs = critic_tokenizer(batch_results, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
                logits = critic(**inputs).logits[:,:,candidate_tokens]
                scores = logits.softmax(dim=-1)[:,:,0] 
                step_scores = scores[inputs == step_tag_id]
                correct_probabilities.extend(step_scores.tolist())
                # score_product = step_scores.prod()
                
                
                # probabilities = F.softmax(outputs.logits, dim=1)
                # correct_probability = probabilities[:, 1:]
                # correct_probability = torch.sum(correct_probability, dim=1)
                # correct_probabilities.extend(correct_probability.tolist())
            
            for i, answer in enumerate(answers):
                num_responses = len(answer.split('assistant\n\n')[1].split('\n'))
                answer_prob = torch.tensor(correct_probabilities[i:i+num_responses]).prod().item()
                answers_prob[i].append(answer_prob)
    
    return answers_prob

def select_highest_probability_answers(extracted_answers, answers_prob):
    highest_probability_answers = []
    for i, question_answers in enumerate(extracted_answers):
        question_probs = answers_prob[i]
        if question_probs:
            max_prob_index = question_probs.index(max(question_probs))
            highest_probability_answer = question_answers[max_prob_index]
        else:
            highest_probability_answer = ""
        highest_probability_answers.append(highest_probability_answer)
    return highest_probability_answers

def compare_with_ground_truth(majority_answers, ground_truth_answers):
    correct_count = 0
    for majority_answer, ground_truth_answer in zip(majority_answers, ground_truth_answers):
        if grader.grade_answer(majority_answer, ground_truth_answer):
            correct_count += 1
    accuracy = correct_count / len(ground_truth_answers)
    return accuracy


In [2]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


critic, critic_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "checkpoints/llama3-8b-critic-lora", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(critic) # Enable native 2x faster inference
critic_tokenizer.padding_side = "left" # Padding side for faster inference

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.642 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.9. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.59it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [3]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

# Usage example
json_directory = "generated_answers_llama3"


all_answers = load_json_answers(json_directory)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, critic_tokenizer, critic)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)


Processing answers: 100%|██████████| 100/100 [1:08:56<00:00, 41.37s/it]


In [5]:
# alpaca_prompt = You MUST copy from above!
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant to solve math problems step by step <|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}"""

def formatting_prompts_func(examples):
    texts = []
    final_answer = []
    for instruction, answer in zip(examples['question'], examples['answer']):
        # Combine all responses and the next response into a single string with newline separation
        extracted_answer = answer.split('### ')[1]
        final_answer.append(extracted_answer)
        # Format the text with the prompt template
        text = prompt.format(instruction, '')
        texts.append(text)
    
    return {'input_text': texts, 'final_answer': final_answer}

from datasets import load_dataset

# Load and preprocess the dataset
dataset = load_dataset("gsm8k", 'main', split='test')
dataset = dataset.map(formatting_prompts_func, batched=True)  # Apply the preprocessing function

Map: 100%|██████████| 1319/1319 [00:00<00:00, 100377.16 examples/s]


In [6]:
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.66
