# Language Modeling instead of classification head

The paper [MATH-SHEPHERD](https://huggingface.co/datasets/peiyi9979/Math-Shepherd) present a more elegant solution, which is use language modeling then directly estimate from the turn tokens. Although their code is not released but it is quite simple to implement compared to my previous method for PRM.

use native unsloth

In [1]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = 'unsloth/llama-3-8b', # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

#Use LoRA to reduce memory usage:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = "unsloth", # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.642 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.9. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.46it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [2]:

# prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

# You are a helpful assistant to solve math problems step by step <|eot_id|><|start_header_id|>user<|end_header_id|>

# {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

# {}"""

# def formatting_prompts_func(examples):
#     texts = []
    
#     for instruction, responses, next_response, rating in zip(examples['instruction'], examples['responses'], examples['next_response'], examples['rating']):
#         # Combine all responses and the next response into a single string with newline separation
#         combined_responses = " + \n".join(responses) + " + \n" + next_response
#         if rating == -1:
#             combined_responses = combined_responses + " - \n"
#         else:
#             combined_responses = combined_responses + " + \n"
        
#         # Format the text with the prompt template
#         text = prompt.format(instruction, combined_responses) 
#         texts.append(text)

    
#     return {"text": texts,}

#     # # Tokenize all texts at once using the tokenizer
#     # model_inputs = tokenizer(texts, padding="max_length", truncation=True, max_length=512)

#     # # Add labels to the model inputs
#     # model_inputs['labels'] = labels
    
#     # return model_inputs


If we treat it as a language model task, then the incorrect steps that ends there should also be included in the dataset. It means we still need to append all the incorrect steps.

We can use a token to represent each step's correctness, such as + and -.
For the prediction of the probability of + and - during inference. We use the special token to represent the correctness of the step. We then use softmax to get the probability of correctness.

During training, because decoder is auto-regressive, we don't need to predict the correctness of the step. We can just use the token to represent the correctness of the step. We can use the special token to represent the correctness of the step. We then use softmax to get the probability of the special token, which is the correctness of the step.

In [3]:
# from datasets import load_dataset

# # Load and preprocess the dataset
# dataset = load_dataset("Birchlabs/openai-prm800k-stepwise-critic", split='train')
# dataset = dataset.filter(lambda x: x['rating'] is not None)  # Filter entries without ratings

# #filter out the examples that has 'next_response' in the responses of the solution
# dataset = dataset.filter(lambda x: not(x['rating'] == 1 and x['is_solution'] == False))

# #convert ratings of 0 to 1 so we have only binary labels
# dataset = dataset.map(lambda x: {'rating': 1 if x['rating'] == 0 else x['rating']})

# dataset = dataset.map(formatting_prompts_func, batched=True)  # Apply the preprocessing function

# len(dataset)

Filter: 100%|██████████| 1015027/1015027 [00:11<00:00, 89580.28 examples/s]
Map: 100%|██████████| 369283/369283 [00:16<00:00, 21922.39 examples/s]
Map: 100%|██████████| 369283/369283 [00:01<00:00, 189351.47 examples/s]


369283

In [2]:
good_token = '+'
bad_token = '-'
step_tag = 'ки'
# tokenizer.add_special_tokens({'additional_special_tokens': [good_token, bad_token, step_tag]})
print(tokenizer.convert_tokens_to_ids([good_token, bad_token, step_tag]))


[None, None, None]


In [21]:
tokenizer.encode(' ки \n')

[128000, 116624, 720]

In [22]:
from datasets import load_dataset

dataset = load_dataset("peiyi9979/Math-Shepherd", split='train')

def tokenize_function(examples):
    inputs = examples["input"]
    # Replace the ки with step_tag for each input example
    inputs = [input.replace('ки \n', 'ки \n') for input in inputs]
    return tokenizer(inputs, padding="max_length", truncation=True, max_length=512)

def tokenize_labels_function(examples):
    labels_list = examples["label"]
    tokenized_labels = []
    
    for labels in labels_list:
        # Replace the + and - with good_token and bad_token, while keeping them in the solution
        labels = labels.replace('+\n', '+ \n')
        labels = labels.replace('-\n', '- \n')
        
        # # Replace the last token with the appropriate special token
        # if labels[-1] == '+':
        #     labels = labels[:-1] + good_token
        # else:
        #     labels = labels[:-1] + bad_token
        
        tokenized_label = tokenizer(labels, padding="max_length", truncation=True, max_length=512)
        tokenized_labels.append(tokenized_label["input_ids"])
    
    return {"labels": tokenized_labels}

dataset = dataset.map(tokenize_function, batched=True)
dataset = dataset.map(tokenize_labels_function, batched=True)
dataset= dataset.remove_columns(['input', 'label', 'task'])


Map: 100%|██████████| 1000/1000 [00:00<00:00, 10791.57 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 4940.18 examples/s]


In [25]:
i=2
torch.tensor(dataset[i]['input_ids']) - torch.tensor(dataset[i]['labels'])

tensor([     0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0, 

[128000,
 18820,
 295,
 21935,
 400,
 1272,
 7682,
 414,
 369,
 220,
 18,
 4207,
 824,
 2046,
 315,
 20064,
 30976,
 18872,
 323,
 400,
 1591,
 7682,
 414,
 369,
 220,
 20,
 4207,
 264,
 2046,
 315,
 27374,
 18872,
 13,
 2650,
 1790,
 810,
 1587,
 1364,
 8493,
 389,
 27374,
 18872,
 1109,
 20064,
 30976,
 18872,
 304,
 264,
 1060,
 30,
 15166,
 220,
 16,
 25,
 54765,
 38202,
 220,
 18,
 4207,
 489,
 220,
 20,
 4207,
 284,
 1134,
 18,
 10,
 20,
 28,
 23,
 2511,
 23,
 4207,
 824,
 2046,
 389,
 4731,
 18872,
 13,
 489,
 720,
 8468,
 220,
 17,
 25,
 3005,
 38202,
 220,
 1272,
 353,
 220,
 18,
 284,
 1134,
 1272,
 9,
 18,
 28,
 4364,
 2511,
 4364,
 389,
 20064,
 30976,
 18872,
 824,
 2046,
 13,
 489,
 720,
 8468,
 220,
 18,
 25,
 3005,
 38202,
 220,
 1591,
 353,
 220,
 20,
 284,
 1134,
 1591,
 9,
 20,
 28,
 6860,
 2511,
 6860,
 389,
 27374,
 18872,
 824,
 2046,
 13,
 489,
 720,
 8468,
 220,
 19,
 25,
 54765,
 38202,
 220,
 4364,
 489,
 220,
 6860,
 284,
 1134,
 4364,
 10,
 6860,
 28,
 11387

In [6]:
tokenizer.vocab_size

128000

In [5]:
print(tokenizer.encode(f' +'))
print(tokenizer.encode(f'+'))
print(tokenizer.encode(f' + '))
print(tokenizer.encode(f'fasa + \n'))




[128000, 489]
[128000, 10]
[128000, 489, 220]
[128000, 15192, 64, 489, 720]


In [7]:
print(tokenizer.encode(f' -'))
print(tokenizer.encode(f'-'))
print(tokenizer.encode(f' - '))
print(tokenizer.encode(f'fasa - \n'))
print(tokenizer.encode(f'asfdsa -'))




[128000, 482]
[128000, 12]
[128000, 482, 220]
[128000, 15192, 64, 482, 720]
[128000, 300, 65934, 64, 482]


In [80]:
print(tokenizer.encode(f'sfas + \n fas'))


[128000, 82, 15192, 489, 720, 67618]


In [5]:
len(dataset["labels"][0])

512

In [3]:
!set CUDA_LAUNCH_BLOCKING=1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [4]:
from trl import SFTTrainer
from transformers import TrainingArguments, Trainer
from transformers.utils import logging
logging.set_verbosity_info()

trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 2e-6,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        save_steps= 5000,
        save_total_limit=2,
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "checkpoints/llama3-8b-critic-lora-4-28",
    ),
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using auto half precision backend


In [9]:
# #@title Show current memory stats
# gpu_stats = torch.cuda.get_device_properties(0)
# start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
# print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
# print(f"{start_gpu_memory} GB of memory reserved.")

In [5]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 62
 "-____-"     Number of trainable parameters = 83,886,080


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjq394[0m ([33mneurorunner[0m). Use [1m`wandb login --relogin`[0m to force relogin


RuntimeError: Triton Error [CUDA]: device-side assert triggered

In [None]:
model.save_pretrained("checkpoints/llama3-8b-critic-lora") # Local saving


In [5]:
tokenizer.encode("\n")

[128000, 198]

In [8]:
label = 'Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? Janet spends 3 hours + 5 hours = <<3+5=8>>8 hours per week on music lessons. + She spends 40 * 3 = <<40*3=120>>120 on clarinet lessons per week. + She spends 28 * 5 = <<28*5=140>>140 on piano lessons per week. + Janet spends 120 + 140 = <<120+140=260>>260 on music lessons per week. + Step 5: She spends 260 * 52 = <<260*52=13520>>13520 on music lessons in a year. The answer is: 13520 -'
label = label[:-1] + '<->'
label

'Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? Step 1: Janet spends 3 hours + 5 hours = <<3+5=8>>8 hours per week on music lessons. + Step 2: She spends 40 * 3 = <<40*3=120>>120 on clarinet lessons per week. + Step 3: She spends 28 * 5 = <<28*5=140>>140 on piano lessons per week. + Step 4: Janet spends 120 + 140 = <<120+140=260>>260 on music lessons per week. + Step 5: She spends 260 * 52 = <<260*52=13520>>13520 on music lessons in a year. The answer is: 13520 <->'

In [10]:
from datasets import load_dataset

dataset = load_dataset("peiyi9979/Math-Shepherd", split='train')
good_token = '<+>'
bad_token = '<->'
step_tag = '<ки>'
tokenizer.add_special_tokens({'additional_special_tokens': [good_token, bad_token, step_tag]})
print(tokenizer.convert_tokens_to_ids([good_token, bad_token, step_tag]))

def tokenize_function(examples):
    inputs = examples["input"]
    # Replace the ки with step_tag for each input example
    inputs = [input.replace('ки', step_tag) for input in inputs]
    return tokenizer(inputs, padding="max_length", truncation=True, max_length=512)

def tokenize_labels_function(examples):
    labels_list = examples["label"]
    tokenized_labels = []
    
    for labels in labels_list:
        # Replace the + and - with good_token and bad_token, while keeping them in the solution
        labels = labels.replace('+\n', good_token + '\n')
        labels = labels.replace('-\n', bad_token + '\n')
        
        # Replace the last token with the appropriate special token
        if labels[-1] == '+':
            labels = labels[:-1] + good_token
        else:
            labels = labels[:-1] + bad_token
        
        tokenized_label = tokenizer(labels, padding="max_length", truncation=True, max_length=512)
        tokenized_labels.append(tokenized_label["input_ids"])
    
    return {"labels": tokenized_labels}

dataset = dataset.map(tokenize_function, batched=True)
dataset = dataset.map(tokenize_labels_function, batched=True)
dataset= dataset.remove_columns(['input', 'label', 'task'])

[128256, 128257, 128258]


Map: 100%|██████████| 444655/444655 [00:46<00:00, 9555.80 examples/s] 
Map: 100%|██████████| 444655/444655 [01:55<00:00, 3845.67 examples/s]


In [11]:
dataset[0]['input_ids']

[128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 

In [12]:
dataset[0]['labels']

[128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 128001,
 

# Evaluation

In [1]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/home/jianingqi/LLMRL/checkpoints/llama3-8b-critic-lora-4-29", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
tokenizer.padding_side = "left" # Padding side for faster inference

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.394 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.0. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.37s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/tinyllama", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
tokenizer.padding_side = "left" # Padding side for faster inference

In [14]:

good_token = ' +'
bad_token = '-'
step_tag = ' ки'

candidate_tokens = tokenizer.encode(f"{good_token} {bad_token}")[1:] # [489, 482]
step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902
print(candidate_tokens)
print(step_tag_id)

[489, 482]
116624


In [39]:
# tokenizer.add_special_tokens({'additional_special_tokens': [good_token, bad_token, step_tag]})

3

In [40]:
# candidate_tokens = tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
# step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902

In [3]:
candidate_tokens

[10, 482]

In [4]:
step_tag_id

17165

In [29]:
tokenizer.decode(3694)

' +\n'

In [30]:
candidate_tokens

[3694, 482]

In [22]:
question = """Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make \n"""
output1 = """The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000 ки \nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000 ки \nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000 ки \nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000 ки \n#### 70000 ки""" # 18 is right

output2 = """The house was worth 80,000 + 50,000 = $<<80000+50000=130000>>130,000 after the repairs ки \nThe house is now worth 130,000 x 150% = $<<130000*150*.01=195000>>195,000 ки \nHe made a profit of 195,000 - 130,000 = $<<195000-130000=65000>>65,000 ки \n#### 65 ки""" #wrong

for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)]).to(model.device)

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        score_product = step_scores.prod()
        print(score_product)
        print(step_scores)

tensor(8.7917e-07, device='cuda:0', dtype=torch.bfloat16)
tensor([0.0503, 0.1406, 0.1011, 0.0103, 0.1191], device='cuda:0',
       dtype=torch.bfloat16)
tensor(3.8743e-06, device='cuda:0', dtype=torch.bfloat16)
tensor([0.0535, 0.1641, 0.0052, 0.0850], device='cuda:0', dtype=torch.bfloat16)


In [15]:
input_id

tensor([[128000,  18820,    295,    753,  78878,  11203,    220,    845,  19335,
            824,   1938,     13,   3005,  50777,   2380,    369,  17954,   1475,
           6693,    323,    293,   2094,  55404,   1354,    369,   1077,   4885,
           1475,   1938,    449,   3116,     13,   3005,  31878,    279,  27410,
            520,    279,  20957,      6,   3157,   7446,    369,    400,     17,
            824,   7878,  37085,  19151,     13,   2650,   1790,    304,  11441,
           1587,   1364,   1304,   1475,   1938,    520,    279,  20957,      6,
           3157,     30,  15166,    220,     16,     25,  54765,    596,  78878,
          11203,    220,    845,  19335,    824,   1938,     13, 116624,    198,
           8468,    220,     17,     25,   3005,  50777,   2380,    369,  17954,
           1475,   6693,     11,    779,   1364,    706,    220,    845,    482,
            220,     18,    284,    220,   1032,  19335,   2163,     13, 116624,
            198,   8468,    

In [19]:
question = """Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"""
output1 = """Janet's ducks lay 16 eggs per day. ки \nShe eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки \nShe bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки \nShe sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18 ки""" # 18 is right
output2 = """Janet's ducks lay 16 eggs per day. ки \nShe eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки \nShe bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки \nShe sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers' market. The answer is: 19 ки""" # 17 is wrong


for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)]).to(model.device)

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        score_product = step_scores.prod()
        print(score_product)
        print(step_scores)

tensor(0.0728, device='cuda:0', dtype=torch.bfloat16)
tensor([0.2021, 0.8281, 0.5469, 0.7969], device='cuda:0', dtype=torch.bfloat16)
tensor(0.0457, device='cuda:0', dtype=torch.bfloat16)
tensor([0.2021, 0.8281, 0.5469, 0.5000], device='cuda:0', dtype=torch.bfloat16)


In [3]:
question = """Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? \n"""
output1 = """Janet's ducks lay 16 eggs per day. ки \nShe eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки \nShe bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки \nShe sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18 ки""" # 18 is right
output2 = """Janet's ducks lay 16 eggs per day. ки \nShe eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки \nShe bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки \nShe sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers' market. The answer is: 17 ки""" # 17 is wrong
input_for_prm = []
for output in [output1, output2]:
    input_for_prm.append(f"{question} {output}")

with torch.no_grad():
    inputs = critic_tokenizer(input_for_prm, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
    outputs = critic(**inputs)


In [4]:
outputs.logits[:,:,candidate_tokens]

torch.Size([2, 256, 128256])

# Evaluation on generated dataset

## Baseline Mistral

In [None]:
import os
import json
import re
import torch
import torch.nn.functional as F
from tqdm import tqdm
import sys
sys.path.append('Data/prm800k/prm800k')
from grading import grader

def load_json_answers(directory, num_files=None):
    json_files = [file for file in os.listdir(directory) if file.endswith(".json")]
    
    if num_files is not None:
        json_files = json_files[:num_files]
    
    all_answers = []
    for file in json_files:
        file_path = os.path.join(directory, file)
        with open(file_path, "r") as f:
            answers = json.load(f)
            all_answers.append(answers)
    
    return all_answers

def extract_answers(all_answers):
    extracted_answers = [[] for _ in range(len(all_answers[0]))]
    for answers in all_answers:
        for i, answer in enumerate(answers):
            match = re.search(r"####\s*(.*)", answer)
            if match:
                extracted_answers[i].append(match.group(1).strip())
            else:
                extracted_answers[i].append("")
    return extracted_answers

def compute_probabilities(all_answers, critic_tokenizer, critic, batch_size=32, is_llama = True):
    answers_prob = [[] for _ in range(len(all_answers[0]))]
    
    good_token = '+'
    bad_token = '-'
    step_tag = ' ки'

    candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
    step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902
    print(candidate_tokens)
    print(step_tag_id)

    with torch.no_grad():
        for answers in tqdm(all_answers, desc="Processing answers"):
            results = []
            if is_llama:
                for answer in answers:
                    result = answer.split('assistant\n\n')[0].split('You are a helpful assistant to solve math problems step by step user\n\n')[1] + '\n'
                    responses = answer.split('assistant\n\n')[1].split('\n')
                    for response in responses:
                        result += response + " ки \n"
                    results.append(result)
            else:
                for answer in answers:
                    result = answer.split('### Response:')[0].split('\n### Input:\n')[1]
                    responses = answer.split('### Response:\n')[1].split('\n')
                    for response in responses:
                        result += response + " ки \n"
                    results.append(result)
                                
            correct_probabilities = []
            for i in range(0, len(results), batch_size):
                batch_results = results[i:i+batch_size]
                
                inputs = critic_tokenizer(batch_results, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
                logits = critic(**inputs).logits[:,:,candidate_tokens]
                scores = logits.softmax(dim=-1)[:,:,0] 
                step_scores = scores[inputs['input_ids'] == step_tag_id]
                correct_probabilities.extend(step_scores.tolist())
                # score_product = step_scores.prod()
                
                
                # probabilities = F.softmax(outputs.logits, dim=1)
                # correct_probability = probabilities[:, 1:]
                # correct_probability = torch.sum(correct_probability, dim=1)
                # correct_probabilities.extend(correct_probability.tolist())
            
            response_counts = []
            for answer in answers:
                if is_llama:
                    num_responses = len(answer.split('assistant\n\n')[1].split('\n'))
                else:
                    num_responses = len(answer.split('### Response:\n')[1].split('\n'))
                response_counts.append(num_responses)
            probability_index = 0
            for i, count in enumerate(response_counts):
                answer_probs = correct_probabilities[probability_index:probability_index+count]
                answer_prob = min(answer_probs)
                answers_prob[i].append(answer_prob)
                probability_index += count
                
    
    return answers_prob

def select_highest_probability_answers(extracted_answers, answers_prob):
    highest_probability_answers = []
    for i, question_answers in enumerate(extracted_answers):
        question_probs = answers_prob[i]
        if question_probs:
            max_prob_index = question_probs.index(max(question_probs))
            highest_probability_answer = question_answers[max_prob_index]
        else:
            highest_probability_answer = ""
        highest_probability_answers.append(highest_probability_answer)
    return highest_probability_answers

def compare_with_ground_truth(majority_answers, ground_truth_answers):
    correct_count = 0
    for majority_answer, ground_truth_answer in zip(majority_answers, ground_truth_answers):
        if grader.grade_answer(majority_answer, ground_truth_answer):
            correct_count += 1
    accuracy = correct_count / len(ground_truth_answers)
    return accuracy


In [5]:
answers = ["\n### Input:\nJanet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\n\n### Response:\n day 1: 16 eggs * 3 = <<16*3=48>>48 eggs\nday 2: 48 eggs + 4 eggs = <<48+4=52>>52 eggs\nday 3: 52 eggs + 4 eggs = <<52+4=56>>56 eggs\nday 4: 56 eggs + 4 eggs = <<56+4=60>>60 eggs\nday 5: 60 eggs + 4 eggs = <<60+4=64>>64 eggs\nday 6: 64 eggs + 4 eggs = <<64+4=68>>68 eggs\nday 7: 68 eggs + 4 eggs = <<68+4=72>>72 eggs\nday 8: 72 eggs + 4 eggs = <<72+4=76>>76 eggs\nday 9: 76 eggs + 4 eggs = <<76+4=80>>80 eggs\nday 10: 80 eggs + 4 eggs = <<80+4=84>>", "\n### Input:\nA robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\n\n### Response:\n robe takes 2*2=<<2*2=4>>4 bolt of blue fiber\nSo it needs 4/2=<<4/2=2>>2 bolt of white fiber\nIt takes 2*2=<<2*2=4>>4 bolt of white fiber\nSo in total it takes 4+2=<<4+2=6>>6 bolt of color\n#### 6", "\n### Input:\nJosh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?\n\n### Response:\n house value: 80000*1.5=$<<80000*1.5=120000>>120000\nvalue after repairs: 120000+50000=$<<120000+50000=170000>>170000\nSo he made $170000-$120000=$<<170000-120000=50000>>50000 profit\n#### 50000"]
response_counts = []
for answer in answers:
    num_responses = len(answer.split('### Response:\n')[1].split('\n'))
    response_counts.append(num_responses)

probability_index = 0
for i, count in enumerate(response_counts):
    print(probability_index)
    print(probability_index+count)
    # answer_probs = correct_probabilities[probability_index:probability_index+count]
    # answer_prob = min(answer_probs)
    # answers_prob[i].append(answer_prob)
    probability_index += count

0
10
10
15
15
19


In [None]:
def compute_probabilities(all_answers, critic_tokenizer, critic, answer_num = 1, batch_size=32, is_llama = True):
    answers_prob = [[] for _ in range(len(all_answers[0]))]
    
    good_token = '+'
    bad_token = '-'
    step_tag = ' ки'

    candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
    step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902
    print(candidate_tokens)
    print(step_tag_id)

    with torch.no_grad():
        for answers in tqdm(all_answers, desc="Processing answers"):
            results = []
            if is_llama:
                answer= answers[answer_num]
                result = answer.split('assistant\n\n')[0].split('You are a helpful assistant to solve math problems step by step user\n\n')[1] + '\n'
                responses = answer.split('assistant\n\n')[1].split('\n')
                for response in responses:
                    result += response + " ки \n"
                results.append(result)
            else:
                answer= answers[answer_num]
                result = answer.split('### Response:')[0].split('\n### Input:\n')[1]
                responses = answer.split('### Response:\n')[1].split('\n')
                for response in responses:
                    result += response + " ки \n"
                results.append(result)
                                
            correct_probabilities = []
            for i in range(0, len(results), batch_size):
                batch_results = results[i:i+batch_size]
                
                inputs = critic_tokenizer(batch_results, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
                logits = critic(**inputs).logits[:,:,candidate_tokens]
                scores = logits.softmax(dim=-1)[:,:,0] 
                step_scores = scores[inputs['input_ids'] == step_tag_id]
                correct_probabilities.extend(step_scores.tolist())
                # score_product = step_scores.prod()
                
                
                # probabilities = F.softmax(outputs.logits, dim=1)
                # correct_probability = probabilities[:, 1:]
                # correct_probability = torch.sum(correct_probability, dim=1)
                # correct_probabilities.extend(correct_probability.tolist())
            
            answer= answers[answer_num]
            if is_llama:
                num_responses = len(answer.split('assistant\n\n')[1].split('\n'))
            else:
                num_responses = len(answer.split('### Response:\n')[1].split('\n'))
            # answer_prob = torch.tensor(correct_probabilities[i:i+num_responses]).prod().item()
            answer_prob = torch.tensor(correct_probabilities).min().item()
            answers_prob[i].append(answer_prob)
    
    return answers_prob


In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch

good_token = '+'
bad_token = '-'
step_tag = 'ки'

tokenizer = AutoTokenizer.from_pretrained('peiyi9979/math-shepherd-mistral-7b-prm')
candidate_tokens = tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902
model = AutoModelForCausalLM.from_pretrained('peiyi9979/math-shepherd-mistral-7b-prm').eval()

question = """A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?"""
output1 = """It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки 
So it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки 
#### 3 ки 
""" # 18 is right
output2 = """robe takes 2*2=<<2*2=4>>4 bolt of blue fiber ки \nSo it needs 4/2=<<4/2=2>>2 bolt of white fiber ки \nIt takes 2*2=<<2*2=4>>4 bolt of white fiber ки \nSo in total it takes 4+2=<<4+2=6>>6 bolt of color ки \n#### 6 ки""" # 17 is wrong


for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)])

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        print(step_scores)
        
# tensor([0.9955, 0.9958, 0.9983, 0.9957])
# tensor([0.9955, 0.9958, 0.9983, 0.0240])


Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.65s/it]


tensor([0.9930, 0.9853, 0.9504])
tensor([0.6967, 0.8318, 0.4640, 0.7615, 0.6703])


In [3]:

for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([tokenizer.encode(input_for_prm)])

    with torch.no_grad():
        logits = model(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        print(step_scores.prod())
        

tensor(0.9298)
tensor(0.1372)


In [7]:
tokenizer.pad_token = tokenizer.eos_token

In [21]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

model.to("cuda")

# Usage example
json_directory = "generated_answers"


all_answers = load_json_answers(json_directory, num_files= 5)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, tokenizer, model)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)


Processing answers:   0%|          | 0/5 [00:00<?, ?it/s]

Processing answers: 100%|██████████| 5/5 [21:57<00:00, 263.58s/it]


In [22]:
#change to the other probability
#tiny majority vote 0.08 for 5 files
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.06


In [19]:
#change to the other probability
#llama3
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.65


## llama3 finetune

In [44]:
import os
import json
import re
import torch
import torch.nn.functional as F
from tqdm import tqdm
import sys
sys.path.append('Data/prm800k/prm800k')
from grading import grader


def load_json_answers(directory, num_files=None):
    json_files = [file for file in os.listdir(directory) if file.endswith(".json")]
    
    if num_files is not None:
        json_files = json_files[:num_files]
    
    all_answers = []
    for file in json_files:
        file_path = os.path.join(directory, file)
        with open(file_path, "r") as f:
            answers = json.load(f)
            all_answers.append(answers)
    
    return all_answers


def extract_answers(all_answers):
    extracted_answers = [[] for _ in range(len(all_answers[0]))]
    for answers in all_answers:
        for i, answer in enumerate(answers):
            match = re.search(r"####\s*(.*)", answer)
            if match:
                extracted_answers[i].append(match.group(1).strip())
            else:
                extracted_answers[i].append("")
    return extracted_answers

def compute_probabilities(all_answers, critic_tokenizer, critic, batch_size=32, is_llama = True):
    answers_prob = [[] for _ in range(len(all_answers[0]))]
    
    good_token = ' +'
    bad_token = '-'
    step_tag = ' ки'

    candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
    step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902
    print(candidate_tokens)
    print(step_tag_id)

    with torch.no_grad():
        for answers in tqdm(all_answers, desc="Processing answers"):
            results = []
            if is_llama:
                for answer in answers:
                    result = answer.split('assistant\n\n')[0].split('You are a helpful assistant to solve math problems step by step user\n\n')[1] + '\n'
                    responses = answer.split('assistant\n\n')[1].split('\n')
                    for response in responses:
                        result += response + " ки \n"
                    results.append(result)
            else:
                for answer in answers:
                    result = answer.split('### Response:')[0].split('\n### Input:\n')[1]
                    responses = answer.split('### Response:\n')[1].split('\n')
                    for response in responses:
                        result += response + " ки \n"
                    results.append(result)
                                
            correct_probabilities = []
            for i in tqdm(range(0, len(results), batch_size), desc="Computing probabilities", leave=False):
                batch_results = results[i:i+batch_size]
                
                inputs = critic_tokenizer(batch_results, padding="max_length", truncation=True, max_length=512, return_tensors="pt").to("cuda")
                logits = critic(**inputs).logits[:,:,candidate_tokens]
                scores = logits.softmax(dim=-1)[:,:,0] 
                step_scores = scores[inputs['input_ids'] == step_tag_id]
                correct_probabilities.extend(step_scores.tolist())
                # score_product = step_scores.prod()
                
                
                # probabilities = F.softmax(outputs.logits, dim=1)
                # correct_probability = probabilities[:, 1:]
                # correct_probability = torch.sum(correct_probability, dim=1)
                # correct_probabilities.extend(correct_probability.tolist())
            
            response_counts = []
            for answer in answers:
                if is_llama:
                    num_responses = len(answer.split('assistant\n\n')[1].split('\n'))
                else:
                    num_responses = len(answer.split('### Response:\n')[1].split('\n'))
                response_counts.append(num_responses)
            
            probability_index = 0
            for i, count in enumerate(response_counts):
                answer_probs = correct_probabilities[probability_index:probability_index+count]
                if answer_probs:
                    answer_prob = min(answer_probs)
                    answers_prob[i].append(answer_prob)
                else:
                    print('len of prob')
                    print(len(correct_probabilities))
                    print('len of responses')
                    print(sum(response_counts))
                    print('There is a length mismatch')
                    print('-----', i)
                    print(answers[i])
                    answers_prob[i].append(0.0)
                probability_index += count
    
    return answers_prob

def select_highest_probability_answers(extracted_answers, answers_prob):
    highest_probability_answers = []
    for i, question_answers in enumerate(extracted_answers):
        question_probs = answers_prob[i]
        if question_probs:
            max_prob_index = question_probs.index(max(question_probs))
            highest_probability_answer = question_answers[max_prob_index]
        else:
            highest_probability_answer = ""
        highest_probability_answers.append(highest_probability_answer)
    return highest_probability_answers

def compare_with_ground_truth(majority_answers, ground_truth_answers):
    correct_count = 0
    for majority_answer, ground_truth_answer in zip(majority_answers, ground_truth_answers):
        if grader.grade_answer(majority_answer, ground_truth_answer):
            correct_count += 1
    accuracy = correct_count / len(ground_truth_answers)
    return accuracy


In [45]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

# Usage example
json_directory = "generated_answers_"


all_answers = load_json_answers(json_directory, num_files = 100)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, critic_tokenizer, critic, is_llama = False)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")

[489, 482]
116624


Processing answers: 100%|██████████| 100/100 [1:40:17<00:00, 60.18s/it]



Accuracy: 0.14


In [36]:
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.68


In [7]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch
import wandb
import os
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.


critic, critic_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/home/jianingqi/LLMRL/checkpoints/llama3-8b-critic-lora-4-29", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastLanguageModel.for_inference(critic) # Enable native 2x faster inference
critic_tokenizer.padding_side = "left" # Padding side for faster inference

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.394 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.0. CUDA Toolkit = 11.8.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:56<00:00, 14.10s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [31]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch

good_token = ' +'
bad_token = '-'
step_tag = ' ки'

# tokenizer = AutoTokenizer.from_pretrained('peiyi9979/math-shepherd-mistral-7b-prm')
candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902
# model = AutoModelForCausalLM.from_pretrained('peiyi9979/math-shepherd-mistral-7b-prm').eval()

question = """Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"""
output1 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers' market. The answer is: 18 ки""" # 18 is right
output2 = """Step 1: Janet's ducks lay 16 eggs per day. ки\nStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. ки\nStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. ки\nStep 4: She sells the remainder at the farmers' market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers' market. The answer is: 17 ки""" # 17 is wrong

inputs = []
for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    inputs.append(input_for_prm)

inputs_id = critic_tokenizer(inputs, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")

with torch.no_grad():
    logits = critic(inputs_id['input_ids']).logits[:,:,candidate_tokens]
    scores = logits.softmax(dim=-1)[:,:,0] 
    step_scores = scores[inputs_id['input_ids'] == step_tag_id]
    print(step_scores)

tensor([0.4531, 0.2949, 0.2227, 0.1191, 0.4531, 0.2949, 0.2227, 0.1069],
       device='cuda:0', dtype=torch.bfloat16)


In [20]:

for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([critic_tokenizer.encode(input_for_prm)]).to("cuda")

    with torch.no_grad():
        logits = critic(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        print(step_scores)
        

tensor([0.3926, 0.4844, 0.2949, 0.4219], device='cuda:0', dtype=torch.bfloat16)
tensor([0.3926, 0.4844, 0.2949, 0.3203], device='cuda:0', dtype=torch.bfloat16)


In [37]:
with torch.no_grad():
    print(critic(inputs_id['input_ids']).logits.shape)

torch.Size([2, 256, 128256])


In [3]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

# Usage example
json_directory = "generated_answers"


all_answers = load_json_answers(json_directory)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, critic_tokenizer, critic)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)


Processing answers: 100%|██████████| 100/100 [50:50<00:00, 30.51s/it]


In [35]:
# alpaca_prompt = You MUST copy from above!
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant to solve math problems step by step <|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}"""

def formatting_prompts_func(examples):
    texts = []
    final_answer = []
    for instruction, answer in zip(examples['question'], examples['answer']):
        # Combine all responses and the next response into a single string with newline separation
        extracted_answer = answer.split('### ')[1]
        final_answer.append(extracted_answer)
        # Format the text with the prompt template
        text = prompt.format(instruction, '')
        texts.append(text)
    
    return {'input_text': texts, 'final_answer': final_answer}

from datasets import load_dataset

# Load and preprocess the dataset
dataset = load_dataset("gsm8k", 'main', split='test')
dataset = dataset.map(formatting_prompts_func, batched=True)  # Apply the preprocessing function

product score method

In [5]:
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.06


In [106]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

# Usage example
json_directory = "generated_answers_llama3"


all_answers = load_json_answers(json_directory, num_files = 10)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, critic_tokenizer, critic, is_llama = True)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")

[489, 482]
116624


Processing answers:   0%|          | 0/10 [00:00<?, ?it/s]

Processing answers: 100%|██████████| 10/10 [05:04<00:00, 30.47s/it]



Accuracy: 0.60


min score method

In [8]:
from transformers import LlamaForSequenceClassification, AutoTokenizer, LlamaForCausalLM
import torch

# Usage example
json_directory = "generated_answers_llama3"


all_answers = load_json_answers(json_directory, num_files = 5)
extracted_answers = extract_answers(all_answers)
answers_prob = compute_probabilities(all_answers, critic_tokenizer, critic, is_llama = True)
highest_probability_answers = select_highest_probability_answers(extracted_answers, answers_prob)
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")

[489, 482]
116624


Processing answers:   0%|          | 0/5 [00:00<?, ?it/s]

Processing answers:   0%|          | 0/5 [00:31<?, ?it/s]


ValueError: min() arg is an empty sequence

In [None]:
#change to the other probability
accuracy = compare_with_ground_truth(highest_probability_answers, dataset['final_answer'])
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.66


In [89]:
json_directory = "generated_answers_llama3"

all_answers = load_json_answers(json_directory, num_files = 1)

batch_size=1
is_llama = True
answers_prob = [[] for _ in range(len(all_answers[0]))]

good_token = ' +'
bad_token = '-'
step_tag = ' ки'

candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}") # [648, 387]
step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902

with torch.no_grad():
    for answers in tqdm(all_answers, desc="Processing answers"):
        results = []
        if is_llama:
            for answer in answers:
                result = answer.split('assistant\n\n')[0].split('You are a helpful assistant to solve math problems step by step user\n\n')[1] + '\n'
                responses = answer.split('assistant\n\n')[1].split('\n')
                for response in responses:
                    result += response + " ки \n"
                results.append(result)
        else:
            for answer in answers:
                result = answer.split('### Response:')[0].split('\n### Input:\n')[1]
                responses = answer.split('### Response:\n')[1].split('\n')
                for response in responses:
                    result += response + " ки \n"
                results.append(result)
                            
        correct_probabilities = []
        for i in range(0, len(results), batch_size):
            batch_results = results[i:i+batch_size]
            
            inputs = critic_tokenizer(batch_results, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
            logits = critic(**inputs).logits[:,:,candidate_tokens]
            scores = logits.softmax(dim=-1)[:,:,0] 
            step_scores = scores[inputs == step_tag_id]
            correct_probabilities.extend(step_scores.tolist())
            # score_product = step_scores.prod()
            
            
            # probabilities = F.softmax(outputs.logits, dim=1)
            # correct_probability = probabilities[:, 1:]
            # correct_probability = torch.sum(correct_probability, dim=1)
            # correct_probabilities.extend(correct_probability.tolist())
        
        for i, answer in enumerate(answers):
            if is_llama:
                num_responses = len(answer.split('assistant\n\n')[1].split('\n'))
            else:
                num_responses = len(answer.split('### Response:\n')[1].split('\n'))
            # answer_prob = torch.tensor(correct_probabilities[i:i+num_responses]).prod().item()
            answer_prob = torch.tensor(correct_probabilities[i:i+num_responses]).min().item()
            answers_prob[i].append(answer_prob)
    

Processing answers:   0%|          | 0/1 [01:29<?, ?it/s]


RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

In [68]:
step_scores

tensor([5.1036e-07, 3.6671e-09, 2.5728e-08], device='cuda:0',
       dtype=torch.bfloat16)

In [133]:
with torch.no_grad():
    result_1 = results[1]
    print(result_1)
    inputs = critic_tokenizer(result_1, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")
    print(critic(**inputs).logits.shape)
    logits = critic(**inputs).logits[:,:,candidate_tokens]
    scores = logits.softmax(dim=-1)[:,:,0] 
    step_scores = scores[inputs['input_ids'] == step_tag_id]
    print(step_scores)
    answer_prob = torch.tensor(step_scores).prod().item()
    print(answer_prob)
    

A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?
It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки 
So it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки 
#### 3 ки 

torch.Size([1, 256, 128256])
tensor([0.3203, 0.0635, 0.2021], device='cuda:0', dtype=torch.bfloat16)
0.004119873046875


  answer_prob = torch.tensor(step_scores).prod().item()


In [154]:
inputs[0] = result_1

In [156]:
question = """A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?"""
output1 = """It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки 
So it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки 
#### 3 ки 
""" # 18 is right
output2 = """It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки 
So it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки 
#### 4 ки """ # 17 is wrong
output3 = """It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки 
So it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки 
#### 9 ки """ # 17 is wrong

# inputs = []
# for output in [output1, output2,output3]:
#     input_for_prm = f"{question} {output}"
#     inputs.append(input_for_prm)

inputs_id = critic_tokenizer(inputs, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to("cuda")

with torch.no_grad():
    print(critic(**inputs_id).logits.shape)
    logits = critic(**inputs_id).logits[:,:,candidate_tokens]
    scores = logits.softmax(dim=-1)[:,:,0] 
    step_scores = scores[inputs_id['input_ids'] == step_tag_id]
    print(step_scores[0:3].prod())
    print(step_scores)

torch.Size([3, 256, 128256])
tensor(0.0044, device='cuda:0', dtype=torch.bfloat16)
tensor([0.3203, 0.0718, 0.1924, 0.2559, 0.0396, 0.1484, 0.2559, 0.0396, 0.1484],
       device='cuda:0', dtype=torch.bfloat16)


In [124]:
inputs[0]

'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\n It takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки \nSo it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки \n#### 3 ки '

In [125]:
result_1

'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\nIt takes 2 bolts of blue fiber and half that much white fiber. 2 * 1/2 = <<2*1/2=1>>1 bolt of white fiber ки \nSo it takes 1 + 2 = <<1+2=3>>3 bolts of fiber ки \n#### 3 ки \n'

In [112]:
# good_token = ' +'
# bad_token = '-'
# step_tag = ' ки'

# candidate_tokens = critic_tokenizer.encode(f"{good_token} {bad_token}")[1:] # [648, 387]
# step_tag_id = critic_tokenizer.encode(f"{step_tag}")[-1] # 12902
# print(candidate_tokens)
for output in [output1, output2]:
    input_for_prm = f"{question} {output}"
    input_id = torch.tensor([critic_tokenizer.encode(input_for_prm)]).to("cuda")

    with torch.no_grad():
        logits = critic(input_id).logits[:,:,candidate_tokens]
        scores = logits.softmax(dim=-1)[:,:,0] 
        step_scores = scores[input_id == step_tag_id]
        print(step_scores.prod())
        print(step_scores)
        

tensor(0.0073, device='cuda:0', dtype=torch.bfloat16)
tensor([0.3770, 0.1011, 0.1924], device='cuda:0', dtype=torch.bfloat16)
tensor(0.0066, device='cuda:0', dtype=torch.bfloat16)
tensor([0.3770, 0.1011, 0.1729], device='cuda:0', dtype=torch.bfloat16)


In [62]:
torch.tensor(correct_probabilities).prod().item()

4.8150832720791174e-23