# Math Question Answer Verification Competition

## Starter Code

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [None]:
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth
  Using cached unsloth-2024.11.7-py3-none-any.whl.metadata (59 kB)
Using cached unsloth-2024.11.7-py3-none-any.whl (163 kB)
Installing collected packages: unsloth
Successfully installed unsloth-2024.11.7
Found existing installation: unsloth 2024.11.7
Uninstalling unsloth-2024.11.7:
  Successfully uninstalled unsloth-2024.11.7
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-mila2y0o/unsloth_7297e1338e3243288242c3dd6c5202c8
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-mila2y0o/unsloth_7297e1338e3243288242c3dd6c5202c8
  Resolved https://github.com/unslothai/unsloth.git to commit f26d4e739ed507de7a9088da53d10fd02f58d160
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metad

In [None]:
#Imports
import os
import gc
import math
import torch
import shutil
import random
import numpy as np
import json
import pandas as pd

from google.colab import drive
from unsloth import FastLanguageModel

In [None]:
max_seq_length = 2048 # Choose any
# Specify dtype based on GPU type. Use None for automatic detection.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Load model and wrap with LoRA adapters

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0.1, # Supports any, but = 0 is optimized
    bias = "all",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth: bias = `none` is supported for fast patching. You are using bias = all.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.11.7 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


## Competition dataset

In [None]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

README.md:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

In [None]:
prompt = """Review the provided math question, its solution, and the final answer. Verify the accuracy of the solution steps and the correctness of the final answer based on the calculations provided.
Please analyze the information below and explicitly state 'True' if the solution accurately solves the problem and the answer matches, or 'False' if it does not. Do not provide additional information or explanation in your response.

### Math Question:
{}

### Provided Answer:
{}

### Detailed Solution:
{}

Answer is correct (True/False only):{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    question = examples["question"]
    answer = examples["answer"]
    solution = examples["solution"]
    output = examples["is_correct"]
    texts = []

    for ques, ans, sol, out in zip(question, answer, solution, output):
        # Format the prompt and add the "text" field
        text = prompt.format(ques, ans, sol, out) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}



In [None]:
from datasets import DatasetDict

# Split and shuffle the dataset into training and validation sets
shuffled_dataset = dataset['train'].shuffle(seed=42)  # Set a seed for reproducibility
train_valid_split = shuffled_dataset.train_test_split(test_size=0.005) # 99.5% train 0.5% validation
train_dataset = train_valid_split['train']
eval_dataset = train_valid_split['test']

In [None]:
# Process the training dataset and generate prompt for each datapoint
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/995000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
#print a smaple training example
train_dataset['text'][0]

"Review the provided math question, its solution, and the final answer. Verify the accuracy of the solution steps and the correctness of the final answer based on the calculations provided.\nPlease analyze the information below and explicitly state 'True' if the solution accurately solves the problem and the answer matches, or 'False' if it does not. Do not provide additional information or explanation in your response.\n\n### Math Question:\nPaul needed to buy some new clothes for work.  He had a 10% off coupon that he could use on his entire purchase after any other discounts.  Paul bought 4 dress shirts at $15.00 apiece, 2 pairs of pants that each cost $40.00.  He found a suit for $150.00 and 2 sweaters for $30.00 each.  When he got to the register, the clerk told him that the store was offering 20% off of everything in the store.  After the discounts and the coupon, how much did Paul spend on his new clothes?\n\n### Provided Answer:\n252\n\n### Detailed Solution:\nLet's solve this 

## SFT

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

max_steps = 800

training_args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 6,
        warmup_steps = int(0.1*max_steps),
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = max_steps,
        learning_rate = 4e-4,  # 2e-4, 3e-4, 5e-4
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        # eval_steps=1,  # Evaluate after every logging step; adjust as needed
        # eval_strategy="steps",  # Evaluate at each logging step
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
        eval_strategy="steps",
        eval_steps=10,
        save_steps=50,            # Save a checkpoint every 10 steps
        save_total_limit=3,
    )

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args,
)

Map (num_proc=4):   0%|          | 0/995000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 400
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss,Validation Loss
10,0.7234,0.660618
20,0.5917,0.638127
30,0.7446,0.637876
40,0.6869,0.658276
50,0.7309,0.690449
60,0.6427,0.694319
70,0.6894,0.689212
80,0.6541,0.688648
90,0.7551,0.6916
100,0.7048,0.684463


In [None]:
trainer_stats

TrainOutput(global_step=400, training_loss=0.6610693509876728, metrics={'train_runtime': 13583.4139, 'train_samples_per_second': 0.707, 'train_steps_per_second': 0.029, 'total_flos': 2.108350115734487e+17, 'train_loss': 0.6610693509876728, 'epoch': 0.009648241206030151})

## Inference

In [None]:
# Sample inferene data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]
sample_sol = test_dataset['solution'][0]

In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        sample_sol, # solution
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Promt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
print("Response:", response)

Input Promt:
 Review the provided math question, its solution, and the final answer. Verify the accuracy of the solution steps and the correctness of the final answer based on the calculations provided.
Please analyze the information below and explicitly state 'True' if the solution accurately solves the problem and the answer matches, or 'False' if it does not. Do not provide additional information or explanation in your response.

### Math Question:
The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?

### Provided Answer:
205

### Detailed Solution:
Let's solve this problem using Python code.
<llm-code>
minutes_per_hour = 60
minutes_left_before_5 = 5 * minutes_per_hour
tota

## Testing

In [None]:
def clear_cuda():
    torch.cuda.empty_cache()  # Empties the cache
    if torch.cuda.is_available():
        # Wait for all CUDA kernels to complete
        torch.cuda.synchronize()

        # Resets memory allocator which can free up a significant amount of memory
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.reset_accumulated_memory_stats()

        # Explicitly deletes any tensors on the default device (usually GPU:0) by collecting them
        with torch.cuda.device('cuda'):
            torch.cuda.empty_cache()

if torch.cuda.is_available():
    print("Clearing CUDA memory...")
    clear_cuda()
    print("CUDA memory cleared.")

Clearing CUDA memory...
CUDA memory cleared.


In [None]:
torch.cuda.empty_cache()
gc.collect()

0

In [None]:
!nvidia-smi

Fri Nov 15 23:09:52 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0              46W / 400W |   7149MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

batch_size = 30
responses = []  # List to store all responses
actual_correctness = []  # List to store actual correctness
comparison_results = []  # List to store comparison results

# Process the dataset in batches
for i in range(0, len(eval_dataset['question']), batch_size):
    print("Batch: ", i," to ", i+batch_size)
    batch_questions = eval_dataset['question'][i:i+batch_size]
    batch_answers = eval_dataset['answer'][i:i+batch_size]
    batch_solutions = eval_dataset['solution'][i:i+batch_size]
    batch_is_correct = eval_dataset['is_correct'][i:i+batch_size]  # Actual correctness

    # Generate input prompts for the batch
    input_prompts = [
        prompt.format(q, a, s, "")
        for q, a, s in zip(batch_questions, batch_answers, batch_solutions)
    ]

    # Tokenize inputs as a batch
    inputs = tokenizer(input_prompts, return_tensors="pt", padding=True).to("cuda")
    input_token_lens = inputs['input_ids'].shape[1]  # Total length of the input tokens

    # Generate outputs for the entire batch
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

    batch_responses = [
        tokenizer.decode(outputs[j][input_token_lens:], skip_special_tokens=True).strip()
        for j in range(outputs.size(0))
    ]
    responses.extend(batch_responses)  # Append batch responses to the main list

    # Collect actual correctness and compare
    actual_correctness.extend(batch_is_correct)
    batch_comparison = [str(resp).lower() == str(ac).strip().lower() for resp, ac in zip(batch_responses, batch_is_correct)]
    # print("Batch Comparison: ", batch_comparison)
    comparison_results.extend(batch_comparison)  # Append comparison results
    # print("Comparison Results: ", comparison_results[i:])


    # Clean up GPU memory after processing each batch
    del inputs
    del outputs
    torch.cuda.empty_cache()
    gc.collect()
    clear_cuda()

# After processing all batches
print("Collected Responses:", responses)
print("Actual Correctness:", actual_correctness)
print("Comparison Results:", comparison_results)

# # Optionally, analyze the comparison results
accuracy = sum(comparison_results) / len(comparison_results)
print(f"Model Accuracy: {accuracy * 100:.2f}%")


Batch:  0  to  30
Batch:  30  to  60
Batch:  60  to  90
Batch:  90  to  120
Batch:  120  to  150
Batch:  150  to  180
Batch:  180  to  210
Batch:  210  to  240
Batch:  240  to  270
Batch:  270  to  300
Batch:  300  to  330
Batch:  330  to  360
Batch:  360  to  390
Batch:  390  to  420
Batch:  420  to  450
Batch:  450  to  480
Batch:  480  to  510
Batch:  510  to  540
Batch:  540  to  570
Batch:  570  to  600
Batch:  600  to  630
Batch:  630  to  660
Batch:  660  to  690
Batch:  690  to  720
Batch:  720  to  750
Batch:  750  to  780
Batch:  780  to  810
Batch:  810  to  840
Batch:  840  to  870
Batch:  870  to  900
Batch:  900  to  930
Batch:  930  to  960
Batch:  960  to  990
Batch:  990  to  1020
Batch:  1020  to  1050
Batch:  1050  to  1080
Batch:  1080  to  1110
Batch:  1110  to  1140
Batch:  1140  to  1170
Batch:  1170  to  1200
Batch:  1200  to  1230
Batch:  1230  to  1260
Batch:  1260  to  1290
Batch:  1290  to  1320
Batch:  1320  to  1350
Batch:  1350  to  1380
Batch:  1380  to 

## Submission Generation

In [None]:
clear_cuda()

In [None]:
torch.cuda.empty_cache()
gc.collect()

0

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

batch_size = 30
responses = []  # List to store all responses

# Process the dataset in batches
for i in range(0, len(test_dataset['question']), batch_size):
    print("Batch: ", i," to ", i+batch_size)
    batch_questions = test_dataset['question'][i:i+batch_size]
    batch_answers = test_dataset['answer'][i:i+batch_size]
    batch_solutions = test_dataset['solution'][i:i+batch_size]

    # Generate input prompts for the batch
    input_prompts = [
        prompt.format(q, a, s, "")
        for q, a, s in zip(batch_questions, batch_answers, batch_solutions)
    ]

    # Tokenize inputs as a batch
    inputs = tokenizer(input_prompts, return_tensors="pt", padding=True).to("cuda")
    input_token_lens = inputs['input_ids'].shape[1]  # Total length of the input tokens

    # Generate outputs for the entire batch
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

    batch_responses = [
        tokenizer.decode(outputs[j][input_token_lens:], skip_special_tokens=True).strip()
        for j in range(outputs.size(0))
    ]
    responses.extend(batch_responses)  # Append batch responses to the main list

    # Clean up GPU memory after processing each batch
    del inputs
    del outputs
    torch.cuda.empty_cache()
    gc.collect()
    if torch.cuda.is_available():
        clear_cuda()


Batch:  0  to  30
Batch:  30  to  60
Batch:  60  to  90
Batch:  90  to  120
Batch:  120  to  150
Batch:  150  to  180
Batch:  180  to  210
Batch:  210  to  240
Batch:  240  to  270
Batch:  270  to  300
Batch:  300  to  330
Batch:  330  to  360
Batch:  360  to  390
Batch:  390  to  420
Batch:  420  to  450
Batch:  450  to  480
Batch:  480  to  510
Batch:  510  to  540
Batch:  540  to  570
Batch:  570  to  600
Batch:  600  to  630
Batch:  630  to  660
Batch:  660  to  690
Batch:  690  to  720
Batch:  720  to  750
Batch:  750  to  780
Batch:  780  to  810
Batch:  810  to  840
Batch:  840  to  870
Batch:  870  to  900
Batch:  900  to  930
Batch:  930  to  960
Batch:  960  to  990
Batch:  990  to  1020
Batch:  1020  to  1050
Batch:  1050  to  1080
Batch:  1080  to  1110
Batch:  1110  to  1140
Batch:  1140  to  1170
Batch:  1170  to  1200
Batch:  1200  to  1230
Batch:  1230  to  1260
Batch:  1260  to  1290
Batch:  1290  to  1320
Batch:  1320  to  1350
Batch:  1350  to  1380
Batch:  1380  to 

In [None]:
len(responses)

10000

In [None]:
# Saving responses in a dataframe
responses_df = pd.DataFrame({
    'ID': range(len(responses)),
    'is_correct': responses
})

# Save to CSV file
responses_df.to_csv('model_responses.csv', index=False)
print("Responses have been saved to model_responses.csv.")


Responses have been saved to model_responses.csv.


## Saving Model

In [None]:
# Local saving
model.save_pretrained("lora_model") 
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", #Trained model
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



## Saving Files

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


### Saving Submission File

In [None]:
# Define source and destination paths
source_file = '/content/model_responses.csv'
destination_file = '/content/drive/My Drive/DL/model_responses_85.10.csv'  # Adjust the path where you want to save in Drive

# Ensure the destination directory exists
os.makedirs(destination_file, exist_ok=True)


In [None]:
# Ensure the destination directory exists
os.makedirs(os.path.dirname(destination_file), exist_ok=True)

# Copy the file
shutil.copy(source_file, destination_file)
print(f"File copied successfully to {destination_file}")

File copied successfully to /content/drive/My Drive/DL/model_responses_85.10.csv


### Saving Checkpoint

In [None]:
# Source and destination paths
source_folder = '/content/outputs'  
destination_folder = '/content/drive/My Drive/DL/dropout_0.1/checkpoints3'  

# Ensure the destination directory exists
os.makedirs(destination_folder, exist_ok=True)


In [None]:
# Copy files from source to destination
if os.path.exists(source_folder):
    # If destination folder already exists, it will throw an error, so check if it's not there
    shutil.copytree(source_folder, destination_folder, dirs_exist_ok=True)
else:
    print(f"Source folder {source_folder} does not exist. Please check the source path.")

### Saving Model

In [None]:
# Define source and destination paths
source_folder = '/content/lora_model'  
destination_folder = '/content/drive/My Drive/DL/lora_model_dropout_0.1_3' 

# Ensure the destination directory exists
os.makedirs(destination_folder, exist_ok=True)

In [None]:
# Copy files from source to destination
if os.path.exists(source_folder):
    # If destination folder already exists, it will throw an error, so check if it's not there
    shutil.copytree(source_folder, destination_folder, dirs_exist_ok=True)
else:
    print(f"Source folder {source_folder} does not exist. Please check the source path.")


Source folder /content/lora_model does not exist. Please check the source path.


## Loading Checkpoint for further training

Since env running out of CUDA memory, restarting sessions is necessary

## Fetching Checkpoint

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Define source and destination paths
source_folder = '/content/drive/My Drive/DL/dropout_0.1/checkpoints3'
destination_folder = '/content/outputs'

# Copy files from source to destination
if os.path.exists(source_folder):
    # If destination folder already exists, it will throw an error, so check if it's not there
    shutil.copytree(source_folder, destination_folder, dirs_exist_ok=True)
else:
    print(f"Source folder {source_folder} does not exist. Please check the source path.")

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the tokenizer and model from the local copy
path_to_checkpoint = os.path.join(destination_folder, "checkpoint-bestmodel")
model, tokenizer = FastLanguageModel.from_pretrained(path_to_checkpoint)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

## One point Inference to test

## SFT Continued

In [None]:
if torch.cuda.is_available():
    print("Clearing CUDA memory...")
    clear_cuda()
    print("CUDA memory cleared.")

Clearing CUDA memory...
CUDA memory cleared.


In [None]:
# Define training args again

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

max_steps = 1300

training_args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        warmup_steps = int(0.1*max_steps),
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = max_steps,
        learning_rate = 4e-5,  # 2e-4, 3e-4, 5e-4
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        # eval_steps=1,  # Evaluate after every logging step; adjust as needed
        # eval_strategy="steps",  # Evaluate at each logging step
        optim = "adamw_8bit",
        weight_decay = 0.02,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
        eval_strategy="steps",
        eval_steps=10,
        save_steps=50,            # Save a checkpoint every 10 steps
        save_total_limit=3,
        load_best_model_at_end=True,
        resume_from_checkpoint=path_to_checkpoint,
    )

In [None]:
max_seq_length=2048

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,     #.select(range(2000)),
    eval_dataset = eval_dataset,       #.select(range(2000)),
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args,
)

Map (num_proc=4):   0%|          | 0/995000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

trainer_stats = trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 8
\        /    Total batch size = 32 | Total steps = 1,300
 "-____-"     Number of trainable parameters = 83,886,080
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss,Validation Loss
810,0.5645,0.508793
820,0.5273,0.517694
830,0.473,0.522682
840,0.5969,0.525196
850,0.5395,0.526424
860,0.5775,0.526008
870,0.5117,0.528223
880,0.4689,0.525705
890,0.5055,0.524102
900,0.5436,0.523184


config.json:   0%|          | 0.00/926 [00:00<?, ?B/s]

Step,Training Loss,Validation Loss
810,0.5645,0.508793
820,0.5273,0.517694
830,0.473,0.522682
840,0.5969,0.525196
850,0.5395,0.526424
860,0.5775,0.526008
870,0.5117,0.528223
880,0.4689,0.525705
890,0.5055,0.524102
900,0.5436,0.523184


In [None]:
def save_model_checkpoint(trainer, tokenizer, training_args, output_dir):
    """
    Save a complete training checkpoint for SFTTrainer including RNG states and trainer state

    Args:
        trainer: SFTTrainer instance
        tokenizer: The tokenizer to save
        training_args: Training arguments/config to save
        output_dir: Directory to save the checkpoint
    """
    try:
        # Create the output directory if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)

        # Save the model state
        trainer.model.save_pretrained(output_dir)

        # Save the tokenizer
        if tokenizer is not None:
            tokenizer.save_pretrained(output_dir)

        # Save optimizer state
        if trainer.optimizer is not None:
            torch.save(trainer.optimizer.state_dict(),
                      os.path.join(output_dir, 'optimizer.pt'))

        # Save scheduler state
        if hasattr(trainer, 'lr_scheduler') and trainer.lr_scheduler is not None:
            torch.save(trainer.lr_scheduler.state_dict(),
                      os.path.join(output_dir, 'scheduler.pt'))

        # Save RNG states
        rng_states = {
            'python': random.getstate(),
            'numpy': np.random.get_state(),
            'torch': torch.get_rng_state(),
            'cuda': torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None
        }
        torch.save(rng_states, os.path.join(output_dir, 'rng_state.pth'))

        # Save trainer state
        trainer_state = {
            'best_metric': trainer.state.best_metric,
            'best_model_checkpoint': trainer.state.best_model_checkpoint,
            'epoch': trainer.state.epoch,
            'global_step': trainer.state.global_step,
            'log_history': trainer.state.log_history,
            'total_flos': trainer.state.total_flos,
            'trial_name': trainer.state.trial_name,
            'trial_params': trainer.state.trial_params
        }
        json.dump(trainer_state,
                 open(os.path.join(output_dir, 'trainer_state.json'), 'w'),
                 indent=2)

        # Save training arguments
        if training_args is not None:
            torch.save(training_args,
                      os.path.join(output_dir, 'training_args.bin'))

        print(f"Checkpoint saved successfully to {output_dir}")

    except Exception as e:
        print(f"Error saving checkpoint: {str(e)}")
        raise


In [None]:
save_model_checkpoint(
    trainer=trainer,
    tokenizer=tokenizer,
    training_args=training_args,
    output_dir="outputs/checkpoint-bestmodel"
)

Checkpoint saved successfully to outputs/checkpoint-bestmodel


## Train model in chunks

In [None]:
def train_model_in_chunks(trainer, total_steps, chunk_size=200):
    steps_completed = 0
    checkpoint_exists = os.path.exists(os.path.join(trainer.args.output_dir, 'checkpoint'))

    while steps_completed < total_steps:
        if steps_completed == 0 and not checkpoint_exists:
            # Initial training without resuming
            trainer.train()
        else:
            # Resume from the last checkpoint
            trainer.train(resume_from_checkpoint=True)

        steps_completed += chunk_size

        print(f"Completed {steps_completed}/{total_steps} steps")



In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 6,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 400,
        learning_rate = 3e-4,  # 2e-4, 3e-4, 5e-4
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        # eval_steps=1,  # Evaluate after every logging step; adjust as needed
        # eval_strategy="steps",  # Evaluate at each logging step
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    )

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    # eval_dataset = eval_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args,
)

Map (num_proc=4):   0%|          | 0/995000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
train_model_in_chunks(trainer, total_steps=1000)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 200
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
1,1.6882
2,1.7452
3,1.6628
4,1.3121
5,1.1146
6,0.8569
7,0.7184
8,0.7848
9,0.7646
10,0.7591


Completed 200/1000 steps


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 200
 "-____-"     Number of trainable parameters = 83,886,080
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
201,0.5172


Completed 400/1000 steps


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 200
 "-____-"     Number of trainable parameters = 83,886,080
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
202,0.6268


Completed 600/1000 steps


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 200
 "-____-"     Number of trainable parameters = 83,886,080
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
203,0.6087


Completed 800/1000 steps


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 6
\        /    Total batch size = 24 | Total steps = 200
 "-____-"     Number of trainable parameters = 83,886,080
  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss
204,0.614


Completed 1000/1000 steps
