# Our First Step: Run the Original Model

We are using `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` mainly in our experiment, which is a reasoning model based on Qwen-2.5B.
We are planning to evaluate Coconut to the distilled model, and we will compare the performance of Coconut with the original CoT.

## Import Dependencies and the Model

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device_map='auto', torch_dtype=torch.float16)

## Load Datasets

In [3]:
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("open-r1/OpenThoughts-114k-math")

ds['train']

Dataset({
    features: ['source', 'problem', 'solution', 'messages', 'system', 'conversations', 'generated_token_count', 'correct'],
    num_rows: 89120
})

## Modify the CoT Chain to use `<sot>` and `<eot>`

In DeepSeek R1, the CoT chain is wrapped with `<think>` and `</think>` XML tag, but we will use `<sot>` and `<eot>` special tokens instead.

That's because in Coconut, we must find a **special** thing to determine whether the chain is terminated or not—in legacy CoT, we can just use HTML parsing and use `</think>` to determine the end of the chain. However, this doesn't apply to Coconut.

Hence, we will use `<sot>` and `<eot>` to determine the start and end of the chain.

## Add Special Tokens

In [4]:
# SOT: Start of Thought, EOT: End of Thought, SOS: Start of Solution, EOS: End of Solution
special_tokens_dict = {'additional_special_tokens': ['<sot>', '<eot>', '<sos>', '<eos>']}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Embedding(151669, 1536)

### Preprocess the Dataset

We will preprocess the dataset to add `<sot>` and `<eot>` to the CoT chain.

In [5]:
import re

ft_ds = load_dataset('ServiceNow-AI/R1-Distill-SFT', 'v1')

def replace_tags(example):
    example['reannotated_assistant_content'] = re.sub(r'<think>(.*?)</think>', r'<sot>\1<eot>', example['reannotated_assistant_content'])
    example['reannotated_assistant_content'] = re.sub(r'\\boxed{(.*?)}', r'<sos>\1<eos>', example['reannotated_assistant_content'])
    return example


ft_ds = ft_ds.map(replace_tags)

Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/51 [00:00<?, ?it/s]

Due to the set size, we can use the 1024 items of the set to test the model.

In [6]:
train_subset = ft_ds['train'].shuffle(seed=42).select(range(1024))
def tokenize_function(examples):
    return tokenizer(
        examples["reannotated_assistant_content"],  # Column containing text
        padding="max_length",  # Ensure consistent input size
        truncation=True,  # Avoid overflow
        max_length=8192,  # Adjust based on model's context window
        return_tensors="pt"  # Return PyTorch tensors
    )
tokenized_datasets = train_subset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask"])

from torch.utils.data import DataLoader

batch_size = 2  # Adjust based on GPU memory
train_dataloader = DataLoader(tokenized_datasets, batch_size=batch_size, shuffle=True)

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

### Try to train the model

In [10]:
from torch.optim import AdamW

# Define optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training parameters
num_epochs = 3
gradient_accumulation_steps = 8  # Adjust for lower VRAM
scaler = torch.cuda.amp.GradScaler()  # Mixed precision training

model.train()
for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        optimizer.zero_grad()

        with torch.cuda.amp.autocast():  # Mixed precision for efficiency
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss / gradient_accumulation_steps  # Normalize loss

        scaler.scale(loss).backward()

        if (step + 1) % gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        if step % 100 == 0:  # Logging
            print(f"Epoch {epoch+1}, Step {step}, Loss: {loss.item()}")

    # Save checkpoint
    torch.save(model.state_dict(), f"finetuned_model_epoch{epoch+1}.pt")

print("Fine-tuning complete!")

  scaler = torch.cuda.amp.GradScaler()  # Mixed precision training
  with torch.cuda.amp.autocast():  # Mixed precision for efficiency


RuntimeError: MPS backend out of memory (MPS allocated: 13.10 GB, other allocations: 720.00 KB, max allowed: 18.13 GB). Tried to allocate 6.00 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).