# Our First Step: Run the Original Model

We are using `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` mainly in our experiment, which is a reasoning model based on Qwen-2.5B.
We are planning to evaluate Coconut to the distilled model, and we will compare the performance of Coconut with the original CoT.

## Import Dependencies and the Model

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device_map='auto', torch_dtype=torch.float16)

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

## Load Datasets

In [2]:
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("open-r1/OpenThoughts-114k-math")

ds['train']

README.md:   0%|          | 0.00/2.91k [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/176M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/89120 [00:00<?, ? examples/s]

Dataset({
    features: ['source', 'problem', 'solution', 'messages', 'system', 'conversations', 'generated_token_count', 'correct'],
    num_rows: 89120
})

## Modify the CoT Chain to use `<sot>` and `<eot>`

In DeepSeek R1, the CoT chain is wrapped with `<think>` and `</think>` XML tag, but we will use `<sot>` and `<eot>` special tokens instead.

That's because in Coconut, we must find a **special** thing to determine whether the chain is terminated or not—in legacy CoT, we can just use HTML parsing and use `</think>` to determine the end of the chain. However, this doesn't apply to Coconut.

Hence, we will use `<sot>` and `<eot>` to determine the start and end of the chain.

## Add Special Tokens

In [3]:
# SOT: Start of Thought, EOT: End of Thought, SOS: Start of Solution, EOS: End of Solution
special_tokens_dict = {'additional_special_tokens': ['<sot>', '<eot>', '<sos>', '<eos>']}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Embedding(151669, 1536)

### Preprocess the Dataset

We will preprocess the dataset to add `<sot>` and `<eot>` to the CoT chain.

In [4]:
import re

ft_ds = load_dataset('ServiceNow-AI/R1-Distill-SFT', 'v1')

def replace_tags(example):
    example['reannotated_assistant_content'] = re.sub(r'<think>(.*?)</think>', r'<sot>\1<eot>', example['reannotated_assistant_content'])
    example['reannotated_assistant_content'] = re.sub(r'\\boxed{(.*?)}', r'<sos>\1<eos>', example['reannotated_assistant_content'])
    return example


ft_ds = ft_ds.map(replace_tags)

README.md:   0%|          | 0.00/2.70k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/52 [00:00<?, ?files/s]

train-00000-of-00052.parquet:   0%|          | 0.00/167M [00:00<?, ?B/s]

train-00001-of-00052.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00002-of-00052.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00003-of-00052.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00004-of-00052.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00005-of-00052.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00006-of-00052.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00007-of-00052.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00008-of-00052.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00009-of-00052.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00010-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00011-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00012-of-00052.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00013-of-00052.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00014-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00015-of-00052.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00016-of-00052.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00017-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00018-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00019-of-00052.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00020-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00021-of-00052.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00022-of-00052.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00023-of-00052.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00024-of-00052.parquet:   0%|          | 0.00/207M [00:00<?, ?B/s]

train-00025-of-00052.parquet:   0%|          | 0.00/462M [00:00<?, ?B/s]

train-00026-of-00052.parquet:   0%|          | 0.00/114M [00:00<?, ?B/s]

train-00027-of-00052.parquet:   0%|          | 0.00/114M [00:00<?, ?B/s]

train-00028-of-00052.parquet:   0%|          | 0.00/143M [00:00<?, ?B/s]

train-00029-of-00052.parquet:   0%|          | 0.00/347M [00:00<?, ?B/s]

train-00030-of-00052.parquet:   0%|          | 0.00/367M [00:00<?, ?B/s]

train-00031-of-00052.parquet:   0%|          | 0.00/433M [00:00<?, ?B/s]

train-00032-of-00052.parquet:   0%|          | 0.00/430M [00:00<?, ?B/s]

train-00033-of-00052.parquet:   0%|          | 0.00/423M [00:00<?, ?B/s]

train-00034-of-00052.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00035-of-00052.parquet:   0%|          | 0.00/75.2M [00:00<?, ?B/s]

train-00036-of-00052.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

train-00037-of-00052.parquet:   0%|          | 0.00/142M [00:00<?, ?B/s]

train-00038-of-00052.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

train-00039-of-00052.parquet:   0%|          | 0.00/287M [00:00<?, ?B/s]

train-00040-of-00052.parquet:   0%|          | 0.00/272M [00:00<?, ?B/s]

train-00041-of-00052.parquet:   0%|          | 0.00/295M [00:00<?, ?B/s]

train-00042-of-00052.parquet:   0%|          | 0.00/294M [00:00<?, ?B/s]

train-00043-of-00052.parquet:   0%|          | 0.00/293M [00:00<?, ?B/s]

train-00044-of-00052.parquet:   0%|          | 0.00/212M [00:00<?, ?B/s]

train-00045-of-00052.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00046-of-00052.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00047-of-00052.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

train-00048-of-00052.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00049-of-00052.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

train-00050-of-00052.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

train-00051-of-00052.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1679162 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/51 [00:00<?, ?it/s]

Map:   0%|          | 0/1679162 [00:00<?, ? examples/s]

Due to the set size, we can use the 1024 items of the set to test the model.

In [10]:
train_subset = ft_ds['train'].shuffle(seed=42).select(range(1024))
def tokenize_function(examples):
    return tokenizer(
        examples["reannotated_assistant_content"],  # Column containing text
        padding="max_length",  # Ensure consistent input size
        truncation=True,  # Avoid overflow
        max_length=8192,  # Adjust based on model's context window
        return_tensors="pt"  # Return PyTorch tensors
    )
tokenized_ds = train_subset.map(tokenize_function, batched=True)
tokenized_ds.set_format(type="torch", columns=["input_ids", "attention_mask"])

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

### Try to train the model

In [11]:
from transformers import TrainingArguments, Trainer
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

training_args = TrainingArguments(
    output_dir="./stage_1",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,  # 3 epochs to make use of 1000 examples
    learning_rate=5e-5,
    warmup_steps=50,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds,
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(


<IPython.core.display.Javascript object>

KeyboardInterrupt: 