# Training DeepSeek Model from Scratch

This notebook demonstrates how to train the custom `DeepSeek` model (imported from `model.py`) on custom dataset. It includes:
1. Training for 10000 steps.
2. Generating 5 sample ouputs


In [18]:
# Install PyTorch with CUDA support for Windows (assuming CUDA 12.1)
# If this fails or you have a different CUDA version, check https://pytorch.org/get-started/locally/
!pip uninstall -y torch torchvision
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
!pip install transformers datasets

Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121


You can safely remove it manually.


Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Using cached https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-win_amd64.whl (2449.3 MB)
Collecting torchvision
  Using cached https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp312-cp312-win_amd64.whl (6.1 MB)
Installing collected packages: torch, torchvision
Successfully installed torch-2.5.1+cu121 torchvision-0.20.1+cu121



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


> **IMPORTANT**: After running the cell above, you **MUST** restart the Jupyter Kernel for the changes to take effect. Go to **Kernel > Restart Kernel** in the menu.

In [1]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoConfig
from datasets import load_dataset
from model import DeepSeekLM  # Import our custom model
import os

print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cpu":
    print("WARNING: You are running on CPU. Training will be very slow. Please ensure you have a GPU and the correct PyTorch version installed.")

PyTorch Version: 2.5.1+cu121
CUDA Available: True
CUDA Device: NVIDIA RTX 5000 Ada Generation Laptop GPU
CUDA Version: 12.1
Using device: cuda


## 1. Load Model and Tokenizer

In [3]:
# Load Custom Tokenizer
tokenizer_path = "./custom_tokenizer1"

# Robust Tokenizer Loading
if os.path.exists(tokenizer_path):
    print(f"Loading custom tokenizer from {tokenizer_path}")
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
else:
    print(f"Custom tokenizer not found at {tokenizer_path}.")
    print("Falling back to default tokenizer (HuggingFaceTB/SmolLM2-135M)...")
    model_id = "HuggingFaceTB/SmolLM2-135M"
    tokenizer = AutoTokenizer.from_pretrained(model_id)

# Ensure pad_token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Padding token set to: {tokenizer.pad_token}")

model_id = "HuggingFaceTB/SmolLM2-135M"
config = AutoConfig.from_pretrained(model_id)

# Update config vocab size to match tokenizer
config.vocab_size = len(tokenizer)
print(f"Model vocab size updated to: {config.vocab_size}")

# DeepSeek Architecture Configuration
config.q_lora_rank = 128
config.kv_lora_rank = 128
config.nope_head_dim = 32
config.rope_head_dim = 32
config.num_shared_experts = 1
config.num_routed_experts = 8
config.num_active_experts = 2
config.expert_intermediate_size = 384

print("DeepSeek Architecture Configuration:")
print(f"Q LoRA Rank: {config.q_lora_rank}")
print(f"KV LoRA Rank: {config.kv_lora_rank}")
print(f"NoPE Head Dim: {config.nope_head_dim}")
print(f"RoPE Head Dim: {config.rope_head_dim}")
print(f"Num Shared Experts: {config.num_shared_experts}")
print(f"Num Routed Experts: {config.num_routed_experts}")
print(f"Num Active Experts: {config.num_active_experts}")
print(f"Expert Intermediate Size: {config.expert_intermediate_size}")

print(config)

# Initialize model from scratch
model = DeepSeekLM(config).to(device)
print("Model initialized with DeepSeek Architecture.")

Custom tokenizer not found at ./custom_tokenizer1.
Falling back to default tokenizer (HuggingFaceTB/SmolLM2-135M)...
Padding token set to: <|endoftext|>
Model vocab size updated to: 49152
DeepSeek Architecture Configuration:
Q LoRA Rank: 128
KV LoRA Rank: 128
NoPE Head Dim: 32
RoPE Head Dim: 32
Num Shared Experts: 1
Num Routed Experts: 8
Num Active Experts: 2
Expert Intermediate Size: 384
LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dtype": "bfloat16",
  "eos_token_id": 0,
  "expert_intermediate_size": 384,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 576,
  "initializer_range": 0.041666666666666664,
  "intermediate_size": 1536,
  "is_llama_config": true,
  "kv_lora_rank": 128,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "nope_head_dim": 32,
  "num_active_experts": 2,
  "num_attention_heads": 9,
  "num_hidden_layers": 30,
  "num_key_value

## 2. Prepare Dataset (Chunked)
We concatenate text and split into chunks to allow the model to learn context across lines.

In [4]:
# Robust Data Loading
data_file = "input-1.txt"
if os.path.exists(data_file):
    with open(data_file, "r", encoding="utf-8") as f:
        full_text = f.read()
    print(f"Loaded local data from {data_file}")
    from datasets import Dataset
    dataset = Dataset.from_dict({"text": [full_text]})
else:
    print(f"Data file {data_file} not found. Downloading default dataset from Hugging Face...")
    from datasets import load_dataset
    # Use a small subset of a public dataset
    dataset = load_dataset("HuggingFaceTB/smollm-corpus", "cosmopedia-v2", split="train[:1%]")
    print("Loaded default dataset from Hugging Face.")

# Double check tokenizer padding
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

block_size = 256 # Context window size

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
        
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

def tokenize_function(examples):
    return tokenizer(examples["text"])

# Tokenize all text
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Group into chunks
lm_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,
)

lm_dataset = lm_dataset.with_format("torch")

# Create dataloader
train_dataloader = DataLoader(lm_dataset, batch_size=16, shuffle=True,pin_memory=True)
print(f"Dataset prepared. Number of chunks: {len(lm_dataset)}")

Loaded local data from input-1.txt


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (341094 > 8192). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Dataset prepared. Number of chunks: 1332


In [5]:
print(model)

DeepSeekLM(
  (embed_tokens): Embedding(49152, 576)
  (layers): ModuleList(
    (0-29): 30 x Block(
      (self_attn): DeepSeekMLA(
        (kv_down_proj): Linear(in_features=576, out_features=128, bias=False)
        (kv_norm): RMSNorm()
        (w_uk): Linear(in_features=128, out_features=288, bias=False)
        (w_ur): Linear(in_features=128, out_features=288, bias=False)
        (w_uv): Linear(in_features=128, out_features=576, bias=False)
        (q_down_proj): Linear(in_features=576, out_features=128, bias=False)
        (q_norm): RMSNorm()
        (w_uq): Linear(in_features=128, out_features=288, bias=False)
        (w_qr): Linear(in_features=128, out_features=288, bias=False)
        (o_proj): Linear(in_features=576, out_features=576, bias=False)
      )
      (mlp): DeepSeekMoE(
        (shared_experts): ModuleList(
          (0): DeepSeekExpertLayer(
            (gate_proj): Linear(in_features=576, out_features=384, bias=False)
            (up_proj): Linear(in_features=576, 

## 3. Training Loop

In [6]:
# Optimization: Enable TF32 for faster matrix multiplications on Ampere+ GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

if os.name != 'nt':
     print("Compiling model with torch.compile...")
     model = torch.compile(model)
else:
    print("Skipping torch.compile on Windows to avoid potential compatibility issues.")

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True if torch.cuda.is_available() else False)
loss_fn = torch.nn.CrossEntropyLoss()

# Optimization: Mixed Precision Training
scaler = torch.amp.GradScaler('cuda')

def generate_text(model, tokenizer, prompt="The meaning of life is", max_new_tokens=50, temperature=0.7, top_k=50):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids = inputs.input_ids
    
    for _ in range(max_new_tokens):
        with torch.no_grad():
            with torch.amp.autocast('cuda'):
                logits = model(input_ids)
                if isinstance(logits, tuple):
                    logits = logits[0]
            next_token_logits = logits[:, -1, :] / temperature
            
            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k, dim=-1)
            probs = torch.nn.functional.softmax(top_k_logits, dim=-1)
            next_token_index = torch.multinomial(probs, num_samples=1)
            next_token = top_k_indices.gather(-1, next_token_index)
            
            input_ids = torch.cat([input_ids, next_token], dim=1)
            
            if next_token.item() == tokenizer.eos_token_id:
                break
            
    print(f"Generated: {tokenizer.decode(input_ids[0], skip_special_tokens=True)}")
    model.train()

steps = 0
max_steps = 10000
save_path = "checkpoint_10000.pt"

model.train()
print("Starting training...")

import time
start_time = time.time()
total_tokens = 0

# Loop indefinitely until max_steps is reached
while steps < max_steps:
    for batch in train_dataloader:
        if steps >= max_steps:
            break
            
        input_ids = batch["input_ids"].to(device)
        labels = input_ids.clone()
        
        optimizer.zero_grad()
        
        with torch.amp.autocast('cuda'):
            outputs = model(input_ids)
            if isinstance(outputs, tuple):
                logits, expert_usage = outputs
            else:
                logits = outputs
                expert_usage = None
            
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            
            loss = loss_fn(shift_logits.view(-1, config.vocab_size), shift_labels.view(-1))
            
            # Calculate Accuracy
            with torch.no_grad():
                preds = torch.argmax(shift_logits, dim=-1)
                correct = (preds == shift_labels).sum()
                total = shift_labels.numel()
                accuracy = correct.float() / total
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        
        steps += 1
        total_tokens += input_ids.numel()
        
        if steps % 100 == 0:
            elapsed = time.time() - start_time
            tps = total_tokens / elapsed
            print(f"Step {steps}: Loss {loss.item():.4f} | Acc {accuracy.item():.4f} | TPS {tps:.2f}")
            if expert_usage is not None:
                top_experts = torch.topk(expert_usage, k=3)
                print(f"  Top Experts: {top_experts.indices.tolist()} (Counts: {top_experts.values.tolist()})")
            
        if steps % 1000 == 0:
            print(f"\n--- Step {steps} Generation ---")
            generate_text(model, tokenizer)
            print("-----------------------------\n")

# Save Checkpoint
torch.save(model.state_dict(), save_path)
print(f"Checkpoint saved to {save_path}")

print("\n--- Final Generations (5 Outputs) ---")
prompts = ["The future of AI is", "Once upon a time", "In a galaxy far away", "The secret to happiness is", "Python is a programming language that"]
for i, p in enumerate(prompts):
    print(f"\nGeneration {i+1}:")
    generate_text(model, tokenizer, prompt=p, max_new_tokens=100)
print("-------------------------------------")

Skipping torch.compile on Windows to avoid potential compatibility issues.
Starting training...
Step 100: Loss 4.6934 | Acc 0.2505 | TPS 4007.58
  Top Experts: [3, 1, 6] (Counts: [36809.0, 33519.0, 32624.0])
Step 200: Loss 4.0469 | Acc 0.3025 | TPS 4174.99
  Top Experts: [3, 1, 6] (Counts: [35844.0, 33733.0, 33348.0])
Step 300: Loss 3.7518 | Acc 0.3353 | TPS 4202.27
  Top Experts: [1, 3, 6] (Counts: [33874.0, 33787.0, 33483.0])
Step 400: Loss 3.4933 | Acc 0.3642 | TPS 4165.01
  Top Experts: [3, 1, 0] (Counts: [34359.0, 33296.0, 32160.0])
Step 500: Loss 3.1531 | Acc 0.4066 | TPS 4132.45
  Top Experts: [3, 1, 6] (Counts: [35348.0, 33343.0, 32253.0])
Step 600: Loss 2.2389 | Acc 0.5588 | TPS 4147.06
  Top Experts: [3, 6, 1] (Counts: [34546.0, 32690.0, 32171.0])
Step 700: Loss 1.9544 | Acc 0.6103 | TPS 4124.61
  Top Experts: [1, 3, 6] (Counts: [34124.0, 33445.0, 31511.0])
Step 800: Loss 1.4430 | Acc 0.7078 | TPS 4117.61
  Top Experts: [1, 6, 3] (Counts: [33659.0, 33184.0, 32558.0])
Step 900