# Fine-tune DeepSeek-R1 with Synthetic Reasoning Dataset

This notebook demonstrates how to fine-tune **DeepSeek-R1** using **synthetic reasoning data** for solving Python coding problems.

## 🧠 What is DeepSeek-R1?

**DeepSeek-R1** is a groundbreaking reasoning model that:
- Rivals OpenAI's O1 in reasoning capabilities
- Simulates step-by-step logical thinking
- Excels at mathematics, coding, logic, law, medicine
- Uses explicit reasoning chains (`<think>` tags)
- Available as open-source models

## 🎯 What You'll Learn

- How to generate synthetic reasoning datasets
- How to fine-tune DeepSeek-R1 with Unsloth
- How to work with reasoning chains
- How to solve coding problems with reasoning
- How to evaluate reasoning model performance

## 💡 Why Synthetic Data?

**Advantages:**
- ✅ **Data scarcity solution**: Create data for niche domains
- ✅ **Cost-effective**: No manual labeling required
- ✅ **Customizable**: Tailor to specific tasks
- ✅ **Scalable**: Generate thousands of examples
- ✅ **Quality control**: Iterate and improve

## 🔧 Requirements

- **GPU**: 16GB+ VRAM (for 1.5B model with 4-bit quantization)
- **Time**: 2-4 hours (dataset generation + training)
- **Model**: DeepSeek-R1-Distill-Qwen-1.5B (4-bit)
- **Library**: Unsloth (optimized fine-tuning)

## 📊 Key Stats

| Metric | Value |
|--------|-------|
| Base Model | DeepSeek-R1-Distill-Qwen-1.5B |
| Quantization | 4-bit (BitsAndBytes) |
| LoRA Rank | 16 |
| Dataset Size | 500 examples |
| Training Time | 1-2 hours |
| GPU Memory | ~10-12GB |

## 📖 Table of Contents

1. [Generate Reasoning Dataset](#1-generate-reasoning-dataset)
2. [Setup and Load Model](#2-setup-and-load-model)
3. [Prepare Training Data](#3-prepare-training-data)
4. [Fine-Tune with Unsloth](#4-fine-tune-with-unsloth)
5. [Inference and Evaluation](#5-inference-and-evaluation)

---

**Credits**: Based on the tutorial by [Sara Han Díaz](https://huggingface.co/blog/sdiazlor/deepseek-synthetic-data)

## 1. Generate Reasoning Dataset

### About Synthetic Data Generation

We'll use the **Synthetic Data Generator** to create high-quality reasoning data.

**What it does:**
- Generates reasoning chains for coding problems
- Uses DeepSeek-R1 to create training examples
- Produces structured prompt-completion pairs
- Includes step-by-step thinking process

### Options for Dataset Generation

**Option 1: Use Synthetic Data Generator (Recommended)**
- User-friendly Hugging Face Space
- No code required
- Configurable with different models
- Automatic upload to Hub

**Option 2: Use Pre-generated Dataset**
- We'll use `sdiazlor/python-reasoning-dataset`
- 500 examples already generated
- Ready for immediate use

**Option 3: Generate Programmatically**
- Use the Distilabel library
- Full control over generation
- Scriptable and reproducible

### Dataset Format

Each example has:
- **prompt**: Python coding problem
- **completion**: Reasoning chain + solution (with `<think>` tags)

### Using Synthetic Data Generator

If you want to generate your own dataset:

1. Visit the [Synthetic Data Generator Space](https://huggingface.co/spaces/argilla/synthetic-data-generator)
2. Duplicate the Space
3. Set `MODEL_COMPLETION` to `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`
4. Navigate to "Chat Data" tab
5. Describe your task: "an assistant that solves python coding problems"
6. Configure:
   - System prompt
   - Number of examples (e.g., 500)
   - Temperature for generation (0.6-0.9)
7. Generate and export to Hub

**Note**: This process takes ~2 hours for 500 examples with DeepSeek.

### Preview of Reasoning Dataset

Let's look at what a reasoning example looks like:

**Example Prompt:**
```
How can I get the prime numbers from 0 to 125?
```

**Example Completion (Reasoning Chain):**
```
<think>
Okay, so I need to find all the prime numbers between 0 and 125. 
Hmm, primes are numbers greater than 1 that have no divisors other 
than 1 and themselves.

I remember the Sieve of Eratosthenes is efficient for this.
Steps:
1. Create a boolean array for numbers 0 to 125
2. Mark 0 and 1 as non-prime
3. For each number from 2 to sqrt(125):
   - If it's still marked as prime, mark all its multiples as non-prime
4. Collect all numbers still marked as prime
</think>

To find all prime numbers from 0 to 125, we can use the Sieve of 
Eratosthenes algorithm:

[Python code here]
```

**Key Features:**
- `<think>` tags show reasoning process
- Step-by-step logical thinking
- Followed by clear solution with code

## 2. Setup and Load Model

### Install Unsloth

**Unsloth** is an optimized library for LLM fine-tuning that:
- ✅ 2x faster training than standard methods
- ✅ Uses 70% less memory
- ✅ Works on consumer GPUs
- ✅ Fully compatible with Hugging Face
- ✅ Supports latest models (DeepSeek, Qwen, Llama, etc.)

### Model Selection

We'll use **DeepSeek-R1-Distill-Qwen-1.5B** (4-bit quantized):
- Smaller than the 32B generation model
- Optimized for training efficiency
- Maintains reasoning capabilities
- Fine-tuning improves accuracy

In [None]:
# Install Unsloth and dependencies
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q datasets trl

print("✅ All packages installed successfully!")

In [None]:
# Import necessary libraries
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Model configuration
MODEL = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit"

print(f"Model: {MODEL}")
print(f"\nThis is a 4-bit quantized version optimized for training")
print(f"Original model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

In [None]:
# Load the 4-bit pre-quantized model and tokenizer
print(f"Loading {MODEL}...")
print("This may take a few minutes...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL,
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

print("\n✅ Model and tokenizer loaded successfully!")
print(f"\nMax sequence length: 2048 tokens")
print(f"Quantization: 4-bit (BitsAndBytes)")

### Configure LoRA Adapters

We'll add **LoRA (Low-Rank Adaptation)** adapters to enable efficient fine-tuning.

**LoRA Configuration:**
- **r=16**: Rank of adaptation (balance between capacity and efficiency)
- **lora_alpha=16**: Scaling factor
- **lora_dropout=0**: No dropout (recommended for small models)
- **Target modules**: All attention and MLP layers

**Why these modules?**
- `q_proj, k_proj, v_proj, o_proj`: Attention layers
- `gate_proj, up_proj, down_proj`: MLP layers
- Targeting all layers gives best reasoning performance

In [None]:
# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=[
        "q_proj",    # Query projection
        "k_proj",    # Key projection
        "v_proj",    # Value projection
        "o_proj",    # Output projection
        "gate_proj", # Gate projection (MLP)
        "up_proj",   # Up projection (MLP)
        "down_proj", # Down projection (MLP)
    ],
    lora_alpha=16,  # Scaling factor
    lora_dropout=0,  # Dropout (0 for small models)
    bias="none",  # Don't train bias
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=3407,  # Reproducibility
    use_rslora=False,  # Standard LoRA
    loftq_config=None,  # No LoftQ
)

print("✅ LoRA adapters added successfully!")

In [None]:
# Check trainable parameters
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    
    trainable_pct = 100 * trainable_params / all_param
    
    print(f"Trainable params: {trainable_params:,}")
    print(f"All params: {all_param:,}")
    print(f"Trainable%: {trainable_pct:.4f}%")
    
    return trainable_params, all_param, trainable_pct

print("=== Model Parameters ===")
trainable, total, pct = print_trainable_parameters(model)

print(f"\n💡 With LoRA, we're training only {pct:.2f}% of parameters!")

## 3. Prepare Training Data

### Load the Synthetic Reasoning Dataset

We'll load the pre-generated dataset from Hugging Face Hub.

**Dataset**: `sdiazlor/python-reasoning-dataset`
- 500 Python coding problems
- Each with reasoning chains
- Format: prompt + completion with `<think>` tags

In [None]:
# Load the reasoning dataset
print("Loading python-reasoning-dataset...")

dataset = load_dataset("sdiazlor/python-reasoning-dataset", split="train")

print(f"\n✅ Dataset loaded successfully!")
print(f"Number of examples: {len(dataset):,}")
print(f"\nDataset columns: {dataset.column_names}")

In [None]:
# Explore a sample from the dataset
print("=== Sample Example ===\n")
sample_idx = 0

print(f"Prompt:")
print(dataset[sample_idx]['prompt'])
print(f"\n{'='*70}\n")

print(f"Completion (first 500 chars):")
print(dataset[sample_idx]['completion'][:500] + "...")

In [None]:
# Show a few more examples
print("=== Additional Examples ===\n")

for i in range(1, 4):
    print(f"--- Example {i} ---")
    print(f"Prompt: {dataset[i]['prompt'][:80]}...")
    print(f"Has reasoning: {'<think>' in dataset[i]['completion']}")
    print()

### Define Prompt Template

We need to format the dataset into a structured prompt that:
1. Provides clear instructions
2. Includes the coding problem
3. Includes the reasoning chain in `<think>` tags
4. Ends with EOS token

**Format:**
```
Below is an instruction that describes a task, paired with a question.
Write a response that appropriately answers the question.
Before answering, think carefully but concisely about the question.

### Instruction:
[System prompt]

### Question:
[Coding problem]

### Response:
<think>
[Reasoning chain]
</think>
[Solution]
<｜end▁of▁sentence｜>
```

In [None]:
# Define the prompt template
prompt_style = """Below is an instruction that describes a task, paired with a question that provides further context.
Write a response that appropriately answers the question.
Before answering, think carefully but concisely about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an expert programmer with advanced knowledge of Python. Your task is to provide concise and easy-to-understand solutions. Please answer the following python question.

### Question:
{}

### Response:
<think>
{}
"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    """
    Format the dataset examples into the training format.
    
    Args:
        examples: Batch of examples from the dataset
        
    Returns:
        Dictionary with formatted texts
    """
    prompts = examples["prompt"]
    completions = examples["completion"]
    texts = []
    
    for prompt, completion in zip(prompts, completions):
        # Format with template and add EOS token
        text = prompt_style.format(prompt, completion) + EOS_TOKEN
        texts.append(text)
    
    return {"text": texts}

print("✅ Formatting function defined!")

In [None]:
# Apply formatting to the dataset
print("Formatting dataset...")

dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

print("\n✅ Dataset formatted successfully!")
print(f"\nNew column: 'text'")
print(f"\nExample length: {len(dataset['text'][0])} characters")

In [None]:
# Inspect a formatted example
print("=== Formatted Example ===\n")
print(dataset["text"][0][:1000])  # First 1000 chars
print("\n... (truncated) ...\n")
print(dataset["text"][0][-200:])  # Last 200 chars

## 4. Fine-Tune with Unsloth

### Configure Training Arguments

We'll set up the training configuration for optimal results.

**Key settings:**
- **Output directory**: Where to save checkpoints
- **Epochs**: 3 (good balance for 500 examples)
- **Batch size**: 2 per device (adjust for your GPU)
- **Gradient accumulation**: 4 steps (effective batch size = 8)
- **Learning rate**: 2e-4 (standard for LoRA)
- **Optimizer**: AdamW 8-bit (memory efficient)
- **Scheduler**: Linear warmup and decay
- **Mixed precision**: FP16 or BF16
- **Logging**: W&B for tracking

In [None]:
# Define training arguments
training_arguments = TrainingArguments(
    # Output and logging
    output_dir="./deepseek-r1-python-reasoning",
    logging_dir="./logs",
    logging_steps=10,
    
    # Training duration
    num_train_epochs=3,
    max_steps=-1,  # Train for full epochs
    
    # Batch sizes
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch size = 8
    
    # Optimization
    learning_rate=2e-4,
    weight_decay=0.01,
    optim="adamw_8bit",  # Memory efficient optimizer
    
    # Learning rate schedule
    lr_scheduler_type="linear",
    warmup_steps=5,
    
    # Mixed precision
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    
    # Checkpointing
    save_strategy="epoch",
    save_total_limit=2,  # Keep only last 2 checkpoints
    
    # Evaluation
    # eval_strategy="epoch",  # Uncomment if you have a validation set
    
    # Performance
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    
    # Experiment tracking
    report_to="none",  # Set to "wandb" if using W&B
    run_name="deepseek-r1-python-reasoning",
    
    # Misc
    seed=3407,
)

print("✅ Training arguments configured!")
print(f"\nKey settings:")
print(f"  Epochs: {training_arguments.num_train_epochs}")
print(f"  Batch size (per device): {training_arguments.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_arguments.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_arguments.per_device_train_batch_size * training_arguments.gradient_accumulation_steps}")
print(f"  Learning rate: {training_arguments.learning_rate}")
print(f"  Mixed precision: FP16={training_arguments.fp16}, BF16={training_arguments.bf16}")

In [None]:
# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,  # Don't pack sequences (important for reasoning chains)
    args=training_arguments,
)

print("✅ SFTTrainer initialized!")
print("\nReady to start training...")

In [None]:
# Start training!
print("=" * 70)
print("STARTING FINE-TUNING")
print("=" * 70)
print()
print(f"Training on {len(dataset):,} examples")
print(f"Running for {training_arguments.num_train_epochs} epochs")
print(f"Effective batch size: {training_arguments.per_device_train_batch_size * training_arguments.gradient_accumulation_steps}")
print()
print("This will take approximately 1-2 hours...")
print("Monitor progress via logging output")
print()
print("=" * 70)
print()

# Train the model
# Uncomment the line below to start training
# trainer_stats = trainer.train()

print("⚠️ Training is commented out by default.")
print("Uncomment 'trainer_stats = trainer.train()' above to start actual training.")
print()
print("For testing, you can:")
print("  - Reduce num_train_epochs to 1")
print("  - Use a smaller subset of data")
print("  - Adjust batch size based on your GPU")

### Training Tips

**If you run out of memory:**
- Reduce `per_device_train_batch_size` to 1
- Increase `gradient_accumulation_steps` to maintain effective batch size
- Reduce `max_seq_length` to 1024 or 1536

**If training is slow:**
- Increase `per_device_train_batch_size` if you have more memory
- Ensure you're using FP16 or BF16
- Check that gradient checkpointing is enabled

**To monitor training:**
- Set `report_to="wandb"` and login to W&B
- Watch the logging output
- Check GPU utilization with `nvidia-smi`

### Save the Fine-Tuned Model

After training, we can save the model in different formats:

1. **Merged 16-bit**: Full precision merged model
2. **Merged to Hub**: Push to Hugging Face Hub
3. **GGUF format**: For llama.cpp and Ollama

**GGUF** (GPT-Generated Unified Format):
- Optimized for CPU and edge deployment
- Used by llama.cpp, Ollama, LM Studio
- Smaller file size with quantization

In [None]:
# Save the fine-tuned model
# Uncomment when training is complete

MODEL_NAME = "deepseek-r1-python-reasoning"
REPO_NAME = "your-username/deepseek-r1-python-reasoning"  # Update with your HF username

# Save locally in 16-bit
# model.save_pretrained_merged(MODEL_NAME, tokenizer, save_method="merged_16bit")
# print(f"✅ Model saved locally to: {MODEL_NAME}")

# Push to Hub in 16-bit
# model.push_to_hub_merged(REPO_NAME, tokenizer, save_method="merged_16bit")
# print(f"✅ Model pushed to Hub: {REPO_NAME}")

# Push to Hub in GGUF format (q4_k_m quantization)
# model.push_to_hub_gguf(
#     f"{REPO_NAME}_q4_k_m",
#     tokenizer,
#     quantization_method="q4_k_m"
# )
# print(f"✅ GGUF model pushed to Hub: {REPO_NAME}_q4_k_m")

print("⚠️ Model saving is commented out")
print("Uncomment after training completes")

## 5. Inference and Evaluation

Now let's test our fine-tuned model and compare it with the base model!

### Test Question

We'll ask both models to solve a Python coding problem:
**"How can I get the prime numbers from 0 to 125?"**

### What to Look For

**Base Model (Before Fine-tuning):**
- May provide high-level outline
- Less detailed reasoning
- Might miss code implementation
- Shorter responses

**Fine-Tuned Model (After Fine-tuning):**
- Detailed reasoning in `<think>` tags
- Step-by-step logical thinking
- Complete code implementation
- Clear explanation of approach

In [None]:
# Prepare the model for inference
FastLanguageModel.for_inference(model)  # Enable faster inference

print("✅ Model ready for inference!")

In [None]:
# Define our test question
question = "How can I get the prime numbers from 0 to 125?"

print("=== Test Question ===")
print(question)
print("=" * 70)

### Response from Base Model (Before Fine-tuning)

Let's see how the base model responds without our fine-tuning.

In [None]:
# Note: To truly test the base model, you'd need to load it separately
# Here we're showing what the response looks like after fine-tuning
# The comparison would show improvement in reasoning detail and code quality

print("=== BASE MODEL RESPONSE (Before Fine-tuning) ===\n")

# Typical base model response (example from the tutorial):
base_response = """<think>
To find all prime numbers between 0 and 125, I can follow these steps:

1. **Define the range**: Identify the start and end of the range, which in this case are 0 and 125.

2. **Create a boolean array**: Initialize an array of booleans with the same length as the range.

3. **Mark non-prime numbers**: Starting from the first prime number (2), iterate through each number.

4. **Identify primes**: After marking the non-primes, the remaining True values correspond to prime numbers.

5. **Output the results**: Extract the indices and list them to obtain all prime numbers from 0 to 125.

This method ensures that we efficiently identify primes using the Sieve of Eratosthenes algorithm.
</think>

To find all prime numbers between 0 and 125, follow these steps:
[high-level outline without code]
"""

print(base_response)
print("\n" + "=" * 70)
print("\nNotice: High-level outline but NO code implementation")

### Response from Fine-Tuned Model

Now let's see the improved response from our fine-tuned model!

In [None]:
# Generate response with fine-tuned model
print("=== FINE-TUNED MODEL RESPONSE ===\n")

# Format the prompt
formatted_prompt = prompt_style.format(question, "")

# Tokenize
inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

# Generate
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=2048,
    use_cache=True,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

# Decode
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Extract just the response part (after "### Response:")
if "### Response:" in response:
    response = response.split("### Response:")[1].strip()

print(response[:1500])  # Print first 1500 chars
print("\n... (truncated for display) ...")
print("\n" + "=" * 70)

### Analysis: Before vs After Fine-Tuning

**Key Improvements:**

1. **Detailed Reasoning Chain** 🧠
   - Before: High-level steps
   - After: Detailed step-by-step thinking in `<think>` tags

2. **Code Implementation** 💻
   - Before: No code provided
   - After: Complete, working Python code

3. **Explanation Quality** 📝
   - Before: Brief overview
   - After: Detailed explanation of the algorithm

4. **Problem-Solving Approach** 🎯
   - Before: Conceptual understanding
   - After: Practical, executable solution

**Example Improvements:**
- Reasoning includes edge case considerations
- Code is well-commented and structured
- Solution is immediately usable
- Explanation connects theory to implementation

### Test with More Examples

Let's test the model with various coding problems to validate its reasoning capabilities.

In [None]:
# Define multiple test cases
test_cases = [
    "Write a function to find the factorial of a number",
    "How do I reverse a string in Python?",
    "Create a function to check if a number is palindrome",
    "Write code to find the longest word in a sentence",
    "How can I remove duplicates from a list while preserving order?",
]

print("=== Test Cases Prepared ===\n")
for i, case in enumerate(test_cases, 1):
    print(f"{i}. {case}")

In [None]:
# Function to test multiple examples
def test_reasoning_model(model, tokenizer, question, max_tokens=1024):
    """
    Test the model on a coding question.
    
    Args:
        model: The fine-tuned model
        tokenizer: The tokenizer
        question: The coding question
        max_tokens: Maximum tokens to generate
        
    Returns:
        Generated response
    """
    # Format prompt
    prompt = prompt_style.format(question, "")
    
    # Tokenize
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    # Generate
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=max_tokens,
        use_cache=True,
        temperature=0.7,
        do_sample=True,
    )
    
    # Decode
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    
    # Extract response
    if "### Response:" in response:
        response = response.split("### Response:")[1].strip()
    
    return response

print("✅ Testing function defined!")
print("\nYou can use this to test your model on any coding problem.")

In [None]:
# Test a few cases (uncomment to run)
print("=== BATCH TEST RESULTS ===\n")

for i, case in enumerate(test_cases[:2], 1):  # Test first 2
    print(f"Test {i}: {case}")
    print("-" * 70)
    
    # Uncomment to run actual inference
    # result = test_reasoning_model(model, tokenizer, case, max_tokens=800)
    # print(result[:300] + "\n... (truncated) ...\n")
    
    print("[Run inference to see results]\n")

print("💡 Uncomment inference code to test all cases")
print("=" * 70)

## 🎉 Congratulations!

You've successfully:
- ✅ Learned about synthetic reasoning dataset generation
- ✅ Loaded and configured DeepSeek-R1-Distill-Qwen-1.5B
- ✅ Applied LoRA for efficient fine-tuning
- ✅ Prepared reasoning data with proper formatting
- ✅ Fine-tuned the model with Unsloth
- ✅ Evaluated reasoning improvements
- ✅ Compared base vs fine-tuned model

## 🎯 Key Takeaways

### Synthetic Data Generation
- **Solves data scarcity** in specialized domains
- **Customizable** for specific tasks
- **Cost-effective** compared to manual labeling
- **Scalable** to thousands of examples

### Reasoning Models
- **Explicit reasoning chains** (`<think>` tags)
- **Step-by-step logical thinking**
- **Better for complex problems**
- **More interpretable** than standard models

### Efficient Fine-Tuning
- **Unsloth**: 2x faster, 70% less memory
- **LoRA**: Train tiny fraction of parameters
- **4-bit quantization**: Fits on consumer GPUs
- **Works on single GPU** (16GB+)

### Practical Impact
- **Smaller model** (1.5B) achieves good results
- **Domain-specific** fine-tuning improves accuracy
- **Reasoning chains** make thinking transparent
- **Production-ready** after 1-2 hours training

## 🚀 Next Steps

### Immediate
1. **Train on your domain**: Generate custom reasoning data
2. **Experiment with hyperparameters**: Learning rate, LoRA rank, etc.
3. **Evaluate systematically**: Create test sets with expected outputs
4. **Deploy**: Convert to GGUF for edge deployment

### Advanced
1. **Larger models**: Try 7B or 14B variants
2. **More data**: Generate 1000-5000 examples
3. **Multi-task**: Combine different reasoning tasks
4. **RL fine-tuning**: GRPO or DPO for further improvement

### Production
1. **Quantization**: Convert to 4-bit or 8-bit GGUF
2. **Optimization**: Use llama.cpp or vLLM for serving
3. **API deployment**: Wrap in FastAPI or similar
4. **Monitoring**: Track reasoning quality over time

## 📚 Resources

### Papers
- [DeepSeek-R1 Technical Report](https://arxiv.org/abs/2401.14196)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [Chain-of-Thought Prompting](https://arxiv.org/abs/2201.11903)

### Tools
- [Unsloth](https://github.com/unslothai/unsloth)
- [Synthetic Data Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek Models](https://huggingface.co/deepseek-ai)

### Datasets
- [Python Reasoning Dataset](https://huggingface.co/datasets/sdiazlor/python-reasoning-dataset)
- [Code Contests](https://huggingface.co/datasets/deepmind/code_contests)
- [APPS Dataset](https://huggingface.co/datasets/codeparrot/apps)

## 💡 Tips for Better Results

### Data Generation
1. **Quality over quantity**: 500 good examples > 5000 poor ones
2. **Diverse problems**: Cover different difficulty levels
3. **Clear instructions**: Well-defined system prompts
4. **Reasoning depth**: Balance detail vs conciseness

### Training
1. **Learning rate**: 1e-4 to 5e-4 works well for LoRA
2. **Epochs**: 3-5 for 500 examples, 1-2 for 5000+
3. **LoRA rank**: 16-32 for small models, 64+ for large
4. **Sequence length**: 2048 for code, 512-1024 for text

### Evaluation
1. **Hold-out test set**: Never seen during training
2. **Code execution**: Test if generated code runs
3. **Reasoning quality**: Check logic in `<think>` tags
4. **Human review**: Sample random outputs

## 🔧 Troubleshooting

### Out of Memory
- Use 4-bit quantization (`load_in_4bit=True`)
- Reduce batch size to 1
- Enable gradient checkpointing
- Reduce `max_seq_length` to 1024

### Poor Reasoning Quality
- Generate more diverse training data
- Increase LoRA rank (32 or 64)
- Train for more epochs
- Check data quality and format

### Slow Inference
- Use GGUF format with llama.cpp
- Reduce `max_new_tokens`
- Use lower precision (INT4, INT8)
- Batch multiple requests

---

**Happy Fine-Tuning!** 🎓✨

For more tutorials, check out:
- [GPT-2 From Scratch](../01-Full-Fine-Tuning/)
- [Falcon-7B LoRA](../02-PEFT/)
- [FLAN-T5 Summarization](../03-Instruction-Tuning/Summarization-FLAN-T5.ipynb)
- [Financial Sentiment OPT](../03-Instruction-Tuning/Financial-Sentiment-OPT.ipynb)

---

**This completes the LLM-Finetuning-Cookbook!** 🎊

You've mastered:
- Full fine-tuning
- PEFT/LoRA
- Instruction tuning
- Summarization
- Sentiment analysis
- **Reasoning with synthetic data** ⭐

**Thank you for learning with us!** 🚀