# üíø LMFast: Custom Datasets

**Train SLMs on your own data!**

## What You'll Learn
- Load data from various formats (JSON, CSV, JSONL)
- Format data for instruction tuning
- Handle conversation/chat datasets
- Push prepared datasets to HuggingFace Hub

## Supported Formats
- **Alpaca Style**: `{"instruction": "...", "input": "...", "output": "..."}`
- **ShareGPT / Chat**: `{"conversations": [{"from": "human", "value": "..."}, ...]}`
- **Text Completion**: `"Just raw text for pretraining..."`

**Time to complete:** ~10 minutes

## 1Ô∏è‚É£ Setup

In [None]:
!pip install -q lmfast[all]

import lmfast
import pandas as pd
from datasets import Dataset

## 2Ô∏è‚É£ Loading from Local Files

Let's create some dummy files to demonstrate loading.

In [None]:
# Create a dummy JSONL file (common dataset format)
data = [
    {"instruction": "What is the capital of France?", "output": "Paris is the capital of France."},
    {"instruction": "Add 2+2", "output": "The answer is 4."},
    {"instruction": "Write a function to print hello", "output": "print('Hello')"}
]

import json
with open("my_data.jsonl", "w") as f:
    for entry in data:
        f.write(json.dumps(entry) + "\n")
        
print("Created my_data.jsonl")

In [None]:
# Load with LMFast utility (wraps 'datasets' library)
from lmfast.training.data import load_dataset

dataset = load_dataset("json", data_files="my_data.jsonl")
print(f"Loaded {len(dataset)} examples")
print(dataset[0])

## 3Ô∏è‚É£ Formatting for Instruction Tuning

Models like SmolLM or Llama-3 require specific prompt templates. LMFast handles this via `DataCollator` or explicit mapping.

In [None]:
def format_alpaca(example):
    input_text = example.get('input', '')
    if input_text:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{input_text}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

formatted_ds = dataset.map(format_alpaca)
print("Formatted Example:")
print(formatted_ds[0]['text'])

## 4Ô∏è‚É£ Handling Chat Data (ShareGPT)

Chat data is more complex but standard for fine-tuning assistants.

In [None]:
chat_data = [
    {
        "conversations": [
            {"from": "human", "value": "Hi there!"},
            {"from": "gpt", "value": "Hello! How can I help you today?"},
            {"from": "human", "value": "Tell me a joke."},
            {"from": "gpt", "value": "Why did the chicken cross the road? To get to the other side!"}
        ]
    }
]

# Create dataset
chat_ds = Dataset.from_list(chat_data)

def apply_chat_template(example, tokenizer):
    # Uses the model's chat template (e.g. ChatML)
    messages = []
    for msg in example['conversations']:
        role = "user" if msg['from'] == "human" else "assistant"
        messages.append({"role": role, "content": msg['value']})
    
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

# Must load a tokenizer to apply template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")

formatted_chat = chat_ds.map(lambda x: apply_chat_template(x, tokenizer))

print("ChatML Formatted Example:")
print(formatted_chat[0]['text'])

## 5Ô∏è‚É£ Using CSV / Excel Files

Common in enterprise settings.

In [None]:
# Create dummy CSV
df = pd.DataFrame({
    'question': ['How do I reset password?', 'Where is the login button?'],
    'answer': ['Go to settings > reset.', 'Top right corner.']
})
df.to_csv("support.csv", index=False)

# Load csv
csv_ds = load_dataset("csv", data_files="support.csv")

# Map columns to instruction format
def map_csv(example):
    return {
        "instruction": example['question'],
        "output": example['answer']
    }

final_ds = csv_ds.map(map_csv)
final_ds = final_ds.map(format_alpaca)

print("CSV Example:")
print(final_ds[0]['text'])

## 6Ô∏è‚É£ Preparing for Training

Once formatted, you can pass this directly to `SLMTrainer`.

In [None]:
# Split into train/test
split_ds = final_ds.train_test_split(test_size=0.1)

print(f"Train size: {len(split_ds['train'])}")
print(f"Test size: {len(split_ds['test'])}")

# Ready to train!
# lmfast.train("HuggingFaceTB/SmolLM-135M", dataset=split_ds['train'])

## üéâ Summary

You've learned how to:
- ‚úÖ Load JSONL, CSV, and Python lists
- ‚úÖ Format for Alpaca and ChatML styles
- ‚úÖ Split datasets for training/evaluation

### Best Practices
- **Clean your data**: Remove duplicates and low-quality entries.
- **Use Chat Templates**: Always use `apply_chat_template` for chat models.
- **Balance your data**: Ensure coverage of all tasks you want the model to do.

### Next Steps
- `01_quickstart_training.ipynb`: Use your custom dataset to train!