# üìö Preparing Data for Fine-Tuning

To teach our Llama 3 model how to be a better assistant, we need to show it examples of perfect conversations. 

We are going to use the **`HuggingFaceH4/no_robots`** dataset. It contains 10,000 extremely high-quality instructions written by skilled human experts.

First, let's download it!

In [1]:
from datasets import load_dataset
import json

# 1. Download the dataset from Hugging Face
print("Downloading dataset...")
dataset = load_dataset("HuggingFaceH4/no_robots")

# Let's look at one example from the training split
example = dataset["train"][0]
print("\n--- Example Raw Data ---")
print(json.dumps(example['messages'], indent=2))

Downloading dataset...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/571k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]


--- Example Raw Data ---
[
  {
    "content": "Please summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert\u2019s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition\u2014which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. \u201cWe can\u2019t ask 

### Formatting the Data
As we can see above, the raw dataset is just a list of dictionaries.

But Llama 3 won't understand that during training! It demands that the text be formatted as a single long string, using its special `<|start_header_id|>` and `<|eot_id|>` tokens.

Fortunately, the `mlx-lm` tokenizer we used in the first notebook has a built-in helper to do exactly this formatting!

In [2]:
from mlx_lm import load

# 2. Load JUST the tokenizer (this is instant, it doesn't load the heavy model weights)
model_id = "mlx-community/Meta-Llama-3-8B-Instruct-4bit"
_, tokenizer = load(model_id, tokenizer_config={"trust_remote_code": True})

# 3. Create a function to convert the raw messages into Llama 3 format
def format_for_llama3(messages):
    # apply_chat_template takes the list of dictionaries and turns it into the exact training string
    chat_string = tokenizer.apply_chat_template(messages, tokenize=False)
    # mlx_lm expects the training data to have a key called 'text'
    return {"text": chat_string}

# Test the formatter on our example!
formatted_example = format_for_llama3(example['messages'])
print("\n--- Fully formatted Llama 3 Training String ---")
print(formatted_example['text'])

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]


--- Fully formatted Llama 3 Training String ---
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Please summarize the goals for scientists in this text:

Within three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert‚Äôs portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition‚Äîwhich was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood m

### Saving the Dataset

Now we just need to apply this formatting to 500 examples (10,000 would take hours to train, so we'll start small for practice!) and save it to a `.jsonl` file.

In [3]:
# 4. Process and save the data for training
num_examples_to_use = 500

print(f"Processing {num_examples_to_use} examples...")
with open("train.jsonl", "w") as f:
    for i in range(num_examples_to_use):
        messages = dataset["train"][i]["messages"]
        formatted_data = format_for_llama3(messages)
        
        # Write as a JSON Line (jsonl)
        f.write(json.dumps(formatted_data) + "\n")

# We also need a small validation set so the model can test itself during training
print("Processing 50 validation examples...")
with open("valid.jsonl", "w") as f:
    for i in range(50):
        # We take these from the 'test' split so the model hasn't seen them before
        messages = dataset["test"][i]["messages"]
        formatted_data = format_for_llama3(messages)
        f.write(json.dumps(formatted_data) + "\n")

print("\n‚úÖ Data preparation complete! You now have train.jsonl and valid.jsonl ready for fine-tuning.")

Processing 500 examples...
Processing 50 validation examples...

‚úÖ Data preparation complete! You now have train.jsonl and valid.jsonl ready for fine-tuning.
