# Introduction to Model Distillation: Efficient Knowledge Transfer for AI Applications  - Part 1

[![](https://img.shields.io/badge/Powered%20by-Nebius%20AI-orange?style=flat&labelColor=darkblue&color=green)](https://nebius.com/ai-studio)

## Pre requisites

- Nebius API key.  Sign up for free at [AI Studio](https://studio.nebius.com/)

## Introduction

Model distillation is a powerful technique in machine learning where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model within the given task. By transferring knowledge from the teacher to the student, distillation enables lightweight models to achieve comparable performance within the task, while being dramatically faster and cheaper to deploy (and, consequently, cheaper to run inference with).

The benefits are compelling:   
- **Latency improvement**: Smaller models perform much faster, which makes them ideal for real-time applications like agentic scenarios or other tasks where an immediate response is required.
- **Cost reduction**: Smaller models require less compute for inference, and hence are available at cheaper rates. Furthermore, the fine-tuned model removes the need in the long detailed prompts to ensure a specific format of the output data, which also reduces the price due to the lower tokens consumption.

In this tutorial, we demonstrate how to perform distillation using **Nebius AI Studio** to create a grammar-correcting model. We will:  
1. Generate high-quality training data via batched LLM generation using the recently released **Qwen3-235B-A22B**.
2. Fine-tune a **Qwen3-4B** non-reasoning student model using LoRA adapters  
3. Deploy, evaluate and compare the distilled model with a 3.5x times larger model of this family, Qwen3-14B, using the most powerful open-source LLM to date, **DeepSeek-R1**, as evaluator.

By leveraging Nebius AI Studio’s batched generation, fine-tuning API, optimized inference and zero-click model deployment, we streamline the entire workflow—proving that large capabilities can indeed come in small packages. Let’s dive in!

Before we start, please note three things.

First, the procedure we employ differs from traditional distillation where the student model is trained on teacher's internal representations - instead, we will simply train the student model on the completions of the teacher model. 

Second, the goal of this blog post is not to maximize the quality of the model on the given task but rather exhibit how to correctly perform distillation and why it matters. Hence, we will not focus on task-specific quality-optimization tricks or play around with the data. However, we will include all best practices for distillation so that your distilled model goes beyond your expectations! 

Finally, due to non-deterministic parameters recommended by Qwen3 authors to run Qwen3, your results may slightly differ from the ones you see here if you relaunch the code. But no need to worry - we made sure the quality of the fine-tuned model is guaranteed to stay within the confidence interval of the baseline model!

## 1 - Prerequisites

Create an `.env` file with NEBIUS_API_KEY as follows

```text
NEBIUS_API_KEY=your_api_key_goes_here
```

## 2 - Dependencies

Let's start with importing the necessary packages.

In [1]:
import os
from dotenv import load_dotenv

from typing import Sequence
from openai import Client
from datasets import load_dataset, Dataset, concatenate_datasets
from tqdm import tqdm
import pandas as pd
import json
import numpy as np
import time
import requests
import re

## 3 - Load Configuration

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

## 4  - Initialize Client

The cell below creates the OpenAI-like Client to work with Nebius AI Studio and defines necessary variables.

In [None]:
DATASETS_CACHE_DIR = 'cache'
BASE_URL = "https://api.studio.nebius.ai"

client = Client(
    base_url=f'{BASE_URL}/v1',
    api_key=os.getenv('NEBIUS_API_KEY')
)

## 5 - Load Input Dataset
In this tutorial, we want to demonstrate how, given only a dataset of input texts, train a small model by leveraging the most powerful LLMs to generate the desired outputs.

We will take a C4-200M dataset [1] for these purposes, which is intended for GEC models pre-training. Its outputs are anyway unsuitable for direct fine-tuning of a GEC model because it contains many errors, for example:

- **Input:** review narrow river as if air surf ...
- **Output:** air washer review boneco w200 air washer winix air washer review..

We will use its inputs and generate correct outputs using one of state-of-the-art LLMs - a recently released Qwen3-235B-A22B [2]. With a proper prompt tuning, we can urge the model output the data in easy-to-reuse format so that we can create the dataset to fine-tune our target small model -- Qwen3-4B [2].

Tens of thousands observations is generally enough to improve the quality of the model. Let's take a subset of 25k observations, process it by removing too short and too long sentences (this will leave us at 22k), and split into train and validation subsets for fine-tuning (21k & 1k).

In [4]:
input_dataset = load_dataset('Aktsvigun/c4_200m_25k', split='train', cache_dir=DATASETS_CACHE_DIR)
input_dataset

Dataset({
    features: ['input'],
    num_rows: 25000
})

Let's examine a random instance from the dataset.

In [5]:
input_dataset[2025]

{'input': 'Are you dissapointed on DNF or upsng race ettiraces?'}

C4-200M dataset is intended to contain sentences. Sentences below 3 or above 40 words are definite outliers, which most likely contain some garbage inputs. Let's filter out such input texts.

In [6]:
input_dataset = input_dataset.filter(lambda x: 40 > len(x['input'].split()) > 3)
input_dataset

Dataset({
    features: ['input'],
    num_rows: 22114
})

## 6 - Batch Inference

_Heads up: Running this part will cost ~$4.9._

You can use normal synchronous generations with Qwen3-235B-A22B to generate outputs for the dataset. However, if you are not in a last-minute rush, usage of batch inference is recommended. It provides as much as 2x cheaper rates, and is guaranteed to finish within 24 hours. In most cases, it takes a few hours or even minutes, again, depending on the size of the dataset.

Let's see how to use the batched generation to annotate our input dataset.

First, we need a carefully designed prompt to have the data generated in the desired format. Desired format here is untouched input sentence if it is already gramatically correct, or its corrected version, otherwise.

To urge the model follow the desired format for generation (without adding introduction like "Here is the corrected text" or further explanations), we will leverage few-shot learning examples. We provide one example per gramatically correct and incorrect input texts in our prompt.

In [7]:
system_prompt_distillation = """
Act as an experienced English proofreader. Please check the grammar of the user's text. If the text contains errors or misprints, print the corrected text. Otherwise, print the text as it is, otherwise. Check only the grammar of the text. Don't print anything else.

Examples for few-shot learning:
Example 1 (the text contains errors):
User: In fact who let me know abut this program was him.
Assistant: In fact, he was the one who let me know about this program.

Example 2 (the text does not contain errors):
User: On the other hand, it's very efficient computationally as it only requires one forward pass through the model per example.
Assistant: On the other hand, it's very efficient computationally as it only requires one forward pass through the model per example.
""".strip()

### 6.1 - Save input data in JSONL format

Let's format the dataset and save it as a `.jsonl` file.

In [8]:
!mkdir data

max_tokens = 4096

with open('data/batch_input.jsonl', 'w') as f:
    for i, inst in enumerate(input_dataset, 1):
        dict_to_write = {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "Qwen/Qwen3-235B-A22B",
                "messages": [
                    {"role": "system", "content": system_prompt_distillation},
                    {"role": "user", "content": inst["input"]}
                ],
                "max_tokens": max_tokens
            }
        }
        json.dump(dict_to_write, f, ensure_ascii=False)
        f.write('\n')

mkdir: cannot create directory ‘data’: File exists


### 6.2 - Upload our input dataset to Nebius AI Studio.

In [9]:
batch_input_file = client.files.create(
    file=open("data/batch_input.jsonl", "rb"),
    purpose="batch"
)
batch_input_file

FileObject(id='file-386da9e1-b2c8-42aa-a799-363b61cff93c', bytes=24630543, created_at=1753305500, filename='batch_input.jsonl', object='file', purpose='batch', status=None, expires_at=None, status_details=None)

### 6.3 - Launch the batch inference job

Now that all perliminary steps are done, use the uploaded dataset to create the batched generation job. The code below launches the batched generation.

In [10]:
batch_input_file_id = batch_input_file.id
batch = client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "description": "Distillation of Qwen/Qwen3-235B-A22B for GEC"
    }
)
batch

Batch(id='batch_624bbde2-58f6-42e9-adf4-096ca8a98569', completion_window='24h', created_at=1753305501, endpoint='/v1/chat/completions', input_file_id='file-386da9e1-b2c8-42aa-a799-363b61cff93c', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=None, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Distillation of Qwen/Qwen3-235B-A22B for GEC'}, output_file_id=None, request_counts=BatchRequestCounts(completed=None, failed=None, total=None))

### 6.4 - Wait for batch inference job to finish

It will now take some time to complete your job. The total time highly depends on the workload of the model. In our case, it finished within 1 hour. 

You can periodically monitor the status of your job. When the job is completed, `status` will be equal to `'done'`. The cell below will update its status every minute and stop running once the job is finished.

In [11]:
%%time 
import time

start_time = time.time()
update_num_seconds = 60
active_statuses = ["validating", "validated", "running"]
print (f"Batch {batch.id} created, waiting for completion...")
while batch.status in active_statuses:
    time.sleep(update_num_seconds)
    # Retrieve the batch state
    batch = client.batches.retrieve(batch.id)
    elapsed = time.time() - start_time
    print(f"Elapsed: {int(elapsed)}s ({elapsed/60:.1f} min) : current status: {batch.status}")

Batch batch_624bbde2-58f6-42e9-adf4-096ca8a98569 created, waiting for completion...
Elapsed: 60s (1.0 min) : current status: running
Elapsed: 121s (2.0 min) : current status: running
Elapsed: 182s (3.0 min) : current status: running
Elapsed: 242s (4.0 min) : current status: running
Elapsed: 303s (5.1 min) : current status: running
Elapsed: 364s (6.1 min) : current status: running
Elapsed: 425s (7.1 min) : current status: running
Elapsed: 486s (8.1 min) : current status: running
Elapsed: 547s (9.1 min) : current status: running
Elapsed: 607s (10.1 min) : current status: running
Elapsed: 668s (11.1 min) : current status: running
Elapsed: 729s (12.2 min) : current status: running
Elapsed: 789s (13.2 min) : current status: running
Elapsed: 850s (14.2 min) : current status: running
Elapsed: 911s (15.2 min) : current status: running
Elapsed: 972s (16.2 min) : current status: running
Elapsed: 1033s (17.2 min) : current status: running
Elapsed: 1093s (18.2 min) : current status: running
Elapse

### 6.5 - Save Batch inference data locally

Our batch has been successfully processed. Let's save the generations to a file and examine the format in which it comes.

In [12]:
file_response = client.files.content(batch.output_file_id)
file_response.write_to_file('data/batch_output.jsonl')

In [13]:
# Display the first line from the output file

with open('data/batch_output.jsonl') as f:
    for line in f.readlines():
        output = json.loads(line)
        break
print(json.dumps(output, indent=4))
print ("✅ Batch processing completed successfully!")

{
    "id": "batch_req_fc038ea5-6112-4b0f-be71-e402f74a8105",
    "custom_id": "request-1580",
    "response": {
        "id": "chatcmpl-c03d469ec6bd4c61a7a9a22157ce548b",
        "choices": [
            {
                "finish_reason": "stop",
                "index": 0,
                "logprobs": null,
                "message": {
                    "content": "<think>\nOkay, let's tackle this query. The user wants me to act as an experienced English proofreader. The task is to check the grammar of their text. If there are errors or misprints, I need to correct them. If not, just return the text as is. Important: only check grammar, don't add anything else.\n\nFirst, I'll read through the user's text carefully. The example they provided shows that the assistant should correct grammatical errors and misprints, like typos or wrong word choices, but leave the technical terms and product names intact. \n\nThe user's text is: \"Fluke Australia - Th7 Fluke 1587 and 1577 insulations Mu

## 7 - Data Cleanup 

To get a model suitable for online application, let's query only the generations without the thinking part. Next, create a dataset that we'll afterwards merge with the input dataset.

Even though our dataset is not that large, let's create the `Dataset` object from file so that at no point we store the whole dataset in RAM - this will be a helpful example of how to deal with large datasets.

Since we want to exhibit a distillation for real-world usecases, we will only train the model on completions, discarding the thinking part. This ensures the responses are generated immediately, which is generally crucial for production applications. Hence, we extract the content after the `</think>` tag to save only the final corrected version.

There may also be cases where the model thought for so long that it didn't reach the final output. We will filter these cases by removing observations where the number of completion tokens coincides with maximum tokens we used for generation (4096).

In [14]:
output_save_path = 'data/batch_output_processed.jsonl'
prompt_tokens = 0
completion_tokens = 0
ids_to_filter = set()

with open(output_save_path, 'w') as f_out:
    with open('data/batch_output.jsonl') as f_in:
        for line in f_in.readlines():
            output = json.loads(line)
            output_text = output['response']['choices'][0]['message']['content']
            output_without_thinking = output_text.split('</think>')[-1].strip()
            output_id = int(output['custom_id'].split('-')[1])
            # Check the generation was finished. We won't remove these instances at the moment:
            # we will remove them once we concatenate the outputs with the input dataset
            if output['response']['usage']['completion_tokens'] == max_tokens:
                ids_to_filter.add(output_id)
                    
            json.dump({'output': output_without_thinking, 'id': output_id}, f_out, ensure_ascii=False)
            f_out.write('\n')
            # Calculate token statistics
            prompt_tokens += output['response']['usage']['prompt_tokens']
            completion_tokens += output['response']['usage']['completion_tokens']

## 8 - Examine the price of batch inference

Let's also calculate the price of the batched generation. We can take the price for input/output tokens of a model from the [AI Studio home page](https://studio.nebius.com/). For `Qwen/Qwen3-235B-A22B`, it is \\$0.2/\\$0.6 for 1M input/output tokens. However, thanks to using batched generation, it costs twice as little with \\$0.1/\\$0.3 for 1M input/output tokens

In [15]:
price = (prompt_tokens * 0.1 + completion_tokens * 0.3) / 1_000_000
print(f'Batched generation price: ${price:.1f}')

Batched generation price: $4.9


## 9 - Merge and Clean the dataset

Now let's merge our dataset with outputs with dataset with inputs, remove the instance for which the generation has not been finished, and check that the merge didn't break anything.

In [16]:
output_dataset = Dataset.from_json(output_save_path, split="train")
output_dataset = output_dataset.sort('id')
output_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['output', 'id'],
    num_rows: 22114
})

In [17]:
assert len(input_dataset) == len(output_dataset)
ft_dataset = concatenate_datasets([input_dataset, output_dataset], axis=1)
# Filter out unfinished generations
ft_dataset = ft_dataset.filter(lambda x: x['id'] not in ids_to_filter)
# Remove the `id` column, which is not useful anymore
ft_dataset = ft_dataset.remove_columns('id')
ft_dataset

Flattening the indices:   0%|          | 0/22114 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/22114 [00:00<?, ? examples/s]

Filter:   0%|          | 0/22114 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 22092
})

In [18]:
ft_dataset[42]

{'input': 'I think we need both 48bit & softprin in Libdrm.',
 'output': 'I think we need both 48-bit and softprin in Libdrm.'}

## 10 - Save the final finetuning dataset

In [19]:
## Save final dataset in JSONL format

# ft_dataset.to_json('data/ft_dataset2.jsonl', orient="records", lines=True)

with open('data/ft_dataset.jsonl', 'w') as f:
    for i, record in enumerate(ft_dataset):
        json.dump(record, f, ensure_ascii=False)
        f.write('\n')
print("✅ Final dataset saved to 'data/ft_dataset.jsonl'")

✅ Final dataset saved to 'data/ft_dataset.jsonl'


Our dataset for fine-tuning is created! We can now procede to fine-tuning - the most exciting part for most of AI developers :-).