In [1]:
%load_ext autoreload
%autoreload 2

# InstructLab Skills Synthetic Data Generation

![InstructLab Banner](../../../assets/imgs/instructlab-banner.png)

This notebook demonstrates how to customize language models by generating training data for specific skills, following the methodology outlined in the LAB (Large-scale Alignment for Chatbots) framework [[paper link](https://arxiv.org/pdf/2403.01081)].

### Customizing Model Behavior

The LAB framework enables us to shape how a model responds to various tasks by training it on carefully crafted examples. Want your model to write emails in your company's tone? Need it to follow specific formatting guidelines? This customization is achieved through what the paper defines as compositional skills.

Compositional skills are tasks that combine different abilities to handle complex queries. For example, if you want your model to write company emails about quarterly performance, it needs to:
- Understand financial concepts
- Perform basic arithmetic
- Write in your preferred communication style
- Follow your organization's email format

### Demo Overview

This notebook will show you how to:
1. Set up a teacher model for generating training data
2. Create examples that reflect your preferred style and approach
3. Generate Synthetic Data
4. Validate that the generated data matches your requirements

The end goal is to create training data that will help align the model with your specific needs, whether that's matching your company's communication style, following particular protocols, or handling specialized tasks in your preferred way.

## Install sdg-hub

```bash 
pip install sdg-hub==0.1.0a2
```

## 🧑‍🏫 Step 1: Set Up the Teacher Model

This demo uses **Mixtral-8x7B-Instruct-v0.1** as the teacher model. We'll serve it using **vLLM**, and use **Llama Stack** to expose an OpenAI-compatible API.


### Serve the Model with vLLM

Launch the vLLM server with:

```bash
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2
```

This will start the model server at: `http://localhost:8000`

> ⚠️ Make sure your system has sufficient GPU memory.  
> 🔧 Adjust `--tensor-parallel-size` based on available GPUs.  
> ⏱️ First-time model loading may take several minutes.


### Set Up Llama Stack (OpenAI-Compatible Interface)

1. Clone and install Llama Stack (OpenAI-compatible branch)
```bash
git clone https://github.com/bbrowning/llama-stack.git
cd llama-stack
git checkout openai_server_compat
pip install -e .
```

2. Install the Python client
```bash
pip install llama-stack-client
```

3. Launch the Llama Stack Server (connected to vLLM)
```bash
export INFERENCE_MODEL=mistralai/Mixtral-8x7B-Instruct-v0.1
llama stack build --template remote-vllm
```

The server will start at: `http://localhost:8321`

You can use the CLI to verify the setup:

```bash
llama-stack-client   --endpoint http://localhost:8321   inference chat-completion   --model-id $INFERENCE_MODEL   --message "write a haiku about language models"
```


Let’s setup a client and test the connection with python and move on! 🚀


In [2]:
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8321/v1/openai/v1"


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
teacher_model = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# Test the connection with a simple completion
response = client.chat.completions.create(
    model=teacher_model,
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.0,
    max_tokens=10
)
completion = response.choices[0].message.content

print(f"Connection successful! {teacher_model}: {completion}")

Connection successful! mistralai/Mixtral-8x7B-Instruct-v0.1:  Hello! It's nice to meet you.


## ✍️ Step 2: Provide Custom Examples


#### Usecase: Teaching a Language Model the Skill: Unstructured Text → Markdown Table

Company X receives large volumes of user feedback through support emails, in-app surveys, and app store reviews. These messages often contain valuable product insights, but the content is unstructured and difficult to analyze at scale.

To streamline internal workflows, an AI team at Company X wants to teach a language model how to convert raw user feedback into structured markdown tables. These tables summarize key topics, user sentiment, and issues in a format that’s easy to scan, report, or push into dashboards and tracking systems.

We can do this using InstructLab!

#### 🧾 Example Input and Output

📥 Input (Unstructured Feedback)
```
Hey team — I’ve been using the new update for about a week now.

Couple of things:
- The dark mode is awesome, great job!
- But the loading time after login feels slower than before. Not a deal breaker but noticeable.
- I also noticed that the calendar widget doesn’t update properly if I change time zones.

Overall, I love where this is going. Just needs a few tweaks.
```
📤 Output (Markdown Table)

| Feature           | Feedback                                                               | Sentiment |
|------------------|------------------------------------------------------------------------|-----------|
| Dark Mode        | Works well, user is satisfied.                                          | Positive  |
| Login Performance| Loading time after login is slower than previous version.               | Negative  |
| Calendar Widget  | Doesn't update correctly when time zones change.                        | Negative  |
| Overall          | User is happy with the direction of the product, but suggests tweaks.   | Positive  |

#### Instructlab Grounded Skills Generation Pipeline 

Now that we have laid out our usecase, lets dive into the skills generation pipeline proposed by LAB 
You can refer to the flow details and block config from this yaml (src/instructlab/sdg/flows/generation/skills/simple_grounded_skill.yaml)

InstructLab uses a multi-step process of generation and evaluation to generate synthetic data. For grounded skills it looks like this: 

<table>
<tr>
  <td>
    <img src="../../../assets/imgs/IL_skills_pipeline.png" alt="Skills Pipeline" width="250">
  </td>
  <td>
    <ul>
      <li>
        <strong>Context Generation (<code>gen_contexts</code>)</strong><br>
        Generates diverse, relevant contexts for the skill<br>
        Produces 10 unique contexts per run<br><br>
      </li>
      <li>
        <strong>Question Generation & Validation</strong><br>
        <code>gen_grounded_questions</code>: Creates 3 questions per context<br>
        <code>eval_grounded_questions</code>: Evaluates question quality<br>
        <code>filter_grounded_questions</code>: Keeps only perfect scores (1.0)<br><br>
      </li>
      <li>
        <strong>Response Generation & Quality Control</strong><br>
        <code>gen_grounded_responses</code>: Generates appropriate responses<br>
        <code>evaluate_grounded_qa_pair</code>: Scores Q&A pair quality<br>
        <code>filter_grounded_qa_pair</code>: Retains high-quality pairs (score ≥ 2.0)<br><br>
      </li>
      <li>
        <strong>Final Processing</strong><br>
        <code>combine_question_and_context</code>: Merges context with questions for complete examples<br><br>
      </li>
    </ul>
  </td>
</tr>
</table>

#### Seed Data with Examples
Now that we've seen how LAB generates skill-specific data, let's walk through how to use it for our own use case.

As outlined in the LAB paper, the first step is to provide a small number of **seed examples** (typically 5) to bootstrap the skill. These examples are passed into the generation pipeline as input and are stored in a `.jsonl` file.

For this demo, we’ll use the pre-populated seed file located at: [mdtable_seeds.jsonl](examples/instructlab/skills/sample_data/mdtable_seeds.jsonl)

Lets open the file and explore a row: 

In [3]:
from datasets import load_dataset

# Load the seed dataset
seed_data = load_dataset("json", data_files="sample_data/mdtable_seeds.jsonl", split="train")

# Display the first example
seed_data[0]

  from .autonotebook import tqdm as notebook_tqdm


{'task_description': 'Convert the following unstructured user feedback into a structured markdown table.',
 'seed_context': "Been using the new dashboard for a few days. It's way faster than the previous one, really appreciate the snappy filters. But export to CSV seems broken — nothing happens when I click it. Also, dark mode resets every time I log in.",
 'seed_question': 'I would like to convert the above feedback into a markdown table with columns for Feature, Feedback and Sentiment.',
 'seed_response': "| Feature           | Feedback                                                           | Sentiment |\n|------------------|--------------------------------------------------------------------|-----------|\n| Dashboard        | Much faster than previous version, filters are responsive.         | Positive  |\n| Export to CSV    | Clicking the export button doesn't trigger a download.             | Negative  |\n| Dark Mode        | Resets to light mode on login.                      

## 🚀 Step 3: Generate Synthetic Data

Now that we have our seed data ready, we can use LAB’s Skill Data Generator to create **high-quality synthetic training examples** for our custom skill.

This step leverages a predefined **flow configuration** that encodes how seed examples are expanded — by generating new contexts, questions, and responses, and filtering them for quality.

In this demo, we'll use the `synth_grounded_skills.yaml` flow, which follows LAB's grounded generation pattern (context → question → response).

In [4]:
from sdg_hub.flow import Flow
from sdg_hub.pipeline import Pipeline
from sdg_hub.sdg import SDG

# Path to the skill generation flow configuration
flow_path = "../../../src/sdg_hub/flows/generation/skills/synth_grounded_skills.yaml"

# Load the flow
flow = Flow(client).get_flow_from_file(flow_path)

# Initialize the synthetic data generator
generator = SDG(
    [Pipeline(flow)],
)

At this point, the generator is ready to run the full pipeline — including context generation, question/response generation, evaluation, and filtering — to produce a synthetic dataset that can be used for fine-tuning or skill bootstrapping.

In the next step, we’ll run this pipeline and inspect the generated outputs. (This should take about a minute or so)

In [5]:
generated_data = generator.generate(seed_data)

Map: 100%|██████████| 211/211 [00:00<00:00, 15834.36 examples/s]
Filter: 100%|██████████| 211/211 [00:00<00:00, 56887.46 examples/s]
Filter: 100%|██████████| 211/211 [00:00<00:00, 41071.01 examples/s]


Map: 100%|██████████| 150/150 [00:00<00:00, 13887.81 examples/s]
Filter: 100%|██████████| 150/150 [00:00<00:00, 56008.69 examples/s]
Filter: 100%|██████████| 150/150 [00:00<00:00, 36493.36 examples/s]
Map (num_proc=8): 100%|██████████| 150/150 [00:00<00:00, 969.40 examples/s]


## 🔍 Step 4: Explore and Validate the Synthetically Generated Data

Once the skill generation pipeline has been executed, the output is a set of **synthetically generated examples** — new context-question-response triples that follow the same structure as the seed data but are expanded and refined by the teacher model.

Below is an example of one generated entry:

In [12]:
import random 

rand_idx = random.choice(range(len(generated_data)))
generated_data[rand_idx]

{'task_description': 'Convert the following unstructured user feedback into a structured markdown table.',
 'seed_context': 'The analytics view is very informative. Would love to see breakdown by team as well. Charts sometimes take a few seconds to load though. Mobile layout is clean.',
 'seed_question': 'Please convert the above feedback into a markdown table with columns for Feature, Feedback and Sentiment.',
 'seed_response': '| Feature           | Feedback                                                             | Sentiment |\n|------------------|----------------------------------------------------------------------|-----------|\n| Analytics View    | Provides useful insights.                                           | Positive  |\n| Team Breakdown    | Requested feature not currently available.                         | Neutral   |\n| Charts            | Load slowly on occasion.                                            | Negative  |\n| Mobile Layout     | Clean and well-desi

In [13]:
print(generated_data[rand_idx]['question'])

The user provided feedback on the e-commerce website design. They mentioned that the product search function is efficient and easy to use, but suggested improving the search results by adding images. The user also likes the layout of the shopping cart and checkout process. However, they pointed out that the site could benefit from a more prominent display of shipping costs. The user appreciates the guest checkout option but feels that creating an account should offer some benefits, such as discounts or quicker checkout in the future.

Regarding the e-commerce website design, how would you express the user's thoughts on the guest checkout option and the benefits of creating an account in a markdown table with columns for 'Feature', 'Feedback', and 'Sentiment'?


In [14]:
print(generated_data[rand_idx]['response'])

| Feature                  | Feedback                                                                                   | Sentiment |
|-------------------------|--------------------------------------------------------------------------------------------|-----------|
| Guest Checkout Option   | Appreciated for its convenience, but could be improved with more incentives.              | Neutral   |
| Creating an Account      | User suggested offering benefits like discounts or quicker checkout in the future.        | Positive  |
| Benefits of Creating an Account | Not explicitly stated, but user implied they should offer value for account creation. | Positive  |


## 🏁 Conclusion

In this notebook, we demonstrated how to teach a custom skill to a language model using the InstructLab Skill Data Generator (SDG). Starting from a small set of seed examples, we walked through the full synthetic data generation pipeline — including context creation, question generation, response synthesis, evaluation, and filtering.

We explored a real-world use case: **transforming unstructured user feedback into structured markdown tables**, and showed how the LAB framework can automate the generation of high-quality, instructional training data at scale.

This approach is especially powerful for procedural or domain-specific tasks where labeled data is scarce but consistent task logic can be modeled. With just a few carefully curated seed examples, you can unlock scalable skill creation and push new capabilities into LLMs with minimal manual effort.

You’re now ready to use these synthetic examples for Fine-tuning small models! 

Next steps? Try adapting this pipeline to your own task, domain, or format — whether it’s triaging support tickets, extracting structured data, or following domain-specific workflows. The skills are yours to create.