# DPO Fine-Tuning with Intel Orca Dataset on Microsoft Foundry

This notebook demonstrates how to fine-tune language models using **Direct Preference Optimization (DPO)** with the Intel Orca DPO Pairs dataset.

## What You'll Learn
1. Understand DPO fine-tuning
2. Prepare and format DPO training data  
3. Upload datasets to Microsoft Foundry
4. Create and monitor a DPO fine-tuning job
5. Evaluate your fine-tuned model

Note: Execute each cell in sequence.

## 1. Setup and Installation

Install all required packages from requirements.txt

In [1]:
pip install -r requirements.txt

Collecting azure-mgmt-cognitiveservices (from -r requirements.txt (line 9))
  Downloading azure_mgmt_cognitiveservices-14.1.0-py3-none-any.whl.metadata (32 kB)
Collecting azure-ai-evaluation>=1.13.0 (from -r requirements.txt (line 12))
  Downloading azure_ai_evaluation-1.15.3-py3-none-any.whl.metadata (51 kB)
Collecting nltk>=3.9.1 (from azure-ai-evaluation>=1.13.0->-r requirements.txt (line 12))
  Downloading nltk-3.9.3-py3-none-any.whl.metadata (3.2 kB)
Collecting pandas<3.0.0,>=2.1.2 (from azure-ai-evaluation>=1.13.0->-r requirements.txt (line 12))
  Downloading pandas-2.3.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting Jinja2>=3.1.6 (from azure-ai-evaluation>=1.13.0->-r requirements.txt (line 12))
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk>=3.9.1->azure-ai-evaluation>=1.13.0->-r requirements.txt (line 12))
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting regex>=2021.8.3 (from nltk>=3.9.1->azure-ai-evalu

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
semantic-kernel 1.30.0 requires pydantic!=2.10.0,!=2.10.1,!=2.10.2,!=2.10.3,<2.12,>=2.0, but you have pydantic 2.12.5 which is incompatible.

[notice] A new release of pip is available: 25.0.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Import Libraries

In [3]:
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

print(" All libraries imported successfully")

 All libraries imported successfully


## 3. Define Evaluation Function

Function to evaluate model performance using Azure AI Evaluation SDK.

In [4]:
def evaluate_model(deployment_name, num_samples=10, evaluator_model=None):
    """
    Evaluate a model deployment using Azure AI Evaluation SDK.
    
    Args:
        deployment_name: Name of the deployed model to evaluate
        num_samples: Number of samples to evaluate (default: 10)
        evaluator_model: Name of the model to use for evaluation (default: use base model from env)
    
    Returns:
        Dictionary containing evaluation metrics
    """
    import json
    from azure.ai.evaluation import evaluate, CoherenceEvaluator, FluencyEvaluator, GroundednessEvaluator
    from openai import AzureOpenAI
    
    if evaluator_model is None:
        evaluator_model = os.getenv("DEPLOYMENT_NAME")
    
    print(f"Evaluating deployment: {deployment_name}")
    print(f"Using evaluator model: {evaluator_model}")
    print(f"Using {num_samples} samples from training.jsonl")
    
    azure_openai_client = AzureOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_KEY"),
        api_version="2024-08-01-preview"
    )
    
    print("Generating model responses...")
    eval_data = []
    with open("training.jsonl", 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= num_samples:
                break
            sample = json.loads(line)
            
            messages = sample["input"]["messages"]
            query = next((msg["content"] for msg in messages if msg["role"] == "user"), "")
            
            response = azure_openai_client.chat.completions.create(
                model=deployment_name,
                messages=messages,
                temperature=0.7,
                max_tokens=500
            )
            model_response = response.choices[0].message.content
            
            ground_truth = next((msg["content"] for msg in sample["preferred_output"] if msg["role"] == "assistant"), "")
            
            eval_data.append({
                "query": query,
                "response": model_response,
                "ground_truth": ground_truth
            })
            if (i + 1) % 10 == 0 or (i + 1) == num_samples:
                print(f"  Processed {i+1}/{num_samples}")
    
    eval_file = f"evaluation_data_{deployment_name.replace('-', '_')}.jsonl"
    with open(eval_file, 'w', encoding='utf-8') as f:
        for item in eval_data:
            f.write(json.dumps(item) + '\n')
    
    model_config = {
        "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
        "api_key": os.getenv("AZURE_OPENAI_KEY"),
        "azure_deployment": evaluator_model,
        "api_version": "2024-08-01-preview",
    }
    
    print("Running evaluation with 3 metrics...")
    try:
        results = evaluate(
            data=eval_file,
            evaluators={
                "coherence": CoherenceEvaluator(model_config=model_config),
                "fluency": FluencyEvaluator(model_config=model_config),
                "groundedness": GroundednessEvaluator(model_config=model_config)
            },
            evaluator_config={
                "default": {
                    "column_mapping": {
                        "query": "${data.query}",
                        "response": "${data.response}",
                        "ground_truth": "${data.ground_truth}"
                    }
                },
                "groundedness": {
                    "column_mapping": {
                        "query": "${data.query}",
                        "response": "${data.response}",
                        "context": "${data.ground_truth}"
                    }
                }
            },
            output_path=f"./evaluation_results_{deployment_name.replace('-', '_')}"
        )
    except Exception as e:
        print(f"Evaluation encountered an error: {str(e)}")
        print("Results may be incomplete. Check the output folder for partial results.")
        results = {"metrics": {}, "error": str(e)}
    
    print(f"EVALUATION RESULTS: {deployment_name}\n")
    
    if "metrics" in results:
        metrics = results["metrics"]
        
        coherence = metrics.get('coherence.coherence', metrics.get('coherence'))
        fluency = metrics.get('fluency.fluency', metrics.get('fluency'))
        groundedness = metrics.get('groundedness.groundedness', metrics.get('groundedness'))
        
        if coherence is not None:
            print(f"Coherence:      {coherence:.4f} (1-5 scale)")
        if fluency is not None:
            print(f"Fluency:        {fluency:.4f} (1-5 scale)")
        if groundedness is not None:
            print(f"Groundedness:   {groundedness:.4f} (1-5 scale)")
    
    print("="*60)
    print(f"Detailed results: ./evaluation_results_{deployment_name.replace('-', '_')}")
    
    return results

## 4. Configure Azure Environment
Set your Microsoft Foundry Project endpoint and model name. We're using **gpt-4.1-mini** in this example, but you can use other supported GPT models. Copy the file `.env.template` (located in this folder), and save it as file named `.env`. Enter appropriate values for the environment variables used for the job you want to run. 

```
# Required for DPO Fine-Tuning
MICROSOFT_FOUNDRY_PROJECT_ENDPOINT=<your-endpoint> 
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>
AZURE_AOAI_ACCOUNT=<your-foundry-account-name>
MODEL_NAME=<your-base-model-name>

# Required for Model Local Evaluation
AZURE_OPENAI_ENDPOINT=<your-azure-openai-endpoint>
AZURE_OPENAI_KEY=<your-azure-openai-api-key>
DEPLOYMENT_NAME=<your-deployment-name>
```

In [19]:
# Load environment variables
load_dotenv()

endpoint = os.environ.get("MICROSOFT_FOUNDRY_PROJECT_ENDPOINT")
model_name = os.environ.get("MODEL_NAME")

# Define dataset file paths
training_file_path = "training.jsonl"
validation_file_path = "validation.jsonl"

print(f"Base model: {model_name}")

## 4. Connect to Microsoft Foundry Project

Connect to Microsoft Foundry Project using Azure credential authentication. This initializes the project client and OpenAI client needed for fine-tuning workflows. Ensure you have the **Azure AI User** role assigned to your account for the Microsoft Foundry Project resource.

In [6]:
credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
openai_client = project_client.get_openai_client()

print("Connected to Microsoft Foundry Project")

Connected to Microsoft Foundry Project


## 5. Upload Training Files

Upload the training and validation JSONL files to Microsoft Foundry. Each file is assigned a unique ID that will be referenced when creating the fine-tuning job.

In [7]:
print("Uploading training file...")
with open(training_file_path, "rb") as f:
    train_file = openai_client.files.create(file=f, purpose="fine-tune")
print(f" Training file ID: {train_file.id}")

print("\nUploading validation file...")
with open(validation_file_path, "rb") as f:
    validation_file = openai_client.files.create(file=f, purpose="fine-tune")
print(f" Validation file ID: {validation_file.id}")

Uploading training file...
 Training file ID: file-790562148477483a9ddccb51d4a02c7e

Uploading validation file...
 Validation file ID: file-f1ef0a6914e046d6b35fa81ad6ca2f71


In [8]:
print("Waiting for files to be processed...")
openai_client.files.wait_for_processing(train_file.id)
openai_client.files.wait_for_processing(validation_file.id)
print(" Files ready!")

Waiting for files to be processed...
 Files ready!


## 7. Evaluate Base Model

Establish baseline performance metrics by evaluating the base model before DPO fine-tuning. This provides a comparison point to measure improvements after training.



In [16]:
base_deployment = os.getenv("DEPLOYMENT_NAME")
print(f"Evaluating base model: {base_deployment}\n")

base_results = evaluate_model(base_deployment, num_samples=50)

## 8. Create DPO Fine-Tuning Job
Create a DPO fine-tuning job with your uploaded datasets. Configure the following hyperparameters to control the training process:

1. n_epochs (3): Number of complete passes through the training dataset. More epochs can improve performance but may lead to overfitting. Typical range: 1-10.
2. batch_size (1): Number of training examples processed together in each iteration. Smaller batches (1-2) are common for DPO to maintain training stability.
3. learning_rate_multiplier (1.0): Scales the default learning rate. Values < 1.0 make training more conservative, while values > 1.0 speed up learning but may cause instability. Typical range: 0.1-2.0.
Adjust these values based on your dataset size and desired model behavior. 

Start with these defaults and experiment if needed.

In [None]:
fine_tuning_job = openai_client.fine_tuning.jobs.create(
    training_file=train_file.id,
    validation_file=validation_file.id,
    model=model_name,
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {
                "n_epochs": 1,
                "batch_size": 1,
                "learning_rate_multiplier": 1.0
            }
        }
    },
    extra_body={"trainingType": "GlobalStandard"}
)

print(f" Job ID: {fine_tuning_job.id}")
print(f"Status: {fine_tuning_job.status}")

## 9. Monitor Training Progress
Check the status of your fine-tuning job and track progress. You can view the current status, and recent training events. Training duration varies based on dataset size, model, and hyperparameters - typically ranging from minutes to several hours.

In [12]:
job_status = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
print(f"Status: {job_status.status}")

In [17]:
# View recent events
events = list(openai_client.fine_tuning.jobs.list_events(fine_tuning_job.id, limit=10))
for event in events:
    print(event.message)

## 10. Retrieve Fine-Tuned Model
After the fine-tuning job succeeded, retrieve the fine-tuned model ID. This ID is required to make inference calls with your customized model.

In [18]:
completed_job = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)

if completed_job.status == "succeeded":
    fine_tuned_model_id = completed_job.fine_tuned_model
    print(f" Fine-tuned Model ID: {fine_tuned_model_id}")
else:
    print(f"Status: {completed_job.status}")

## 11. Deploy the fine-tuned Model

Deploy the fine-tuned model to Azure OpenAI as a deployment endpoint. This step is required before making inference calls. The deployment uses GlobalStandard SKU with 50 capacity.

In [None]:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentProperties, DeploymentModel, Sku

subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group = os.environ.get("AZURE_RESOURCE_GROUP")
account_name = os.environ.get("AZURE_AOAI_ACCOUNT")

deployment_name = "gpt-4.1-mini-dpo-finetuned"

with CognitiveServicesManagementClient(credential=credential, subscription_id=subscription_id) as cogsvc_client:
    deployment_model = DeploymentModel(format="OpenAI", name=fine_tuned_model_id, version="1")
    deployment_properties = DeploymentProperties(model=deployment_model)
    deployment_sku = Sku(name="GlobalStandard", capacity=200)
    deployment_config = Deployment(properties=deployment_properties, sku=deployment_sku)
    
    print(f"Deploying fine-tuned model: {fine_tuned_model_id}")
    deployment = cogsvc_client.deployments.begin_create_or_update(
        resource_group_name=resource_group,
        account_name=account_name,
        deployment_name=deployment_name,
        deployment=deployment_config,
    )
    
    print("Waiting for deployment to complete...")
    deployment.result()

print(f"Model deployment completed: {deployment_name}")

Deploying fine-tuned model: gpt-4.1-mini-2025-04-14.ft-4cad7de198a34baeb4f0c95ff01ac844
Waiting for deployment to complete...
Model deployment completed: gpt-4.1-mini-dpo-finetuned


## 12. Test Your Fine-Tuned Model

Validate your fine-tuned model by running test inferences. This helps you assess whether the DPO training successfully aligned the model with your preferred response patterns from the training data

In [10]:
print(f"Testing fine-tuned model via deployment: {deployment_name}")

response = openai_client.responses.create(
    model=deployment_name,
    input=[{"role": "user", "content": "Explain machine learning in simple terms."}]
)

print(f"Model response: {response.output_text}")

Testing fine-tuned model via deployment: gpt-4.1-mini-dpo-finetuned
Model response: Machine learning is like teaching a computer to learn from experience, similar to how people do. Instead of programming specific instructions for every task, we give the computer a lot of data and it figures out patterns on its own. Then, it can use what it learned to make decisions or predictions. For example, if you show a machine learning system lots of pictures of cats and dogs, it will learn to recognize which is which by itself.


## 13. Evaluate Fine-Tuned Model

Evaluate your model using Azure AI Evaluation SDK to measure quality improvements from DPO fine-tuning.

We'll assess 3 key metrics:
- **Coherence**: Logical flow and structure
- **Fluency**: Grammatical correctness and naturalness
- **Groundedness**: Factual accuracy against context

In [None]:
base_model = os.getenv("DEPLOYMENT_NAME")  # Use base model for evaluation

print(f"Evaluating fine-tuned model: {deployment_name}")

finetuned_results = evaluate_model(deployment_name, num_samples=50, evaluator_model=base_model)

print("\nCompare base model vs fine-tuned model metrics to see DPO improvements!")

Evaluating fine-tuned model: gpt-4.1-mini-dpo-finetuned
Using base model as evaluator: gpt-4.1-mini

Evaluating deployment: gpt-4.1-mini-dpo-finetuned
Using evaluator model: gpt-4.1-mini
Using 50 samples from training.jsonl
Generating model responses...
  Processed 10/50
  Processed 20/50
  Processed 30/50
  Processed 40/50
  Processed 50/50
Running evaluation with 3 metrics...
2026-01-05 14:50:06 +0530   24016 execution.bulk     INFO     Finished 1 / 50 lines.
2026-01-05 14:50:06 +0530   24016 execution.bulk     INFO     Average execution time for completed lines: 8.14 seconds. Estimated time for incomplete lines: 398.86 seconds.
2026-01-05 14:50:07 +0530   24016 execution.bulk     INFO     Finished 2 / 50 lines.
2026-01-05 14:50:07 +0530   24016 execution.bulk     INFO     Average execution time for completed lines: 4.35 seconds. Estimated time for incomplete lines: 208.8 seconds.
2026-01-05 14:50:07 +0530   22800 execution.bulk     INFO     Finished 1 / 50 lines.
2026-01-05 14:50:07

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20260105_091958_326856"
Run status: "Completed"
Start time: "2026-01-05 09:19:58.326856+00:00"
Duration: "0:00:39.278192"

2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Finished 41 / 50 lines.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Average execution time for completed lines: 1.0 seconds. Estimated time for incomplete lines: 9.0 seconds.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Finished 42 / 50 lines.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Average execution time for completed lines: 0.98 seconds. Estimated time for incomplete lines: 7.84 seconds.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Finished 43 / 50 lines.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Average execution time for completed lines: 0.95 seconds. Estimated time for incomplete lines: 6.65 seconds.
2026-01-05 14:50:39 +0530   22800 execution.bulk     INFO     Finished 44 / 50 lines.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "fluency_20260105_091958_332852"
Run status: "Completed"
Start time: "2026-01-05 09:19:58.332852+00:00"
Duration: "0:00:42.034842"

2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Finished 41 / 50 lines.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Average execution time for completed lines: 1.03 seconds. Estimated time for incomplete lines: 9.27 seconds.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Finished 42 / 50 lines.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Average execution time for completed lines: 1.0 seconds. Estimated time for incomplete lines: 8.0 seconds.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Finished 43 / 50 lines.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Average execution time for completed lines: 0.98 seconds. Estimated time for incomplete lines: 6.86 seconds.
2026-01-05 14:50:40 +0530   28448 execution.bulk     INFO     Finished 44 / 50 lines.
20

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "groundedness_20260105_091958_336895"
Run status: "Completed"
Start time: "2026-01-05 09:19:58.336895+00:00"
Duration: "0:00:43.737933"


{
    "coherence": {
        "status": "Completed",
        "duration": "0:00:39.278192",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:00:42.034842",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "groundedness": {
        "status": "Completed",
        "duration": "0:00:43.737933",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    }
}


Evaluation results saved to "C:\work\AMLRepos\fine-tuning\Demos\DPO_Intel_Orca\evaluation_results_gpt_4.1_mini_dpo_finetuned".

EVALUA

## 13.1 Model Comparison Results

Below is an example comparison between base model and fine-tuned model evaluation results (using 50 samples):

| Metric | Base Model | Fine-Tuned Model | Change | Status |
|--------|-----------|------------------|--------|--------|
| **Coherence** | 4.3000 | 3.8400 | -0.4600 | Decreased |
| **Fluency** | 3.5800 | 2.7000 | -0.8800 | Decreased |
| **Groundedness** | 4.1000 | 3.1000 | -1.0000 | Decreased |

### Understanding the Results

The example above shows that the fine-tuned model performed worse than the base model across all metrics. This indicates that the DPO training did not improve model quality with the current configuration.

### How to Improve Results

To achieve better fine-tuned model performance, experiment with:

**1. Hyperparameter Tuning:**
- **Reduce epochs**: Try `n_epochs=1` or `n_epochs=2` to prevent overfitting
- **Lower learning rate**: Set `learning_rate_multiplier=0.5` or `0.1` for more conservative training
- **Adjust batch size**: Keep at 1-2 for DPO stability

**2. Training Data:**
- **Increase sample size**: Use more training examples (e.g., 100-1000 samples)
- **Verify data quality**: Ensure "preferred_output" responses are truly higher quality than rejected ones
- **Review data format**: Confirm DPO pairs are correctly labeled

**3. Evaluation Settings:**
- **Increase evaluation samples**: Use `num_samples=100` or more for more reliable metrics
- **Test on different data**: Evaluate on a separate validation set, not training data

### Success Indicators

Your fine-tuning is successful when you see **positive changes** like:
- Coherence: +0.5 or higher
- Fluency: +0.3 or higher  
- Groundedness: +0.4 or higher

Iterate on hyperparameters and training data until your fine-tuned model consistently outperforms the base model!

## 14. Continual Fine-Tuning (Optional)

If your fine-tuned model didn't show improvements, you can perform **continual fine-tuning** by using the fine-tuned model as the base for another round of training. This iterative approach can help refine the model further.

### When to Use Continual Fine-Tuning:
- Your first fine-tuning run didn't improve metrics
- You want to adjust hyperparameters and train further
- You have additional training data to incorporate
- You need to fine-tune for a more specific task

### How It Works:
Instead of using model_name (base model), use fine_tuned_model_id from section 10 as your new base model. The code below is the same as section 8, but modified to continue training from your fine-tuned model.

In [None]:

continual_base_model = fine_tuned_model_id
print(f"Continual fine-tuning using base model: {continual_base_model}")

continual_job = openai_client.fine_tuning.jobs.create(
    training_file=train_file.id,  # You can use the same or upload new training data
    validation_file=validation_file.id,
    model=continual_base_model,  # Using fine-tuned model instead of base model
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {
                "n_epochs": 2,  # Reduced from 3 to prevent overfitting
                "batch_size": 1,
                "learning_rate_multiplier": 0.5  # Lower learning rate for fine-tuning refinement
            }
        }
    },
    extra_body={"trainingType": "GlobalStandard"}
)

print(f"Continual fine-tuning job created!")
print(f"Job ID: {continual_job.id}")
print(f"Status: {continual_job.status}")

Continual fine-tuning using base model: gpt-4.1-mini-2025-04-14.ft-4cad7de198a34baeb4f0c95ff01ac844
Continual fine-tuning job created!
Job ID: ftjob-ca7805d53a5e496ca87aebca9894e134
Status: pending


## 15. Next Steps

Congratulations! You've successfully fine-tuned a model with DPO.

### What's Next?
- Deploy your model to production
- Evaluate on more test cases
- Experiment with hyperparameters
- Try different datasets