Skip to content

Commit 22f4e8a

Browse files
authored
Merge pull request #38923 from github/repo-sync
Repo sync
2 parents 7d68d57 + bdcf6ea commit 22f4e8a

File tree

2 files changed

+22
-19
lines changed

2 files changed

+22
-19
lines changed

.github/workflows/sync-audit-logs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,8 @@ jobs:
8787
echo "Creating pull request..."
8888
gh pr create \
8989
--title "Update audit log event data" \
90-
--body '👋 humans. This PR updates the audit log event data with the latest changes. (Synced from github/audit-log-allowlists)
91-
90+
--body '👋 Docs First Responder. This PR updates the audit log event data with the latest changes, synced from github/audit-log-allowlists.
91+
You only need to spot check this PR and make sure it builds successfully. You do not need to validate the contents (that is the responsibility of product teams).
9292
If CI does not pass or other problems arise, contact #docs-engineering on slack.' \
9393
--repo github/docs-internal \
9494
--label audit-log-pipeline \

content/github-models/use-github-models/evaluating-ai-models.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Evaluating AI models
3-
intro: 'Test and compare AI model outputs using evaluators and scoring metrics in GitHub Models.'
2+
title: Evaluating AI models
3+
intro: 'Test and compare AI model outputs using evaluators and scoring metrics in {% data variables.product.prodname_github_models %}.'
44
versions:
55
feature: github-models
66
shortTitle: Evaluate AI models
@@ -10,7 +10,10 @@ topics:
1010

1111
## Overview
1212

13-
{% data variables.product.prodname_github_models %} provides a simple evaluation workflow that helps developers compare large language models (LLMs), refine prompts, and make data-driven decisions within the GitHub platform. You can use {% data variables.product.prodname_github_models %} to experiment with new features or validate model changes by analyzing performance, accuracy, and cost through structured evaluation tools.
13+
{% data variables.product.prodname_github_models %} provides a simple evaluation workflow that helps developers compare large language models (LLMs), refine prompts, and make data-driven decisions within the {% data variables.product.github %} platform. You can use {% data variables.product.prodname_github_models %} to experiment with new features or validate model changes by analyzing performance, accuracy, and cost through structured evaluation tools.
14+
15+
>[!TIP]
16+
> You can run evaluations directly from the command line using the `gh models eval` command. It uses the same evaluators as the UI: string match, similarity, custom LLM-as-a-judge evaluators, and more, so you can test your `.prompt.yml` file locally or in CI.
1417
1518
## Use cases for {% data variables.product.prodname_github_models %}
1619

@@ -41,7 +44,7 @@ In this step, you'll configure a model to generate summaries for customer suppor
4144

4245
### Defining the system prompt
4346

44-
Define the model's behavior for your current goal. In this case, the goal is to summarize customer feedback. Under **Parameters**, enter the following System prompt:
47+
Define the model's behavior for your current goal. In this case, the goal is to summarize customer feedback. Under **Parameters**, enter the following System prompt:
4548

4649
You are a helpful assistant that summarizes support ticket responses into concise summaries.
4750

@@ -144,13 +147,13 @@ The following tables display each model's output and its evaluator scores:
144147

145148
| Model | Output |
146149
|-------|--------------------------------------------------------------------------------------------------------------------------------------------|
147-
| GPT-4.1 | User reports that the app crashes when uploading a PDF from mobile, though uploading works on desktop. |
150+
| {% data variables.copilot.copilot_gpt_41 %} | User reports that the app crashes when uploading a PDF from mobile, though uploading works on desktop. |
148151
| DeepSeek-R1 | <think> Okay, the user is having trouble with the app crashing when they try to upload a PDF from their phone. It works on the desktop but not mobile. Let me break this down. First, the main issue is the app crashing specifically during PDF uploads on mobile. The fact that it works on desktop suggests it's a mobile-specific problem. Possible causes could be related to the mobile app version, the phone's OS, file size limits on mobile, or compatibility issues with certain PDF formats on mobile devices. I need to make sure the summary mentions the crash during PDF uploads on mobile, note that desktop works, and |
149152
| Phi-4 | The app crashes when attempting to upload a PDF from the mobile device, though it functions correctly on the desktop version. |
150153

151154
| Model | Similarity | Relevance | Groundedness | Input tokens | Output tokens | Latency |
152155
|---------------|------------|-----------|--------------|--------------|---------------|---------|
153-
| GPT-4.1 | 100% | 50% | 100% | 61 | 20 | 918ms |
156+
| {% data variables.copilot.copilot_gpt_41 %} | 100% | 50% | 100% | 61 | 20 | 918ms |
154157
| DeepSeek-R1 | 50% | 50% | 75% | 52 | 128 | 2285ms |
155158
| Phi-4 | 75% | 100% | 100% | 61 | 66 | 1117ms |
156159

@@ -162,19 +165,19 @@ Evaluate how closely each model’s output aligns with the expected summary. The
162165

163166
| Model | Similarity score |
164167
|---------------|------------------|
165-
| GPT-4.1 | 100% |
168+
| {% data variables.copilot.copilot_gpt_41 %} | 100% |
166169
| DeepSeek-R1 | 50% |
167170
| Phi-4 | 75% |
168171

169-
While all models included the key content from the input, DeepSeek-R1’s similarity score is significantly lower due to its verbose internal commentary, which deviate from the expected concise summary format. In contrast, GPT-4.1’s response matches the phrasing and structure of the reference output.
172+
While all models included the key content from the input, DeepSeek-R1’s similarity score is significantly lower due to its verbose internal commentary, which deviate from the expected concise summary format. In contrast, {% data variables.copilot.copilot_gpt_41 %}’s response matches the phrasing and structure of the reference output.
170173

171174
### Relevance
172175

173176
Evaluate how well each model captures the core intent of the input. The table below shows the relevance scores for each model.
174177

175178
| Model | Relevance score |
176179
|---------------|-----------------|
177-
| GPT-4.1 | 50% |
180+
| {% data variables.copilot.copilot_gpt_41 %} | 50% |
178181
| DeepSeek-R1 | 50% |
179182
| Phi-4 | 100% |
180183

@@ -187,7 +190,7 @@ The table below shows the relevance scores for each model.
187190

188191
| Model | Groundedness score |
189192
|---------------|--------------------|
190-
| GPT-4.1 | 100% |
193+
| {% data variables.copilot.copilot_gpt_41 %} | 100% |
191194
| DeepSeek-R1 | 75% |
192195
| Phi-4 | 100% |
193196

@@ -201,13 +204,13 @@ The following tables display each model's output and its evaluator scores:
201204

202205
| Model | Output |
203206
|-------|---------------------------------------------------------------------------------------------------------------------------------------------|
204-
| GPT-4.1 | User requests the addition of a dark mode feature due to discomfort and eye strain when using the app at night. |
207+
| {% data variables.copilot.copilot_gpt_41 %} | User requests the addition of a dark mode feature due to discomfort and eye strain when using the app at night. |
205208
| DeepSeek-R1 | <think> Okay, the user wants dark mode added because it's hard to use at night and their eyes hurt. I need to make a concise summary. Let me start with the main request: adding dark mode. Then mention the reason: eye strain from prolonged use at night. Keep it to one sentence. Maybe structure it as "Request to implement dark mode to reduce eye strain during nighttime use." That covers the key points without extra details. Check if that's under the character limit. Yep, looks good. </think> Summary: Request to implement dark mode to reduce eye strain caused by prolonged nighttime use. |
206209
| Phi-4 | The customer is requesting the addition of a dark mode feature to reduce eye strain when using the product at night. |
207210

208211
| Model | Similarity | Relevance | Groundedness | Input Tokens | Output Tokens | Latency |
209212
|---------------|------------|-----------|--------------|---------------|----------------|---------|
210-
| GPT-4.1 | 100% | 75% | 100% | 57 | 18 | 1286ms |
213+
| {% data variables.copilot.copilot_gpt_41 %} | 100% | 75% | 100% | 57 | 18 | 1286ms |
211214
| DeepSeek-R1 | 50% | 0% | 25% | 49 | 128 | 1946ms |
212215
| Phi-4 | 100% | 75% | 100% | 58 | 20 | 899ms |
213216

@@ -217,7 +220,7 @@ Evaluate how closely each model’s output aligns with the expected summary. The
217220

218221
| Model | Similarity score |
219222
|---------------|------------------|
220-
| GPT-4.1 | 100% |
223+
| {% data variables.copilot.copilot_gpt_41 %} | 100% |
221224
| DeepSeek-R1 | 50% |
222225
| Phi-4 | 100% |
223226

@@ -229,11 +232,11 @@ Evaluate how well each model captures the core intent of the input. The table be
229232

230233
| Model | Relevance score |
231234
|---------------|-----------------|
232-
| GPT-4.1 | 75% |
235+
| {% data variables.copilot.copilot_gpt_41 %} | 75% |
233236
| DeepSeek-R1 | 0% |
234237
| Phi-4 | 75% |
235238

236-
GPT-4.1 and Phi-4 both captured the main intent of the user’s request: the need for dark mode to reduce eye strain and improve usability at night. DeepSeek-R1 scored 0% in relevance due to its verbose internal commentary that distracted from the actual output.
239+
{% data variables.copilot.copilot_gpt_41 %} and Phi-4 both captured the main intent of the user’s request: the need for dark mode to reduce eye strain and improve usability at night. DeepSeek-R1 scored 0% in relevance due to its verbose internal commentary that distracted from the actual output.
237240

238241
### Groundedness
239242

@@ -242,15 +245,15 @@ The table below shows the relevance scores for each model.
242245

243246
| Model | Groundedness score |
244247
|---------------|--------------------|
245-
| GPT-4.1 | 100% |
248+
| {% data variables.copilot.copilot_gpt_41 %} | 100% |
246249
| DeepSeek-R1 | 25% |
247250
| Phi-4 | 100% |
248251

249252
DeepSeek-R1 scored lower due to its verbose `<think>` block, which included speculative reasoning not present in the original input.
250253

251254
## Save the configuration
252255

253-
After completing your evaluations, the final step is to choose the model that performs best for your specific use case. In the examples above, Phi-4 and GPT-4.1 delivered strong, consistent results across all evaluators. DeepSeek-R1 scored lower due to its verbose reasoning and less focused outputs.
256+
After completing your evaluations, the final step is to choose the model that performs best for your specific use case. In the examples above, Phi-4 and {% data variables.copilot.copilot_gpt_41 %} delivered strong, consistent results across all evaluators. DeepSeek-R1 scored lower due to its verbose reasoning and less focused outputs.
254257

255258
Once you've selected your preferred model and prompt configuration, add a descriptive name to the prompt file, then click **Commit changes**. This will save the model, prompt, parameter settings, and associated dataset as a reusable configuration file in your repository.
256259

0 commit comments

Comments
 (0)