You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/workflows/sync-audit-logs.yml
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -87,8 +87,8 @@ jobs:
87
87
echo "Creating pull request..."
88
88
gh pr create \
89
89
--title "Update audit log event data" \
90
-
--body '👋 humans. This PR updates the audit log event data with the latest changes. (Synced from github/audit-log-allowlists)
91
-
90
+
--body '👋 Docs First Responder. This PR updates the audit log event data with the latest changes, synced from github/audit-log-allowlists.
91
+
You only need to spot check this PR and make sure it builds successfully. You do not need to validate the contents (that is the responsibility of product teams).
92
92
If CI does not pass or other problems arise, contact #docs-engineering on slack.' \
Copy file name to clipboardExpand all lines: content/github-models/use-github-models/evaluating-ai-models.md
+20-17Lines changed: 20 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
-
title: Evaluating AI models
3
-
intro: 'Test and compare AI model outputs using evaluators and scoring metrics in GitHub Models.'
2
+
title: Evaluating AI models
3
+
intro: 'Test and compare AI model outputs using evaluators and scoring metrics in {% data variables.product.prodname_github_models %}.'
4
4
versions:
5
5
feature: github-models
6
6
shortTitle: Evaluate AI models
@@ -10,7 +10,10 @@ topics:
10
10
11
11
## Overview
12
12
13
-
{% data variables.product.prodname_github_models %} provides a simple evaluation workflow that helps developers compare large language models (LLMs), refine prompts, and make data-driven decisions within the GitHub platform. You can use {% data variables.product.prodname_github_models %} to experiment with new features or validate model changes by analyzing performance, accuracy, and cost through structured evaluation tools.
13
+
{% data variables.product.prodname_github_models %} provides a simple evaluation workflow that helps developers compare large language models (LLMs), refine prompts, and make data-driven decisions within the {% data variables.product.github %} platform. You can use {% data variables.product.prodname_github_models %} to experiment with new features or validate model changes by analyzing performance, accuracy, and cost through structured evaluation tools.
14
+
15
+
>[!TIP]
16
+
> You can run evaluations directly from the command line using the `gh models eval` command. It uses the same evaluators as the UI: string match, similarity, custom LLM-as-a-judge evaluators, and more, so you can test your `.prompt.yml` file locally or in CI.
14
17
15
18
## Use cases for {% data variables.product.prodname_github_models %}
16
19
@@ -41,7 +44,7 @@ In this step, you'll configure a model to generate summaries for customer suppor
41
44
42
45
### Defining the system prompt
43
46
44
-
Define the model's behavior for your current goal. In this case, the goal is to summarize customer feedback. Under **Parameters**, enter the following System prompt:
47
+
Define the model's behavior for your current goal. In this case, the goal is to summarize customer feedback. Under **Parameters**, enter the following System prompt:
45
48
46
49
You are a helpful assistant that summarizes support ticket responses into concise summaries.
47
50
@@ -144,13 +147,13 @@ The following tables display each model's output and its evaluator scores:
|GPT-4.1| User reports that the app crashes when uploading a PDF from mobile, though uploading works on desktop. |
150
+
|{% data variables.copilot.copilot_gpt_41 %}| User reports that the app crashes when uploading a PDF from mobile, though uploading works on desktop. |
148
151
| DeepSeek-R1 | <think> Okay, the user is having trouble with the app crashing when they try to upload a PDF from their phone. It works on the desktop but not mobile. Let me break this down. First, the main issue is the app crashing specifically during PDF uploads on mobile. The fact that it works on desktop suggests it's a mobile-specific problem. Possible causes could be related to the mobile app version, the phone's OS, file size limits on mobile, or compatibility issues with certain PDF formats on mobile devices. I need to make sure the summary mentions the crash during PDF uploads on mobile, note that desktop works, and |
149
152
| Phi-4 | The app crashes when attempting to upload a PDF from the mobile device, though it functions correctly on the desktop version. |
@@ -162,19 +165,19 @@ Evaluate how closely each model’s output aligns with the expected summary. The
162
165
163
166
| Model | Similarity score |
164
167
|---------------|------------------|
165
-
|GPT-4.1| 100% |
168
+
|{% data variables.copilot.copilot_gpt_41 %}| 100% |
166
169
| DeepSeek-R1 | 50% |
167
170
| Phi-4 | 75% |
168
171
169
-
While all models included the key content from the input, DeepSeek-R1’s similarity score is significantly lower due to its verbose internal commentary, which deviate from the expected concise summary format. In contrast, GPT-4.1’s response matches the phrasing and structure of the reference output.
172
+
While all models included the key content from the input, DeepSeek-R1’s similarity score is significantly lower due to its verbose internal commentary, which deviate from the expected concise summary format. In contrast, {% data variables.copilot.copilot_gpt_41 %}’s response matches the phrasing and structure of the reference output.
170
173
171
174
### Relevance
172
175
173
176
Evaluate how well each model captures the core intent of the input. The table below shows the relevance scores for each model.
174
177
175
178
| Model | Relevance score |
176
179
|---------------|-----------------|
177
-
|GPT-4.1| 50% |
180
+
|{% data variables.copilot.copilot_gpt_41 %}| 50% |
178
181
| DeepSeek-R1 | 50% |
179
182
| Phi-4 | 100% |
180
183
@@ -187,7 +190,7 @@ The table below shows the relevance scores for each model.
187
190
188
191
| Model | Groundedness score |
189
192
|---------------|--------------------|
190
-
|GPT-4.1| 100% |
193
+
|{% data variables.copilot.copilot_gpt_41 %}| 100% |
191
194
| DeepSeek-R1 | 75% |
192
195
| Phi-4 | 100% |
193
196
@@ -201,13 +204,13 @@ The following tables display each model's output and its evaluator scores:
|GPT-4.1| User requests the addition of a dark mode feature due to discomfort and eye strain when using the app at night. |
207
+
|{% data variables.copilot.copilot_gpt_41 %}| User requests the addition of a dark mode feature due to discomfort and eye strain when using the app at night. |
205
208
| DeepSeek-R1 | <think> Okay, the user wants dark mode added because it's hard to use at night and their eyes hurt. I need to make a concise summary. Let me start with the main request: adding dark mode. Then mention the reason: eye strain from prolonged use at night. Keep it to one sentence. Maybe structure it as "Request to implement dark mode to reduce eye strain during nighttime use." That covers the key points without extra details. Check if that's under the character limit. Yep, looks good. </think> Summary: Request to implement dark mode to reduce eye strain caused by prolonged nighttime use. |
206
209
| Phi-4 | The customer is requesting the addition of a dark mode feature to reduce eye strain when using the product at night. |
@@ -217,7 +220,7 @@ Evaluate how closely each model’s output aligns with the expected summary. The
217
220
218
221
| Model | Similarity score |
219
222
|---------------|------------------|
220
-
|GPT-4.1| 100% |
223
+
|{% data variables.copilot.copilot_gpt_41 %}| 100% |
221
224
| DeepSeek-R1 | 50% |
222
225
| Phi-4 | 100% |
223
226
@@ -229,11 +232,11 @@ Evaluate how well each model captures the core intent of the input. The table be
229
232
230
233
| Model | Relevance score |
231
234
|---------------|-----------------|
232
-
|GPT-4.1| 75% |
235
+
|{% data variables.copilot.copilot_gpt_41 %}| 75% |
233
236
| DeepSeek-R1 | 0% |
234
237
| Phi-4 | 75% |
235
238
236
-
GPT-4.1 and Phi-4 both captured the main intent of the user’s request: the need for dark mode to reduce eye strain and improve usability at night. DeepSeek-R1 scored 0% in relevance due to its verbose internal commentary that distracted from the actual output.
239
+
{% data variables.copilot.copilot_gpt_41 %} and Phi-4 both captured the main intent of the user’s request: the need for dark mode to reduce eye strain and improve usability at night. DeepSeek-R1 scored 0% in relevance due to its verbose internal commentary that distracted from the actual output.
237
240
238
241
### Groundedness
239
242
@@ -242,15 +245,15 @@ The table below shows the relevance scores for each model.
242
245
243
246
| Model | Groundedness score |
244
247
|---------------|--------------------|
245
-
|GPT-4.1| 100% |
248
+
|{% data variables.copilot.copilot_gpt_41 %}| 100% |
246
249
| DeepSeek-R1 | 25% |
247
250
| Phi-4 | 100% |
248
251
249
252
DeepSeek-R1 scored lower due to its verbose `<think>` block, which included speculative reasoning not present in the original input.
250
253
251
254
## Save the configuration
252
255
253
-
After completing your evaluations, the final step is to choose the model that performs best for your specific use case. In the examples above, Phi-4 and GPT-4.1 delivered strong, consistent results across all evaluators. DeepSeek-R1 scored lower due to its verbose reasoning and less focused outputs.
256
+
After completing your evaluations, the final step is to choose the model that performs best for your specific use case. In the examples above, Phi-4 and {% data variables.copilot.copilot_gpt_41 %} delivered strong, consistent results across all evaluators. DeepSeek-R1 scored lower due to its verbose reasoning and less focused outputs.
254
257
255
258
Once you've selected your preferred model and prompt configuration, add a descriptive name to the prompt file, then click **Commit changes**. This will save the model, prompt, parameter settings, and associated dataset as a reusable configuration file in your repository.
0 commit comments