Skip to content

Commit 59560d2

Browse files
Wrote up detailed instructions & provided sample files
1 parent 153d933 commit 59560d2

File tree

6 files changed

+312
-0
lines changed

6 files changed

+312
-0
lines changed

.env.sample

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
AZURE_OPENAI_API_BASE=YOUR_API_BASE_ENDPOINT
2+
AZURE_OPENAI_API_KEY=YOUR_KEY_HERE
3+
AZURE_OPENAI_API_MODEL=gpt4o
4+
AZURE_OPENAI_API_VERSION=2024-02-15-preview
5+
ITERATIONS_PER_PROMPT=10
6+
TEMPERATURE=0.95
7+
TOP_P=0.95
8+
MAX_TOKENS=800
9+
SLEEP_TIME=10
10+
OUTPUT_EXTENSION=.html

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.env
2+
.DS_STORE

README

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# CodeGen Model Evaluation and Refinement Tools
2+
3+
A test suite for evaluating & refining generative models’ “knowledge“ of code pattern best practices.
4+
5+
## Install & Setup
6+
7+
Before you begin, you’ll want to create a new repository based on this template. Then configure the repository to your specific needs.
8+
9+
To configure your repository, you’ll need to
10+
11+
1. Create [an Azure OpenAI resource](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryItemDetailsBladeNopdl/id/Microsoft.CognitiveServicesOpenAI/)
12+
2. [Set up a deployment of the LLM model](https://oai.azure.com/) of your choosing
13+
3. Put the credentials & such in an [`.env` file](#env-file-setup) in the root of the project (you can copy over `.env.sample`)
14+
4. [Install the dependencies](#dependencies)
15+
16+
### .env File Setup
17+
18+
You will need the following keys in your `.env` file:
19+
20+
```env
21+
AZURE_OPENAI_API_BASE=https://YOUR_PROJECT.openai.azure.com/
22+
AZURE_OPENAI_API_KEY=*******************
23+
AZURE_OPENAI_API_MODEL=YOUR_DEPLOYENT_NAME
24+
AZURE_OPENAI_API_VERSION=THE_API_VERSION
25+
ITERATIONS_PER_PROMPT=10
26+
TEMPERATURE=0.95
27+
TOP_P=0.95
28+
MAX_TOKENS=800
29+
SLEEP_TIME=10
30+
OUTPUT_EXTENSION=.html
31+
```
32+
33+
* `AZURE_OPENAI_API_BASE` - This is the deployment you’re testing against
34+
* `AZURE_OPENAI_API_KEY` - This is your secret key to access the API endpoint
35+
* `AZURE_OPENAI_API_MODEL` - This is the model you are testing against within that deployment
36+
* `AZURE_OPENAI_API_VERSION` - This is the current API version (_not_ the model version)
37+
* `ITERATIONS_PER_PROMPT` - How many times you’d like to run each prompt against the model (recommend at least 5, but more will give you a better sense of variability)
38+
* `TEMPERATURE` - The temperature value you’d like applied to the prompts. Higher allows the model to improvise more and is likely to give you a better sense of the the range of potential responses the model may give.
39+
* `TOP_P` - A high “Top p” value tells the model to consider more potential words and can yield a greater variety in responses.
40+
* `MAX_TOKENS` - The maximum number of tokens you want to allow between the prompt and the response. Depending on the type of code generation you are testing, you could keep this low. The absolute max may depend on the model you’re using.
41+
* `SLEEP_TIME` - How long (in seconds) to wait between prompts; helps you avoid getting throttled.
42+
43+
The best place to gather many of these details is in [Azure AI Studio’s Playground](https://oai.azure.com/portal/playground). Pick the "Chat" playground and then click `View Code > Key Authentication` to find the Endpoint (API Base), API Key, and API version (at the end of the ENDPOINT string in the sample Python code).
44+
45+
### Dependencies
46+
47+
You’ll need Python and the `dotenv` to run the script and pull in the environment variables:
48+
49+
```bash
50+
$> pip install python-dotenv
51+
```
52+
53+
## Running an Evaluation
54+
55+
To run an evaluation, you will follow these steps:
56+
57+
1. Document your test & evaluation criteria
58+
1. Configure your tests
59+
1. Run the tests
60+
1. Evaluate the results
61+
1. Generate diffs
62+
63+
### Document Test & Evaluation Criteria
64+
65+
Before you write your tests, I recommend spending some time documenting
66+
67+
1. Your ideal code output (which may be one thing or one of several acceptable patterns)
68+
1. Acceptable variations (e.g., if the value of a particular property can be improvised, note it)
69+
1. Unacceptable variations (e.g., if a specific property must exist, note it)
70+
1. A list of prompts you would reasonably expect to result in a model returning the code sample you’re looking for. Being too prescriptive will result in very little variance in the proposed code samples, so be as specific as you need to be to point the model in the right direction, but keep it open to interpretation. Try some more specific prompts and some less specific ones too. Use specific keywords from documentation in some prompts and synonyms for those keywords in others.
71+
72+
This documentation will not only help you generate your tests, it will help with [evaluating the results](#evaluating-the-results) too.
73+
74+
If you want to include these docs in the directory structure of this project, be sure to commit them before proceeding.
75+
76+
### Configuring the tests
77+
78+
Tests are stored in the `tests.json` file. Each test contains a `title` string, `prompts` array of prompt strings, and an optional `prefix` string that will be added before each prompt in the test. For example, if I wanted to test the model’s understanding of best practices for building an HTML radio group, I could include the following:
79+
80+
```json
81+
{
82+
"title": "Radio Group",
83+
"prefix": "Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for",
84+
"prompts": [
85+
"a radio group to choose a theme",
86+
"a “theme” picker using radio controls",
87+
"a radio control-based theme chooser",
88+
"an accessible theme chooser with radio controls"
89+
]
90+
}
91+
```
92+
93+
This will trigger the tool to run a test titled “Radio Group.” It will execute the following prompts against the model:
94+
95+
1. Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for a radio group to choose a theme
96+
1. Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for a “theme” picker using radio controls
97+
1. Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for a radio control-based theme chooser
98+
1. Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for an accessible theme chooser with radio controls
99+
100+
The number of prompt variations you choose to include is up to you. Try to describe the component in a few different ways to see how well the model “understands” key coding concepts (e.g., semantics, accessibility, performance).
101+
102+
Once you have your tests written, you’re ready to begin running them. Commit your files to the repository and then proceed to the next step.
103+
104+
### Running the Tests
105+
106+
To run your tests against the model, open the command line and run the Python script
107+
108+
```bash
109+
$> python run_tests.py
110+
```
111+
112+
If all goes well (and you aren’t throttled), the Python code should work its way through your test suite and store the results of the evaluations in the `./output` directory. For every test, it will create a directory that is named to match the test name. Within that directory, it will create numbered subdirectories for each prompt. And within those directories, the response from each individual iteration of the prompt will be captured to a file named with a UUID + the extension you configured in your `.env`.
113+
114+
For example, running one iteration of each prompt from the test above would result in something like this:
115+
116+
```bash
117+
output
118+
|-- Radio Group
119+
| |-- 1
120+
| | |-- f9b6abd4-5952-4df6-b638-a49fee156a21.html
121+
| |-- 2
122+
| | |-- 594caf29-05bb-4699-ad50-362347132ca9.html
123+
| |-- 3
124+
| | |-- 4625cc9e-a38a-480d-bdee-4c80065e362e.html
125+
| |-- 4
126+
| | |-- 42159a0b-1273-49a1-a517-c99ce56480fd.html
127+
```
128+
129+
If you run into issues with your tests running to completion, you might consider breaking the tests into batches or running the tests overnight or at a time when there is less network traffic.
130+
131+
Once you have your results, commit them to the repository before proceeding. You will want to keep track of the commit ID for later.
132+
133+
### Evaluating the results
134+
135+
Once you have your results, it’s time to begin evaluating. Use [your documentation](#document-test--evaluation-criteria) to evaluate each code suggestion.
136+
137+
When the output falls outside the acceptable bounds of your evaluation (because it includes something it shouldn’t or missed something critical), edit the file to align it with your ideal result. Then commit that individual file to the repo with a commit message that enumerates each individual error on its own line. Be as descriptive and instructive as possible. Your commit messages can be used to help improve the model.
138+
139+
For example:
140+
141+
```txt
142+
Error: Current page must be indicated using `aria-current="page"`
143+
Error: Style changes (such as bold) must not be the only means of indicating a link is the current page
144+
```
145+
146+
Repeat this process for every file in the output directory.
147+
148+
If you are looking to speed up this process a bit, you could tackle the output on a per-test basis. Make the necessary changes to every file first and save the files, but don’t commit them yet. As you go, make a scratch file containing the commit message you plan to use for each error you encountered. Then go through the files and commit them one-by-one using the appropriate commit messages from your scratch file. This can make things faster, but you do need to be careful not to commit more than one file at a time because the next step relies on them being individual commits.
149+
150+
When you’re done and all output files have been aligned with your ideal code snippets and committed to the repo with descriptive error messages, proceed to the final step.
151+
152+
### Generate Diff Files
153+
154+
The last step is to generate the diff files that can be used to refine the model’s output. To generate them, run the shell script and provide the commit ID that was the last commit _before_ you began evaluating the code samples:
155+
156+
```bash
157+
$> ./diff-generator.sh 7189a15dd6fc785314af2dc4e035de83ec83b5a8
158+
```
159+
160+
The script will run through every commit and generate a `.diff` file for it. At the top of each file will be the commit message associated with that commit (which is why having them as individual commits is super helpful). If you’d like to improve the readability of the files, you might consider running a batch conversion of " Error:" to "\r\nError:" in the text editor of your choice as this will ensure each Error from you commit messages appears on its own line.

diff-generator.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
3+
# Check if the target commit hash is provided
4+
if [ -z "$1" ]; then
5+
echo "Usage: $0 <target_commit_hash>"
6+
exit 1
7+
fi
8+
9+
TARGET_COMMIT=$1
10+
11+
# Get the list of commits from the current HEAD back to the target commit
12+
COMMITS=$(git rev-list HEAD ^$TARGET_COMMIT)
13+
14+
# Loop through each commit and generate a diff file
15+
for COMMIT in $COMMITS; do
16+
# get the file path
17+
FILEPATH=$(git show --pretty=format: --name-only $COMMIT | head -1)
18+
# remove everything from the FILEPATH up to the final slash
19+
FILEPATH=$(echo $FILEPATH | sed 's/.*\///')
20+
FILENAME=$(basename $FILEPATH .html)
21+
# generate the diff file
22+
echo "Generating diff file for $FILENAME..."
23+
git show --pretty=format:%s $COMMIT >> ./diffs/$FILENAME.diff
24+
done
25+
26+
echo "Diff files generated for each commit back to $TARGET_COMMIT."

run_tests.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import os
2+
import requests
3+
import time
4+
import json
5+
import uuid
6+
from dotenv import load_dotenv
7+
8+
# Load environment variables from .env file
9+
load_dotenv()
10+
API_BASE = os.environ.get("AZURE_OPENAI_API_BASE")
11+
DEPLOYMENT = os.environ.get("AZURE_OPENAI_API_MODEL")
12+
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")
13+
API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
14+
ENDPOINT = API_BASE + "openai/deployments/" + DEPLOYMENT + "/chat/completions?api-version=" + API_VERSION
15+
ITERATIONS = os.environ.get("ITERATIONS_PER_PROMPT")
16+
TEMP = os.environ.get("TEMPERATURE")
17+
TOP_P = os.environ.get("TOP_P")
18+
MAX_TOKENS = os.environ.get("MAX_TOKENS")
19+
SLEEP = os.environ.get("SLEEP_TIME")
20+
EXTENSION = os.environ.get("OUTPUT_EXTENSION")
21+
22+
# headers
23+
headers = {
24+
"Content-Type": "application/json",
25+
"api-key": API_KEY
26+
}
27+
28+
instructions = [
29+
"You are an AI programming assistant.",
30+
"You return only code snippets with NO OTHER TEXT, code fences, etc.",
31+
"Assume your responses will be used in a code editor within an existing HTML document.",
32+
# "Your responses only include the amount of HTML required to properly and validly fulfill the request.",
33+
"You may include inline CSS or JavaScript, but only as much as absolutely necessary."
34+
]
35+
36+
def get_code_response(prompt):
37+
print(f"Prompt: {prompt}")
38+
payload = {
39+
"messages": [
40+
{
41+
"role": "system",
42+
"content": [{
43+
"type": "text",
44+
"text": " ".join(instructions)
45+
}]
46+
},
47+
{
48+
"role": "user",
49+
"content": [{
50+
"type": "text",
51+
"text": prompt
52+
}]
53+
}
54+
],
55+
"temperature": TEMP,
56+
"top_p": TOP_P,
57+
"max_tokens": MAX_TOKENS
58+
}
59+
60+
try:
61+
response = requests.post(ENDPOINT, headers=headers, json=payload)
62+
response.raise_for_status() # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
63+
except requests.RequestException as e:
64+
raise SystemExit(f"Failed to make the request. Error: {e}")
65+
66+
result = response.json()
67+
code = result['choices'][0]['message']['content']
68+
print(f"Response: {code}")
69+
return code
70+
71+
def main():
72+
# Load the JSON file
73+
with open('tests.json', 'r') as file:
74+
data = json.load(file)
75+
76+
for test in data['tests']:
77+
test_title = test['title']
78+
test_folder = os.path.join('output', test_title)
79+
os.makedirs(test_folder, exist_ok=True)
80+
81+
prefix = test.get('prefix', '')
82+
83+
for prompt_index, prompt in enumerate(test['prompts'], start=1):
84+
prompt_folder = os.path.join(test_folder, str(prompt_index))
85+
os.makedirs(prompt_folder, exist_ok=True)
86+
87+
unique_responses = set()
88+
89+
for _ in range(ITERATIONS):
90+
full_prompt = f"{prefix} {prompt}".strip()
91+
response = get_code_response(full_prompt)
92+
if response not in unique_responses:
93+
unique_responses.add(response)
94+
filename = os.path.join(prompt_folder, f"{uuid.uuid4()}{EXTENSION}")
95+
with open(filename, 'w') as response_file:
96+
response_file.write(response)
97+
time.sleep(SLEEP) # To avoid hitting rate limits
98+
99+
if __name__ == "__main__":
100+
main()

tests.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"tests": [
3+
{
4+
"title": "Radio Group",
5+
"prefix": "Given the options light, dark, and high contrast, create the HTML only (no JavaScript) for",
6+
"prompts": [
7+
"a radio group to choose a theme",
8+
"a “theme” picker using radio controls",
9+
"a radio control-based theme chooser",
10+
"an accessible theme chooser with radio controls"
11+
]
12+
}
13+
]
14+
}

0 commit comments

Comments
 (0)