# SPADE Experiment Example

This notebook loads one of the LangChain pipelines, annotated examples, and displays the SPADE-generated assertions. You must specify the pipeline name.

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
# Replace this with one of the 9 pipelines in the paper

PIPELINE_NAME = "codereviews"

## Print the last prompt template, an example, and a sample of 5 candidate assertions

In [16]:
import_str = f"from paper_experiments.{PIPELINE_NAME}.prompt_templates import TEMPLATES; from paper_experiments.{PIPELINE_NAME}.examples import EXAMPLES; from paper_experiments.{PIPELINE_NAME}.candidate_assertions import ALL_FUNCTIONS"
exec(import_str)
from rich import print as rprint
import inspect

# Print the last prompt template and example
rprint(f"Last prompt template:\n{TEMPLATES[-1]}")
rprint(f"Last example:\n{EXAMPLES[-1]}")
rprint(f"Sample of candidate assertions:")

for f in ALL_FUNCTIONS[:5]:
    rprint(inspect.getsource(f))

## Load the cached pipeline results for the examples and join with the labels

In [28]:
# Load cached responses

from spade.execute_assertions import execute_candidate_assertions

pipeline_results = await execute_candidate_assertions(PIPELINE_NAME, TEMPLATES[-1], EXAMPLES, ALL_FUNCTIONS)
rprint(f"There are {len(pipeline_results)} results (approx num assertions * num examples b/c there are some errors).")

c26484e8b8cb09a62f72d2fa70de3ef466a5a4381bdabe91d0805ceb3eb2929f
Found cached results
There are 3344 results (approx num assertions * num examples b/c there are some errors).


In [49]:
# Load the optimizer input, labeled responses, and subsumption results
import pandas as pd
import pickle

labeled_responses = pd.read_csv(f"paper_experiments/{PIPELINE_NAME}/labeled_responses.csv")
rprint(f"There are {len(labeled_responses)} labeled responses. {len(labeled_responses[labeled_responses['label'] == True])} are successful (i.e., good), {len(labeled_responses[labeled_responses['label'] == False])} are failures (i.e., bad).")

# Join the labeled responses with the pipeline results
results_and_labels = pipeline_results.merge(labeled_responses, on=["response"])

# Load subsumption results
subsumption_results = pd.read_csv(f"paper_experiments/{PIPELINE_NAME}/subsumption_results.csv")

### Pipeline responses for some examples and their labels

In [34]:
# Randomly sample 10 results and their labels

results_and_labels.sample(frac=1).head(10)

Unnamed: 0,prompt,example,response,model,function_name,result,prompt_tokens,completion_tokens,label
1264,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Integrate S...","Hello @secureRust,\n\nThank you for your contr...",gpt-4,assert_completeness_in_reviewing_code_diff,False,669.0,1.0,1
1068,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Add TCP con...",Thank you for the PR @rustNetDev! Your effort ...,gpt-4,assert_response_is_personal_and_grateful_v2,False,,,1
1345,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Implement b...",Thank you @rustNetworkDev for the contribution...,gpt-4,assert_proper_acknowledgement_v1,False,,,1
1810,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Function to...","@goDeveloper, thank you for the pull request. ...",gpt-3,assert_response_is_personal_and_grateful_v1,True,,,1
2145,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Use Go chan...","Hey @goChannelExpert, thanks for the contribut...",gpt-4,assert_consistency_with_tasks_and_instructions,True,645.0,1.0,1
1612,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Function to...","Thank you for the PR, @secretManager! I notice...",gpt-4,assert_response_is_concise_and_clear,True,840.0,1.0,1
2499,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Enhance PyT...","Thank you for your pull request, @pytorchEnhan...",gpt-4,assert_respects_information_limitation_v1,True,578.0,1.0,0
1507,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Setup Kuber...",Thank you @kubeDev for the pull request to ini...,gpt-4,assert_follows_code_conventions_v1,False,721.0,1.0,0
2245,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Fix race co...","@nodeAsyncFix, thank you for the pull request!...",gpt-3,assert_response_follows_review_format_v1,False,,,1
1594,user:\nYou are an AI Assistant that’s an exper...,"{'pr_webhook_payload': '{""title"": ""Function to...","Thank you for the PR, @secretManager! I notice...",gpt-4,assert_response_is_brief_v1,True,845.0,1.0,1


### List all subsumption pairs identified by the LLM

A -> B means that A subsumes B. Note that some subsumption pairs are not explicitly identified by the LLM (i.e., asked_LLM = False, but we construct this edge via transitivity).

In [46]:
subsumption_results[subsumption_results["A -> B"] == True]

Unnamed: 0,func A,func B,A -> B,asked_LLM
0,async def assert_response_is_personal_and_grat...,async def assert_proper_acknowledgement_v1(exa...,True,True
1,async def assert_response_is_personal_and_grat...,async def assert_gratitude_personal_touch(exam...,True,True
2,async def assert_response_follows_review_forma...,async def assert_response_is_personal_and_grat...,True,True
3,async def assert_response_is_brief_v1(example:...,async def assert_response_is_concise_v1(\n ...,True,True
4,async def assert_contains_brief_answers_v1(exa...,async def assert_response_is_brief_v1(example:...,True,True
5,async def assert_response_is_concise_and_clear...,async def assert_response_is_concise_v1(\n ...,True,True
6,async def assert_response_is_concise_and_clear...,async def assert_clear_professional_language_v...,True,True
7,async def assert_excludes_irrelevant_content_v...,async def assert_excludes_unrelated_topics_or_...,True,True
8,async def assert_response_does_not_require_ful...,async def assert_excludes_full_codebase_review...,True,True
9,async def assert_excludes_full_codebase_review...,async def assert_response_does_not_require_ful...,True,True


## Run the SPADE optimizer

Ignore the MILP output for now. We will compare the output in a later cell.

In [52]:
from spade.optimizer import select_functions

optimizer_results = select_functions(f"paper_experiments/{PIPELINE_NAME}/optimizer_input.pkl", tau=0.25, alpha=0.6)

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/shreyashankar/miniforge3/envs/promptdelta/lib/python3.10/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/nq/ldkhrrws0xb9whw7b6rpzhc00000gn/T/1f69fc1155544fa1923afef33aa36b5c-pulp.mps timeMode elapsed branch printingOptions all solution /var/folders/nq/ldkhrrws0xb9whw7b6rpzhc00000gn/T/1f69fc1155544fa1923afef33aa36b5c-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 6711 COLUMNS
At line 25417 RHS
At line 32124 BOUNDS
At line 35605 ENDATA
Problem MODEL has 6706 rows, 3480 columns and 11701 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Continuous objective value is 0.6 - 0.01 seconds
Cgl0002I 1767 variables fixed
Cgl0003I 24 fixed, 0 tightened bounds, 34 strengthened rows, 0 substitutions
Cgl0004I processed model has 17 rows, 25 columns (25 integer (25 of which binary)) and 58 elements
Cutoff increment incr

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/shreyashankar/miniforge3/envs/promptdelta/lib/python3.10/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/nq/ldkhrrws0xb9whw7b6rpzhc00000gn/T/111a17bc6ed346feb587c2887b5db7eb-pulp.mps timeMode elapsed branch printingOptions all solution /var/folders/nq/ldkhrrws0xb9whw7b6rpzhc00000gn/T/111a17bc6ed346feb587c2887b5db7eb-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 10715 COLUMNS
At line 41446 RHS
At line 52157 BOUNDS
At line 57662 ENDATA
Problem MODEL has 10710 rows, 5504 columns and 19678 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Continuous objective value is 18 - 0.01 seconds
Cgl0002I 3725 variables fixed
Cgl0003I 28 fixed, 0 tightened bounds, 0 strengthened rows, 4 substitutions
Cgl0004I processed model has 19 rows, 15 columns (15 integer (15 of which binary)) and 45 elements
Cutoff increment incr

### Compare SPADE_base and SPADE_sub

SPADE_base does not rely on the ILP; the result is constructed by throwing out any individual candidate assertions with FFR > threshold. SPADE_sub uses the ILP (with subsumption) to find the optimal set of assertions.

In [62]:
rprint(f"There are {len(ALL_FUNCTIONS)} candidate assertions.")
rprint(f"SPADE_base selected {len(optimizer_results['spade_base']['selected_functions'])} functions while SPADE_sub selected {len(optimizer_results['spade_sub']['selected_functions'])} functions.")
rprint(f"SPADE_base had an FFR of {optimizer_results['spade_base']['ffr']} while SPADE_sub had an FFR of {optimizer_results['spade_sub']['ffr']}. (Lower FFR is better.)")
rprint(f"SPADE_base had a coverage of {optimizer_results['spade_base']['coverage']} while SPADE_sub had a coverage of {optimizer_results['spade_sub']['coverage']}. (Higher coverage is better.)")
rprint(f"SPADE_base excluded {len(optimizer_results['spade_base']['not_subsumed_excluded_functions'])} functions that are not subsumed and still satisfy FFR constraints while SPADE_sub excluded {len(optimizer_results['spade_sub']['not_subsumed_excluded_functions'])} such functions.")

### Compare SPADE_cov and SPADE_sub

SPADE_cov uses the ILP but only optimizes for example coverage and FFR. SPADE_sub uses the ILP and optimizes for example coverage, FFR, and subsumption.

In [72]:
rprint(f"There are {len(ALL_FUNCTIONS)} candidate assertions.")
rprint(f"SPADE_cov selected {len(optimizer_results['spade_cov']['selected_functions'])} functions while SPADE_sub selected {len(optimizer_results['spade_sub']['selected_functions'])} functions.")
rprint(f"SPADE_cov had an FFR of {optimizer_results['spade_cov']['ffr']} while SPADE_sub had an FFR of {optimizer_results['spade_sub']['ffr']}. (Lower FFR is better.)")
rprint(f"SPADE_cov had a coverage of {optimizer_results['spade_cov']['coverage']} while SPADE_sub had a coverage of {optimizer_results['spade_sub']['coverage']}. (Higher coverage is better.)")
rprint(f"SPADE_cov excluded {len(optimizer_results['spade_cov']['not_subsumed_excluded_functions'])} functions that are not subsumed and still satisfy FFR constraints while SPADE_sub excluded {len(optimizer_results['spade_sub']['not_subsumed_excluded_functions'])} such functions.")

#### Print a sample of assertions selected by SPADE_sub but not SPADE_cov

In [78]:
# Get assertions in SPADE_sub but not in SPADE_cov
unique_functions_in_sub = set(optimizer_results["spade_sub"]["selected_function_names"]) - set(optimizer_results["spade_cov"]["selected_function_names"])

# Print 3 random assertions in unique_functions_in_sub
import random
random_assertions = list(unique_functions_in_sub)
random_assertions = random.sample(random_assertions, 3)
for assertion in random_assertions:
    for f in ALL_FUNCTIONS:
        if f.__name__ == assertion:
            rprint(inspect.getsource(f))
            break