# Project 5: Prompting With Large Language Models

In this project, we learn how to solve tasks by prompting existing LLM APIs. We will experiment with zero-shot and few-shot prompting and different methods for example selection for a semantic parsing task.

First we install and import the required dependencies. These include:
* `openai` as our API for querying LLMs (you are free to choose to use a different LLM API if you would like)


In [1]:
%%capture
%pip install openai

If you are using the OpenAI API, then go to create an account and then copy your secret API key from `https://platform.openai.com/account/api-keys`. Set this as an environment variable or key management service so we can load it below. Make sure to keep private key secret. You may use a different LLM service eg. Cohere (https://cohere.ai/).

In [2]:
import os
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

If you have successfully authorized, then you should be able to see a list of available models by running the command below.

In [96]:
import openai
openai.api_key = OPENAI_API_KEY
[model["root"] for model in openai.Model.list()["data"]]

['babbage',
 'davinci',
 'text-davinci-edit-001',
 'gpt-3.5-turbo-0301',
 'babbage-code-search-code',
 'text-similarity-babbage-001',
 'gpt-3.5-turbo',
 'code-davinci-edit-001',
 'text-davinci-001',
 'text-davinci-003',
 'ada',
 'babbage-code-search-text',
 'babbage-similarity',
 'code-search-babbage-text-001',
 'text-curie-001',
 'whisper-1',
 'code-search-babbage-code-001',
 'text-ada-001',
 'text-embedding-ada-002',
 'text-similarity-ada-001',
 'curie-instruct-beta',
 'ada-code-search-code',
 'ada-similarity',
 'code-search-ada-text-001',
 'text-search-ada-query-001',
 'davinci-search-document',
 'ada-code-search-text',
 'text-search-ada-doc-001',
 'davinci-instruct-beta',
 'text-similarity-curie-001',
 'code-search-ada-code-001',
 'ada-search-query',
 'text-search-davinci-query-001',
 'curie-search-query',
 'davinci-search-query',
 'babbage-search-document',
 'ada-search-document',
 'text-search-curie-query-001',
 'text-search-babbage-doc-001',
 'curie-search-document',
 'text-sear

We will now evaluate the LLM on a semantic parsing task. Geoquery is a dataset that contains information about the geography of the United States. For more information, please see: https://www.cs.utexas.edu/users/ml/nldata/geoquery.html. We will experiment with the compositional split introduced in (Keysers et al., 2020) https://openreview.net/forum?id=SygcCnNKwr. First, let's download the train and validation data. The goal for the LLM is to take in English queries about US geography about population, elevation, etc. and output a formal representation of the query.

In [5]:
os.makedirs('./saves/', exist_ok=True)

In [6]:
!wget https://github.com/kl2806/geoquery/raw/main/data.zip -P data/
!unzip -o data/data.zip -d data/

--2023-04-22 20:22:43--  https://github.com/kl2806/geoquery/raw/main/data.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kl2806/geoquery/main/data.zip [following]
--2023-04-22 20:22:43--  https://raw.githubusercontent.com/kl2806/geoquery/main/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4493 (4.4K) [application/zip]
Saving to: ‘data/data.zip.2’


2023-04-22 20:22:43 (49.8 MB/s) - ‘data/data.zip.2’ saved [4493/4493]

Archive:  data/data.zip
  inflating: data/train.tsv          
  inflating: data/dev.tsv            


Let's take a look at the data, each instance should compose of an English utterance, and a formal representation of the utterance

In [7]:
!head -n 5 data/train.tsv

how tall is the highest point in m0	answer ( elevation_1 ( highest ( intersection ( place , loc_2 ( m0 ) ) ) ) )
what is the largest city in m0	answer ( largest ( intersection ( city , loc_2 ( m0 ) ) ) )
what states border states that the m0 runs through	answer ( intersection ( state , next_to_2 ( intersection ( state , traverse_1 ( m0 ) ) ) ) )
what is the maximum elevation of m0	answer ( highest ( intersection ( place , loc_2 ( m0 ) ) ) )
what is the population of m0	answer ( population_1 ( m0 ) )


In [8]:
import csv
from dataclasses import dataclass
from typing import List

@dataclass
class Example:
     query: str
     program: str

training_examples: List[Example] = []

with open('./data/train.tsv', 'r') as tsv_file:
    reader = csv.reader(tsv_file, delimiter='\t')
    for query, program in reader:
        training_examples.append(Example(query, program))

dev_examples: List[Example] = []
with open('./data/dev.tsv', 'r') as tsv_file:
    reader = csv.reader(tsv_file, delimiter='\t')
    for query, program in reader:
        dev_examples.append(Example(query, program))

print(f"Num training examples: {len(training_examples)}")
print(f"Num dev examples: {len(dev_examples)}")


Num training examples: 440
Num dev examples: 40


Now, let's define a function that uses the OpenAI API to output the semantic parse. As a first cut, let's just try to describe the task in English and return it in `get_static_prompt`. Then implement `parse_example`, which should call the LLM API, and return the semantic parse. `gpt-3.5-turbo`, the corresponding API call to ChatGPT should work well for this task.

In [129]:
from typing import Callable

def get_static_prompt(utterance: str):
    """Return a prompt that doesn't change between different examples"""
    """YOUR CODE HERE"""
    return [
        {"role": "system", "content": "You are a semantic parser. Answer as concisely as possible."},
        {"role": "user", "content": "Map the following query into functional query language: " + utterance}
    ]

def parse_example(model: str, utterance: str, prompt_method: Callable[[str], str], **kwargs: dict) -> str:
    """Return the semantic parse of the utterance"""
    prompt = prompt_method(utterance, **kwargs)
    """YOUR CODE HERE"""
    response = openai.ChatCompletion.create(model=model, messages=prompt)
    
    return response["choices"][0]["message"]["content"]

parse = parse_example(model="gpt-3.5-turbo",
                      utterance="what river runs through m0",
                      prompt_method=get_static_prompt)

print(parse)

filter(same_entity(m0, river), river)


With just an English description, the output probably does not look very similar to the target language that we want. Let's try to construct a prompt with some examples from our training set and see how it does. Implement the function below to uniformly sample examples from the training set, and use them to construct a few-shot prompt to the model.

In [130]:
import random
from typing import List

def get_prompt(utterance: str, examples: List[Example]):
    messages = [{"role": "system", "content": "You are a semantic parser. Answer as concisely as possible."}]

    for example in examples:
        messages.extend([
            {"role": "user", "content": "Map the following query into functional query language: " + example.query},
            {"role": "assistant", "content": example.program}
        ])

    return messages + [
        {"role": "user", "content": "Map the following query into functional query language: " + utterance}
    ]

def random_sample_prompt(utterance: str, training_examples: List[Example], num_samples: int = 10):
    """Return a prompt for a given example"""
    """YOUR CODE HERE"""
    examples = random.sample(training_examples, num_samples)

    return get_prompt(utterance, examples)

prompt = random_sample_prompt(utterance="what river runs through m0",
                              training_examples=training_examples)
print("Uniform sampling prompt:")
for entry in prompt:
    print(entry)

Uniform sampling prompt:
{'role': 'system', 'content': 'You are a semantic parser. Answer as concisely as possible.'}
{'role': 'user', 'content': 'Map the following query into functional query language: number of people in m0'}
{'role': 'assistant', 'content': 'answer ( population_1 ( m0 ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: how many people live in m0'}
{'role': 'assistant', 'content': 'answer ( population_1 ( m0 ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what are the major cities in the state of m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( major , intersection ( city , loc_2 ( intersection ( state , m0 ) ) ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: which of the states bordering m0 has the largest population'}
{'role': 'assistant', 'content': 'answer ( largest_one ( population_1 , intersection ( state , next_to_2 ( m0 )

In [131]:
parse = parse_example(model="gpt-3.5-turbo",
                      utterance="what river runs through m0",
                      prompt_method=random_sample_prompt,
                      training_examples=training_examples)
print(parse)

answer ( intersection ( river , traverse_2 ( m0 ) ) )


Now, let's evaluate our uniform sampling prompt on the validation set. If you run into rate limit issues with the API, you may want to use a backoff strategy or consult one of the solutions here https://platform.openai.com/docs/guides/rate-limits/error-mitigation.

In [14]:
%%capture
!pip install tenacity

In [132]:
import tqdm
from tenacity import (
    retry,
    wait_fixed,
    stop_after_attempt,
)

REQUEST_PER_MIN = 3

@retry(wait=wait_fixed(60/REQUEST_PER_MIN+1), stop=stop_after_attempt(100))
def completion_with_backoff(*args, **kwargs):
    return parse_example(*args, **kwargs)

def get_predictions(model: str,
                    evaluation_examples: List[Example],
                    prompt_creation_function: Callable[[str], str],
                    **kwargs: List[str]) -> List[str]:
    """Get a list of predictions from the evaluation examples"""
    predictions = []
    for example in tqdm.tqdm(evaluation_examples):
        predicted_program = completion_with_backoff(model, example.query, prompt_creation_function, **kwargs)
        predictions.append(predicted_program)
    return predictions

def evaluate(predictions: List[str], evaluation_examples: List[Example]) -> float:
    """Evaluate the accuracy of the predictions"""
    correct = 0
    for prediction, example in zip(predictions, evaluation_examples):
        if prediction == example.program:
            correct += 1
    return correct / len(evaluation_examples)

In [133]:
random_sample_predictions = get_predictions(model="gpt-3.5-turbo",
                                            evaluation_examples=dev_examples,
                                            prompt_creation_function=random_sample_prompt,
                                            training_examples=training_examples)

100%|██████████| 40/40 [12:51<00:00, 19.28s/it]


The model should get at least 15\% exact match with randomly sampling examples.

In [134]:
import re

def save_predictions(predictions: List[str], file_path: str):
    with open(file_path, 'w') as f:
        for prediction in predictions:
            f.write(re.sub("\n", " ", prediction))
            f.write('\n')

In [135]:
# Save the predictions as `random_predictions.txt`
save_predictions(random_sample_predictions, './saves/random_predictions.txt')

exact_match = evaluate(random_sample_predictions, dev_examples)
print(f"Exact match for uniform sampling prompt: {exact_match}")

Exact match for uniform sampling prompt: 0.3


Randomly sampling examples does not consider the utterance when selecting examples. Next, we will try to pick examples for the prompt based on embedding similarity. First, let's install `sentence-transformers`, which we will use to get embeddings of the utterance.

In [49]:
%%capture
%pip install sentence-transformers

Now, let's construct embeddings using a small pretrained model for each example in our training data. With the embeddings of all the training data, we can construct a function that takes in an utterance and outputs a prompt with examples having highest cosine similarity with the utterance.

In [136]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def get_nearest_neighbor_prompt(utterance: str,
                                training_examples: List[Example],
                                embedding_model: str,
                                num_samples: int = 10):
    model = SentenceTransformer(embedding_model)
    """YOUR CODE HERE"""
    utterance_embedding = model.encode([utterance])
    corpus_embeddings = model.encode([example.query for example in training_examples])

    similarities = cosine_similarity(utterance_embedding, corpus_embeddings)[0]
    indices = similarities.argsort()[::-1][:num_samples]
    examples = [training_examples[i] for i in indices]

    return get_prompt(utterance, examples)

In [144]:
prompt = get_nearest_neighbor_prompt(utterance="what river runs through m0",
                           training_examples=training_examples,
                           embedding_model='all-MiniLM-L6-v2')
print("Similarity sampling prompt:")
for entry in prompt:
    print(entry)

Similarity sampling prompt:
{'role': 'system', 'content': 'You are a semantic parser. Answer as concisely as possible.'}
{'role': 'user', 'content': 'Map the following query into functional query language: which rivers are in m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( river , loc_2 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what rivers are in m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( river , loc_2 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what rivers are in m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( river , loc_2 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what rivers are in m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( river , loc_2 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what rivers are i

Evaluate the similarity based prompt on the validation data and save your predictions as `similarity_predictions.txt` with one prediction per line of the validation set. With similarity based example selection, we should get at least 20\% exact match on the validation set. Note that there could be some duplicates in the training data because we are working with a version of the data where the location names are normalized to be variables like `m0`.

In [138]:
similarity_predictions = get_predictions(model="gpt-3.5-turbo",
                                         evaluation_examples=dev_examples,
                                         prompt_creation_function=get_nearest_neighbor_prompt,
                                         training_examples=training_examples,
                                         embedding_model='all-MiniLM-L6-v2')

100%|██████████| 40/40 [13:08<00:00, 19.71s/it]


In [139]:
# Save the predictions as `similarity_predictions.txt`
save_predictions(similarity_predictions, './saves/similarity_predictions.txt')

exact_match = evaluate(similarity_predictions, dev_examples)
print(f"Exact match for nearest neighbor prompt: {exact_match}")

Exact match for nearest neighbor prompt: 0.7


Let's try to improve on nearest neighbor based example search. This part will be more open ended. We will now implement a different example selection method that improves over uniform random selection. You may implement an algorithm `Diverse Demonstrations Improve In-context Compositional Generalization` (https://arxiv.org/abs/2212.06800) or come up with your own example selection method. In the report, describe the algorithm that you implemented and intuition of why it could be effective.

In [140]:
import numpy as np
from sklearn.cluster import KMeans

def construct_diversity_prompt(utterance: str,
                               training_examples: List[Example],
                               embedding_model: str,
                               num_samples: int = 10,
                               random_state: int = 0):
    """YOUR CODE HERE"""
    model = SentenceTransformer(embedding_model)
    utterance_embedding = model.encode([utterance])
    corpus_embeddings = model.encode([example.query for example in training_examples])

    similarities = cosine_similarity(utterance_embedding, corpus_embeddings)[0]

    kmeans = KMeans(
        n_clusters=num_samples, random_state=random_state, n_init="auto"
    ).fit(corpus_embeddings)
    examples = []
    for label in np.unique(kmeans.labels_):
        cluster_indices = np.where(kmeans.labels_ == label)[0]
        index = cluster_indices[similarities[cluster_indices].argsort()[-1]]
        examples.append(training_examples[index])

    return get_prompt(utterance, examples)

In [141]:
prompt = construct_diversity_prompt(utterance="what river runs through m0",
                           training_examples=training_examples,
                           embedding_model='all-MiniLM-L6-v2')
print("Diversity sampling prompt:")
for entry in prompt:
    print(entry)

Diversity sampling prompt:
{'role': 'system', 'content': 'You are a semantic parser. Answer as concisely as possible.'}
{'role': 'user', 'content': 'Map the following query into functional query language: through which states does the m0 flow'}
{'role': 'assistant', 'content': 'answer ( intersection ( state , traverse_1 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: where is m0'}
{'role': 'assistant', 'content': 'answer ( loc_1 ( m0 ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: which river traverses most states'}
{'role': 'assistant', 'content': 'answer ( most ( river , traverse_2 , state ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: which rivers are in m0'}
{'role': 'assistant', 'content': 'answer ( intersection ( river , loc_2 ( m0 ) ) )'}
{'role': 'user', 'content': 'Map the following query into functional query language: what states border sta

In [142]:
diversity_predictions = get_predictions(model="gpt-3.5-turbo",
                                        evaluation_examples=dev_examples,
                                        prompt_creation_function=construct_diversity_prompt,
                                        training_examples=training_examples,
                                        embedding_model='all-MiniLM-L6-v2')

100%|██████████| 40/40 [13:13<00:00, 19.85s/it]


In [143]:
# Save the predictions as `diversity_predictions.txt`
save_predictions(diversity_predictions, './saves/diversity_predictions.txt')

exact_match = evaluate(diversity_predictions, dev_examples)
print(f"Exact match for diversity based prompt: {exact_match}")

Exact match for diversity based prompt: 0.475


Get the predictions and submit these as `diversity_predictions.txt`, where each line is a prediction for the development set. With the improved selection, we should get least 35\% exact match on the validation set.

For the report, compare the predictions from the example selection methods using 1) uniform random sampling 2) embedding-based similarity search and 3) coverage based selection. Compare and contrast the errors and submit your analysis as `report.pdf`.

## Submission

Turn in the following files on Gradescope:
* hw5.ipynb (this file; please rename to match)
* random_predictions.txt
* similarity_predictions.txt
* diversity_predictions.txt
* report.pdf