<a href="https://colab.research.google.com/github/scorecard-ai/scorecard-cookbook/blob/main/%5BScorecard%5D_OpenAI_Assistants_API_RAG_Evals_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo: OpenAI Assistants API RAG Evals Example

## 🧙‍♂️ Instructions

1. Create an account and [login to Scorecard](https://app.getscorecard.ai/). Copy your [API key](https://app.getscorecard.ai/settings).
1. Add your Scorecard and OpenAI API Keys below.
1. Go to `Runtime` -> `Run all`. Enjoy!

In [None]:
#@title 👉 API Keys

import os

OPENAI_API_KEY = "" #@param { type: "string" }
SCORECARD_API_KEY = "" #@param { type: "string" }

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["SCORECARD_API_KEY"] = SCORECARD_API_KEY

# Setup

In [None]:
#@title Install dependencies
#@markdown In order to keep the notebook working for all future users, we pin the dependency versions.

!pip install scorecard-ai==0.1.12
!pip install openai==1.14.3

Collecting scorecard-ai==0.1.12
  Downloading scorecard_ai-0.1.12-py3-none-any.whl (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scorecard-ai
  Attempting uninstall: scorecard-ai
    Found existing installation: scorecard-ai 0.1.10
    Uninstalling scorecard-ai-0.1.10:
      Successfully uninstalled scorecard-ai-0.1.10
Successfully installed scorecard-ai-0.1.12
Collecting openai==1.14.3
  Downloading openai-1.14.3-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.9/262.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.11.1
    Uninstalling openai-1.11.1:
      Successfully uninstalled openai-1.11.1
Successfully i

In [None]:
#@title Imports

from openai import OpenAI
from scorecard.client import Scorecard
from google.colab import files
from typing import List, Tuple
import asyncio
import time

# Build your LLM system

Now, let's define your system (aka system-under-test)! For this demo, we'll set up an OpenAI Assistants instance to perform RAG on the file provided.

In [None]:
#@title Upload your document/data

uploaded = files.upload()
print("Upload done!")
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
#@title Create an OpenAI Assistant

client = OpenAI()
file = client.files.create(
  file=open(list(uploaded.keys())[0], "rb"),
  purpose='assistants'
)
assistant = client.beta.assistants.create(
    instructions="You have access to files containing my emails to answer questions.",
    name="RAG Assistant",
    tools=[{"type": "retrieval"}],
    file_ids=[file.id],
    model="gpt-4-1106-preview",
)

In [None]:
#@title Call OpenAI to generate completion from the Assistant
#@markdown Here we'll define an example of a multi-message prompt sent to OpenAI.

def generate(query) -> Tuple[str, List[str]]:
    client = OpenAI(api_key=OPENAI_API_KEY)
    run = client.beta.threads.create_and_run(
        assistant_id=assistant.id,
        thread={
            "messages": [
                {
                    "role": "user",
                    "content": query,
                }
            ]
        },
    )
    run_id = run.id
    thread_id = run.thread_id
    print(f"Run ID: {run_id}")
    print(f"Thread ID: {thread_id}")

    while True:
        try:
            run = client.beta.threads.runs.retrieve(
                run_id=run_id, thread_id=thread_id
            )
            if run.completed_at:
                elapsed_time = run.completed_at - run.created_at
                formatted_elapsed_time = time.strftime(
                    "%H:%M:%S", time.gmtime(elapsed_time)
                )
                print(f"Run completed in {formatted_elapsed_time}")
                messages = client.beta.threads.messages.list(
                    thread_id=thread_id
                )
                last_message = messages.data[0]

                # Extract the message content
                message_content = last_message.content[0].text
                annotations = message_content.annotations
                citations = []

                # Iterate over the annotations and add footnotes
                for index, annotation in enumerate(annotations):
                    # Gather citations based on annotation attributes
                    if (file_citation := getattr(annotation, 'file_citation', None)):
                        citations.append(file_citation.quote)

                response = message_content.value

                return response, str(citations)
            else:
                time.sleep(0.2)

        except Exception as e:
            print(f"An error occurred while retrieving the run: {e}")
            raise

# Evaluate your system

### Pre-req: Create Metrics

First, using the Scorecard application, create your metrics and scoring config. For this example,
we can use something simple like a Helpfulness metric that determines whether
the generation adheres to the user's request.

Once you have created your scoring config, copy the ID and enter it below:

In [None]:
#@title Configure Metrics
SCORING_CONFIG_ID = 1  #@param { type: "number" }

In [None]:
scorecard_client = Scorecard(
    api_key=SCORECARD_API_KEY
)

In [None]:
#@title 1. Create a basic Testset
#@markdown Here we'll create a basic Testset that gets stored in Scorecard.

client = Scorecard(
    api_key=SCORECARD_API_KEY
)

# Create a Testset
testset = client.testset.create(
    name="RAG Email Demo",
    description="Demo of a testset created via Scorecard Python SDK",
    using_retrieval=True
)

# Add three testcases
client.testcase.create(
    testset_id=testset.id,
    user_query="What was said on January 2?"
)
client.testcase.create(
    testset_id=testset.id,
    user_query="How many times did I email Roland?"
)
client.testcase.create(
    testset_id=testset.id,
    user_query="What did I last say to Roland?"
)

print("Visit the Scorecard app to view your Testset:")
print(f"https://app.getscorecard.ai/view-dataset/{testset.id}")

Visit the Scorecard app to view your Testset:
https://app.getscorecard.ai/view-dataset/1597


In [None]:
#@title 2. Run Tests
#@markdown Now we'll create a new Run to execute our LLM system above.

from scorecard.types import RunStatus

client = Scorecard(
    api_key=SCORECARD_API_KEY
)

run = client.run.create(testset_id=testset.id)
client.run.update_status(run_id=run.id, status=RunStatus.RUNNING_EXECUTION)

for testcase in client.testset.get_testcases(testset_id=testset.id).results:
  model_response = generate(query=testcase.user_query)
  print(model_response)
  client.testrecord.create(run_id=run.id,
                           testset_id=testset.id,
                           testcase_id=testcase.id,
                           user_query=testcase.user_query,
                           context=model_response[1],
                           response=model_response[0])

client.run.update_status(run_id=run.id, status=RunStatus.AWAITING_SCORING)

print("Visit the Scorecard app to view your Run:")
print(f"https://app.getscorecard.ai/view-records/{run.id}")

Run ID: run_MN238gIBPWbIrqkvU41SCLCQ
Thread ID: thread_LtxCRjTAsGQTGw3A2GeuQEj6
Run completed in 00:00:33
('I am unable to find the specific mention of January 2nd in the emails. It appears that scrolling through the document did not reveal any correspondence from that date. If you have any specific keywords or context about the content from January 2nd that you are looking for, it would help to narrow down the search. Otherwise, it might be that the emails from January 2nd are not included in the uploaded file. Can you please provide more information or guide me on how to proceed?', '[]')
Run ID: run_lRUgq7cqkjPWydFEwikhq6IW
Thread ID: thread_whGAvk6ffLqGCu9yNA6knu1x
Run completed in 00:00:12
('After scrolling through the document and seeing references to "Roland," it appears that you have not directly emailed Roland. The instances of the name "Roland" found in the document are in the context of an email invitation list and RSVP for events where Roland is one of many recipients. If yo

In [None]:
#@title 3. Kick off Scoring
#@markdown Once your run above is finished executing, hit the "Run Scoring" button to run scoring. Once that's done, visit the Results page:

print("Visit the Scorecard app to view your Results:")
print(f"https://app.getscorecard.ai/view-grades/{run.id}")

Visit the Scorecard app to view your Results:
https://app.getscorecard.ai/view-grades/3109
