# Perform Experiments with UpTrain

Experiments help you perform A/B testing, so you can compare and choose the options most suitable for you. This notebook shows you how to perform experiments with UpTrain. 

The experiment we will be demonstrating is to compare the responses given by a model when passed contexts of different lengths. This is done by using a `chunk_size` parameter that limits the number of tokens in the `context` passed to the model.

We will only look at the code that is specific to performing experiments. We will not be looking at the entire process of extracting the context and generating the response. To learn more about that, please refer to the [Data Driven Experimentation Demo](https://github.com/uptrain-ai/uptrain/blob/main/examples/experiments/data_driven_experimentation_demo.ipynb).

## Install UpTrain

Run the following commands in your terminal to install UpTrain:
```bash
pip install uptrain
```

In [1]:
from uptrain import APIClient, Evals
import json

### Create an UpTrain API Client

Before we can start using UpTrain, we need to create an API client. You can do this by passing your API key to the `APIClient` constructor.


In [2]:
UPTRAIN_API_KEY = "up-***************"
import os
UPTRAIN_API_KEY = os.environ.get("UPTRAIN_API_KEY")
client = APIClient(uptrain_api_key=UPTRAIN_API_KEY)

### Create your data

You can define your data as a simple dictionary with the following keys:

- `question`: The question you want to ask
- `context`: The context relevant to the question
- `response`: The response to the question

Here, we will perform A/B testing based on chunk size.  This value is also passed as a key in the data dictionary.

- `chunk_size`: The limit on the number of tokens in the context

In [3]:
data = [
    {
        "question": "How can you ensure that a designed prompt elicits the desired response from a language model?",
        "context": "nudge the model to generate a desired result. Prompt design can be an efficient\nway to experiment with adapting a language model for a specific use case.",
        "response": "To ensure that a designed prompt elicits the desired response from a language model, you can experiment with different prompt designs that nudge the model towards generating the desired result. This can be an efficient way to adapt the language model for a specific use case.",
        "chunk_size": 200
    },
    {
        "question": "How can you ensure that a designed prompt elicits the desired response from a language model?",
        "context": "Design and save your own prompts\nPrompt design is the process of manually creating prompts that elicit the\ndesired response from a language model. By carefully crafting prompts, you can\nnudge the model to generate a desired result. Prompt design can be an efficient\nway to experiment with adapting a language model for a specific use case.\nYou can create and save your own prompts in Vertex AI Studio. When\ncreating a new prompt, you enter the prompt text, specify the model to use,\nconfigure parameter values, and test the prompt by generating a response. You\ncan iterate on the prompt and its configurations until you get the desired\nresults. When you are done designing the prompt, you can save it in\nVertex AI Studio.\nResponse citations\nIf you are using a text model in Vertex AI Studio like text-bison, you\nreceive text responses based on your input. Our features are intended to produce\noriginal content and not replicate existing content at length. If",
        "response": "To ensure that a designed prompt elicits the desired response from a language model, you can manually create prompts that carefully craft the desired result. By iterating on the prompt and its configurations in Vertex AI Studio, you can experiment and adjust until you achieve the desired results. Additionally, it is important to note that the text models in Vertex AI Studio are designed to produce original content and not replicate existing content at length.",
        "chunk_size": 1000
    }
]

## Experiments

Now that we have our data, we can perform experiments on it using UpTrain. We use the `evaluate_experiments` method to do this. This method takes the following arguments:

- `project_name`: The name of your project
- `data`: The data you want to log and evaluate
- `evals`: The evaluations you want to perform on your data
- `exp_columns` - A list of all the columns that act as identifiers to indicate which experiment the row belongs to. You can enter multiple column names here.

You can find the list of all available evaluations [here](https://docs.uptrain.ai/key-components/evals).


In [14]:
results = client.evaluate_experiments(
    project_name="Chunk-Size-Experiment",
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.RESPONSE_RELEVANCE, Evals.FACTUAL_ACCURACY],
    exp_columns=["chunk_size"]
)

[32m2024-01-09 19:17:52.964[0m | [1mINFO    [0m | [36muptrain.framework.remote[0m:[36mlog_and_evaluate[0m:[36m507[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain server[0m


### Get your results

In [15]:
print(json.dumps(results, indent=3))

[
   {
      "question": "How can you ensure that a designed prompt elicits the desired response from a language model?",
      "score_factual_accuracy_chunk_size_200": 1.0,
      "score_factual_accuracy_chunk_size_1000": 1.0,
      "score_context_relevance_chunk_size_200": 0.5,
      "score_context_relevance_chunk_size_1000": 1.0,
      "response_chunk_size_200": "To ensure that a designed prompt elicits the desired response from a language model, you can experiment with different prompt designs that nudge the model towards generating the desired result. This can be an efficient way to adapt the language model for a specific use case.",
      "response_chunk_size_1000": "To ensure that a designed prompt elicits the desired response from a language model, you can manually create prompts that carefully craft the desired result. By iterating on the prompt and its configurations in Vertex AI Studio, you can experiment and adjust until you achieve the desired results. Additionally, it is i

We can use these results to compare the changes in the model's response when the context length is changed. This would be more clear when done with a larger dataset. However, the process is the same.

Factual Accuracy Score:

In [9]:
print("Factual Accuracy for chunk size 200: ", results[0]["score_factual_accuracy_chunk_size_200"])
print("Factual Accuracy for chunk size 1000: ", results[0]["score_factual_accuracy_chunk_size_1000"])

Factual Accuracy for chunk size 200:  1.0
Factual Accuracy for chunk size 1000:  1.0


Context Relevance Score:

In [11]:
print("Context Relevance for chunk size 200: ", results[0]["score_context_relevance_chunk_size_200"])
print("Context Relevance for chunk size 1000: ", results[0]["score_context_relevance_chunk_size_1000"])

Context Relevance for chunk size 200:  0.5
Context Relevance for chunk size 1000:  0.5


Response Relevance Score:

In [10]:
print("Response Relevance for chunk size 200: ", results[0]["score_response_relevance_chunk_size_200"])
print("Response Relevance for chunk size 1000: ", results[0]["score_response_relevance_chunk_size_1000"])

Response Relevance for chunk size 200:  1.0
Response Relevance for chunk size 1000:  0.6666666666666666


**Access UpTrain Dashboards**: We can access the evaluation results at https://demo.uptrain.ai/dashboard/ - the same API key can be used to access the dashboards. Here's a sample screenshot of the above evaluation performed on a larger dataset in the [Data Driven Experimentation Demo](https://github.com/uptrain-ai/uptrain/blob/main/examples/experiments/data_driven_experimentation_demo.ipynb).


<img width="600" src="https://uptrain-assets.s3.ap-south-1.amazonaws.com/images/github-assets/Screenshot+2023-11-08+at+12.10.08+AM.png">