# LangSmith-Dataset 

- Author: [Minji](https://github.com/r14minji)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

The notebook demonstrates how to create a dataset for evaluating Retrieval-Augmented Generation (RAG) models using LangSmith. It includes steps for setting up environment variables, creating datasets with questions and answers, and uploading examples to LangSmith for testing. Additionally, it provides instructions on using HuggingFace datasets and updating datasets with new examples.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Creating a LangSmith Dataset](#creating-a-LangSmith-Dataset)
- [Creating a Dataset for LangSmith Testing](#creating-a-Dataset-for-LangSmith-Testing)


### References

- [LangChain](https://blog.langchain.dev/)
- [LangSmith](https://docs.smith.langchain.com)
----

## Environment Setup

Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.


**[Note]**

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.
Check out the  [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

You can set API keys in a `.env` file or set them manually.

[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

In [4]:
from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv():
    set_env(
        {
            # "OPENAI_API_KEY": "",
            # "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "",  # set the project name same as the title
        }
    )

## Creating a LangSmith Dataset

Let's learn how to build a custom RAG evaluation dataset.

To construct a dataset, you need to understand three main processes:

Case: Evaluating whether the retrieval is relevant to the question

> Question - Retrieval

![](./assets/langsmith-dataset-01.png)

Case: Evaluating whether the answer is relevant to the question

> Question - Answer

![](./assets/langsmith-dataset-02.png)

Case: Checking if the answer is based on the retrieved documents (Hallucination Check)

> Retrieval - Answer

![](./assets/langsmith-dataset-03.png)

Thus, you typically need `Question`, `Retrieval`, and `Answer` information. However, it is practically challenging to construct ground truth for `Retrieval`.


If ground truth for `Retrieval` exists, you can save and use it all in your dataset. Otherwise, you can create and use a dataset with only `Question` and `Answer`

## Creating a LangSmith Dataset

Use `inputs` and `outputs` to create a dataset.

The dataset consists of `questions` and `answers`.

In [5]:
import pandas as pd

# List of questions and answers
inputs = [
    "What is the name of the generative AI created by Samsung Electronics?",
    "On what date did U.S. President Biden issue an executive order ensuring safe and trustworthy AI development and usage?",
    "Please briefly describe Cohere's data provenance explorer."
]

# List of corresponding answers
outputs = [
    "The name of the generative AI created by Samsung Electronics is Samsung Gauss.",
    "On October 30, 2023, U.S. President Biden issued an executive order.",
    "Cohere's data provenance explorer is a platform that tracks the sources and licensing status of datasets used for training AI models, ensuring transparency. It collaborates with 12 organizations and provides source information for over 2,000 datasets, helping developers understand data composition and lineage.",
]

# Create question-answer pairs
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Convert to a DataFrame
df = pd.DataFrame(qa_pairs)

# Display the DataFrame,
df.head()

Unnamed: 0,question,answer
0,What is the name of the generative AI created ...,The name of the generative AI created by Samsu...
1,On what date did U.S. President Biden issue an...,"On October 30, 2023, U.S. President Biden issu..."
2,Please briefly describe Cohere's data provenan...,Cohere's data provenance explorer is a platfor...


Alternatively, you can use the Synthetic Dataset generated in a previous tutorial.

The code below shows an example of using an uploaded HuggingFace Dataset.

In [8]:
# !pip install -qU datasets

In [14]:
import pandas as pd
from datasets import load_dataset, Dataset
import os

# Download dataset from HuggingFace Dataset using the repo_id
dataset = load_dataset(
    teddylee777/rag-synthetic-dataset,  # Dataset name
    token=os.environ["HUGGINGFACEHUB_API_TOKEN"],  # Required for private datasets
)

# View dataset by split
huggingface_df = dataset["korean_v1"].to_pandas()
huggingface_df.head()

NameError: name 'teddylee777' is not defined

## Creating a Dataset for LangSmith Testing

- Create a new dataset under `Datasets & Testing`.

![](./assets/langsmith-dataset-04.png)

You can also create a dataset directly using the LangSmith UI from a CSV file.

For more details, refer to the documentation below:

- [LangSmith UI Documentation](https://docs.smith.langchain.com/observability/how_to_guides/tracing/upload_files_with_traces)



In [11]:
from langsmith import Client

client = Client()
dataset_name = "RAG_EVAL_DATASET"


# Function to create a dataset
def create_dataset(client, dataset_name, description=None):
    for dataset in client.list_datasets():
        if dataset.name == dataset_name:
            return dataset

    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description=description,
    )
    return dataset


# Create dataset
dataset = create_dataset(client, dataset_name)

# Add examples to the created dataset
client.create_examples(
    inputs=[{"question": q} for q in df["question"].tolist()],
    outputs=[{"answer": a} for a in df["answer"].tolist()],
    dataset_id=dataset.id,
)


You can add examples to the dataset later.

In [13]:
# New list of questions
new_questions = [
    "What is the name of the generative AI created by Samsung Electronics?",
    "Is it true that Google invested $2 billion in Teddynote?",
]

# 새로운 답변 목록
new_answers = [
    "The name of the generative AI created by Samsung Electronics is Teddynote.",
    "This is not true. Google agreed to invest up to $2 billion in Anthropic, starting with $500 million and planning to invest an additional $1.5 billion in the future.",
]

# Verify the updated version in the UI
client.create_examples(
    inputs=[{"question": q} for q in new_questions],
    outputs=[{"answer": a} for a in new_answers],
    dataset_id=dataset.id,
)
