<a href="https://colab.research.google.com/github/uptrain-ai/uptrain/blob/main/examples/integrations/vector_db/milvus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="UpTrain">
  </a>
</h1>

<div style="text-align: center;">

## Building RAG Pipeline using Milvus and Evaluating Results using UpTrain

### Overview

RAG is a technique that is used to increase the accuracy of an LLM's response using information retrieved from an external source.

In this tutorial, we will walk through the use of [Milvus](https://milvus.io/) (an Open-Source Vector Database) and [UpTrain](https://uptrain.ai/) (an Open-Source LLM Evaluation Tool) to walk through the key components of a RAG Pipeline.

A Vector Database can help implement similarity search on your document (afterall your knowledge base could be pretty large, and you would just want to pass the relevant information to an LLM)

There are multiple parts in a RAG pipeline and a failure can occur at any step. Tools like UpTrain can help you analyze LLM generated responses using over 20+ pre-built evaluations (covering use cases such as response quality, context awareness, code related evals)

### Install Dependencies

In [1]:
%pip install pymilvus uptrain milvus-model

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
from openai import OpenAI
import wikipedia
import os
import json

### Defining the data 

In this example we will be considering a case where user's questions are related to the 2020 Summer Olympics

In [3]:
questions = [
    "How many events were there in the 2020 Summer Olympics",
    "Who performed at the opening ceremony of the 2020 Summer Olympics?",
    "Where did the 2020 Summer Olympics take place?",
]

**Let's first define a knowledge base**

In this example, we will be using a [Wikipedia page](https://en.wikipedia.org/wiki/2020_Summer_Olympics) about the 2022 Summer Olympics

In [4]:
base_data = wikipedia.page("2020 Summer Olympics").content 

In [5]:
base_data[:100]

'The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially branded as Tokyo'

**Split Document into Chunks**

For ease of retrieving information, we will split the knowledge base into smaller chunks.

Here, we have splitted the document into chunks of sentences

In [6]:
chunks = base_data.split('. ')

# Let's Look at the first 5 chunks
chunks[:5] 

['The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially branded as Tokyo 2020, were an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July 2021',
 'Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina on 7 September 2013.Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 on 24 March 2020 due to the global COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled)',
 'However, the event retained the Tokyo 2020 branding for marketing purposes',
 'It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic, the first and only Olympic Games to be held without official spectators',
 'The Games were the mos

### Create embeddings 

for the information present in chunks

In [7]:
from pymilvus import model

# initialize using 'text-embedding-3-large'
openai_ef = model.dense.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-large", 
    dimensions=512 
)

context_embeddings = openai_ef(chunks)

In [8]:
context_embeddings_formatted = []

for index in range(len(context_embeddings)):
    context_embeddings_formatted.append(
        {
            "id": index,
            "vector": context_embeddings[index]
        }
    )

### Store these embeddings in Milvus Vector DB

In this example we have hosted Milvus locally using Docker.

You can check out this [documentation](https://milvus.io/docs/install_standalone-docker.md) to learn how to install Milvus

In [9]:
from pymilvus import MilvusClient

CLUSTER_ENDPOINT = "http://localhost:19530/"

milvus_client = MilvusClient(
    uri=CLUSTER_ENDPOINT,
)

We will now create a collection in Milvus to store the embeddings that we created

In [10]:
MILVUS_COLLECTION_NAME = "UpTrain_Milvus"

milvus_client.create_collection(
    collection_name=MILVUS_COLLECTION_NAME,
    dimension=512
)

We will now store the embeddings in the collection that we created

In [11]:
milvus_client.insert(
    collection_name=MILVUS_COLLECTION_NAME,
    data=context_embeddings_formatted
)

{'insert_count': 272,
 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215,

### Retrieve relevant chunk using embeddings stored in Milvus

Create embeddings for list of questions

In [12]:
questions_embeddings = openai_ef(questions)

Extract the chunks closest to the questions using the embeddings stored in Milvus

In [13]:
res = milvus_client.search(
    collection_name= MILVUS_COLLECTION_NAME,
    data= questions_embeddings,
    limit=2,
)

> **Note**: You can set the number of results you want to fetch (by using the argument `limit`), in this example we are fetching the best 2 result

In [14]:
res

[[{'id': 171, 'distance': 0.6818565726280212, 'entity': {}},
  {'id': 0, 'distance': 0.6670520305633545, 'entity': {}}],
 [{'id': 155, 'distance': 0.6477672457695007, 'entity': {}},
  {'id': 153, 'distance': 0.6460253000259399, 'entity': {}}],
 [{'id': 0, 'distance': 0.7630819082260132, 'entity': {}},
  {'id': 92, 'distance': 0.7080820798873901, 'entity': {}}]]

Now, let's extract the id of these matching vectors (i.e. id of the most similar chunk)

In [15]:
vector_id_list = [[item['id'] for item in sublist] for sublist in res]

In [16]:
vector_id_list

[[171, 0], [155, 153], [0, 92]]

Using these ids we will retrieve the chunks most relevant to the question asked 

In [17]:
retrieved_chunk = []

for i in range(len(vector_id_list)):
    retrieved_chunk.append([chunks[i] for i in vector_id_list[i]])

In [18]:
retrieved_chunk

[['These five new sports were approved on 3 August 2016 by the IOC during the 129th IOC Session in Rio de Janeiro, Brazil, and were included in the sports program for 2020 only, bringing the total number of sports at the 2020 Olympics to 33.\n\n\n=== Test events ===\nA total of 56 test events were scheduled to take place in the run-up to the 2020 Olympics and Paralympics',
  'The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially branded as Tokyo 2020, were an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July 2021'],
 ['Emperor Naruhito formally opened the Games, and at the end of the torch relay the Olympic cauldron was lit by Japanese tennis player Naomi Osaka.For the first time in the 2020 Olympic Games, it was decided that one male and one female in each country would take turns holding flags and serve as two of them',
  'One of these events was a concert held on 18 Jul

### Generating responses using the question and context

In [19]:
open_ai_client = OpenAI(api_key = os.environ["OPENAI_API_KEY"])

def get_response(retrieved_context, question):
    prompt = f"""
        Context information is below.
        ---------------------
        {retrieved_context}
        ---------------------
        Given the context information and not prior knowledge, answer the query and relevant citations.
        Query: {question}
        Answer Format:
        A python dictionary with 'response' and 'cited_context'
        """
    
    response = open_ai_client.chat.completions.create(
            model="gpt-3.5-turbo", messages=[{"role": "system", "content": prompt}]
        ).choices[0].message.content
    # return response
    return {'question': question, 'context': retrieved_context, 'response': json.loads(response)['response'], 'cited_context': json.loads(response)['cited_context']}

### Define your dataset

In [20]:
data = []

for i in range(len(questions)):
    retrieved_context = retrieved_chunk[i]
    retrieved_context = ' '.join(retrieved_context)
    question = questions[i]    
    data.append(get_response(retrieved_context, question))

In [21]:
data

[{'question': 'How many events were there in the 2020 Summer Olympics',
  'context': 'These five new sports were approved on 3 August 2016 by the IOC during the 129th IOC Session in Rio de Janeiro, Brazil, and were included in the sports program for 2020 only, bringing the total number of sports at the 2020 Olympics to 33.\n\n\n=== Test events ===\nA total of 56 test events were scheduled to take place in the run-up to the 2020 Olympics and Paralympics The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially branded as Tokyo 2020, were an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July 2021',
  'response': 'A total of 339 events were held in the 2020 Summer Olympics.',
  'cited_context': 'A total of 56 test events were scheduled to take place in the run-up to the 2020 Olympics and Paralympics. The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and official

### Performing Evaluations on RAG pipeline using UpTrain's Open-Source Software (OSS)

UpTrain uses these 4 parameters to perform RCA on your RAG pipeline:

|Parameter|Explanation|
|--|--|
|question|This is the query asked by your user.|
|context|This is the context that you pass to an LLM (retrieved-context)|
|response|The response generated by the LLM|
|cited_context|The relevant portion of the retrieved context that the LLM cites along with the response.|

You can refer to this [tutorial](https://github.com/uptrain-ai/uptrain/blob/main/examples/root_cause_analysis/rag_with_citation.ipynb) or [documentation](https://docs.uptrain.ai/tutorials/analyzing-failure-cases) to understand more about using UpTrain to evaluate a RAG pipeline

In [22]:
from uptrain import RcaTemplate, EvalLLM

import nest_asyncio
nest_asyncio.apply()
from loguru import logger
logger.remove()


eval_llm = EvalLLM(openai_api_key = os.environ["OPENAI_API_KEY"])

results = eval_llm.perform_root_cause_analysis(
    data = data,
    rca_template = RcaTemplate.RAG_WITH_CITATION
)

100%|██████████| 3/3 [00:01<00:00,  1.62it/s]
100%|██████████| 3/3 [00:04<00:00,  1.44s/it]
100%|██████████| 3/3 [00:08<00:00,  2.83s/it]
100%|██████████| 3/3 [00:06<00:00,  2.33s/it]
100%|██████████| 3/3 [00:04<00:00,  1.48s/it]
100%|██████████| 3/3 [00:02<00:00,  1.36it/s]
100%|██████████| 3/3 [00:05<00:00,  1.96s/it]


### Let's look at some of the an evaluation generated by UpTrain

In this example, we can see that the LLM has hallucinated, by generating information which could not be verified by the information present in the context

In [23]:
results[0]

{'question': 'How many events were there in the 2020 Summer Olympics',
 'context': 'These five new sports were approved on 3 August 2016 by the IOC during the 129th IOC Session in Rio de Janeiro, Brazil, and were included in the sports program for 2020 only, bringing the total number of sports at the 2020 Olympics to 33.\n\n\n=== Test events ===\nA total of 56 test events were scheduled to take place in the run-up to the 2020 Olympics and Paralympics The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially branded as Tokyo 2020, were an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July 2021',
 'response': 'A total of 339 events were held in the 2020 Summer Olympics.',
 'cited_context': 'A total of 56 test events were scheduled to take place in the run-up to the 2020 Olympics and Paralympics. The 2020 Summer Olympics, officially the Games of the XXXII Olympiad and officially b

In this example, we can see that the context doesn't have relevant information on where did the first Summer Olympics took place. Hence, the retrieval pipeline needs improvement

In [24]:
results[1]

{'question': 'Who performed at the opening ceremony of the 2020 Summer Olympics?',
 'context': 'Emperor Naruhito formally opened the Games, and at the end of the torch relay the Olympic cauldron was lit by Japanese tennis player Naomi Osaka.For the first time in the 2020 Olympic Games, it was decided that one male and one female in each country would take turns holding flags and serve as two of them One of these events was a concert held on 18 July, which featured J-rock band Wanima, choreography by dancers Aio Yamada and Tuki Takamura, and the presentation of animated "creatures" based on illustrations "embodying the thoughts and emotions of people from across the world".The original plans for Nippon Festival included events such as Kabuki x Opera (a concert that would have featured stage actor Ichikawa Ebizō XI, opera singers Anna Pirozzi and Erwin Schrott, and the Tokyo Philharmonic Orchestra), an arts and culture festival focusing on disabilities, and a special two-day exhibition s