## Introduction to LLM & RAG evaluation with MLflow - Example Notebook
### Overview
#### This repository contains sample code for evaluating RAG systems using Mlflow metrics, relevance and latency. Please note that the code provided here is for illustrative purposes only and is not for production deployment. Refer to the documentation links included in the readme.md for additional details on Mlflow services and metrics

#### Services and libraries were issued to illustrate the solution.

##### - Azure Open AI: gpt-4 is used for embedding and completions.
##### - Open AI: Mlflow uses Open AI for evaluation
##### - Chroma and Qdrant: Vector databases for storing embedded domain specific documents.
##### - Mlflow AI Gateway(Experimental): A central configuration management service deployed      locally. LLM provider, endpoint and API keys can be stored using Mlflow API gateway.
##### - Mlflow Tracking server: MLflow tracking server is a stand-alone HTTP server that serves multiple REST API endpoints for tracking runs/experiments.
##### - Langchain
##### - Sample Mlflow documentation for QnA.


#### Complete all the steps mentioned in the prerequiste section of the readme.md in the repository before running the cells in this notebook

In [None]:
# Install required libraries and packages inluded in the requirements.txt file
!pip install -r requirements.txt


### Import necessary libraries

In [None]:
import os
import sys
import langchain 
from openai import AzureOpenAI
from langchain.vectorstores import Chroma
import pandas as pd
import mlflow
import mlflow.deployments
from langchain.chains import RetrievalQA
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import Mlflow
from langchain.document_loaders import WebBaseLoader
from mlflow.metrics.genai import relevance
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.embeddings import MlflowEmbeddings
import qdrant_client
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import CollectionStatus
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient
import langchain_qdrant
from langchain_qdrant import QdrantVectorStore
from qdrant_client.http.models import Distance, VectorParams
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.retrievers.self_query.qdrant import QdrantTranslator


### Retrieve environment variables and set tracking URI, LLMs for embedding and Querying

In [None]:
AZURE_OPENAI_API_KEY=os.environ.get('AZURE_OPENAI_API_KEY')
OPENAI_API_KEY=os.environ.get('OPENAI_API_KEY')
AZURE_OPENAI_ENDPOINT=os.environ.get('AZURE_OPENAI_ENDPOINT')
tracking_uri = "http://localhost:8080/"                               
mlflow.set_tracking_uri(tracking_uri)
llm = Mlflow(target_uri="http://127.0.0.1:5000", endpoint="completions")
ebedd_llm = Mlflow(target_uri="http://127.0.0.1:5000", endpoint="embeddings")

### Load Mlflow sample documents,embed and ingest the loaded documents into Chroma vector database and instantiate a retriever 



In [None]:
# Load Mlflow sample documents using Langchain WebBaseLoader and create a Retriever for querying Chroma vector store
loader = WebBaseLoader(
    [
        "https://mlflow.org/docs/latest/index.html",
        "https://mlflow.org/docs/latest/tracking/autolog.html",
        "https://mlflow.org/docs/latest/deep-learning/index.html",
        "https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html",
        "https://mlflow.org/docs/latest/python_api/mlflow.deployments.html",
        "https://mlflow.org/docs/latest/tracking/autolog.html"
    ]
)

documents = loader.load()
CHUNK_SIZE = 1000
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = MlflowEmbeddings(
    target_uri="http://127.0.0.1:5000",
    endpoint="embeddings",
)

docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(fetch_k=3),
    return_source_documents=True,)


In [None]:
# Create a dataframe to store a list of questions to evaluate on.
eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "What is deep learning?",
            "How to monitor models using mlflow?",
            "Explain Mlflow auto logging feature",
            "What is Deep learning",
        ],
    }
)

In [None]:
# Insitialize an in-memory Qdrant store, metadata for retriever for Langchain SelfQueryRetriver for querying Qdrant vector database

Qdrant_store = Qdrant.from_documents(
    texts,
    embeddings,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="MLflow Collection",)
#source, title,language _id,_collection_name
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.retrievers.self_query.qdrant import QdrantTranslator
from langchain_openai import OpenAI

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="Mlflow documentation link",
        type="string",
    ),
    AttributeInfo(
        name="title",
        description="Title of the web page",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="language",
        description="Language of mlflow documentation",
        type="string",
    ),
   
    
]

document_content_description = "Mlflow documentation repository"
#llm = OpenAI(temperature=0)
qdrant_retriever = SelfQueryRetriever.from_llm(
    llm, Qdrant_store, document_content_description, metadata_field_info, structured_query_translator=QdrantTranslator(metadata_key="metadata"),verbose=True
)

In [None]:
#Create necessary functions to apply questions to Chroma and Qdrant retrievers for evaluation

def chroma_model(input_df):
    return input_df["questions"].map(qa).tolist()

def qdrant_model(input_df):
    #return input_df['questions'].map(qdrant.similarity_search).tolist()
    return input_df['questions'].map(qdrant_retriever.invoke)

def mlflow_evaluate (vectordb,input_df):
    relevance_metric = relevance()
    if "chroma" in vectordb.lower():
        with mlflow.start_run():
    
            Chroma_results = mlflow.evaluate(
                chroma_model,
                data=eval_df,
            #model_type="question-answering", 
                evaluators="default",
                predictions="result",
                extra_metrics=[relevance_metric, mlflow.metrics.latency()],
                evaluator_config={
                    "col_mapping": {
                        "inputs": "questions",
                        "context": "result"
                }
            },
        )
        print(Chroma_results.metrics)
        display(Chroma_results.tables["eval_results_table"])
        print(f"See aggregated evaluation results below: \n\n")
        df_c= pd.DataFrame(Chroma_results.metrics,[0])
        return df_c

    elif 'qdrant' in vectordb.lower():
        with mlflow.start_run():
    
            Qdrant_results = mlflow.evaluate(
                 qdrant_model,
                data=eval_df,
            #model_type="question-answering", 
                evaluators="default",
                predictions="result",
                extra_metrics=[relevance_metric, mlflow.metrics.latency()],
                evaluator_config={
                    "col_mapping": {
                        "inputs": "questions",
                        "context": "result"
                }
            },
        )
    print(Qdrant_results.metrics)
    display(Qdrant_results.tables["eval_results_table"])
    print(f"See aggregated evaluation results below: \n\n")
    df_q= pd.DataFrame(Qdrant_results.metrics,[0])
    return df_q

            
    
        

### Experiment #1: Evaluate performance of the RAG system with Chroma Vector database

In [None]:
# Call the function created above to process questions using Chroma Vector Database
df_c= mlflow_evaluate('Chroma',eval_df)

### Experiment #2: Evaluate RAG system's relevance and latency with Qdrant Vector database

In [None]:
# Call the function to query Qdrant in-memory store
df_q= mlflow_evaluate('Qdrant',eval_df)


### Now let us plot the relevance and latency values from experiment 1 and 2 for comparion purpose

In [None]:
import matplotlib.pyplot as plt
import numpy as np

c_values= df_c[['latency/mean', 'relevance/v1/mean', 'relevance/v1/p90']].squeeze()
q_values= df_q[['latency/mean', 'relevance/v1/mean', 'relevance/v1/p90']].squeeze()          

labels = ['latency/mean', 'relevance/v1/mean', 'relevance/v1/p90']

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, c_values, width, label='Chroma')
rects2 = ax.bar(x + width/2, q_values, width, label='Qdrant')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Value')
ax.set_title('Comparison of relevance and latency metrics between Chroma and Qdrant')
ax.set_xticks(x, labels)
ax.legend()

ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()

#### Note: The above comparison is for illustration purpose only,and is not to convery one vector database is better than the other.