# Homework 1

Question 0.
To get up and running, you'll first need to install Docker and Milvus. Find instructions below:
* Docker Compose ([Instructions](https://docs.docker.com/compose/install/))
* Milvus Standalone ([Instructions](https://milvus.io/docs/install_standalone-docker.md))

Question 1.
Build a prototype version of your RAG application. Choose any embedding/LLM/parameter set you like.

Question 2.
Set up evaluations. Answer relevance is set up for you. Add evaluations for context relevance and groundedness.

Question 3.
Try different index types, embeddings, parameters and LLMs. Find the best performing application for the evaluation set, and explain why it performed the best.

Note. You may not prune the dataset stored in the vector database to improve performance.

## Setup

### Install dependencies
Let's install some of the dependencies for this notebook if we don't have them already

In [None]:
#! pip install trulens-eval==0.10.0 llama_index==0.8.4 pymilvus==2.3.0 nltk==3.8.1 html2text==2020.1.16 tenacity==8.2.3

### Add API keys
For this quickstart, you will need Open AI and Huggingface keys

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "..."

### Import from LlamaIndex and TruLens

In [None]:
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import MilvusVectorStore
from llama_index.llms import OpenAI
from llama_index import (
    VectorStoreIndex,
    SimpleWebPageReader,
    LLMPredictor,
    ServiceContext
)

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings

from tenacity import retry, stop_after_attempt, wait_exponential

from trulens_eval import TruLlama, Feedback, Tru, feedback
from trulens_eval.feedback import Groundedness
tru = Tru()


### First we need to load documents. We can use SimpleWebPageReader

In [None]:
from llama_index import WikipediaReader

cities = [
    "Atlanta", "Los Angeles", "Chicago", "Houston", "Phoenix, Arizona", 
    "Philadelphia", "San Antonio", "Honolulu", "Tucson", "Mexico City", 
    "Austin", "Jacksonville", "Fort Worth", "Cincinatti", "Charlotte", 
    "San Francisco", "Indianapolis", "Seattle", "Salt Lake City", "Washington D.C."
]

wiki_docs = []
for city in cities:
    try:
        doc = WikipediaReader().load_data(pages=[city])
        wiki_docs.extend(doc)
    except Exception as e:
        print(f"Error loading page for city {city}: {e}")

### Now write down our test prompts

In [None]:
test_prompts = [
    "What's the best national park near Honolulu",
    "What are some famous universities in Tucson?",
    "What bodies of water are near Chicago?",
    "What is the name of Chicago's central business district?",
    "What are the two most famous universities in Los Angeles?",
    "What are some famous festivals in Mexico City?",
    "What are some famous festivals in Los Angeles?",
    "What professional sports teams are located in Los Angeles",
    "How do you classify Houston's climate?",
    "What landmarks should I know about in Cincinatti"
]


### Question 1. Build a prototype RAG

In [None]:
vector_store = MilvusVectorStore(index_params={
        "index_type": index_param,
        "metric_type": "L2"
        },
        search_params={"nprobe": 20},
        overwrite=True)
llm = OpenAI(model="gpt-3.5-turbo")
storage_context = StorageContext.from_defaults(vector_store = vector_store)
service_context = ServiceContext.from_defaults(embed_model = embed_model, llm = llm)
index = VectorStoreIndex.from_documents(wiki_docs,
            service_context=service_context,
            storage_context=storage_context)
query_engine = index.as_query_engine(top_k = top_k)
@retry(stop=stop_after_attempt(10), wait=wait_exponential(multiplier=1, min=4, max=10))
def call_query_engine(prompt):
        return query_engine.query(prompt)
for prompt in test_prompts:
    call_query_engine(prompt)

### Question 2. Set up Evaluation.

In [None]:
import numpy as np

# Initialize OpenAI-based feedback function collection class:
openai = feedback.OpenAI(model_engine="gpt-4")

# Define groundedness
grounded = Groundedness(groundedness_provider=openai)
f_groundedness = Feedback(grounded.groundedness_measure, name = "Groundedness").on(
    TruLlama.select_source_nodes().node.text # context
).on_output().aggregate(grounded.grounded_statements_aggregator)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance, name = "Answer Relevance").on_input_output()

# Question/statement relevance between question and each context chunk.
f_qs_relevance = Feedback(openai.qs_relevance, name = "Context Relevance").on_input().on(
    TruLlama.select_source_nodes().node.text
).aggregate(np.mean)

### Question 3. Find the best configuration.

In [None]:
index_params = ["IVF_FLAT","HNSW"]
embed_v12 = HuggingFaceEmbeddings(model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
embed_ft3_v12 = HuggingFaceEmbeddings(model_name = "Sprylab/paraphrase-multilingual-MiniLM-L12-v2-fine-tuned-3")
embed_ada = OpenAIEmbeddings(model_name = "text-embedding-ada-002")
embed_models = [embed_v12, embed_ft3_v12, embed_ft5_v12]
top_ks = [1,3,7]

In [None]:
import itertools
for index_param, embed_model, top_k in itertools.product(
    index_params, embed_models, top_ks
    ):
    if embed_model == embed_v12:
        embed_model_name = "v12"
    elif embed_model == embed_ft3_v12:
        embed_model_name = "ft3_v12"
    elif embed_model == embed_ada:
        embed_model_name = "ada"
    vector_store = MilvusVectorStore(index_params={
        "index_type": index_param,
        "metric_type": "L2"
        },
        search_params={"nprobe": 20},
        overwrite=True)
    llm = OpenAI(model="gpt-3.5-turbo")
    storage_context = StorageContext.from_defaults(vector_store = vector_store)
    service_context = ServiceContext.from_defaults(embed_model = embed_model, llm = llm)
    index = VectorStoreIndex.from_documents(wiki_docs,
            service_context=service_context,
            storage_context=storage_context)
    query_engine = index.as_query_engine(top_k = top_k)
    tru_query_engine = TruLlama(query_engine,
                    app_id=f"gpt4eval-App-{index_param}-{embed_model_name}-{top_k}",
                    feedbacks=[f_groundedness, f_qa_relevance, f_qs_relevance],
                    metadata={
                        'index_param':index_param,
                        'embed_model':embed_model_name,
                        'top_k':top_k
                        })
    @retry(stop=stop_after_attempt(10), wait=wait_exponential(multiplier=1, min=4, max=10))
    def call_tru_query_engine(prompt):
        return tru_query_engine.query(prompt)
    for prompt in test_prompts:
        call_tru_query_engine(prompt)

### Explore in a Dashboard

In [None]:
tru.run_dashboard() # open a local streamlit app to explore

# tru.stop_dashboard() # stop if needed

Alternatively, you can run `trulens-eval` from a command line in the same folder to start the dashboard.

### Or view results directly in your notebook

In [None]:
tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all