# Building a RAG chat assistant

This notebook develops the concept for a chat assistant that can give feedback on espresso-related questions, using the guidelines from [Espresso Aficionados website](https://espressoaf.com) as context.

The idea is to refer to the properties of an espresso saved by the user, hence this is also integrated as an input to the prompt that will be sent to the LLM.

Models from Google AI Studio will be used: to export the Google API Key, which is stored in the .env file, load the environment variables:

In [None]:
from dotenv import load_dotenv

load_dotenv()

True

## Indexing and storing context documents

Unless or until the documents need to be updated, this can be done just once to save a persistent database.

Use the gemini embeddings:

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

gemini_embeddings = GoogleGenerativeAIEmbeddings(model = "models/gemini-embedding-001")

Create a Chroma vector database with persistent data written to the same directory:

In [10]:
import chromadb
from langchain_chroma import Chroma

chroma_client = chromadb.PersistentClient(path = "chroma_db")
chroma_vector_store = Chroma(
    client = chroma_client,
    collection_name = "espresso_aficionados",
    embedding_function = gemini_embeddings
)

Get the relevant webpages, keeping only the main content, which is marked by the `<main>` tag.

In [14]:
import bs4

bs_main_content = bs4.SoupStrainer("main")

from langchain_community.document_loaders import WebBaseLoader

loader_espresso_aficionados = WebBaseLoader(
    web_paths = (
        "https://espressoaf.com/guides/beginner.html", 
        "https://espressoaf.com/guides/puckprep.html", 
        "https://espressoaf.com/guides/profiling.html", 
        "https://espressoaf.com/guides/water.html", 
        "https://espressoaf.com/guides/preferential-extraction.html", 
        "https://espressoaf.com/info/Glossary.html", 
        "https://espressoaf.com/info/flow_and_pressure.html", 
        "https://espressoaf.com/info/extraction_evenness_theory.html"
    ), 
    bs_kwargs = {"parse_only": bs_main_content}
)

docs_espresso_aficionados = loader_espresso_aficionados.load()

Split the page content into smaller chunks that can will comfortable fit into the LLM context window:

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 100,
    add_start_index = True
)

split_text_espresso_aficionados = text_splitter.split_documents(docs_espresso_aficionados)

Develop a procedure to add the split documents to the vector database in batches, so that Google API's rate limits for the embedding model are not exceeded (the limit is 100 requests per minute; to be on the safe side, we send 80 requests every 70 seconds):

In [None]:
from uuid import uuid4
from time import sleep

def add_documents_to_chroma_through_gemini(
        documents = split_text_espresso_aficionados, 
        vector_store = chroma_vector_store
    ):

    # determine number of batches
    if len(documents) == 0:
        print("No documents found to be added to vector store.")
        return
    elif len(documents) % 80 == 0:
        n_batches = len(documents) // 80
    else:
        n_batches = len(documents) // 80 + 1

    #iterate through batches
    for i in range(n_batches):
        batch = documents[i * 80 : (i + 1) * 80]
        
        # create unique ids for the batch
        uuids_batch = [str(uuid4()) for _ in range(len(batch))]
        
        # add the batch to the vector store, if Google allows it
        try:
            vector_store.add_documents(
                documents = batch,
                ids = uuids_batch
            )
        except Exception as e:
            print(f"Error adding chunk to vector store: {e}")
            raise
        # if no exception was caught, wait for 70 seconds to avoid rate limiting
        else:
            print(f"Added batch {i + 1} of {n_batches} to vector store.")
            sleep(70) 
            

Add the embeddings to the database with the function defined above:

In [32]:
add_documents_to_chroma_through_gemini(
    documents = split_text_espresso_aficionados,
    vector_store = chroma_vector_store
)

Added batch 1 of 4 to vector store.
Added batch 2 of 4 to vector store.
Added batch 3 of 4 to vector store.
Added batch 4 of 4 to vector store.


## Retrieving context and generating response

Initialise the model:

In [None]:
from langchain.chat_models import init_chat_model

gemini_flash_llm = init_chat_model("gemini-2.5-flash", model_provider = "google_genai")

Prompt engineering:

In [None]:
from langchain_core.prompts import PromptTemplate

prompt_template = """
    You are an expert in espresso preparation and extraction. 
    You have access to the following guideline that contain relevant information about espresso preparation. \n
    Guidelines : {context}
    The user will ask you a question about an espresso with the following properties: \n
    Espresso properties : {data}
    Please answer the question at the end based on the provided context and data. 
    If you don't know the answer, just say that you don't know.
    Use three sentences maximum and keep the answer concise. \n
    Question : {question}
    Answer :
"""

prompt = PromptTemplate.from_template(prompt_template)

Define espresso data to be used in the prompt (in the actual implementation, this would come from the database):

In [59]:
from typing_extensions import TypedDict

class EspressoData(TypedDict):
    roast_level: int
    grind_size: float
    dose_gr: float
    extraction_time_s: int
    yield_gr: float

last_espresso = EspressoData(
    roast_level = 4,
    grind_size = 0.4,
    dose_gr = 18.0,
    extraction_time_s = 30,
    yield_gr = 36.0
)

Define the logic of retrieval and generation. The app state consists of three inputs (context, data, question) and the answer. The context depends on the question, since the most relevant documents chunks will be retrieved based on similarity with the question, hence retrieval is a node in the flow. The data doesn't depend on the question, so it is simply parsed into a natural language description.

In [63]:
from langchain_core.documents import Document
from typing_extensions import List


class State(TypedDict):
    context: List[Document]
    data: EspressoData
    question: str
    answer: str

def retrieve_context(state: State):
    context_docs = chroma_vector_store.similarity_search(state["question"], k = 2)
    return {"context": context_docs}

def describe_espresso_data(espresso_data: EspressoData) -> str:
    return f"""
        The espresso was made from beans with a roast level of {espresso_data["roast_level"]} on a scale of 1 to 5, 
        ground to a size of {espresso_data["grind_size"]} on a scale of 0 to 1,
        with a dose of {espresso_data["dose_gr"]} grams,
        extracted for {espresso_data["extraction_time_s"]} seconds,
        yielding {espresso_data["yield_gr"]} grams of espresso, 
        hence with an extraction ratio of {round(espresso_data["yield_gr"] / espresso_data['dose_gr'], 2)}.
        """

def generate_answer(state: State, espresso_data = last_espresso, llm = gemini_flash_llm, prompt = prompt) -> str:
    docs_content = "\n\n".join([doc.page_content for doc in state["context"]])
    data_description = describe_espresso_data(espresso_data)
    message = prompt.invoke({
        "context" : docs_content,
        "data" : data_description,
        "question" : state["question"]
    })
    response = llm.invoke(message)
    return {"answer": response.content}

Bring together retrieval and generation in a langgraph object:

In [64]:
from langgraph.graph import StateGraph, START

graph_builder = StateGraph(State).add_sequence([retrieve_context, generate_answer])
graph_builder.add_edge(START, "retrieve_context")
graph = graph_builder.compile()

Test with a question:

In [65]:
result = graph.invoke({
    "question" : "My espresso is too bitter, what can I do to improve it?"
})

In [68]:
print("Context:\n", result["context"], "\n\n")
print("Answer:\n", result["answer"], "\n")

Context:
 [Document(id='eb80c520-66d0-48e9-bdbf-e4ae4802809b', metadata={'source': 'https://espressoaf.com/guides/beginner.html', 'start_index': 3}, page_content='How to Dial In Espresso: The Basics  A couple notes before we begin: 1) The dialling-in guide below is only one of the many ways an espresso shot can dialled. There are variations to this technique, and many different shot profiles that one can use. That said, if you’ve just started your espresso journey, I’ve found this particular step-by-step technique to be the easiest way to familiarise yourself with the basic principles of espresso extraction. 3) The 1:2 in 30 seconds “rule” is but a'), Document(id='5bba32df-9bac-4145-95d1-06a258314e73', metadata={'start_index': 5585, 'source': 'https://espressoaf.com/guides/beginner.html'}, page_content='grind size. As a general rule, lower temperatures will extract less while higher temperatures will extract more.   “Help! My shot tastes bad no matter what I do!”  If you find that the 