https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_crag.ipynb

some deviations from the source code because i dont wanna pay for embeddings from openai, or hit openai models. All openAI integration is replaced with ollama.

I also removed langsmith integration. don't think it's needed. just a frontend for LLM debugging which i can achieve with `langchain.debug = True`

Try to roll my own adaptive RAG for my own use case. I already know how to use langgraph so time to practice. Don't wanna implement the same graph as what's done in the video cause i don't wanna do do a websearch.

Question --> query_analysis -> Related to documents --> retrieve_and_grade, relevant?  --Yes--> llm_generation --> hallucinate_check
                                                                                   | No  
                                                                                   v  
                                                                                 hyde_retrieval 
                                                                                   |  
                                                                                   v  
                                                                              llm_generation --> hallucinate_check  

            query_analysis --> Anything else --> llm_generation --> hallucinate_check 

hallucinate_check --Yes--> llm_generation  
                 | No  
                 v  
          answer_qns_check --Yes--> Give LLM response (End)  
                          | No  
                          v  
                   Give default response (End)  



The states i'll need:
1. question
2. documents (if vector store is hit)
3. prompt (store prompt so that we can check prompt and hallucination to see if anything is hallucinated)
4. generation
5. retries (count the total number of tries that's occurred for hallucination check to prevent too many loops)

pts to note:
1. query_analysis takes question --llm--> decide 1 of 2 paths (retrieve_and_grade or llm_generation)
2. retrieve_and_grade takes question, does retrieval --llm grader--> if there isn't 1 doc that's related, send to Hyde retrieval, else store docs to documents and send to llm_generation
3. hyde_retrieval takes question --hyde rewrite--> retrieval --llm grader--> if there isn't 1 doc that's related, send to END, else store docs to documents and send to llm_generation
4. llm_generation takes question and docs (if any) and answers the question (store the prompt in state)--> hallucinate_check
5. hallucinate_check checks if there's hallucination, by taking in the prompt and generation and grading whether it hallucinates, if yes, and retries < X, send it to llm_generation again (store retries: 1, +=1 etc and set logic ), if yes but retries >= X, overwrite generation with default response and send to END. If no, send to answer_qns_check
6. answer_qns_check takes prompt and generation and grade whether it answers the question. If yes, send to END. If no overwrite generation with default response and send to END.

In [None]:
from langchain_ollama.chat_models import ChatOllama
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from typing import Literal
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from operator import itemgetter

In [2]:
from typing_extensions import TypedDict
from typing import List

class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        question: question
        documents: documents
        prompt: prompt for llm generation
        generation: LLM generation
        tries: number of llm generation attempted
    """
    question: str
    documents: List[str]
    prompt: str
    generation: str
    tries: int

In [3]:
# Load docs
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=100
)
doc_splits = text_splitter.split_documents(docs_list)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding= OllamaEmbeddings(
        model="nomic-embed-text:v1.5"
    ),
)
retriever = vectorstore.as_retriever()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
# llm = ChatOllama(
#     model="command-r7b",
#     temperature=0.1,
# )
llm = ChatOllama(
    model="llama3.2:3b-instruct-q8_0",
    temperature=0,
)

In [None]:
def query_analysis(state):
    # Data model
    class web_search(BaseModel):
        """
        The internet. Use web_search for questions that are related to anything else than agents, prompt engineering, and adversarial attacks.
        """
        query: str = Field(description="The query to use when searching the internet.")

    class vectorstore(BaseModel):
        """
        A vectorstore containing documents related to agents, prompt engineering, and adversarial attacks. Use the vectorstore for questions on these topics.
        """
        query: str = Field(description="The query to use when searching the vectorstore.")

    question = state['question']
    llm_router = ChatOllama(
        model="llama3.1:8b", # "command-r7b", command r7b doesn't support tool calling, refer to debug section below.
        temperature=0.1,
    )

    llm_router = llm_router.bind_tools(
        tools=[web_search, vectorstore] # , preamble=preamble
    )
    route = llm_router.invoke(question)
    print(route.tool_calls)

    # Route accordingly
    if route.tool_calls == []:
        raise ValueError("Router could not decide source")
    if len(route.tool_calls) != 1:
        raise ValueError("Router could not decide routing")

    path = route.tool_calls[0]['name']
    if path == 'web_search':
        print("Not implementing websearch. End here.")
        state['generation'] = 'Should have been a wbsearch.'
        return 'END'

    elif path == 'vectorstore':
        print("Routing to retrieve_and_grade")
        return 'retrieve_and_grade'

    else:
        print('Route to llm generation')
        return 'llm_generation'

def grader(question, docs):
    # Data model
    class GradeDocuments(BaseModel):
        """Binary score for relevance check on retrieved documents."""
        binary_score: Literal['Yes', "No"]
    grader_llm = llm.with_structured_output(GradeDocuments)
    # Prompt
    grade_prompt_template = """You are a grader assessing relevance of a retrieved document to a user question.
If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant.
Give a binary score 'Yes' or 'No' to indicate whether the document is relevant to the question.

question:
{question}

document:
{document}
"""
    grade_prompt = ChatPromptTemplate.from_template(grade_prompt_template)

    filtered_docs = []
    for i, doc in enumerate(docs):
        relevant = (grade_prompt | grader_llm).invoke({
            'question': question,
            'document': doc.page_content.strip()
        })
        print(f"Doc {i+1}, relevant?: {relevant.binary_score}")
        if relevant.binary_score == 'Yes':
            filtered_docs.append(doc)
    return filtered_docs

def retrieve_and_grade(state):
    # retrieve_and_grade takes question, does retrieval --llm grader--> if there isn't 1 doc that's related, send to Hyde retrieval, else store docs to documents and send to llm_generation
    question = state['question']
    docs = retriever.get_relevant_documents(question)
    filtered_docs = grader(question, docs)
    return {
        'question': question,
        'documents': filtered_docs
    }

def retrieve_and_grade_check(state):
    if len(state['documents']) == 0:
        print('No relevant documents retrieved, send to HyDE')
        return 'hyde_retrieval'
    else:
        return 'llm_generation'

def hyde_retrieval(state):
    # hyde_retrieval takes question --hyde rewrite--> retrieval --llm grader--> if there isn't 1 doc that's related, send to END, else store docs to documents and send to llm_generation
    question = state['question']
    # Prompt
    hyde_prompt_template = """Please write a paper/document to answer the question:
Question: {question}
"""
    hyde_prompt = ChatPromptTemplate.from_template(hyde_prompt_template)
    hypothetical_doc = (hyde_prompt | llm | StrOutputParser()).invoke({
            'question': question,
    })
    docs = retriever.get_relevant_documents(hypothetical_doc)
    filtered_docs = grader(question, docs)
    return {
        'question': question,
        'documents': filtered_docs
    }

def hyde_retrieval_check(state):
    if len(state['documents']) == 0:
        print('No relevant documents retrieved from HyDE, send to end')
        state['generation'] = "Sorry LLM could not retrieve any documents that answers your question."
        return 'END'
    else:
        return 'llm_generation'
    

def llm_generation(state):
    # llm_generation takes question and docs (if any) and answers the question (store the prompt in state)--> hallucinate_check
    question = state['question']
    documents = state['documents']
    if 'tries' not in state:
        tries = 1
    else:
        tries = state['tries'] + 1

    if len(documents) == 0:
        template = "You are a helpful assistant, answer the following question and do not hallucinate facts.\n{question}"
        gen_prompt = ChatPromptTemplate.from_template(template)
        generation = (
            gen_prompt 
            | llm
            | StrOutputParser()
        ).invoke({'question': question})
    else:
        template = "You are a helpful RAG system, answer the following question with the given documents as context.\ndocuments:\n{documents}\n--END OF CONTEXT--\nquestion:{question}"
        gen_prompt = ChatPromptTemplate.from_template(template)
        generation = (
            {
                'question': itemgetter('question'),
                'documents': itemgetter('documents') | RunnableLambda(lambda docs: '\n\ndocument:\n'.join([doc.page_content.strip() for doc in docs]))
            }
            | gen_prompt 
            | llm 
            | StrOutputParser()
        ).invoke({
                'question': question,
                'documents': documents
        })
    
        
    return {
        'question': question,
        'documents': documents,
        'prompt': gen_prompt,
        'generation': generation,
        'tries': tries
    }

def hallucinate_check(state):
    # hallucinate_check checks if there's hallucination, by taking in the prompt and generation and grading whether it hallucinates, if yes, and retries < X, send it to llm_generation again (store retries: 1, +=1 etc and set logic ), if yes but retries >= X, overwrite generation with default response and send to END. If no, send to answer_qns_check
    class HallucinationCheck(BaseModel):
        """Binary score for relevance check on whether theres hallucinations."""
        binary_score: str = Literal['Yes', "No"]
    hallucinate_llm = llm.with_structured_output(HallucinationCheck)

    prompt = state['prompt']
    generation = state['generation']
    tries = state['tries']

    hallucinate_check_prompt_template = """You are an LLM hallucination checker. Your job is to evaluate whether the given response to a question has anything non-factual, or made up, containing hallucinations. If there are hallucinations, you should respond 'Yes', and vice versa.
\nquestion:{prompt}\n\n{generation}"""
    hallucinate_check_prompt = ChatPromptTemplate.from_template(hallucinate_check_prompt_template)
    hallucinate = (hallucinate_check_prompt | hallucinate_llm | StrOutputParser()).invoke({
            'prompt': prompt,
            'generation': generation
    })
    if hallucinate == 'Yes' and tries < 2:
        print('LLM generation has hallucinations, attempt re-generation, tries: {tries}.')
        return 'llm_generation'
    elif hallucinate == 'Yes' and tries >= 2:
        state['generation'] = 'Sorry LLM could not come up with generations that were hallucination free.'
        return 'END'
    else:
        return 'answer_qns_check'
    
def answer_question_check(state):
    # answer_qns_check takes prompt and generation and grade whether it answers the question. If yes, send to END. If no overwrite generation with default response and send to END.
    class AnswerCheck(BaseModel):
        """Binary score for relevance check on whether the answer, answers the question."""
        binary_score: str = Literal['Yes', "No"]
    answer_llm = llm.with_structured_output(AnswerCheck)
    answer_check_prompt_template = """You are an answer checker. Your job is to evaluate whether the given response to a question answers the question. If the answer answers the question, respond with 'Yes', and vice versa.
\nquestion:{nquestion}\n\n{generation}"""
    answer_check_prompt = ChatPromptTemplate.from_template(answer_check_prompt_template)
    answered = (answer_check_prompt | answer_llm | StrOutputParser()).invoke({
            'prompt': state['prompt'],
            'generation': state['generation']
    })
    if answered == 'No':
        state['generation'] = "Sorry LLM could not come up with generations that answered the question."
    return 'END'

def end(state):
    return state['generation']
    

1. query_analysis takes question --llm--> decide 1 of 2 paths (retrieve_and_grade or llm_generation)
2. retrieve_and_grade takes question, does retrieval --llm grader--> if there isn't 1 doc that's related, send to Hyde retrieval, else store docs to documents and send to llm_generation
3. hyde_retrieval takes question --hyde rewrite--> retrieval --llm grader--> if there isn't 1 doc that's related, send to END, else store docs to documents and send to llm_generation
4. llm_generation takes question and docs (if any) and answers the question (store the prompt in state)--> hallucinate_check
5. hallucinate_check checks if there's hallucination, by taking in the prompt and generation and grading whether it hallucinates, if yes, and retries < X, send it to llm_generation again (store retries: 1, +=1 etc and set logic ), if yes but retries >= X, overwrite generation with default response and send to END. If no, send to answer_qns_check
6. answer_qns_check takes prompt and generation and grade whether it answers the question. If yes, send to END. If no overwrite generation with default response and send to END.

In [24]:
from langgraph.graph import END, StateGraph, START

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("start", lambda state: state)  # Dummy node
workflow.add_node("query_analysis", query_analysis)  # route to END, 
workflow.add_node("retrieve_and_grade", retrieve_and_grade)  # grade documents
workflow.add_node("retrieve_and_grade_check", retrieve_and_grade_check)  # generatae
workflow.add_node("hyde_retrieval", hyde_retrieval)  # generatae
workflow.add_node("hyde_retrieval_check", hyde_retrieval_check)  # generatae
workflow.add_node("llm_generation", llm_generation)  # transform_query
workflow.add_node("hallucinate_check", hallucinate_check)  # transform_query
workflow.add_node("answer_question_check", answer_question_check)  # transform_query
workflow.add_node("END", end)  # transform_query

# Build graph
workflow.add_edge(START, "start")

workflow.add_conditional_edges(
    "start",
    query_analysis,
    {
        "END": "END",
        "retrieve_and_grade": "retrieve_and_grade",
        "llm_generation": "llm_generation"
    },
)
workflow.add_conditional_edges(
    "retrieve_and_grade",
    retrieve_and_grade_check,
    {
        "hyde_retrieval": "hyde_retrieval",
        "llm_generation": "llm_generation"
    },
)
workflow.add_conditional_edges(
    "hyde_retrieval",
    hyde_retrieval_check,
    {
        "END": "END",
        "llm_generation": "llm_generation"
    },
)
workflow.add_conditional_edges(
    "llm_generation",
    hallucinate_check,
    {
        "END": "END",
        "answer_question_check": "answer_question_check"
    },
)
workflow.add_edge("answer_question_check", "END")

app = workflow.compile()


# workflow.add_edge("retrieve", "grading")
# workflow.add_conditional_edges(
#     "grading",
#     decide_to_search,
#     {
#         "rewrite_and_search": "rewrite_and_search",
#         "generate": "generate",
#     },
# )
# workflow.add_edge("rewrite_and_search", "generate")
# workflow.add_edge("generate", END)

# # Compile
# app = workflow.compile()

In [None]:
inputs = {"question": "What are the types of agent memory?"}
for output in app.stream(inputs):
    for key, value in output.items():
        # Node
        print(f"Node '{key}':")
        # Optional: print full state at each node
        # pprint.pprint(value["keys"], indent=2, width=80, depth=None)
    print("\n---\n")

[{'name': 'vectorstore', 'args': {'query': 'types of agent memory'}, 'id': 'ae666ea1-9777-4fb4-b750-f9ad0f3aa2c6', 'type': 'tool_call'}]
Routing to retrieve_and_grade
Node 'start':

---



## Debugging tool calling
My router keeps answering the question instead of routing to the appropriate tool. The debugging below shows that it's because command r7b doesn't support tool calling on ollama as of 23rd feb 25.

In [None]:
class web_search(BaseModel):
    """
    The internet. Use web_search for questions that are related to anything else than agents, prompt engineering, and adversarial attacks.
    """
    query: str = Field(description="The query to use when searching the internet.")

class vectorstore(BaseModel):
    """
    A vectorstore containing documents related to agents, prompt engineering, and adversarial attacks. Use the vectorstore for questions on these topics.
    """
    query: str = Field(description="The query to use when searching the vectorstore.")

llm_router = ChatOllama(
    model="command-r7b",
    temperature=0.1,
)

preamble="""You are an expert at routing a user question to a vectorstore or web search.
The vectorstore contains documents related to agents, prompt engineering, and adversarial attacks.
Use the vectorstore for questions on these topics. Otherwise, use web-search."""
llm_router.bind_tools(
    tools=[web_search, vectorstore], preamble=preamble
)
route = llm_router.invoke('What are the types of agent memory?')

In [5]:
route
# The tool calling doesn't work properly?

AIMessage(content="Agent memory is a crucial component in artificial intelligence and robotics, enabling agents to store and retrieve information for decision-making and learning. There are several types of memory that agents can utilize, each serving different purposes:\n\n1. Short-Term or Working Memory: This type of memory holds temporary information that the agent needs to process or manipulate immediately. It is often used for maintaining a current state, performing calculations, or holding intermediate results during problem-solving tasks. Short-term memory has limited capacity and duration, typically lasting only a few seconds to a minute.\n\n2. Long-Term Memory: Long-term memory stores information over extended periods, sometimes even indefinitely. It is responsible for retaining knowledge, experiences, skills, and learned behaviors. Long-term memory can be further categorized into different types:\n   - Episodic Memory: This type of long-term memory involves the recollection o

In [6]:
llm = ChatOllama(
    model="command-r7b",
    temperature=0.1,
)
structured_llm_router = llm.bind_tools(
    tools=[web_search, vectorstore], #preamble=preamble
    tool_choice="auto"
)

# Prompt
route_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{question}"),
    ]
)

question_router = route_prompt | structured_llm_router
response = question_router.invoke(
    {"question": "Should I use websearch or vectorstore to answer this question: Who will the Bears draft first in the NFL draft?"}
)

In [7]:
response
# Still doesn't work, try an example straight from langchain docs

AIMessage(content="To answer this question, it would be best to use the 'web_search' tool as it is related to a current event and requires up-to-date information. The 'vectorstore' tool is designed for questions on agents, prompt engineering, and adversarial attacks, which are not relevant to this query.", additional_kwargs={}, response_metadata={'model': 'command-r7b', 'created_at': '2025-02-23T02:05:38.069165443Z', 'done': True, 'done_reason': 'stop', 'total_duration': 74148531953, 'load_duration': 68528020, 'prompt_eval_count': 1351, 'prompt_eval_duration': 64765000000, 'eval_count': 65, 'eval_duration': 9313000000, 'message': Message(role='assistant', content="To answer this question, it would be best to use the 'web_search' tool as it is related to a current event and requires up-to-date information. The 'vectorstore' tool is designed for questions on agents, prompt engineering, and adversarial attacks, which are not relevant to this query.", images=None, tool_calls=None)}, id='ru

In [None]:
from langchain_ollama import ChatOllama
from pydantic import BaseModel, Field

chat = ChatOllama(
    model="command-r7b",
    temperature=0.1,
)

class Multiply(BaseModel):
    a: int = Field(..., description="First integer")
    b: int = Field(..., description="Second integer")

ans = chat.invoke("What is 45*67")
ans.tool_calls
# yeap example from langchain doc doesn't work

[]

In [16]:
# try direct from ollama
import ollama

def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers

  Args:
    a: The first integer number
    b: The second integer number

  Returns:
    int: The sum of the two numbers
  """
  return a + b


response = ollama.chat(
  "llama3.1:8b",
  messages=[{'role': 'user', 'content': 'What is 10 + 10?'}],
  tools=[add_two_numbers], # Actual function reference
)
response

ChatResponse(model='llama3.1:8b', created_at='2025-02-23T02:13:25.906459303Z', done=True, done_reason='stop', total_duration=3675175215, load_duration=75125448, prompt_eval_count=179, prompt_eval_duration=281000000, eval_count=24, eval_duration=3317000000, message=Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='add_two_numbers', arguments={'a': 10, 'b': 10}))]))

In [None]:
# try direct from ollama
import ollama

def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers

  Args:
    a: The first integer number
    b: The second integer number

  Returns:
    int: The sum of the two numbers
  """
  return a + b


response = ollama.chat(
  "command-r7b",
  messages=[{'role': 'user', 'content': 'What is 10 + 10?'}],
  tools=[add_two_numbers], # Actual function reference
)
response
# Okay looks like it's the command-r7b model that doesn't support tool calling. It's tagged as tool currently on ollama page.. so probably some bug there. nvm let's use llama3.1

ChatResponse(model='command-r7b', created_at='2025-02-23T02:15:18.467995693Z', done=True, done_reason='stop', total_duration=88410796397, load_duration=28001151421, prompt_eval_count=1259, prompt_eval_duration=60120000000, eval_count=3, eval_duration=284000000, message=Message(role='assistant', content='20', images=None, tool_calls=None))

In [19]:
llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.1,
)
structured_llm_router = llm.bind_tools(
    tools=[web_search, vectorstore], #preamble=preamble
    tool_choice="auto"
)

# Prompt
route_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{question}"),
    ]
)

question_router = route_prompt | structured_llm_router
response = question_router.invoke(
    {"question": "Should I use websearch or vectorstore to answer this question: Who will the Bears draft first in the NFL draft?"}
)

In [None]:
response
# confirmed that llama3.1 is ok, but command r doesn't work for tool claling right now

AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-02-23T02:18:03.656131531Z', 'done': True, 'done_reason': 'stop', 'total_duration': 31813874514, 'load_duration': 16094691810, 'prompt_eval_count': 279, 'prompt_eval_duration': 12892000000, 'eval_count': 24, 'eval_duration': 2825000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-0e4acb95-5ce4-4343-82ec-d880d8911d17-0', tool_calls=[{'name': 'vectorstore', 'args': {'query': 'Bears first draft pick in NFL draft'}, 'id': 'aa5948d2-f962-46b9-be83-4da2213e7b92', 'type': 'tool_call'}], usage_metadata={'input_tokens': 279, 'output_tokens': 24, 'total_tokens': 303})