In [20]:
!pip install google-api-python-client>=2.10
!pip install -U langchain
!pip install -U langchain-community
!pip install -U langchain-text-splitters
!pip install langgraph

Collecting langchain
  Downloading langchain-0.2.3-py3-none-any.whl.metadata (6.9 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.1-py3-none-any.whl.metadata (2.2 kB)
Downloading langchain-0.2.3-py3-none-any.whl (974 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.0/974.0 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_text_splitters-0.2.1-py3-none-any.whl (23 kB)
Installing collected packages: langchain-text-splitters, langchain
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 0.0.1
    Uninstalling langchain-text-splitters-0.0.1:
      Successfully uninstalled langchain-text-splitters-0.0.1
  Attempting uninstall: langchain
    Found existing installation: langchain 0.1.12
    Uninstalling langchain-0.1.12:
      Successfully uninstalled langchain-0.1.12
Successfully installed langchain-0.2.3 langchain-text-splitt

In [4]:
!pip install tavily-python

Collecting tavily-python
  Downloading tavily_python-0.3.5-py3-none-any.whl.metadata (11 kB)
Downloading tavily_python-0.3.5-py3-none-any.whl (13 kB)
Installing collected packages: tavily-python
Successfully installed tavily-python-0.3.5


# Evaluating Workflow with Graph Agents

Exploring a tutorial as detailed in this [video](https://www.youtube.com/watch?v=-ROS6gfYIts) and corresponding source files notebook [here](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_rag_agent_llama3_local.ipynb).

> Node: I will be changing the RAG from local md files of the Streameye blog, utilising HuggingFaceEmbeddings to load ./data/embeddings/gte-large/ as embeddings model instead of the nomic. For search I will be using the Google API. Llama3 will be loaded with the normal transformers API of HF

With the release of LLaMA3, we're seeing great interest in agents that can run reliably and locally (e.g., on your laptop). Here, we show to how build reliable local agents using LangGraph and LLaMA3-8b from scratch. We combine ideas from 3 advanced RAG papers (Adaptive RAG, Corrective RAG, and Self-RAG) into a single control flow. We run this locally w/ a local vectorstore c/o @nomic_ai & @trychroma, @tavilyai for web search, and LLaMA3-8b via @ollama.


## Setting Environment


In [1]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env file


## Loading the Model

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig,
    pipeline
)

from langchain_community.utilities import GoogleSearchAPIWrapper

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_community.tools.tavily_search import TavilySearchResults



model_name = "../ext_models/Meta-Llama-3-8B-Instruct"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             device_map=DEVICE, 
                                             torch_dtype="auto")
generation_config = GenerationConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
stop_token = "<|eot_id|>"  
stop_token_id = tokenizer.encode(stop_token)[0]
begin_token = "<|begin_of_text|>"
begin_token_id = tokenizer.encode(begin_token)[0]
generation_config.eos_token_id = stop_token_id
generation_config.begin_token_id = begin_token_id
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.1
generation_config.top_p = 0.9
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15
generation_config.pad_token_id = tokenizer.eos_token_id

In [6]:
llm_pipeline = pipeline("text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        generation_config=generation_config, 
                        return_full_text=False) 
llm = HuggingFacePipeline(pipeline=llm_pipeline)

  warn_deprecated(


## Loading Embeddings

In [7]:
gte_large = HuggingFaceEmbeddings(model_name="./data/embeddings/gte-large/", 
                                       model_kwargs={"device": DEVICE}, 
                                       encode_kwargs={"normalize_embeddings": True})

  warn_deprecated(


## Prepare the Retreiver

Load previously created Chroma DB of the Streameye Blog articles. 

In [8]:
db = Chroma(persist_directory="./data/sound/db", embedding_function=gte_large) # lets try the multimodal podcast

In [9]:
retriever = db.as_retriever(search_kwargs={"k": 3})

In [10]:
def format_docs(docs, confidence = 0.85):
    return "\n\n".join(doc[0].page_content for doc in docs if doc[1] > confidence)

### Lets Create Another DB

We will use the text taken from OpenAI Whisper to create a new Chroma DB

In [8]:
from langchain_community.document_loaders import TextLoader
from langchain.schema import Document

# loader = TextLoader("./data/sound/text_full.txt")
loader = TextLoader("./data/sound/nextjs.conf/text.txt")
text = loader.load()


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        add_start_index=True
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    return chunks

chunks = split_text(text)

## Saving to Chroma

def save_to_chroma(chunks: list[Document]):
    db_path = "./data/sound/nextjs.conf/db";
    if os.path.exists(db_path):
        shutil.rmtree(db_path) ## Recursive delete
    db = Chroma.from_documents(chunks, gte_large, persist_directory=db_path)
    db.persist()
save_to_chroma(chunks)

Split 1 documents into 29 chunks.


  warn_deprecated(


### DB From Single Text From Whisper

In [9]:
db = Chroma(persist_directory="./data/sound/nextjs.conf/db", embedding_function=gte_large)
retriever = db.as_retriever(search_kwargs={"k": 3})

## The Different Nodes of the Graph

1. Retreiver Grader - grades the response by the retreiver as yes/ no
2. The Answer Generator - answers a question using context from the docs
3. Answer Grader - grades if the answer is relevant
4. Hallucination Grader - grades if the model has hallucinated
5. Web Search - falls back to web search


### 1. Retrieval Grader

In [11]:
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing relevance 
    of a retrieved document to a user question. If the document contains keywords related to the user question, 
    grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.
    Provide the binary score as a JSON with a single key 'score' and no premable or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|>
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """,
    input_variables=["question", "document"],
)
retrieval_grader = prompt | llm | JsonOutputParser()

In [12]:
question = "What can you tell me about Chameleon model"
context = db.similarity_search_with_relevance_scores(question, k=3)
docs = format_docs(context, confidence = 0.8)

In [13]:
context

[(Document(metadata={'source': './data/sound/text_full.txt', 'start_index': 4800}, page_content="is pretty important because now what this means is that this is the new state-of-the-art vision language model, multimodal model, right? I guess GPT-4-0 is supposedly better than GPT-4, but is it better than GPT-4-V? So if Chameleon is better than GPT-4-V, it might even be better than GPT-4-0, but we don't really know yet. So we're going to say potentially state-of-the-art multimodal model. Okay. The interesting thing about Chameleon is that it's going to be trained from scratch on an end-to-end fashion on an interleaved mixture of all modalities, right? And they need to train this from scratch because they're going to be using this early fusion approach where all modalities are projected into a shared representation space from the start, allowing for seamless reasoning and generation across modalities. Okay, so let's take a pause there to think about what does this actually mean. So what t

In [14]:

docs

"is pretty important because now what this means is that this is the new state-of-the-art vision language model, multimodal model, right? I guess GPT-4-0 is supposedly better than GPT-4, but is it better than GPT-4-V? So if Chameleon is better than GPT-4-V, it might even be better than GPT-4-0, but we don't really know yet. So we're going to say potentially state-of-the-art multimodal model. Okay. The interesting thing about Chameleon is that it's going to be trained from scratch on an end-to-end fashion on an interleaved mixture of all modalities, right? And they need to train this from scratch because they're going to be using this early fusion approach where all modalities are projected into a shared representation space from the start, allowing for seamless reasoning and generation across modalities. Okay, so let's take a pause there to think about what does this actually mean. So what this means is that this model is taking these different modalities. Here you have the modality of

In [15]:
retrieval_grader.invoke({"question": question, "document": docs})

{'score': 'yes'}

### 2. The Answer Generator

In [25]:
# Prompt
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an assistant for question-answering 
    tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer,
    just say that you don't know. Try and answer exhaustively but only with information provided by the context below.
    Do not make up any assumptions yourself.<|eot_id|>
    <|start_header_id|>user<|end_header_id|>
    Question: {question} 
    Context: {context} 
    Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["question", "context"],
)
rag_chain = prompt | llm | StrOutputParser()

In [27]:
question = "Is there a difference between a multimodal model and generic LLM that handles graphics and text?"
context = db.similarity_search_with_relevance_scores(question, k=3)
docs = format_docs(context)
generation = rag_chain.invoke({"question": question, "context": docs})
print(generation)



Based on the given context, I would say that yes, there is a difference between a multimodal model and a generic Large Language Model (LLM) that handles graphics and text.

A multimodal model is specifically designed to handle multiple input modalities such as images, videos, audio, or text, whereas a generic LLM is primarily focused on processing and generating human language (text). While some LLMs may be able to process simple visual inputs like images, they are not typically designed to handle complex multimedia data like videos or graphics.

Multimodal models, on the other hand, are trained on large datasets that combine different types of data, allowing them to learn representations that can capture relationships across modalities. This enables them to perform tasks like image captioning, visual question answering, or video summarization, which require integrating information from multiple sources.

In contrast, a generic LLM might struggle with these tasks due to its limited t

### 3. Search

In [28]:
# search = GoogleSearchAPIWrapper()
search = TavilySearchResults(k=3)
# search.run("Is there a difference in responsive html5 ad and takeover skin?")

### 4. Answer Grader

In [29]:
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether an 
    answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is 
    useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|> Here is the answer:
    \n ------- \n
    {generation} 
    \n ------- \n
    Here is the question: {question} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["generation", "question"],
)
# test it
question = "What kind of ads can a CMP generate? Do I need to use my own designers?"
generation = """According to the given context, a Creative Management Platform (CMP) can generate various types of ads, 
including:

* Tailored ad banners
* Advertisements showcasing the distinctiveness of one's brand, incorporating brand colors, logo, and slogan.

As for whether you need to use your own designers, it seems that CMPs can assist in creating ads, setting them up, 
and even allowing users to add dynamic data, images, illustrations, videos, etc., which implies that CMPs may not 
require the involvement of in-house designers. However, this does not necessarily mean that having professional 
designers would not be beneficial or necessary, especially if you want to customize the designs further or have 
specific requirements.
"""

answer_grader = prompt | llm | JsonOutputParser()

In [16]:
answer_grader.invoke({"generation": generation, "question": question})

{'score': 'yes'}

## The Graph State Class Handles the State

While the class holds the state, we define functions as nodes that perform operations on this state

In [30]:
from typing_extensions import TypedDict
from typing import List
from langchain_core.documents import Document

class GraphState(TypedDict):
    """
    Represents the state of the Graph
    """
    question: str
    generation: str
    web_search: str
    documents: List[str]
    

### Define the Nodes and Edges of the Graph

In [31]:
def combine_docs(docs, confidence = 0.85):
    return "\n\n".join(doc[0].page_content for doc in docs if doc[1] > confidence)
def retrieve(state):
    """
    Retrieves documents from the database
    Args:
        state (GraphState): The current state of the graph
    Returns:
        new key in the state containing the documents
    """
    documents = db.similarity_search_with_relevance_scores(state["question"], k=3)
    question = state["question"]
    return {"documents": documents, "question": question}

def generate(state):
    """
    Generate answer using RAG on retrieved documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    question = state["question"]
    documents = state["documents"]

    # RAG generation
    generation = rag_chain.invoke({"context": documents, "question": question})
    return {"documents": documents, "question": question, "generation": generation}

def grade_combined(state):
    """
    Determines whether the retrieved documents are relevant to the question
    This function would first combine all documents into one single context
    and grade the full text in one go.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): The resulting context document and updated web_search state
    """
    print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")
    question = state["question"]
    documents = state["documents"]
    web_search = "No"
    # Combine all documents
    combined_docs = combine_docs(documents, confidence = 0.8)
    # initialise score as a dictionary
    score = {}
    if combined_docs.strip() == "":
        score = {"score": "No"}
    else:
        score = retrieval_grader.invoke({"question": question, "document": combined_docs})
    
    if score["score"].lower() == "yes":
        print("---GRADE: DOCUMENT RELEVANT---")
        documents = Document(page_content=combined_docs)
    # Document not relevant
    else:
        print("---GRADE: DOCUMENT NOT RELEVANT---")
        # We do not return the document
        # We set a flag to indicate that we want to run web search
        web_search = "Yes"
    return {"documents": documents, "question": question, "web_search": web_search}
    
def grade_documents(state):
    """
    Determines whether the retrieved documents are relevant to the question
    If any document is not relevant, we will set a flag to run web search

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Filtered out irrelevant documents and updated web_search state
    """

    print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")
    question = state["question"]
    documents = state["documents"]

    # Score each doc
    filtered_docs = []
    web_search = "No"
    for d in documents:
        score = retrieval_grader.invoke(
            {"question": question, "document": d[0].page_content} # 0 is the content, 1 is the confidence
        )
        grade = score["score"]
        # Document relevant
        if grade.lower() == "yes":
            print("---GRADE: DOCUMENT RELEVANT---")
            filtered_docs.append(d)
        # Document not relevant
        else:
            print("---GRADE: DOCUMENT NOT RELEVANT---")
            # We do not include the document in filtered_docs
            # We set a flag to indicate that we want to run web search
            web_search = "Yes"
            continue
    return {"documents": filtered_docs, "question": question, "web_search": web_search}

def web_search(state):
    """
    Web search based based on the question

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Appended web results to documents
    """

    print("---WEB SEARCH---")
    question = state["question"]
    documents = state["documents"]

    # Web search
    docs = search.run(question)
    print("---SEARCH DOCS---", docs)
    
    web_results = Document(page_content=docs)
    # if documents is not None:
    #     documents.append(web_results)
    # else:
    #     documents = [web_results]
    documents = [web_results]
    return {"documents": documents, "question": question}

# Conditional

def decide_to_generate(state):
    """
    Determines whether to generate an answer, or add web search

    Args:
        state (dict): The current graph state

    Returns:
        str: Binary decision for next node to call
    """

    print("---ASSESS GRADED DOCUMENTS---")
    question = state["question"]
    web_search = state["web_search"]
    filtered_documents = state["documents"]

    if web_search == "Yes":
        # All documents have been filtered check_relevance
        # We will re-generate a new query
        print(
            "---DECISION: SOME DOCUMENTS ARE NOT RELEVANT TO QUESTION, INCLUDE WEB SEARCH---"
        )
        return "websearch"
    else:
        # We have relevant documents, so generate answer
        print("---DECISION: GENERATE---")
        return "generate"

def grade_answer(state):
    """
    Determines whether the generation is grounded in the document and answers question.

    Args:
        state (dict): The current graph state

    Returns:
        str: returns a json with yes/ no
    """
    print("---CHECKING ANSWER---")
    question = state["question"]
    documents = state["documents"]
    generation = state["generation"]
    score = answer_grader.invoke({"question": question, "generation": generation})
    return score["score"]

### Add the Nodes to the Graph

In [32]:
from langgraph.graph import END, StateGraph

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("websearch", web_search)  # web search
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_combined)  # grade combined
workflow.add_node("generate", generate)  # generatae

### Build the Graph

use this diagram for reference. Note that this is the first stage simplified without hallucination grader and initial router. These can be added next.
![Graph Agents Diagram](./data/images/graph_agents.png)

In [33]:
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "websearch": "websearch",
        "generate": "generate",
    },
)
workflow.add_edge("websearch", "generate")
workflow.add_conditional_edges(
    "generate",
    grade_answer,
    {
        "yes": END,
        "no": "websearch",
    },
)
# workflow.add_edge("generate", END)

In [44]:
app = workflow.compile()
inputs = {"question": "Are there any open sourced multimodal models that we can try?"}
output = app.stream(inputs)
response = []
for output in app.stream(inputs):
    for key, value in output.items():
        response.append({"key": key, "value": value})
response

---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
---GENERATE---
---CHECKING ANSWER---


[{'key': 'retrieve',
  'value': {'question': 'Are there any open sourced multimodal models that we can try?',
   'documents': [(Document(page_content="papers on multimodal models. Multimodal models are all the rage right now because the two biggest players in the AI race, OpenAI and Google, released multimodal models, right? So OpenAI released GPT-4.0, but a version of it that we haven't really gotten access to, which is this kind of like multimodal version. And then Google released Project Astra, which is basically the same thing, right? You can stream video and audio to the model and the model outputs audio, right? Streams out audio. So we wanted to kind of figure out what is the current kind of state of the art? What's the landscape in multimodal models right now? So one thing that we looked at is the open source world, right? Hugging face kind of represents the open source community. You don't have a ton of money. You can't be doing giant training runs. So whenever people build mul

In [45]:
print(output['generate']['generation'])



According to the given context, yes, there are open-sourced multimodal models that can be tried. Specifically mentioned are:

1. "What Matters When Building Vision Language Models" from Hugging Face (published on May 3, 2024)
2. Mirasol 3B, a multimodal autoregressive model for time-aligned and contextual modalities from Google, DeepMind, and Google Research (first published on Archive last year and updated since)

These models represent the open-source community's efforts in developing multimodal models. However, please note that the context does not provide direct links or access to these models.
