# Week 5 Exercise: RAG

In [16]:
from dotenv import load_dotenv
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import dotenv

In [17]:
load_dotenv()

True

In [6]:
search_term = "LangChain"
docs = WikipediaLoader(query=search_term, load_max_docs=1).load()

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20, length_function=len)

In [9]:
data = text_splitter.split_documents(docs)


[Document(metadata={'title': 'LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n", 'source': 'https://en.wikipedia.org/wiki/LangChain'}, page_content="LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis."),
 Document(metadata={'title': 'LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangC

In [12]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

In [18]:
embeddings = OpenAIEmbeddings()

In [20]:
import chromadb

In [21]:
store = Chroma.from_documents(data, embeddings, ids = [f"{item.metadata['source']}-{index}" for index, item in enumerate(data)], collection_name="langchain_wikipedia", persist_directory='db')

  store.persist()


In [27]:
 from langchain.chains import RetrievalQA
 from langchain.prompts import PromptTemplate
 from langchain_openai import ChatOpenAI
 import pprint

The video used for this notebook was only 1 year old and yet a lot of the libraries have changed. It seems like langchain was refactored into multiple modules, so I had to change the imports to match the current version of the library.

In [None]:
template = """You are a bot that answers questions about LangChain using only the context provided. If you don't know the answer, say "I don't know".

{context}

Question: {question}"""
PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

I find the prompt template to be a bit weird. It seems like an unintuitive implementation to use this pattern.

In [28]:
llm = ChatOpenAI(temperature=0, model="gpt-4o")

In [29]:
qa_with_source = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [31]:
pprint.pprint(qa_with_source.invoke("How can I use LangChain to build a RAG system?"))

{'query': 'How can I use LangChain to build a RAG system?',
 'result': "I don't know.",
 'source_documents': [Document(metadata={'source': 'https://en.wikipedia.org/wiki/LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n", 'title': 'LangChain'}, page_content="LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis."),
                      Document(metadata={'summary': "LangChain is a software framework that helps facilitate the int

Right off the bat I'm getting an "I don't know" which is dissapointing but not suprising since Wikipedia probably wouldn't have enough details on how to build a RAG system with LangChain.

In [32]:
pprint.pprint(qa_with_source.invoke("What is LangChain?"))

{'query': 'What is LangChain?',
 'result': 'LangChain is a software framework that helps facilitate the '
           'integration of large language models (LLMs) into applications. It '
           'is used for various purposes, including document analysis and '
           'summarization, chatbots, and code analysis.',
 'source_documents': [Document(metadata={'source': 'https://en.wikipedia.org/wiki/LangChain', 'title': 'LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n"}, page_content="LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with 

This response wasn't that bad. It was definitely shorter than I expected but the response it gave was accurate and to the point.

In [33]:
pprint.pprint(qa_with_source.invoke("How can I use LangChain?"))

{'query': 'How can I use LangChain?',
 'result': "I don't know.",
 'source_documents': [Document(metadata={'source': 'https://en.wikipedia.org/wiki/LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n", 'title': 'LangChain'}, page_content="LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis."),
                      Document(metadata={'title': 'LangChain', 'source': 'https://en.wikipedia.org/wiki/LangChain', 'summary': "LangChain 

Similar to the RAG question, this one also returned "I don't know". I think this is because the Wikipedia page doesn't have enough information on how to use LangChain in general.

In [34]:
pprint.pprint(qa_with_source.invoke("What is the purpose of LangChain?"))

{'query': 'What is the purpose of LangChain?',
 'result': 'LangChain is a software framework designed to facilitate the '
           'integration of large language models (LLMs) into applications.',
 'source_documents': [Document(metadata={'title': 'LangChain', 'source': 'https://en.wikipedia.org/wiki/LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n"}, page_content="LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis."),
     

This response seemed too similar to the response about "what is LangChain?". It seems like the model is just repeating itself. I think this is because the Wikipedia page doesn't have enough information on the purpose of LangChain. At some level, maybe the original response was good enough to answer this question as well, but it would have been nice to see a more detailed response.