**11-697 Homework 4: Retrieval Augmented Generation (RAG) for Question Answering**

**A. Overview**

In this task, you will curate a set of questions and build two LangChain QA pipelines, one with RAG and one without. For the RAG pipeline, you will index a set of documents that improve model performance on your questions. Therefore you should experiment to find questions about your chosen topic that can't be answered well by the model without RAG.

By default, the retriever k = 1; you can set this to any value that works well with your chosen questions / corpus.

Before beginning, **please read through the entire notebook** and let us know right away if you have questions (post on Piazza).

**NOTE**: Please use a markdown (Text) cell to include your written comments!

**NOTE**: Please submit your notebook in a completed state, i.e. run the notebook to completion, and don't erase the content of the output cells!

**NOTE**: Please use this submission format specification: your submission notebook should be named as 'HW4_\<your Andrew ID>'.ipynb

**SUBMISSION DEADLINE**: October 12th at 11:59pm

Good luck!

In [None]:
# INSTALL REQUIRED LIBRARIES.
!pip install langchain
!pip install langchain_core
!pip install beautifulsoup4
!pip install langchain_community
!pip install langchain-openai
!pip install faiss-cpu
!pip install langchain-openai

Collecting langchain_community
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7.0,>=0.6.7->langchain_community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.31-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# MOUNT GOOGLE DRIVE AND GET FILE CHUNKS FOR RAG.
#
# Modify this cell to load file chunks however you like (you don't need
# to use Google Drive, you can use LangChain integrations, etc.). Pick file chunks that
# improve the performance of the model on your questions when RAG is used.
#
# If you use Google Drive to store your files, then create an appropriate folder
# under your Colab Notebooks folder in Google Drive (the example below uses
# /Data/RAG_documents as a subfolder).

from google.colab import drive
drive.mount('/gdrive', force_remount=True)
drive_root_folder = '/gdrive/My Drive/Colab Notebooks/Data/RAG_documents'

import os
from langchain_core.documents import Document # Moved import here

os.chdir(drive_root_folder)
fileChunks = []
files = os.listdir()
print(files)
for file in files:
    with open(file, 'r') as f:
        document = f.read()
        fileChunks.append(Document(page_content=document))

Mounted at /gdrive
['lti.txt', 'nyberg.txt', 'cmu.txt', 'carnegie.txt', 'sinema.txt', 'in_nai.txt', 'beta.txt']


In [None]:
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("enter your Openai_api_key: ")

enter your Openai_api_key: ··········


In [None]:
import getpass
os.environ["AZURE_API_KEY"] = getpass.getpass("enter your Azure_api_key: ")

enter your Azure_api_key: ··········


In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os

# Create an LLM instance to use with LangChain. You can use whatever model
# you want, the gpt-4o-mini seems to work well and isn't too expensive. Be
# sure to create a corresponding OpenAI key for whatever model you choose.
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    api_key=os.environ['OPENAI_API_KEY'],
    base_url='https://ai-gateway.andrew.cmu.edu/'
)
# Create word embeddings to use when vectorizing file chunks. Don't change!
embeddings = OpenAIEmbeddings(
    model="azure/text-embedding-3-small",
    api_key=os.environ['AZURE_API_KEY'],
    base_url='https://ai-gateway.andrew.cmu.edu/'
)

In [None]:
# VECTORIZE THE FILE CHUNKS FOR RAG.
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(fileChunks)
vector = FAISS.from_documents(documents, embeddings)

In [None]:
# CREATE RETRIEVER_TOOL FOR LANGCHAIN.
#
# Note: "k" is a hyperparameter that must be tuned

from langchain.tools.retriever import create_retriever_tool
retriever = vector.as_retriever(search_kwargs={"k": 3})
retriever_tool = create_retriever_tool(
    retriever,
    "CMU-assistant",
    """Please play the role of a Carnegie Mellon University assistant, you are an expert about CMU helping people understand what it is and all the information related to it.
    your job is to answer the user's queries according to the information you are provided with.
    """,
)

In [None]:
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

# Define the prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """Please play the role of a Carnegie Mellon University assistant, you are an expert about CMU, helping people understand what it is and all the information related to it.
    your job is to answer the user's queries according to the information you are provided with. """),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# Create the agents, one with RAG, one without
agent = create_tool_calling_agent(llm, [retriever_tool], prompt)
dummyAgent = create_tool_calling_agent(llm, [], prompt)

# Create agent executors, one with RAG, one without
agent_executor = AgentExecutor(agent=agent, tools=[retriever_tool], verbose=False)
dummyAgent_executor = AgentExecutor(agent=dummyAgent, tools=[], verbose=False)

In [None]:
def answerQuestions(questionArray):
  for question in questionArray:
    print("----------------------------------------------------------------------------------------------------------------------------------------------")
    print( "\nQuestion: " + question )
    result = dummyAgent_executor.invoke({"input": question})
    print ("Answer (no RAG): " + result["output"])
    result = agent_executor.invoke({"input": question})
    print ("Answer (with RAG): " + result["output"])

Curate your own set of questions; you should choose a topic and questions that the model can't answer well without RAG. Then index files for RAG that improve the performance of the model on your curated questions. To make the assignment interesting, you should curate a set of at least 10 questions about your chosen topic, and index several files that the retriever must choose from in order to provide a useful context in the prompt.

**NOTE**: Although this sample notebook indexes four simple .txt files from Google Drive (the .zip file containing these files was posted to Canvas), you are not limited to using simple text files or only using four files. Feel free to explore the different file type integrations that are supported by LangChain!

In [None]:
myQuestions = ["What year was Eric Nyberg born?",
               "Where was Eric Nyberg born?"]
answerQuestions(myQuestions)


Question: What year was Eric Nyberg born?
Answer (no RAG): I'm sorry, but I don't have any information regarding Eric Nyberg's birth year. If you have any other questions about Carnegie Mellon University or its programs, feel free to ask!
Answer (with RAG): Eric Nyberg was born in 1962, specifically on February 7th.

Question: Where was Eric Nyberg born?
Answer (no RAG): Eric Nyberg is a notable professor at Carnegie Mellon University, specifically known for his work in the field of computer science and artificial intelligence. However, I do not have specific information about his place of birth. For detailed personal information like this, you may need to refer to his professional biography or publications.
Answer (with RAG): Eric Nyberg was born in 1962, in Salem, Massachusetts.


### Curated Questions


In [None]:
myQuestions = ["List five movies that were shown at the 2025 Nairobi film festival?",
                "What film was selected as the opening film at the 2025 Nairobi film festival",
                "Who are the members on the Jury at the 2025 Nairobi film festival",
                "Who is the director of the NBO film festival",
                "What is the plot of the movie How to build a library",
                "What documentaries has Zippy Kimundu made?",
                "What is the plot of the film my father's shadow?",
                "Who directed the documentary the eyes of Ghana",
               "What is memories of love about?",
               "How many movies and from which countries are represented in the 2025 NBO Film festival"]

# new retriever
retriever_tool = create_retriever_tool(
    retriever,
    "African_movies_expert",
    """Please play the role of an Kenyan entertainment journalist, you are an expert about African movies focussing on the Kenyan and East Africam movie industry.
    your job is to answer the user's queries according to the information you are provided with.
    """,
)

# Define the prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """Please play the role of an Kenyan entertainment journalist, you are an expert about African movies focussing on the Kenyan and East Africam movie industry.
    your job is to answer the user's queries according to the information you are provided with. """),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# Create the agents, one with RAG, one without
agent = create_tool_calling_agent(llm, [retriever_tool], prompt)
dummyAgent = create_tool_calling_agent(llm, [], prompt)

# Create agent executors, one with RAG, one without
agent_executor = AgentExecutor(agent=agent, tools=[retriever_tool], verbose=False)
dummyAgent_executor = AgentExecutor(agent=dummyAgent, tools=[], verbose=False)

###Before Training

In [None]:
answerQuestions(myQuestions)

----------------------------------------------------------------------------------------------------------------------------------------------

Question: List five movies that were shown at the 2025 Nairobi film festival?
Answer (no RAG): It seems that I currently do not have access to information regarding specific movies shown at the 2025 Nairobi Film Festival. The festival typically features a diverse lineup of films from across Africa and beyond, highlighting the talents of filmmakers in the region.

If you have any specific films or filmmakers in mind or need information about previous festivals, feel free to ask!
Answer (with RAG): I'm sorry, but it seems there is no information available about the movies shown at the 2025 Nairobi Film Festival. If you have any other questions or need information on a different topic related to the Kenyan or East African film industry, feel free to ask!
----------------------------------------------------------------------------------------------

### After addding new files for RAG

In [None]:
answerQuestions(myQuestions)

----------------------------------------------------------------------------------------------------------------------------------------------

Question: List five movies that were shown at the 2025 Nairobi film festival?
Answer (no RAG): I currently do not have access to the specific movies that were shown at the 2025 Nairobi Film Festival, as that information is not available in my data. The festival typically showcases a range of local and international films, focusing on various themes relevant to the African context.

If you are interested in specific genres, notable filmmakers, or past Nairobi Film Festivals, I would be happy to provide that information!
Answer (with RAG): Here are five movies that were showcased at the 2025 Nairobi Film Festival:

1. **How to Build a Library** 
   - Directors: Maia Lekow, Christopher King | Kenya
   - A documentary about two Nairobi women who transform a former whites-only library into a vibrant cultural hub.

2. **Mothers of Chibok**
   - Direc

As shown Question answering improves with relvant source documents added.

At k=1 there was no significant improvement even with new source documents added. k=3 Was the sweet spot with the LLM retrieving information a lot better and answering the questions more fully and ably.