Welcome to the INFO-5940 RAG Chatbot repository! This chatbot is a Retrieval-Augmented Generation (RAG) system powered by OpenAI models and LangChain. It enables users to upload documents (PDF and TXT) and retrieve relevant information from them using semantic search and conversational AI. The chatbot is implemented using Streamlit for the user interface and ChromaDB for vector storage.
Before starting, ensure you have the following installed on your system:
- Docker (Ensure Docker Desktop is running)
- VS Code
- VS Code Remote - Containers Extension
- Git
- OpenAI API Key
Open a terminal and run:
git clone https://github.com/cw2236/5940.git
cd 5940
Since docker-compose.yml
expects environment variables, follow these steps:
Inside the project folder, create a .env
file:
touch .env
Add your API key and base URL:
OPENAI_API_KEY=your-api-key-here
OPENAI_BASE_URL=https://api.ai.it.cornell.edu/
TZ=America/New_York
Make sure the docker-compose.yml
include this .env
file:
version: '3.8'
services:
devcontainer:
container_name: info-5940-devcontainer
build:
dockerfile: Dockerfile
target: devcontainer
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- OPENAI_BASE_URL=${OPENAI_BASE_URL}
- TZ=${TZ}
volumes:
- '$HOME/.aws:/root/.aws'
- '.:/workspace'
env_file:
- .env
Compose the container:
docker-compose up --build
Now, your API key will be automatically loaded inside the container.
-
Open Docker dashboard and run the imange you just created(It should be called 5940)
-
Open VS Code and navigate to the project folder.
-
Open the Command Palette (
Ctrl+Shift+P
orCmd+Shift+P
on Mac) and search for:Remote-Containers: Rebuild and Reopen in Container"
-
Select From 'docker-compose.yml', then don't select any choice and click OK directly
-
Select this option. VS Code will build and open the project inside the container.
📌 Note: If you don’t see this option, ensure that the Remote - Containers extension is installed.
pip install -U langchain-chroma
pip install pdfminer.six
Once the setup is complete, run the chatbot with the following command:
streamlit run Chatbot_A1.py
Results of the tests follow the logic and step defined in the prompt template.
- Tested with asking for domain specific questions such as "Who is Ayham Boucher?" (WITHOUT uploading any documents)
- Tested with asking for general questions such as "What is a banana?" , "Who is the first president of the U.S.?"(WITHOUT uploading any documents)
- Tested with asking for general questions such as "What is a banana?", "Who is the first president of the U.S.?" (AFTER uploading single/multiple document(s))
- Tested with domain specific questions such as "Who is Ayham Boucher", "What does Ayham teach?", "What is Harvard's school color?", "When was Cornell built?" (AFTER uploading single/multiple document(s))
If you hope to start over after testing, please feel free to delete the 'uploaded_docs' and 'chroma_db' directories and re-run the AI chatbot.
- Upload and process PDF and TXT documents.
- Store document embeddings using ChromaDB.
- Retrieve relevant document sections using similarity search.
- Generate responses using OpenAI's GPT-4o model.
- Maintain conversation history within a session.
- Able to identify source of information.
- Answer the question even when context does not include useful information, with notification to users that response is generated not based on the context, or respond with "I don't know".
- Uploaded files are saved in the
uploaded_docs
directory. - Documents are split into chunks for efficient retrieval.
- Chunks are stored in ChromaDB for similarity search, along with metadata (source file name).
- Document Upload: Users upload PDF/TXT files.
- Text Processing: Extracted text is chunked using
RecursiveCharacterTextSplitter
. - Vector Storage: Chunks and source of information are embedded and stored in ChromaDB.
- Retrieval: Similar document chunks and corresponding metadata about source files are retrieved based on user queries.
- Generation: Retrieved content is used to generate an answer using GPT-4o, following tailored prompts.
This project is developed as part of INFO-5940 at Cornell University.