# Building a Documentation Chatbot with LangChain

This script demonstrates how to build an intelligent chatbot that queries documentation using LangChain.
The chatbot can:
- Parse and preprocess Markdown files.
- Embed document content for efficient similarity-based retrieval.
- Answer detailed, context-aware queries from users.

In [11]:
#!sudo /venv/bin/pip install langchain --quiet
#!sudo /venv/bin/pip install -U langchain-community --quiet
#!sudo /venv/bin/pip install -U langchain-openai --quiet
#!sudo /venv/bin/pip install -U langchain-core --quiet
#!sudo /venv/bin/pip install -U langchainhub --quiet
#!sudo /venv/bin/pip install -U unstructured python-magic pandoc markdown faiss-cpu --quiet
#!sudo /venv/bin/pip install --quiet chromadb
!sudo /venv/bin/pip install -U --quiet unstructured markdown

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging

import helpers.hdbg as hdbg
import langchain
import langchain.chains
import langchain.docstore.document as lngchdocstordoc
import langchain.embeddings
import langchain.hub
import langchain.text_splitter
import langchain_openai
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.vectorstores import FAISS

In [6]:
import langchain_utils as ut

In [19]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)

INFO  > cmd='/venv/lib/python3.12/site-packages/ipykernel_launcher.py -f /home/.local/share/jupyter/runtime/kernel-1e1769f8-d0de-4cbe-9738-0f57d8b46461.json'


## Define Config

In [8]:
config = {
    # Define language model arguments.
    "language_model": {
        # Define your model here.
        "model": "gpt-4o-mini",
        "temperature": 0,
    },
    # Define input directory path containing documents.
    "source_directory": "example_docs",
    "parse_data_into_chunks": {
        "chunk_size": 500,
        "chunk_overlap": 50,
    },
}

hdbg.dassert_dir_exists(config["source_directory"])

## Setting Up

We'll begin by importing the required libraries and configuring the environment. The chatbot will use:
- OpenAI's GPT-4o-mini as the core language model.
- FAISS for fast document retrieval.
- LangChain utilities for document parsing, text splitting, and chaining.

In [9]:
# Set the OpenAI API key.
#os.environ["OPENAI_API_KEY"] = config["open_ai_api_key"]
# Initialize the chat model.
print(config["language_model"])
chat_model = langchain_openai.ChatOpenAI(**config["language_model"])

{'model': 'gpt-4o-mini', 'temperature': 0}


## Define Functions

## Parse and Preprocess Documentation

Markdown files serve as the primary data source for this chatbot.
We'll parse the files into LangChain `Document` objects and split them into manageable chunks to ensure efficient retrieval.

In [14]:
md_files = ut.list_markdown_files(config["source_directory"])
print(md_files)

['example_docs/README.md', 'example_docs/code_organization.md', 'example_docs/onboarding/ck.setup_vpn_and_dev_server_access.how_to_guide.md', 'example_docs/onboarding/admin.onboarding_process.reference.md', 'example_docs/onboarding/ck.hiring_process.how_to_guide.md', 'example_docs/onboarding/all.track_time_with_hubstaff.how_to_guide.md', 'example_docs/onboarding/ck.development_setup.how_to_guide.md', 'example_docs/onboarding/all.development_documents.reference.md', 'example_docs/onboarding/all.organize_email.how_to_guide.md', 'example_docs/onboarding/intern.set_up_development_on_laptop.how_to_guide.md', 'example_docs/onboarding/kaizenflow.prepare_for_development.how_to_guide.md', 'example_docs/onboarding/all.dev_must_read_checklist.reference.md', 'example_docs/onboarding/intern.onboarding_checklist.reference.md', 'example_docs/onboarding/all.onboarding_checklist.reference.md', 'example_docs/onboarding/kaizenflow.signing_up.how_to_guide.md', 'example_docs/onboarding/all.receive_crypto_p

In [20]:
# Initialize with documents.
md_files = ut.list_markdown_files(config["source_directory"])
raw_documents = ut.parse_markdown_files(md_files)
chunked_documents = ut.split_documents(
    raw_documents,
    chunk_size=config["parse_data_into_chunks"]["chunk_size"],
    chunk_overlap=config["parse_data_into_chunks"]["chunk_overlap"],
)

INFO  Found 16 markdown files in example_docs


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.48it/s]

INFO  Successfully parsed 16/16 files
INFO  Split 16 documents into 184 chunks





In [None]:
more 

In [22]:
print(chunked_documents[1])

page_content='File description

Invariants:

Files are organized by directory (e.g., docs, docs/work_tools)

Each file name uses the Diataxis naming convention

Each file name should be linked to the corresponding file as always

Files are organized in alphabetical order to make it easy to add more files and see which file is missing

Each file has a bullet lists summarizing its content using imperative mode

In docs

docs/README.md

This file

Describe all the available documentation files' metadata={'source': 'example_docs/README.md', 'last_modified': 1743723856.6785784, 'checksum': '96671db8a4b2d1d7be94bbcc56a5f965', 'start_index': 447}


In [None]:
print(chunked_documents[2])

## Create a FAISS Vector Store

To enable fast document retrieval, we'll embed the document chunks using OpenAI's embeddings and store them in a FAISS vector store.

In [27]:
# Initialize OpenAI embeddings.
embeddings = langchain.embeddings.OpenAIEmbeddings()
# Create a FAISS vector store.
vector_store = ut.create_vector_store(chunked_documents, embeddings)
_LOG.info("FAISS vector store created with %d documents.", len(chunked_documents))

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO  Created vector store with 184 entries
INFO  FAISS vector store created with 184 documents.


## Build a QA Chain

The `RetrievalQA` chain combines document retrieval with OpenAI's GPT-3.5 for question answering.
It retrieves the most relevant document chunks and uses them as context to generate answers.

In [24]:
# Build the retriever from the vector store
retriever = ut.build_retriever(vector_store)

# Create the RetrievalQA chain
qa_chain = langchain.chains.RetrievalQA.from_chain_type(
    llm=chat_model, retriever=retriever, return_source_documents=True
)

_LOG.info("RetrievalQA chain initialized.")

INFO  Built retriever with config: {'k': 4}
INFO  RetrievalQA chain initialized.


## Step 5: Query the Chatbot

Let's interact with the chatbot! We'll ask it questions based on the documentation.
The chatbot will retrieve relevant chunks and generate context-aware responses.

In [26]:
# Define a user query.
#query = "What are the guidelines for setting up a new project?"
query = "Is there any mention of Diataxis?"

# Query the chatbot.
response = qa_chain({"query": query})

# Display the answer and source documents.
print(f"Answer:\n{response['result']}\n")
print("Source Documents:")
for doc in response["source_documents"]:
    print(f"- Source: {doc.metadata['source']}")
    print(f"  Excerpt: {doc.page_content[:200]}")

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO  HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Answer:
The provided context does not mention Diataxis beyond stating that file names use the Diataxis naming convention.

Source Documents:
- Source: example_docs/README.md
  Excerpt: File description

Invariants:

Files are organized by directory (e.g., docs, docs/work_tools)

Each file name uses the Diataxis naming convention

Each file name should be linked to the corresponding 
- Source: example_docs/onboarding/kaizenflow.prepare_for_development.how_to_guide.md
  Excerpt: Happy coding!

Technologies used

UMD DATA605 Big Data Systems contains lectures and tutorials about most of the technologies we use in KaizenFlow, e.g., Dask, Docker, Docker Compose, Git, github, Jup
- Source: example_docs/onboarding/admin.onboarding_process.reference.md
  Excerpt: Google form: ?

CV / LinkedIn: ?

GitHub handle: ?

Email: ?

Devops

## Step 6: Dynamic Updates

What if the documentation changes? We'll handle this by monitoring the folder for new or modified files.
The vector store will be updated dynamically to ensure the chatbot stays up-to-date.

In [None]:
ut.update_vector_store_from_changes(
    config,
    vector_store,
    embeddings,
)

## Step 7: Enhancements - Personalization

We can extend the chatbot to include personalized responses:
- Filter documents by metadata (e.g., tags, categories).
- Customize responses based on user preferences.

For example, users can ask for specific sections of the documentation or request summaries tailored to their needs.

In [28]:
# Example query with personalized intent.
personalized_query = "Show me onboarding guidelines for new employees."

# Query the chatbot.
personalized_response = qa_chain({"query": personalized_query})

# Display the personalized response.
print(f"Answer:\n{personalized_response['result']}\n")
print("Source Documents:")
for doc in personalized_response["source_documents"]:
    print(f"- Source: {doc.metadata['source']}")
    print(f"  Excerpt: {doc.page_content[:200]}")

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO  HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Answer:
I don't know.

Source Documents:
- Source: example_docs/onboarding/all.onboarding_checklist.reference.md
  Excerpt: Onboarding Checklist

Onboarding process for a new team member

Meta

Make on-boarding automatic

Be patient

Ask for confirmation

Make on-boarding similar to our work routine

Improve on-boarding pr
- Source: example_docs/onboarding/all.onboarding_checklist.reference.md
  Excerpt: Ask for confirmation of all the actions, e.g.,

"Does this and that work?"

"Did you receive the email?"

"Can you log in?"

Make the new team member follow the instructions so that they can get famil
- Source: example_docs/onboarding/ck.hiring_process.how_to_guide.md
  Excerpt: Follow the instructions in all.onboarding_checklist.reference.md

HiringMeister: once the full onboarding is complete, organize more complex tasks

## Summary

In this script, we:
1. Parsed and processed Markdown documentation.
2. Embedded document chunks into a FAISS vector store for efficient retrieval.
3. Built a RetrievalQA chain for context-aware question answering.
4. Enabled dynamic updates to handle changing documentation.
5. Enhanced the chatbot with personalized query handling.

This showcases how LangChain can be used to build intelligent, flexible chatbots tailored for specific tasks.