# Tutorial For Langchain

The notebook shows how to use langchain API with an example. In this example we will be building chatbot for the internal documentation using lancgchain.

Refernces: 
 - Official docs : https://python.langchain.com/docs/introduction/

## Imports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [15]:
import os
import glob
import logging

import langchain_openai as langOpenAI
import langchain.document_loaders as docloader
import langchain.docstore.document as docstore
import langchain.text_splitter as txtsplitter
import langchain.embeddings as lang_embeddings
import langchain.vectorstores as vectorstores
import langchain.chains as chains
import langchain.chat_models as chatmodels

from typing import List

import helpers.hsystem as hsystem
import helpers.hprint as hprint
import helpers.hdbg as hdbg
import helpers.hpandas as hpanda


In [64]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)



### Define the GPT Model to use.

In [19]:
os.environ["OPENAI_API_KEY"] = ""

chat_model = langOpenAI.ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

In [9]:
def parse_markdown_files(file_paths) -> List[docstore.Document]:
    """
    Parse all the markdown files into Documents.

    :param file_paths: list of md file_paths
    """
    documents = []
    for file_path in file_paths:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        # Create a Document object for each file
        documents.append(docstore.Document(page_content=content, metadata={"source": file_path}))
    return documents

In [10]:
def list_markdown_files(directory:str) -> List[str]:
    return list(glob.glob(f"{directory}/*.md"))

### RecursiveCharacterTextSplitter 
Utility function in LangChain  for splitting large chunks of text into smaller more manageable pieces while ensuring minimal overlap or fragmentation of meaningful content.


In [31]:
# Directory containing Markdown files
directory = "../../docs"

# List Markdown files
markdown_files = list_markdown_files(directory)

# Parse Markdown files into LangChain documents
documents = parse_markdown_files(markdown_files)

# Split long documents into smaller chunks
text_splitter = txtsplitter.RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(documents)

# Print sample chunked documents
for doc in split_documents[:5]:
    _LOG.info("Source: %s", {doc.metadata['source']})
    _LOG.info("Content: %s", {doc.page_content})

INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'<!-- toc -->\n\n- [Tutorials "Learn X in 60 minutes"](#tutorials-learn-x-in-60-minutes)\n  * [What are the goals for each tutorial](#what-are-the-goals-for-each-tutorial)\n\n<!-- tocstop -->\n\n# Tutorials "Learn X in 60 minutes"\n\nThe goal is to give everything needed for one person to become familiar with a\nBig data / AI / LLM / data science technology in 60 minutes.\n\n- Each tutorial conceptually corresponds to a blog entry.'}
INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'Each tutorial corresponds to a directory in the `//tutorials` repo\n[https://github.com/causify-ai/tutorials](https://github.com/causify-ai/tutorials)\nwith'}
INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'- A markdown \\`XYZ.API.md\\` about the API and the software layer written by us\n  on top of the native API\n- A markdown `XYZ.example.md` with a

### VECTOR STORES

#### FAISS (Facebook AI Similarity Search) 
It is a library designed for efficient similarity search and clustering of dense vectors. In LangChain, FAISS is commonly used as a vector store to store and retrieve embeddings, which are vector representations of text or other data.


In [32]:
# Initialize embeddings
embeddings = lang_embeddings.OpenAIEmbeddings()

# Embed and store split_documents
vector_store = vectorstores.FAISS.from_documents(split_documents, embeddings)

retriever = vector_store.as_retriever()

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [33]:
# Create the QA chain
qa_chain = chains.RetrievalQA.from_chain_type(
    llm=chat_model,
    retriever=retriever,
    return_source_documents=True
)

In [34]:
# User's question
query = "What are the guidelines on creating new project"

# Get the answer and source documents
result = qa_chain({"query": query})

# Print the answer
_LOG.info("Answer: %s", result['result'])

# Print the source file references
_LOG.info("\nSource Documents:")
for doc in result['source_documents']:
    _LOG.info("File: %s", {doc.metadata['source']})
    _LOG.info("Excerpt: %s", {doc.page_content[:200]})

  result = qa_chain({"query": query})


INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO  HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO  Answer: The guidelines for creating a new project based on the Markdown documents should cover various aspects such as the package description, problem it solves, alternatives, native API description, Docker container details, visual aids using tools like mermaid, references to relevant resources, and ensuring that Jupyter notebooks are unit tested, self-contained, and run end-to-end after a restart. These guidelines aim to provide a comprehensive and well-documented approach to developing projects in areas like Git, Docker, databases, workflow management tools, and more.
INFO  
Source Documents:
INFO  File: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Excerpt: {'Markdown documents should cover:'}
INFO  File: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Excerpt: {'This is the same appr

In [38]:
def get_vectors_by_document_name(vector_store: vectorstores.FAISS, document_name: str) -> List:
    """
    Retrieve vectors from a FAISS vector store based on the document name.

    :param vector_store: FAISS vector store object that supports similarity search.
    :param document_name:  name of the document used as a filter in the metadata.

    :return: list of results from the FAISS vector store that match the given document name.
    """
    # Query using the metadata field source
    results = vector_store.similarity_search(
        # Pass an empty query or a dummy vector if supported
        query="",
        # Retrieve all matching documents
        k=None,
        # Filter by the document name
        filter={"source": document_name} 
    )
    return results


In [39]:
# Example usage
document_name = "all.how_write_tutorials.how_to_guide.md"
results = get_vectors_by_document_name(vector_store, document_name)

# Print results
for doc in results:
    _LOG.info("File: %s", {doc.metadata['source']})
    _LOG.info("Content: %s", {doc.page_content[:200]})

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


### Demo to create a documentation QA bot but the docs can be updated or deleted.

In [40]:
# Initialize some state.
vector_store = None
folder = "../docs"
filename_to_md5sum = {}

In [41]:
from typing import List
from langchain.schema import Document

def parse_markdown_files(file_paths: List[str]) -> List[Document]:
    """
    Parse and structure Markdown files into LangChain Document objects.

    :param file_paths: list of file paths to the Markdown files

    :return: list of Document objects, where each document contains the content
                        of a Markdown file and metadata with the file's source path.
    """
    documents = []
    filename_to_md5sum = {}
    for file_path in file_paths:
        # Read the content of the Markdown file
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()       
        # Compute the MD5 checksum of the file
        md5sum, _ = hsystem.system_to_string(f"md5sum {file_path}")[1].split()
        filename_to_md5sum[file_path] = md5sum 
        # Create a Document object for each file
        documents.append(Document(page_content=content, metadata={"source": file_path}))
    
    return documents


In [55]:
def create_vector_store_from_markdown_files(folder):
    # List Markdown files
    markdown_files = list_markdown_files(directory)
    # Parse Markdown files into LangChain documents
    documents = parse_markdown_files(markdown_files)
    # Split long documents into smaller chunks
    text_splitter = txtsplitter.RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    split_documents = text_splitter.split_documents(documents)
    # Create embeddings for all documents.
    vector_store = vectorstores.Chroma.from_documents(split_documents, embeddings)
    return vector_store

In [56]:
def get_changes_in_documents_folder(folder):
    # List Markdown files
    markdown_files = list_markdown_files(folder)
    changes = {}
    changes["modified"] = []
    for file_path in markdown_files:
        md5sum, _ = hsystem.system_to_string(f"md5sum {file_path}")[1].split()
        if file_path not in filename_to_md5sum or filename_to_md5sum[file_path] == md5sum:
            print(f"Found a new / modified file {file_path}")
            changes["modified"].append(file_path)
    return changes

In [57]:
def update_files_in_vector_store(vector_store, files):
    if len(files) == 0:
        print("No new files found")
        return
    ids_to_delete = []
    for file in files:
        for doc in vector_store:
            if doc.metadata.get('source') == file:
                ids_to_delete.append(doc.id)
    vector_store.delete(ids_to_delete)
    documents = parse_markdown_files(files)
    # Split long documents into smaller chunks
    text_splitter = txtsplitter.RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    split_documents = text_splitter.split_documents(documents)
    texts = [doc.page_content for doc in split_documents]
    embeddings_list = embeddings.embed_documents(texts)  # Compute embeddings for multiple documents
    # Add documents to vector store with computed embeddings
    vector_store.add_documents(
        documents=split_documents,
        embeddings=embeddings_list
    )
    return vector_store

In [58]:
query = "What are the goals for tutorial project?"

In [59]:
if vector_store:
    changes = get_changes_in_documents_folder(folder)
    vector_store = update_files_in_vector_store(vector_store, changes["modified"])
else:
    vector_store = create_vector_store_from_markdown_files(folder)

INFO  Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [62]:
# Create the QA chain
qa_chain = chains.RetrievalQA.from_chain_type(
    llm=chat_model,
    retriever=retriever,
    return_source_documents=True
)

In [63]:
# Get the answer and source documents
result = qa_chain({"query": query})

# Print the answer
_LOG.info("Answer: %s", result['result'])

# Print the source file references
_LOG.info("\nSource Documents:")
for doc in result['source_documents']:
    _LOG.info("File: %s", {doc.metadata['source']})
    _LOG.info("Excerpt: %s", {doc.page_content[:200]})

INFO  HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO  HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO  Answer: The goals for the tutorial project are to provide everything needed for one person to become familiar with a Big data / AI / LLM / data science technology in 60 minutes. This includes creating tutorials that are conceptually like blog entries, providing markdown files about the API and software layer, offering examples of applications using the API, supplying Docker containers in a specific format, and including Jupyter notebooks with examples of APIs and full examples. The tutorials should be unit tested, self-contained, linear, and run end-to-end after a restart to ensure they work properly and are easy to follow.
INFO  
Source Documents:
INFO  File: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Excerpt: {'<!-- toc -->\n\n- [Tutorials "Learn X in 60 minutes"](#tutorials-learn-x-in-60-minutes)