# Build an AI RAG Assistant Using LangChain

Imagine you work as a consultant for Quest Analytics, a small but fast-growing research organization.

In today’s fast-paced research environment, the sheer volume of scientific papers can be overwhelming, making it nearly impossible to stay up-to-date with the latest developments. 

The researchers at Quest Analytics have been struggling to find the time to examine countless documents, let alone extract the most relevant and insightful information. 

You have been hired to build an AI RAG assistant that can read, understand, and summarize vast amounts of data, all in real time. Follow the below tasks to construct the AI-powered RAG assistant to optimize the research endeavors at Quest Analytics.

## Task 1: Load document using LangChain for different sources

In [1]:
from langchain_community.document_loaders import PyPDFLoader

pdf_url = "A_Comprehensive_Review_of_Low_Rank_Adaptation_in_Large_Language_Models.pdf"

loader = PyPDFLoader(pdf_url)
pages = loader.load_and_split()
print(pages[0].page_content[:1000])  # Print the first 1000 characters of the first page for brevity

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


A Comprehensive Review of Low-Rank
Adaptation in Large Language Models for
Efficient Parameter Tuning
September 10, 2024
Abstract
Natural Language Processing (NLP) often involves pre-training large
models on extensive datasets and then adapting them for specific tasks
through fine-tuning. However, as these models grow larger, like GPT-3
with 175 billion parameters, fully fine-tuning them becomes computa-
tionally expensive. We propose a novel method called LoRA (Low-Rank
Adaptation) that significantly reduces the overhead by freezing the orig-
inal model weights and only training small rank decomposition matrices.
This leads to up to 10,000 times fewer trainable parameters and reduces
GPU memory usage by three times. LoRA not only maintains but some-
times surpasses fine-tuning performance on models like RoBERTa, De-
BERTa, GPT-2, and GPT-3. Unlike other methods, LoRA introduces
no extra latency during inference, making it more efficient for practical
applications. All relevant code an

## Task 2: Apply text splitting techniques

In [2]:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

latex_text = """
    \documentclass{article}

    \begin{document}

    \maketitle

    \section{Introduction}

    Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. 
    In recent years, LLMs have made significant advances in various natural language processing tasks, including language translation, text generation, 
    and sentiment analysis.

    \subsection{History of LLMs}

    The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational 
    power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets,
    leading to significant improvements in performance.

    \subsection{Applications of LLMs}

    LLMs have many applications in the industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for 
    research in linguistics, psychology, and computational linguistics.

    \end{document}
"""

text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.LATEX,
    chunk_size=100,
    chunk_overlap=0,
    length_function=len
)

texts = text_splitter.create_documents([latex_text])
texts

[Document(metadata={}, page_content='\\documentclass{article}\n\n    \x08egin{document}\n\n    \\maketitle\n\n    \\section{Introduction}'),
 Document(metadata={}, page_content='Large language models (LLMs) are a type of machine learning model that can be trained on vast'),
 Document(metadata={}, page_content='amounts of text data to generate human-like language. \n    In recent years, LLMs have made'),
 Document(metadata={}, page_content='significant advances in various natural language processing tasks, including language translation,'),
 Document(metadata={}, page_content='text generation, \n    and sentiment analysis.\n\n    \\subsection{History of LLMs}\n\n    The earliest'),
 Document(metadata={}, page_content='LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could'),
 Document(metadata={}, page_content='be processed and the computational \n    power available at the time. In the past decade, however,'),
 Document(metadata={}, page_cont

## Task 3: Embed documents

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from pydantic import BaseModel

## Embedding model
def embedding():
    
    embedding = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
    )

    return embedding

query = "How are you?"
embedding_model = embedding()
query_embedding = embedding_model.embed_query(query)
query_embedding[:5]



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  return _bootstrap._gcd_import(name[level:], package, level)
  from .autonotebook import tqdm as notebook_tqdm


[0.007003862410783768,
 0.010914131067693233,
 0.08746250718832016,
 0.08679929375648499,
 0.026648471131920815]

## Task 4: Create and configure vector databases to store embeddings

In [4]:
from langchain_community.document_loaders import TextLoader
from langchain.vectorstores import Chroma

# Load text data
loader = TextLoader("new-Policies.txt")
data = loader.load()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

chunks = text_splitter.split_documents(data)

# Create embeddings for the chunks
ids = [str(i) for i in range(0, len(chunks))]
vectordb = Chroma.from_documents(chunks, embedding_model, ids=ids)

# Similarity search
query = "Smoking policy"
docs = vectordb.similarity_search(query, k=5)
docs

[Document(metadata={'source': 'new-Policies.txt'}, page_content='to our success. We regularly review and update this policy to incorporate best practices in'),
 Document(metadata={'source': 'new-Policies.txt'}, page_content='4. Mobile Phone Policy'),
 Document(metadata={'source': 'new-Policies.txt'}, page_content='This policy encourages the responsible use of mobile devices in line with legal and ethical'),
 Document(metadata={'source': 'new-Policies.txt'}, page_content='Consequences: Non-compliance with this policy may result in disciplinary actions, including'),
 Document(metadata={'source': 'new-Policies.txt'}, page_content='Consequences: Violations of this policy may lead to disciplinary action, including potential')]

## Task 5: Develop a retriever to fetch document segments based on queries

In [5]:
query = "Email policy"
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'new-Policies.txt'}, page_content='3. Internet and Email Policy'),
 Document(metadata={'source': 'new-Policies.txt'}, page_content='Compliance: Adhere to all relevant laws and regulations concerning internet and email use,')]

## Task 6: Construct a QA Bot that leverages the LangChain and LLM to answer questions

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from pydantic import BaseModel
import os
from dotenv import load_dotenv
import gradio as gr

# Load environment variables
load_dotenv("../.env")

# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## LLM
def get_llm():
    model_id = 'google/flan-t5-large'
    
    # Get HuggingFace API key from environment
    hf_api_key = os.getenv('HUGGING_FACE_API_KEY')
    
    # Create tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_api_key)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id, token=hf_api_key)
    
    # Create text generation pipeline
    text_generation_pipeline = pipeline(
        "text2text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        temperature=0.5,
        do_sample=True,
    )
    
    # Create LangChain HuggingFace LLM
    hf_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
    return hf_llm

## Document loader
def document_loader(file):
    loader = PyPDFLoader(file.name)
    loaded_document = loader.load()
    return loaded_document

## Text splitter
def text_splitter(data):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

## Vector db
def vector_database(chunks):
    embedding_model = hf_embedding()
    vectordb = Chroma.from_documents(chunks, embedding_model)
    return vectordb

## Embedding model
def hf_embedding():
    
    hf_embedding = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
    )
    return hf_embedding

## Retriever
def retriever(file):
    splits = document_loader(file)
    chunks = text_splitter(splits)
    vectordb = vector_database(chunks)
    retriever = vectordb.as_retriever()
    return retriever

## QA Chain
def retriever_qa(file, query):
    llm = get_llm()
    retriever_obj = retriever(file)
    qa = RetrievalQA.from_chain_type(llm=llm, 
                                    chain_type="stuff", 
                                    retriever=retriever_obj, 
                                    return_source_documents=False)
    response = qa.invoke(query)
    return response['result']

# Create Gradio interface
rag_application = gr.Interface(
    fn=retriever_qa,
    allow_flagging="never",
    inputs=[
        gr.File(label="Upload PDF File", file_count="single", file_types=['.pdf'], type="filepath"),  # Drag and drop file upload
        gr.Textbox(label="Input Query", lines=2, placeholder="Type your question here...")
    ],
    outputs=gr.Textbox(label="Output"),
    title="RAG Chatbot",
    description="Upload a PDF document and ask any question. The chatbot will try to answer using the provided document."
)

# Launch the app
rag_application.launch(server_name="127.0.0.1", server_port= 7860)

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




Device set to use cpu
Device set to use cpu
