# Retrieval-augmented generation (RAG)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/question_answering/qa.ipynb)

## Use case
Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents.

LLMs, given their proficiency in understanding text, are a great tool for this.

In this walkthrough we'll go over how to build a question-answering over documents application using LLMs.

Two very related use cases which we cover elsewhere are:
- [QA over structured data](https://python.langchain.com/docs/use_cases/qa_structured/sql) (e.g., SQL)
- [QA over code](https://python.langchain.com/docs/use_cases/question_answering/code_understanding) (e.g., Python)

![intro.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/qa_intro.png?raw=true)

## Overview
The pipeline for converting raw unstructured data into a QA chain looks like this:
1. `Loading`: First we need to load our data. Use the [LangChain integration hub](https://integrations.langchain.com/) to browse the full set of loaders.
2. `Splitting`: [Text splitters](/docs/modules/data_connection/document_transformers/) break `Documents` into splits of specified size
3. `Storage`: Storage (e.g., often a [vectorstore](/docs/modules/data_connection/vectorstores/)) will house [and often embed](https://www.pinecone.io/learn/vector-embeddings/) the splits
4. `Retrieval`: The app retrieves splits from storage (e.g., often [with similar embeddings](https://www.pinecone.io/learn/k-nearest-neighbor/) to the input question)
5. `Generation`: An [LLM](/docs/modules/model_io/models/llms/) produces an answer using a prompt that includes the question and the retrieved data

![flow.jpeg](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/qa_flow.jpeg?raw=true)


In [16]:
!pip install -qq -r requirements.txt

In [None]:
!pip install -qq unstructured[all-docs]

# 1. Load data - connect to external documents - Model Cards

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader('model_cards/', glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
len(docs)

6

In [6]:
loader.load()

 Document(metadata={'source': 'model_cards/123abhiALFLKFO/distilbert-base-uncased-finetuned-cola/README.md'}, page_content='---\nlicense: apache-2.0\ntags:\n- generated_from_trainer\ndatasets:\n- glue\nmetrics:\n- matthews_correlation\nmodel_index:\n- name: distilbert-base-uncased-finetuned-cola\n  results:\n  - task:\n      name: Text Classification\n      type: text-classification\n    dataset:\n      name: glue\n      type: glue\n      args: cola\n    metric:\n      name: Matthews Correlation\n      type: matthews_correlation\n      value: 0.5331291095663535\n---\n\n<!-- This model card has been generated automatically according to the information the Trainer had access to. You\nshould probably proofread and complete it, then remove this comment. -->\n\n# distilbert-base-uncased-finetuned-cola\n\nThis model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the glue dataset.\nIt achieves the following results on the evaluation set

# 2. Splitting - break documents into splits of specified size

In [7]:
# Split documents

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                               chunk_overlap=100,
                                               separators=['\n', '.'])
splits = text_splitter.split_documents(loader.load())

In [8]:
len(splits)

126

In [9]:
splits[0]

Document(metadata={'source': 'model_cards/openai-community/roberta-base-openai-detector/README.md'}, page_content='---\nlanguage: en\nlicense: mit\ntags:\n- exbert\ndatasets: \n- bookcorpus\n- wikipedia\n---\n\n# RoBERTa Base OpenAI Detector\n\n## Table of Contents\n- [Model Details](#model-details)\n- [Uses](#uses)\n- [Risks, Limitations and Biases](#risks-limitations-and-biases)\n- [Training](#training)\n- [Evaluation](#evaluation)\n- [Environmental Impact](#environmental-impact)\n- [Technical Specifications](#technical-specifications)\n- [Citation Information](#citation-information)\n- [Model Card Authors](#model-card-author)')

In [10]:
splits[4]

Document(metadata={'source': 'model_cards/openai-community/roberta-base-openai-detector/README.md'}, page_content='- **Language(s):** English\n- **License:** MIT\n- **Related Models:** [RoBERTa base](https://huggingface.co/roberta-base), [GPT-XL (1.5B parameter version)](https://huggingface.co/gpt2-xl), [GPT-Large (the 774M parameter version)](https://huggingface.co/gpt2-large), [GPT-Medium (the 355M parameter version)](https://huggingface.co/gpt2-medium) and [GPT-2 (the 124M parameter version)](https://huggingface.co/gpt2)\n- **Resources for more information:**')

# 3. Storage - the vector store that embeds the splits

In [None]:
# Embed and store splits
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceInstructEmbeddings

# Downloading embedding model
embedding_model = HuggingFaceInstructEmbeddings(
    model_name = "hkunlp/instructor-large",
    embed_instruction = "Represent the model cards for retrieval: ",
    query_instruction = 'Represent the user question for retrieving supporting documents: ',
    model_kwargs = {'device': 'cuda'}
)

vectorstore = Chroma.from_documents(documents=splits, embedding=embedding_model)

# 4. Retrieval - retrieve splits from storage

In [None]:
retriever = vectorstore.as_retriever()

In [None]:
vectorstore._collection.get()

In [None]:
vectorstore._collection.get(ids=['dfa6e198-800c-11ee-96f8-0242ac1c000c'], include=['embeddings', 'documents'])

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
access_token = ""
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             load_in_4bit=True,
                                             device_map='auto',
                                             use_auth_token=access_token)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_auth_token=access_token)

In [None]:
# Test the model
test_prompt = """### Instruction: What is the three step training procedure for modern transformer-based LLMs?

### Answer:
"""

encoded_instruction = tokenizer(test_prompt,
                                return_tensors='pt',
                                add_special_tokens=True)
model_inputs = encoded_instruction.to(device)
generated_ids = model.generate(**model_inputs,
                               max_new_tokens=1000,
                               do_sample=True,
                               pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

In [None]:
import transformers

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000
)
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [None]:
from langchain.chains import RetrievalQA

# Prompt Template
qa_template = """### Instruction: You are a helpful assistant.
Use the following context to answer the question below.
If you don't know the answer or the context does not help you answer the question, please say "I don't know".

{context}

{question}

### Answer: """

# Create a prompt instance
QA_prompt = PromptTemplate.from_template(qa_template)

# Instantiate the QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_prompt},
    return_source_documents=True
)

In [None]:
# Enter a Question
question = "What is the three step training procedure for modern transformer-based LLMs?"
#question = "Does the model use high accuracy?"
#question = "Which is the model that has higher accuracy ?"

# Query Mistral 7B Instruct w/ RAG pipeline
response = qa_chain({'query': question})

# Print your result
print(response['result'])

In [None]:
response['source_documents']

# How to compose tools (LLM agents) to tackle complex tasks (recommending AI Product components - data/model cards)?

## Step 1. Load

Specify a `DocumentLoader` to load in your unstructured data as `Documents`.

A `Document` is a dict with text (`page_content`) and `metadata`.

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives")
data = loader.load()

### Go deeper
- Browse the > 160 data loader integrations [here](https://integrations.langchain.com/).
- See further documentation on loaders [here](/docs/modules/data_connection/document_loaders/).

## Step 2. Split

Split the `Document` into chunks for embedding and vector storage.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

### Go deeper

- `DocumentSplitters` are just one type of the more generic `DocumentTransformers`.
- See further documentation on transformers [here](/docs/modules/data_connection/document_transformers/).
- `Context-aware splitters` keep the location ("context") of each split in the original `Document`:
    - [Markdown files](/docs/use_cases/question_answering/document-context-aware-QA)
    - [Code (py or js)](docs/integrations/document_loaders/source_code)
    - [Documents](/docs/integrations/document_loaders/grobid)

## Step 3. Store

To be able to look up our document splits, we first need to store them where we can later look them up.

The most common way to do this is to embed the contents of each document split.

We store the embedding and splits in a vectorstore.

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=embedding_model)

### Go deeper
- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).
- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).
- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).
- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/).

 Here are Steps 1-3:

![lc.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/qa_data_load.png?raw=true)

## Step 4. Retrieve

Retrieve relevant splits for any question using [similarity search](https://www.pinecone.io/learn/what-is-similarity-search/).

This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

In [None]:
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)

### Go deeper

Vectorstores are commonly used for retrieval, but they are not the only option. For example, SVMs (see thread [here](https://twitter.com/karpathy/status/1647025230546886658?s=20)) can also be used.

LangChain [has many retrievers](/docs/modules/data_connection/retrievers/) including, but not limited to, vectorstores.

All retrievers implement a common method `get_relevant_documents()` (and its asynchronous variant `aget_relevant_documents()`).

In [None]:
from langchain.retrievers import SVMRetriever

svm_retriever = SVMRetriever.from_documents(all_splits, OpenAIEmbeddings())
docs_svm = svm_retriever.get_relevant_documents(question)
len(docs_svm)

Some common ways to improve on vector similarity search include:
- `MultiQueryRetriever` [generates variants of the input question](/docs/modules/data_connection/retrievers/MultiQueryRetriever) to improve retrieval.
- `Max marginal relevance` selects for [relevance and diversity](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf) among the retrieved documents.
- Documents can be filtered during retrieval using [`metadata` filters](/docs/use_cases/question_answering/document-context-aware-QA).

In [None]:
import logging
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=ChatOpenAI(temperature=0)
)
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

In addition, a useful concept for improving retrieval is decoupling the documents from the embedded search key.

For example, we can embed a document summary or question that are likely to lead to the document being retrieved.

See details in [here](docs/modules/data_connection/retrievers/multi_vector) on the multi-vector retriever for this purpose.

![mv.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/multi_vector.png?raw=true)