# Understanding the basic structure of RAG

- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus)
- Design: 
- Peer Review: 
- Proofread: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

### 1. Pre-processing - Steps 1 to 4
![rag-1.png](./assets/12-rag-rag-basic-pdf-rag-process-01.png)
![rag-1-graphic](./assets/12-rag-rag-basic-pdf-rag-graphic-1.png)


The pre-processing stage involves four steps to load, split, embed, and store documents into a Vector DB (database).

- **Step 1: Document Load** : Load the document content.  
- **Step 2: Text Split** : Split the document into chunks based on specific criteria.  
- **Step 3: Embedding** : Generate embeddings for the chunks and prepare them for storage.  
- **Step 4: Vector DB Storage** : Store the embedded chunks in the database.  

### 2. RAG Execution (RunTime) - Steps 5 to 8
![rag-2.png](./assets/12-rag-rag-basic-pdf-rag-process-02.png)
![rag-2-graphic](./assets/12-rag-rag-basic-pdf-rag-graphic-2.png)


- **Step 5: Retriever** : Define a retriever to fetch results from the database based on the input query. Retrievers use search algorithms and are categorized as Dense or Sparse:
  - **Dense** : Similarity-based search.
  - **Sparse** : Keyword-based search.

- **Step 6: Prompt** : Create a prompt for executing RAG. The `context` in the prompt includes content retrieved from the document. Through prompt engineering, you can specify the format of the answer.  

- **Step 7: LLM** : Define the language model (e.g., GPT-3.5, GPT-4, Claude, etc.).  

- **Step 8: Chain** : Create a chain that connects the prompt, LLM, and output.  

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [RAG Basic Pipeline](#rag-basic-pipeline)
- [Complete code](#complete-code)

### References

- [langChain docs : QA with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)
------

Document Used for Practice
A European Approach to Artificial Intelligence - A Policy Perspective

- Author: EIT Digital and 5 EIT KICs (EIT Manufacturing, EIT Urban Mobility, EIT Health, EIT Climate-KIC, EIT Digital)
- Link: https://eit.europa.eu/sites/default/files/eit-digital-artificial-intelligence-report.pdf
- File Name: A European Approach to Artificial Intelligence - A Policy Perspective.pdf

 _Please copy the downloaded file to the data folder for practice._ 

## Environment-setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

 **[Note]** 
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [ `langchain-opentutorial` ](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

Set the API key.

In [1]:
# Install required packages
from langchain_opentutorial import package

package.install(
    ["langchain_community",
    "langsmith"
    "langchain"
    "langchain_text_splitters"
    "langchain_core"
    "langchain_openai"],
    verbose=False,
    upgrade=False,
)

In [2]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {   "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RAG-Basic-PDF",
    }
)

Environment variables have been set successfully.


In [3]:
# Configuration file for managing API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)

True

## RAG Basic Pipeline

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

Below is the skeleton code for understanding the basic structure of RAG (Relevant Answer Generation).

The content of each module can be adjusted to fit specific scenarios, allowing for iterative improvement of the structure to suit the documents.

(Different options or new techniques can be applied at each step.)

In [6]:
# Step 1: Load Documents
loader = PyMuPDFLoader("./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")
docs = loader.load()
print(f"Number of pages in the document: {len(docs)}")

Number of pages in the document: 24


Print the content of the page.

In [7]:
print(docs[10].page_content)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
11
GENERIC 
There are five issues that, though from slightly different angles, 
are considered strategic and a potential source of barriers and 
bottlenecks: data, organisation, human capital, trust, markets. The 
availability and quality of data, as well as data governance are of 
strategic importance. Strictly technical issues (i.e., inter-operabi-
lity, standardisation) are mostly being solved, whereas internal and 
external data governance still restrain the full potential of AI Inno-
vation. Organisational resources and, also, cognitive and cultural 
routines are a challenge to cope with for full deployment. On the 
one hand, there is the issue of the needed investments when evi-
dence on return is not yet consolidated. On the other hand, equally 
important, are cultural conservatism and misalignment between 
analytical and business objectives. Skills shortages are a main 
bottleneck in all the four sectors cons

Check the metadata.

In [8]:
docs[10].__dict__

{'id': None,
 'metadata': {'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
  'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
  'page': 10,
  'total_pages': 24,
  'format': 'PDF 1.4',
  'title': '',
  'author': '',
  'subject': '',
  'keywords': '',
  'creator': 'Adobe InDesign 15.1 (Macintosh)',
  'producer': 'Adobe PDF Library 15.0',
  'creationDate': "D:20200922223534+02'00'",
  'modDate': "D:20200922223544+02'00'",
  'trapped': ''},
 'page_content': 'A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n11\nGENERIC \nThere are five issues that, though from slightly different angles, \nare considered strategic and a potential source of barriers and \nbottlenecks: data, organisation, human capital, trust, markets. The \navailability and quality of data, as well as data governance are of \nstrategic importance. Strictly technical issues (i.e., inter-operabi-\nlity, standardis

In [9]:
# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
print(f"Number of split chunks: {len(split_documents)}")

Number of split chunks: 163


In [10]:
# Step 3: Generate Embeddings
embeddings = OpenAIEmbeddings()

In [11]:
# Step 4: Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

In [12]:
for doc in vectorstore.similarity_search("URBAN MOBILITY"):
    print(doc.page_content)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
14
Table 3: Urban Mobility: concerns, opportunities and policy levers.
URBAN MOBILITY
The adoption of AI in the management of urban mobility systems 
brings different sets of benefits for private stakeholders (citizens, 
private companies) and public stakeholders (municipalities, trans-
portation service providers). So far only light-weight task specific 
AI applications have been deployed (i.e., intelligent routing, sharing
A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
15
One of the most interesting development close to scale up is the 
creation of platforms, which are fed by all different data sources 
of transport services (both private and public) and provide the ci-
tizens a targeted recommendation on the best way to travel, also 
based on personal preferences and characteristics. 
Urban Mobility should focus on what is already potentially avai-
apps, predictive models based on citizens’ 

In [13]:
# Step 5: Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()


Send a query to the retriever and check the resulting chunks.

In [15]:
# Send a query to the retriever and check the resulting chunks.
retriever.invoke("What is the phased implementation timeline for the EU AI Act?")

[Document(id='0287d0f6-85cf-49c0-9916-623a6e5455ab', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 9, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 15.1 (Macintosh)', 'producer': 'Adobe PDF Library 15.0', 'creationDate': "D:20200922223534+02'00'", 'modDate': "D:20200922223544+02'00'", 'trapped': ''}, page_content='A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n10\nrequirements becomes mandatory in all sectors and create bar-\nriers especially for innovators and SMEs. Public procurement ‘data \nsovereignty clauses’ induce large players to withdraw from AI for \nurban ecosystems. Strict liability sanctions block AI in healthcare, \nwhile limiting space of self-driving experimentation. The support \nmeasures to boost European A

Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Using legacy API key. Please generate a new API key."}')trace=5352e19a-6564-4a53-81e5-149a0c4d4923,id=5352e19a-6564-4a53-81e5-149a0c4d4923
Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Using legacy API key. Please generate a new API key."}')trace=5352e19a-6564-4a53-81e5-149a0c4d4923,id=5352e19a-6564-4a53-81e5-149a0c4d4923
Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.

In [14]:
# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

In [15]:
# Step 7: Create Language Model (LLM)
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

In [16]:
# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Input a query (question) into the created chain and execute it.

In [17]:
# Run Chain
# Input a query about the document and print the response.
question = "Where has the application of AI in healthcare been confined to so far?"
response = chain.invoke(question)
print(response)

The application of AI in healthcare has so far been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.


## Complete code

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")
docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 3: Generate Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create and Save the Database
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create Retriever
# Search and retrieve information contained in the documents.
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
# Create a prompt.
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Create Language Model (LLM)
# Create the language model (LLM).
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [19]:
# Run Chain
# Input a query about the document and print the response.
question = "Where has the application of AI in healthcare been confined to so far?"
response = chain.invoke(question)
print(response)

The application of AI in healthcare has been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.
