## Retrieval Augmented Generation using LangChain

RAG is a technique for augmenting LLM knowledge with facts fetched from external sources.

A typical RAG application has two main components:

1.   **Indexing**: a pipeline for ingesting data from a source and indexing it
2.   **Retrieval**: take user query at run time and retrieves the relevant data from the index
3.   **Generation**: pass the data to LLM and generate the output

The following notebook is going to provide the flow


In [1]:
## Install the libraries
!pip install -q openai==1.5.0 llmx typing-extensions==4.5.0 python-dotenv
!pip install -q langchain==0.1.4
!pip install -q langchainhub
!pip install -q transformers==4.35.2
!pip install -q unstructured==0.7.12
!pip install -q sentence_transformers
!pip install -q faiss-cpu==1.7.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.7/223.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

#### Indexing

The sequence has the following steps:

1. Load: First we need to load our data. This is done with DocumentLoaders.
2. Split: break large Documents into smaller chunks which is useful for indexing data and for passing it into a model. Large chunks are harder to search over.
3. Store with Embeddings: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embedding model.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
import sys, os
if 'google.colab' in sys.modules:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    # specify the path of the folder containing "file_name" by changing the lecture index:
    lecture_index = '07'
    path_to_file = '/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture{}/'.format(lecture_index)
    print(path_to_file)
    # change current path to the folder containing "file_name"
    os.chdir(path_to_file)
    !pwd

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture07/
/content/gdrive/My Drive/BT5153_2024/codes/lab_lecture07


In [24]:
folder_path = "../data/sample_report"

#### Load the pdf files

In [25]:
!ls $folder_path

report1.pdf  report2.pdf


In [26]:
# Document Loader
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    folder_path, #
    glob='*.pdf',     # we only get pdfs
    show_progress=True
)
docs = loader.load()

100%|██████████| 2/2 [00:05<00:00,  2.95s/it]


In [27]:
# check the number of pdf files that have been loaded
print(len(docs))
# print the loaded object
print(docs[0])

2
page_content='BT5153 Applied Machine Learning for Business Analytics Group Project Proposal Concrete Surface Crack Detection\n\nBai Tong\n\nA0262700R\n\nLe Meiyan\n\nA0262707A\n\nLuo Jianwei\n\nA0262763Y\n\nZhang Ruixu\n\nA0262828W\n\nZhang Xingyu\n\nA0262692X\n\nAbstract1\n\nto proactively identify and address potential safety hazards caused by structural defects.\n\nThis report presents an approach for concrete surface crack detection using Convolutional Neural Network (CNN) models. Four different CNN models including a baseline CNN, VGG16, ResNet50, and Inception v3 were explored. The results show the Inception v3 model outperforms the other models and achieves the highest testing accuracy and lower testing loss in detecting concrete surface cracks. Further, the report investigates the importance of image features for concrete surface crack detection and the basis of prediction making by CNN models.\n\n2. Dataset introduction\n\nThe Surface Crack Detection dataset from Kaggle cont

#### Split the long document

In [28]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0
)
docs_split = text_splitter.split_documents(docs)
print(docs_split[0])

print(f"{len(docs)} documents have been splitted into {len(docs_split)} chunks")



page_content='BT5153 Applied Machine Learning for Business Analytics Group Project Proposal Concrete Surface Crack Detection\n\nBai Tong\n\nA0262700R\n\nLe Meiyan\n\nA0262707A\n\nLuo Jianwei\n\nA0262763Y\n\nZhang Ruixu\n\nA0262828W\n\nZhang Xingyu\n\nA0262692X\n\nAbstract1\n\nto proactively identify and address potential safety hazards caused by structural defects.\n\nThis report presents an approach for concrete surface crack detection using Convolutional Neural Network (CNN) models. Four different CNN models including a baseline CNN, VGG16, ResNet50, and Inception v3 were explored. The results show the Inception v3 model outperforms the other models and achieves the highest testing accuracy and lower testing loss in detecting concrete surface cracks. Further, the report investigates the importance of image features for concrete surface crack detection and the basis of prediction making by CNN models.\n\n2. Dataset introduction' metadata={'source': '../data/sample_report/report1.pdf'}

Store the collection of chunks

In [10]:
# Need the embeeding model to represent the query and the chunk text
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()
# Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs_split, embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The embeedings are used to check the similarity between query and the chunks and based on indexing, the similar chunk would be returned

In [11]:
query = "What was the dataset used in the project music genre classification?"
search_docs = db.similarity_search(query,k=2)
print(search_docs[0])
print(len(search_docs))

page_content="Music Genre Classification\n\nBT5153 Applied Machine Learning in Business Analytics – Group 04 Project Report Anusha Mediboina (A0262847U), Arian Madadi (A0231939X), Asok Kaushik (A0262739U),\n\nHongyu Ren(A0262688M), Naveen Mathew Verghese (A0262734A)\n\nupload their content. It also helps users to discover new artists and songs that fit their preferences.\n\nAbstract\n\nMusic has become an inseparable aspect of people's daily lives today. With numerous musical genres, it is no surprise that people have varying musical tastes. Genres are extremely helpful for music discovery. They bond fans and listeners and facilitate shared experiences. Hence, the classification and the recommendation of contemporary music in music streaming platforms is up- to-date issue.\n\n2. Dataset and Features" metadata={'source': '/content/drive/MyDrive/project_papers/tmp_data/report3.pdf'}
2


#### Retrieval

Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

In [12]:
# Create a retriever object from the 'db' with a search configuration where it retrieves up to 3 relevant splits/documents.
retriever = db.as_retriever(search_kwargs={"k": 2})

Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

In [13]:
# RAG prompt
from langchain import hub

# Loads the latest version
prompt = hub.pull("rlm/rag-prompt")
print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]


As you can see, the context is from the Retriever and question is the users' input

##### LLM

Here, we use local and free LLM from huggingface. The model we used here is a 1.1B [llama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model.

In [14]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM

In [15]:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The model 'LlamaForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].


In [18]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500)
llm_model = HuggingFacePipeline(pipeline=pipe)

To improve the performance, you might need to use the openai LLM using the following code

```python
import os
os.environ["OPENAI_API_KEY"] = openai_key# load the LLM model
from langchain.chat_models import ChatOpenAI
model_name = "gpt-3.5-turbo"
llm_model = ChatOpenAI(model_name=model_name)
```

In [19]:
from langchain.chains import RetrievalQA
# Create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model (llm), a chain type "refine," the retriever we created, and an option to not return source documents.
qa_chain = RetrievalQA.from_chain_type(llm=llm_model, chain_type_kwargs={"prompt": prompt}, retriever=retriever)

In [20]:
query = "What is the dataset used in concrete surface cracks detection?"
result = qa_chain.invoke({"query": query})
print(result)

{'query': 'What is the dataset used in concrete surface cracks detection?', 'result': ' The dataset used in the concrete surface cracks detection task is the Surface Crack Detection dataset from Kaggle. The images are labeled either “positive” (with crack) or “negative” (without crack). The dataset contains 40,000 images, with 20,000 images for training, 10,000 images for validation, and 10,000 images for testing. The models used in this report are Baseline CNN, VGG-16, ResNet50, and Inception v3. The evaluation results show that the Inception v3 model has the lowest testing loss and highest testing accuracy.'}
