<a href="https://colab.research.google.com/github/tahreemrasul/fine_tune_embedding_model_rag/blob/main/rag_research_paper_engine_llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Paper Engine using arXiv, LangChain 🦜️🔗 and Llama 3 🦙

| | |
|-|-|
|Author(s) | [Tahreem Rasul](https://github.com/tahreemrasul) |

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/tahreemrasul/fine_tune_embedding_model_rag/blob/main/rag_research_paper_engine_llama3.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/tahreemrasul/fine_tune_embedding_model_rag/blob/main/rag_research_paper_engine_llama3.ipynb">
      <img width="28px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Overview

This notebook demonstrates implementing a research paper engine using the arXiv API to show how to improve LLM's response by augmenting LLM's knowledge with external data sources such as documents. The notebooks uses Vertex AI Gemini Pro 1.0 for Text, Embeddings for Text API, arXiv API and LangChain 🦜️🔗.

## Context

Large Language Models (LLMs) have improved quantitatively and qualitatively. They can learn new abilities without being directly trained on them. However, there are constraints with LLMs - they are unaware of events after training and it is almost impossible to trace the sources to their responses. It is preferred for LLM based systems to cite their sources and be grounded in facts.

To solve for the constraints, one of the approaches is to augment the prompt sent to LLM with relevant data retrieved from an external knowledge base through Information Retrieval (IR) mechanism.

This approach is called Retrieval Augmented Generation (RAG), also known as Generative QA in the context of a Question Answering task. There are two main components in RAG based architecture: (1) Retriever and (2) Generator.

## Getting Started

### Install packages and their dependencies

Install the following packages required to execute this notebook.

In [2]:
# Install LangChain and related packages
!pip install --upgrade --quiet langchain langchain-groq langchain-community chromadb arxiv pymupdf chainlit sentence-transformers

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [3]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ Before proceeding, please wait for the kernel to finish restarting ⚠️</b>
</div>

## Retrieve Relevant Papers from arXiv API

This step retrieves relevant research papers based on the user query. The document corpus used as dataset will be the research papers pulled from the `arXiv` API. We will be using the `ArxivLoader` class from LangChain to load the PDFs of these papers.

In [1]:
# @title Query & No. of Papers { display-mode: "form" }
query = "neural networks"  # @param {type:"string"}

# @title Total Docs { display-mode: "form" }
num_papers = "3"  # @param {type: "string"}

In [2]:
from langchain_community.document_loaders import ArxivLoader

arxiv_docs = ArxivLoader(query=query, load_max_docs=int(num_papers)).load()

Once retreived, display the metadata to check which papers were returned

In [3]:
for i in range(int(num_papers)):
  print(f"Paper # {i+1}:")
  print(f"Published: {arxiv_docs[i].metadata['Published']}")
  print(f"Title: {arxiv_docs[i].metadata['Title']}")
  print(f"Authors: {arxiv_docs[i].metadata['Authors']}")
  print(f"Summary: {arxiv_docs[i].metadata['Summary']}")
  print('------------------------------------------------------------------------------------------------------------')


Paper # 1:
Published: 2023-04-18
Title: Lecture Notes: Neural Network Architectures
Authors: Evelyn Herberg
Summary: These lecture notes provide an overview of Neural Network architectures from
a mathematical point of view. Especially, Machine Learning with Neural Networks
is seen as an optimization problem. Covered are an introduction to Neural
Networks and the following architectures: Feedforward Neural Network,
Convolutional Neural Network, ResNet, and Recurrent Neural Network.
------------------------------------------------------------------------------------------------------------
Paper # 2:
Published: 2005-04-13
Title: Self-Organizing Multilayered Neural Networks of Optimal Complexity
Authors: V. Schetinin
Summary: The principles of self-organizing the neural networks of optimal complexity
is considered under the unrepresentative learning set. The method of
self-organizing the multi-layered neural networks is offered and used to train
the logical neural networks which were appl

## Chunk documents - TextSplitter

Split the documents retrieved into smaller chunks. When splitting the document, ensure a few chunks can fit within the context length of LLM.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_data = []
for doc in arxiv_docs:
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    doc_splits = text_splitter.create_documents([doc.page_content])
    for idx, split in enumerate(doc_splits):
      split.metadata["chunk"] = idx
    pdf_data.append(doc_splits)

print(f"# of pdfs = {len(pdf_data)} \n# of split documents = {sum([len(doc_splits) for doc_splits in pdf_data])}")

# of pdfs = 3 
# of split documents = 143


## Create the Embedding model

In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="trasul/bge-base-all-nli-triplet")
print(embedding_model)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
) model_name='trasul/bge-base-all-nli-triplet' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False


## Configure `ChromaDB` as Vector Store

This step generates embeddings from the documents and adds the embeddings to the vector store. The vector store being used is the `Chroma` database.


In [6]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(pdf_data[0], embedding_model)

In [7]:
# @title search query { display-mode: "form" }
search_query = "What should be considered when taking derivatives of ReLU?"  # @param {type:"string"}

Verify the `ChromaDB` with similarity search

In [8]:
db.similarity_search(
    search_query
)

[Document(metadata={'chunk': 52}, page_content='∂L\n∂y[L] is a part of all derivatives. So,\nif we compute the derivative by W [L−1] ﬁrst, we already have this component available and can\nreuse it in the computation of the derivative by W [L−2], etc.\nIn order to formalize the eﬀective computation of derivatives in a backpropagation algorithm,\nwe decompose the forward propagation into two parts, cf. e.g. [26, Section 7.3.2].\nz[ℓ] = W [ℓ−1]y[ℓ−1] + b[ℓ−1]\n∈Rnℓ,\ny[ℓ] = σℓ](z[ℓ])\n∈Rnℓ.\n2\nFEEDFORWARD NEURAL NETWORK\n23\nThis was not necessary in Example 2.3, because we only consider weights and no biases. Fur-\nthermore, we assume that the loss function L takes the ﬁnal output y[L] as an input. Especially,\nno other feature vectors y[ℓ] for ℓ̸= L enter the loss function directly. This is the case e.g. for\nmean squared error, cf. Example 1.2, and cross entropy, cf. Section 2.4.\nIn general, we now have by chain rule for all ℓ= 0, . . . , L −1\n∂L\n∂W [ℓ] = ∂L\n∂y[L] ·\nℓ+2\nY\nj=L\

## Retrieval based Question/Answering Chain

We will demonstrate using three LangChain retrieval Q&A chains:

- `RetrievalQA`
- `ConversationalRetrievalChain`
- Advanced: customized Q&A prompt and format

We begin by initializing a Vertex AI LLM and a LangChain retriever to fetch documents from our Chroma Database containing ingested pdfs of papers we fetched earlier.

For Q&A chains our retriever is passed directly to the chain and can be used automatically without any further configuration.

Behind the scenes, first the search query is passed to the retriever which runs a search and returns relevant document chunks.

These chunks are then passed to the prompt used by the LLM to be used as context.

In [17]:
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA

llm = model = ChatGroq(temperature=0, model_name="llama3-8b-8192",
                       api_key="your_GROQ_API_key_here")

retriever = db.as_retriever()

### `RetrievalQA` chain

This is the simplest document Q&A chain offered by LangChain.

There are several different chain types available.

- In these examples we use the `stuff` type, which simply inserts all of the document chunks into the prompt.
- This has the advantage of only making a single LLM call, which is faster and more cost efficient.
- However, if we have a large number of search results we run the risk of exceeding the token limit in our prompt, or truncating useful information.
- Other chain types such as `map_reduce` and `refine` use an iterative process which makes multiple LLM calls, taking individual document chunks at a time and refining the answer iteratively.

In [10]:
retrieval_qa = RetrievalQA.from_chain_type(llm=llm,
                                           chain_type="stuff",
                                           retriever=retriever)

retrieval_qa.invoke(search_query)

{'query': 'What should be considered when taking derivatives of ReLU?',
 'result': 'When taking derivatives of ReLU (Rectified Linear Unit), one needs to account for the non-differentiability at 0.'}

#### Inspecting the process

If we add `return_source_documents=True` we can inspect the document chunks that were returned by the retriever.

This is helpful for debugging, as these chunks may not always be relevant to the answer, or their relevance might not be obvious.

In [11]:
retrieval_qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 return_source_documents=True)

results = retrieval_qa.invoke(search_query)

print("*" * 79)
print(results["result"])
print("*" * 79)
for doc in results["source_documents"]:
    print("-" * 79)
    print(doc.page_content)

*******************************************************************************
When taking derivatives of ReLU (Rectified Linear Unit), one needs to account for the non-differentiability at 0.
*******************************************************************************
-------------------------------------------------------------------------------
∂L
∂y[L] is a part of all derivatives. So,
if we compute the derivative by W [L−1] ﬁrst, we already have this component available and can
reuse it in the computation of the derivative by W [L−2], etc.
In order to formalize the eﬀective computation of derivatives in a backpropagation algorithm,
we decompose the forward propagation into two parts, cf. e.g. [26, Section 7.3.2].
z[ℓ] = W [ℓ−1]y[ℓ−1] + b[ℓ−1]
∈Rnℓ,
y[ℓ] = σℓ](z[ℓ])
∈Rnℓ.
2
FEEDFORWARD NEURAL NETWORK
23
This was not necessary in Example 2.3, because we only consider weights and no biases. Fur-
thermore, we assume that the loss function L takes the ﬁnal output y[L] as an input. 

## ConversationalRetrievalChain
`ConversationalRetrievalChain` remembers and uses previous questions so you can have a chat-like discovery process.

To use this chain we must provide a memory class to store and pass the previous messages to the LLM as context. Here we use the `ConversationBufferMemory` class that comes with LangChain.

In [12]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(llm=llm,
                                                                 retriever=retriever,
                                                                 memory=memory)


conversational_retrieval.invoke(search_query)["answer"]

'When taking derivatives of ReLU (Rectified Linear Unit), one needs to account for the non-differentiability at 0.'

In [13]:
new_query = "What about other activation functions?"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

According to the text, other popular activation functions are:

1. Sigmoid (logistic): σ(y) = 1 / (1 + exp(-y))
2. Hyperbolic tangent: σ(y) = tanh(y) = exp(y) - exp(-y) / (exp(y) + exp(-y))
3. Rectified Linear Unit (ReLU): σ(y) = max{y, 0}
4. Leaky ReLU: σ(y) = max{αy, y}, where α is a small positive value.

These activation functions are all monotone increasing and continuous, which is in the spirit of the original idea of the Heaviside function.


In [14]:
new_query = "give me specifically for sigmoid"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

According to the provided context, especially a problem for sigmoid activation function, since its derivative is bounded by 1.
