<a href="https://colab.research.google.com/github/swethag04/llm/blob/main/Q%26A/Research_QandA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research paper Q&A App using Retrieval Augmented Generation (RAG)

**Goal** : <br>
The goal of the project is to build a Question and Answer system powered by Large Language Models (LLM). The user uploads a research paper in the app and it will answer any specific question from the research paper.
<br><br>

**Data Required**: <br>
Digital copies of research papers in pdf format
<br><br>

**How to source the data?** <br>
The data can be sourced by scraping the internet for popular research papers
<br><br>

**Expected Results:** <br>
The app should reliably and correctly answer the questions using the information from the specific document.
<br><br>


### Technology and Tools Used


1. **Large Language Models (LLMs)**<br>
    
> LLMS are deep learning models that are pre-trained on vast amounts of data and can be used for tasks like generating text, summarizing, translating, answering questions etc.

2.   **Retrieval Augmented Generation (RAG)**

> RAG is a technique that enhances the capabilties of LLMs by incorporating information retrieval into the text generation process. This is done by retrieving  data/documents relevant to a question or task and providing them as context for the LLM.

3. **LlamaIndex**

> LlamaIndex is an open source data framework for building RAG systems.

4. **Streamlit**

> Streamlit is an open source python library used to create custom web apps for ML



###Installing relevant packages

In [3]:
#!pip install openai
#!pip install tiktoken
#!pip install cohere
#!pip install llama-index==0.8.69.post1
#!pip install pypdf
#!pip install ragas==0.0.20
#!pip install sentence-transformers
#!pip install -q "huggingface_hub[inference]"

###Setting up API Keys

In [7]:
from secret_key import openapi_key, hugging_face_key
import os
import openai

os.environ["OPENAI_API_KEY"] = openapi_key
openai.api_key = os.getenv("OPENAI_API_KEY")

os.environ["HUGGINGFACE_API_KEY"] = hugging_face_key

### Importing Packages

In [8]:
import pandas as pd
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, OpenAIEmbedding
from llama_index.llms import OpenAI
from langchain.embeddings import HuggingFaceEmbeddings

import nest_asyncio

nest_asyncio.apply()

### Building RAG
Following are the different steps required to build a RAG system

1. **Loading**: <br>
Load the document and create text chunks to divide longer texts into smaller chunks , so as to not exceed the context widow of LLMs
2. **Indexing**: <br>
Create embeddings of the text chunks (Two semantically related texts will be in proximity in the vector space while dissimilar texts are far away)
3. **Storing**: <br>
Store embeddings and associated metadata in vector databases with indexing for fast search and retrieval
4. **Querying**: <br>
Retrieve relevant context from an index when given a user query.
6. **Response Synthesis**: <br>
 Combine the user query, top k results from the search (similarity search) and the prompt to send as input to the LLM which generates the final response to the user query
7. **Evaluation**: <br>
Evaluate the response generated for accuracy and relevancy

Llama Index provides an easy framework for implementing the above steps to build a RAG system.


### Loading the data file
The data file used here is the following research paper :
PaLM: Scaling Language Modeling with Pathways  (https://arxiv.org/abs/2204.02311)  [link text](https://)

In [9]:
# Load the document
reader = SimpleDirectoryReader(input_files = ['/content/sample_data/palm.pdf'])
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 87 docs


### Indexing and Querying

In [86]:
# Build an index, ask a sample query and generate a response
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What hardware was used to train PaLM?")
print(response)

TPU v4 Pods were used to train PaLM.


The above RAG system works fine and provides a relevant and correct answer for the query asked from the research paper provided.

### Evaluation
In order to further evaluate the RAG system, Ragas python package can be used.
 Ragas offers metrics for evaluating both the  Retrieval and the Generation components of the RAG.
####Generation Metrics


1.   **Faithfulness**


*   Measures factual consitency of generated answer against the given context
*   Generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context
* Faithfulness score = (# claims that can be inferred from given context)/(Total# of claims in the generated answer)


        
2.   **Answer Relevancy**
* How relevant is the generated answer to the question
* Lower score is assigned to answers that are incomplete or contain redundant info
* To calculate this score, LLM is prompted to generate an appropriate question for the generated answer multiple times and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to genrate questions from the answer that align with the original question.

<br>


####Retrieval Metrics


1.   **Context Relevancy**

*   This metric gauges the relavancy of the retrieved context, calculated based on both the questions and contexts
*   context relevancy = # sentences in retrieved context that are relevant for answering the given question/total # sentences in retrieved context


2.  **Context Precision**



*   Metric that evaluates whether all of the ground truth relevant items present in the contexts are ranked higher or not

### Synthetic test dataset

In [31]:
# Creating a synthetic test dataset for evaluation
testsetgenerator = TestsetGenerator.from_default()
test_size = 10
testset = testsetgenerator.generate(docs, test_size=test_size)
test_df = testset.to_pandas()
test_df.head()


  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [01:19<11:59, 79.93s/it][A
 30%|███       | 3/10 [02:14<04:47, 41.08s/it][A
 60%|██████    | 6/10 [04:20<02:46, 41.60s/it][A
100%|██████████| 10/10 [15:22<00:00, 108.31s/it][A
15it [16:32, 62.06s/it]                         [A
21it [17:17, 37.16s/it][A
28it [28:50, 63.87s/it][A
36it [30:15, 41.86s/it][A
45it [31:03, 27.27s/it][A
55it [33:18, 21.94s/it][A

Unnamed: 0,question,context,answer,question_type,episode_done
0,What are the limitations of fairness evaluatio...,- A major limitation of the fairness analyses ...,The limitations of fairness evaluations in Eng...,simple,True
1,What is the performance trend for the PaLM mod...,- We can see in Figure 5 that performance on g...,The performance trend for the PaLM model on go...,simple,True
2,What impact does model size have on the rate o...,- Figure 18(b) shows the memorization rate as ...,The impact of model size on the rate of memori...,multi_context,True
3,"Which model outperforms GPT-3, Gopher, and Chi...",- The three models evaluated on 58 tasks in co...,"PaLM 540B 5-shot outperforms GPT-3, Gopher, an...",multi_context,True
4,What types of data are included in the PaLM pr...,- The dataset is a mixture of ﬁltered webpages...,The types of data included in the PaLM pretrai...,simple,True


In [33]:
test_df.to_pickle('/content/sample_data/testdf.pkl')

In [12]:
test_df = pd.read_pickle('/content/sample_data/testdf.pkl')

In [13]:
# collect questions and answers
test_questions = test_df['question'].values.tolist()
test_answers = [[item] for item in test_df['answer'].values.tolist()]

In [14]:
test_questions

['What are the limitations of fairness evaluations in English language technologies?',
 'What is the performance trend for the PaLM model on goal step wikihow and logical args tasks?',
 'What impact does model size have on the rate of memorization of training examples, taking into account occurrence proportion in training data and corpus breakdown?',
 'Which model outperforms GPT-3, Gopher, and Chinchilla on the 58 tasks in the PaLM models on BIG-bench?',
 'What types of data are included in the PaLM pretraining dataset?',
 'What are the occupations that co-occur most frequently with gender pronouns?',
 'What is a critical open scaling question when comparing parameter models trained with varying token numbers, considering model depth and width, training corpus quality, and increased model capacity without increased compute?',
 'What is the relationship between parameter count and model scale in the PaLM model? How does parameter count change when model scale increases from 8B to 62B?'

In [15]:
test_answers

[['The limitations of fairness evaluations in English language technologies include the fact that they are only performed on English language data. This means that bias benchmarks and evaluations developed for the Western world may not be applicable to other languages and socio-cultural contexts. There is also a lack of standardization of fairness benchmarks, an understanding of the harms related to different bias measures in NLP, and comprehensive coverage of identities. The evaluations are also limited by potential risks that cannot be measured, and bias can vary depending on the specific downstream application and training pipeline. It is unclear if evaluations on pre-trained language models affect downstream task evaluations after fine-tuning. Therefore, it is recommended to assess fairness gaps in the application context before deployment.'],
 ['The performance trend for the PaLM model on goal step wikihow and logical args tasks is a log-linear scaling curve, with the PaLM 540B mo

### Compare Embeddings for Retriever
The performance of a the retriever is an important factor that determines the effectiveness of a RAG system, specifically the quality of embeddings. Here I compare the performance of two different embeddings.

In [16]:
# Build RAG

def build_query_engine(embed_model):

    # Split the documents into nodes and create vector embeddings of the text of every node
    vector_index = VectorStoreIndex.from_documents( docs, service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=embed_model,
    )
    #
    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine

In [17]:
# Import metrics from ragas to evaluate retriever component

from ragas.metrics import (
    context_precision,
    context_recall,
)

metrics = [
    context_precision,
    context_recall,
]

### Evaluate OpenAI Embeddings

In [18]:
from ragas.llama_index import evaluate

openai_model = OpenAIEmbedding()
query_engine1 = build_query_engine(openai_model)
result_openai = evaluate(query_engine1, metrics, test_questions, test_answers)


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


evaluating with [context_precision]


100%|██████████| 1/1 [00:13<00:00, 13.33s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [01:41<00:00, 101.33s/it]


In [19]:
print(result_openai)


{'context_precision': 0.9500, 'context_recall': 0.6067}


In [21]:
import statistics

print("Ragas score: ", statistics.harmonic_mean(list(result_openai.values())))

Ragas score:  0.7404710920603806


### Evaluate Bge Embeddings

In [22]:

flag_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
query_engine2 = build_query_engine(flag_model)
result_bge = evaluate(query_engine2, metrics, test_questions, test_answers)

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

evaluating with [context_precision]


100%|██████████| 1/1 [00:11<00:00, 11.48s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [01:34<00:00, 94.74s/it]


In [23]:
print(result_bge)

{'context_precision': 0.9500, 'context_recall': 0.7067}


In [24]:
print("Ragas score: ", statistics.harmonic_mean(list(result_openai.values())))

Ragas score:  0.7404710920603806


### Compare scores of openAI vs Bge embedding models

Based on the evaluation results, `context_precision` and `context_recall` metrics of the OpenAI model are the same in my RAG pipeline when applied to my own dataset.

Here, I will be using 2 LLMs - HuggingFace zephyr-7b-alpha and Falcon-7b-instruct models

### Compare LLMs using Ragas Evaluations
The LLM used in the RAG system also has a large impact in the quality of the generated output.

In [25]:
from llama_index.llms import HuggingFaceInferenceAPI
from llama_index.embeddings import HuggingFaceInferenceAPIEmbedding

def build_qe(llm):
  vector_index = VectorStoreIndex.from_documents(
        docs, service_context=ServiceContext.from_defaults(chunk_size=512, llm=llm),
        embed_model=HuggingFaceInferenceAPIEmbedding,
    )

  qe = vector_index.as_query_engine(similarity_top_k=2)
  return qe

In [26]:
# Function to evaluate as Llama index does not support async evaluation for HFInference API
from datasets import Dataset

def generate_responses(query_engine, test_questions, test_answers):
  responses = [query_engine.query(q) for q in test_questions]

  answers = []
  contexts = []
  for r in responses:
    answers.append(r.response)
    contexts.append([c.node.get_content() for c in r.source_nodes])
  dataset_dict = {
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
  }
  if test_answers is not None:
    dataset_dict["ground_truths"] = test_answers
  ds = Dataset.from_dict(dataset_dict)
  return ds

In [27]:
# Import metrics from ragas to evaluate generator component
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    #answer_correctness,
)

metrics = [
    faithfulness,
    answer_relevancy,
   # answer_correctness,
]

### Evaluate Zephyr7B Alpha LLM

In [28]:
# Use zephyr model using HFInference API
zephyr_llm = HuggingFaceInferenceAPI(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    token=""
)
qe1 = build_qe(zephyr_llm)
result_ds = generate_responses(qe1, test_questions, test_answers)
print(result_ds)


Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 10
})


In [30]:
from ragas import evaluate
result_zephyr = evaluate(result_ds, metrics=metrics )
print(result_zephyr)

evaluating with [faithfulness]


100%|██████████| 1/1 [02:38<00:00, 158.87s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:17<00:00, 17.30s/it]


{'faithfulness': 0.9667, 'answer_relevancy': 0.9325}


In [31]:
print("Ragas score: ", statistics.harmonic_mean(list(result_zephyr.values())))

Ragas score:  0.9492609291938752


### Evaluate Falcon-7B-Instruct LLM

In [32]:
falcon_llm = HuggingFaceInferenceAPI(
    model_name="tiiuae/falcon-7b-instruct",
    token=""
)
qe2 = build_query_engine(falcon_llm)
result_ds_falcon = generate_responses(qe2, test_questions, test_answers)
result_falcon = evaluate(result_ds_falcon, metrics=metrics)

result_falcon


evaluating with [faithfulness]


100%|██████████| 1/1 [02:23<00:00, 143.35s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:11<00:00, 11.56s/it]


{'faithfulness': 0.8524, 'answer_relevancy': 0.9588}

In [34]:
print("Ragas score: ", statistics.harmonic_mean(list(result_falcon.values())))

Ragas score:  0.9024430950954836


Based on the evaluation results, the Zephyr model seems to outperform the Falcon model in my RAG pipeline when applied to my own dataset.

