# Title: RAGAS: Automated Evaluation of Retrieval Augmented Generation

#### Group Member Names :

1. Yogita - 200553102
2. Shoaib Khan - 200574765


### INTRODUCTION:

* These days LLMs have achieved impressive capabilities in answering questions without external knowledge. But, they havev limitations in handling out-of-distribution information.

* RAGs solve this by incorporating relevant information into the LLM's context. But evaluating RAGs is a challenge due to absencce of reliable evaluation metrics. 

*********************************************************************************************************************
#### AIM :

* The aim of the paper is to develop an efficient framework for evaluating RAGs. Particularly in scenarios where truth data is limited. RAGAS aims to provide insights into the performance of RAGs by offering a suite of metrics. 

*********************************************************************************************************************
#### Github Repo:
Link: https://github.com/shoaibk99/Machine-Learning-Programming-Final
*********************************************************************************************************************
#### DESCRIPTION OF PAPER:

* The paper introduces RAGAS a framwork designed to automate the assessment of RAGs. 

* It focuses on scenarios where reference answers are unavailable and aims to provide measures for correctness. 

*********************************************************************************************************************
#### PROBLEM STATEMENT :

* Althought Ragas provides solution for the RAG evaluation, it does so by depending on major LLM providers like OpenAI.

* It requires an API key provided by the same LLM provider which can be quite expensive. 

* Moreover, it gives the users less control over the evaluation model.
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
* 
*********************************************************************************************************************
#### SOLUTION:
* We will implement a version of the model using other open-source models from HuggingFace.

* This will solve the issue of requiring paid API keys for RAG evaluation. 


# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|



*********************************************************************************************************************






### Implement paper code :

In [None]:
!pip install ragas

In [None]:

from datasets import Dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "open api key"

covid_information_bot_dataset = {
    'question': [
        'What are common symptoms of COVID-19?',
        'How does COVID-19 spread?',
        'Can COVID-19 be cured by antibiotics?',
        'What is the incubation period for COVID-19?',
        'Do masks help prevent the spread of COVID-19?'
    ],
    'answer': [
        'Common symptoms of COVID-19 include headache and rash.',
        'COVID-19 spreads primarily through respiratory droplets when an infected person coughs, sneezes, or talks.',
        'Yes, COVID-19 can be cured by antibiotics.',
        'The incubation period for COVID-19 is typically 14-21 days.',
        'No, masks do not help in preventing the spread of COVID-19.'
    ],
    'contexts': [
        [
            'Common symptoms of COVID-19 include fever, cough, and shortness of breath. Some patients also experience fatigue, body aches, loss of taste or smell, sore throat, and gastrointestinal issues.'
        ],
        [
            'COVID-19 spreads mainly through close contact with an infected person. The virus is transmitted via respiratory droplets that are released when the infected person coughs, sneezes, or talks. It can also spread by touching surfaces contaminated with the virus and then touching the face.'
        ],
        [
            'COVID-19 is caused by a virus, specifically the SARS-CoV-2 virus. Antibiotics are not effective against viruses; they only work on bacterial infections. Treatment for COVID-19 primarily focuses on relieving symptoms and may include antiviral medications in severe cases.'
        ],
        [
            'The incubation period for COVID-19, which is the time between exposure to the virus and the onset of symptoms, is typically 2 to 14 days. However, most cases develop symptoms within 5 to 6 days after exposure.'
        ],
        [
            'Wearing masks, particularly in crowded and enclosed spaces, helps reduce the spread of COVID-19 by blocking respiratory droplets from reaching others. Masks are especially effective when combined with other measures like social distancing and hand hygiene.'
        ]
    ],
    'ground_truth': [
        'Common symptoms of COVID-19 include fever, cough, and shortness of breath.',
        'COVID-19 spreads primarily through respiratory droplets when an infected person coughs, sneezes, or talks.',
        'No, antibiotics do not cure COVID-19, as it is caused by a virus, not bacteria.',
        'The incubation period for COVID-19 is typically 2 to 14 days.',
        'Yes, masks do help in preventing the spread of COVID-19 by blocking respiratory droplets.'
    ]
}


dataset = Dataset.from_dict(covid_information_bot_dataset)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

*********************************************************************************************************************
### Contribution  Code :

In [None]:
pip install torch transformers datasets sentence-transformers pandas

In [None]:
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from datasets import Dataset
import pandas as pd
from IPython.display import display

# Check if CUDA is available and set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load models
qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", device=device)
semantic_similarity_model = SentenceTransformer('paraphrase-MiniLM-L6-v2').to(device)

def generate_qa_answer(question, context):
    return qa_model(question=question, context=context)['answer']

def calculate_semantic_similarity(text1, text2):
    embeddings = semantic_similarity_model.encode([text1, text2])
    return float(torch.nn.functional.cosine_similarity(
        torch.tensor(embeddings[0]).unsqueeze(0),
        torch.tensor(embeddings[1]).unsqueeze(0)
    ))

def evaluate_answer_correctness(generated_answer, pre_available_answer, ground_truth):
    gen_vs_ground = calculate_semantic_similarity(generated_answer, ground_truth)
    pre_vs_ground = calculate_semantic_similarity(pre_available_answer, ground_truth)
    gen_vs_pre = calculate_semantic_similarity(generated_answer, pre_available_answer)

    # Calculate accuracy score
    accuracy_score = (pre_vs_ground + gen_vs_pre) / 2

    return {
        'generated_answer': generated_answer,
        'gen_vs_ground_similarity': gen_vs_ground,
        'pre_vs_ground_similarity': pre_vs_ground,
        'gen_vs_pre_similarity': gen_vs_pre,
        'accuracy_score': accuracy_score
    }

def evaluate_rag(dataset):
    results = []
    for sample in dataset:
        context = ' '.join(sample['contexts'])
        generated_answer = generate_qa_answer(sample['question'], context)
        evaluation = evaluate_answer_correctness(generated_answer, sample['answer'], sample['ground_truth'])

        results.append({
            'question': sample['question'],
            'pre_available_answer': sample['answer'],
            'generated_answer': evaluation['generated_answer'],
            'ground_truth': sample['ground_truth'],
            'accuracy_score': evaluation['accuracy_score'],
            'gen_vs_ground_similarity': evaluation['gen_vs_ground_similarity'],
            'pre_vs_ground_similarity': evaluation['pre_vs_ground_similarity'],
            'gen_vs_pre_similarity': evaluation['gen_vs_pre_similarity']
        })
    return pd.DataFrame(results)

# Sample data
covid_information_bot_dataset = {
    'question': [
        'What are common symptoms of COVID-19?',
        'How does COVID-19 spread?',
        'Can COVID-19 be cured by antibiotics?',
        'What is the incubation period for COVID-19?',
        'Do masks help prevent the spread of COVID-19?'
    ],
    'answer': [
        'Common symptoms of COVID-19 include headache and rash.',
        'COVID-19 spreads primarily through respiratory droplets when an infected person coughs, sneezes, or talks.',
        'Yes, COVID-19 can be cured by antibiotics.',
        'The incubation period for COVID-19 is typically 14-21 days.',
        'No, masks do not help in preventing the spread of COVID-19.'
    ],
    'contexts': [
        [
            'Common symptoms of COVID-19 include fever, cough, and shortness of breath. Some patients also experience fatigue, body aches, loss of taste or smell, sore throat, and gastrointestinal issues.'
        ],
        [
            'COVID-19 spreads mainly through close contact with an infected person. The virus is transmitted via respiratory droplets that are released when the infected person coughs, sneezes, or talks. It can also spread by touching surfaces contaminated with the virus and then touching the face.'
        ],
        [
            'COVID-19 is caused by a virus, specifically the SARS-CoV-2 virus. Antibiotics are not effective against viruses; they only work on bacterial infections. Treatment for COVID-19 primarily focuses on relieving symptoms and may include antiviral medications in severe cases.'
        ],
        [
            'The incubation period for COVID-19, which is the time between exposure to the virus and the onset of symptoms, is typically 2 to 14 days. However, most cases develop symptoms within 5 to 6 days after exposure.'
        ],
        [
            'Wearing masks, particularly in crowded and enclosed spaces, helps reduce the spread of COVID-19 by blocking respiratory droplets from reaching others. Masks are especially effective when combined with other measures like social distancing and hand hygiene.'
        ]
    ],
    'ground_truth': [
        'Common symptoms of COVID-19 include fever, cough, and shortness of breath.',
        'COVID-19 spreads primarily through respiratory droplets when an infected person coughs, sneezes, or talks.',
        'No, antibiotics do not cure COVID-19, as it is caused by a virus, not bacteria.',
        'The incubation period for COVID-19 is typically 2 to 14 days.',
        'Yes, masks do help in preventing the spread of COVID-19 by blocking respiratory droplets.'
    ]
}

dataset = Dataset.from_dict(covid_information_bot_dataset)

# Evaluate
results = evaluate_rag(dataset)

# Display results
print("\nEvaluating: 100% 5/5 [00:00<00:00]")
display(results)

*********************************************************************************************************************

### Results :

* We were able to implement the same evaluation metrics using open-source HuggingFace models like **distilbert-base-cased-distilled-squad** for question answering and **paraphrase-MiniLM-L6-v2** for similarity.
*******************************************************************************************************************************

#### Observations :

* More transparency in metrics calculation as the models are open-source.

* Less dependency on propreitary LLMs like ChatGPT for metrics evaluation.
*******************************************************************************************************************************


### Conclusion and Future Direction :

* While the original implementation offers a pre=packaged solution for metric evaluation, the new implementation offers a better approach in terms of cost, flecibility and transparency. 

* It also highlights the fact of not having/needing proprietary LLM providers for developing new approaches in RAG metrics evaluation.
*******************************************************************************************************************************
#### Learnings :

* Learned to implement question-answering model for answer generation.

* Acquired hands on experience in implementing open-source models for RAG metrics evaluation.

* Became aware of evaluation metrics like 'gen_vs_ground_similarity', 'pre_vs_ground_similarity', 'gen_vs_pre_similarity'. 
*******************************************************************************************************************************
#### Results Discussion :

* Transparency in metrics calculation. This is helpful especially when we know that LLM models are known to hallucinate.
*******************************************************************************************************************************
#### Limitations :

* Since our implementation is hosted locally, it is difficult to scale.

*******************************************************************************************************************************
#### Future Extension :

* The implementation can be modified to be used as a free-to-use service for other people to be able to calculate metrics for their RAGs.

* Since we have more control over implementation more useful and newer metrics can be added over time.

# References:

[1]:  Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. Cardiff University, United Kingdom.
    