# Question-Answer Dataframe Creation

In this phase, my objective is to construct a QA (Question-Answer) file using the LLM (LLaMA-3.0). This file will consist of question-answer pairs generated from a randomly selected subset of texts within my DataFrame. The primary purpose of this process is to prepare a robust dataset that supports my future experiments and evaluations effectively.

**Steps to Create the QA File:**

1.	Selecting Random Texts:\
	•	To ensure a diverse and representative sample of the dataset, I will randomly select texts from my DataFrame. This approach guarantees that the QA file covers a wide spectrum of topics and contexts, providing a comprehensive base for testing the model.
2.	Generating Questions and Answers:\
	•	I will leverage the advanced natural language understanding capabilities of LLaMA-3.0 to generate meaningful and contextually relevant questions and their corresponding answers. This step is crucial as it tests the model’s ability to interpret and process textual information accurately.
3.	Formatting the QA File:\
	•	Each question-answer pair will be systematically organized into a structured file. Entries will include the ID of the original text, the generated question, and its answer. This format will facilitate easy access and manipulation of data for subsequent evaluation steps.
4.	Evaluating the QA Dataframe:\
	•	After creation, the QA dataframe will undergo a rigorous evaluation to ensure that the generated questions meet my criteria of clarity, reality, and specificity. This evaluation will be pivotal in determining the usability of the generated questions in real-world scenarios and further experimental setups.

### Importing libraries

In [1]:
# Standard library imports
import os
import sys
import json
import logging
import time
import random
import re
from typing import Any, List, Tuple


# Third-party imports
import pandas as pd
import numpy as np
import mlflow
import openai
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_exception_type
from tqdm import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score

# Langchain and embedding imports
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings


# Local imports from script directories
cwd = os.getcwd()
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../assets')))
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../src')))
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../')))

from scripts.pinecone_func import pinecone_upsert

# Logging configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Environment variables setup
load_dotenv()

# Instance of embedding model
embed_model = HuggingFaceEmbeddings()

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
2024-08-30 01:41:27,632 - INFO - Use pytorch device_name: mps
2024-08-30 01:41:27,633 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


---
### Initialising LLama Model

In this section, I describe how I set up the lightweight LLaMA 3.0 model for running experiments on my local machine. This configuration aims to generate high-quality outputs while efficiently managing computational resources.

**Model Initialization:**\
	•	Model Path: The model files are loaded from a local directory.\
	•	Temperature: Set at 0.5 to generate more precise and less random outputs.\
	•	Max Tokens: Increased to 512, allowing the model to generate longer responses.\
	•	Top_p and Top_k: Configured to control the generation’s randomness, focusing the model to select the most likely next words.\
	•	Repeat Penalty: Increased to 1.2 to discourage repetitive responses, enhancing the variety and relevance of the generated content.\
	•	GPU Layers: Utilizing one GPU layer to optimize the use of available hardware resources without compromising performance.

In [None]:
# Set up logging
logging.basicConfig(level=logging.INFO)

# Initialize the LlamaCpp model
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path="../models/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf",
    temperature=0.5,  # Reduced temperature for more focused outputs
    max_tokens=512,  # Increased max tokens to allow for longer responses
    top_p=0.95,
    top_k=50,
    callback_manager=callback_manager,
    verbose=True,
    n_ctx=8192,
    repeat_penalty=1.2,  # Increased repeat penalty to discourage repetition
    n_gpu_layers=1, 
)


---
## Creating QA dataframe 


---
### Data Preparation

This section outlines the process of preparing the underlying data from which the QA pairs will be generated. The data preparation is crucial for ensuring the quality and variability of the input data, which directly influences the quality of the QA output.

Steps Involved in Data Preparation:

1.	Data Loading:\
    •	The complete dataset is loaded from a CSV file to ensure all available information is included for processing. This step captures the initial scope of the data available for generating QA pairs.
2.	Data Cleaning:\
    •	I clean the dataset by removing any rows where the text column contains NA values. This step is crucial because NA values can occur during text extraction due to language mismatches or missing entries. Cleaning ensures the dataset’s integrity, providing a solid foundation for generating accurate and relevant QA pairs.
3.	Selecting a Random Subset:\
    •	I select a random subset of 50 entries from the cleaned dataset. This subset is not only manageable in size for in-depth processing but also diverse, encompassing a range of contexts and topics. This diversity is vital for enhancing the robustness and comprehensiveness of the QA training and evaluation phases.


In [2]:
#loading the data
data_full = pd.read_csv("../assets/csv/data_full.csv", index_col=0)
print(f'Initial shape of the dataframe: {data_full.shape}')

#dropping na values
data_full = data_full.dropna(subset=['text'])
print(f'Shape of the dataframe after dropping na values: {data_full.shape}')

#selecting random subset
df_subset = data_full.sample(50, random_state=42)


Initial shape of the dataframe: (1034, 7)
Shape of the dataframe after dropping na values: (893, 7)


---
### QA Pair Extraction and Generation

This part of the process involves extracting and generating QA pairs from the dataset using predefined patterns and an LLM-based generation method. The aim is to create a diverse and comprehensive QA dataset suitable for evaluating the LLaMA model’s performance.

**Steps Involved in QA Pair Generation:**

1.	QA Pair Extraction Function:\
	•	This function searches through the provided text for predefined patterns indicating a question-answer format. It first looks for explicit pairs marked with “Question:” and “Answer:”. If none are found, it tries to identify standalone questions marked by a question mark and treats the following text as the answer.
2. 	Single Question Generation Function:\
	•	This function prepares a text input for the LLM, truncating it to manage size and ensure focus. It formats a prompt asking the LLM to generate one specific question and its answer based on the truncated text, adhering to defined rules such as using specific terms and ensuring relevance.\
	•	Attempts to generate a QA pair are made, with retries if the initial attempts fail, to manage errors or unsatisfactory responses.
3. Batch QA Pair Generation Function:\
	•	Operating on the entire DataFrame, this function attempts to generate multiple QA pairs. It shuffles the DataFrame to randomize the starting point, processes each row individually, and uses the generate_single_question function to produce QA pairs.\
	•	It logs the process, captures successful QA pairs, and stops once the desired number of pairs is generated or the DataFrame is fully processed.
	

In [None]:
def extract_qa_pair(text):
    """
    Extracts a question-answer pair from a given text. First, it looks for explicitly formatted Q&A pairs.
    If none are found, it searches for any sentence ending with a question mark and treats the following
    text as the answer.
    
    Parameters:
    - text (str): The text from which to extract the QA pair.

    Returns:
    - tuple: A tuple containing the question and answer as strings, or (None, None) if no pair is found.
    """
    qa_pattern = re.compile(r'(?:Question:|Q:)\s*(.*?)\s*(?:Answer:|A:)\s*(.*)', re.DOTALL | re.IGNORECASE)
    match = qa_pattern.search(text)
    
    if match:
        question = match.group(1).strip()
        answer = match.group(2).strip()
        return question, answer
    
    # If no Q&A pattern found, try to extract any sentence with a question mark
    question_pattern = re.compile(r'([^.!?]+\?)')
    questions = question_pattern.findall(text)
    
    if questions:
        question = questions[0].strip()
        # Use the rest of the text as the answer
        answer = text[text.index(question) + len(question):].strip()
        return question, answer
    
    return None, None

def generate_single_question(text, max_retries=3):
    """
    Generates a single question-answer pair from a provided text using an LLM model. It retries generation
    up to a specified number of attempts if the initial tries fail.

    Parameters:
    - text (str): The text from which to generate the question.
    - max_retries (int): The maximum number of retries allowed for generating the QA pair.

    Returns:
    - list: A list containing a single tuple of the generated question and answer, or an empty list if generation fails.
    """
    max_text_length = 300
    truncated_text = ' '.join(text.split()[:max_text_length])
    
    template = """Text: {text}

Generate 1 specific question and answer based on the text above. Follow these rules:

1. Use specific terms, names, and figures from the text in your question.
2. The question must be directly related to the text content.
3. The answer should be comprehensive and use information from the text.
4. Use only this format, with no additional text:

Q: [Specific question]
A: [Detailed answer]"""

    prompt = PromptTemplate.from_template(template)
    formatted_prompt = prompt.format(text=truncated_text)
    
    for attempt in range(max_retries):
        try:
            response = llm.invoke(formatted_prompt)
            logging.info(f"Raw LLM response: {response}")

            question, answer = extract_qa_pair(response)
            
            if question and answer:
                return [(question, answer)]
            else:
                logging.warning(f"Attempt {attempt + 1}: Failed to extract Q&A. Retrying...")
        except Exception as e:
            logging.error(f"Attempt {attempt + 1}: Error in generate_single_question: {str(e)}")
    
    logging.error(f"Failed to generate Q&A after {max_retries} attempts. Problematic text: {truncated_text[:100]}...")
    return []

def generate_qa_pairs(df: pd.DataFrame, max_questions: int = 5) -> List[Tuple[str, str, str]]:
    """
    Generates a list of question-answer pairs from a DataFrame containing text data. This function shuffles
    the DataFrame to ensure diverse starting points and processes each row until the desired number of QA
    pairs is generated or the DataFrame is fully traversed.

    Parameters:
    - df (pd.DataFrame): The DataFrame containing the text data from which to generate QA pairs.
    - max_questions (int): The maximum number of QA pairs to generate.

    Returns:
    - List[Tuple[str, str, str]]: A list of tuples, each containing the document ID, question, and answer.
    """


    all_qa_pairs = []
    processed_ids = set()

    # Shuffle the dataframe to ensure we're not always starting from the same place
    df = df.sample(frac=1).reset_index(drop=True)

    for _, row in df.iterrows():
        if len(all_qa_pairs) >= max_questions:
            break

        if row['id'] in processed_ids:
            continue

        logging.info(f"Processing row with ID: {row['id']}")
        qa_pairs = generate_single_question(row['text'])
        
        if qa_pairs:
            for question, answer in qa_pairs:
                all_qa_pairs.append((row['id'], question, answer))
                logging.info(f"Added QA pair for ID {row['id']}: Q: {question[:50]}... A: {answer[:50]}...")
                if len(all_qa_pairs) >= max_questions:
                    break
            processed_ids.add(row['id'])
        else:
            logging.warning(f"No QA pairs generated for row with ID: {row['id']}")

        if len(all_qa_pairs) >= max_questions:
            break

    logging.info(f"Total QA pairs generated: {len(all_qa_pairs)}")
    return all_qa_pairs

def clean_evaluated_questions(input_df: pd.DataFrame) -> pd.DataFrame:
    """
    Cleans and formats the 'question' and 'answer' columns in a DataFrame by removing specific unwanted
    substrings, characters, and formatting cues. It also handles special cases like removing everything
    after certain patterns and trimming extra spaces.

    Parameters:
    - df_path (str): Path to the CSV file containing the evaluation data.
    - output_path (str): Path where the cleaned CSV will be saved.

    Returns:
    - pd.DataFrame: The cleaned DataFrame with updated 'question' and 'answer' columns.
    """
    # Load the DataFrame from the specified path
    evaluated_df = input_df.copy()

    # Define a helper function to apply various cleaning operations to a column
    def clean_column(text: str) -> str:
        # Remove specific phrases and extra new lines
        replacements = {
            'Question\n\n': '', 'Question\n': '', 'Question:': '',
            'Answer\n\n': '', 'Answer\n': '', 'Answer:': '',
            'Q:': '', 'A:': ''
        }
        for old, new in replacements.items():
            text = text.replace(old, new)
        
        # Remove content after certain patterns
        for delimiter in ['\n\n**', '\n\n', '\n']:
            if delimiter in text:
                text = text.split(delimiter)[0]
        
        # Strip leading and trailing whitespace
        text = text.strip()
        
        return text

    # Clean 'question' and 'answer' columns
    evaluated_df['question'] = evaluated_df['question'].apply(clean_column)
    evaluated_df['answer'] = evaluated_df['answer'].apply(clean_column)


    # Return the cleaned DataFrame
    return evaluated_df

# Generate QA pairs
qa_pairs = generate_qa_pairs(data_full, max_questions=75)

# Print the generated QA pairs
for doc_id, question, answer in qa_pairs:
    print(f"Document ID: {doc_id}")
    print(f"Q: {question}")
    print(f"A: {answer}")
    print("-" * 50)

# Convert to a dataframe
qa_df = pd.DataFrame(qa_pairs, columns=['id', 'question', 'answer'])
qa_df = clean_evaluated_questions(qa_df)
print(qa_df)

# Optionally, save the results to a CSV file
qa_df.to_csv('qa_pairs_to_evaluate.csv', index=False)


---
## QA Evaluation


After generating the question and answer pairs, a rigorous evaluation is essential to ensure their quality and applicability. This evaluation is structured around three key criteria:

1.	Specificity: Questions should be precise rather than broad. For instance, rather than asking a vague question like, “What is this article about?” it is more beneficial to ask something specific, such as, “What is the main argument presented in article ‘XYZ’?” This approach ensures that the questions probe specific knowledge areas and are directly related to the text.
2.	Realism: The questions must reflect queries that students are likely to pose in an academic context, embodying practical, real-world relevance. This means they should be formulated in a way that naturally aligns with typical student inquiries and educational standards.
3.	Clarity: It is critical that each question is articulated clearly, without any ambiguous terms or confusing phrasing that could mislead or perplex students. Clarity in question formulation is vital to effective assessment and learning.

The evaluation comprises two distinct phases:

1.	Cross-Evaluation with an Alternate Large Language Model: This step involves using another sophisticated language model to verify the originality and accuracy of the generated questions, providing an automated layer of quality assurance.
2.	Manual Review with Human-in-the-Loop: The final stage of the evaluation process involves a manual review by an expert. This human element ensures that the questions not only meet technical standards but also resonate with human judgment and educational value.

This dual-stage evaluation strategy aims to establish the reliability and efficiency of using AI-generated content in educational settings, potentially paving the way for broader applications in future projects.

In [18]:
qa_df = pd.read_csv('../assets/csv/qa_pairs_to_evaluate.csv', index_col=0)

In [27]:

# Get the API key
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    raise ValueError("No OpenAI API key found. Please set the OPENAI_API_KEY environment variable.")

# Set up OpenAI client
client = openai.Client(api_key=api_key)

def is_non_specific_question(question):
    # List of patterns for very general, non-specific questions
    patterns = [
        r"^what is this about\??$",
        r"^what is the main topic\??$",
        r"^what (is|are) the main point(s)?\??$",
        r"^can you summarize this\??$",
        r"^what (is|are) the key takeaway(s)?\??$",
        r"^what is the main (point|idea|topic) of this (article|paper)\??$",
        r"^what information does the text provide\??$",
        r"^what is the main point of this paper\??$",
        r"^[Your specific question here]\??$",  # Placeholder for additional patterns
        # Patterns for detecting questions that refer to unspecified models or papers
        r"^what (does|do) the (model|paper|study) (propose|suggest|conclude)\??$",
        r"^how does the (model|paper|study) improve (on|over) existing (models|methods)\??$"
    ]

    # Convert the input question to lower case for case insensitive matching
    question = question.lower()
    
    # Check if the question matches any pattern
    return any(re.match(pattern, question) for pattern in patterns)

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(20), retry=retry_if_exception_type((openai.RateLimitError, openai.APIError)))
def evaluate_question(question):
    # Check if it's a non-specific question
    if is_non_specific_question(question):
        return {
            "Specificity": 1,  # Very low specificity
            "Realism": 1,     # Likely realistic, as these are common questions
            "Clarity": 1      # Usually very clear, even if not specific
        }

    evaluation_prompt = f"""
    Evaluate the following question based on three criteria using a scale of 1 to 5:
    1. **Specificity**: Does the question clearly identify a specific paper, model, or study, or specific topic, avoiding general references like 'the model' or 'the paper'? The question should mention specific details that clearly delineate which content is being queried, rather than asking broadly.
    2. **Realism**: Is the question realistic and aligned with what students might genuinely ask in an academic setting? The question should reflect practical and common-sense inquiries likely to arise during study or review.
    3. **Clarity**: Is the question clearly formulated, avoiding ambiguous language or phrasing that could confuse students? The question should be easy to understand and free from vague terms or complex structures.
    Rate each criterion from 1 to 5, where:
    1 - Very Poor
    2 - Poor
    3 - Fair
    4 - Good
    5 - Excellent
    Question: "{question}"
    Please provide a rating for each criterion along with a brief explanation.
    Specificity:
    Realism:
    Clarity:
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are an AI assistant that evaluates questions based on specificity, realism, and clarity."},
                {"role": "user", "content": evaluation_prompt}
            ],
            max_tokens=150,
            temperature=0.5,
        )
        response_text = response.choices[0].message.content.strip()
    except openai.RateLimitError:
        logging.warning("Rate limit reached. Retrying...")
        raise
    except openai.AuthenticationError:
        logging.error("Authentication failed. Please check your API key.")
        raise
    except Exception as e:
        logging.error(f"Error in API call: {e}")
        return None

    scores = {
        "Specificity": None,
        "Realism": None,
        "Clarity": None
    }
    for line in response_text.splitlines():
        for criterion in scores.keys():
            if f"{criterion}:" in line:
                try:
                    scores[criterion] = int(line.split(":")[1].strip().split()[0])
                except (ValueError, IndexError):
                    logging.error(f"Error parsing score for {criterion}")
    
    return scores

def process_questions(df, batch_size=1):
    evaluated_questions_df = df.copy()
    evaluated_questions_df['Specificity'] = None
    evaluated_questions_df['Realism'] = None
    evaluated_questions_df['Clarity'] = None
    evaluated_questions_df['Average Score'] = None
    evaluated_questions_df['Is Non-Specific'] = None

    for i in tqdm(range(0, len(evaluated_questions_df), batch_size), desc="Processing questions"):
        batch = evaluated_questions_df.iloc[i:i+batch_size]
        for index, row in batch.iterrows():
            question = row['question']
            is_non_specific = is_non_specific_question(question)
            evaluated_questions_df.at[index, 'Is Non-Specific'] = is_non_specific
            
            scores = evaluate_question(question)
            if scores:
                valid_scores = [score for score in scores.values() if score is not None]
                for criterion, score in scores.items():
                    evaluated_questions_df.at[index, criterion] = score
                if valid_scores:
                    average_score = sum(valid_scores) / len(valid_scores)
                    evaluated_questions_df.at[index, 'Average Score'] = average_score
            
            # Implement exponential backoff with jitter
            time.sleep(random.uniform(5, 15))

    return evaluated_questions_df

# Verify API key is set
print(f"API key loaded: {'Yes' if api_key else 'No'}")

# Assuming qa_df is your DataFrame with columns: question_id, question, answer
# If your DataFrame columns have different names, adjust accordingly

# Process questions
try:
    evaluated_questions_df = process_questions(qa_df)

    # Save the DataFrame with evaluations to a CSV file
    evaluated_questions_df.to_csv('evaluated_questions_with_scores.csv', index=False)

    # Display the evaluated questions with their scores
    print(evaluated_questions_df)
except Exception as e:
    logging.error(f"An error occurred during processing: {e}")

API key loaded: Yes


Processing questions:   0%|          | 0/75 [00:00<?, ?it/s]2024-08-30 13:50:56,299 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing questions:   1%|▏         | 1/75 [00:14<18:09, 14.72s/it]2024-08-30 13:51:11,625 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing questions:   3%|▎         | 2/75 [00:30<18:32, 15.24s/it]2024-08-30 13:51:27,048 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing questions:   4%|▍         | 3/75 [00:47<19:11, 16.00s/it]2024-08-30 13:51:43,856 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing questions:   5%|▌         | 4/75 [01:03<19:08, 16.17s/it]2024-08-30 13:52:00,575 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing questions:   7%|▋         | 5/75 [01:13<16:01, 13.74s/it]2024-08-30 13:52:09,865 - INFO - HT

                                  id  \
0   29d3d7b9e48f046d75821fe3b763da05   
1   d804b24d051b1c5f06d0e0819f25ca37   
2   d3251ee436c00c1107189675b4bd8fc7   
3   e8de7463b8dde36dadddab419c363a57   
4   ba886c5587fbbd26b9020e0c258a8ca7   
..                               ...   
70  4cac46414d07fba4f66281e2f625418f   
71  c001c31186c9b7275a1917594b89926e   
72  10b5c43940a1595f5e955ee4fdb4fd93   
73  6621820746cd195d905c3bf7ab37006a   
74  86d2b2d0bfd932492ba0d6ceb4c5e252   

                                             question  \
0    What are some potential benefits of a new cur...   
1    How will students learn about different metho...   
2    How will the research group DeSBi improve sta...   
3   What are some of the key findings in "Debt Mat...   
4   What is the main focus of Prof. Dr. Joachim Ga...   
..                                                ...   
70        What are Jiawei Zhang's research interests?   
71        What was Hedda Nielsen's role in 2018-2019?   
72   W




After the initial phase of automatic evaluation, I will conduct a manual review of the questions. During this review, each question will be carefully assessed for its relevance, clarity, and alignment with the specified criteria. Based on this assessment, I will create a binary variable named “Keep.”

In this system:

	•	A value of 1.0 indicates that the question meets all quality standards and should be retained.
	•	A value of 0.0 suggests that the question does not meet the necessary criteria and should be excluded from the dataset.


In [51]:
evaluated_questions_df_checked = pd.read_csv('../assets/csv/evaluated_questions_with_scores_manual.csv', index_col=0)

#keep only where column keep is 1
evaluated_questions_df_checked = evaluated_questions_df_checked[evaluated_questions_df_checked['Keep'] == 1]
evaluated_questions_df_checked.head()

Unnamed: 0,id,question,answer,Specificity,Realism,Clarity,Average Score,Is Non-Specific,Keep
0,29d3d7b9e48f046d75821fe3b763da05,What are some potential benefits of a new curr...,"According to Bahaj and Reis (2023), one potent...",2.0,4.0,3.0,3.0,False,1.0
1,d804b24d051b1c5f06d0e0819f25ca37,How will students learn about different method...,"After participating in this course, students w...",3.0,4.0,4.0,3.666667,False,1.0
2,d3251ee436c00c1107189675b4bd8fc7,How will the research group DeSBi improve stat...,The research group DeSBi aims to develop novel...,4.0,5.0,4.0,4.333333,False,1.0
3,e8de7463b8dde36dadddab419c363a57,"What are some of the key findings in ""Debt Mat...","According to the study, one of the key finding...",5.0,4.0,5.0,4.666667,False,1.0
4,ba886c5587fbbd26b9020e0c258a8ca7,What is the main focus of Prof. Dr. Joachim Ga...,The main focus of Prof. Dr. Joachim Gassen is ...,3.0,4.0,4.0,3.666667,False,1.0


This section outlines the steps to finalize the dataframe used for evaluating the retriever. Currently, the process incorporates a Human-in-the-Loop (HITL) approach to ensure the quality of the dataframe. This critical human review helps identify any discrepancies or issues that might not be caught by automated systems.

Once a significant number of these dataframes have been reviewed and validated, it’s possible to develop a predictive model. This model would assess whether a question is likely to pass moderation based on the learned criteria. Implementing such a model could reduce or potentially eliminate the need for human moderation, streamlining the process to be more automatic.

At this stage, human review is essential. With the rigorously checked dataframe, we can proceed to the evaluation phase of the retriever. This ensures that our retriever is tested under the most accurate and reliable conditions, paving the way for robust performance in real-world applications.

        !! Makefile and script are stored in the src/retriver_evaluation!!