<a href="https://colab.research.google.com/github/sam4410/RAG-Technique-based-models/blob/main/Hybrid_Adaptive_RAG_Driven_Gen_AI_with_Expert_Human_Feedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will be implementing Adaptive RAG with Human Feedback (HF) loop. This system is called adaptive because the documents used for retrieval are updated. Integrating HF in RAG leads to a pragmatic hybrid approach because it involves humans in an otherwise automated generative process.

Implementation of Adaptive RAG framework is goign to include the following:
* Defining the adaptive RAG ecosystem
* Applying adaptive RAG to augmented retrieval queries
* Automating augmented generative AI inputs with HF
* Automating end-user feedback rankings to trigger expert HF
* Creating an automated feedback system for a human expert
* Integrating HF with adaptive RAG for GPT-4o

We will build our own pipeline and introduce HF. The program is divided into three separate parts: the retriever, generator, and evaluator functions, which can be separate agents in a real-life project's pipeline

## 1. Retriever

In [1]:
!pip install requests==2.32.3 beautifulsoup4==4.12.3 openai==1.40.3



In [2]:
import os
import warnings
import requests
from google.colab import userdata
from bs4 import BeautifulSoup
from huggingface_hub import login
import re
import openai
from google.colab import drive
warnings.filterwarnings('ignore')

# Connect this Colab to my Google Drive
drive.mount("/content/drive")

#Retrieving and setting OpenAI API key
f = open("drive/MyDrive/Colab Notebooks/key_files/openai_api_key.txt", "r")
API_KEY=f.readline().strip()
f.close()

#The OpenAI API key
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

#Retrieving and setting Activeloop API token
f = open("drive/MyDrive/Colab Notebooks/key_files/activeloop_token.txt", "r")
API_token=f.readline().strip()
f.close()
ACTIVELOOP_TOKEN=API_token
os.environ['ACTIVELOOP_TOKEN'] =ACTIVELOOP_TOKEN

# signing to hugging face hub
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=False)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Preparing the dataset

In [3]:
# we will retrieve Wikipedia documents by scraping them through their URLs. The dataset will contain automated or human-crafted labels for each document, which is the first step
# toward indexing the documents of a dataset
import requests
from bs4 import BeautifulSoup
import re

# URLs of the Wikipedia articles mapped to keywords
urls = {
    "prompt engineering": "https://en.wikipedia.org/wiki/Prompt_engineering",
    "artificial intelligence":"https://en.wikipedia.org/wiki/Artificial_intelligence",
    "llm": "https://en.wikipedia.org/wiki/Large_language_model",
    "llms": "https://en.wikipedia.org/wiki/Large_language_model"
    }

#### Processing the data

In [4]:
# apply a standard scraping and text-cleaning function to the document that will be retrieved
def fetch_and_clean(url):
  # Fetch the content of the URL
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'html.parser')

  # Find the main content of the article, ignoring side boxes and headers
  content = soup.find('div', {'class': 'mw-parser-output'})

  # Remove less relevant sections such as "See also", "References", etc.
  for section_title in ['References', 'Bibliography', 'External links', 'See also']:
    section = content.find('span', {'id': section_title})
    if section:
      for sib in section.parent.find_next_siblings():
        sib.decompose()
      section.parent.decompose()

  # Focus on extracting and cleaning text from paragraph tags only
  paragraphs = content.find_all('p')
  cleaned_text = ' '.join(paragraph.get_text(separator=' ', strip=True) for paragraph in paragraphs)
  cleaned_text = re.sub(r'\[\d+\]', '', cleaned_text)   # remove citation markers like [1], [2], etc

  return cleaned_text

#### Retrieval process for user input

The first step here involves identifying a keyword within the user's input. The function "process_query"
takes two parameters: user_input and num_words. The number of words to retrieve is restricted by
factors like the input limitations of the model, cost considerations, and overall system performance

In [5]:
import textwrap

def process_query(user_input, num_words):
  user_input = user_input.lower()

  # Check for any of the specified keywords in the input
  matched_keyword = next((keyword for keyword in urls if keyword in user_input), None)

  # upon finding match between keyword in user query and keywords associated with URLs
  # following functions for fetching and cleaning the data are triggered
  if matched_keyword:
    print(f"Fetching data from: {urls[matched_keyword]}")
    cleaned_text = fetch_and_clean(urls[matched_keyword])

    # Limit the display to the specified number of words from the cleaned text
    words = cleaned_text.split()   # Split the text into words
    first_n_words = ' '.join(words[:num_words])   # Join the first n words into a single string

    # num_words parameter helps in chunking the text
    # cleaned and truncated text is then formatted for display
    # Wrap the first n words to 80 characters wide for display
    wrapped_text = textwrap.fill(first_n_words, width=80)
    print("\nFirst {} words of the cleaned text:".format(num_words))
    print(wrapped_text) # Print the first n words as a well-formatted paragraph

    # Use the exact same first_n_words for the GPT-4 prompt to ensure consistency
    prompt = f"Summarize the following information about {matched_keyword}:\n{first_n_words}"

    wrapped_prompt = textwrap.fill(prompt, width=80) # Wrap prompt text
    print("\nPrompt for Generator:", wrapped_prompt)

    # Return the specified number of words
    return first_n_words
  else:
    print("No relevant keywords found. Please enter a query related to 'LLM', 'LLMs', or 'Prompt Engineering'.")
    return None

Above function ultimately returns the first n words, providing a concise and relevant snippet
of information based on the user's query. This design allows the system to manage data retrieval efficiently
while also maintaining user engagement

## 2.Generator

2.1. Integrating HF-RAG for augmented document inputs

The dynamic nature of information retrieval and the necessity for contextually relevant data augmentation in generative AI models require a flexible system capable of adapting to varying levels of input quality.

The adaptive RAG selection system employs HF scores to determine the
optimal retrieval strategy for document implementation within the RAG ecosystem.

The adaptive approach aims to optimize the balance between automated retrieval and human insight,
ensuring the generative model's outputs are of the highest possible relevance and accuracy

2.2. Input

A user of Company C is prompted to enter a question:

In [6]:
# Request user input for keyword parsing
user_input = input("Enter your query: ").lower()

Enter your query: What is LLMs?


2.3. Mean Ranking Simulation Scenario

Let's assume that the human user feedback panel has been evaluating
the hybrid adaptive RAG system for some time. The user feedback panel ranks the reponses a number of times, which automatically updates by calculating the mean of the ratings and storing it in a ranking variable named 'ranking'. The ranking score will help the management team decide whether to downgrade the rank of a document, upgrade it, or suppress documents through manual or automated functions.

In [7]:
# We will begin with a 1 to 5 ranking, which will deactivate RAG so that we can see the native response of the generative model
#Select a score between 1 and 5 to run the simulation
ranking=5

we will modify this value to activate RAG without additional human-expert feedback with
ranking=5. Finally, we will modify this value to activate human feedback RAG without retrieving
documents with ranking=3.

In [8]:
# initializing the text for the generative AI model simulations
text_input=[]

Each time we switch scenarios, make sure to come back and reinitialize text_input

#### Ranking 1–2: No RAG

In [9]:
# ranking of the generative AI’s output is very low. All RAG functionality is deactivated until the management team can analyze and improve the system
# In this case, text_input is equal to user_input
if ranking >= 1 and ranking < 3:
  text_input = user_input

#### Ranking 3–4: Human-expert feedback RAG

In this scenario, human-expert feedback was triggered by poor user feedback ratings with automated RAG documents(ranking=5) and without RAG (ranking
1-2). The human-expert panel has filled in a flashcard, which has now been stored as an expert-level
RAG document.

In [10]:
# program first checks the ranking and activates HF retrieval
hf=False
if ranking>3 and ranking<5:
  hf=True

The program will then fetch the proper document from an expert panel (selected experts within a
corporation) dataset based on keywords, embeddings, or other search methods that fit the goals of a
project. In this case, we assume we have found the right flashcard and download it

In [11]:
with open("drive/MyDrive/human_feedback.txt", "r") as file:
  content = file.read().replace('\n', ' ').replace('#', '')     #Removing new line and markdown characters
text_input=content
print(text_input)  #The content of the file explains both what an LLM is and how it can help Company C improve customer support

A Large Language Model (LLM) is an advanced AI system trained on vast amounts of text data to generate human-like text responses. It understands and generates language based on the patterns and information it has learned during training. LLMs are highly effective in various language-based tasks, including answering questions, making recommendations, and facilitating conversations. They can be continually updated with new information and trained to understand specific domains or industries.For the C-phone series customer support, incorporating an LLM could significantly enhance service quality and efficiency. The conversational agent powered by an LLM can provide instant responses to customer inquiries, reducing wait times and freeing up human agents for more complex issues. It can be programmed to handle common technical questions about the C-phone series, troubleshoot problems, guide users through setup processes, and offer tips for optimizing device performance. Additionally, it can 

In [12]:
# If we try generate content based this text_input creatd as symbol of expert human feedabck, the response is satisfactory
# The preceding response is now much better since it defines LLMs and also shows how to improve customer service for Company C's C-phone series.

#### Ranking 5: RAG with no human-expert feedback documents

In case, users do not require RAG documents that include human-expert RAG flashcards, snippets, or
documents. This might be the case, particularly, if software engineers are the users. In this case, the maximum number of words is limited to 100 to optimize API costs, but can be modified

In [13]:
if ranking >= 5:
  max_words=100 #Limit: the size of the data we can add to the input
  rdata=process_query(user_input,max_words)
  if rdata:
    rdata_clean = rdata.replace('\n', ' ').replace('#', '')
    rdata_sentences = rdata_clean.split('. ')
    print(rdata)
  text_input=rdata
  print(text_input)

Fetching data from: https://en.wikipedia.org/wiki/Large_language_model

First 100 words of the cleaned text:
A large language model ( LLM ) is a type of machine learning model designed for
natural language processing tasks such as language generation . LLMs are
language models with many parameters, and are trained with self-supervised
learning on a vast amount of text. The largest and most capable LLMs are
generative pretrained transformers (GPTs). Modern models can be fine-tuned for
specific tasks or guided by prompt engineering . [ 1 ] These models acquire
predictive power regarding syntax , semantics , and ontologies [ 2 ] inherent in
human language corpora, but they also inherit inaccuracies and biases present

Prompt for Generator: Summarize the following information about llm: A large language model ( LLM ) is
a type of machine learning model designed for natural language processing tasks
such as language generation . LLMs are language models with many parameters, and
are trained

2.6. Content Generation

In [15]:
import openai
from openai import OpenAI
import time
import httpx
from openai import DefaultHttpxClient

client = OpenAI(http_client=DefaultHttpxClient())
gptmodel="gpt-4o"
start_time = time.time() # Start timing before the request

# define a standard Gpt-4o prompt, giving it enough information to respond and leaving the rest up to the model and RAG data
def call_gpt4_with_full_text(itext):
  # Join all lines to form a single string
  text_input = '\n'.join(itext)
  prompt = f"Please summarize or elaborate on the following content:\n{text_input}"

  try:
    response = client.chat.completions.create(
        model=gptmodel,
        messages=[
                {"role": "system", "content": "You are an expert Natural Language Processing exercise expert."},
                {"role": "assistant", "content": "1.You can explain read the input and answer in detail"},
                {"role": "user", "content": prompt}
    ],
    temperature=0.1 # Add the temperature parameter here and otherparameters you need
    )
    return response.choices[0].message.content.strip()
  except Exception as e:
    return str(e)

In [16]:
# Call the function and print the result
gpt4_response = call_gpt4_with_full_text(text_input)

response_time = time.time() - start_time  # Measure response time
print(f"Response Time: {response_time:.2f} seconds")  # Print response time

print(gptmodel,"Response:", gpt4_response)

Response Time: 44.72 seconds
gpt-4o Response: A large language model (LLM) is a type of machine learning model specifically designed for natural language processing (NLP) tasks, such as language generation. These models are characterized by having a large number of parameters and are trained using self-supervised learning on extensive text datasets. The most advanced and capable LLMs are often generative pre-trained transformers (GPTs). Modern LLMs can be fine-tuned for specific tasks or guided using prompt engineering to achieve desired outcomes.

These models gain predictive power in understanding syntax, semantics, and ontologies inherent in human language corpora. However, they also inherit inaccuracies and biases present in the data they are trained on. This means that while they can perform complex language tasks, they may also reflect and perpetuate existing biases and errors found in the training data.


In [17]:
# below code then formats the output
import textwrap

def print_formatted_response(response):
  # Define the width for wrapping the text
  wrapper = textwrap.TextWrapper(width=80)
  wrapped_text = wrapper.fill(text=response)

  # Print the formatted response with a header and footer
  print("GPT-4 Response:")
  print("---------------")
  print(wrapped_text)
  print("---------------\n")

In [18]:
# # Assuming 'gpt4_response' contains the response from the previous GPT-4 call
print_formatted_response(gpt4_response)

GPT-4 Response:
---------------
A large language model (LLM) is a type of machine learning model specifically
designed for natural language processing (NLP) tasks, such as language
generation. These models are characterized by having a large number of
parameters and are trained using self-supervised learning on extensive text
datasets. The most advanced and capable LLMs are often generative pre-trained
transformers (GPTs). Modern LLMs can be fine-tuned for specific tasks or guided
using prompt engineering to achieve desired outcomes.  These models gain
predictive power in understanding syntax, semantics, and ontologies inherent in
human language corpora. However, they also inherit inaccuracies and biases
present in the data they are trained on. This means that while they can perform
complex language tasks, they may also reflect and perpetuate existing biases and
errors found in the training data.
---------------



## 3.Evaluator

we will implement two automatic metrics:
response time and cosine similarity score. We will then implement two interactive evaluation functions:
human user rating and human-expert evaluation.

#### 3.1. Response time


In [19]:
import time

start_time = time.time() # Start timing before the request

response_time = time.time() - start_time # Measure response time
print(f"Response Time: {response_time:.2f} seconds") # Print response time

Response Time: 0.00 seconds


#### 3.2. Cosine similarity score

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
  vectorizer = TfidfVectorizer()
  tfidf = vectorizer.fit_transform([text1, text2])
  similarity = cosine_similarity(tfidf[0:1], tfidf[1:2])
  return similarity[0][0]

# Example usage with your existing functions
similarity_score = calculate_cosine_similarity(text_input, gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

Cosine Similarity Score: 0.734


The score shows a strong similarity between the input and the output of the model. But how will a
human user rate this response? Let's find out.

#### 3.3. Human user rating

The human user rating interface provides human user feedback. it is recommended designing this interface and process after fully understanding user needs through a workshop with them

In [21]:
# code begins with the interface's parameters
# Score parameters
counter=20         # number of feedback queries
score_history=30   # human feedback
threshold=4        # minimum rankings to trigger human expert feedback

In this simulation, the parameters show that the system has computed human feedback:
* counter=20 shows the number of ratings already entered by the users
* score_history=60 shows the total score of the 20 ratings
* threshold=4 states the minimum mean rating, score_history/counter, to obtain without triggering a human-expert feedback request

In [22]:
import numpy as np

def evaluate_response(response):
  print("\nGenerated Response:")
  print(response)
  print("\nPlease evaluate the response based on the following criteria:")
  print("1 - Poor, 2 - Fair, 3 - Good, 4 - Very Good, 5 - Excellent")
  score = input("Enter the relevance and coherence score (1-5): ")
  try:
    score = int(score)
    if 1 <= score <= 5:
      return score
    else:
      print("Invalid score. Please enter a number between 1 and 5.")
      return evaluate_response(response)      # Recursive call if the input if the input is invalid
  except ValueError:
    print("Invalid input. Please enter a number.")
    return evaluate_response(response)      # Recursive call if the input is invalid
score = evaluate_response(gpt4_response)
print("Evaluator Score:", score)

counter+=1
score_history+=score
mean_score=round(np.mean(score_history/counter), 2)
if counter>0:
  print("Rankings      :", counter)
  print("Score history : ", mean_score)


Generated Response:
A large language model (LLM) is a type of machine learning model specifically designed for natural language processing (NLP) tasks, such as language generation. These models are characterized by having a large number of parameters and are trained using self-supervised learning on extensive text datasets. The most advanced and capable LLMs are often generative pre-trained transformers (GPTs). Modern LLMs can be fine-tuned for specific tasks or guided using prompt engineering to achieve desired outcomes.

These models gain predictive power in understanding syntax, semantics, and ontologies inherent in human language corpora. However, they also inherit inaccuracies and biases present in the data they are trained on. This means that while they can perform complex language tasks, they may also reflect and perpetuate existing biases and errors found in the training data.

Please evaluate the response based on the following criteria:
1 - Poor, 2 - Fair, 3 - Good, 4 - Very

#### 3.4. Human-expert evaluation

Metrics such as cosine similarity indeed measure similarity but not provide in-depth accuracy. Time performance will not determine the accuracy of a response either. But if the rating is too low, why is that? Because the user is not satisfied with the response!

In [25]:
# parameters to trigger an expert’s feedback are counter_threshold and score_threshold
# The number of user ratings must exceed the expert's threshold counter, which is counter_threshold=10
# The threshold of the mean score of the ratings is 4 in this scenario: score_threshold=4
counter_threshold=10
score_threshold=4

if counter > counter_threshold and score_history<=score_threshold:
  print("Human expert evaluation is required for the feedback loop.")

# the output will confirm the expert feedback loop because of the poor mean ratings and the number of times the users rated the response

Now a standard HTML interface in a Python cell will display the thumbs-up and
thumbs-down icons. If the expert presses on the thumbs-down icon, a feedback snippet can be entered and saved in a feedback file named expert_feedback.txt

In [26]:
import base64
from google.colab import output
from IPython.display import display, HTML

def image_to_data_uri(file_path):
  # convert an image to data URI
  with open(file_path, 'rb') as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode()
  return f'data:image/png;base64,{encoded_string}'

thumbs_up_data_uri = image_to_data_uri("drive/MyDrive/thumbs_up.png")
thumbs_down_data_uri = image_to_data_uri("drive/MyDrive/thumbs_down.png")

def display_icons():
  # Define the HTML content with the two clickable images
  html = f'''




    '''
  display(HTML(html))

def save_feedback(feedback):
  with open("drive/MyDrive/expert_feedback.txt", "w") as file:
    file.write(feedback)
  print("Feedback saved successfully.")

# Register the callback
output.register_callback('notebook.save_feedback', save_feedback)

print("Human Expert Adaptive RAG activated")

# Display the icons with click handlers
display_icons()

Human Expert Adaptive RAG activated


We can add a function for thumbs-down meaning that the response was incorrect and that the management team has to communicate with the user panel or add a prompt to the user feedback interface. This is a management decision, of course. In our scenario, the human expert pressed the thumbs-down icon and was prompted to enter a response.

The human expert provided the response, which was saved in 'drive/MyDrive/expert_feedback.txt'. The preceding expert's feedback can then be used to improve the RAG dataset