# **Real-time search engine using Google API and Facebook BART LLM model**

**Muhammad Zain Raza**

In [None]:
!pip install google-auth google-auth-oauthlib google-auth-httplib2 requests

## **1. Google Search Function**

1. **Getting top n results**: we define a function to obtain the urls of the top n results.
2. **Scrape Conent**: The function scrape_content accesses a url and scrapes the main text from the url.
3. Through a loop, the content from each url is added to a list. We can use this list now, *search_results*, as our main corpus. Next, we perform preprocessing on this text, create chunks, embeddings and add to a database

In [1]:
import requests
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession

'''
get_top_n_results(query, api_key, cse_id, n )
query: search Query string

api_key : Secret google api key

cse_id : Custom search engine id

n: number of results to return

returns a list containing the top n result dictionaries (for us the title and url are useful)
'''
def get_top_n_results(query, api_key, cse_id, n = 10):
  url = "https://www.googleapis.com/customsearch/v1"
  params = {
      'q': query,
      'key': api_key,
      'cx': cse_id,
      'num': n
  }
  response = requests.get(url, params=params)
  results = response.json()
  return results.get("items", [])


# Example
api_key = "AIzaSyBjC5OktQBsbpL2HeoY5u_IhybKiGaLzVU"
cse_id = "e48f67dc9e0cd4168"
query = "How to loose Weight?"
results = get_top_n_results(query, api_key, cse_id, n=10)

for result in results:
    print(result['title'])
    print(result['link'])
    print()

print(type(result))

from bs4 import BeautifulSoup

def scrape_content(url, timeout=30):
    """
    Scrape content from a URL with a specified timeout.
      url: The url to scrape.
      timeout: The timeout duration in seconds.

    Returns:
      scraped content or times out
    """
    try:
        response = requests.get(url, timeout=timeout)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        content = ' '.join([p.get_text() for p in paragraphs])
        return content
    except requests.exceptions.Timeout:
        return f"Timeout occurred while scraping {url}"
    except requests.exceptions.RequestException as e:
        return f"Error occurred while scraping {url}: {e}"

Weight loss: 6 strategies for success - Mayo Clinic
https://www.mayoclinic.org/healthy-lifestyle/weight-loss/in-depth/weight-loss/art-20047752

Steps for Losing Weight | Healthy Weight and Growth | CDC
https://www.cdc.gov/healthy-weight-growth/losing-weight/index.html

Tips to help you lose weight - NHS
https://www.nhs.uk/live-well/healthy-weight/managing-your-weight/tips-to-help-you-lose-weight/

How to Lose Weight Fast and Safely
https://www.webmd.com/diet/lose-weight-fast

How to lose weight fast: 9 scientific ways to drop fat
https://www.medicalnewstoday.com/articles/322345

How to Lose Weight Fast (the Healthy Way)
https://store.google.com/intl/en/ideas/articles/how-to-lose-weight-fast-the-healthy-way/

Successful weight loss: 10 tips to lose weight
https://www.medicalnewstoday.com/articles/303409

How to Lose Weight Naturally: 29 Tips Supported by Science
https://www.healthline.com/nutrition/30-ways-to-lose-weight-naturally

Weight loss - a healthy approach - Better Health Channe

In [None]:
!pip install nltk
!pip install -U sentence-transformers
!pip install faiss-gpu

## **2. Chunking and creating Embeddings**

Now for the next step, once we have all the content, we create helper functions, named get_the_embeddings, to convert the content into chunks and create embeddings from these chunks for indexing.

1. Chunking - 10 sentences with a 3 sentence overlap per chunk
2. Creating embeddings using sentence transfomers

In [2]:
from sentence_transformers import SentenceTransformer, util
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
nltk.download('punkt_tab')
# Function to extrack sentences and chunking them into paragraphs to get the embeddings for those paragraphs
import numpy as np
import faiss

def get_the_embeddings(search_results):
  '''
  get_the_embeddings(search_results):

  Inputs: search_results - the full content of n search results

  The function takes the content of all search results as input, creates chunks of 10 sentences each, with
  an overlap of 3 sentences across each chunk. It then converts these to embeddings using all-mini-l6-v2 sentence
  transfomer and

  Outputs: extracted paragraphs & paragraph embeddings
  '''

  extracted_paragraphs = []

  for results in search_results:
    processed_sentences = sent_tokenize(results["content"])  # taking the content from the search results and tokenizing it into sentences

    # chunking the extracted sentences into paragraphs containing 10 sentences each, with an overlap of 3 sentences
    for i in range(0, len(processed_sentences), 10):
      paragraph = ' '.join(processed_sentences[max(0, i - 3) : min(len(processed_sentences), i + 10)])
      extracted_paragraphs.append(paragraph)

  # Getting the embeddings for the extracted paragraphs
  paragraph_embeddings = embedding_model.encode(extracted_paragraphs, convert_to_tensor=True)
  return extracted_paragraphs, paragraph_embeddings

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## **3. Index Formation, and Context Generation**
1. Index is formed using the function 'get_faiss_index'
2. find_nearest_neighbors obtains top k ranked chunks from the embeddings
3. join_similar_paragraphs joins the top k chunks to pass them into the context.
4. get_context finally combines the helper functions to retrieve the context


In [5]:
def get_faiss_index(paragraph_embeddings):
  '''
  function: get_faiss_index(paragraph_embeddings)

  get_faiss_index initializes a faiss index and populates the index with paragraph embeddings.
  IT uses a euclidian distance based metric to perform the nearest neighbor search.

  Args: paragraph_embeddings - inputs all paragraph embeddings generated from
  previous function.

  Output: FAISS index

  '''
  # Initialize parameters
  d = 384                           # dimension of vectors
  xb = paragraph_embeddings.cpu().numpy()  # converting embeddings to a numpy array

  # Create a simple FAISS index
  index = faiss.IndexFlatL2(d)       # L2 (Euclidean) distance-based flat index

  # Add vectors to the index
  index.add(xb)                      # adding all paragraph vectors to the index
  #print("Number of vectors in the index:", index.ntotal)
  return index


In [6]:
# A function that takes in the question and converts it into a query vectory to search for k neares neighbours in the faiss index
def find_nearest_neighbors(question, index, k=3):
  '''
  find_nearest_neighbors(question, index, k=3)

  Args: question (string) - query input by user
        index - FAISS index populated with paragraph embeddings from search results
        k (=3) - number of top results to retrieve

  Function inputs a string, a query by the user, it converts the query to embedding using
  a sentence transfomer, and finds the top k results based on shortest euclidian distance between
  query and content chunks.
  '''
    # Get the embeddings for the question
  question_embedding = embedding_model.encode(question, convert_to_tensor=True)

    # Ensure the embedding is a 2D array (required by FAISS)
  if question_embedding.ndim == 1:
      question_embedding = question_embedding.reshape(1, -1)

    # Convert the embedding to a NumPy array
  xq = question_embedding.cpu().numpy()

    # Perform the nearest-neighbor search
  distances, indices = index.search(xq, k)
  return indices

In [7]:
# Function that uses the indices fo the most similar paragraphs to get the context from the list of original paragraphs
def join_similar_paragraphs(query_indices, original_text):
  '''
  join_similar_paragraphs(query_indices, original_text)

  Args - query_indices - indices of top resuls
         Original_text - corresponding text

  function joins the original text of the most similar paragraphs based on indices
  obtained from index search to be passed on to context.

  Output: string of combined most relevant paragraphs from search to be used as context

  '''
  context = ' '.join([original_text[i] for i in query_indices])
  return context

In [8]:
def get_context(results):
  '''
  get_context(results)

  args:results - string of cumulative search result content

  Output : Context (string) & matching urls (strings)

  This function combines all the helper functions to create the context
  that is to be fed into the final RAG model to generate answers. It also
  outputs the links to the websites which the context was obtained from.

  '''
  search_results = []
  for result in results:
    url = result['link']
    content = scrape_content(url)
    search_results.append({'url': url, 'content': content})

  extracted_paragraphs, paragraph_embeddings = get_the_embeddings(search_results)

  index = get_faiss_index(paragraph_embeddings)

  indices_of_nearest_neighbors = find_nearest_neighbors(query, index, k=1)
  nearest_paragraphs = [extracted_paragraphs[i] for i in indices_of_nearest_neighbors[0]]
  nearest_urls = [search_results[i // len(extracted_paragraphs)]['url'] for i in indices_of_nearest_neighbors[0]]

  context = join_similar_paragraphs(indices_of_nearest_neighbors[0], extracted_paragraphs)

  return context, nearest_urls

## **4. Fine-Tuning an LLM model for Question Answering using Microsoft MS Marco dataset**

1. MS Marco Dataset is downloaded
2. The train test and validation splits are obtained
3. preprocessing is completed
4. Bart Base model is trained for a RAG approach
5. Various Metrics are calculated using the test dataset to test performance of the fine-tuned model

In [None]:
!pip install datasets
!pip install transformers

In [17]:
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("microsoft/ms_marco", "v1.1")

In [18]:
# The dataset already has splits of training, validation and testing datasets
train_data = dataset['train']
validation_data = dataset['validation']
test_data = dataset['test']

In [19]:
from transformers import BartTokenizer
# Initialize the tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
# Define max token length
MAX_LENGTH = 512

### **4.1 Pre-processing the dataset to prepare for training**

In this step we will pre-process the trainnig and validation data to use it for fine-tuning our model. The dataset contains multiple context passages and the location of the most relevant passage from them that was used to generate the answer. In our pre-processing we select that passage and use it for training.

In [None]:
def preprocess(examples):
    # Initialize lists to store processed data
    input_ids = []
    attention_masks = []
    labels = []
    # Iterate through all rows in the dataset
    for query, passages, answers in zip(examples['query'], examples['passages'], examples['answers']):
        # Extract passage_text and is_selected lists
        passage_texts = passages['passage_text']
        is_selected = passages['is_selected']

        # Find the first relevant passage
        context = ""
        for text, selected in zip(passage_texts, is_selected):
            if selected == 1:
                context = text
                break

        # Use an empty string if no relevant passage is found
        if context == "":
            context = "No relevant passage found."

        # Use the first answer or an empty string if no answer is available
        answer = answers[0] if len(answers) > 0 else ""

        # Tokenize the input (query + context) and output (answer)
        inputs = tokenizer(
            query + " " + tokenizer.sep_token + " " + context,
            max_length=MAX_LENGTH,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )
        outputs = tokenizer(
            answer,
            max_length=MAX_LENGTH,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )

        # Append processed data
        input_ids.append(inputs['input_ids'].squeeze(0))
        attention_masks.append(inputs['attention_mask'].squeeze(0))
        labels.append(outputs['input_ids'].squeeze(0))

    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'labels': labels
    }

# Apply preprocessing
train_dataset = train_data.map(preprocess, batched=True, remove_columns=train_data.column_names)
validation_dataset = validation_data.map(preprocess, batched=True, remove_columns=validation_data.column_names)

# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

### **4.2 Loading and training the model**

In [21]:
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments, DataCollatorWithPadding
import torch

In [None]:
# Load BART model for conditional generation
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

# Move the model to GPU if available
device = torch.device("cuda")
model.to(device)

In [23]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",          # Save after each evaluation
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=2,
    load_best_model_at_end=True,    # Load best model at the end
    logging_dir="./logs",
    logging_steps=500,
    lr_scheduler_type="cosine",
    warmup_steps=1000,
    fp16=torch.cuda.is_available(),
    report_to=[]
)



In [None]:
data_collator = DataCollatorWithPadding(tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
0,0.0235,0.019624
2,0.0172,0.018749
4,0.015,0.018854


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
  return Variable._execution_engine.run_backward(  # Calls into the 

In [None]:
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

In [None]:
# Save the model and tokenizer
model.save_pretrained("./bart-msmarco-2")
tokenizer.save_pretrained("./bart-msmarco-2")

In [25]:
# Load the trained model and tokenizer
trained_model = BartForConditionalGeneration.from_pretrained("./bart-msmarco-2")
trained_model.to(device)  # Move model to GPU
trained_tokenizer = BartTokenizer.from_pretrained("./bart-msmarco-2")

# Prepare an example query and context
query = "What is Results-Based Accountability?"
context = "Results-Based Accountability (RBA) is a disciplined way of thinking and taking action to improve community outcomes..."

# Tokenize input
inputs = trained_tokenizer(
    query + " " + trained_tokenizer.sep_token + " " + context,
    return_tensors="pt",
    max_length=MAX_LENGTH,
    truncation=True
)

# Generate the answer
inputs = {k: v.to(device) for k, v in inputs.items()}
output = trained_model.generate(
    inputs["input_ids"],
    max_length=50,
    num_beams=5,
    early_stopping=True
)

# Decode and print the answer
answer = trained_tokenizer.decode(output[0], skip_special_tokens=True)
print("Answer:", answer)

Answer: It is a disciplined way of thinking and taking action to improve community outcomes.


## **5. Testing on the test dataset of MS MARCO**

In [26]:
def preprocess_test_data(examples):
    input_ids = []
    attention_masks = []
    references = []

    for query, passages, answers in zip(examples['query'], examples['passages'], examples['answers']):
        # Extract passage_text and is_selected lists
        passage_texts = passages['passage_text']
        is_selected = passages['is_selected']

        # Find the first relevant passage
        context = ""
        for text, selected in zip(passage_texts, is_selected):
            if selected == 1:
                context = text
                break

        # Use an empty string if no relevant passage is found
        if context == "":
            context = "No relevant passage found."

        # Use the first answer as the reference
        reference = answers[0] if len(answers) > 0 else ""

        # Tokenize the input (query + context)
        inputs = tokenizer(
            query + " " + tokenizer.sep_token + " " + context,
            max_length=512,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )

        # Append processed data
        input_ids.append(inputs['input_ids'].squeeze(0))
        attention_masks.append(inputs['attention_mask'].squeeze(0))
        references.append(reference)

    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'references': references
    }

# Preprocess the test data
test_dataset = test_data.map(preprocess_test_data, batched=True, remove_columns=test_data.column_names)
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'references'])


In [None]:
# Load the trained model and tokenizer
from transformers import BartForConditionalGeneration, BartTokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BartForConditionalGeneration.from_pretrained("./bart-msmarco-2").to(device)
tokenizer = BartTokenizer.from_pretrained("./bart-msmarco-2")

# Generate predictions
def generate_predictions(batch):
    inputs = {'input_ids': batch['input_ids'].to(device), 'attention_mask': batch['attention_mask'].to(device)}
    outputs = model.generate(
        inputs["input_ids"],
        max_length=50,
        num_beams=5,
        early_stopping=True
    )
    predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    return {"predictions": predictions}

# Apply the model to the test dataset
test_results = test_dataset.map(generate_predictions, batched=True, batch_size=16)

In [None]:
!pip install rouge_score
!pip install evaluate

## **6. Evaluation metrics for MS Marco test dataset**

### **6.1 ROUGE Score**

In [28]:
from evaluate import load

# Load ROUGE metric
rouge = load("rouge")

# Compute ROUGE scores
predictions = test_results["predictions"]
references = test_results["references"]

# Strip whitespace from predictions and references
predictions = [pred.strip() for pred in predictions]
references = [ref.strip() for ref in references]

# Evaluate using ROUGE
rouge_scores = rouge.compute(predictions=predictions, references=references)
print("ROUGE Scores:", rouge_scores)

ROUGE Scores: {'rouge1': 0.6176155114997887, 'rouge2': 0.4696353361929626, 'rougeL': 0.6107581816553828, 'rougeLsum': 0.6107647176419008}


### **6.2 Exact Match Score**

In [29]:
# Compute Exact Match
def compute_exact_match(predictions, references):
    em_count = 0
    for pred, ref in zip(predictions, references):
        if pred == ref:
            em_count += 1
    return em_count / len(references)

exact_match_score = compute_exact_match(predictions, references)
print("Exact Match Score:", exact_match_score)

Exact Match Score: 0.25202072538860104


### **6.3 BLEU Score**

In [30]:
# Load ROUGE metric
bleu = load("bleu")

# Compute ROUGE scores
predictions = test_results["predictions"]
references = test_results["references"]

# Strip whitespace from predictions and references
predictions = [pred.strip() for pred in predictions]
references = [ref.strip() for ref in references]

# Evaluate using ROUGE
bleu_scores = bleu.compute(predictions=predictions, references=references)
print("BLEU Scores:", bleu_scores)

BLEU Scores: {'bleu': 0.429352809073028, 'precisions': [0.7213172804532578, 0.6531203492409962, 0.6304825980075042, 0.6157247010790318], 'brevity_penalty': 0.656552562207897, 'length_ratio': 0.7038523479639051, 'translation_length': 110136, 'reference_length': 156476}


In [31]:
import pandas as pd
df_pred = pd.DataFrame(columns = ["Actual", "Predictions"])
for i,(s,t) in enumerate(zip(references, predictions)):
  df_pred.loc[i, "Actual"] = s
  df_pred.loc[i, "Predictions"] = t

df_pred

Unnamed: 0,Actual,Predictions
0,Yes,Yes
1,Fossil fuels are basically the remains of anim...,Fossil fuels are basically the remains of anim...
2,The apothem of a regular polygon is a line seg...,It is a line segment from the center to the mi...
3,$45 to $210. 2,$45 to $210
4,It is the collection of physical elements that...,Computer hardware (usually simply called hardw...
...,...,...
9645,Can last 3-4 days in the fridge as long as its...,3-4 days
9646,Body mass index (BMI) the weight in kilograms ...,Body mass index (BMI) the weight in kilograms ...
9647,Yes,Yes
9648,'Bisque' is a shade of White that is 23% satur...,A shade of White that is 23% saturated and 100...


### **6.4 Fuzzy Match**

In [None]:
!pip install rapidfuzz

In [None]:
from rapidfuzz.fuzz import ratio

def compute_fuzzy_match(predictions, references, threshold=80):
    """
    Compute fuzzy match scores and the percentage of predictions meeting a similarity threshold.

    Args:
        predictions (list): List of predicted strings.
        references (list): List of reference strings or lists.
        threshold (int): Minimum similarity percentage to consider a match.

    Returns:
        dict: Average fuzzy match score and percentage of predictions above the threshold.
    """
    # Flatten references for single string comparisons
    references = [ref[0] if isinstance(ref, list) else ref for ref in references]

    fuzzy_scores = []
    above_threshold_count = 0

    for pred, ref in zip(predictions, references):
        score = ratio(pred, ref)
        fuzzy_scores.append(score)
        if score >= threshold:
            above_threshold_count += 1

    # Calculate average score and percentage above threshold
    avg_score = sum(fuzzy_scores) / len(fuzzy_scores)
    percentage_above_threshold = (above_threshold_count / len(fuzzy_scores)) * 100

    return {
        "average_fuzzy_score": avg_score,
        "percentage_above_threshold": percentage_above_threshold,
    }

# Compute fuzzy match
fuzzy_results = compute_fuzzy_match(predictions, references, threshold=80)
print("Fuzzy Match Results:", fuzzy_results)

### **6.5 Similarity Score with all-MiniLM-L6-v2 embeddings**

In [32]:
from sentence_transformers import SentenceTransformer, util
import torch

# Load a pre-trained model for embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

def compute_embedding_similarity(predictions, references):
    """
    Compute cosine similarity between embeddings of predictions and references.

    Args:
        predictions (list): List of predicted strings.
        references (list): List of reference strings or lists.

    Returns:
        dict: Average similarity score and individual similarity scores.
    """
    # Flatten references for single string comparisons
    references = [ref[0] if isinstance(ref, list) else ref for ref in references]

    # Compute embeddings for predictions and references
    pred_embeddings = model.encode(predictions, convert_to_tensor=True)
    ref_embeddings = model.encode(references, convert_to_tensor=True)

    # Compute cosine similarity
    similarity_scores = util.cos_sim(pred_embeddings, ref_embeddings).diagonal()

    # Average similarity score
    avg_similarity = similarity_scores.mean().item()

    return {
        "average_similarity_score": avg_similarity,
        "individual_scores": similarity_scores.tolist()
    }

# Compute similarity
embedding_similarity_results = compute_embedding_similarity(predictions, references)
print("Embedding Similarity Results:", embedding_similarity_results)

Embedding Similarity Results: {'average_similarity_score': 0.7611591219902039, 'individual_scores': [1.0000001192092896, 0.9777355790138245, 0.6604857444763184, 0.9401236176490784, 0.8257583379745483, 0.5456299185752869, 0.43091946840286255, 1.0000001192092896, 0.9832344055175781, 0.9999998807907104, 0.4791112542152405, 1.0, 1.000000238418579, 0.8914610147476196, 0.16854019463062286, 0.5215747952461243, 0.29016077518463135, 0.8858665227890015, 1.0000001192092896, 1.000000238418579, 0.966277003288269, 0.835607647895813, 0.9403818249702454, 0.9791785478591919, 0.9269140362739563, 0.7360854744911194, 0.7431190609931946, 0.9380679130554199, 1.0000001192092896, 0.8969537615776062, 0.9999998807907104, 0.7899443507194519, 1.0, 0.9082695841789246, 0.3967020511627197, 0.08060099929571152, 1.000000238418579, 1.0, 0.488650381565094, 0.5192423462867737, 1.0, 0.5674492120742798, 0.6560137271881104, 0.8292387127876282, 0.6532447934150696, 0.7334948182106018, 1.0000004768371582, 0.7240651249885559, 0

## **7. RAG Pipeline**
Finally combining all helper functions with new model to create the function for the full RAG pipeline



In [33]:
import torch

# Load the trained model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BartForConditionalGeneration.from_pretrained("./bart-msmarco-2").to(device)
tokenizer = BartTokenizer.from_pretrained("./bart-msmarco-2")

def rag_pipeline(query):

  # get google API results
    results = get_top_n_results(query, api_key, cse_id, n=6)

    # get context
    context = get_context(results)[0]
    #print(context)
    if not context.strip():
        return "Unable to generate context from the search results."

    inputs = tokenizer(
        query + " " + tokenizer.sep_token + " " + context,
        return_tensors="pt",
        max_length=512,
        truncation=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        inputs["input_ids"],
        max_length=250,
        num_beams=5,
        early_stopping=True
    )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer, get_context(results)[1]

## **8. Testing on Custom Dataset**

In [None]:
import pandas as pd

file_path = "test_set.csv"
data = pd.read_csv(file_path)

questions = data["Questions"].tolist()
true_answers = data["Answers"].tolist()

# Process questions through the RAG pipeline and collect predictions
predicted_answers = []
similarities = []


for num, question in enumerate(questions):
  predicted_answer, *_ = rag_pipeline(question)
  predicted_answers.append(predicted_answer)
  print(str(num+1/len(questions) * 100) + "% Done")

In [None]:
rouge_scores = rouge.compute(predictions=predicted_answers, references=true_answers)
print("ROUGE Scores:", rouge_scores)
bleu_scores = bleu.compute(predictions=predicted_answers, references=true_answers)
print("BLEU Scores:", bleu_scores)
fuzzy_results = compute_fuzzy_match(predictions=predicted_answers, references=true_answers, threshold=80)
print("Fuzzy Match Results:", fuzzy_results)

## **9. Demo with Gradio**

In [None]:
!pip install gradio
import gradio as gr

# Define Gradio Interface
interface = gr.Interface(
    fn=rag_pipeline,
    inputs=gr.Textbox(label = "Enter your qeustion here"),
    outputs=gr.Textbox(label="Generated Answer"),
    title="Simpler Search Engine",
    description="Enter a question below and direct results"
)

# Launch the Gradio app
interface.launch(share = True)

print(rag_pipeline("What is Results-Based Accountability?"))