<a href="https://colab.research.google.com/github/yokurang/ml-final-llm-model/blob/main/ml_final_project_llm_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Large Language Models (LLMs) for Data Search

We investigated the potential of using large language models (LLMs) to enhance our data search capabilities. Several approaches were considered:

1. **ChatGPT from OpenAI**: While effective, this option involves additional costs.
2. **Local Smaller Models**: Limited by our computational resources, this option could restrict performance.
3. **Hosted Open-Source Models via Hugging Face**: This provides a balance of cost-efficiency and capability.

Ultimately, we selected MistralAI7b, an open-source model from Hugging Face, specifically fine-tuned for conversational tasks. This model aligns well with our needs, enabling effective training on our documents, such as Wikipedia articles, to optimize search functionality.

We first install the necessary libraries and environment variables.

In [None]:
%%capture
!pip3 install pyTelegramBotAPI
!pip3 install requests
!pip3 install pandas

In [None]:
from google.colab import userdata
import telebot
import requests

telegram_bot_token = 'TELEGRAM_TOKEN'  # @param {type: "string"}

try:
  TELEGRAM_API_KEY=userdata.get(telegram_bot_token)
  bot = telebot.TeleBot(TELEGRAM_API_KEY)
  bot.get_me()
except userdata.SecretNotFoundError as e:
   print(f'Secret not found\n\nThis expects you to create a secret named {telegram_bot_token} in Colab\n\nMessage botfather on telegram to create a new bot and get that token\n\nStore that in the secrets section on the left side of the notebook (key icon)\n\nName the secret {telegram_bot_token}')
   raise e
except userdata.NotebookAccessError as e:
  print(f'You need to grant this notebook access to the {telegram_bot_token} secret in order for the notebook to access your Telegram Bot on your behalf.')
  raise e
except Exception as e:
  # unknown error
  print(f"There was an unknown error. Ensure you have a secret {telegram_bot_token} stored in Colab and it's a valid key from telegram")
  raise e

## Verifying the Environment Variables

Let us verify that our Telegram bot API key is working correctly.

In [None]:
# Checking that the tele bot is responsive

url = f"https://api.telegram.org/bot{userdata.get(telegram_bot_token)}/getUpdates"
response = requests.get(url)

print("Status Code:", response.status_code)
print("Response Content:", response.content)

Status Code: 200
Response Content: b'{"ok":true,"result":[]}'


In [None]:
response = requests.get(url)

print("Status Code:", response.status_code)
print("Response Content:", response.content)

Status Code: 200
Response Content: b'{"ok":true,"result":[]}'


Our telegram bot API key is working.

Let us make sure we can load the data set correctly.

In [11]:
import pandas as pd
import csv
# Loading the data set
qna = pd.read_csv('updated_qns_answers.csv')
qna

Unnamed: 0,question,points,article,nlp_analysis,readability_score,preprocessed_question,id,url,article_text,preprocessed_text,cluster
0,did the people of gibraltar vote to remain pa...,58,Gibraltar,"([('the united kingdom', 'GPE'), ('2002', 'DAT...",62.68,people gibraltar vote remain part united kingd...,15222,https://simple.wikipedia.org/wiki/Gibraltar,Gibraltar is an Overseas Territory of the Unit...,gibraltar overseas territory united kingdom me...,192.0
1,which country uses the franc as its official ...,55,Currency,"([], [(' ', 'dep', 'which'), ('which', 'det', ...",53.88,country us franc official currency,2140,https://simple.wikipedia.org/wiki/Currency,Currency is the unit of money used by the peop...,currency unit money used people country union ...,127.0
2,which of these old communist parties no longe...,52,List of communist parties,"([('communist', 'NORP'), ('today', 'DATE')], [...",69.79,old communist party longer exists today,4402,https://simple.wikipedia.org/wiki/List%20of%20...,There are a number of communist parties around...,number communist party around world world hist...,121.0
3,a patient has a terminal illness and wants to ...,65,Medical ethics,"([], [('a', 'det', 'patient'), ('patient', 'ns...",66.23,patient terminal illness want end life family ...,13938,https://simple.wikipedia.org/wiki/Medical%20et...,Medical ethics is the set of ethical rules tha...,medical ethic set ethical rule doctor follow i...,117.0
4,"according to plato, what are the three types o...",55,The Republic,"([('three', 'CARDINAL')], [('according', 'prep...",71.14,according plato three type people society made,13148,https://simple.wikipedia.org/wiki/The%20Republic,The Republic is a book by Plato. It was finish...,republic book plato finished bc asks question ...,117.0
...,...,...,...,...,...,...,...,...,...,...,...
2568,what are the main areas of focus in andrology?,62,Andrology,"([], [('what', 'attr', 'are'), ('are', 'ROOT',...",62.34,main area focus andrology,11639,https://simple.wikipedia.org/wiki/Andrology,"Andrology is the study of male health, especia...",andrology study male health especially male se...,22.0
2569,what are the main arguments in favor of animal...,51,Animal rights,"([], [('what', 'attr', 'are'), ('are', 'ROOT',...",69.79,main argument favor animal right,27285,https://simple.wikipedia.org/wiki/Animal%20rights,Animal rights is a term used for the general b...,animal right term used general belief non huma...,159.0
2570,what are the main arguments in favor of neocla...,47,Neoclassical economics,"([], [('what', 'attr', 'are'), ('are', 'ROOT',...",44.41,main argument favor neoclassical economics,18577,https://simple.wikipedia.org/wiki/Neoclassical...,Neoclassical economics is an economic theory t...,neoclassical economics economic theory argues ...,148.0
2571,what are the main branches of physical geography?,57,Geography,"([], [('what', 'attr', 'are'), ('are', 'ROOT',...",46.44,main branch physical geography,296,https://simple.wikipedia.org/wiki/Geography,"Geography (from Greek: , geographia, literally...",geography greek geographia literally earth des...,181.0


In [12]:
import telebot

# Initialise the Tele Bot

BOT_TOKEN = userdata.get(telegram_bot_token)
bot = telebot.TeleBot(BOT_TOKEN)

Next let us make sure that we can load the Huggingface API key correctly.

In [13]:
# @title ⚙️ Configure Hugging Face Token

from google.colab import userdata

hugging_face_token_secret = 'HF_TOKEN'  # @param {type: "string"}

try:
  TELEGRAM_API_KEY=userdata.get(hugging_face_token_secret)
except userdata.SecretNotFoundError as e:
   print(f'Secret not found\n\nThis expects you to create a secret named {hugging_face_token_secret} in Colab\n\nGot to that url and create a write token (https://huggingface.co/settings/tokens)\n\nStore that in the secrets section on the left side of the notebook (key icon)\n\nName the secret {hugging_face_token_secret}')
   raise e
except userdata.NotebookAccessError as e:
  print(f'You need to grant this notebook access to the {hugging_face_token_secret} secret in order for the notebook to access your Telegram Bot on your behalf.')
  raise e
except Exception as e:
  # unknown error
  print(f"There was an unknown error. Ensure you have a secret {hugging_face_token_secret} stored in Colab and it's a valid key from telegram")
  raise e

In [14]:
%%capture
!pip3 install langchain transformers

In [15]:
import os
from langchain_community.llms import HuggingFaceHub

# Get HuggingFace Credentials

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get(hugging_face_token_secret)

In [16]:
from transformers import AutoTokenizer

In [17]:
!pip3 install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


## Setting Up the Web Scraper and Vector Database

To effectively respond to queries with precise context, we've developed a system to process Wikipedia URLs by:

1. **Loading Webpages**: We begin by loading each Wikipedia URL present in our dataset.
2. **Content Extraction**: Next, we extract textual content from these webpages.
3. **Content Chunking**: The extracted content is then segmented into manageable chunks, ensuring that each chunk contains a coherent portion of the content.
4. **Embedding Chunks**: Each chunk is transformed into embeddings using the `sentence-transformers/all-MiniLM-l6-v2` model from Hugging Face, which provides a high-level representation of the text.
5. **FAISS Database**: These embeddings are stored in a FAISS-backed database, which is optimized for efficient similarity-based query matching. This setup allows the database to return only the most relevant content in response to a query.

This architecture leverages the Retriever-Augmented Generation (RAG) technique, enabling the use of our own document corpus as a searchable data layer. By integrating this system with our bot through various LangChain modules, we ensure that our large language model can efficiently navigate and retrieve information from our custom dataset.

In [18]:
from langchain_community.document_loaders import WebBaseLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import LocalFileStore
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings

In [19]:
store = LocalFileStore("./cache")
core_embeddings_model = HuggingFaceInferenceAPIEmbeddings(
    api_key = userdata.get(hugging_face_token_secret),
    model_name="sentence-transformers/all-MiniLM-l6-v2")

embedder = CacheBackedEmbeddings.from_bytes_store(core_embeddings_model, store)
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50, length_function = len)

We can train our model with a subset of the data since we are limited by our laptop's computational resources.

In [20]:
def process_urls(urls, text_splitter, embedder):
    all_chunks = []

    # Step 1: Collect chunks from all URLs
    for url in set(urls):  # Ensure uniqueness
        print(f"Processing {url}...")
        content = WebBaseLoader(url).load()
        chunks = text_splitter.transform_documents(content)
        all_chunks.extend(chunks)

    # Step 2 and 3: Process all chunks to generate embeddings and create a single vector store
    vectorstore = FAISS.from_documents(all_chunks, embedder)

    return vectorstore

combined_vectorstore = process_urls(qna['url'][:50], text_splitter, embedder) # modify here
retriever = combined_vectorstore.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.3})

Processing https://simple.wikipedia.org/wiki/By-product...
Processing https://simple.wikipedia.org/wiki/Boy...
Processing https://simple.wikipedia.org/wiki/Fact...
Processing https://simple.wikipedia.org/wiki/Newt...
Processing https://simple.wikipedia.org/wiki/Half-life%20%28element%29...
Processing https://simple.wikipedia.org/wiki/Buddha...
Processing https://simple.wikipedia.org/wiki/List%20of%20communist%20parties...
Processing https://simple.wikipedia.org/wiki/Shetland%20Sheepdog...
Processing https://simple.wikipedia.org/wiki/Snow%20leopard...
Processing https://simple.wikipedia.org/wiki/Galician%20language...
Processing https://simple.wikipedia.org/wiki/Medical%20ethics...
Processing https://simple.wikipedia.org/wiki/The%20Republic...
Processing https://simple.wikipedia.org/wiki/Biscuit...
Processing https://simple.wikipedia.org/wiki/Ketchup...
Processing https://simple.wikipedia.org/wiki/Election...
Processing https://simple.wikipedia.org/wiki/Handcuffs...
Processing https://s

In [21]:
# qna[['question','article','url','article_text']]
qna[['question']][:10]

Unnamed: 0,question
0,did the people of gibraltar vote to remain pa...
1,which country uses the franc as its official ...
2,which of these old communist parties no longe...
3,a patient has a terminal illness and wants to ...
4,"according to plato, what are the three types o..."
5,"after ten half-lives, what percentage of the o..."
6,approximately how many years will pass on eart...
7,are a and b the same number?
8,are addition and multiplication dyadic functions?
9,are albinic animals more likely to be attacked...


In [22]:
retriever.get_relevant_documents("a patient has a terminal illness and wants to end their life, but their family members are opposed to it. how should the doctor handle this situation?")

[Document(page_content="If there is not enough of a medicine to treat every person who has a disease who should get the medicine?\nIf a baby has a disease that will kill him very soon, what should a doctor do if the baby's mother says she does not want the baby to be helped?\nA patient has an injury that cannot be helped and that will kill them in a few minutes. The patient asks a doctor “am I going to die?” What should the doctor say?", metadata={'source': 'https://simple.wikipedia.org/wiki/Medical%20ethics', 'title': 'Medical ethics - Simple English Wikipedia, the free encyclopedia', 'language': 'en'})]

The function below is to extract the content from FAISS database for a given query for our Telegram bot.

In [23]:
def append_documents_to_instruction(instruction):
    instruction_with_documents = f'''
    Answer the following question, making use of the documents provided below if they are relevant.
    Do not use those documents and do not mention them if you deem that they do not contain any relevant information. Do not mention the id of the documents.
    If a document is relevant, add the source of the information by adding a link to the exact url that was used.
    For that use the 'source' field of the relevant document.
    Question: {instruction}
    '''

    docs = retriever.get_relevant_documents(instruction)

    if len(docs) == 0: # If there are no relevant documents, just return the original instruction
        return instruction

    instruction_with_documents += "Documents:\n"

    for i, doc in enumerate(docs):
        instruction_with_documents += f'''- {doc.metadata}
            Content: {doc.page_content}'''
    return instruction_with_documents

The following is the LLM class to interact with the MistralAI7b model for the Telegram bot.

In [24]:
class LLM:
  def __init__(self):
    model_string = "mistralai/Mistral-7B-Instruct-v0.2"
    self.chat = []
    self.llm = HuggingFaceHub(repo_id=model_string, model_kwargs={"temperature": 0.5, "max_length":64,"max_new_tokens":512})
    self.tokenizer = AutoTokenizer.from_pretrained(model_string)

  def get_reply(self, instruction):
    instruction_with_context = append_documents_to_instruction(instruction)
    self.chat.append({"role" : "user", "content" : instruction_with_context})

    prompt = self.tokenizer.apply_chat_template(self.chat, tokenize=False)
    reply = self.llm.invoke(prompt)
    self.chat.append({"role" : "assistant", "content" : reply})
    return reply

I also added the feature where the Telegram bot can accept PDF files and extract information from them.

In [25]:
!pip3 install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [26]:
def dl_file(message):
    import requests
    file_info = bot.get_file(message.document.file_id)

    download_url = f"https://api.telegram.org/file/bot{bot.token}/{file_info.file_path}"
    response = requests.get(download_url)

    if response.status_code == 200:
        with open(message.document.file_name, 'wb') as file:
            file.write(response.content)
        return True
    else:
        return False

In [27]:
def init_bot():
  bot = telebot.TeleBot(BOT_TOKEN)

  @bot.message_handler(commands=['start'])
  def on_start(message):
      bot.send_message(message.chat.id, "Beep, boop, starting bot...")

  @bot.message_handler(commands=['newchat'])
  def on_new_chat(message):
    llm.chat = []
    bot.reply_to(message,  "Starting new chat!")

  @bot.message_handler(content_types=['document'])
  def on_document(message):
    if message.document.mime_type == 'application/pdf':
        reply = bot.reply_to(message, "⬇️ Downloading file ⬇️")

        if not dl_file(message):
            bot.edit_message_text("❌ Failed to download file", reply.chat.id, reply.message_id)
            return

        bot.edit_message_text("🗃️ Adding file to database 🗃️", reply.chat.id, reply.message_id)

        loader = PyPDFLoader(message.document.file_name)
        pages = loader.load_and_split()
        chunks = text_splitter.transform_documents(pages)

        combined_vectorstore.add_documents(chunks)

        bot.edit_message_text("✅ PDF received and added to database", reply.chat.id, reply.message_id)
    else:
        bot.reply_to(message, "For the moment, I only support indexing PDF files. Please send a PDF file.")

  @bot.message_handler(func=lambda msg: True)
  def on_message(message):
      print(f"Message received! {message}")
      reply = llm.get_reply(message.text)
      bot.reply_to(message, reply)

  return bot

In [28]:
llm = LLM()

  warn_deprecated(


tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Starting the Telegram Bot

To activate the Telegram bot, execute the subsequent cell which will keep the bot running indefinitely, ready to respond to user queries. Interaction with the bot is done through the Telegram app. To halt the bot, simply interrupt the running cell.

Due to the potential computational limitations of local machines, it's recommended to run the bot in Google Colab. You can access the appropriate Colab notebook [here](https://colab.research.google.com/drive/1FgZJTcYCAsYMuHCRPTWUQAvEWR4nKoQH?usp=sharing).

**Preparation for Google Colab**:
Before using Google Colab, ensure you upload the following items:
- `updated_qns_answers.csv`: The dataset file.
- `huggingface_api_key`: Your API key for Hugging Face.
- `telegram_bot_api_key`: Your API key for the Telegram bot.

These steps will prepare your environment to effectively run the Telegram bot without local resource constraints.

In [37]:
bot = init_bot()
llm.chat = []
bot.infinity_polling()

Message received! {'content_type': 'text', 'id': 44, 'message_id': 44, 'from_user': {'id': 5304133338, 'is_bot': False, 'first_name': 'y', 'username': 'yokurang', 'last_name': 'k', 'language_code': 'en', 'can_join_groups': None, 'can_read_all_group_messages': None, 'supports_inline_queries': None, 'is_premium': None, 'added_to_attachment_menu': None, 'can_connect_to_business': None}, 'date': 1713330416, 'chat': {'id': 5304133338, 'type': 'private', 'title': None, 'username': 'yokurang', 'first_name': 'y', 'last_name': 'k', 'is_forum': None, 'photo': None, 'bio': None, 'join_to_send_messages': None, 'join_by_request': None, 'has_private_forwards': None, 'has_restricted_voice_and_video_messages': None, 'description': None, 'invite_link': None, 'pinned_message': None, 'permissions': None, 'slow_mode_delay': None, 'message_auto_delete_time': None, 'has_protected_content': None, 'sticker_set_name': None, 'can_set_sticker_set': None, 'linked_chat_id': None, 'location': None, 'active_username

2024-04-17 05:07:07,100 (__init__.py:1092 MainThread) ERROR - TeleBot: "Infinity polling: polling exited"
ERROR:TeleBot:Infinity polling: polling exited
2024-04-17 05:07:07,104 (__init__.py:1094 MainThread) ERROR - TeleBot: "Break infinity polling"
ERROR:TeleBot:Break infinity polling


Let us cross-validate our model.

In [30]:
qna_asked = []
qs = qna[['question']]
qs

Unnamed: 0,question
0,did the people of gibraltar vote to remain pa...
1,which country uses the franc as its official ...
2,which of these old communist parties no longe...
3,a patient has a terminal illness and wants to ...
4,"according to plato, what are the three types o..."
...,...
2568,what are the main areas of focus in andrology?
2569,what are the main arguments in favor of animal...
2570,what are the main arguments in favor of neocla...
2571,what are the main branches of physical geography?


In [31]:
def cross_validate_retriever(questions, retriever):
    qna_asked = []  # Initialize an empty list to store question and metadata
    processed_questions = set()  # Set to track processed questions to avoid duplicates

    # Iterate over each question
    for question in questions:
        # Skip the question if it has already been processed
        if question in processed_questions:
            continue

        # Retrieve relevant documents for the question using the provided retriever
        docs = retriever.get_relevant_documents(question)

        # Store the question and the metadata of each document, and mark the question as processed
        for doc in docs:
            qna_asked.append({'question': question, 'metadata': doc.metadata})
        processed_questions.add(question)

    return qna_asked

Since we trained on the first 50 questions, let us cross-validate on the first 50 questions too.

In [32]:
questions = qna['question'][:50]

qna_asked = cross_validate_retriever(questions, retriever)

def remove_duplicates(qna_asked):
    unique_entries_str = set(str(entry) for entry in qna_asked)  # Convert to set of strings to remove duplicates
    unique_qna_asked = [eval(entry) for entry in unique_entries_str]  # Convert back to list of dictionaries
    return unique_qna_asked


qna_asked_filtered = remove_duplicates(qna_asked)
qna_asked_filtered




[{'question': 'are cassette tapes still popular among audiophiles?',
  'metadata': {'source': 'https://simple.wikipedia.org/wiki/Audio%20cassette',
   'title': 'Audio cassette - Simple English Wikipedia, the free encyclopedia',
   'language': 'en'}},
 {'question': 'are handcuffs used for hog-tying a suspect?',
  'metadata': {'source': 'https://simple.wikipedia.org/wiki/Handcuffs',
   'title': 'Handcuffs - Simple English Wikipedia, the free encyclopedia',
   'language': 'en'}},
 {'question': 'are there any cultures or belief systems that attribute supernatural powers to certain objects, such as amulets or talismans, as a form of fetishism?',
  'metadata': {'source': 'https://simple.wikipedia.org/wiki/Fetishism',
   'title': 'Fetishism - Simple English Wikipedia, the free encyclopedia',
   'language': 'en'}},
 {'question': 'are shelties considered to be highly energetic dogs?',
  'metadata': {'source': 'https://simple.wikipedia.org/wiki/Shetland%20Sheepdog',
   'title': 'Shetland Sheepdo

Let us filter out the duplicate responses.

In [33]:
# Process to only keep question and modified metadata.title
processed_qna = []
for entry in qna_asked_filtered:
    # Copy the question
    question = entry['question']

    # Extract and modify the title from the metadata
    title = entry['metadata']['title']
    # Remove the unwanted text from the title
    modified_title = title.replace(' - Simple English Wikipedia, the free encyclopedia', '')

    # Create a new dictionary with the modified content and add it to the new list
    processed_qna.append({'question': question, 'title': modified_title})
processed_qna

[{'question': 'are cassette tapes still popular among audiophiles?',
  'title': 'Audio cassette'},
 {'question': 'are handcuffs used for hog-tying a suspect?',
  'title': 'Handcuffs'},
 {'question': 'are there any cultures or belief systems that attribute supernatural powers to certain objects, such as amulets or talismans, as a form of fetishism?',
  'title': 'Fetishism'},
 {'question': 'are shelties considered to be highly energetic dogs?',
  'title': 'Shetland Sheepdog'},
 {'question': 'are biscuits in british english sweet or savory?',
  'title': 'Biscuit'},
 {'question': 'are midges and gnats the same thing?', 'title': 'Fly'},
 {'question': 'approximately how many years will pass on earth for each year in the spaceship traveling at 10% of the speed of light?',
  'title': 'Time dilation'},
 {'question': ' did the people of gibraltar vote to remain part of the united kingdom in the 2002 referendum?',
  'title': 'Gibraltar'},
 {'question': 'are nuts a type of fruit or a fastener?', '

Let us count how many correct and incorrect matches our LLM model made.

In [34]:
matches = 0
mismatches = 0

validation_results = []

for index, row in qna.head(50).iterrows():
    question = row['question']
    article = row['article']

    # Find the corresponding entry in processed_qna based on the question
    corresponding_entry = next((item for item in processed_qna if item['question'] == question), None)

    # If a matching question is found in processed_qna
    if corresponding_entry:
        # Compare the article in qna with the title in processed_qna
        if corresponding_entry['title'] == article:
            matches += 1
            validation_results.append({'question': question, 'match': True})
        else:
            mismatches += 1
            validation_results.append({'question': question, 'match': False})
    else:
        # If no matching question is found, consider it a mismatch for tracking purposes
        mismatches += 1
        validation_results.append({'question': question, 'match': False})

print(f"Matches: {matches}, Mismatches: {mismatches}")
for result in validation_results:
    print(result)


Matches: 44, Mismatches: 6
{'question': ' did the people of gibraltar vote to remain part of the united kingdom in the 2002 referendum?', 'match': True}
{'question': ' which country uses the franc as its official currency?', 'match': True}
{'question': ' which of these old communist parties no longer exists today?', 'match': True}
{'question': 'a patient has a terminal illness and wants to end their life, but their family members are opposed to it. how should the doctor handle this situation?', 'match': True}
{'question': 'according to plato, what are the three types of people that society should be made up of?', 'match': True}
{'question': 'after ten half-lives, what percentage of the original atoms remain?', 'match': True}
{'question': 'approximately how many years will pass on earth for each year in the spaceship traveling at 10% of the speed of light?', 'match': True}
{'question': 'are a and b the same number?', 'match': True}
{'question': 'are addition and multiplication dyadic fu

The results are pretty good !

In [35]:
correct_questions = [result['question'] for result in validation_results if result['match']]
correct_points = qna[qna['question'].isin(correct_questions)]['points'].sum()

# Calculate the total available points from all questions considered (first 50 entries)
total_points = qna.head(50)['points'].sum()

# Calculate the accuracy ratio
accuracy_ratio = correct_points / total_points if total_points > 0 else 0

# Output the results
print(f"Total Correct Points: {correct_points}")
print(f"Total Available Points: {total_points}")
print(f"Accuracy Ratio: {accuracy_ratio:.2f}")

Total Correct Points: 2464
Total Available Points: 2835
Accuracy Ratio: 0.87


In [None]:
We also scored a lot of points !

In [36]:
validation_df = pd.DataFrame(validation_results)
processed_qna_df = pd.DataFrame(processed_qna)

# Merge validation results with processed_qna to include model's answers
validation_with_answers = pd.merge(validation_df, processed_qna_df, on='question', how='left')

# Rename columns in qna for clarity when merging
qna_renamed = qna.rename(columns={'article': 'correct_answer'})

# Merge the combined data with qna to get the correct answers and additional information
full_comparison = pd.merge(validation_with_answers, qna_renamed, on='question', how='left')

# Filter to show only the mismatches
incorrect_answers = full_comparison[full_comparison['match'] == False]

# Selecting relevant columns to display
incorrect_answers = incorrect_answers[['question', 'title', 'correct_answer', 'points', 'article_text']]

true_mismatches = full_comparison[(full_comparison['match'] == False) & (full_comparison['title'] != full_comparison['correct_answer'])]

# Selecting relevant columns to display
true_mismatches = true_mismatches[['question', 'title', 'correct_answer', 'points', 'article_text']]

# Display the DataFrame with true mismatches
true_mismatches

Unnamed: 0,question,title,correct_answer,points,article_text
12,are avocados considered a fruit or a vegetable...,Fruit,List of fruits,56,Fruits on this list are defined as the word is...
21,are dogs considered mammals?,,Fact,89,"A fact is a statement that is real or true, or..."
31,are nuts a type of fruit or a fastener?,Berry,Nut,53,Nut or nuts can mean different things:\nNut (f...
32,are nuts a type of fruit or a fastener?,Fruit,Nut,53,Nut or nuts can mean different things:\nNut (f...
33,are nuts a type of fruit or a fastener?,List of fruits,Nut,53,Nut or nuts can mean different things:\nNut (f...
41,are seed plants that produce both male and fem...,Fruit,Female,52,Female is a gender. It is the sex that produce...
46,are strawberries true berries according to bot...,Fruit,Berry,45,The word berry is used for many different kind...
52,are the set of all fords and the set of all do...,,Set,76,A set is an idea from mathematics. A set has m...


Upon reviewing the questions that our model answered incorrectly, we observed common issues. Incorrect responses often arose when the model chose similar but incorrect documents or when it failed to retrieve relevant documents due to a search similarity threshold set at 0.3. Additionally, removing duplicate questions sometimes eliminated the possibility of correct answers. Future research could investigate the impact of including duplicate questions in the training set on model accuracy.