# Creation of LLM-based Agents for the MedQuad database

MedQuad (Medical Question Answering Dataset) is an English-language medical database developed primarily for artificial intelligence and natural language processing (NLP) research. This database contains question-answer pairs focused on medical topics and is designed to help automated systems provide accurate and reliable answers to medical questions. The content of the database is generally drawn from medical literature, clinical information, and authoritative sources such as publications from the National Institutes of Health (NIH) or other recognized health organizations.

## Installation of dependencies

In [None]:
!pip install llama-index llama-index-embeddings-huggingface--quiet

## Import packages

In [None]:
from tqdm import tqdm

from llama_index.core import Settings
from llama_index.core import Document
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core import load_index_from_storage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from matplotlib import pyplot as plt

## Collecting data

In [None]:
import pandas as pd

df_medquad = pd.read_csv("hf://datasets/keivalya/MedQuad-MedicalQnADataset/medDataset_processed.csv")
df_medquad.info()
df_medquad.head()

In [None]:
df_medquad["text_char_len"] = df_medquad["Answer"].apply(lambda x: len(x))
df_medquad["text_word_len"] = df_medquad["Answer"].apply(lambda x: len(x.split()))
df_medquad_stat = df_medquad[["text_char_len", "text_word_len"]]
df_medquad_stat.describe()

In [None]:
df_medquad_stat['text_char_len'].plot(kind='hist', bins=20, title='text_char_len')
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
df_medquad_stat['text_word_len'].plot(kind='hist', bins=20, title='text_char_len')
plt.gca().spines[['top', 'right',]].set_visible(False)

## Create a vector database using the [LlamaIndex](https://docs.llamaindex.ai/en/stable/) function library.

LlamaIndex (formerly known as GPT Index) is a Python library designed to support the integration and querying of structured and unstructured data using large language models (LLMs).

Key features:

    1. document indexing: helps to pre-process and index data for efficient searching.
    2. Retrieval-Augmented Generation (RAG) support: can be easily combined with LLMs to develop data-driven question-and-answer systems.
    3. Modular architecture: includes data loaders, indexing algorithms and different retrieval strategies.
    4. Integration capabilities: Compatible with multiple data sources (e.g. files, databases, APIs, web pages).
    Multiple index structures: e.g. vector space search, tree-like indexes or keyword-based search.

What is it good for?

    Document-based search and retrieval using LLMs.
    2. summarising and extracting information from large texts.
    Enterprise AI applications (e.g. chatbots, knowledge management systems, customer service automation).

In [None]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

Settings.llm = None
Settings.embed_model = HuggingFaceEmbedding(model_name=model_name, device="cuda")

In [None]:
chunks = []
chunk_size_by_words = 150

for text in tqdm(df_medquad["Answer"].values):
  text_split = text.split(" ")
  for i in range(0, len(text_split), chunk_size_by_words):
    chunk = " ".join(text_split[i:i + chunk_size_by_words])
    chunks.append(Document(text=chunk))

len(chunks)

In [None]:
index = VectorStoreIndex.from_documents(chunks, show_progress=True, insert_batch_size=len(chunks))

In [None]:
persist_dir="storage"

In [None]:
# Körülbelü 1 és 2 perc között
index.storage_context.persist(persist_dir=persist_dir)
print(f"VectorStoreIndex saved to {persist_dir}.")

In [None]:
loaded_storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index = load_index_from_storage(loaded_storage_context)
print(f"VectorStoreIndex loaded from {persist_dir}.")

In LlamaIndex, when using the index.storage_context.persist method, the data associated with the index is saved to a file in the specified persist_dir directory (by default './storage'). This process creates different files that store different components of the index. Below, I will list in detail what files it generates and what is contained in them:

    docstore.json
        Content: this file contains the information about the document repository (docstore). The docstore stores the raw documents or parts of documents (called Nodes) that you used to create the index. It can contain the text, metadata (e.g. filename, identifier) and other relevant information about the documents.
        Purpose: Provides quick access to documents without having to reload them from the source.
    Quick access to documents without having to download them from a repository. index_store.json
        Content: Stores the metadata of the index store (index_store). This includes information about the index structure, such as the index identifier (index_id) and other state information that is generated when the index is created.
        Purpose: To assist in fast loading and management of the index, especially when there are multiple indexes in the same repository.
    vector_store.json
        Content: Saves the data of a vector store (vector_store) containing embedding vectors of documents or text fragments. These vectors are required for semantic search and are usually represented as a simple in-memory vector database if you are not using an external vector store (e.g. Chroma, Pinecone).
        Purpose: Allows reuse of vectors without re-indexing, which is advantageous in terms of time and computational cost.
    graph_store.json (optional)
        Content: If you are using a graph-based index (e.g. knowledge graph), this file stores the graph data, such as the relationships between nodes and edges.
        Purpose: Required to maintain and reload the graph structure. It only appears if you are using a graph-based index.

In [None]:
print("Question", df_medquad["Question"].values[0])
print("Answer", df_medquad["Answer"].values[0])

In [None]:
query_text = df_medquad["Question"].values[0]
query_engine = index.as_query_engine(similarity_top_k=10)

response = query_engine.query(query_text)

# Print the results
print(f"Query: {query_text}\n")
print("Source details:")
for node in response.source_nodes:
  text = node.text.replace('\n', " ")
  print(f"Node ID: {node.node_id}\nScore: {node.score}\n{text[:200]}\n")