# HCP Targeting and Segmentation Using Llama Index Framework

## Problem Statement
Healthcare organizations face challenges in analyzing large datasets from diverse sources, such as:
- HCP records
- Patient notes
- Prescription trends
- Engagement metrics

The goal is to process complex data and provide actionable insights using the **Llama Index Framework**, incorporating the following components:
1. Data Loaders
2. Chunking/Tokenization
3. Node Parser
4. Embeddings
5. Indexing & Retrieval
6. Vector Databases
7. LLMs for insights
8. Response synthesis & Query Engine

---

# Dataset Overview

The **HCP Targeting and Segmentation** dataset contains structured information about healthcare professionals (HCPs).
It includes fields such as `HCP_ID`, `Specialty`, `Region`, and engagement-related metrics.

## Key Features
1. **HCP_ID**: A unique identifier for each healthcare professional.
2. **Name**: The name of the HCP.
3. **Specialty**: The medical specialty of the HCP (e.g., Pediatrics, Cardiology).
4. **Region**: The geographic region where the HCP practices (e.g., North, East).
5. **Prescribing_Trend**: Indicates the prescribing behavior of the HCP (e.g., Low, Medium, High).
6. **Research_Involvement**: Specifies whether the HCP is involved in research activities (Yes/No).
7. **Engagement_Score**: A qualitative measure of the HCP's engagement level (e.g., Low, Medium, High).


In [1]:
import openai
from llama_index.core import Document, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter, SentenceWindowNodeParser
import pandas as pd

# Configure OpenAI API
# openai.api_key = ""  # Replace with your OpenAI API key

# Step 1: Load the CSV file into a DataFrame


In [3]:
file_path = "data/HCP_Targeting_and_Segmentation.csv"
hcp_data = pd.read_csv(file_path)

# Step 2: Convert each row into a Document object (node)


In [4]:
documents = [
    Document(
        text=f"HCP_ID: {row['HCP_ID']}, Name: {row['Name']}, Specialty: {row['Specialty']}, "
             f"Region: {row['Region']}, Prescribing_Trend: {row['Prescribing_Trend']}, "
             f"Research_Involvement: {row['Research_Involvement']}, Engagement_Score: {row['Engagement_Score']}",
        metadata={"row_index": idx}  # Add metadata for reference (optional)
    )
    for idx, row in hcp_data.iterrows()
]

# Step 3: Print the total number of documents created
print(f"Total rows converted to documents: {len(documents)}")

# Inspect the first few documents
for i, doc in enumerate(documents[:2]):
    print(f"Document {i + 1}:")
    print(doc.text)
    print(f"Metadata: {doc.metadata}")
    print("-" * 50)


Total rows converted to documents: 50000
Document 1:
HCP_ID: HCP00001, Name: HCP_Name_1, Specialty: Pediatrics, Region: North, Prescribing_Trend: Low, Research_Involvement: Yes, Engagement_Score: Medium
Metadata: {'row_index': 0}
--------------------------------------------------
Document 2:
HCP_ID: HCP00002, Name: HCP_Name_2, Specialty: Pediatrics, Region: North, Prescribing_Trend: Low, Research_Involvement: Yes, Engagement_Score: Low
Metadata: {'row_index': 1}
--------------------------------------------------


# Step 3: Chunking and Tokenizing using TokenTextSplitter


In [5]:
splitter = TokenTextSplitter(
    chunk_size=512,        # Max tokens per chunk
    chunk_overlap=50,      # Overlap between chunks
    separator=" "          # Separator for splitting
)

chunks = splitter.get_nodes_from_documents(documents)
print(f"Total chunks created after chunking: {len(chunks)}")

Total chunks created after chunking: 50000


# Step 4: Node Parsing using SentenceWindowNodeParser


In [6]:
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Number of surrounding sentences to include in context
    window_metadata_key="window",  # Metadata key for context
    original_text_metadata_key="original_sentence"  # Metadata key for the original sentence
)

nodes = node_parser.get_nodes_from_documents(chunks)
print(f"Total nodes created after parsing: {len(nodes)}")

Total nodes created after parsing: 50000


## Step 4: Generate Vector Embeddings and Create Index

In [7]:
parsed_documents = [
    Document(
        text=node.text,
        metadata=node.metadata  # Include metadata like doc_id, row_index, etc.
    )
    for node in nodes
]
index = VectorStoreIndex.from_documents(parsed_documents)
print(f"Index created with {len(nodes)} nodes.")

Index created with 50000 nodes.


In [8]:
def get_total_index_count(index):
    """Returns the total number of documents in the VectorStoreIndex."""
    try:
        # Access the document store
        docstore = index.storage_context.docstore

        # Count the number of documents
        total_count = len(docstore.docs)
        print(f"Total number of documents indexed: {total_count}")
        return total_count
    except AttributeError as e:
        print("Unable to fetch the total count. Ensure the index is correctly initialized.")
        print(f"Error: {e}")
        return 0

# Example usage
total_indexes = get_total_index_count(index)


Total number of documents indexed: 50000


## Step 5: Query Index


In [9]:
# Query the index
def query_index(index, query):

    query_engine = index.as_query_engine()
    response = query_engine.query(query)

    return response

query_text = "What are the key prescription trends for HCPs?"
response = query_index(index, query_text)
print("Query Response:")
print(response)

Query Response:
The key prescription trends for the healthcare providers (HCPs) mentioned in the context are high prescribing trend for the HCP specializing in Oncology and medium prescribing trend for the HCP specializing in Cardiology.


## Step 6: Dynamically Update Index with New Data


In [10]:
def dynamic_update_index(data_dir="data"):
    """Reloads documents from the specified directory and updates the index."""
    print("Updating index with new data...")
    documents = SimpleDirectoryReader(data_dir).load_data()
    index = VectorStoreIndex.from_documents(documents)
    print("Index updated successfully.")
    return index

In [11]:
# Add new data
new_data = pd.DataFrame({
    "HCP_ID": [101, 102, 103],
    "Specialty": ["Pediatrics", "Cardiology", "Orthopedics"],
    "Engagement_Score": [85, 65, 45],
    "Notes": [
        "Highly engaged with pediatric programs.",
        "Needs follow-up for engagement.",
        "Minimal interaction recorded."
    ]
})
new_data_filepath = "new_hcp_data.csv"
new_data.to_csv(new_data_filepath, index=False)

# Update index
updated_index = dynamic_update_index(data_dir="data")

Updating index with new data...
Index updated successfully.


## Step 7: Query Updated Index

In [12]:
# Query the updated index
updated_query_text = "Who are the HCPs with high engagement scores?"
updated_response = query_index(index, updated_query_text)
print("Updated Query Response:")
print(updated_response)

Updated Query Response:
HCP18075 and HCP36020 are the healthcare professionals with high engagement scores.


# Conclusion

This end-to-end solution demonstrates how to efficiently process a structured dataset, such as a CSV file, using **Llama Index** for indexing and querying. The key steps include:

1. **Data Preparation**:
   - The rows from the CSV file were converted into `Document` objects, preserving relevant metadata like `HCP_ID` and `row_index`.

2. **Chunking and Tokenization**:
   - Large text content was split into smaller, manageable chunks using `TokenTextSplitter`, ensuring consistent size and overlap for optimal retrieval.

3. **Node Parsing**:
   - Sentence-level parsing was applied using `SentenceWindowNodeParser` to create granular nodes while retaining context through metadata.

4. **Compatibility with VectorStoreIndex**:
   - The parsed nodes were transformed back into `Document` objects, making them compatible with the `VectorStoreIndex` for seamless indexing.

5. **Efficient Querying**:
   - The processed and indexed dataset allowed for precise queries, such as identifying HCPs based on their specialties and engagement scores.

## Benefits
- **Scalability**: The pipeline supports large datasets with fine-grained chunking and parsing.
- **Flexibility**: Metadata allows for advanced filtering and contextual retrieval.
- **Actionable Insights**: The indexed dataset enables natural language queries, transforming raw data into meaningful insights.

## Key Takeaways
This workflow combines the power of Llama Index's **tokenization**, **parsing**, and **indexing** capabilities to handle structured datasets efficiently. By ensuring compatibility and leveraging metadata, this approach provides a robust foundation for advanced querying and decision-making.
