<a href="https://colab.research.google.com/github/udhayarajan4562/llm-engineering/blob/main/Rag_for_xyz_hospital.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain langchain-openai langchain-chroma langchain-community
!pip install sentence-transformers gradio plotly scikit-learn python-dotenv tiktoken


Collecting langchain-openai
  Downloading langchain_openai-1.0.1-py3-none-any.whl.metadata (1.8 kB)
Collecting langchain-chroma
  Downloading langchain_chroma-1.0.0-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
INFO: pip is looking at multiple versions of langchain-openai to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-openai
  Downloading langchain_openai-1.0.0-py3-none-any.whl.metadata (1.8 kB)
  Downloading langchain_openai-0.3.35-py3-none-any.whl.metadata (2.4 kB)
Collecting chromadb<2.0.0,>=1.0.20 (from langchain-chroma)
  Downloading chromadb-1.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of langchain-chroma to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-chroma
  Downloading langchain_

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
import os
import glob
import pathlib
import numpy as np
from dotenv import load_dotenv
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import gradio as gr

# LangChain imports
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_chroma import Chroma
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings


In [4]:
# Directly access secrets stored in Colab
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")
OPENROUTER_API_KEY = userdata.get("OPENROUTER_API_KEY")

# Check which one is missing
if not HF_TOKEN and not OPENROUTER_API_KEY:
    raise ValueError("⚠️ Both HF_TOKEN and OPENROUTER_API_KEY are missing in Colab Secrets.")
elif not HF_TOKEN:
    raise ValueError("⚠️ HF_TOKEN is missing in Colab Secrets.")
elif not OPENROUTER_API_KEY:
    raise ValueError("⚠️ OPENROUTER_API_KEY is missing in Colab Secrets.")

# Export for libraries that expect them
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY

print("✅ Both secrets loaded successfully.")


✅ Both secrets loaded successfully.


In [5]:
# Define constants
MODEL = "openai/gpt-4o-mini"
DB_NAME = "/content/vector_db_data"

# Candidate locations for the 'data' folder on Google Drive / Colab
DATA_ROOT_CANDIDATES = [
    "/content/drive/My Drive/data",
    "/content/drive/MyDrive/data",
    "/content/drive/My Drive/xyz-hosipital/data",
    "/content/drive/MyDrive/xyz-hosipital/data",
    "/content/data",  # fallback if uploaded to runtime
]

def pick_data_root(candidates):
    for p in candidates:
        if os.path.isdir(p):
            return p
    return None

DATA_ROOT = pick_data_root(DATA_ROOT_CANDIDATES)
if not DATA_ROOT:
    raise FileNotFoundError(
        "❌ Could not find the 'data' folder. Please place it in Google Drive under one of: "
        + ", ".join(DATA_ROOT_CANDIDATES)
    )

print(f"✅ Using data root: {DATA_ROOT}")


✅ Using data root: /content/drive/My Drive/data


In [6]:
# Collect markdown files
text_loader_kwargs = {"encoding": "utf-8"}
documents = []

# Load root-level markdown files
root_loader = DirectoryLoader(DATA_ROOT, glob="*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
root_docs = root_loader.load()
for doc in root_docs:
    doc.metadata["doc_type"] = "root"
    documents.append(doc)

# Load markdown files from subfolders
top_level_folders = [p for p in glob.glob(f"{DATA_ROOT}/*") if os.path.isdir(p)]

for folder in top_level_folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        try:
            rel = pathlib.Path(doc.metadata.get("source", "")).resolve().as_posix()
        except Exception:
            rel = doc.metadata.get("source", "")
        doc.metadata["source_path"] = rel
        documents.append(doc)

# Split long docs into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f"✅ Loaded {len(chunks)} chunks from 'data' folder.")
print("📁 Document types:", sorted(set(chunk.metadata['doc_type'] for chunk in chunks)))


✅ Loaded 92 chunks from 'data' folder.
📁 Document types: ['appointments', 'billing', 'departments', 'inventory', 'medical_records', 'patients', 'research', 'root', 'staff']


In [7]:
# Free HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    cache_folder="./hf_cache"
)

# Delete old DB if present
if os.path.exists(DB_NAME):
    try:
        Chroma(persist_directory=DB_NAME, embedding_function=embeddings).delete_collection()
    except Exception as e:
        print(f"⚠️ Could not delete existing collection cleanly: {e}")

# Create vectorstore
vectorstore = Chroma.from_documents(
    documents=chunks, embedding=embeddings, persist_directory=DB_NAME
)
print(f"✅ Vectorstore created with {vectorstore._collection.count()} documents.")


  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Vectorstore created with 92 documents.


In [8]:
collection = vectorstore._collection
sample = collection.get(limit=1, include=["embeddings"])
if len(sample.get("embeddings", [])) > 0:
    print(f"📊 Embedding dimension: {len(sample['embeddings'][0])}")

result = collection.get(include=["embeddings", "documents", "metadatas"])
vectors = np.array(result["embeddings"])
documents_text = result["documents"]
doc_types = [m.get("doc_type", "unknown") for m in result["metadatas"]]

# Color map
color_map = {
    "patients": "blue",
    "medical_records": "green",
    "staff": "red",
    "departments": "orange",
    "appointments": "purple",
    "billing": "brown",
    "inventory": "teal",
    "research": "magenta",
    "root": "gray",
}
colors = [color_map.get(t, "gray") for t in doc_types]

# 2D visualization
n = vectors.shape[0] if isinstance(vectors, np.ndarray) else 0
if n >= 2:
    perplexity = max(5, min(30, n - 1))
    try:
        tsne_2d = TSNE(n_components=2, random_state=42, perplexity=perplexity, init="random", learning_rate="auto")
        reduced_2d = tsne_2d.fit_transform(vectors)

        fig = go.Figure(data=[go.Scatter(
            x=reduced_2d[:, 0], y=reduced_2d[:, 1],
            mode="markers",
            marker=dict(size=5, color=colors, opacity=0.8),
            text=[f"Type: {t}<br>Text: {d[:120]}..." for t, d in zip(doc_types, documents_text)]
        )])
        fig.update_layout(title="2D Chroma Vector Visualization (data/)", width=800, height=600)
        fig.show()
    except Exception as e:
        print(f"⚠️ Skipping 2D t-SNE due to: {e}")




📊 Embedding dimension: 384


In [9]:
# 3D visualization
if n >= 3:
    perplexity = max(5, min(30, n - 1))
    try:
        tsne_3d = TSNE(n_components=3, random_state=42, perplexity=perplexity, init="random", learning_rate="auto")
        reduced_3d = tsne_3d.fit_transform(vectors)

        fig = go.Figure(data=[go.Scatter3d(
            x=reduced_3d[:, 0], y=reduced_3d[:, 1], z=reduced_3d[:, 2],
            mode="markers",
            marker=dict(size=5, color=colors, opacity=0.8),
            text=[f"Type: {t}<br>Text: {d[:120]}..." for t, d in zip(doc_types, documents_text)]
        )])
        fig.update_layout(title="3D Chroma Vector Visualization (data/)", width=900, height=700)
        fig.show()
    except Exception as e:
        print(f"⚠️ Skipping 3D t-SNE due to: {e}")

In [10]:
# Initialize LLM via OpenRouter
llm = ChatOpenAI(
    model_name=MODEL,
    temperature=0.7,
    openai_api_key=os.environ["OPENROUTER_API_KEY"],
    openai_api_base="https://openrouter.ai/api/v1"
)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
retriever = vectorstore.as_retriever()

conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=retriever, memory=memory
)

print("✅ LLM and RAG chain initialized for 'data/' corpus.")

# Quick sanity query
try:
    query = "Summarize the XYZ Hospital dataset: key categories and example contents."
    result = conversation_chain.invoke({"question": query})
    print("🤖", result.get("answer", "No answer field returned."))
except Exception as e:
    print(f"⚠️ Test query failed: {e}")



Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/



✅ LLM and RAG chain initialized for 'data/' corpus.
🤖 The XYZ Hospital dataset provides comprehensive information about a fictional multi-specialty healthcare facility and is structured in markdown format for AI applications. The key categories of the dataset include:

1. **Patients Directory**: Contains detailed profiles of patients with information such as personal details, medical history, current conditions, medications, treatment plans, insurance information, and emergency contacts.

2. **Medical Records**: Clinical records organized by department that include diagnoses, treatments, and progress notes.

3. **Staff**: Information about hospital personnel, including doctors, nurses, and administrative staff.

4. **Departments**: Details about various hospital departments and their specialties.

5. **Appointments**: Scheduling data for patient appointments and procedures.

6. **Billing**: Financial records, including insurance claims and payment information.

7. **Inventory**: Data o

In [11]:
# Gradio chat interface
def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result.get("answer", "")

gr.ChatInterface(
    fn=chat,
    title="🏥 XYZ Hospital Dataset Chatbot (OpenRouter + Chroma)",
    theme="soft"
).launch()



The 'tuples' format for chatbot messages is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style 'role' and 'content' keys.



It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://1b71ab8978f8646369.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


