<a href="https://colab.research.google.com/github/thisisraj57/Kanishka-Raj-Case-study-Gen-AI_ProcDNA/blob/main/Kanishka_Raj_Case_study_Gen_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Case Study: GenAI Developer Hiring - Generative AI for Medical Research Processing**


**Introduction:**

ProcDNA is at the forefront of applying Generative AI to revolutionize medical research analysis. They are planning to build a system that can automatically process and synthesize insights from medical research papers.

**Ask:**

o	Entity Recognition and Linking: Identify and link endpoints (drugs, genes, proteins, efficacy metrics, safety metrics, adverse events, etc.) to relevant knowledge bases using generative techniques.

o	Extractive & Abstractive Summarization: Generate summaries that incorporate both extractive methods (key sentence selection) and abstractive methods (paraphrasing and novel sentence generation) to capture the essence of the research.

o	Provide a production architecture which highlights various aspects listed below:


1. Data Preprocessing Pipeline
2. Generative AI Processing Engine
3. Knowledge Base Integration
4. Result Storage and Retrieval


**Evaluation Criteria:**

‚Ä¢	Entity Linking Accuracy: Evaluate how well the system links extracted entities to relevant knowledge bases.

‚Ä¢	Summary Coherence and Informativeness: Assess the summaries for clarity, factuality, and their ability to capture the key findings of the research.

‚Ä¢	Safety guardrails applied for data protection and privacy: Suggest/Implement data protection guardrails with respect to the input and output of the LLM.


**Articles for train and test:**

Attached as PDFs with the case study email.



**Submission criteria:**

‚Ä¢	Utilize any opensource LLM of your choice and create a free AWS/Azure account where you can train and test your model.

‚Ä¢	Submit the code and results for evaluation in an attachment.

‚Ä¢	Create a small PPT presentation outlining your approach, architecture diagram and highlight the key challenges faced and how you overcame them.

‚Ä¢	Include a slide detailing the cloud infrastructure and the necessary services required for implementation, along with the associated infrastructure costs.

‚Ä¢	Additionally, provide a slide outlining the go-to-market strategy for these types of tools.



In [1]:
#Necessary Liberaries to be installed
!pip install streamlit
!pip install google-generativeai
!pip install langchain
!pip install PyPDF2
!pip install faiss-cpu
!pip install langchain_google_genai
!pip install -U langchain-community
!pip install faiss-cpu
!pip install pypdf



In [42]:
import getpass
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import FAISS  # Import FAISS vectorstore

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = ""

# Load PDF document
loader = PyPDFLoader("/content/drive/MyDrive/Kanishka Raj Case study Gen AI/Case_study_data_genAI_developer.pdf")
data = loader.load()

# Split the text into manageable chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
documents = text_splitter.split_documents(data)

# Initialize the embedding
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# Create FAISS vectorstore from the documents and embeddings
docsearch = FAISS.from_documents(documents, embeddings)

# Save the FAISS index to disk
faiss_index_path = "/content/drive/MyDrive/Kanishka Raj Case study Gen AI/faissdb"
docsearch.save_local(faiss_index_path)

print(f"FAISS index saved to: {faiss_index_path}")

FAISS index saved to: /content/drive/MyDrive/Kanishka Raj Case study Gen AI/faissdb


In [43]:
%%writefile app.py
import os
import streamlit as st
from langchain.vectorstores import FAISS
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
import re
from typing import List, Dict
from nltk.tokenize import sent_tokenize
from nltk.translate.bleu_score import sentence_bleu
import nltk
import plotly.graph_objects as go
from datetime import datetime

nltk.download('punkt', quiet=True)

# Streamlit setup
st.set_page_config(page_title="Medical Research GenAI Processor", page_icon="üß¨", layout="wide")

# Custom CSS for improved visual appeal
st.markdown("""
    <style>
    .stApp {
        background-color: #f0f8ff;
    }
    .main .block-container {
        padding-top: 2rem;
        padding-bottom: 2rem;
    }
    h1, h2, h3 {
        color: #2c3e50;
    }
    .stButton>button {
        background-color: #3498db;
        color: white;
        border-radius: 5px;
        border: none;
        padding: 0.5rem 1rem;
        transition: all 0.3s ease;
    }
    .stButton>button:hover {
        background-color: #2980b9;
    }
    .stTextInput>div>div>input {
        border-radius: 5px;
    }
    .sidebar .sidebar-content {
        background-color: #2c3e50;
        color: white;
    }
    </style>
    """, unsafe_allow_html=True)

# Header
st.markdown("<h1 style='text-align: center;'>üß¨ Medical Research GenAI Processor üß¨</h1>", unsafe_allow_html=True)

# Sidebar setup
with st.sidebar:
    st.image("https://cdn.pixabay.com/photo/2024/05/05/16/30/ai-generated-8741448_1280.jpg", width=200)
    st.markdown(
        """
        <div style="background-color: #3498db; padding: 10px; border-radius: 5px; text-align: center;">
            <h3 style="color: white;">üåü Empowering medical research through AI üåü</h3>
        </div>
        """,
        unsafe_allow_html=True
    )
    st.markdown("### Status: üü¢ Online")

    st.markdown("---")

    st.markdown("### About")
    st.info("This AI-powered application analyzes and summarizes complex medical research papers, extracting critical entities and providing enriched context.")

    st.markdown("### üë©‚Äçüíº Developer")
    st.write("Developed by: Kanishka Raj")
    st.write("Contact: +91-9612223176")

    st.markdown("---")

    # Theme selector
    theme = st.selectbox("Choose Theme", ["Light", "Dark"])
    if theme == "Dark":
        st.markdown("""
            <style>
            .stApp {
                background-color: #1e1e1e;
                color: #ffffff;
            }
            .main .block-container {
                background-color: #2d2d2d;
            }
            h1, h2, h3 {
                color: #ffffff;
            }
            </style>
            """, unsafe_allow_html=True)

# Function to set the API key securely
def set_api_key():
    api_key = st.text_input("Enter your Gemini API Key:", type="password")
    if st.button("Submit API Key"):
        if api_key:
            st.session_state.api_key = api_key
            os.environ["GOOGLE_API_KEY"] = api_key
            st.success("API Key set successfully!")
            st.rerun()
        else:
            st.error("Please enter a valid API Key.")

# Check if API key is set
if "api_key" not in st.session_state:
    st.warning("Please set your Gemini API Key to proceed.")
    set_api_key()
else:
    # Initialize session state
    if "query_history" not in st.session_state:
        st.session_state.query_history = []
    if "response_history" not in st.session_state:
        st.session_state.response_history = []
    if "response" not in st.session_state:
        st.session_state.response = None

    # Path to the pre-generated FAISS vector store
    faiss_index_path = "/content/drive/MyDrive/Kanishka Raj Case study Gen AI/faissdb"  # Update this path

    # Initialize embeddings and language model
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", temperature=0.5, max_tokens=3000)

    # Entity recognition function
    def extract_entities(text: str) -> Dict[str, List[str]]:
        entity_patterns = {
            "drugs": r'\b[A-Z][a-z]+(?:mab|nib|zumab|ximab|lumab|stim|stat|olol|ide|ine)\b',
            "genes": r'\b[A-Z]{2,}[0-9]*\b',
            "proteins": r'\b[A-Z][a-z]{2,}(?:in|ase)\b',
            "efficacy_metrics": r'\b(?:overall survival|progression-free survival|response rate|remission rate)\b',
            "safety_metrics": r'\b(?:adverse events|toxicity|side effects)\b',
        }

        entities = {category: re.findall(pattern, text, re.IGNORECASE) for category, pattern in entity_patterns.items()}
        return {k: list(set(v)) for k, v in entities.items() if v}  # Remove duplicates

    # Entity Linking Accuracy Evaluation
    def evaluate_entity_linking(extracted_entities: Dict[str, List[str]], knowledge_base: Dict[str, List[str]]) -> float:
        total_entities = sum(len(entities) for entities in extracted_entities.values())
        correctly_linked = 0

        for category, entities in extracted_entities.items():
            if category in knowledge_base:
                correctly_linked += len(set(entities) & set(knowledge_base[category]))

        return correctly_linked / total_entities if total_entities > 0 else 0

    # Summary Coherence and Informativeness Evaluation
    def evaluate_summary(summary: str, reference_text: str) -> Dict[str, float]:
        summary_sentences = sent_tokenize(summary)
        reference_sentences = sent_tokenize(reference_text)

        bleu_score = sentence_bleu([reference_sentences], summary_sentences)

        # Simple readability score (average words per sentence)
        avg_words_per_sentence = sum(len(s.split()) for s in summary_sentences) / len(summary_sentences)

        return {
            "coherence_score": bleu_score,
            "readability_score": 1 / (avg_words_per_sentence / 10)  # Normalize to 0-1 range
        }

    # Data Protection and Privacy Guardrails
    def apply_privacy_guardrails(text: str) -> str:
        # Remove potential patient identifiers (simplified example)
        text = re.sub(r'\b(?:patient|subject)\s+\d+\b', '[REDACTED]', text, flags=re.IGNORECASE)

        # Remove dates (simplified)
        text = re.sub(r'\b\d{1,2}/\d{1,2}/\d{2,4}\b', '[DATE REDACTED]', text)

        return text

    # System prompt for the GenAI model
    system_prompt = """
    You are an AI specialized in processing and answering questions based on medical research papers.
    You should:
    1. Identify and extract relevant entities like drugs, genes, proteins, efficacy metrics, and safety metrics.
    2. Generate concise, clear, and accurate summaries that incorporate key sentence selection, paraphrasing, and novel sentence generation to capture the essence of the research.
    3. Focus on key insights related to efficacy, safety, and significant conclusions of the research.
    4. Ensure all information is based solely on the provided context, without introducing external data.
    5. Respect data privacy by not mentioning any specific patient names or identifiers.

    Context:
    {context}

    Extracted Entities:
    {entities}

    Query: {input}

    Please provide a comprehensive answer based on the above information.
    """

    prompt_template = ChatPromptTemplate.from_messages([("system", system_prompt), ("human", "{input}")])

    # Load FAISS vector store with medical research embeddings
    @st.cache_resource
    def load_docsearch():
        return FAISS.load_local(faiss_index_path, embeddings, allow_dangerous_deserialization=True)

    with st.spinner("Loading document store..."):
        docsearch = load_docsearch()
        st.success("Document store loaded successfully!")

    # Query form for the user
    st.subheader("Ask your Query")
    with st.form(key="query_form"):
        query = st.text_area("Type your question here...", height=100)
        col1, col2, col3 = st.columns([1,1,1])
        with col1:
            submit_button = st.form_submit_button(label="Submit Query")
        with col2:
            clear_button = st.form_submit_button(label="Clear Input")
        with col3:
            delete_button = st.form_submit_button(label="Delete Last Query")

    # Query processing
    if submit_button and query:
        with st.spinner("Analyzing documents and generating response..."):
            retriever = docsearch.as_retriever(search_kwargs={"k": 5})
            relevant_docs = retriever.get_relevant_documents(query)

            all_text = " ".join([doc.page_content for doc in relevant_docs])
            entities = extract_entities(all_text)

            context_with_entities = f"""
            Context: {all_text}

            Extracted Entities:
            {entities}
            """

            question_answer_chain = create_stuff_documents_chain(llm, prompt_template)
            rag_chain = create_retrieval_chain(retriever, question_answer_chain)
            response = rag_chain.invoke({"input": query, "context": context_with_entities, "entities": entities})

            safe_response = apply_privacy_guardrails(response["answer"])

            st.session_state.query_history.append(query)
            st.session_state.response_history.append(safe_response)
            st.session_state.response = safe_response

            st.markdown("### Answer:")
            st.write(safe_response)

            # Display extracted entities
            st.markdown("### Extracted Entities:")
            for entity_type, entity_list in entities.items():
                st.write(f"**{entity_type.capitalize()}:** {', '.join(entity_list)}")

            # Evaluation
            st.markdown("### Evaluation Metrics:")

            mock_knowledge_base = {
                "drugs": ["Aspirin", "Ibuprofen", "Paracetamol"],
                "genes": ["BRCA1", "TP53", "EGFR"],
                "proteins": ["Insulin", "Hemoglobin", "Collagen"],
            }
            entity_linking_accuracy = evaluate_entity_linking(entities, mock_knowledge_base)
            summary_metrics = evaluate_summary(safe_response, all_text)

            col1, col2, col3 = st.columns(3)
            with col1:
                st.metric("Entity Linking Accuracy", f"{entity_linking_accuracy:.2f}")
            with col2:
                st.metric("Summary Coherence", f"{summary_metrics['coherence_score']:.2f}")
            with col3:
                st.metric("Readability Score", f"{summary_metrics['readability_score']:.2f}")

            # Visualization of entities
            entity_counts = {k: len(v) for k, v in entities.items()}
            fig = go.Figure(data=[go.Bar(x=list(entity_counts.keys()), y=list(entity_counts.values()))])
            fig.update_layout(title="Extracted Entities Count", xaxis_title="Entity Type", yaxis_title="Count")
            st.plotly_chart(fig)

    # Clear input
    if clear_button:
        query = ""
        st.success("Input cleared!")

    # Delete last response
    if delete_button and st.session_state.response_history:
        st.session_state.query_history.pop(-1)
        st.session_state.response_history.pop(-1)
        st.session_state.response = None
        st.success("Previous response deleted!")

    # Display retrieval history
    with st.expander("Query and Answer History"):
        if st.session_state.query_history:
            for i, (q, r) in enumerate(zip(st.session_state.query_history, st.session_state.response_history)):
                st.write(f"**Query {i + 1}:** {q}")
                st.write(f"**Answer {i + 1}:** {r}")
                st.markdown("---")
        else:
            st.write("No history available.")

    # Export functionality
    if st.button("Export Session Data"):
        export_data = "\n".join([f"Q: {q}\nA: {r}\n" for q, r in zip(st.session_state.query_history, st.session_state.response_history)])
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"medical_research_session_{timestamp}.txt"
        st.download_button(
            label="Download Session Data",
            data=export_data,
            file_name=filename,
            mime="text/plain"
        )

    # Footer
    st.markdown("---")
    st.markdown("Made with ‚ù§Ô∏è by Kanishka Raj | ¬© 2024 Medical Research GenAI Processor")

Overwriting app.py


In [44]:
!npm install localtunnel
import sys
import os
import urllib


py_file_location = "/content/app.py"
sys.path.append(os.path.abspath(py_file_location))

print("Password/Enpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

!streamlit run app.py &>/content/logs.txt &
!npx localtunnel --port 8501

[K[?25h
up to date, audited 23 packages in 485ms

3 packages are looking for funding
  run `npm fund` for details

2 [33m[1mmoderate[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
Password/Enpoint IP for localtunnel is: 34.86.1.242
your url is: https://young-ants-fold.loca.lt
^C
