# RAG: the first part

## what is RAG🧐?

RAG, or Retrieval-Augmented Generation, is a technique that combines the strengths of search engines and language models. It first retrieves relevant information from a large collection of documents, then uses that information to generate more accurate and informed responses.

![RAG illustration: Retrieval and Generation workflow from langchain docs](https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png)

*In the illustration above, RAG retrieves supporting documents before generating an answer, ensuring responses are grounded in real data.*

## let's create a sample RAG System

before start the implementation we need to have data in order to feed our application the needed that for this so under `/dataset` you will find a collection of markdown files that contain needed data

## Required Imports

In [1]:
import os
import glob
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr

## Data Loading

In [14]:
context = {}
MODEL = "gpt-4o"

sickness_folders = glob.glob("datasets/*/")

for folder_path in sickness_folders:
    sickness_name = os.path.basename(folder_path.rstrip('/'))
    
    symptom_file = os.path.join(folder_path, "symptoms.md")
    treatment_file = os.path.join(folder_path, "treatment.md")
    
    sickness_data = {
        "name": sickness_name,
        "symptoms": "",
        "treatment": ""
    }
    
    if os.path.exists(symptom_file):
        try:
            with open(symptom_file, "r", encoding="utf-8") as f:
                sickness_data["symptoms"] = f.read()
        except Exception as e:
            pass
    
    if os.path.exists(treatment_file):
        try:
            with open(treatment_file, "r", encoding="utf-8") as f:
                sickness_data["treatment"] = f.read()
        except Exception as e:
            pass
    
    if sickness_data["symptoms"] or sickness_data["treatment"]:
        full_document = f"DISEASE: {sickness_name.upper()}\n\n"
        
        if sickness_data["symptoms"]:
            full_document += "SYMPTOMS:\n" + sickness_data["symptoms"] + "\n\n"
        
        if sickness_data["treatment"]:
            full_document += "TREATMENT:\n" + sickness_data["treatment"] + "\n\n"
        
        context[sickness_name] = full_document
        context[f"{sickness_name}_symptoms"] = sickness_data["symptoms"]
        context[f"{sickness_name}_treatment"] = sickness_data["treatment"]

In [None]:
system_message = """You are a medical expert assistant that answers questions about diseases, their symptoms and treatments.
You only use information provided in the medical documents from your knowledge base.

Important instructions:
- Provide precise and well-structured answers
- Clearly distinguish between symptoms and treatments
- If you cannot find information in the provided context, state it clearly
- Never make medical diagnoses - always recommend consulting a healthcare professional
- Never invent medical information that is not in the provided documents
- Organize your responses clearly with sections for symptoms and treatments when appropriate

WARNING: The information provided is for informational purposes only and does not replace professional medical advice."""

## Context Search Function

In [15]:
def get_relevant_context(message):
    relevant_context = []
    message_lower = message.lower()
    
    symptom_keywords = ["symptom", "symptoms", "sign", "signs", "manifestation", "present"]
    treatment_keywords = ["treatment", "treatments", "treat", "cure", "medicine", "remedy", "therapy"]
    
    is_symptom_query = any(keyword in message_lower for keyword in symptom_keywords)
    is_treatment_query = any(keyword in message_lower for keyword in treatment_keywords)
    
    for document_name, document_content in context.items():
        score = 0
        
        disease_name = document_name.replace("_symptoms", "").replace("_treatment", "")
        if disease_name.lower() in message_lower:
            score += 10
        
        message_words = [word for word in message_lower.split() if len(word) > 3]
        document_content_lower = document_content.lower()
        
        matching_words = sum(1 for word in message_words if word in document_content_lower)
        if len(message_words) > 0:
            word_match_ratio = matching_words / len(message_words)
            score += word_match_ratio * 5
        
        if is_symptom_query and document_name.endswith("_symptoms"):
            score += 3
        elif is_treatment_query and document_name.endswith("_treatment"):
            score += 3
        elif not document_name.endswith("_symptoms") and not document_name.endswith("_treatment"):
            score += 1
        
        if score >= 2:
            relevant_context.append({
                "content": document_content,
                "name": document_name,
                "score": score
            })
    
    relevant_context.sort(key=lambda x: x["score"], reverse=True)
    top_documents = relevant_context[:5]
    
    return [doc["content"] for doc in top_documents]

## Context Addition Function

In [19]:
def add_context(message):
    relevant_context = get_relevant_context(message)
    
    if relevant_context:
        message += "\n\n=== RELEVANT MEDICAL INFORMATION ===\n"
        message += "The following information from your medical knowledge base may be helpful:\n\n"
        
        for i, context_doc in enumerate(relevant_context, 1):
            if "SYMPTOMS:" in context_doc and "TREATMENT:" in context_doc:
                doc_type = "COMPLETE RECORD"
            elif "SYMPTOMS:" in context_doc:
                doc_type = "SYMPTOMS"
            elif "TREATMENT:" in context_doc:
                doc_type = "TREATMENT"
            else:
                doc_type = "DOCUMENT"
            
            message += f"{doc_type} {i}:\n"
            message += f"{context_doc}\n"
            message += "-" * 60 + "\n\n"
        
        message += "=== END OF MEDICAL INFORMATION ===\n\n"
        message += "Response instructions:\n"
        message += "- Base your response only on the information above\n"
        message += "- Structure your response in clear sections (symptoms, treatments, etc.)\n"
        message += "- Always include the medical consultation warning\n"
        message += "- If the requested information is not available, state it clearly\n"
    
    return message

## OpenAI API Configuration

In [20]:
load_dotenv(override=True)

api_key = os.getenv("OPENAI_API_KEY")
if api_key:
    client = OpenAI(api_key=api_key)

## Main Chat Function

In [21]:
def chat(message, history):
    try:
        messages = [{"role": "system", "content": system_message}]
        
        for msg in history:
            messages.append(msg)
        
        enriched_message = add_context(message)
        messages.append({"role": "user", "content": enriched_message})
        
        stream = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            stream=True,
            temperature=0.1
        )
        
        response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                response += chunk.choices[0].delta.content
                yield response
                
    except Exception as e:
        yield f"Error generating response: {str(e)}"

def simple_chat_test(message):
    try:
        enriched_message = add_context(message)
        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": enriched_message}
        ]
        
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=0.1
        )
        
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

## RAG System Testing

In [22]:
if 'client' in globals():
    test_questions = [
        "What diseases are available in the database?",
        "What are the symptoms of flu?",
        "How to treat migraine?",
        "What are the signs of diabetes?",
        "What is the treatment for hypertension?",
        "Symptoms and treatment of asthma?"
    ]
    
    for i, question in enumerate(test_questions, 1):
        relevant_context = get_relevant_context(question)
        
        try:
            response = simple_chat_test(question)
        except Exception as e:
            pass

## Gradio Chat Interface

In [None]:
if 'client' in globals() and client:
    diseases = [name for name in context.keys() if not name.endswith('_symptoms') and not name.endswith('_treatment')]
    
    chat_interface = gr.ChatInterface(
        fn=chat,
        type="messages",
        title="Medical Assistant RAG",
        description=f"""
        Ask questions about diseases, symptoms and treatments!
        
        Knowledge base: {len(diseases)} documented diseases
        Total documents: {len(context)} (including symptoms and treatments)
        
        WARNING: The information provided is for informational purposes only. 
        Always consult a healthcare professional for proper diagnosis and treatment.
        """,
        examples=[
            "What diseases are available in your database?",
            "What are the symptoms of flu?",
            "How to treat migraine?",
            "Symptoms and treatment of diabetes?",
            "What is hypertension?",
            "How to recognize an asthma attack?",
        ],
    )
    
    chat_interface.launch(
        share=False,
        debug=True,
        show_error=True,
        server_name="127.0.0.1",
        server_port=7860
    )

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


## Conclusion and Possible Improvements

### What we accomplished:
- Automatic loading of all files from the `datasets` folder
- Context search system based on keywords  
- Integration with OpenAI API for response generation
- Interactive chat interface with Gradio
- Response streaming for better user experience

### Possible improvements:
1. **Semantic search**: Use embeddings (like OpenAI Embeddings) for more precise search
2. **Intelligent chunking**: Divide long documents into smaller chunks
3. **Vector database**: Use Chroma, Pinecone, or FAISS for better search performance
4. **Preprocessing**: Clean and structure data before indexing
5. **Metrics**: Add metrics to evaluate result relevance
6. **Cache**: Cache frequent results to improve performance

### Next steps:
- Test with different types of questions
- Optimize the context search function
- Add more documents to the knowledge base
- Implement advanced semantic search