# 🤖 LLM Integration for Answer Generation

This notebook demonstrates how to integrate a Large Language Model (LLM) with our retrieval pipeline to generate answers based on retrieved documents. We'll explore:

1. **Prompt Engineering** - Crafting effective prompts for answer generation
2. **Retrieval-Augmented Generation** - Using retrieved documents to ground LLM responses
3. **Comparison Analysis** - Evaluating answers with and without retrieval

This completes the RAG (Retrieval-Augmented Generation) pipeline by adding the generation component.

In [1]:
import os
import json
import time
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Optional
import matplotlib.pyplot as plt
import seaborn as sns

# Add the src directory to the path
import sys
sys.path.append(os.path.abspath('..'))

from src.rag_pipeline import RAGPipeline
from src.llm_generator import LLMGenerator

# Set paths
DATA_DIR = Path("../data")
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUT_DIR = Path("../outputs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Set OpenAI API key - replace with your own or use environment variable
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Check if API key is set
if not os.environ.get("OPENAI_API_KEY"):
    print("⚠️ Warning: OPENAI_API_KEY environment variable not set.")
    print("Please set your API key using os.environ['OPENAI_API_KEY'] = 'your-key-here'")
    print("or export it in your environment before running this notebook.")

## 1. Initialize RAG Pipeline and LLM Generator

First, we'll set up our retrieval pipeline and LLM generator.

In [2]:
# Initialize RAG Pipeline
rag_pipeline = RAGPipeline()

# Load documents
rag_pipeline.load_documents(str(PROCESSED_DIR / "processed_chunks.json"))

# Initialize LLM Generator
llm_generator = LLMGenerator(
    model="gpt-3.5-turbo",  # You can change to "gpt-4" for better quality
    temperature=0.3,
    max_tokens=500
)

print(f"RAG Pipeline initialized with {len(rag_pipeline.documents)} documents")
print(f"LLM Generator initialized with model: {llm_generator.model}")

## 2. Custom Prompt Templates

Let's define some custom prompt templates for different types of queries.

In [3]:
# Define custom prompt templates
standard_prompt = """
Answer the following question based on the provided context information. 
If the answer cannot be determined from the context, say so clearly.

CONTEXT:
{context}

QUESTION:
{query}

ANSWER:
"""

detailed_prompt = """
You are an expert in sports science and athletic performance. 
Provide a detailed and scientifically accurate answer to the following question 
based ONLY on the provided context information.

If the answer cannot be fully determined from the context, clearly state what is known 
from the context and what additional information would be needed.

CONTEXT:
{context}

QUESTION:
{query}

DETAILED ANSWER:
"""

summarization_prompt = """
Summarize the key information from the provided context that is relevant to the question.
Be concise but comprehensive, focusing on the most important points.

CONTEXT:
{context}

QUESTION:
{query}

SUMMARY:
"""

# Map of prompt types
prompt_templates = {
    "standard": standard_prompt,
    "detailed": detailed_prompt,
    "summary": summarization_prompt
}

## 3. Full RAG Pipeline: Retrieval + Generation

Now let's combine retrieval and generation to answer queries.

In [4]:
def full_rag_pipeline(query: str, top_k: int = 5, prompt_type: str = "standard"):
    """Run the full RAG pipeline: retrieval + generation."""
    start_time = time.time()
    
    # Step 1: Retrieve relevant documents
    retrieval_results = rag_pipeline.query(query, top_k=top_k, rerank=True)
    
    # Step 2: Format documents for the LLM
    retrieved_docs = []
    for doc in retrieval_results['results']:
        retrieved_docs.append({
            "id": doc["id"],
            "content": doc["content"],
            "metadata": {
                "source": doc.get("metadata", {}).get("source", "Unknown"),
                "score": doc["score"]
            }
        })
    
    # Step 3: Generate answer using LLM
    prompt_template = prompt_templates.get(prompt_type, prompt_templates["standard"])
    answer_result = llm_generator.generate_answer(query, retrieved_docs, prompt_template)
    
    # Calculate total time
    total_time = time.time() - start_time
    
    # Combine results
    full_result = {
        "query": query,
        "answer": answer_result["answer"],
        "retrieval_results": retrieval_results,
        "prompt_type": prompt_type,
        "model": llm_generator.model,
        "timing": {
            "retrieval_time": retrieval_results["timing"]["total_time"],
            "generation_time": total_time - retrieval_results["timing"]["total_time"],
            "total_time": total_time
        }
    }
    
    return full_result

# Test the full pipeline with a sample query
sample_query = "How does sleep quality affect athletic performance and recovery?"
result = full_rag_pipeline(sample_query, top_k=3, prompt_type="detailed")

# Display the answer
print(f"Query: {result['query']}\n")
print(f"Answer:\n{result['answer']}\n")
print(f"Sources:")
for i, doc in enumerate(result['retrieval_results']['results']):
    print(f"  {i+1}. {doc['id']} (Score: {doc['score']:.3f})")
print(f"\nTiming: Retrieval {result['timing']['retrieval_time']:.3f}s, Generation {result['timing']['generation_time']:.3f}s, Total {result['timing']['total_time']:.3f}s")

## 4. Comparing Answers With and Without Retrieval

Let's compare answers generated with retrieved context vs. without context.

In [5]:
def compare_with_without_retrieval(query: str, top_k: int = 3):
    """Compare answers generated with and without retrieval context."""
    # Get retrieval results
    retrieval_results = rag_pipeline.query(query, top_k=top_k, rerank=True)
    
    # Format documents for the LLM
    retrieved_docs = []
    for doc in retrieval_results['results']:
        retrieved_docs.append({
            "id": doc["id"],
            "content": doc["content"],
            "metadata": {
                "source": doc.get("metadata", {}).get("source", "Unknown"),
                "score": doc["score"]
            }
        })
    
    # Compare answers
    comparison = llm_generator.compare_with_without_retrieval(query, retrieved_docs)
    
    return comparison

# Test with a few queries
comparison_queries = [
    "What is HRV and how does it relate to recovery?",
    "How can athletes optimize their sleep for better performance?",
    "What are the signs of overtraining syndrome?"
]

for query in comparison_queries:
    print(f"\n{'='*80}\nQUERY: {query}\n{'='*80}\n")
    
    comparison = compare_with_without_retrieval(query)
    
    print("WITH RETRIEVAL:\n")
    print(comparison['with_retrieval']['answer'])
    print("\n" + "-"*80 + "\n")
    
    print("WITHOUT RETRIEVAL:\n")
    print(comparison['without_retrieval']['answer'])
    print("\n" + "="*80)

## 5. Analyzing Answer Quality

Let's analyze the quality of answers generated with different prompt templates.

In [6]:
# Define a set of test queries
test_queries = [
    "What factors affect HRV measurements?",
    "How can I use HRV to guide my training?",
    "What is the relationship between sleep and recovery?",
    "How should nutrition be adjusted for high-intensity training?",
    "What are the best strategies for monitoring training load?"
]

# Test different prompt templates
results = []

for query in test_queries:
    for prompt_type in prompt_templates.keys():
        print(f"Processing query: '{query}' with prompt type: {prompt_type}")
        result = full_rag_pipeline(query, top_k=3, prompt_type=prompt_type)
        
        # Store result summary
        results.append({
            "query": query,
            "prompt_type": prompt_type,
            "answer_length": len(result["answer"]),
            "retrieval_time": result["timing"]["retrieval_time"],
            "generation_time": result["timing"]["generation_time"],
            "total_time": result["timing"]["total_time"]
        })

# Convert to DataFrame for analysis
results_df = pd.DataFrame(results)

# Analyze results
print("\nSummary Statistics:")
summary = results_df.groupby('prompt_type').agg({
    'answer_length': ['mean', 'std'],
    'retrieval_time': 'mean',
    'generation_time': 'mean',
    'total_time': 'mean'
})
display(summary)

# Visualize answer length by prompt type
plt.figure(figsize=(10, 6))
sns.boxplot(x='prompt_type', y='answer_length', data=results_df)
plt.title('Answer Length by Prompt Type')
plt.xlabel('Prompt Type')
plt.ylabel('Answer Length (characters)')
plt.grid(True, alpha=0.3)
plt.savefig(OUTPUT_DIR / "answer_length_by_prompt.png", dpi=300)
plt.show()

# Visualize timing by prompt type
timing_data = results_df.melt(
    id_vars=['prompt_type', 'query'],
    value_vars=['retrieval_time', 'generation_time'],
    var_name='timing_type',
    value_name='time_seconds'
)

plt.figure(figsize=(12, 6))
sns.barplot(x='prompt_type', y='time_seconds', hue='timing_type', data=timing_data)
plt.title('Processing Time by Prompt Type')
plt.xlabel('Prompt Type')
plt.ylabel('Time (seconds)')
plt.grid(True, alpha=0.3)
plt.savefig(OUTPUT_DIR / "processing_time_by_prompt.png", dpi=300)
plt.show()

## 6. Saving Results for Future Analysis

Let's save our results for future reference and analysis.

In [7]:
# Create a comprehensive example with all prompt types
comprehensive_query = "How can athletes use HRV to optimize their training and recovery?"
comprehensive_results = {}

for prompt_type in prompt_templates.keys():
    result = full_rag_pipeline(comprehensive_query, top_k=5, prompt_type=prompt_type)
    comprehensive_results[prompt_type] = result

# Save to file
with open(OUTPUT_DIR / "llm_integration_results.json", 'w') as f:
    json.dump(comprehensive_results, f, indent=2)

print(f"Saved comprehensive results to {OUTPUT_DIR / 'llm_integration_results.json'}")

# Save a sample for the README
readme_example = comprehensive_results["detailed"]
with open(OUTPUT_DIR / "readme_example.json", 'w') as f:
    json.dump(readme_example, f, indent=2)

print(f"Saved README example to {OUTPUT_DIR / 'readme_example.json'}")

## 7. Fallback Mechanisms for API Issues

Let's implement a fallback mechanism for when the OpenAI API is unavailable.

In [8]:
def generate_answer_with_fallback(query, retrieved_docs, prompt_type="standard", max_retries=3):
    """Generate an answer with fallback mechanisms for API issues."""
    prompt_template = prompt_templates.get(prompt_type, prompt_templates["standard"])
    
    # Try OpenAI API first
    for attempt in range(max_retries):
        try:
            result = llm_generator.generate_answer(query, retrieved_docs, prompt_template)
            return result
        except Exception as e:
            print(f"Attempt {attempt+1} failed: {str(e)}")
            time.sleep(2)  # Wait before retrying
    
    # If all attempts fail, use a simple extractive fallback
    print("All API attempts failed. Using extractive fallback.")
    
    # Simple extractive summary from top documents
    fallback_answer = f"Based on the retrieved documents:\n\n"
    
    for i, doc in enumerate(retrieved_docs[:3]):
        # Extract first 200 characters from each document
        snippet = doc["content"][:200] + "..."
        fallback_answer += f"Document {i+1}: {snippet}\n\n"
    
    fallback_answer += "\nNote: This is an extractive summary due to LLM API unavailability."
    
    return {
        "query": query,
        "answer": fallback_answer,
        "sources": [doc.get("id", "unknown") for doc in retrieved_docs],
        "fallback_used": True
    }

# Example usage (we'll simulate an API failure)
def simulate_api_failure():
    # Get retrieval results
    query = "What are the best recovery strategies for athletes?"
    retrieval_results = rag_pipeline.query(query, top_k=3, rerank=True)
    
    # Format documents
    retrieved_docs = []
    for doc in retrieval_results['results']:
        retrieved_docs.append({
            "id": doc["id"],
            "content": doc["content"],
            "metadata": {
                "source": doc.get("metadata", {}).get("source", "Unknown"),
                "score": doc["score"]
            }
        })
    
    # Temporarily set an invalid API key to simulate failure
    original_key = os.environ.get("OPENAI_API_KEY", "")
    os.environ["OPENAI_API_KEY"] = "invalid-key-for-testing"
    
    try:
        # This should fail and use the fallback
        result = generate_answer_with_fallback(query, retrieved_docs, max_retries=1)
        print(f"\nQuery: {query}\n")
        print(f"Answer:\n{result['answer']}")
    finally:
        # Restore the original key
        if original_key:
            os.environ["OPENAI_API_KEY"] = original_key
        else:
            del os.environ["OPENAI_API_KEY"]

# Uncomment to test the fallback mechanism
# simulate_api_failure()

## Conclusion

In this notebook, we've successfully integrated an LLM with our retrieval pipeline to create a complete RAG system. Key accomplishments include:

1. **Prompt Engineering**: We created different prompt templates for various query types, demonstrating how prompt design affects answer quality and length.

2. **Retrieval-Augmented Generation**: We showed how grounding LLM responses in retrieved documents improves answer accuracy and relevance.

3. **Comparison Analysis**: We compared answers generated with and without retrieval context, highlighting the benefits of the RAG approach.

4. **Fallback Mechanisms**: We implemented fallback strategies for handling API failures, ensuring system robustness.

This completes our RAG pipeline, which now includes document processing, embedding, retrieval, reranking, and answer generation components.