<a href="https://colab.research.google.com/github/ryanchen0327/OpenScholarForSciFy/blob/main/OpenScholar_Complete_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenScholar v2.0.0 - Complete Google Colab Notebook

This notebook provides a complete setup, demo, and automated testing suite for OpenScholar v2.0.0 with all enhanced features.

## Features Available:
- **Multi-Source Feedback Retrieval**: Semantic Scholar, peS2o, Google Search, You.com
- **Adaptive Score-Based Filtering**: Quality-based document filtering
- **Self-Reflective Generation**: Iterative improvement pipeline
- **OpenScholar Reranker**: Enhanced document relevance scoring
- **Automated Testing Suite**: Performance evaluation without human annotators

## Table of Contents:
1. [Environment Setup](#setup)
2. [API Key Configuration](#api-keys)
3. [Quick Demo](#demo)
4. [Automated Testing](#testing)
5. [Results Analysis](#analysis)

---

## Part 1: Environment Setup

First, let's clone the repository and install all dependencies:

In [None]:
# Clone the OpenScholar repository
!git clone https://github.com/ryanchen0327/OpenScholarForSciFy.git
%cd OpenScholarForSciFy

# Install required dependencies
!pip install -r requirements.txt
!pip install google-search-results python-dotenv requests beautifulsoup4

print("Environment setup complete!")
!pwd

## Part 2: API Key Configuration

Configure your API keys for different services:

### Required:
- **OpenAI API** (for GPT models)

### Optional (for enhanced features):
- **Semantic Scholar API** (better academic retrieval)
- **SerpAPI** (Google Search integration)
- **You.com API** (additional search source)

In [None]:
import os
import getpass

# OpenAI API Key (Required)
print("Setting up API keys...")
print("\n1. OpenAI API Key (Required)")
openai_key = getpass.getpass("Enter your OpenAI API key: ")
with open('openai_key.txt', 'w') as f:
    f.write(openai_key)
print("OpenAI API key saved")

# Optional API Keys
print("\n2. Optional API Keys (press Enter to skip any)")

# Semantic Scholar API Key
s2_key = getpass.getpass("Enter your Semantic Scholar API key (optional): ")
if s2_key.strip():
    os.environ['S2_API_KEY'] = s2_key
    print("Semantic Scholar API key set")
else:
    print("Semantic Scholar API key skipped")

# SerpAPI Key for Google Search
serp_key = getpass.getpass("Enter your SerpAPI key for Google Search (optional): ")
if serp_key.strip():
    os.environ['SERP_API_KEY'] = serp_key
    print("SerpAPI key set - Google Search enabled")
else:
    print("SerpAPI key skipped - Google Search disabled")

# You.com API Key
you_key = getpass.getpass("Enter your You.com API key (optional): ")
if you_key.strip():
    os.environ['YOU_API_KEY'] = you_key
    print("You.com API key set - You.com Search enabled")
else:
    print("You.com API key skipped - You.com Search disabled")

print("\nAPI configuration complete!")

## Part 3: Quick Demo

Let's test OpenScholar with sample questions to see it in action:

In [None]:
import json

# Create sample questions for demo
print("Creating sample test questions...")

demo_questions = [
    {
        "input": "What are the recent advances in large language models for scientific research?",
        "question_id": "demo_1"
    },
    {
        "input": "How does CRISPR gene editing work and what are its current applications?",
        "question_id": "demo_2"
    },
    {
        "input": "What are the environmental impacts of renewable energy technologies?",
        "question_id": "demo_3"
    }
]

# Save demo data
with open('demo_input.jsonl', 'w') as f:
    for question in demo_questions:
        f.write(json.dumps(question) + '\n')

print(f"Created {len(demo_questions)} demo questions")
print("\nDemo Questions:")
for i, q in enumerate(demo_questions, 1):
    print(f"  {i}. {q['input']}")

### Demo 1: Basic RAG Configuration

In [None]:
print("Running Basic RAG Demo...")
print("Configuration: Basic retrieval + OpenScholar reranker")

!python run.py \
  --input_file demo_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --output_file demo_basic_output.json \
  --sample_k 1

print("\nBasic RAG demo completed!")
print("Results saved to: demo_basic_output.json")

### Demo 2: Enhanced Multi-Source Configuration

In [None]:
print("Running Enhanced Multi-Source Demo...")
print("Configuration: Score filtering + Multi-source feedback + Self-reflection")

# Enhanced run with all features
!python run.py \
  --input_file demo_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --use_score_threshold \
  --score_threshold_type percentile_75 \
  --feedback \
  --ss_retriever \
  --use_pes2o_feedback \
  --output_file demo_enhanced_output.json \
  --sample_k 1

print("\nEnhanced multi-source demo completed!")
print("Results saved to: demo_enhanced_output.json")

### Demo Results Comparison

In [None]:
import json
import os

def analyze_demo_results(filename, config_name):
    """Analyze demo results"""
    if not os.path.exists(filename):
        print(f"{config_name}: File {filename} not found")
        return None
    
    with open(filename, 'r') as f:
        data = json.load(f)
    
    results = data.get('data', data)
    
    print(f"\n{config_name} Results:")
    print(f"  Questions processed: {len(results)}")
    
    if results:
        first_result = results[0]
        output = first_result.get('output', '')
        contexts = first_result.get('ctxs', first_result.get('docs', []))
        
        # Count citations (rough estimate)
        citation_count = output.count('[') + output.count('(')
        
        print(f"  Answer length: {len(output)} characters")
        print(f"  Documents used: {len(contexts)}")
        print(f"  Citations found: ~{citation_count}")
        
        # Show answer preview
        print(f"  \nAnswer preview:")
        preview = output[:200].replace('\n', ' ')
        print(f"    {preview}...")
    
    return results

print("Analyzing demo results...")

# Analyze both configurations
basic_results = analyze_demo_results('demo_basic_output.json', 'Basic RAG')
enhanced_results = analyze_demo_results('demo_enhanced_output.json', 'Enhanced Multi-Source')

# Comparison summary
if basic_results and enhanced_results:
    print("\nConfiguration Comparison:")
    
    basic_docs = len(basic_results[0].get('ctxs', basic_results[0].get('docs', [])))
    enhanced_docs = len(enhanced_results[0].get('ctxs', enhanced_results[0].get('docs', [])))
    
    basic_length = len(basic_results[0].get('output', ''))
    enhanced_length = len(enhanced_results[0].get('output', ''))
    
    print(f"  Documents: Basic ({basic_docs}) vs Enhanced ({enhanced_docs}) = {enhanced_docs - basic_docs:+d}")
    print(f"  Length: Basic ({basic_length}) vs Enhanced ({enhanced_length}) = {enhanced_length - basic_length:+d} chars")
    
    if enhanced_docs > basic_docs:
        print("  Enhanced configuration used more diverse sources")
    if enhanced_length > basic_length:
        print("  Enhanced configuration provided more comprehensive answers")

print("\nDemo analysis complete!")

## Part 4: Automated Testing Suite

Now let's run comprehensive automated tests to evaluate OpenScholar's performance across different configurations:

In [None]:
print("Setting up automated testing suite...")

# Create comprehensive test dataset
test_questions = [
    {
        "input": "What are the applications of CRISPR gene editing in treating genetic diseases?",
        "question_id": "auto_test_1",
        "category": "biomedical"
    },
    {
        "input": "How do transformer neural networks work in natural language processing?",
        "question_id": "auto_test_2",
        "category": "ai_ml"
    },
    {
        "input": "What are the environmental impacts of lithium mining for electric vehicle batteries?",
        "question_id": "auto_test_3",
        "category": "environmental"
    },
    {
        "input": "How does quantum entanglement enable quantum computing advantages?",
        "question_id": "auto_test_4",
        "category": "physics"
    },
    {
        "input": "What are the latest developments in fusion energy research and ITER project?",
        "question_id": "auto_test_5",
        "category": "energy"
    }
]

# Save automated test data
with open('automated_test_input.jsonl', 'w') as f:
    for item in test_questions:
        f.write(json.dumps(item) + '\n')

print(f"Created {len(test_questions)} test questions across multiple domains")
print("\nTest Categories:")
categories = {}
for q in test_questions:
    cat = q['category']
    if cat not in categories:
        categories[cat] = []
    categories[cat].append(q['input'][:60] + '...')

for cat, questions in categories.items():
    print(f"  {cat.replace('_', ' ').title()}: {len(questions)} questions")

### Test Configuration 1: Baseline RAG

In [None]:
print("Running Test 1: Baseline RAG...")
print("Configuration: Basic retrieval only")

!python run.py \
  --input_file automated_test_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --output_file test_baseline_results.json \
  --sample_k 5

print("Baseline RAG test completed")

### Test Configuration 2: RAG + Reranker

In [None]:
print("Running Test 2: RAG + Reranker...")
print("Configuration: Basic retrieval + OpenScholar reranker")

!python run.py \
  --input_file automated_test_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --output_file test_reranker_results.json \
  --sample_k 5

print("Reranker test completed")

### Test Configuration 3: Score-Based Filtering

In [None]:
print("Running Test 3: Score-Based Filtering...")
print("Configuration: Reranker + adaptive document filtering")

!python run.py \
  --input_file automated_test_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --use_score_threshold \
  --score_threshold_type percentile_75 \
  --output_file test_filtering_results.json \
  --sample_k 5

print("Score-based filtering test completed")

### Test Configuration 4: Self-Reflective Generation

In [None]:
print("Running Test 4: Self-Reflective Generation...")
print("Configuration: Score filtering + feedback loop + Semantic Scholar retrieval")

!python run.py \
  --input_file automated_test_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --use_score_threshold \
  --feedback \
  --ss_retriever \
  --output_file test_feedback_results.json \
  --sample_k 5

print("Self-reflective generation test completed")

### Test Configuration 5: Multi-Source Feedback

In [None]:
print("Running Test 5: Multi-Source Feedback...")
print("Configuration: All features + multiple retrieval sources")

# Enhanced run with all available sources
!python run.py \
  --input_file automated_test_input.jsonl \
  --model_name gpt-4o-mini \
  --api openai \
  --api_key_fp openai_key.txt \
  --use_contexts \
  --ranking_ce \
  --reranker OpenScholar/OpenScholar_Reranker \
  --use_score_threshold \
  --feedback \
  --ss_retriever \
  --use_pes2o_feedback \
  --output_file test_multisource_results.json \
  --sample_k 5

print("Multi-source feedback test completed")

## Part 5: Comprehensive Results Analysis

Now let's analyze and compare all test configurations:

In [None]:
import json
import os
from datetime import datetime

def evaluate_test_configuration(filename, config_name):
    """Comprehensive evaluation of a test configuration"""
    if not os.path.exists(filename):
        return None
    
    with open(filename, 'r') as f:
        data = json.load(f)
    
    results = data.get('data', data)
    
    if not results:
        return None
    
    # Calculate metrics
    total_length = sum(len(r.get('output', '')) for r in results)
    total_citations = sum(r.get('output', '').count('[') + r.get('output', '').count('(') for r in results)
    total_docs = sum(len(r.get('ctxs', r.get('docs', []))) for r in results)
    
    # Calculate quality metrics
    avg_length = total_length / len(results)
    avg_citations = total_citations / len(results)
    avg_docs = total_docs / len(results)
    
    # Estimate comprehensiveness
    comprehensiveness_score = (avg_length / 1000) + (avg_citations * 2)
    
    metrics = {
        "configuration": config_name,
        "questions_processed": len(results),
        "avg_answer_length": round(avg_length, 1),
        "avg_citations": round(avg_citations, 1),
        "avg_documents": round(avg_docs, 1),
        "comprehensiveness_score": round(comprehensiveness_score, 2),
        "success_rate": 1.0
    }
    
    return metrics

print("Running comprehensive analysis...")
print("=" * 60)

# Test configurations
configurations = [
    ("test_baseline_results.json", "1. Baseline RAG"),
    ("test_reranker_results.json", "2. RAG + Reranker"),
    ("test_filtering_results.json", "3. Score-Based Filtering"),
    ("test_feedback_results.json", "4. Self-Reflective Generation"),
    ("test_multisource_results.json", "5. Multi-Source Feedback")
]

all_metrics = []
for filename, config_name in configurations:
    metrics = evaluate_test_configuration(filename, config_name)
    if metrics:
        all_metrics.append(metrics)
        print(f"\n{config_name}:")
        print(f"  Questions processed: {metrics['questions_processed']}")
        print(f"  Avg answer length: {metrics['avg_answer_length']} chars")
        print(f"  Avg citations: {metrics['avg_citations']}")
        print(f"  Avg documents: {metrics['avg_documents']}")
        print(f"  Comprehensiveness: {metrics['comprehensiveness_score']}")
    else:
        print(f"\n{config_name}: No results found")

print("\n" + "=" * 60)

In [None]:
# Generate comprehensive test report
if all_metrics:
    # Find best performers
    best_comprehensive = max(all_metrics, key=lambda x: x['comprehensiveness_score'])
    best_citation = max(all_metrics, key=lambda x: x['avg_citations'])
    
    print("\nPERFORMANCE COMPARISON")
    print("=" * 40)
    print(f"Most Comprehensive: {best_comprehensive['configuration']}")
    print(f"Best Citations: {best_citation['configuration']}")
    
    print("\nRECOMMENDATIONS")
    print("=" * 30)
    print("For maximum quality: Use Multi-Source Feedback")
    print("For balanced performance: Use Score-Based Filtering")
    print("For speed: Use Baseline RAG")
    
report = {
    "timestamp": datetime.now().isoformat(),
    "test_type": "comprehensive_automated_evaluation",
    "total_configurations": len(all_metrics),
    "results": all_metrics
}

# Save comprehensive report
with open('comprehensive_test_report.json', 'w') as f:
    json.dump(report, f, indent=2)

print(f"\nTested {len(all_metrics)} configurations")
print("Full report saved to: comprehensive_test_report.json")
print("\nAUTOMATED TESTING COMPLETE!")

## Part 6: Advanced Configuration Options

For users who want to customize OpenScholar further:

In [None]:
print("Advanced Configuration Options")
print("=" * 40)

print("\nScore Threshold Types:")
threshold_types = [
    "average - Uses mean score as threshold",
    "median - Uses median score as threshold", 
    "percentile_25 - Keeps top 75% of documents",
    "percentile_50 - Keeps top 50% of documents",
    "percentile_75 - Keeps top 25% of documents",
    "percentile_90 - Keeps top 10% of documents"
]
for t in threshold_types:
    print(f"  • {t}")

print("\nAvailable Models:")
models = [
    "gpt-4o - Highest quality, slower",
    "gpt-4o-mini - Balanced quality and speed", 
    "gpt-3.5-turbo - Fastest, lower cost"
]
for m in models:
    print(f"  • {m}")

print("\nFeedback Sources:")
sources = [
    "--ss_retriever - Semantic Scholar API",
    "--use_pes2o_feedback - peS2o Dense Retrieval",
    "--use_google_feedback - Google Search (requires SerpAPI)",
    "--use_youcom_feedback - You.com Search (requires API key)"
]
for s in sources:
    print(f"  • {s}")

print("\nExample Custom Command:")
custom_command = '''python run.py \\
  --input_file your_questions.jsonl \\
  --model_name gpt-4o \\
  --api openai \\
  --api_key_fp openai_key.txt \\
  --use_contexts \\
  --ranking_ce \\
  --reranker OpenScholar/OpenScholar_Reranker \\
  --use_score_threshold \\
  --score_threshold_type percentile_90 \\
  --feedback \\
  --ss_retriever \\
  --use_pes2o_feedback \\
  --use_google_feedback \\
  --output_file custom_results.json'''
print(custom_command)

## Conclusion

Congratulations! You've successfully set up and tested OpenScholar v2.0.0 with all its enhanced features.

### What You've Accomplished:
- **Environment Setup**: Cloned repository and installed dependencies
- **API Configuration**: Set up OpenAI and optional service APIs
- **Quick Demo**: Tested basic and enhanced configurations
- **Automated Testing**: Evaluated 5 different configurations
- **Performance Analysis**: Compared results across configurations

### Next Steps:

#### For Research Use:
1. **Scale Up**: Process larger datasets using the command-line interface
2. **Evaluation**: Use ScholarQABench for systematic evaluation
3. **Fine-tuning**: Experiment with different threshold types and models

#### For Production Use:
1. **API Integration**: Set up server endpoints for real-time queries
2. **Caching**: Implement result caching for repeated queries
3. **Monitoring**: Add logging and performance monitoring

### Documentation & Resources:
- **Full Documentation**: [GitHub Repository](https://github.com/ryanchen0327/OpenScholarForSciFy)
- **Score Filtering Guide**: `SCORE_FILTERING_README.md`
- **Multi-Source Setup**: `MULTI_SOURCE_FEEDBACK_README.md`
- **Testing Guide**: `AUTOMATED_TESTING_GUIDE.md`
- **Changelog**: `CHANGELOG.md`

### Tips for Optimal Performance:
- **For Speed**: Use Baseline RAG or Score-Based Filtering
- **For Quality**: Use Multi-Source Feedback with all APIs
- **For Research**: Use Self-Reflective Generation
- **For Production**: Use Score-Based Filtering with percentile_75

---

**OpenScholar v2.0.0** - Enhanced Scientific Question Answering System

*Happy researching!*