<a href="https://colab.research.google.com/github/ryanchen0327/OpenScholarForSciFy/blob/main/OpenScholar_Complete_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenScholar v2.0.0 - Complete Google Colab Notebook\n\nThis notebook provides a complete setup, demo, and automated testing suite for OpenScholar v2.0.0 with all enhanced features.\n\n## Features Available:\n- **Multi-Source Feedback Retrieval**: Semantic Scholar, peS2o, Google Search, You.com\n- **Adaptive Score-Based Filtering**: Quality-based document filtering\n- **Self-Reflective Generation**: Iterative improvement pipeline\n- **OpenScholar Reranker**: Enhanced document relevance scoring\n- **Automated Testing Suite**: Performance evaluation without human annotators\n\n## Table of Contents:\n1. [Environment Setup](#setup)\n2. [API Key Configuration](#api-keys)\n3. [Quick Demo](#demo)\n4. [Automated Testing](#testing)\n5. [Advanced Configuration](#advanced)\n6. [Results Analysis](#analysis)\n\n---

## Part 1: Environment Setup\n\nFirst, let's clone the repository and install all dependencies:

In [None]:
# Clone the OpenScholar repository\n!git clone https://github.com/ryanchen0327/OpenScholarForSciFy.git\n%cd OpenScholarForSciFy\n\n# Install required dependencies\n!pip install -r requirements.txt\n!pip install google-search-results python-dotenv requests beautifulsoup4\n\nprint(\"Environment setup complete!\")\nprint(\"Current directory:\", !pwd)

## Part 2: API Key Configuration\n\nConfigure your API keys for different services:\n\n### Required:\n- **OpenAI API** (for GPT models)\n\n### Optional (for enhanced features):\n- **Semantic Scholar API** (better academic retrieval)\n- **SerpAPI** (Google Search integration)\n- **You.com API** (additional search source)

In [None]:
import os\nimport getpass\n\n# OpenAI API Key (Required)\nprint(\"Setting up API keys...\")\nprint(\"\\n1. OpenAI API Key (Required)\")\nopenai_key = getpass.getpass(\"Enter your OpenAI API key: \")\nwith open('openai_key.txt', 'w') as f:\n    f.write(openai_key)\nprint(\"OpenAI API key saved\")\n\n# Optional API Keys\nprint(\"\\n2. Optional API Keys (press Enter to skip any)\")\n\n# Semantic Scholar API Key\ns2_key = getpass.getpass(\"Enter your Semantic Scholar API key (optional): \")\nif s2_key.strip():\n    os.environ['S2_API_KEY'] = s2_key\n    print(\"Semantic Scholar API key set\")\nelse:\n    print(\"Semantic Scholar API key skipped\")\n\n# SerpAPI Key for Google Search\nserp_key = getpass.getpass(\"Enter your SerpAPI key for Google Search (optional): \")\nif serp_key.strip():\n    os.environ['SERP_API_KEY'] = serp_key\n    print(\"SerpAPI key set - Google Search enabled\")\nelse:\n    print(\"SerpAPI key skipped - Google Search disabled\")\n\n# You.com API Key\nyou_key = getpass.getpass(\"Enter your You.com API key (optional): \")\nif you_key.strip():\n    os.environ['YOU_API_KEY'] = you_key\n    print(\"You.com API key set - You.com Search enabled\")\nelse:\n    print(\"You.com API key skipped - You.com Search disabled\")\n\nprint(\"\\nAPI configuration complete!\")

## Part 3: Quick Demo\n\nLet's test OpenScholar with sample questions to see it in action:

In [None]:
import json\n\n# Create sample questions for demo\nprint(\"Creating sample test questions...\")\n\ndemo_questions = [\n    {\n        \"input\": \"What are the recent advances in large language models for scientific research?\",\n        \"question_id\": \"demo_1\"\n    },\n    {\n        \"input\": \"How does CRISPR gene editing work and what are its current applications?\",\n        \"question_id\": \"demo_2\"\n    },\n    {\n        \"input\": \"What are the environmental impacts of renewable energy technologies?\",\n        \"question_id\": \"demo_3\"\n    }\n]\n\n# Save demo data\nwith open('demo_input.jsonl', 'w') as f:\n    for question in demo_questions:\n        f.write(json.dumps(question) + '\\n')\n\nprint(f\"Created {len(demo_questions)} demo questions\")\nprint(\"\\nDemo Questions:\")\nfor i, q in enumerate(demo_questions, 1):\n    print(f\"  {i}. {q['input']}\")

### Demo 1: Basic RAG Configuration

In [None]:
print(\"Running Basic RAG Demo...\")\nprint(\"Configuration: Basic retrieval + OpenScholar reranker\")\n\n!python run.py \\\n  --input_file demo_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --output_file demo_basic_output.json \\\n  --sample_k 1\n\nprint(\"\\nBasic RAG demo completed!\")\nprint(\"Results saved to: demo_basic_output.json\")

### Demo 2: Enhanced Multi-Source Configuration

In [None]:
print(\"Running Enhanced Multi-Source Demo...\")\nprint(\"Configuration: Score filtering + Multi-source feedback + Self-reflection\")\n\n# Enhanced run with all features\n!python run.py \\\n  --input_file demo_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --use_score_threshold \\\n  --score_threshold_type percentile_75 \\\n  --feedback \\\n  --ss_retriever \\\n  --use_pes2o_feedback \\\n  --output_file demo_enhanced_output.json \\\n  --sample_k 1\n\nprint(\"\\nEnhanced multi-source demo completed!\")\nprint(\"Results saved to: demo_enhanced_output.json\")

### Demo Results Comparison

In [None]:
import json\nimport os\n\ndef analyze_demo_results(filename, config_name):\n    \"\"\"Analyze demo results\"\"\"\n    if not os.path.exists(filename):\n        print(f\"{config_name}: File {filename} not found\")\n        return None\n    \n    with open(filename, 'r') as f:\n        data = json.load(f)\n    \n    results = data.get('data', data)\n    \n    print(f\"\\n{config_name} Results:\")\n    print(f\"  Questions processed: {len(results)}\")\n    \n    if results:\n        first_result = results[0]\n        output = first_result.get('output', '')\n        contexts = first_result.get('ctxs', first_result.get('docs', []))\n        \n        # Count citations (rough estimate)\n        citation_count = output.count('[') + output.count('(')\n        \n        print(f\"  Answer length: {len(output)} characters\")\n        print(f\"  Documents used: {len(contexts)}\")\n        print(f\"  Citations found: ~{citation_count}\")\n        \n        # Show answer preview\n        print(f\"  \\nAnswer preview:\")\n        preview = output[:200].replace('\\n', ' ')\n        print(f\"    {preview}...\")\n    \n    return results\n\nprint(\"Analyzing demo results...\")\n\n# Analyze both configurations\nbasic_results = analyze_demo_results('demo_basic_output.json', 'Basic RAG')\nenhanced_results = analyze_demo_results('demo_enhanced_output.json', 'Enhanced Multi-Source')\n\n# Comparison summary\nif basic_results and enhanced_results:\n    print(\"\\nConfiguration Comparison:\")\n    \n    basic_docs = len(basic_results[0].get('ctxs', basic_results[0].get('docs', [])))\n    enhanced_docs = len(enhanced_results[0].get('ctxs', enhanced_results[0].get('docs', [])))\n    \n    basic_length = len(basic_results[0].get('output', ''))\n    enhanced_length = len(enhanced_results[0].get('output', ''))\n    \n    print(f\"  Documents: Basic ({basic_docs}) vs Enhanced ({enhanced_docs}) = {enhanced_docs - basic_docs:+d}\")\n    print(f\"  Length: Basic ({basic_length}) vs Enhanced ({enhanced_length}) = {enhanced_length - basic_length:+d} chars\")\n    \n    if enhanced_docs > basic_docs:\n        print(\"  Enhanced configuration used more diverse sources\")\n    if enhanced_length > basic_length:\n        print(\"  Enhanced configuration provided more comprehensive answers\")\n\nprint(\"\\nDemo analysis complete!\")

## Part 4: Automated Testing Suite\n\nNow let's run comprehensive automated tests to evaluate OpenScholar's performance across different configurations:

In [None]:
print(\"Setting up automated testing suite...\")\n\n# Create comprehensive test dataset\ntest_questions = [\n    {\n        \"input\": \"What are the applications of CRISPR gene editing in treating genetic diseases?\",\n        \"question_id\": \"auto_test_1\",\n        \"category\": \"biomedical\"\n    },\n    {\n        \"input\": \"How do transformer neural networks work in natural language processing?\",\n        \"question_id\": \"auto_test_2\",\n        \"category\": \"ai_ml\"\n    },\n    {\n        \"input\": \"What are the environmental impacts of lithium mining for electric vehicle batteries?\",\n        \"question_id\": \"auto_test_3\",\n        \"category\": \"environmental\"\n    },\n    {\n        \"input\": \"How does quantum entanglement enable quantum computing advantages?\",\n        \"question_id\": \"auto_test_4\",\n        \"category\": \"physics\"\n    },\n    {\n        \"input\": \"What are the latest developments in fusion energy research and ITER project?\",\n        \"question_id\": \"auto_test_5\",\n        \"category\": \"energy\"\n    }\n]\n\n# Save automated test data\nwith open('automated_test_input.jsonl', 'w') as f:\n    for item in test_questions:\n        f.write(json.dumps(item) + '\\n')\n\nprint(f\"Created {len(test_questions)} test questions across multiple domains\")\nprint(\"\\nTest Categories:\")\ncategories = {}\nfor q in test_questions:\n    cat = q['category']\n    if cat not in categories:\n        categories[cat] = []\n    categories[cat].append(q['input'][:60] + '...')\n\nfor cat, questions in categories.items():\n    print(f\"  {cat.replace('_', ' ').title()}: {len(questions)} questions\")

### Test Configuration 1: Baseline RAG

In [None]:
print(\"Running Test 1: Baseline RAG...\")\nprint(\"Configuration: Basic retrieval only\")\n\n!python run.py \\\n  --input_file automated_test_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --output_file test_baseline_results.json \\\n  --sample_k 5\n\nprint(\"Baseline RAG test completed\")

### Test Configuration 2: RAG + Reranker

In [None]:
print(\"Running Test 2: RAG + Reranker...\")\nprint(\"Configuration: Basic retrieval + OpenScholar reranker\")\n\n!python run.py \\\n  --input_file automated_test_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --output_file test_reranker_results.json \\\n  --sample_k 5\n\nprint(\"Reranker test completed\")

### Test Configuration 3: Score-Based Filtering

In [None]:
print(\"Running Test 3: Score-Based Filtering...\")\nprint(\"Configuration: Reranker + adaptive document filtering\")\n\n!python run.py \\\n  --input_file automated_test_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --use_score_threshold \\\n  --score_threshold_type percentile_75 \\\n  --output_file test_filtering_results.json \\\n  --sample_k 5\n\nprint(\"Score-based filtering test completed\")

### Test Configuration 4: Self-Reflective Generation

In [None]:
print(\"Running Test 4: Self-Reflective Generation...\")\nprint(\"Configuration: Score filtering + feedback loop + Semantic Scholar retrieval\")\n\n!python run.py \\\n  --input_file automated_test_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --use_score_threshold \\\n  --feedback \\\n  --ss_retriever \\\n  --output_file test_feedback_results.json \\\n  --sample_k 5\n\nprint(\"Self-reflective generation test completed\")

### Test Configuration 5: Multi-Source Feedback

In [None]:
print(\"Running Test 5: Multi-Source Feedback...\")\nprint(\"Configuration: All features + multiple retrieval sources\")\n\n# Enhanced run with all available sources\n!python run.py \\\n  --input_file automated_test_input.jsonl \\\n  --model_name gpt-4o-mini \\\n  --api openai \\\n  --api_key_fp openai_key.txt \\\n  --use_contexts \\\n  --ranking_ce \\\n  --reranker OpenScholar/OpenScholar_Reranker \\\n  --use_score_threshold \\\n  --feedback \\\n  --ss_retriever \\\n  --use_pes2o_feedback \\\n  --output_file test_multisource_results.json \\\n  --sample_k 5\n\nprint(\"Multi-source feedback test completed\")

## Part 5: Comprehensive Results Analysis\n\nNow let's analyze and compare all test configurations:

In [None]:
import json\nimport os\nfrom datetime import datetime\n\ndef evaluate_test_configuration(filename, config_name):\n    \"\"\"Comprehensive evaluation of a test configuration\"\"\"\n    if not os.path.exists(filename):\n        return None\n    \n    with open(filename, 'r') as f:\n        data = json.load(f)\n    \n    results = data.get('data', data)\n    \n    if not results:\n        return None\n    \n    # Calculate metrics\n    total_length = sum(len(r.get('output', '')) for r in results)\n    total_citations = sum(r.get('output', '').count('[') + r.get('output', '').count('(') for r in results)\n    total_docs = sum(len(r.get('ctxs', r.get('docs', []))) for r in results)\n    \n    # Calculate quality metrics\n    avg_length = total_length / len(results)\n    avg_citations = total_citations / len(results)\n    avg_docs = total_docs / len(results)\n    \n    # Estimate comprehensiveness\n    comprehensiveness_score = (avg_length / 1000) + (avg_citations * 2)\n    \n    metrics = {\n        \"configuration\": config_name,\n        \"questions_processed\": len(results),\n        \"avg_answer_length\": round(avg_length, 1),\n        \"avg_citations\": round(avg_citations, 1),\n        \"avg_documents\": round(avg_docs, 1),\n        \"comprehensiveness_score\": round(comprehensiveness_score, 2),\n        \"success_rate\": 1.0\n    }\n    \n    return metrics\n\nprint(\"Running comprehensive analysis...\")\nprint(\"=\" * 60)\n\n# Test configurations\nconfigurations = [\n    (\"test_baseline_results.json\", \"1. Baseline RAG\"),\n    (\"test_reranker_results.json\", \"2. RAG + Reranker\"),\n    (\"test_filtering_results.json\", \"3. Score-Based Filtering\"),\n    (\"test_feedback_results.json\", \"4. Self-Reflective Generation\"),\n    (\"test_multisource_results.json\", \"5. Multi-Source Feedback\")\n]\n\nall_metrics = []\nfor filename, config_name in configurations:\n    metrics = evaluate_test_configuration(filename, config_name)\n    if metrics:\n        all_metrics.append(metrics)\n        print(f\"\\n{config_name}:\")\n        print(f\"  Questions processed: {metrics['questions_processed']}\")\n        print(f\"  Avg answer length: {metrics['avg_answer_length']} chars\")\n        print(f\"  Avg citations: {metrics['avg_citations']}\")\n        print(f\"  Avg documents: {metrics['avg_documents']}\")\n        print(f\"  Comprehensiveness: {metrics['comprehensiveness_score']}\")\n    else:\n        print(f\"\\n{config_name}: No results found\")\n\nprint(\"\\n\" + \"=\" * 60)

In [None]:
# Generate comprehensive test report\nif all_metrics:\n    # Find best performers\n    best_comprehensive = max(all_metrics, key=lambda x: x['comprehensiveness_score'])\n    best_citation = max(all_metrics, key=lambda x: x['avg_citations'])\n    \n    print(\"\\nPERFORMANCE COMPARISON\")\n    print(\"=\" * 40)\n    print(f\"Most Comprehensive: {best_comprehensive['configuration']}\")\n    print(f\"Best Citations: {best_citation['configuration']}\")\n    \n    print(\"\\nRECOMMENDATIONS\")\n    print(\"=\" * 30)\n    print(\"For maximum quality: Use Multi-Source Feedback\")\n    print(\"For balanced performance: Use Score-Based Filtering\")\n    print(\"For speed: Use Baseline RAG\")\n    \nreport = {\n    \"timestamp\": datetime.now().isoformat(),\n    \"test_type\": \"comprehensive_automated_evaluation\",\n    \"total_configurations\": len(all_metrics),\n    \"results\": all_metrics\n}\n\n# Save comprehensive report\nwith open('comprehensive_test_report.json', 'w') as f:\n    json.dump(report, f, indent=2)\n\nprint(f\"\\nTested {len(all_metrics)} configurations\")\nprint(\"Full report saved to: comprehensive_test_report.json\")\nprint(\"\\nAUTOMATED TESTING COMPLETE!\")

## Conclusion\n\nCongratulations! You've successfully set up and tested OpenScholar v2.0.0 with all its enhanced features.\n\n### What You've Accomplished:\n- Environment Setup: Cloned repository and installed dependencies\n- API Configuration: Set up OpenAI and optional service APIs\n- Quick Demo: Tested basic and enhanced configurations\n- Automated Testing: Evaluated 5 different configurations\n- Performance Analysis: Compared results across configurations\n\n### Next Steps:\n\n#### For Research Use:\n1. **Scale Up**: Process larger datasets using the command-line interface\n2. **Evaluation**: Use ScholarQABench for systematic evaluation\n3. **Fine-tuning**: Experiment with different threshold types and models\n\n#### For Production Use:\n1. **API Integration**: Set up server endpoints for real-time queries\n2. **Caching**: Implement result caching for repeated queries\n3. **Monitoring**: Add logging and performance monitoring\n\n### Documentation & Resources:\n- **Full Documentation**: [GitHub Repository](https://github.com/ryanchen0327/OpenScholarForSciFy)\n- **Score Filtering Guide**: `SCORE_FILTERING_README.md`\n- **Multi-Source Setup**: `MULTI_SOURCE_FEEDBACK_README.md`\n- **Testing Guide**: `AUTOMATED_TESTING_GUIDE.md`\n- **Changelog**: `CHANGELOG.md`\n\n### Tips for Optimal Performance:\n- **For Speed**: Use Baseline RAG or Score-Based Filtering\n- **For Quality**: Use Multi-Source Feedback with all APIs\n- **For Research**: Use Self-Reflective Generation\n- **For Production**: Use Score-Based Filtering with percentile_75\n\n---\n\n**OpenScholar v2.0.0** - Enhanced Scientific Question Answering System\n\n*Happy researching!*