# PubMed Search Testing Notebook

This notebook tests the PubMed search and article retrieval functionality.

## Overview
- Test PubMed API integration
- Search for recent articles
- Retrieve article details
- Validate data quality

In [1]:
# Setup
import sys
import os
sys.path.append('../src')

import asyncio
import pandas as pd
from pathlib import Path
import json

In [2]:
# Import our modules - Fixed import paths
import sys
import os
from pathlib import Path

# Add src directory to path for imports
notebook_dir = Path().resolve()
src_dir = notebook_dir.parent / "src"
sys.path.insert(0, str(src_dir))

print(f"Notebook directory: {notebook_dir}")
print(f"Source directory: {src_dir}")
print(f"Source exists: {src_dir.exists()}")

# Now import our modules
from pubmed.searcher import PubMedSearcher
from utils.config import load_config
from utils.logger import setup_logger, get_logger

# Setup logging
setup_logger(level="INFO")
logger = get_logger(__name__)

print("✅ All imports successful!")

Notebook directory: /home/santi/Projects/UBMI-IFC-Podcast/notebooks
Source directory: /home/santi/Projects/UBMI-IFC-Podcast/src
Source exists: True
✅ All imports successful!


## 1. Configure PubMed Access

**Important**: You need to set your email in the config for PubMed API access.

In [3]:
# Load configuration
config = load_config()
print("PubMed configuration:")
print(f"Email: {config['pubmed']['email']}")
print(f"Base URL: {config['pubmed']['base_url']}")
print(f"Rate limit: {config['pubmed']['rate_limit_delay']}s")
print(f"Max articles per week: {config['pubmed']['max_articles_per_week']}")

if config['pubmed']['email'] == 'your-email@example.com':
    print("\n⚠️  WARNING: Please update your email in config/config.yaml or .env file")
    print("NCBI requires a valid email for API access")

PubMed configuration:
Email: santiago_gr@ciencias.unam.mx
Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
Rate limit: 0.34s
Max articles per week: 1000


In [4]:
# Initialize searcher
searcher = PubMedSearcher(config)
print("PubMed searcher initialized")

PubMed searcher initialized


## 2. Test Basic Search

Start with a simple search to test the API connection.

In [5]:
# Test basic search with neuroscience terms
import asyncio

async def test_basic_search():
    test_terms = ["neuroscience", "physiology"]
    print(f"Testing search with terms: {test_terms}")

    try:
        pmids = await searcher.search_recent_articles(
            query_terms=test_terms,
            days_back=7,
            max_results=10  # Small number for testing
        )
        
        print(f"✅ Search successful! Found {len(pmids)} articles")
        print(f"Sample PMIDs: {pmids[:5]}")
        return pmids
        
    except Exception as e:
        print(f"❌ Search failed: {e}")
        return []

# Run the async function
pmids = await test_basic_search()

[32m2025-09-17 15:56:36[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "neuroscience"[Abstract] OR "physiology"[Abstract][0m


Testing search with terms: ['neuroscience', 'physiology']


[32m2025-09-17 15:56:47[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 10 articles[0m


✅ Search successful! Found 10 articles
Sample PMIDs: ['18558853', '18284371', '15664172', '32697748', '33848482']


In [6]:
# Test with a simpler query to isolate the issue
import aiohttp

async def test_simple_search():
    """Test with a very simple query without date filters"""
    
    # Override the searcher method temporarily for testing
    original_search = searcher.search_recent_articles
    
    async def simple_search_override(query_terms=None, days_back=7, max_results=1000):
        """Simplified search without complex date filters"""
        
        # Simple query without date filters
        if query_terms:
            query = " OR ".join([f'"{term}"[Title/Abstract]' for term in query_terms])
        else:
            query = "neuroscience"
        
        searcher.logger.info(f"Simple search query: {query}")
        
        # Parameters for esearch
        params = {
            'db': 'pubmed',
            'term': query,
            'retmax': max_results,
            'retmode': 'xml',
            'tool': 'ubmi-ifc-podcast',
            'email': searcher.email,
            'sort': 'relevance'
        }
        
        url = f"{searcher.base_url}esearch.fcgi"
        
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url, params=params) as response:
                    if response.status == 200:
                        xml_content = await response.text()
                        pmids = searcher._parse_search_results(xml_content)
                        searcher.logger.info(f"Found {len(pmids)} articles")
                        return pmids[:max_results]
                    else:
                        searcher.logger.error(f"Search failed with status {response.status}")
                        error_content = await response.text()
                        searcher.logger.error(f"Error content: {error_content}")
                        return []
            except Exception as e:
                searcher.logger.error(f"Error in simple search: {str(e)}")
                return []
    
    # Temporarily replace the method
    searcher.search_recent_articles = simple_search_override
    
    try:
        pmids = await searcher.search_recent_articles(
            query_terms=["neuroscience"],
            max_results=5
        )
        print(f"Simple search found {len(pmids)} PMIDs: {pmids}")
        return pmids
    finally:
        # Restore original method
        searcher.search_recent_articles = original_search

# Test simple search
simple_pmids = await test_simple_search()

[32m2025-09-17 15:57:54[0m | [1mINFO[0m | [36m__main__[0m:[36msimple_search_override[0m:[36m19[0m - [1mSimple search query: "neuroscience"[Title/Abstract][0m
[32m2025-09-17 15:57:55[0m | [1mINFO[0m | [36m__main__[0m:[36msimple_search_override[0m:[36m40[0m - [1mFound 5 articles[0m


Simple search found 5 PMIDs: ['30085354', '29723499', '30522733', '37736162', '34381347']


In [7]:
# Test PubMed API directly with a temporary valid email
import aiohttp

async def test_pubmed_api_direct():
    """Test PubMed API directly to diagnose issues"""
    
    # Use a simple test email for API testing
    test_email = "test@example.com"  # You should replace this with your actual email
    
    # Simple search query
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'pubmed',
        'term': 'neuroscience[Title]',
        'retmax': 5,
        'retmode': 'xml',
        'tool': 'ifc-podcast-generator',
        'email': test_email
    }
    
    print(f"Testing direct PubMed API call...")
    print(f"URL: {base_url}")
    print(f"Params: {params}")
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(base_url, params=params) as response:
                print(f"Status: {response.status}")
                print(f"Headers: {dict(response.headers)}")
                
                if response.status == 200:
                    content = await response.text()
                    print(f"Response length: {len(content)}")
                    print(f"Response preview: {content[:500]}...")
                    
                    # Try to parse XML
                    from xml.etree import ElementTree as ET
                    try:
                        root = ET.fromstring(content)
                        id_list = root.find('.//IdList')
                        if id_list is not None:
                            pmids = [id_elem.text for id_elem in id_list.findall('Id')]
                            print(f"Found PMIDs: {pmids}")
                        else:
                            print("No IdList found in response")
                    except ET.ParseError as e:
                        print(f"XML parse error: {e}")
                        
                else:
                    error_content = await response.text()
                    print(f"Error response: {error_content}")
                    
        except Exception as e:
            print(f"Request failed: {e}")

# Run direct API test
await test_pubmed_api_direct()

Testing direct PubMed API call...
URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
Params: {'db': 'pubmed', 'term': 'neuroscience[Title]', 'retmax': 5, 'retmode': 'xml', 'tool': 'ifc-podcast-generator', 'email': 'test@example.com'}
Status: 200
Headers: {'Date': 'Wed, 17 Sep 2025 21:58:00 GMT', 'Server': 'Finatra', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Referrer-Policy': 'origin-when-cross-origin', 'NCBI-SID': 'E34FD6E3A8540DBB_EB2FSID', 'NCBI-PHID': '1D340F5DAC9273150000169538C3E658.1.1.m_1', 'Content-Type': 'text/xml; charset=UTF-8', 'Cache-Control': 'private', 'Content-Encoding': 'gzip', 'X-RateLimit-Limit': '3', 'X-RateLimit-Remaining': '2', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'X-RateLimit-Limit,X-RateLimit-Remaining', 'Set-Cookie': 'ncbi_sid=E34FD6E3A8540DBB_EB2FSID; domain=.nih.gov; path=/; expires=Thu, 17 Sep 2026 21:58:01 GMT', 'X-UA-Comp

## 3. Test Article Detail Retrieval

In [8]:
# Test fetching details for found articles
async def test_article_details():
    if pmids:
        print(f"Fetching details for {len(pmids)} articles...")
        
        try:
            articles = await searcher.fetch_article_details(pmids)
            print(f"✅ Retrieved details for {len(articles)} articles")
            
            if articles:
                sample = articles[0]
                print("\n📄 Sample article:")
                print(f"PMID: {sample.pmid}")
                print(f"Title: {sample.title}")
                print(f"Authors: {', '.join(sample.authors[:3]) if sample.authors else 'No authors'}")
                print(f"Journal: {sample.journal}")
                print(f"Publication Date: {sample.publication_date}")
                print(f"DOI: {sample.doi}")
                print(f"Abstract length: {len(sample.abstract) if sample.abstract else 0} characters")
                print(f"MeSH terms: {sample.mesh_terms[:5] if sample.mesh_terms else 'None'}")
                
                if sample.abstract:
                    print(f"\nAbstract preview: {sample.abstract[:200]}...")
            
            return articles
            
        except Exception as e:
            print(f"❌ Detail retrieval failed: {e}")
            return []
    else:
        print("⏭️  Skipping detail retrieval (no PMIDs found)")
        return []

# Run the async function
articles = await test_article_details()

Fetching details for 10 articles...


[32m2025-09-17 15:58:13[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36mfetch_article_details[0m:[36m167[0m - [1mRetrieved details for 10 articles[0m


✅ Retrieved details for 10 articles

📄 Sample article:
PMID: 18558853
Title: Descending pathways in motor control.
Authors: Roger N Lemon
Journal: Annual review of neuroscience
Publication Date: 2008
DOI: 10.1146/annurev.neuro.31.060407.125547
Abstract length: 1036 characters
MeSH terms: ['Animals', 'Biological Evolution', 'Brain', 'Efferent Pathways', 'Humans']

Abstract preview: Each of the descending pathways involved in motor control has a number of anatomical, molecular, pharmacological, and neuroinformatic characteristics. They are differentially involved in motor control...


## 4. Test Different Search Strategies

In [9]:
# Test different search approaches
async def test_search_strategies():
    search_tests = [
        {
            'name': 'Broad biomedical search',
            'terms': None,  # Uses default broad search
            'days': 7,
            'max_results': 5
        },
        {
            'name': 'Specific neuroscience terms',
            'terms': ['hippocampus', 'memory', 'synaptic plasticity'],
            'days': 14,
            'max_results': 5
        },
        {
            'name': 'Cardiovascular research',
            'terms': ['cardiac', 'heart', 'cardiovascular'],
            'days': 7,
            'max_results': 5
        }
    ]

    search_results = {}

    for test in search_tests:
        print(f"\n🔍 Testing: {test['name']}")
        
        try:
            test_pmids = await searcher.search_recent_articles(
                query_terms=test['terms'],
                days_back=test['days'],
                max_results=test['max_results']
            )
            
            search_results[test['name']] = {
                'pmids': test_pmids,
                'count': len(test_pmids),
                'terms': test['terms']
            }
            
            print(f"   Found {len(test_pmids)} articles")
            
        except Exception as e:
            print(f"   ❌ Failed: {e}")
            search_results[test['name']] = {'pmids': [], 'count': 0, 'error': str(e)}

    # Summary
    print("\n📊 Search Results Summary:")
    for name, result in search_results.items():
        print(f"   {name}: {result['count']} articles")
    
    return search_results

# Run the async function
search_results = await test_search_strategies()

[32m2025-09-17 15:58:21[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: (humans[MeSH Terms]) AND (english[Language])[0m



🔍 Testing: Broad biomedical search


[32m2025-09-17 15:58:22[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 5 articles[0m
[32m2025-09-17 15:58:22[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "hippocampus"[Abstract] OR "memory"[Abstract] OR "synaptic plasticity"[Abstract][0m


   Found 5 articles

🔍 Testing: Specific neuroscience terms


[32m2025-09-17 15:58:22[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 5 articles[0m
[32m2025-09-17 15:58:22[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "cardiac"[Abstract] OR "heart"[Abstract] OR "cardiovascular"[Abstract][0m


   Found 5 articles

🔍 Testing: Cardiovascular research


[32m2025-09-17 15:58:23[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 5 articles[0m


   Found 5 articles

📊 Search Results Summary:
   Broad biomedical search: 5 articles
   Specific neuroscience terms: 5 articles
   Cardiovascular research: 5 articles


## 5. Data Quality Analysis

In [10]:
# Analyze data quality if we have articles
if articles:
    print("📊 Data Quality Analysis:")
    
    # Convert to DataFrame for analysis
    df_data = []
    for article in articles:
        df_data.append({
            'pmid': article.pmid,
            'title_length': len(article.title) if article.title else 0,
            'has_abstract': bool(article.abstract),
            'abstract_length': len(article.abstract) if article.abstract else 0,
            'author_count': len(article.authors) if article.authors else 0,
            'has_doi': bool(article.doi),
            'mesh_term_count': len(article.mesh_terms) if article.mesh_terms else 0,
            'journal': article.journal
        })
    
    df = pd.DataFrame(df_data)
    
    print(f"\nTotal articles analyzed: {len(df)}")
    print(f"Articles with abstracts: {df['has_abstract'].sum()} ({df['has_abstract'].mean()*100:.1f}%)")
    print(f"Articles with DOI: {df['has_doi'].sum()} ({df['has_doi'].mean()*100:.1f}%)")
    print(f"Average abstract length: {df['abstract_length'].mean():.0f} characters")
    print(f"Average author count: {df['author_count'].mean():.1f}")
    print(f"Average MeSH terms: {df['mesh_term_count'].mean():.1f}")
    
    # Top journals
    top_journals = df['journal'].value_counts().head()
    print(f"\nTop journals:")
    for journal, count in top_journals.items():
        print(f"   {journal}: {count}")
        
else:
    print("⏭️  No articles available for quality analysis")

📊 Data Quality Analysis:

Total articles analyzed: 10
Articles with abstracts: 8 (80.0%)
Articles with DOI: 10 (100.0%)
Average abstract length: 551 characters
Average author count: 2.3
Average MeSH terms: 9.0

Top journals:
   Current biology : CB: 3
   Annual review of neuroscience: 2
   Neuron: 1
   Acta pharmaceutica (Zagreb, Croatia): 1
   Science (New York, N.Y.): 1


## 6. Save Test Data

In [12]:
# Save test results
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

if articles:
    # Save articles
    searcher.save_articles(articles, output_dir / "test_pubmed_articles.json")
    print(f"💾 Saved {len(articles)} test articles")
    
    # Save search results summary
    summary = {
        'timestamp': pd.Timestamp.now().isoformat(),
        'total_articles': len(articles),
        'search_results': search_results,
        'quality_metrics': {
            'articles_with_abstracts': int(df['has_abstract'].sum()),
            'articles_with_doi': int(df['has_doi'].sum()),
            'avg_abstract_length': float(df['abstract_length'].mean()),
            'avg_author_count': float(df['author_count'].mean())
        } if 'df' in locals() else {}
    }
    
    with open(output_dir / "pubmed_test_summary.json", 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    
    print("💾 Saved test summary")
else:
    print("⏭️  No data to save")

[32m2025-09-17 15:59:15[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msave_articles[0m:[36m325[0m - [1mSaved 10 articles to ../data/raw/test_pubmed_articles.json[0m


💾 Saved 10 test articles
💾 Saved test summary


## 7. Test Rate Limiting

In [13]:
# Test rate limiting with multiple requests
async def test_rate_limiting():
    print("🕐 Testing rate limiting...")

    import time

    rate_test_results = []
    start_time = time.time()

    for i in range(3):  # Test 3 requests
        request_start = time.time()
        
        try:
            test_pmids = await searcher.search_recent_articles(
                query_terms=["test"],
                days_back=30,
                max_results=2
            )
            
            request_time = time.time() - request_start
            rate_test_results.append({
                'request': i+1,
                'time': request_time,
                'pmids_found': len(test_pmids),
                'success': True
            })
            
            print(f"   Request {i+1}: {request_time:.2f}s, {len(test_pmids)} PMIDs")
            
        except Exception as e:
            rate_test_results.append({
                'request': i+1,
                'error': str(e),
                'success': False
            })
            print(f"   Request {i+1}: Failed - {e}")

    total_time = time.time() - start_time
    print(f"\nTotal time for {len(rate_test_results)} requests: {total_time:.2f}s")
    print(f"Average time per request: {total_time/len(rate_test_results):.2f}s")

    successful_requests = [r for r in rate_test_results if r['success']]
    print(f"Successful requests: {len(successful_requests)}/{len(rate_test_results)}")
    
    return rate_test_results

# Run the async function
rate_test_results = await test_rate_limiting()

[32m2025-09-17 15:59:17[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "test"[Abstract][0m


🕐 Testing rate limiting...


[32m2025-09-17 15:59:18[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 2 articles[0m
[32m2025-09-17 15:59:18[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "test"[Abstract][0m


   Request 1: 0.53s, 2 PMIDs


[32m2025-09-17 15:59:19[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 2 articles[0m
[32m2025-09-17 15:59:19[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m94[0m - [1mSearching PubMed with query: "test"[Abstract][0m


   Request 2: 0.52s, 2 PMIDs


[32m2025-09-17 15:59:19[0m | [1mINFO[0m | [36mpubmed.searcher[0m:[36msearch_recent_articles[0m:[36m118[0m - [1mFound 2 articles[0m


   Request 3: 0.44s, 2 PMIDs

Total time for 3 requests: 1.49s
Average time per request: 0.50s
Successful requests: 3/3


## Next Steps

1. **Configure email**: Make sure you have a valid email in the configuration
2. **API key**: Consider getting a PubMed API key for higher rate limits
3. **Search optimization**: Fine-tune search terms based on IFC research areas
4. **Error handling**: Test how the system handles network issues, rate limits, etc.

## Common Issues
- **Email required**: NCBI requires a valid email address
- **Rate limiting**: Too many requests will get blocked
- **Network timeouts**: Large requests may timeout
- **XML parsing**: Malformed XML responses can cause errors