# SciTeX Web Tutorial

This notebook demonstrates how to use the `scitex.web` module for web-based scientific research and data collection.

## Features Covered

* PubMed database searching and article retrieval
* Automated BibTeX generation from scientific articles
* Web content extraction and summarization
* URL crawling with content analysis
* Scientific literature management
* CrossRef metrics integration
* Asynchronous batch processing

## 1. Basic Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scitex import web as stx_web
import requests
import json
from datetime import datetime
import asyncio

print("SciTeX Web Tutorial")
print("Available functions:")
available_functions = [func for func in dir(stx_web) if not func.startswith('_')]
for i, func in enumerate(available_functions):
    if i % 3 == 0:
        print()
    print(f"{func:<25}", end="")
print()

# Note: Some functions require internet access and API keys
print("\n⚠️ Note: This tutorial demonstrates web functionality.")
print("   Some examples may require internet access and API keys.")
print("   Examples are designed to work with mock data when needed.")

## 2. PubMed Literature Search

### Basic PubMed Search

In [None]:
# Demonstrate PubMed search functionality
print("PubMed Literature Search Examples:")
print("=" * 34)

# Example search queries for different research areas
research_queries = {
    "Neuroscience": "epilepsy prediction machine learning",
    "AI/ML": "deep learning medical imaging", 
    "Statistics": "bayesian analysis clinical trials",
    "Bioinformatics": "genomic data analysis python",
    "Physics": "quantum computing algorithms"
}

print("Research Query Examples:")
for field, query in research_queries.items():
    print(f"{field:<15}: '{query}'")

# Demonstrate search functionality (mock example)
def mock_pubmed_search(query, n_entries=5):
    """Mock PubMed search for demonstration purposes."""
    print(f"\nSearching PubMed for: '{query}'")
    print(f"Requested entries: {n_entries}")
    
    # Simulate search results
    mock_results = {
        "query": query,
        "total_found": np.random.randint(50, 500),
        "retrieved": n_entries,
        "status": "success",
        "bibtex_file": f"pubmed_{query.replace(' ', '_')}.bib"
    }
    
    print(f"✓ Found {mock_results['total_found']} articles")
    print(f"✓ Retrieved {mock_results['retrieved']} detailed entries")
    print(f"✓ Saved to: {mock_results['bibtex_file']}")
    
    return mock_results

# Example searches
search_examples = [
    ("machine learning healthcare", 10),
    ("CRISPR gene editing", 5),
    ("neural networks signal processing", 8)
]

search_results = []
for query, n_entries in search_examples:
    result = mock_pubmed_search(query, n_entries)
    search_results.append(result)

# Summary statistics
total_articles = sum(r['total_found'] for r in search_results)
total_retrieved = sum(r['retrieved'] for r in search_results)

print(f"\nSearch Summary:")
print(f"Total queries: {len(search_results)}")
print(f"Total articles found: {total_articles:,}")
print(f"Total articles retrieved: {total_retrieved}")
print(f"Average articles per query: {total_articles/len(search_results):.1f}")

### BibTeX Generation and Management

In [None]:
# Demonstrate BibTeX generation functionality
print("BibTeX Generation and Management:")
print("=" * 33)

# Mock paper data structure
def create_mock_paper(title, authors, journal, year, pmid, doi=""):
    """Create mock paper data structure."""
    return {
        "title": title,
        "authors": [{"name": author} for author in authors],
        "source": journal,
        "pubdate": f"{year} Jan",
        "pmid": pmid,
        "doi": doi
    }

# Sample scientific papers
sample_papers = {
    "12345678": create_mock_paper(
        "Deep Learning Approaches for Medical Image Analysis",
        ["Smith J", "Johnson A", "Williams B"],
        "Nature Medicine",
        "2023",
        "12345678",
        "10.1038/s41591-023-12345"
    ),
    "23456789": create_mock_paper(
        "Machine Learning in Genomics: A Comprehensive Review",
        ["Brown C", "Davis D"],
        "Nature Genetics", 
        "2023",
        "23456789",
        "10.1038/s41588-023-23456"
    ),
    "34567890": create_mock_paper(
        "Statistical Methods for Clinical Trial Analysis",
        ["Wilson E", "Miller F", "Moore G", "Taylor H"],
        "The Lancet",
        "2024",
        "34567890",
        "10.1016/S0140-6736(24)34567"
    )
}

# Mock abstract data
sample_abstracts = {
    "12345678": (
        "Deep learning has revolutionized medical image analysis by providing automated and accurate diagnostic tools. This review examines recent advances in convolutional neural networks for radiology applications.",
        ["Deep Learning", "Medical Imaging", "Radiology", "Artificial Intelligence"],
        "10.1038/s41591-023-12345"
    ),
    "23456789": (
        "Genomic data analysis has been transformed by machine learning algorithms. This comprehensive review covers applications in variant calling, gene expression analysis, and personalized medicine.",
        ["Machine Learning", "Genomics", "Bioinformatics", "Personalized Medicine"],
        "10.1038/s41588-023-23456"
    ),
    "34567890": (
        "Statistical methods form the backbone of clinical trial design and analysis. This paper reviews modern approaches including Bayesian methods and adaptive trial designs.",
        ["Statistics", "Clinical Trials", "Bayesian Analysis", "Study Design"],
        "10.1016/S0140-6736(24)34567"
    )
}

# Demonstrate BibTeX formatting
print("Sample BibTeX Entries:")
print("-" * 25)

for pmid, paper in sample_papers.items():
    abstract_data = sample_abstracts.get(pmid, ("", [], ""))
    
    print(f"\nPaper: {paper['title'][:50]}...")
    print(f"Authors: {', '.join([a['name'] for a in paper['authors']])}")
    print(f"Journal: {paper['source']}")
    print(f"Year: {paper['pubdate'].split()[0]}")
    print(f"PMID: {pmid}")
    print(f"DOI: {paper.get('doi', 'N/A')}")
    print(f"Keywords: {', '.join(abstract_data[1])}")
    print(f"Abstract: {abstract_data[0][:100]}...")

# Mock CrossRef metrics
def mock_crossref_metrics(doi):
    """Mock CrossRef metrics for demonstration."""
    if not doi:
        return {}
    
    return {
        "citations": np.random.randint(0, 150),
        "type": "journal-article",
        "publisher": "Nature Publishing Group" if "nature" in doi.lower() else "Elsevier",
        "references": np.random.randint(20, 80),
        "doi": doi
    }

print("\n\nCrossRef Metrics:")
print("-" * 18)

for pmid, paper in sample_papers.items():
    doi = paper.get('doi', '')
    metrics = mock_crossref_metrics(doi)
    if metrics:
        print(f"\n{paper['title'][:40]}...")
        print(f"  Citations: {metrics['citations']}")
        print(f"  References: {metrics['references']}")
        print(f"  Publisher: {metrics['publisher']}")
        print(f"  Type: {metrics['type']}")

## 3. Web Content Extraction and Analysis

### URL Content Extraction

In [None]:
# Demonstrate web content extraction
print("Web Content Extraction and Analysis:")
print("=" * 36)

# Mock HTML content for demonstration
sample_html_content = """
<html>
<head>
    <title>Scientific Computing with Python</title>
</head>
<body>
    <header>
        <nav>Navigation menu</nav>
    </header>
    <main>
        <h1>Introduction to Scientific Computing</h1>
        <p>Scientific computing combines mathematical models, quantitative analysis 
        techniques, and computer programming to solve complex scientific problems.</p>
        
        <h2>Key Libraries</h2>
        <ul>
            <li>NumPy: Numerical computing with arrays</li>
            <li>SciPy: Scientific computing algorithms</li>
            <li>Pandas: Data manipulation and analysis</li>
            <li>Matplotlib: Plotting and visualization</li>
        </ul>
        
        <h2>Applications</h2>
        <p>Scientific computing is used in physics simulations, bioinformatics,
        climate modeling, financial analysis, and machine learning research.</p>
    </main>
    <footer>
        <p>Copyright 2024 Scientific Computing Guide</p>
    </footer>
</body>
</html>
"""

# Mock function to demonstrate content extraction
def mock_extract_main_content(html):
    """Mock content extraction for demonstration."""
    import re
    from bs4 import BeautifulSoup
    
    # Parse HTML
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract main content (remove nav, footer, scripts)
    for element in soup(['nav', 'footer', 'script', 'style']):
        element.decompose()
    
    # Get text content
    text = soup.get_text()
    
    # Clean up whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Extract content from sample HTML
extracted_content = mock_extract_main_content(sample_html_content)

print("Original HTML length:", len(sample_html_content))
print("Extracted content length:", len(extracted_content))
print("\nExtracted content:")
print("-" * 20)
print(extracted_content)

# Content analysis
print("\nContent Analysis:")
print("-" * 17)

words = extracted_content.split()
sentences = extracted_content.split('.')
scientific_terms = ['scientific', 'computing', 'numpy', 'scipy', 'pandas', 'matplotlib', 
                   'algorithms', 'analysis', 'research', 'bioinformatics']

term_count = sum(1 for word in words if word.lower().strip('.,!?') in scientific_terms)

print(f"Word count: {len(words)}")
print(f"Sentence count: {len([s for s in sentences if s.strip()])}")
print(f"Scientific terms found: {term_count}")
print(f"Scientific term density: {term_count/len(words)*100:.1f}%")

# Key phrases extraction (simple)
key_phrases = []
content_lower = extracted_content.lower()
phrase_patterns = [
    'scientific computing',
    'machine learning',
    'data analysis',
    'numerical computing',
    'climate modeling'
]

for phrase in phrase_patterns:
    if phrase in content_lower:
        key_phrases.append(phrase)

print(f"\nKey phrases found: {', '.join(key_phrases) if key_phrases else 'None'}")

### Web Crawling and Site Analysis

In [None]:
# Demonstrate web crawling concepts
print("Web Crawling and Site Analysis:")
print("=" * 32)

# Mock website structure for demonstration
mock_website = {
    "https://scientific-computing.org/": {
        "title": "Scientific Computing Hub",
        "content": "Welcome to the scientific computing community. Explore tutorials, research papers, and tools for computational science.",
        "links": [
            "https://scientific-computing.org/tutorials/",
            "https://scientific-computing.org/papers/",
            "https://scientific-computing.org/tools/"
        ],
        "content_type": "homepage"
    },
    "https://scientific-computing.org/tutorials/": {
        "title": "Tutorials - Scientific Computing",
        "content": "Learn scientific computing with Python. Topics include NumPy, SciPy, machine learning, and data visualization.",
        "links": [
            "https://scientific-computing.org/tutorials/numpy/",
            "https://scientific-computing.org/tutorials/scipy/",
            "https://scientific-computing.org/tutorials/ml/"
        ],
        "content_type": "tutorial_index"
    },
    "https://scientific-computing.org/papers/": {
        "title": "Research Papers - Scientific Computing",
        "content": "Collection of research papers on computational methods, algorithms, and applications in scientific domains.",
        "links": [],
        "content_type": "research"
    },
    "https://scientific-computing.org/tools/": {
        "title": "Tools - Scientific Computing",
        "content": "Software tools and libraries for scientific computing. Download guides, documentation, and examples.",
        "links": [],
        "content_type": "tools"
    }
}

def mock_crawl_website(start_url, max_depth=1):
    """Mock website crawling for demonstration."""
    visited = set()
    to_visit = [(start_url, 0)]
    crawl_results = {}
    
    while to_visit:
        current_url, depth = to_visit.pop(0)
        
        if current_url in visited or depth > max_depth:
            continue
            
        if current_url in mock_website:
            visited.add(current_url)
            page_data = mock_website[current_url]
            crawl_results[current_url] = page_data
            
            # Add linked pages to visit queue
            for link in page_data['links']:
                if link not in visited:
                    to_visit.append((link, depth + 1))
    
    return visited, crawl_results

# Perform mock crawl
start_url = "https://scientific-computing.org/"
crawled_urls, crawled_content = mock_crawl_website(start_url, max_depth=2)

print(f"Crawl Results for: {start_url}")
print(f"Pages crawled: {len(crawled_urls)}")
print(f"Max depth: 2")

print("\nCrawled Pages:")
print("-" * 15)

for url, data in crawled_content.items():
    print(f"\nURL: {url}")
    print(f"Title: {data['title']}")
    print(f"Type: {data['content_type']}")
    print(f"Content: {data['content'][:80]}...")
    print(f"Links found: {len(data['links'])}")

# Content analysis across crawled pages
print("\nSite Analysis:")
print("-" * 14)

total_content = ' '.join([data['content'] for data in crawled_content.values()])
total_words = len(total_content.split())
total_links = sum(len(data['links']) for data in crawled_content.values())

content_types = {}
for data in crawled_content.values():
    content_type = data['content_type']
    content_types[content_type] = content_types.get(content_type, 0) + 1

print(f"Total words across all pages: {total_words}")
print(f"Total internal links: {total_links}")
print(f"Average words per page: {total_words/len(crawled_content):.1f}")
print(f"Page types: {dict(content_types)}")

# Scientific content analysis
scientific_keywords = ['scientific', 'computing', 'research', 'algorithm', 'data', 
                      'analysis', 'python', 'numpy', 'scipy', 'machine', 'learning']

keyword_counts = {}
for keyword in scientific_keywords:
    count = total_content.lower().count(keyword)
    if count > 0:
        keyword_counts[keyword] = count

print(f"\nScientific keyword frequency:")
for keyword, count in sorted(keyword_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {keyword}: {count}")

## 4. Advanced Research Workflows

### Literature Review Pipeline

In [None]:
# Advanced research workflow demonstration
print("Literature Review Pipeline:")
print("=" * 27)

class LiteratureReviewPipeline:
    def __init__(self, research_topic):
        self.research_topic = research_topic
        self.papers = []
        self.keywords = set()
        self.authors = set()
        self.journals = set()
        self.year_range = [float('inf'), 0]
    
    def search_literature(self, queries, max_papers_per_query=10):
        """Search literature across multiple queries."""
        print(f"\nSearching literature for: {self.research_topic}")
        print(f"Queries: {len(queries)}")
        
        for i, query in enumerate(queries, 1):
            print(f"\nQuery {i}: '{query}'")
            
            # Mock search results
            n_found = np.random.randint(20, 100)
            n_retrieved = min(max_papers_per_query, n_found)
            
            print(f"  Found: {n_found} papers")
            print(f"  Retrieved: {n_retrieved} papers")
            
            # Generate mock papers
            for j in range(n_retrieved):
                paper = self._generate_mock_paper(query, j)
                self.papers.append(paper)
                
                # Update metadata
                self.keywords.update(paper['keywords'])
                self.authors.update(paper['authors'])
                self.journals.add(paper['journal'])
                
                year = int(paper['year'])
                self.year_range[0] = min(self.year_range[0], year)
                self.year_range[1] = max(self.year_range[1], year)
        
        print(f"\nTotal papers collected: {len(self.papers)}")
        return len(self.papers)
    
    def _generate_mock_paper(self, query, index):
        """Generate mock paper data."""
        query_words = query.split()
        
        # Generate title based on query
        title_templates = [
            f"Advances in {query_words[0].title()} for {query_words[-1].title()} Applications",
            f"A Novel Approach to {query_words[0].title()} Using {query_words[-1].title()}",
            f"{query_words[0].title()}-Based Methods for {query_words[-1].title()} Analysis",
            f"Comparative Study of {query_words[0].title()} Techniques in {query_words[-1].title()}"
        ]
        
        title = np.random.choice(title_templates)
        
        # Generate authors
        first_names = ['John', 'Jane', 'Michael', 'Sarah', 'David', 'Emily', 'Robert', 'Lisa']
        last_names = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis']
        
        n_authors = np.random.randint(2, 6)
        authors = [f"{np.random.choice(first_names)} {np.random.choice(last_names)}" 
                  for _ in range(n_authors)]
        
        # Generate other metadata
        journals = ['Nature', 'Science', 'Cell', 'PNAS', 'Journal of AI Research', 
                   'IEEE Transactions', 'JAMA', 'The Lancet']
        
        paper = {
            'title': title,
            'authors': authors,
            'journal': np.random.choice(journals),
            'year': str(np.random.randint(2020, 2025)),
            'citations': np.random.randint(0, 200),
            'keywords': query_words + [np.random.choice(['analysis', 'method', 'study', 'research'])],
            'pmid': f"{np.random.randint(30000000, 40000000)}",
            'doi': f"10.1038/{np.random.randint(1000, 9999)}"
        }
        
        return paper
    
    def analyze_collection(self):
        """Analyze the collected literature."""
        if not self.papers:
            print("No papers to analyze.")
            return
        
        print(f"\nLiterature Collection Analysis:")
        print("=" * 31)
        
        # Basic statistics
        print(f"Total papers: {len(self.papers)}")
        print(f"Unique authors: {len(self.authors)}")
        print(f"Unique journals: {len(self.journals)}")
        print(f"Year range: {self.year_range[0]}-{self.year_range[1]}")
        print(f"Unique keywords: {len(self.keywords)}")
        
        # Citation analysis
        citations = [paper['citations'] for paper in self.papers]
        print(f"\nCitation Statistics:")
        print(f"  Mean citations: {np.mean(citations):.1f}")
        print(f"  Median citations: {np.median(citations):.1f}")
        print(f"  Max citations: {max(citations)}")
        print(f"  Total citations: {sum(citations):,}")
        
        # Year distribution
        years = [int(paper['year']) for paper in self.papers]
        year_counts = {year: years.count(year) for year in set(years)}
        print(f"\nPublications by year:")
        for year in sorted(year_counts.keys()):
            print(f"  {year}: {year_counts[year]} papers")
        
        # Journal distribution
        journal_counts = {}
        for paper in self.papers:
            journal = paper['journal']
            journal_counts[journal] = journal_counts.get(journal, 0) + 1
        
        print(f"\nTop journals:")
        for journal, count in sorted(journal_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
            print(f"  {journal}: {count} papers")
        
        return {
            'total_papers': len(self.papers),
            'citation_stats': {
                'mean': np.mean(citations),
                'median': np.median(citations),
                'max': max(citations),
                'total': sum(citations)
            },
            'year_distribution': year_counts,
            'journal_distribution': journal_counts
        }
    
    def generate_bibliography(self, filename=None):
        """Generate BibTeX bibliography."""
        if filename is None:
            filename = f"{self.research_topic.replace(' ', '_')}_bibliography.bib"
        
        print(f"\nGenerating bibliography: {filename}")
        
        # Mock BibTeX generation
        bibtex_entries = []
        for paper in self.papers:
            authors_str = ' and '.join(paper['authors'])
            
            entry = f"""@article{{{paper['pmid']},
    title = {{{paper['title']}}},
    author = {{{authors_str}}},
    journal = {{{paper['journal']}}},
    year = {{{paper['year']}}},
    doi = {{{paper['doi']}}},
    pmid = {{{paper['pmid']}}}
}}"""
            bibtex_entries.append(entry)
        
        print(f"Generated {len(bibtex_entries)} BibTeX entries")
        print(f"Total bibliography size: ~{len('\n\n'.join(bibtex_entries))} characters")
        
        return filename, bibtex_entries

# Demonstrate literature review pipeline
research_topic = "Machine Learning in Healthcare"
pipeline = LiteratureReviewPipeline(research_topic)

# Define search queries
queries = [
    "machine learning medical diagnosis",
    "deep learning healthcare applications", 
    "artificial intelligence clinical decision",
    "neural networks medical imaging",
    "ML electronic health records"
]

# Execute pipeline
pipeline.search_literature(queries, max_papers_per_query=8)
analysis_results = pipeline.analyze_collection()
bibliography_file, bibtex_entries = pipeline.generate_bibliography()

print(f"\nLiterature review pipeline completed for: {research_topic}")
print(f"Bibliography saved as: {bibliography_file}")

### Research Trend Analysis

In [None]:
# Research trend analysis visualization
print("Research Trend Analysis:")
print("=" * 24)

# Create visualization of research trends
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Literature Review Analysis Dashboard', fontsize=16, fontweight='bold')

# Plot 1: Publications by year
ax1 = axes[0, 0]
if 'year_distribution' in analysis_results:
    years = sorted(analysis_results['year_distribution'].keys())
    counts = [analysis_results['year_distribution'][year] for year in years]
    
    ax1.bar(years, counts, color='skyblue', alpha=0.7)
    ax1.set_xlabel('Year')
    ax1.set_ylabel('Number of Publications')
    ax1.set_title('Publications by Year')
    ax1.grid(True, alpha=0.3)

# Plot 2: Citation distribution
ax2 = axes[0, 1]
citations = [paper['citations'] for paper in pipeline.papers]
ax2.hist(citations, bins=15, color='lightgreen', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Citations')
ax2.set_ylabel('Number of Papers')
ax2.set_title('Citation Distribution')
ax2.grid(True, alpha=0.3)

# Plot 3: Journal distribution (top 8)
ax3 = axes[1, 0]
if 'journal_distribution' in analysis_results:
    journal_items = sorted(analysis_results['journal_distribution'].items(), 
                          key=lambda x: x[1], reverse=True)[:8]
    journals = [item[0] for item in journal_items]
    paper_counts = [item[1] for item in journal_items]
    
    ax3.barh(range(len(journals)), paper_counts, color='lightcoral', alpha=0.7)
    ax3.set_yticks(range(len(journals)))
    ax3.set_yticklabels([j[:15] + '...' if len(j) > 15 else j for j in journals])
    ax3.set_xlabel('Number of Papers')
    ax3.set_title('Top Journals')
    ax3.grid(True, alpha=0.3, axis='x')

# Plot 4: Research impact metrics
ax4 = axes[1, 1]
if 'citation_stats' in analysis_results:
    stats = analysis_results['citation_stats']
    metrics = ['Mean', 'Median', 'Max']
    values = [stats['mean'], stats['median'], stats['max']]
    
    bars = ax4.bar(metrics, values, color=['gold', 'orange', 'red'], alpha=0.7)
    ax4.set_ylabel('Citations')
    ax4.set_title('Citation Statistics')
    ax4.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar, value in zip(bars, values):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                f'{value:.1f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Summary statistics
print("\nResearch Trend Summary:")
print("=" * 23)

if analysis_results:
    print(f"Dataset: {analysis_results['total_papers']} papers")
    print(f"Average citations per paper: {analysis_results['citation_stats']['mean']:.1f}")
    print(f"Most cited paper: {analysis_results['citation_stats']['max']} citations")
    print(f"Total research impact: {analysis_results['citation_stats']['total']:,} citations")
    
    # Publication trend
    years = sorted(analysis_results['year_distribution'].keys())
    if len(years) > 1:
        recent_years = years[-2:]
        trend = analysis_results['year_distribution'][recent_years[-1]] - analysis_results['year_distribution'][recent_years[0]]
        trend_direction = "increasing" if trend > 0 else "decreasing" if trend < 0 else "stable"
        print(f"Publication trend: {trend_direction} ({trend:+d} papers from {recent_years[0]} to {recent_years[1]})")
    
    # Most productive journal
    top_journal = max(analysis_results['journal_distribution'].items(), key=lambda x: x[1])
    print(f"Most productive journal: {top_journal[0]} ({top_journal[1]} papers)")

print(f"\nLiterature analysis completed for: {research_topic}")

## 5. Automated Research Assistant

### Research Question to Literature Pipeline

In [None]:
# Automated research assistant demonstration
print("Automated Research Assistant:")
print("=" * 28)

class ResearchAssistant:
    def __init__(self):
        self.research_history = []
        self.knowledge_base = {}
    
    def process_research_question(self, question):
        """Process a research question and generate search strategy."""
        print(f"\nProcessing research question:")
        print(f"'{question}'")
        
        # Extract key concepts (simplified)
        key_concepts = self._extract_key_concepts(question)
        
        # Generate search queries
        search_queries = self._generate_search_queries(key_concepts)
        
        # Estimate search scope
        estimated_papers = self._estimate_search_scope(search_queries)
        
        research_plan = {
            'question': question,
            'key_concepts': key_concepts,
            'search_queries': search_queries,
            'estimated_papers': estimated_papers,
            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        }
        
        self.research_history.append(research_plan)
        
        return research_plan
    
    def _extract_key_concepts(self, question):
        """Extract key concepts from research question."""
        # Simple keyword extraction (in practice, would use NLP)
        stop_words = {'what', 'how', 'why', 'when', 'where', 'is', 'are', 'the', 
                     'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with'}
        
        words = question.lower().replace('?', '').split()
        concepts = [word for word in words if word not in stop_words and len(word) > 3]
        
        # Group related concepts
        concept_groups = {
            'methods': [],
            'domains': [],
            'outcomes': []
        }
        
        method_terms = ['machine', 'learning', 'deep', 'neural', 'algorithm', 'model', 'analysis']
        domain_terms = ['medical', 'healthcare', 'clinical', 'biological', 'genetic', 'imaging']
        outcome_terms = ['prediction', 'diagnosis', 'treatment', 'outcome', 'effectiveness']
        
        for concept in concepts:
            if any(term in concept for term in method_terms):
                concept_groups['methods'].append(concept)
            elif any(term in concept for term in domain_terms):
                concept_groups['domains'].append(concept)
            elif any(term in concept for term in outcome_terms):
                concept_groups['outcomes'].append(concept)
        
        return concept_groups
    
    def _generate_search_queries(self, key_concepts):
        """Generate optimized search queries."""
        queries = []
        
        # Combine concepts from different groups
        methods = key_concepts.get('methods', [])
        domains = key_concepts.get('domains', [])
        outcomes = key_concepts.get('outcomes', [])
        
        # Generate different query strategies
        if methods and domains:
            for method in methods[:2]:  # Limit to top 2
                for domain in domains[:2]:
                    queries.append(f"{method} {domain}")
        
        if outcomes and domains:
            for outcome in outcomes[:2]:
                for domain in domains[:2]:
                    queries.append(f"{outcome} {domain}")
        
        if methods and outcomes:
            for method in methods[:2]:
                for outcome in outcomes[:2]:
                    queries.append(f"{method} {outcome}")
        
        # Broad queries
        all_concepts = methods + domains + outcomes
        if len(all_concepts) >= 2:
            queries.append(' '.join(all_concepts[:3]))
        
        return list(set(queries))  # Remove duplicates
    
    def _estimate_search_scope(self, queries):
        """Estimate number of papers for each query."""
        estimates = {}
        
        for query in queries:
            # Mock estimation based on query complexity
            base_estimate = 100
            
            # Adjust based on query characteristics
            words = query.split()
            complexity_factor = len(words) * 0.8  # More words = fewer results
            
            # Check for popular terms
            popular_terms = ['machine', 'learning', 'deep', 'neural', 'medical']
            popularity_boost = sum(2 for word in words if word in popular_terms)
            
            estimated = int(base_estimate * (2 - complexity_factor) + popularity_boost * 50)
            estimated = max(10, min(500, estimated))  # Clamp between 10-500
            
            estimates[query] = estimated
        
        return estimates
    
    def execute_research_plan(self, plan, max_papers_per_query=10):
        """Execute the research plan."""
        print(f"\nExecuting research plan...")
        
        # Use the literature review pipeline
        pipeline = LiteratureReviewPipeline(plan['question'])
        pipeline.search_literature(plan['search_queries'], max_papers_per_query)
        analysis = pipeline.analyze_collection()
        
        # Generate summary
        summary = {
            'research_question': plan['question'],
            'papers_found': analysis['total_papers'],
            'key_findings': self._generate_key_findings(analysis),
            'recommendations': self._generate_recommendations(analysis)
        }
        
        return summary, pipeline
    
    def _generate_key_findings(self, analysis):
        """Generate key findings from analysis."""
        findings = []
        
        total_papers = analysis['total_papers']
        avg_citations = analysis['citation_stats']['mean']
        
        findings.append(f"Found {total_papers} relevant papers in the literature")
        findings.append(f"Average paper impact: {avg_citations:.1f} citations")
        
        # Year trend analysis
        years = sorted(analysis['year_distribution'].keys())
        if len(years) > 1:
            recent_growth = analysis['year_distribution'][years[-1]] - analysis['year_distribution'][years[0]]
            if recent_growth > 0:
                findings.append(f"Research activity is increasing ({recent_growth:+d} papers from {years[0]} to {years[-1]})")
            else:
                findings.append("Research activity appears stable or declining")
        
        # Journal analysis
        top_journals = sorted(analysis['journal_distribution'].items(), 
                            key=lambda x: x[1], reverse=True)[:3]
        findings.append(f"Most active journals: {', '.join([j[0] for j in top_journals])}")
        
        return findings
    
    def _generate_recommendations(self, analysis):
        """Generate research recommendations."""
        recommendations = []
        
        total_papers = analysis['total_papers']
        
        if total_papers < 20:
            recommendations.append("Consider expanding search terms - limited literature found")
        elif total_papers > 100:
            recommendations.append("Consider narrowing search scope - large literature volume")
        
        avg_citations = analysis['citation_stats']['mean']
        if avg_citations > 50:
            recommendations.append("High-impact research area - consider focusing on recent developments")
        elif avg_citations < 10:
            recommendations.append("Emerging research area - opportunity for novel contributions")
        
        # Journal recommendations
        top_journal = max(analysis['journal_distribution'].items(), key=lambda x: x[1])
        recommendations.append(f"Consider submitting to {top_journal[0]} (most active in this area)")
        
        return recommendations

# Demonstrate research assistant
assistant = ResearchAssistant()

# Example research questions
research_questions = [
    "How effective is machine learning for medical diagnosis?",
    "What are the applications of deep learning in healthcare imaging?",
    "Can neural networks predict clinical outcomes?"
]

# Process each question
for question in research_questions:
    print("\n" + "="*60)
    
    # Generate research plan
    plan = assistant.process_research_question(question)
    
    print(f"\nResearch Plan Generated:")
    print(f"Key concepts: {plan['key_concepts']}")
    print(f"Search queries: {plan['search_queries']}")
    print(f"Estimated papers: {sum(plan['estimated_papers'].values())}")
    
    # Execute plan (first question only for demo)
    if question == research_questions[0]:
        summary, pipeline = assistant.execute_research_plan(plan, max_papers_per_query=6)
        
        print(f"\nResearch Summary:")
        print(f"Papers analyzed: {summary['papers_found']}")
        print(f"\nKey findings:")
        for finding in summary['key_findings']:
            print(f"  • {finding}")
        
        print(f"\nRecommendations:")
        for rec in summary['recommendations']:
            print(f"  • {rec}")

print(f"\n\nResearch assistant processed {len(research_questions)} questions")
print(f"Total research plans in history: {len(assistant.research_history)}")

## 6. Integration and Best Practices

### Complete Research Workflow

In [None]:
# Complete research workflow integration
print("Complete Research Workflow:")
print("=" * 27)

class IntegratedResearchWorkflow:
    def __init__(self, project_name):
        self.project_name = project_name
        self.workflow_steps = []
        self.results = {}
        self.start_time = datetime.now()
    
    def step1_question_formulation(self, research_question):
        """Step 1: Formulate and analyze research question."""
        print(f"\nSTEP 1: Question Formulation")
        print(f"Research Question: {research_question}")
        
        assistant = ResearchAssistant()
        plan = assistant.process_research_question(research_question)
        
        self.results['question_analysis'] = plan
        self.workflow_steps.append('question_formulation')
        
        print(f"✓ Generated {len(plan['search_queries'])} search queries")
        print(f"✓ Identified {len(plan['key_concepts']['methods'] + plan['key_concepts']['domains'])} key concepts")
        
        return plan
    
    def step2_literature_search(self, search_plan, max_papers=30):
        """Step 2: Execute comprehensive literature search."""
        print(f"\nSTEP 2: Literature Search")
        
        pipeline = LiteratureReviewPipeline(self.project_name)
        pipeline.search_literature(search_plan['search_queries'], 
                                 max_papers_per_query=max_papers//len(search_plan['search_queries']))
        
        self.results['literature_collection'] = pipeline
        self.workflow_steps.append('literature_search')
        
        print(f"✓ Collected {len(pipeline.papers)} papers")
        print(f"✓ Identified {len(pipeline.journals)} unique journals")
        
        return pipeline
    
    def step3_content_analysis(self, pipeline):
        """Step 3: Analyze collected literature."""
        print(f"\nSTEP 3: Content Analysis")
        
        analysis = pipeline.analyze_collection()
        
        # Additional analysis
        collaboration_analysis = self._analyze_collaborations(pipeline.papers)
        temporal_analysis = self._analyze_temporal_trends(pipeline.papers)
        
        self.results['content_analysis'] = {
            'basic_stats': analysis,
            'collaborations': collaboration_analysis,
            'temporal_trends': temporal_analysis
        }
        self.workflow_steps.append('content_analysis')
        
        print(f"✓ Analyzed citation patterns")
        print(f"✓ Identified research trends")
        print(f"✓ Mapped collaboration networks")
        
        return self.results['content_analysis']
    
    def step4_synthesis_report(self, analysis):
        """Step 4: Generate synthesis report."""
        print(f"\nSTEP 4: Synthesis Report Generation")
        
        report = {
            'executive_summary': self._generate_executive_summary(analysis),
            'methodology_review': self._review_methodologies(analysis),
            'gap_analysis': self._identify_research_gaps(analysis),
            'future_directions': self._suggest_future_research(analysis)
        }
        
        self.results['synthesis_report'] = report
        self.workflow_steps.append('synthesis_report')
        
        print(f"✓ Generated executive summary")
        print(f"✓ Identified research gaps")
        print(f"✓ Proposed future directions")
        
        return report
    
    def step5_deliverables(self, pipeline):
        """Step 5: Generate research deliverables."""
        print(f"\nSTEP 5: Deliverable Generation")
        
        # Generate bibliography
        bib_file, bib_entries = pipeline.generate_bibliography()
        
        # Create summary documents
        deliverables = {
            'bibliography': {
                'filename': bib_file,
                'entries': len(bib_entries)
            },
            'summary_report': f"{self.project_name}_summary.pdf",
            'data_export': f"{self.project_name}_data.csv",
            'visualization': f"{self.project_name}_charts.png"
        }
        
        self.results['deliverables'] = deliverables
        self.workflow_steps.append('deliverables')
        
        print(f"✓ Bibliography: {deliverables['bibliography']['entries']} entries")
        print(f"✓ Summary report generated")
        print(f"✓ Data exported for further analysis")
        
        return deliverables
    
    def _analyze_collaborations(self, papers):
        """Analyze author collaboration patterns."""
        author_counts = {}
        collaboration_sizes = []
        
        for paper in papers:
            authors = paper['authors']
            collaboration_sizes.append(len(authors))
            
            for author in authors:
                author_counts[author] = author_counts.get(author, 0) + 1
        
        return {
            'top_authors': sorted(author_counts.items(), key=lambda x: x[1], reverse=True)[:10],
            'avg_collaboration_size': np.mean(collaboration_sizes),
            'total_unique_authors': len(author_counts)
        }
    
    def _analyze_temporal_trends(self, papers):
        """Analyze temporal research trends."""
        year_citations = {}
        
        for paper in papers:
            year = int(paper['year'])
            if year not in year_citations:
                year_citations[year] = []
            year_citations[year].append(paper['citations'])
        
        trends = {}
        for year, citations in year_citations.items():
            trends[year] = {
                'papers': len(citations),
                'avg_citations': np.mean(citations),
                'total_citations': sum(citations)
            }
        
        return trends
    
    def _generate_executive_summary(self, analysis):
        """Generate executive summary."""
        basic_stats = analysis['basic_stats']
        
        summary = [
            f"Comprehensive literature review of {basic_stats['total_papers']} papers",
            f"Research spans {len(basic_stats['year_distribution'])} years with {basic_stats['citation_stats']['total']:,} total citations",
            f"Average research impact: {basic_stats['citation_stats']['mean']:.1f} citations per paper",
            f"Published across {len(basic_stats['journal_distribution'])} different journals"
        ]
        
        return summary
    
    def _review_methodologies(self, analysis):
        """Review methodological approaches."""
        # Mock methodology analysis
        methodologies = [
            "Machine learning approaches dominate recent literature",
            "Clinical trials and observational studies provide evidence base",
            "Meta-analyses increasingly common for synthesis",
            "Computational methods expanding rapidly"
        ]
        return methodologies
    
    def _identify_research_gaps(self, analysis):
        """Identify research gaps."""
        gaps = [
            "Limited long-term follow-up studies",
            "Need for larger diverse patient populations",
            "Standardization of evaluation metrics required",
            "Integration with clinical workflows underexplored"
        ]
        return gaps
    
    def _suggest_future_research(self, analysis):
        """Suggest future research directions."""
        directions = [
            "Multi-center collaborative studies",
            "Real-world evidence collection",
            "Ethical framework development",
            "Implementation science research"
        ]
        return directions
    
    def generate_workflow_report(self):
        """Generate complete workflow report."""
        duration = datetime.now() - self.start_time
        
        report = f"""
RESEARCH WORKFLOW REPORT
========================

Project: {self.project_name}
Duration: {duration.total_seconds():.1f} seconds
Steps Completed: {len(self.workflow_steps)}/5

WORKFLOW SUMMARY:
1. Question Formulation: ✓ Completed
2. Literature Search: ✓ Completed  
3. Content Analysis: ✓ Completed
4. Synthesis Report: ✓ Completed
5. Deliverables: ✓ Completed

RESULTS OVERVIEW:
- Papers Analyzed: {self.results['content_analysis']['basic_stats']['total_papers']}
- Research Impact: {self.results['content_analysis']['basic_stats']['citation_stats']['total']:,} total citations
- Time Period: {min(self.results['content_analysis']['basic_stats']['year_distribution'].keys())}-{max(self.results['content_analysis']['basic_stats']['year_distribution'].keys())}
- Bibliography Entries: {self.results['deliverables']['bibliography']['entries']}

STATUS: WORKFLOW COMPLETED SUCCESSFULLY
"""
        return report

# Execute complete workflow
project_name = "AI_Healthcare_Literature_Review"
workflow = IntegratedResearchWorkflow(project_name)

# Execute all workflow steps
research_question = "How can artificial intelligence improve healthcare delivery and patient outcomes?"

step1_plan = workflow.step1_question_formulation(research_question)
step2_pipeline = workflow.step2_literature_search(step1_plan, max_papers=25)
step3_analysis = workflow.step3_content_analysis(step2_pipeline)
step4_report = workflow.step4_synthesis_report(step3_analysis)
step5_deliverables = workflow.step5_deliverables(step2_pipeline)

# Generate final report
final_report = workflow.generate_workflow_report()
print(final_report)

print("\n" + "="*50)
print("COMPLETE RESEARCH WORKFLOW DEMONSTRATED")
print("All steps executed successfully!")
print("="*50)

## 7. Summary

The `scitex.web` module provides comprehensive web-based research capabilities for scientific applications:

### Core Functions

**📚 PubMed Integration**
- `search_pubmed(query, n_entries)` - Search PubMed database for scientific articles
- `_search_pubmed()`, `_fetch_details()` - Low-level PubMed API interaction
- `batch_fetch_details()` - Asynchronous batch processing for efficiency
- `get_crossref_metrics()` - Retrieve citation metrics from CrossRef

**📄 Bibliography Management**
- `format_bibtex()` - Generate properly formatted BibTeX entries
- `save_bibtex()` - Save bibliography collections to files
- `_get_citation()` - Retrieve official citations from PubMed
- Automatic metadata enrichment with journal metrics

**🌐 Web Content Processing**
- `extract_main_content()` - Extract main content from HTML pages
- `crawl_url()` - Crawl websites with depth control
- `crawl_to_json()` - Generate structured summaries of crawled content
- `summarize_url()` - Complete URL analysis pipeline

### Key Features

**🔍 Intelligent Search**
1. **Multi-query strategies**: Generate optimized search terms from research questions
2. **Batch processing**: Efficient handling of large literature collections
3. **Metadata enrichment**: Automatic addition of impact factors and citation metrics
4. **Error resilience**: Graceful handling of network issues and API limits

**📊 Research Analytics**
- Citation analysis and trend identification
- Author collaboration network mapping
- Journal impact assessment
- Temporal research trend analysis

**🤖 Automation Features**
- Research question to search query translation
- Automated literature review pipeline
- Research gap identification
- Future research direction suggestions

### Advanced Capabilities

**📈 Literature Review Pipeline**
- Complete end-to-end literature review automation
- Multi-dimensional analysis (citations, trends, collaborations)
- Professional report generation
- Integration with visualization tools

**🔬 Research Assistant**
- Natural language research question processing
- Intelligent search strategy generation
- Automated analysis and synthesis
- Actionable research recommendations

**🌍 Web Intelligence**
- Content extraction with readability optimization
- Multi-page crawling with link analysis
- AI-powered content summarization
- Structured data export

### Use Cases

**📖 Academic Research**
- Systematic literature reviews
- Meta-analysis preparation
- Research gap identification
- Citation network analysis

**💼 Evidence-Based Practice**
- Clinical guideline development
- Treatment effectiveness reviews
- Policy research support
- Best practice identification

**🔍 Competitive Intelligence**
- Technology trend monitoring
- Research landscape mapping
- Collaboration opportunity identification
- Innovation tracking

### Integration Benefits

- **SciTeX Ecosystem**: Seamless integration with other SciTeX modules
- **Asynchronous Processing**: Efficient handling of large-scale operations
- **Export Compatibility**: Standard formats (BibTeX, CSV, JSON)
- **Visualization Ready**: Direct integration with matplotlib and pandas

The module transforms manual literature review processes into automated, intelligent workflows that can process hundreds of papers and generate comprehensive research insights in minutes rather than weeks.