# SciTeX Web Operations TutorialThis comprehensive notebook demonstrates the SciTeX web utilities module, covering scientific literature search, web scraping, content extraction, and URL processing for research workflows.## Features Covered### Academic Literature Search* PubMed search functionality* Article metadata retrieval* BibTeX citation generation* CrossRef metrics integration### Web Content Processing* URL content summarization* Main content extraction* Web crawling capabilities* JSON conversion utilities### Research Applications* Literature review automation* Reference management* Web-based data collection* Content analysis pipelines

In [None]:
import syssys.path.insert(0, '../src')import sciteximport pandas as pdimport numpy as npfrom pathlib import Pathimport jsonimport timeimport requestsfrom bs4 import BeautifulSoupimport urllib.parsefrom tqdm import tqdmimport matplotlib.pyplot as pltimport seaborn as sns# Create output directory for web examplesweb_output = Path('./web_examples')web_output.mkdir(exist_ok=True, parents=True)print("SciTeX Web Operations Tutorial - Ready to begin!")print("Note: This tutorial requires internet connection for demonstration")

## Part 1: Academic Literature Search with PubMed### 1.1 Basic PubMed SearchThe SciTeX web module provides powerful tools for searching academic literature:

In [None]:
# Search for recent papers on machine learning in neurosciencetry:    print("Searching PubMed for machine learning in neuroscience...")        # Perform search with limited results for demonstration    search_query = "machine learning neuroscience 2023[PDAT]"        # Note: In practice, you would use scitex.web.search_pubmed()    # For demonstration, we'll simulate the search results        # Simulated search results structure    search_results = {        'query': search_query,        'total_results': 1245,        'retrieved': 10,        'articles': [            {                'pmid': '37123456',                'title': 'Deep Learning Approaches for Epilepsy Prediction',                'authors': ['Smith, J.', 'Johnson, A.', 'Williams, B.'],                'journal': 'Nature Neuroscience',                'year': '2023',                'doi': '10.1038/s41593-023-01234-5',                'abstract': 'This study presents novel deep learning methods for predicting epileptic seizures...',                'keywords': ['epilepsy', 'deep learning', 'EEG', 'prediction']            },            {                'pmid': '37123457',                'title': 'Neural Networks for fMRI Data Analysis',                'authors': ['Brown, C.', 'Davis, E.', 'Miller, F.'],                'journal': 'NeuroImage',                'year': '2023',                'doi': '10.1016/j.neuroimage.2023.123456',                'abstract': 'We demonstrate the application of convolutional neural networks to fMRI analysis...',                'keywords': ['fMRI', 'neural networks', 'brain imaging', 'connectivity']            },            {                'pmid': '37123458',                'title': 'Machine Learning in Parkinson\'s Disease Research',                'authors': ['Garcia, H.', 'Lopez, I.', 'Martinez, J.'],                'journal': 'Brain',                'year': '2023',                'doi': '10.1093/brain/awad123',                'abstract': 'This review examines machine learning applications in Parkinson\'s disease...',                'keywords': ['Parkinson\'s disease', 'machine learning', 'biomarkers', 'diagnosis']            }        ]    }        print(f"Search completed successfully!")    print(f"Query: {search_results['query']}")    print(f"Total results available: {search_results['total_results']}")    print(f"Retrieved: {search_results['retrieved']} articles")        # Display first few results    print("\nFirst 3 results:")    for i, article in enumerate(search_results['articles'][:3]):        print(f"\n{i+1}. {article['title']}")        print(f"   Authors: {', '.join(article['authors'])}")        print(f"   Journal: {article['journal']} ({article['year']})")        print(f"   PMID: {article['pmid']}")        print(f"   DOI: {article['doi']}")        print(f"   Keywords: {', '.join(article['keywords'])}")        print(f"   Abstract: {article['abstract'][:100]}...")        except Exception as e:    print(f"Search simulation: {e}")    print("Note: In practice, this would connect to PubMed API")

### 1.2 BibTeX Citation Generation

In [None]:
# Generate BibTeX citations from search resultsdef generate_bibtex_entry(article):    """Generate a BibTeX entry for an article."""    # Clean title and create citation key    first_author_last = article['authors'][0].split(',')[0].replace(' ', '')    year = article['year']    title_words = article['title'].split()[:3]    key_words = [w.lower().replace(',', '').replace('.', '') for w in title_words]    citation_key = f"{first_author_last}{year}{''.join(key_words)}"        # Format authors for BibTeX    authors_bibtex = ' and '.join(article['authors'])        # Create BibTeX entry    bibtex_entry = f"""@article{{citation_key},    title = {{article['title']}},    author = {{authors_bibtex}},    journal = {{article['journal']}},    year = {{article['year']}},    doi = {{article['doi']}},    pmid = {{article['pmid']}},    keywords = {{', '.join(article['keywords'])}}}"""    return bibtex_entry# Generate BibTeX for all articlesbibtex_entries = []for article in search_results['articles']:    bibtex_entry = generate_bibtex_entry(article)    bibtex_entries.append(bibtex_entry)# Save to filebibtex_file = web_output / 'literature_search.bib'with open(bibtex_file, 'w') as f:    f.write('\n\n'.join(bibtex_entries))print(f"Generated BibTeX file: {bibtex_file}")print(f"Number of entries: {len(bibtex_entries)}")# Display first entryprint("\nFirst BibTeX entry:")print(bibtex_entries[0])

### 1.3 Literature Analysis and Visualization

In [None]:
# Analyze search resultsarticles_df = pd.DataFrame(search_results['articles'])print("Literature search analysis:")print(f"Total articles: {len(articles_df)}")print(f"Unique journals: {articles_df['journal'].nunique()}")print(f"Year range: {articles_df['year'].min()} - {articles_df['year'].max()}")# Journal distributionjournal_counts = articles_df['journal'].value_counts()print("\nJournal distribution:")for journal, count in journal_counts.items():    print(f"  {journal}: {count} articles")# Keyword analysisall_keywords = []for keywords in articles_df['keywords']:    all_keywords.extend(keywords)keyword_counts = pd.Series(all_keywords).value_counts()print(f"\nTop keywords:")for keyword, count in keyword_counts.head(10).items():    print(f"  {keyword}: {count} occurrences")# Create visualizationsfig, axes = plt.subplots(1, 2, figsize=(15, 5))# Journal distributionjournal_counts.plot(kind='bar', ax=axes[0])axes[0].set_title('Articles by Journal')axes[0].set_xlabel('Journal')axes[0].set_ylabel('Number of Articles')axes[0].tick_params(axis='x', rotation=45)# Top keywordskeyword_counts.head(8).plot(kind='bar', ax=axes[1])axes[1].set_title('Top Keywords')axes[1].set_xlabel('Keywords')axes[1].set_ylabel('Frequency')axes[1].tick_params(axis='x', rotation=45)plt.tight_layout()plt.show()# Save analysis resultsanalysis_results = {    'search_query': search_results['query'],    'total_articles': len(articles_df),    'unique_journals': articles_df['journal'].nunique(),    'journal_distribution': journal_counts.to_dict(),    'top_keywords': keyword_counts.head(10).to_dict(),    'articles_summary': articles_df[['title', 'journal', 'year', 'pmid']].to_dict('records')}analysis_file = web_output / 'literature_analysis.json'with open(analysis_file, 'w') as f:    json.dump(analysis_results, f, indent=2)print(f"\nAnalysis saved to: {analysis_file}")

## Part 2: Web Content Processing and Extraction### 2.1 URL Content Summarization

In [None]:
# Demonstrate URL content extraction and summarizationdef extract_main_content(html_content):    """Extract main content from HTML using BeautifulSoup."""    soup = BeautifulSoup(html_content, 'html.parser')        # Remove script and style elements    for script in soup(["script", "style"]):        script.decompose()        # Get text    text = soup.get_text()        # Break into lines and remove leading/trailing space    lines = (line.strip() for line in text.splitlines())        # Break multi-headlines into a line each    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))        # Drop blank lines    text = '\n'.join(chunk for chunk in chunks if chunk)        return textdef summarize_content(content, max_length=500):    """Create a summary of content by taking first and key sentences."""    sentences = content.split('. ')        if len(content) <= max_length:        return content        # Take first few sentences and look for key information    summary_sentences = sentences[:3]        # Look for sentences with key scientific terms    key_terms = ['research', 'study', 'analysis', 'results', 'conclusion', 'method', 'data']        for sentence in sentences[3:]:        if any(term in sentence.lower() for term in key_terms):            summary_sentences.append(sentence)            if len('. '.join(summary_sentences)) > max_length:                break        summary = '. '.join(summary_sentences)    if len(summary) > max_length:        summary = summary[:max_length] + '...'        return summary# Simulate processing different types of web contentsample_urls = [    {        'url': 'https://example-research-site.com/article1',        'title': 'Machine Learning in Healthcare',        'content': '''Machine Learning in Healthcare: A Comprehensive Review                This research article examines the current state of machine learning applications in healthcare.         The study analyzes over 500 recent papers to identify trends and opportunities.         Our analysis reveals significant growth in deep learning applications for medical imaging.         Results show that convolutional neural networks achieve 95% accuracy in radiological diagnosis.         The conclusion emphasizes the need for better data standardization and ethical frameworks.         Future research should focus on interpretability and clinical validation of ML models.'''    },    {        'url': 'https://university.edu/neuroscience-lab',        'title': 'Neuroscience Laboratory Research',        'content': '''Welcome to the Computational Neuroscience Laboratory                Our laboratory focuses on understanding brain function through computational modeling.         We use advanced data analysis techniques to study neural networks and brain connectivity.         Current research projects include epilepsy prediction using EEG signals.         Our method combines signal processing with machine learning for real-time seizure detection.         The research team has published over 50 papers in top-tier journals.         We collaborate with hospitals to translate our findings into clinical applications.'''    },    {        'url': 'https://conference.org/proceedings',        'title': 'AI Conference Proceedings',        'content': '''International Conference on Artificial Intelligence in Medicine 2024                This conference brings together researchers from around the world to discuss AI in medicine.         Over 200 presentations cover topics from natural language processing to computer vision.         Key findings include improved diagnostic accuracy using transformer models.         The study demonstrates that large language models can assist in medical documentation.         Results indicate 40% reduction in documentation time for physicians.         Future work will focus on multimodal AI systems for comprehensive patient care.'''    }]# Process and summarize contentprocessed_content = []for item in sample_urls:    # Extract main content (simulated)    main_content = extract_main_content(f"<html><body>{item['content']}</body></html>")        # Create summary    summary = summarize_content(main_content, max_length=300)        processed_item = {        'url': item['url'],        'title': item['title'],        'original_length': len(item['content']),        'summary_length': len(summary),        'compression_ratio': len(summary) / len(item['content']),        'summary': summary    }        processed_content.append(processed_item)# Display resultsprint("Web Content Processing Results:")print("=" * 50)for i, item in enumerate(processed_content, 1):    print(f"\n{i}. {item['title']}")    print(f"   URL: {item['url']}")    print(f"   Original length: {item['original_length']} characters")    print(f"   Summary length: {item['summary_length']} characters")    print(f"   Compression ratio: {item['compression_ratio']:.2f}")    print(f"   Summary: {item['summary']}")    print("-" * 40)# Save processed contentcontent_file = web_output / 'processed_web_content.json'with open(content_file, 'w') as f:    json.dump(processed_content, f, indent=2)print(f"\nProcessed content saved to: {content_file}")

### 2.2 Web Crawling and Data Collection

In [None]:
# Simulate web crawling functionalitydef simulate_crawl_url(base_url, max_depth=2):    """Simulate crawling a website to collect content."""        # Simulate a research website structure    simulated_site = {        'https://research-institute.edu/': {            'title': 'Research Institute Home',            'content': 'Welcome to our research institute. We conduct cutting-edge research in AI and neuroscience.',            'links': [                'https://research-institute.edu/publications',                'https://research-institute.edu/projects',                'https://research-institute.edu/team'            ]        },        'https://research-institute.edu/publications': {            'title': 'Publications',            'content': 'Our recent publications include work on deep learning for medical diagnosis and brain-computer interfaces.',            'links': [                'https://research-institute.edu/publications/2024',                'https://research-institute.edu/publications/2023'            ]        },        'https://research-institute.edu/projects': {            'title': 'Research Projects',            'content': 'Current projects focus on epilepsy prediction, cancer diagnosis, and neural prosthetics.',            'links': [                'https://research-institute.edu/projects/epilepsy',                'https://research-institute.edu/projects/cancer'            ]        },        'https://research-institute.edu/publications/2024': {            'title': '2024 Publications',            'content': 'This year we published 15 papers in top-tier journals including Nature and Science.',            'links': []        },        'https://research-institute.edu/projects/epilepsy': {            'title': 'Epilepsy Prediction Project',            'content': 'Using machine learning to predict epileptic seizures from EEG data with 85% accuracy.',            'links': []        }    }        visited = set()    to_visit = [(base_url, 0)]    crawl_results = []        while to_visit:        current_url, depth = to_visit.pop(0)                if current_url in visited or depth > max_depth:            continue                    if current_url in simulated_site:            visited.add(current_url)            page_data = simulated_site[current_url]                        crawl_results.append({                'url': current_url,                'title': page_data['title'],                'content': page_data['content'],                'depth': depth,                'links_found': len(page_data['links']),                'word_count': len(page_data['content'].split())            })                        # Add linked pages to visit queue            for link in page_data['links']:                if link not in visited:                    to_visit.append((link, depth + 1))        return crawl_results# Perform simulated crawlprint("Simulating web crawl...")crawl_results = simulate_crawl_url('https://research-institute.edu/', max_depth=2)print(f"\nCrawl completed!")print(f"Pages crawled: {len(crawl_results)}")# Analyze crawl resultscrawl_df = pd.DataFrame(crawl_results)print("\nCrawl Analysis:")print(f"Total pages: {len(crawl_df)}")print(f"Max depth reached: {crawl_df['depth'].max()}")print(f"Total words collected: {crawl_df['word_count'].sum()}")print(f"Average words per page: {crawl_df['word_count'].mean():.1f}")# Display crawl resultsprint("\nCrawled Pages:")for i, page in enumerate(crawl_results, 1):    print(f"\n{i}. {page['title']}")    print(f"   URL: {page['url']}")    print(f"   Depth: {page['depth']}")    print(f"   Links found: {page['links_found']}")    print(f"   Word count: {page['word_count']}")    print(f"   Content preview: {page['content'][:100]}...")# Save crawl resultscrawl_file = web_output / 'crawl_results.json'with open(crawl_file, 'w') as f:    json.dump(crawl_results, f, indent=2)print(f"\nCrawl results saved to: {crawl_file}")# Visualize crawl statisticsfig, axes = plt.subplots(1, 2, figsize=(12, 4))# Pages by depthdepth_counts = crawl_df['depth'].value_counts().sort_index()depth_counts.plot(kind='bar', ax=axes[0])axes[0].set_title('Pages by Crawl Depth')axes[0].set_xlabel('Depth Level')axes[0].set_ylabel('Number of Pages')# Word count distributioncrawl_df['word_count'].hist(bins=5, ax=axes[1])axes[1].set_title('Word Count Distribution')axes[1].set_xlabel('Words per Page')axes[1].set_ylabel('Frequency')plt.tight_layout()plt.show()

### 2.3 Content Analysis and Information Extraction

In [None]:
# Analyze collected web content for research insightsdef extract_research_terms(content):    """Extract research-related terms from content."""    research_keywords = [        'machine learning', 'deep learning', 'neural network', 'artificial intelligence',        'data analysis', 'statistical analysis', 'research', 'study', 'experiment',        'clinical trial', 'publication', 'journal', 'conference', 'peer review',        'neuroscience', 'brain', 'EEG', 'fMRI', 'epilepsy', 'seizure',        'diagnosis', 'prediction', 'accuracy', 'model', 'algorithm'    ]        content_lower = content.lower()    found_terms = []        for term in research_keywords:        if term in content_lower:            count = content_lower.count(term)            found_terms.append((term, count))        return found_termsdef categorize_content(content, title):    """Categorize content based on research area."""    categories = {        'Machine Learning': ['machine learning', 'deep learning', 'neural network', 'AI', 'algorithm'],        'Neuroscience': ['neuroscience', 'brain', 'EEG', 'fMRI', 'neural', 'seizure', 'epilepsy'],        'Medical': ['medical', 'clinical', 'diagnosis', 'patient', 'treatment', 'healthcare'],        'Research': ['research', 'study', 'experiment', 'analysis', 'publication', 'journal'],        'Data Science': ['data', 'statistics', 'analysis', 'model', 'prediction', 'accuracy']    }        content_and_title = (content + ' ' + title).lower()    category_scores = {}        for category, keywords in categories.items():        score = sum(content_and_title.count(keyword) for keyword in keywords)        category_scores[category] = score        # Return category with highest score    if max(category_scores.values()) > 0:        return max(category_scores, key=category_scores.get)    else:        return 'General'# Analyze all collected contentcontent_analysis = []# Combine crawl results and processed contentall_content = crawl_results + processed_contentfor item in all_content:    content = item.get('content', item.get('summary', ''))    title = item['title']        # Extract research terms    research_terms = extract_research_terms(content)        # Categorize content    category = categorize_content(content, title)        # Calculate metrics    analysis_item = {        'title': title,        'url': item.get('url', 'N/A'),        'category': category,        'word_count': len(content.split()),        'research_terms_count': len(research_terms),        'research_terms': research_terms,        'top_terms': sorted(research_terms, key=lambda x: x[1], reverse=True)[:5]    }        content_analysis.append(analysis_item)# Create analysis DataFrameanalysis_df = pd.DataFrame(content_analysis)print("Content Analysis Results:")print("=" * 50)print(f"Total content items analyzed: {len(analysis_df)}")# Category distributioncategory_counts = analysis_df['category'].value_counts()print(f"\nContent categories:")for category, count in category_counts.items():    print(f"  {category}: {count} items")# Research terms analysisall_research_terms = []for terms_list in analysis_df['research_terms']:    all_research_terms.extend(terms_list)# Count term frequenciesterm_frequencies = {}for term, count in all_research_terms:    if term in term_frequencies:        term_frequencies[term] += count    else:        term_frequencies[term] = count# Sort by frequencysorted_terms = sorted(term_frequencies.items(), key=lambda x: x[1], reverse=True)print(f"\nTop research terms across all content:")for term, freq in sorted_terms[:10]:    print(f"  {term}: {freq} occurrences")# Display detailed analysis for each itemprint(f"\nDetailed Analysis:")for i, item in enumerate(content_analysis, 1):    print(f"\n{i}. {item['title']}")    print(f"   Category: {item['category']}")    print(f"   Word count: {item['word_count']}")    print(f"   Research terms found: {item['research_terms_count']}")    if item['top_terms']:        top_terms_str = ', '.join([f"{term} ({count})" for term, count in item['top_terms']])        print(f"   Top terms: {top_terms_str}")# Save analysis resultsanalysis_summary = {    'total_items': len(analysis_df),    'category_distribution': category_counts.to_dict(),    'top_research_terms': dict(sorted_terms[:15]),    'detailed_analysis': content_analysis}analysis_file = web_output / 'content_analysis.json'with open(analysis_file, 'w') as f:    json.dump(analysis_summary, f, indent=2)print(f"\nContent analysis saved to: {analysis_file}")# Visualize analysis resultsfig, axes = plt.subplots(1, 2, figsize=(15, 5))# Category distributioncategory_counts.plot(kind='pie', ax=axes[0], autopct='%1.1f%%')axes[0].set_title('Content Categories')axes[0].set_ylabel('')# Top terms frequencytop_terms_df = pd.Series(dict(sorted_terms[:8]))top_terms_df.plot(kind='bar', ax=axes[1])axes[1].set_title('Top Research Terms Frequency')axes[1].set_xlabel('Terms')axes[1].set_ylabel('Frequency')axes[1].tick_params(axis='x', rotation=45)plt.tight_layout()plt.show()

## Part 3: Research Workflow Integration### 3.1 Automated Literature Review Pipeline

In [None]:
# Create an automated literature review pipelineclass LiteratureReviewPipeline:    def __init__(self, output_dir):        self.output_dir = Path(output_dir)        self.output_dir.mkdir(exist_ok=True, parents=True)        self.search_results = []        self.processed_articles = []        self.analysis_results = {}        def search_literature(self, queries, max_results_per_query=20):        """Search for literature using multiple queries."""        print("Searching literature...")                # Simulate searches for different queries        simulated_results = {            'machine learning epilepsy': [                {'title': 'Deep Learning for Epilepsy Detection', 'journal': 'Nature Medicine', 'year': '2024', 'relevance': 0.95},                {'title': 'ML-based Seizure Prediction Systems', 'journal': 'Brain', 'year': '2023', 'relevance': 0.89},                {'title': 'Neural Networks in EEG Analysis', 'journal': 'IEEE TBME', 'year': '2024', 'relevance': 0.82}            ],            'AI medical diagnosis': [                {'title': 'AI in Radiology: Current Status', 'journal': 'Radiology', 'year': '2024', 'relevance': 0.91},                {'title': 'Computer Vision for Medical Imaging', 'journal': 'Medical Image Analysis', 'year': '2023', 'relevance': 0.87},                {'title': 'Deep Learning in Pathology', 'journal': 'Nature Digital Medicine', 'year': '2024', 'relevance': 0.84}            ],            'brain computer interface': [                {'title': 'BCI for Motor Rehabilitation', 'journal': 'Nature Neuroscience', 'year': '2024', 'relevance': 0.93},                {'title': 'Neural Prosthetics Advances', 'journal': 'Science Robotics', 'year': '2023', 'relevance': 0.88},                {'title': 'BCI Signal Processing Methods', 'journal': 'NeuroImage', 'year': '2024', 'relevance': 0.85}            ]        }                for query in queries:            if query in simulated_results:                results = simulated_results[query][:max_results_per_query]                for result in results:                    result['query'] = query                    result['pmid'] = f"sim_{len(self.search_results) + 1:06d}"                self.search_results.extend(results)                print(f"  {query}: {len(results)} articles found")                print(f"Total articles found: {len(self.search_results)}")        return self.search_results        def filter_by_relevance(self, min_relevance=0.8):        """Filter articles by relevance score."""        filtered_results = [article for article in self.search_results if article['relevance'] >= min_relevance]        print(f"Filtered {len(self.search_results)} articles to {len(filtered_results)} (relevance >= {min_relevance})")        return filtered_results        def analyze_trends(self):        """Analyze trends in the literature."""        if not self.search_results:            return {}                df = pd.DataFrame(self.search_results)                analysis = {            'total_articles': len(df),            'year_distribution': df['year'].value_counts().to_dict(),            'journal_distribution': df['journal'].value_counts().to_dict(),            'query_distribution': df['query'].value_counts().to_dict(),            'average_relevance': df['relevance'].mean(),            'high_relevance_count': len(df[df['relevance'] >= 0.9]),            'top_journals': df['journal'].value_counts().head(5).to_dict(),            'recent_articles': len(df[df['year'] == '2024'])        }                self.analysis_results = analysis        return analysis        def generate_report(self):        """Generate a comprehensive literature review report."""        if not self.analysis_results:            self.analyze_trends()                report = f"""# Literature Review Report## Summary- Total articles analyzed: {self.analysis_results['total_articles']}- Average relevance score: {self.analysis_results['average_relevance']:.3f}- High relevance articles (≥0.9): {self.analysis_results['high_relevance_count']}- Recent articles (2024): {self.analysis_results['recent_articles']}## Publication Year Distribution"""        for year, count in sorted(self.analysis_results['year_distribution'].items()):            report += f"- {year}: {count} articles\n"                report += "\n## Top Journals\n"        for journal, count in self.analysis_results['top_journals'].items():            report += f"- {journal}: {count} articles\n"                report += "\n## Search Query Results\n"        for query, count in self.analysis_results['query_distribution'].items():            report += f"- {query}: {count} articles\n"                # Save report        report_file = self.output_dir / 'literature_review_report.md'        with open(report_file, 'w') as f:            f.write(report)                return report, report_file        def export_bibliography(self):        """Export bibliography in multiple formats."""        # Export as CSV        df = pd.DataFrame(self.search_results)        csv_file = self.output_dir / 'bibliography.csv'        df.to_csv(csv_file, index=False)                # Export as BibTeX (simplified)        bibtex_entries = []        for article in self.search_results:            entry = f"""@article{{{article['pmid']},    title = {{{article['title']}}},    journal = {{{article['journal']}}},    year = {{{article['year']}}},    pmid = {{{article['pmid']}}},    relevance = {{{article['relevance']}}}}}"""            bibtex_entries.append(entry)                bibtex_file = self.output_dir / 'bibliography.bib'        with open(bibtex_file, 'w') as f:            f.write('\n\n'.join(bibtex_entries))                return csv_file, bibtex_file# Run literature review pipelineprint("Running Automated Literature Review Pipeline")print("=" * 50)# Initialize pipelinepipeline = LiteratureReviewPipeline(web_output / 'literature_review')# Define search queriesresearch_queries = [    'machine learning epilepsy',    'AI medical diagnosis',    'brain computer interface']# Step 1: Search literaturesearch_results = pipeline.search_literature(research_queries, max_results_per_query=10)# Step 2: Filter by relevancehigh_relevance_articles = pipeline.filter_by_relevance(min_relevance=0.85)# Step 3: Analyze trendsprint("\nAnalyzing literature trends...")analysis = pipeline.analyze_trends()print(f"\nAnalysis Results:")print(f"- Total articles: {analysis['total_articles']}")print(f"- Average relevance: {analysis['average_relevance']:.3f}")print(f"- High relevance articles: {analysis['high_relevance_count']}")print(f"- Recent articles (2024): {analysis['recent_articles']}")# Step 4: Generate reportprint("\nGenerating literature review report...")report, report_file = pipeline.generate_report()print(f"Report saved to: {report_file}")# Step 5: Export bibliographycsv_file, bibtex_file = pipeline.export_bibliography()print(f"Bibliography exported to:")print(f"  CSV: {csv_file}")print(f"  BibTeX: {bibtex_file}")# Display part of the reportprint("\nGenerated Report Preview:")print(report[:500] + "...")

### 3.2 Research Data Collection Workflow

In [None]:
# Create a comprehensive research data collection workflowclass ResearchDataCollector:    def __init__(self, project_name, output_dir):        self.project_name = project_name        self.output_dir = Path(output_dir)        self.output_dir.mkdir(exist_ok=True, parents=True)                # Initialize data storage        self.collected_data = scitex.dict.listed_dict([            'source_url', 'content_type', 'title', 'content',             'collection_date', 'word_count', 'relevance_score'        ])                self.metadata = {            'project_name': project_name,            'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),            'sources_processed': 0,            'total_content_collected': 0        }        def collect_from_sources(self, sources):        """Collect data from multiple web sources."""        print(f"Collecting data for project: {self.project_name}")                for source in sources:            print(f"Processing source: {source['name']}")                        # Simulate data collection from different source types            if source['type'] == 'research_database':                self._collect_from_database(source)            elif source['type'] == 'journal_website':                self._collect_from_journal(source)            elif source['type'] == 'conference_proceedings':                self._collect_from_conference(source)                        self.metadata['sources_processed'] += 1            time.sleep(0.1)  # Simulate processing time                self.metadata['total_content_collected'] = len(self.collected_data['title'])        print(f"Data collection completed. Total items: {self.metadata['total_content_collected']}")        def _collect_from_database(self, source):        """Simulate collecting from research database."""        # Simulate database entries        entries = [            {                'title': 'Machine Learning in Neuroimaging: A Review',                'content': 'Comprehensive review of ML applications in neuroimaging covering 200+ studies...',                'relevance': 0.92            },            {                'title': 'Deep Learning for Medical Image Segmentation',                'content': 'Novel approaches using CNN architectures for precise medical image segmentation...',                'relevance': 0.88            }        ]                for entry in entries:            self._add_collected_item(source['url'], 'database_entry', entry)        def _collect_from_journal(self, source):        """Simulate collecting from journal website."""        articles = [            {                'title': 'AI-Assisted Diagnosis in Emergency Medicine',                'content': 'Study of AI tools reducing diagnostic time by 40% in emergency departments...',                'relevance': 0.94            },            {                'title': 'Ethical Considerations in Medical AI',                'content': 'Analysis of ethical frameworks for implementing AI in clinical practice...',                'relevance': 0.86            }        ]                for article in articles:            self._add_collected_item(source['url'], 'journal_article', article)        def _collect_from_conference(self, source):        """Simulate collecting from conference proceedings."""        papers = [            {                'title': 'Real-time Seizure Detection Using Edge Computing',                'content': 'Implementation of lightweight models for real-time seizure detection on mobile devices...',                'relevance': 0.91            },            {                'title': 'Transfer Learning for Medical Image Classification',                'content': 'Evaluation of transfer learning approaches for medical image classification tasks...',                'relevance': 0.87            }        ]                for paper in papers:            self._add_collected_item(source['url'], 'conference_paper', paper)        def _add_collected_item(self, source_url, content_type, item):        """Add an item to the collected data."""        self.collected_data['source_url'].append(source_url)        self.collected_data['content_type'].append(content_type)        self.collected_data['title'].append(item['title'])        self.collected_data['content'].append(item['content'])        self.collected_data['collection_date'].append(time.strftime('%Y-%m-%d'))        self.collected_data['word_count'].append(len(item['content'].split()))        self.collected_data['relevance_score'].append(item['relevance'])        def analyze_collected_data(self):        """Analyze the collected data."""        if not self.collected_data['title']:            return {}                df = pd.DataFrame(dict(self.collected_data))                analysis = {            'total_items': len(df),            'content_types': df['content_type'].value_counts().to_dict(),            'average_relevance': df['relevance_score'].mean(),            'total_words': df['word_count'].sum(),            'average_words_per_item': df['word_count'].mean(),            'high_relevance_items': len(df[df['relevance_score'] >= 0.9]),            'collection_dates': df['collection_date'].value_counts().to_dict()        }                return analysis        def export_data(self):        """Export collected data in multiple formats."""        if not self.collected_data['title']:            print("No data to export")            return                # Export as CSV        df = pd.DataFrame(dict(self.collected_data))        csv_file = self.output_dir / f'{self.project_name}_collected_data.csv'        df.to_csv(csv_file, index=False)                # Export as JSON        json_file = self.output_dir / f'{self.project_name}_collected_data.json'        with open(json_file, 'w') as f:            json.dump(dict(self.collected_data), f, indent=2)                # Export metadata        metadata_file = self.output_dir / f'{self.project_name}_metadata.json'        with open(metadata_file, 'w') as f:            json.dump(self.metadata, f, indent=2)                return csv_file, json_file, metadata_file        def generate_summary_report(self):        """Generate a summary report of the data collection."""        analysis = self.analyze_collected_data()                if not analysis:            return "No data collected to report."                report = f"""# Data Collection Summary Report## Project: {self.project_name}### Collection Overview- Collection started: {self.metadata['start_time']}- Sources processed: {self.metadata['sources_processed']}- Total items collected: {analysis['total_items']}- Total words collected: {analysis['total_words']:,}- Average words per item: {analysis['average_words_per_item']:.1f}- Average relevance score: {analysis['average_relevance']:.3f}- High relevance items (≥0.9): {analysis['high_relevance_items']}### Content Type Distribution"""        for content_type, count in analysis['content_types'].items():            report += f"- {content_type}: {count} items\n"                return report# Run research data collection workflowprint("Running Research Data Collection Workflow")print("=" * 50)# Initialize data collectorcollector = ResearchDataCollector(    project_name="AI_Medical_Research",    output_dir=web_output / 'data_collection')# Define data sourcesresearch_sources = [    {        'name': 'PubMed Database',        'type': 'research_database',        'url': 'https://pubmed.ncbi.nlm.nih.gov/'    },    {        'name': 'Nature Medicine',        'type': 'journal_website',        'url': 'https://nature.com/nm/'    },    {        'name': 'MICCAI 2024 Proceedings',        'type': 'conference_proceedings',        'url': 'https://miccai2024.org/proceedings/'    }]# Collect data from sourcescollector.collect_from_sources(research_sources)# Analyze collected dataprint("\nAnalyzing collected data...")analysis = collector.analyze_collected_data()print(f"Analysis Results:")print(f"- Total items: {analysis['total_items']}")print(f"- Average relevance: {analysis['average_relevance']:.3f}")print(f"- Total words: {analysis['total_words']:,}")print(f"- High relevance items: {analysis['high_relevance_items']}")# Export dataprint("\nExporting collected data...")csv_file, json_file, metadata_file = collector.export_data()print(f"Data exported to:")print(f"  CSV: {csv_file}")print(f"  JSON: {json_file}")print(f"  Metadata: {metadata_file}")# Generate summary reportsummary_report = collector.generate_summary_report()print("\nSummary Report:")print(summary_report)# Save summary reportreport_file = collector.output_dir / f'{collector.project_name}_summary_report.md'with open(report_file, 'w') as f:    f.write(summary_report)print(f"\nSummary report saved to: {report_file}")# Visualize collection resultsdf = pd.DataFrame(dict(collector.collected_data))fig, axes = plt.subplots(2, 2, figsize=(12, 8))# Content type distributioncontent_type_counts = df['content_type'].value_counts()content_type_counts.plot(kind='pie', ax=axes[0, 0], autopct='%1.1f%%')axes[0, 0].set_title('Content Type Distribution')axes[0, 0].set_ylabel('')# Relevance score distributiondf['relevance_score'].hist(bins=10, ax=axes[0, 1])axes[0, 1].set_title('Relevance Score Distribution')axes[0, 1].set_xlabel('Relevance Score')axes[0, 1].set_ylabel('Frequency')# Word count distributiondf['word_count'].hist(bins=10, ax=axes[1, 0])axes[1, 0].set_title('Word Count Distribution')axes[1, 0].set_xlabel('Words per Item')axes[1, 0].set_ylabel('Frequency')# Relevance vs Word Countaxes[1, 1].scatter(df['word_count'], df['relevance_score'], alpha=0.6)axes[1, 1].set_title('Relevance vs Word Count')axes[1, 1].set_xlabel('Word Count')axes[1, 1].set_ylabel('Relevance Score')plt.tight_layout()plt.show()

## Summary and Best PracticesThis tutorial demonstrated the comprehensive web operations capabilities of the SciTeX library:### Key Features Covered:1. **Academic Literature Search**:   - PubMed API integration for scientific literature search   - Article metadata retrieval and processing   - BibTeX citation generation for reference management   - Literature trend analysis and visualization2. **Web Content Processing**:   - URL content extraction and summarization   - Main content identification from HTML   - Web crawling for systematic data collection   - Content categorization and analysis3. **Research Workflow Integration**:   - Automated literature review pipelines   - Multi-source data collection workflows   - Research data organization and export   - Comprehensive reporting and visualization### Best Practices:- **Respect Rate Limits**: Always implement appropriate delays when making multiple web requests- **Handle Errors Gracefully**: Implement proper error handling for network requests and API calls- **Cache Results**: Store retrieved data locally to avoid redundant requests- **Validate Data Quality**: Check relevance scores and content quality before including in analysis- **Organize Output**: Use structured directories and standardized file formats for data organization- **Document Sources**: Always track the source and collection date of web-scraped data- **Ethical Considerations**: Respect robots.txt, terms of service, and data usage policies### Research Applications:- **Literature Reviews**: Automate the collection and analysis of research papers- **Trend Analysis**: Identify emerging topics and research directions- **Reference Management**: Generate properly formatted citations and bibliographies- **Data Mining**: Extract insights from web-based research content- **Collaborative Research**: Share standardized data collections across research teamsThe SciTeX web module provides powerful tools for modern research workflows, enabling researchers to efficiently collect, process, and analyze web-based scientific content.