# IFC Scraper Testing Notebook

This notebook tests the IFC-UNAM publication scraper component.

## Overview
- Test scraping publications from IFC-UNAM website
- Parse and validate the scraped data
- Save results for further processing

**Note**: You may need to adjust the scraper selectors based on the actual HTML structure of the IFC website.

In [None]:
# Setup imports and path
import sys
import os
sys.path.append('../src')

import asyncio
import pandas as pd
from pathlib import Path

In [None]:
# Import our scraper
from scrapers.ifc_scraper import IFCPublicationScraper
from utils.config import load_config
from utils.logger import setup_logger, get_logger

# Setup logging
setup_logger(level="INFO")
logger = get_logger(__name__)

## 1. Initialize the Scraper

Load configuration and create scraper instance.

In [None]:
# Load configuration
config = load_config()
print("Configuration loaded:")
print(f"Base URL: {config['ifc']['base_url']}")
print(f"Years range: {config['ifc']['years_range']}")
print(f"Rate limit delay: {config['ifc']['rate_limit_delay']}s")

In [None]:
# Initialize scraper
scraper = IFCPublicationScraper(config)
print("Scraper initialized successfully")

## 2. Test Scraping a Single Year

Let's start by testing with a single year to see the HTML structure and adjust our selectors if needed.

In [None]:
# Test scraping for 2024 first
test_year = 2024
print(f"Testing scraper for year {test_year}...")

try:
    publications = await scraper.scrape_publications_by_year(test_year)
    print(f"Successfully scraped {len(publications)} publications for {test_year}")
    
    if publications:
        print("\nFirst publication sample:")
        sample = publications[0]
        print(f"Title: {sample.title}")
        print(f"Authors: {sample.authors}")
        print(f"Journal: {sample.journal}")
        print(f"Abstract: {sample.abstract[:200] if sample.abstract else 'No abstract'}...")
        
except Exception as e:
    print(f"Error: {e}")
    print("This is expected if the website selectors need adjustment")

## 3. Inspect Website Structure

If the scraper fails, let's manually inspect the website structure to understand the HTML layout.

In [None]:
# Manual inspection of the website
import aiohttp
from bs4 import BeautifulSoup

async def inspect_website(year=2024):
    url = f"https://www.ifc.unam.mx/publicaciones.php?year={year}"
    print(f"Inspecting: {url}")
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    html = await response.text()
                    soup = BeautifulSoup(html, 'html.parser')
                    
                    print(f"Page title: {soup.title.text if soup.title else 'No title'}")
                    print(f"Page length: {len(html)} characters")
                    
                    # Look for common publication container patterns
                    potential_containers = [
                        soup.find_all('div', class_=lambda x: x and 'publication' in x.lower()),
                        soup.find_all('div', class_=lambda x: x and 'article' in x.lower()),
                        soup.find_all('li'),
                        soup.find_all('tr'),
                    ]
                    
                    for i, containers in enumerate(potential_containers):
                        print(f"\nPotential container type {i+1}: {len(containers)} elements")
                        if containers and len(containers) > 0:
                            print(f"Sample: {str(containers[0])[:200]}...")
                            
                else:
                    print(f"Failed to fetch page: {response.status}")
                    
        except Exception as e:
            print(f"Error inspecting website: {e}")

# Run inspection
await inspect_website(2024)

## 4. Test Data Processing

Even if scraping fails, let's test the data processing with mock data.

In [None]:
# Create mock publications for testing
from scrapers.ifc_scraper import Publication

mock_publications = [
    Publication(
        title="Neural mechanisms of memory formation in hippocampal circuits",
        authors="García-López, M., Rodríguez-Silva, A., Mendoza-Pérez, J.",
        journal="Journal of Neuroscience",
        year=2024,
        doi="10.1523/JNEUROSCI.1234-24.2024",
        pubmed_id="38123456",
        ifc_url="https://www.ifc.unam.mx/publicacion.php?ut=000123456789",
        abstract="We investigated the cellular and molecular mechanisms underlying memory formation in hippocampal circuits. Using electrophysiological recordings and optogenetic manipulations, we found that..."
    ),
    Publication(
        title="Cardiac physiology under metabolic stress conditions",
        authors="Hernández-Campos, L., López-Martín, R.",
        journal="Cardiovascular Research",
        year=2024,
        doi="10.1093/cvr/cvz098",
        pubmed_id="38234567",
        ifc_url="https://www.ifc.unam.mx/publicacion.php?ut=000234567890",
        abstract="Heart function during metabolic stress was analyzed using isolated perfused heart preparations. Our results demonstrate significant changes in..."
    )
]

print(f"Created {len(mock_publications)} mock publications")
for i, pub in enumerate(mock_publications, 1):
    print(f"{i}. {pub.title[:50]}...")

## 5. Test Data Saving

In [None]:
# Test saving publications
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

# Save mock data
scraper.save_publications(mock_publications, output_dir / "test_ifc_publications.json")

# Verify saved data
import json
with open(output_dir / "test_ifc_publications.json", 'r') as f:
    saved_data = json.load(f)
    
print(f"Saved {len(saved_data)} publications to file")
print("Sample saved data:")
print(json.dumps(saved_data[0], indent=2, ensure_ascii=False))

## 6. Test Multiple Years (if single year works)

In [None]:
# Only run this if the single year test worked
# Uncomment and run if ready

# try:
#     all_publications = await scraper.scrape_all_years(2023, 2024)  # Test with 2 years
#     print(f"Successfully scraped {len(all_publications)} total publications")
#     
#     # Save all data
#     scraper.save_publications(all_publications, output_dir / "all_ifc_publications.json")
#     
#     # Analysis
#     df = pd.DataFrame([{
#         'title': pub.title,
#         'authors': pub.authors,
#         'journal': pub.journal,
#         'year': pub.year,
#         'has_abstract': bool(pub.abstract)
#     } for pub in all_publications])
#     
#     print("\nData summary:")
#     print(df.groupby('year').size())
#     print(f"\nArticles with abstracts: {df['has_abstract'].sum()}/{len(df)}")
#     
# except Exception as e:
#     print(f"Multi-year scraping failed: {e}")

print("Multi-year test commented out - uncomment when single year works")

## Next Steps

1. **Adjust selectors**: If scraping fails, inspect the website HTML and adjust the CSS selectors in `ifc_scraper.py`
2. **Test with real data**: Once selectors work, test with actual IFC publications
3. **Rate limiting**: Ensure the scraper respects rate limits to avoid being blocked
4. **Error handling**: Test how the scraper handles missing data, network errors, etc.

## Common Issues
- **JavaScript rendering**: The website might use JavaScript to load content. If so, consider using Selenium
- **Rate limiting**: Too many requests might get blocked. Adjust the delay in config
- **Changing HTML structure**: Websites change their HTML. The selectors may need updates
- **Access restrictions**: Some content might require authentication or have access restrictions