# IFC Scraper Testing Notebook

This notebook tests the IFC-UNAM publication scraper component.

## Overview
- Test scraping publications from IFC-UNAM website
- Parse and validate the scraped data
- Save results for further processing

**Note**: You may need to adjust the scraper selectors based on the actual HTML structure of the IFC website.

In [1]:
# Setup imports and path
import sys
import os
sys.path.append('../src')

import asyncio
import pandas as pd
from pathlib import Path

In [2]:
# Import our scraper - Fixed import paths
import sys
import os
from pathlib import Path

# Add src directory to path for imports
notebook_dir = Path().resolve()
src_dir = notebook_dir.parent / "src"
sys.path.insert(0, str(src_dir))

print(f"Notebook directory: {notebook_dir}")
print(f"Source directory: {src_dir}")
print(f"Source exists: {src_dir.exists()}")

# Now import our modules
from scrapers.ifc_scraper import IFCPublicationScraper
from utils.config import load_config
from utils.logger import setup_logger, get_logger

# Setup logging
setup_logger(level="INFO")
logger = get_logger(__name__)

print("✅ All imports successful!")

Notebook directory: /home/santi/Projects/UBMI-IFC-Podcast/notebooks
Source directory: /home/santi/Projects/UBMI-IFC-Podcast/src
Source exists: True
✅ All imports successful!
✅ All imports successful!


## 1. Initialize the Scraper

Load configuration and create scraper instance.

In [3]:
# Load configuration
config = load_config()
print("Configuration loaded:")
print(f"Base URL: {config['ifc']['base_url']}")
print(f"Years range: {config['ifc']['years_range']}")
print(f"Rate limit delay: {config['ifc']['rate_limit_delay']}s")

Configuration loaded:
Base URL: https://www.ifc.unam.mx
Years range: {'start': 2021, 'end': 2025}
Rate limit delay: 1.0s


In [4]:
# Initialize scraper
scraper = IFCPublicationScraper(config)
print("Scraper initialized successfully")

Scraper initialized successfully


## 2. Test Scraping a Single Year

Let's start by testing with a single year to see the HTML structure and adjust our selectors if needed.

In [6]:
# Test scraping for 2024 first
import asyncio

async def test_scraping():
    test_year = 2024
    print(f"Testing scraper for year {test_year}...")

    try:
        publications = await scraper.scrape_publications_by_year(test_year)
        print(f"Successfully scraped {len(publications)} publications for {test_year}")
        
        if publications:
            print("\nFirst publication sample:")
            sample = publications[0]
            print(f"Title: {sample.title}")
            print(f"Authors: {sample.authors}")
            print(f"Journal: {sample.journal}")
            print(f"Abstract: {sample.abstract[:200] if sample.abstract else 'No abstract'}...")
            return publications
        else:
            print("No publications found - this may be expected if selectors need adjustment")
            return []
            
    except Exception as e:
        print(f"Error: {e}")
        print("This is expected if the website selectors need adjustment")
        return []

# Run the async function
publications_2024 = await test_scraping()

[32m2025-08-30 15:43:14[0m | [1mINFO[0m | [36mscrapers.ifc_scraper[0m:[36mscrape_publications_by_year[0m:[36m61[0m - [1mScraping publications for year 2024[0m


Testing scraper for year 2024...


[32m2025-08-30 15:43:16[0m | [1mINFO[0m | [36mscrapers.ifc_scraper[0m:[36m_parse_publications_page[0m:[36m97[0m - [1mFound 125 potential publication links[0m
[32m2025-08-30 15:43:16[0m | [1mINFO[0m | [36mscrapers.ifc_scraper[0m:[36m_parse_publications_page[0m:[36m158[0m - [1mSuccessfully parsed 87 publications[0m
[32m2025-08-30 15:43:16[0m | [1mINFO[0m | [36mscrapers.ifc_scraper[0m:[36m_parse_publications_page[0m:[36m158[0m - [1mSuccessfully parsed 87 publications[0m


CancelledError: 

## 3. Inspect Website Structure

If the scraper fails, let's manually inspect the website structure to understand the HTML layout.

In [7]:
# Manual inspection of the website
import aiohttp
from bs4 import BeautifulSoup

async def inspect_website(year=2024):
    url = f"https://www.ifc.unam.mx/publicaciones.php?year={year}"
    print(f"Inspecting: {url}")
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    html = await response.text()
                    soup = BeautifulSoup(html, 'html.parser')
                    
                    print(f"Page title: {soup.title.text if soup.title else 'No title'}")
                    print(f"Page length: {len(html)} characters")
                    
                    # Look for common publication container patterns
                    potential_containers = [
                        soup.find_all('div', class_=lambda x: x and 'publication' in x.lower()),
                        soup.find_all('div', class_=lambda x: x and 'article' in x.lower()),
                        soup.find_all('li'),
                        soup.find_all('tr'),
                    ]
                    
                    for i, containers in enumerate(potential_containers):
                        print(f"\nPotential container type {i+1}: {len(containers)} elements")
                        if containers and len(containers) > 0:
                            print(f"Sample: {str(containers[0])[:200]}...")
                            
                else:
                    print(f"Failed to fetch page: {response.status}")
                    
        except Exception as e:
            print(f"Error inspecting website: {e}")

# Run inspection
await inspect_website(2024)

Inspecting: https://www.ifc.unam.mx/publicaciones.php?year=2024
Page title: Instituto de Fisiología Celular UNAM
Page length: 126087 characters

Potential container type 1: 0 elements

Potential container type 2: 0 elements

Potential container type 3: 130 elements
Sample: <li class="nav-item"><a class="nav-link text-1 text-uppercase" href="publicaciones.php?year=2025">2025</a></li>...

Potential container type 4: 0 elements
Page title: Instituto de Fisiología Celular UNAM
Page length: 126087 characters

Potential container type 1: 0 elements

Potential container type 2: 0 elements

Potential container type 3: 130 elements
Sample: <li class="nav-item"><a class="nav-link text-1 text-uppercase" href="publicaciones.php?year=2025">2025</a></li>...

Potential container type 4: 0 elements


In [8]:
# Deep dive into HTML structure
async def analyze_html_structure(year=2024):
    """Detailed analysis of the IFC website structure"""
    url = f"https://www.ifc.unam.mx/publicaciones.php?year={year}"
    
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    html = await response.text()
                    soup = BeautifulSoup(html, 'html.parser')
                    
                    print("🔍 DETAILED HTML STRUCTURE ANALYSIS")
                    print("="*50)
                    
                    # Check for common content areas
                    main_content = soup.find('main') or soup.find('div', {'id': 'main'}) or soup.find('div', {'class': 'main'})
                    if main_content:
                        print(f"📄 Main content area found: {main_content.name}")
                        print(f"   Content length: {len(str(main_content))} chars")
                    else:
                        print("❌ No main content area found")
                    
                    # Look for publication-related keywords in the HTML
                    keywords = ['publication', 'article', 'paper', 'journal', 'author', 'doi', 'pubmed', 'abstract']
                    print(f"\n🔤 Keyword analysis:")
                    for keyword in keywords:
                        count = html.lower().count(keyword)
                        if count > 0:
                            print(f"   '{keyword}': {count} occurrences")
                    
                    # Check for JavaScript that might load content
                    scripts = soup.find_all('script')
                    print(f"\n🔧 JavaScript analysis:")
                    print(f"   Found {len(scripts)} script tags")
                    
                    ajax_keywords = ['ajax', 'fetch', 'xhr', 'publicaciones', 'load']
                    for script in scripts:
                        if script.string:
                            script_text = script.string.lower()
                            for keyword in ajax_keywords:
                                if keyword in script_text:
                                    print(f"   Found '{keyword}' in script - possible dynamic loading")
                                    break
                    
                    # Look for table structures
                    tables = soup.find_all('table')
                    print(f"\n📊 Table analysis:")
                    print(f"   Found {len(tables)} tables")
                    for i, table in enumerate(tables):
                        rows = table.find_all('tr')
                        print(f"   Table {i+1}: {len(rows)} rows")
                        if rows and len(rows) > 1:  # Skip header-only tables
                            first_row = rows[1] if len(rows) > 1 else rows[0]
                            print(f"      Sample row: {str(first_row)[:150]}...")
                    
                    # Look for specific content patterns
                    print(f"\n🎯 Content pattern analysis:")
                    
                    # Search for year patterns
                    import re
                    year_pattern = r'20\d{2}'
                    years_found = re.findall(year_pattern, html)
                    unique_years = list(set(years_found))
                    print(f"   Years found: {unique_years[:10]}...")  # Show first 10
                    
                    # Search for author name patterns (common Spanish surnames)
                    author_patterns = [r'[A-Z][a-z]+, [A-Z]\.', r'[A-Z][a-z]+ [A-Z][a-z]+']
                    for pattern in author_patterns:
                        matches = re.findall(pattern, html)
                        if matches:
                            print(f"   Potential authors: {matches[:5]}...")
                            break
                    
                    # Check for forms or search interfaces
                    forms = soup.find_all('form')
                    inputs = soup.find_all('input')
                    selects = soup.find_all('select')
                    print(f"\n📝 Form analysis:")
                    print(f"   Forms: {len(forms)}, Inputs: {len(inputs)}, Selects: {len(selects)}")
                    
                    # Look for pagination or navigation
                    nav_elements = soup.find_all(['nav', 'div'], class_=lambda x: x and any(nav_word in x.lower() for nav_word in ['nav', 'page', 'pagination']))
                    print(f"   Navigation elements: {len(nav_elements)}")
                    
                    return soup
                    
        except Exception as e:
            print(f"Error in detailed analysis: {e}")
            return None

# Run detailed analysis
soup = await analyze_html_structure(2024)

🔍 DETAILED HTML STRUCTURE ANALYSIS
📄 Main content area found: div
   Content length: 73936 chars

🔤 Keyword analysis:
   'journal': 33 occurrences
   'author': 2 occurrences
   'doi': 86 occurrences

🔧 JavaScript analysis:
   Found 22 script tags

📊 Table analysis:
   Found 0 tables

🎯 Content pattern analysis:
   Years found: ['2063', '2069', '2000', '2092', '2008', '2097', '2047', '2048', '2024', '2036']...
   Potential authors: ['Silva, M.', 'Luis, E.', 'Maldonado, V.', 'Nieto, E.', 'Santacruz, L.']...

📝 Form analysis:
   Forms: 1, Inputs: 1, Selects: 0
   Navigation elements: 2


In [9]:
# Extract actual publication data from the main content
async def extract_publications_data(year=2024):
    """Find and extract actual publication data"""
    url = f"https://www.ifc.unam.mx/publicaciones.php?year={year}"
    
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            
            print("🎯 EXTRACTING PUBLICATION DATA")
            print("="*40)
            
            # Find the main content area
            main_content = soup.find('main') or soup.find('div', {'id': 'main'}) or soup.find('div', {'class': 'main'})
            if not main_content:
                # Try to find content by looking for DOI patterns
                main_content = soup
            
            # Look for DOI patterns to locate publications
            import re
            doi_pattern = r'10\.\d+/[^\s<>"]+'
            doi_matches = re.findall(doi_pattern, str(main_content))
            
            print(f"📋 Found {len(doi_matches)} DOI patterns:")
            for i, doi in enumerate(doi_matches[:5]):  # Show first 5
                print(f"   {i+1}. {doi}")
            
            # Look for the container that holds these DOIs
            print(f"\n🔍 Finding DOI containers:")
            doi_elements = main_content.find_all(text=re.compile(doi_pattern))
            
            potential_publications = []
            for i, doi_element in enumerate(doi_elements[:3]):  # Analyze first 3
                # Get the parent elements that might contain the full publication
                parent = doi_element.parent
                grandparent = parent.parent if parent else None
                
                print(f"\n📄 Publication {i+1}:")
                print(f"   DOI text: {doi_element.strip()[:100]}...")
                print(f"   Parent tag: {parent.name if parent else 'None'}")
                print(f"   Parent class: {parent.get('class') if parent else 'None'}")
                
                # Look for publication container
                pub_container = grandparent
                for level in range(5):  # Go up 5 levels max
                    if pub_container and pub_container.name:
                        # Check if this container has typical publication structure
                        container_text = pub_container.get_text()[:500]
                        
                        # Look for author patterns in this container
                        author_patterns = re.findall(r'[A-Z][a-z]+(?:, [A-Z]\.)+', container_text)
                        year_patterns = re.findall(r'20\d{2}', container_text)
                        
                        if author_patterns or len(year_patterns) > 0:
                            print(f"   Level {level} container ({pub_container.name}):")
                            print(f"     Authors found: {author_patterns[:2]}")
                            print(f"     Years found: {year_patterns[:2]}")
                            print(f"     Text preview: {container_text[:150]}...")
                            
                            potential_publications.append({
                                'container': pub_container,
                                'doi': doi_element.strip(),
                                'authors': author_patterns,
                                'years': year_patterns
                            })
                            break
                    
                    pub_container = pub_container.parent if pub_container else None
            
            return potential_publications

# Extract actual publication data
publications_data = await extract_publications_data(2024)

🎯 EXTRACTING PUBLICATION DATA
📋 Found 87 DOI patterns:
   1. 10.5306/wjco.v15.i2.195
   2. 10.3390/jox14040081
   3. 10.3390/jof10110740
   4. 10.1093/cvr/cvae156
   5. 10.1128/jb.00264-24

🔍 Finding DOI containers:

📄 Publication 1:
   DOI text: Tecalco-Cruz, AC, Medina-Abreu, KH, Oropeza-Martínez, E, Zepeda-Cervantes, J, Vázquez-Macías, A & Ma...
   Parent tag: a
   Parent class: ['opensans400', 'd-flexy']
   Level 0 container (div):
     Authors found: ['Silva, M.']
     Years found: ['2024']
     Text preview: 
Tecalco-Cruz, AC, Medina-Abreu, KH, Oropeza-Martínez, E, Zepeda-Cervantes, J, Vázquez-Macías, A & Macías-Silva, M. (2024). Deregulation of interferon...

📄 Publication 2:
   DOI text: Luis, E., Conde-Maldonado, V., García-Nieto, E., Juárez-Santacruz, L., Alvarado, M., & Anaya-Hernánd...
   Parent tag: a
   Parent class: ['opensans400', 'd-flexy']
   Level 0 container (div):
     Authors found: ['Luis, E.', 'Maldonado, V.']
     Years found: ['2024']
     Text preview: 
Luis,

  doi_elements = main_content.find_all(text=re.compile(doi_pattern))


In [10]:
# Create improved scraper based on discovered structure
async def scrape_ifc_publications_improved(year=2024):
    """Improved scraper based on actual website structure analysis"""
    url = f"https://www.ifc.unam.mx/publicaciones.php?year={year}"
    
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            
            print(f"🔍 Scraping IFC publications for {year}...")
            
            # Find all publication links with the identified pattern
            publication_links = soup.find_all('a', class_=['opensans400', 'd-flexy'])
            print(f"📚 Found {len(publication_links)} potential publication links")
            
            publications = []
            
            for i, link in enumerate(publication_links):
                try:
                    # Get the full text of the publication entry
                    pub_text = link.get_text().strip()
                    
                    # Skip if this doesn't look like a publication (too short or no DOI pattern)
                    if len(pub_text) < 50 or '10.' not in pub_text:
                        continue
                    
                    print(f"\n📄 Processing publication {i+1}:")
                    print(f"   Text length: {len(pub_text)} characters")
                    print(f"   Preview: {pub_text[:100]}...")
                    
                    # Extract publication details using regex patterns
                    import re
                    
                    # Extract DOI
                    doi_match = re.search(r'10\.\d+/[^\s<>"]+', pub_text)
                    doi = doi_match.group() if doi_match else None
                    
                    # Extract year (typically in parentheses)
                    year_match = re.search(r'\((\d{4})\)', pub_text)
                    pub_year = int(year_match.group(1)) if year_match else year
                    
                    # Extract title (usually after year and before journal)
                    # Pattern: (...year...). Title. Journal
                    title_match = re.search(r'\(\d{4}\)\.\s*([^.]+\.)', pub_text)
                    title = title_match.group(1).strip() if title_match else None
                    
                    # Extract authors (before the year)
                    author_match = re.search(r'^([^(]+)\s*\(', pub_text)
                    authors = author_match.group(1).strip() if author_match else None
                    
                    # Extract journal (try different patterns)
                    # Pattern 1: Title. Journal Name
                    if title:
                        remaining_text = pub_text.split(title, 1)[1] if title in pub_text else pub_text
                        journal_match = re.search(r'^\s*([^.]+)', remaining_text)
                        journal = journal_match.group(1).strip() if journal_match else None
                    else:
                        journal = None
                    
                    # Get the href for more details
                    ifc_url = link.get('href')
                    if ifc_url and not ifc_url.startswith('http'):
                        ifc_url = f"https://www.ifc.unam.mx/{ifc_url}"
                    
                    publication = {
                        'title': title,
                        'authors': authors,
                        'journal': journal,
                        'year': pub_year,
                        'doi': doi,
                        'ifc_url': ifc_url,
                        'raw_text': pub_text
                    }
                    
                    publications.append(publication)
                    
                    print(f"   ✅ Extracted:")
                    print(f"      Title: {title[:50] if title else 'Not found'}...")
                    print(f"      Authors: {authors[:50] if authors else 'Not found'}...")
                    print(f"      Journal: {journal[:30] if journal else 'Not found'}...")
                    print(f"      DOI: {doi}")
                    print(f"      Year: {pub_year}")
                    
                except Exception as e:
                    print(f"   ❌ Error processing publication {i+1}: {e}")
                    continue
            
            print(f"\n🎉 Successfully extracted {len(publications)} publications!")
            return publications

# Test the improved scraper
improved_publications = await scrape_ifc_publications_improved(2024)

🔍 Scraping IFC publications for 2024...
📚 Found 125 potential publication links

📄 Processing publication 2:
   Text length: 313 characters
   Preview: Tecalco-Cruz, AC, Medina-Abreu, KH, Oropeza-Martínez, E, Zepeda-Cervantes, J, Vázquez-Macías, A & Ma...
   ✅ Extracted:
      Title: Deregulation of interferon-gamma receptor 1 expres...
      Authors: Tecalco-Cruz, AC, Medina-Abreu, KH, Oropeza-Martín...
      Journal: World Journal Of Clinical Onco...
      DOI: 10.5306/wjco.v15.i2.195
      Year: 2024

📄 Processing publication 9:
   Text length: 343 characters
   Preview: Luis, E., Conde-Maldonado, V., García-Nieto, E., Juárez-Santacruz, L., Alvarado, M., & Anaya-Hernánd...
   ✅ Extracted:
      Title: Altered Expression of Thyroid- and Calcium Ion Cha...
      Authors: Luis, E., Conde-Maldonado, V., García-Nieto, E., J...
      Journal: Journal of Xenobiotics, 14(4),...
      DOI: 10.3390/jox14040081
      Year: 2024

📄 Processing publication 11:
   Text length: 344 characters
   Pr

## 4. Test Data Processing

Even if scraping fails, let's test the data processing with mock data.

In [8]:
# Create mock publications for testing
from scrapers.ifc_scraper import Publication

mock_publications = [
    Publication(
        title="Neural mechanisms of memory formation in hippocampal circuits",
        authors="García-López, M., Rodríguez-Silva, A., Mendoza-Pérez, J.",
        journal="Journal of Neuroscience",
        year=2024,
        doi="10.1523/JNEUROSCI.1234-24.2024",
        pubmed_id="38123456",
        ifc_url="https://www.ifc.unam.mx/publicacion.php?ut=000123456789",
        abstract="We investigated the cellular and molecular mechanisms underlying memory formation in hippocampal circuits. Using electrophysiological recordings and optogenetic manipulations, we found that..."
    ),
    Publication(
        title="Cardiac physiology under metabolic stress conditions",
        authors="Hernández-Campos, L., López-Martín, R.",
        journal="Cardiovascular Research",
        year=2024,
        doi="10.1093/cvr/cvz098",
        pubmed_id="38234567",
        ifc_url="https://www.ifc.unam.mx/publicacion.php?ut=000234567890",
        abstract="Heart function during metabolic stress was analyzed using isolated perfused heart preparations. Our results demonstrate significant changes in..."
    )
]

print(f"Created {len(mock_publications)} mock publications")
for i, pub in enumerate(mock_publications, 1):
    print(f"{i}. {pub.title[:50]}...")

Created 2 mock publications
1. Neural mechanisms of memory formation in hippocamp...
2. Cardiac physiology under metabolic stress conditio...


## 5. Test Data Saving

In [9]:
# Test saving publications
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

# Save mock data
scraper.save_publications(mock_publications, output_dir / "test_ifc_publications.json")

# Verify saved data
import json
with open(output_dir / "test_ifc_publications.json", 'r') as f:
    saved_data = json.load(f)
    
print(f"Saved {len(saved_data)} publications to file")
print("Sample saved data:")
print(json.dumps(saved_data[0], indent=2, ensure_ascii=False))

[32m2025-08-30 01:28:16[0m | [1mINFO[0m | [36mscrapers.ifc_scraper[0m:[36msave_publications[0m:[36m234[0m - [1mSaved 2 publications to ../data/raw/test_ifc_publications.json[0m


Saved 2 publications to file
Sample saved data:
{
  "title": "Neural mechanisms of memory formation in hippocampal circuits",
  "authors": "García-López, M., Rodríguez-Silva, A., Mendoza-Pérez, J.",
  "journal": "Journal of Neuroscience",
  "year": 2024,
  "doi": "10.1523/JNEUROSCI.1234-24.2024",
  "pubmed_id": "38123456",
  "ifc_url": "https://www.ifc.unam.mx/publicacion.php?ut=000123456789",
  "abstract": "We investigated the cellular and molecular mechanisms underlying memory formation in hippocampal circuits. Using electrophysiological recordings and optogenetic manipulations, we found that...",
  "keywords": null
}


## 6. Test Multiple Years (if single year works)

In [None]:
# Only run this if the single year test worked
# Uncomment and run if ready

# try:
#     all_publications = await scraper.scrape_all_years(2023, 2024)  # Test with 2 years
#     print(f"Successfully scraped {len(all_publications)} total publications")
#     
#     # Save all data
#     scraper.save_publications(all_publications, output_dir / "all_ifc_publications.json")
#     
#     # Analysis
#     df = pd.DataFrame([{
#         'title': pub.title,
#         'authors': pub.authors,
#         'journal': pub.journal,
#         'year': pub.year,
#         'has_abstract': bool(pub.abstract)
#     } for pub in all_publications])
#     
#     print("\nData summary:")
#     print(df.groupby('year').size())
#     print(f"\nArticles with abstracts: {df['has_abstract'].sum()}/{len(df)}")
#     
# except Exception as e:
#     print(f"Multi-year scraping failed: {e}")

print("Multi-year test commented out - uncomment when single year works")

## Next Steps

1. **Adjust selectors**: If scraping fails, inspect the website HTML and adjust the CSS selectors in `ifc_scraper.py`
2. **Test with real data**: Once selectors work, test with actual IFC publications
3. **Rate limiting**: Ensure the scraper respects rate limits to avoid being blocked
4. **Error handling**: Test how the scraper handles missing data, network errors, etc.

## Common Issues
- **JavaScript rendering**: The website might use JavaScript to load content. If so, consider using Selenium
- **Rate limiting**: Too many requests might get blocked. Adjust the delay in config
- **Changing HTML structure**: Websites change their HTML. The selectors may need updates
- **Access restrictions**: Some content might require authentication or have access restrictions