# UAQ Crawler Example

This notebook demonstrates how to use the modular UAQ crawler to extract document links and metadata from the UAQ repository.

In [10]:
# Import the crawler
import sys
sys.path.append('..')

import json
from src.crawlers.uaq_crawler import UAQCrawler

## Initialize the Crawler

In [11]:
# Create crawler instance
crawler = UAQCrawler()
print("Crawler initialized successfully!")

Crawler initialized successfully!


## Crawl Links from First Page

Extract document handles from the first page of search results.

In [12]:
# Crawl only the first page (max_pages=1)
print("Crawling links from first page...")
links = crawler.crawl_links(max_pages=1)

print(f"Found {len(links)} document links:")
for i, link in enumerate(links[:5]):  # Show first 5 links
    print(f"  {i+1}. {link}")

if len(links) > 5:
    print(f"  ... and {len(links) - 5} more links")

Crawling links from first page...
Page 1 done
Found 100 document links:
  1. /handle/123456789/246
  2. /handle/123456789/238
  3. /handle/123456789/266
  4. /handle/123456789/260
  5. /handle/123456789/257
  ... and 95 more links


## Parse Document Metadata

Extract metadata for each found document.

In [13]:
# Parse documents metadata
print("\nParsing document metadata...")
documents_data = crawler.parse_documents(links)

print(f"Parsed {len(documents_data)} documents")

# Display first document metadata
if documents_data:
    first_doc = documents_data[0]
    print("\nFirst document metadata:")
    for key, value in first_doc.items():
        if value and value[0]:  # Only show non-empty fields
            print(f"  {key}: {value[0] if len(value) == 1 else value}")


Parsing document metadata...
Document /handle/123456789/246 started
Document /handle/123456789/246 done
Document /handle/123456789/238 started
Document /handle/123456789/238 done
Document /handle/123456789/266 started
Document /handle/123456789/266 done
Document /handle/123456789/260 started
Document /handle/123456789/260 done
Document /handle/123456789/257 started
Document /handle/123456789/257 done
Document /handle/123456789/262 started
Document /handle/123456789/262 done
Document /handle/123456789/255 started
Document /handle/123456789/255 done
Document /handle/123456789/254 started
Document /handle/123456789/254 done
Document /handle/123456789/258 started
Document /handle/123456789/258 done
Document /handle/123456789/206 started
Document /handle/123456789/206 done
Document /handle/123456789/212 started
Document /handle/123456789/212 done
Document /handle/123456789/208 started
Document /handle/123456789/208 done
Document /handle/123456789/229 started
Document /handle/123456789/229 

## Extract Download Links

Get PDF download links for documents that have them.

In [14]:
# Get download links
download_links = crawler.get_download_links(documents_data)

print(f"\nFound {len(download_links)} documents with PDF downloads:")
for i, (url, filename) in enumerate(download_links[:3]):  # Show first 3
    print(f"  {i+1}. {filename}: {url[:80]}...")

if len(download_links) > 3:
    print(f"  ... and {len(download_links) - 3} more download links")


Found 100 documents with PDF downloads:
  1. 246: https://ri-ng.uaq.mx/bitstream/123456789/246/7/FIMAN-239302.pdf...
  2. 238: https://ri-ng.uaq.mx/bitstream/123456789/238/7/FIMAN-152137.pdf...
  3. 266: https://ri-ng.uaq.mx/bitstream/123456789/266/1/RI003384.pdf...
  ... and 97 more download links


## Save Results

Save the crawled data to JSON files for later use.

In [15]:
# Save results to files
crawler.save_to_json(links, '../data/raw/links_first_page.json')
crawler.save_to_json(documents_data, '../data/raw/documents_first_page.json')

print("\nResults saved to:")
print("  - data/raw/links_first_page.json")
print("  - data/raw/documents_first_page.json")


Results saved to:
  - data/raw/links_first_page.json
  - data/raw/documents_first_page.json


## Summary

In [16]:
print("\n=== CRAWLING SUMMARY ===")
print(f"Pages crawled: 1")
print(f"Links found: {len(links)}")
print(f"Documents parsed: {len(documents_data)}")
print(f"PDF downloads available: {len(download_links)}")

# Count documents with titles
docs_with_titles = sum(1 for doc in documents_data if doc.get('DC.title') and doc['DC.title'][0])
print(f"Documents with titles: {docs_with_titles}")


=== CRAWLING SUMMARY ===
Pages crawled: 1
Links found: 100
Documents parsed: 100
PDF downloads available: 100
Documents with titles: 100


## Optional: Download PDFs

If you want to download the PDFs, uncomment and run the following cell:

In [17]:
# # Download PDFs (uncomment to run)
# print("\nDownloading PDFs...")
# downloaded_files = crawler.download_documents(documents_data)
# print(f"Downloaded {len(downloaded_files)} PDF files to data/raw/")