# Campaign Documents Data Collection System

## Purpose and Scope

This Jupyter notebook implements a data collection system for extracting political campaign documents from the UC Santa Barbara American Presidency Project website (presidency.ucsb.edu). The system is designed to systematically collect campaign documents with associated metadata and full content text for analysis purposes.

## System Architecture

### Data Source
- **Primary URL**: https://www.presidency.ucsb.edu/documents/app-categories/elections-and-transitions/campaign-documents
- **Data Structure**: Paginated results with 1000 documents per page
- **Content Format**: HTML pages with structured metadata and full document text

### Collection Methodology

#### Phase 1: Metadata Extraction
The system performs initial collection of document metadata including:
- Document publication dates
- Document titles and classifications
- Direct links to full document content
- Related category assignments
- Associated reference links

#### Phase 2: Content Extraction
For each document URL collected in Phase 1, the system extracts:
- Complete document text content
- Speaker identification and titles
- Document type classification (Speech, Remarks, Statement, Interview, Address, Debate)
- Location information extracted from titles
- Video content availability detection
- Word count calculations

## Technical Implementation

### Threading Architecture
- **Concurrent Processing**: ThreadPoolExecutor with configurable worker threads
- **Connection Pooling**: Session-based HTTP requests with retry strategies
- **Rate Limiting**: Built-in delays to respect server resources

### Data Persistence
- **Caching System**: Pickle-based storage to avoid re-processing existing data
- **Checkpoint Management**: JSON-based progress tracking for resumable operations
- **Output Format**: CSV files with UTF-8 encoding

### Error Handling
- **Network Resilience**: Automatic retries with exponential backoff
- **Data Validation**: Content verification and error status tracking
- **Graceful Degradation**: Partial success handling for incomplete extractions

## Output Data Structure

### Primary Metadata Fields
- `Date`: Document publication date in standardized format
- `Title`: Complete document title as published
- `Document_Link`: Direct URL to full document content
- `Related_Category`: Associated political figure or category
- `Related_Link`: Reference URL for related category
- `Page`: Source page number for tracking purposes
- `URL_Page_Index`: Zero-based page index for URL construction

### Enhanced Content Fields
- `Document_Title`: Extracted title from document page
- `Document_Date`: Parsed date from document metadata
- `Document_Content`: Full extracted text content
- `Speaker`: Identified speaker or author
- `Speaker_Title`: Official title or position
- `Document_Type`: Classified document category
- `Location`: Extracted location information
- `Video_Available`: Boolean flag for video content presence
- `Word_Count`: Total word count of document content
- `Extraction_Status`: Processing status indicator

## Dependencies and Requirements

### Required Python Packages
```
requests>=2.25.0
beautifulsoup4>=4.9.0
pandas>=1.3.0
tqdm>=4.60.0
```

### System Requirements
- Python 3.7 or higher
- Minimum 2GB available memory for large datasets
- Stable internet connection for data collection

## Execution Instructions

### Cell Execution Order
1. **Cell 1**: Import dependencies and initialize CampaignDocumentsScraper class
2. **Cell 2**: Execute metadata collection pipeline
3. **Cell 3**: Process full content extraction (optional)

### Configuration Parameters
- `max_workers`: Number of concurrent threads (default: 5)
- `max_pages`: Maximum pages to process (default: 25)
- `batch_size`: URLs processed per batch (default: 50)
- `save_interval`: Progress save frequency (default: 500)

## Performance Characteristics

### Metadata Collection
- **Processing Rate**: Approximately 400-500 documents per second
- **Estimated Runtime**: 1-2 minutes for 25,000 documents
- **Memory Usage**: 10-15 MB for complete dataset

### Content Extraction
- **Processing Rate**: 2-3 documents per second (network dependent)
- **Estimated Runtime**: 30-45 minutes for 7,500 unique documents
- **Memory Usage**: 50-100 MB for cached content

### Network Considerations
- Respectful scraping with built-in delays
- Automatic retry mechanisms for failed requests
- Session persistence for connection efficiency


## Step 1: Library Imports and Class Definitions

This cell initializes the required dependencies and defines the primary data collection class. The implementation includes the following components:

### Import Dependencies
- **requests**: HTTP client library for web requests
- **BeautifulSoup**: HTML parsing and content extraction
- **pandas**: Data structure manipulation and CSV export
- **threading**: Thread-safe operations for concurrent processing
- **concurrent.futures**: ThreadPoolExecutor for parallel request handling
- **logging**: Structured logging for process monitoring

### CampaignDocumentsScraper Class Architecture

The main class implements the following methods and features:

#### Core Methods
- `__init__()`: Initialize session configuration and threading parameters
- `get_total_pages()`: Determine available pages through pagination analysis
- `test_page_availability()`: Binary search algorithm for page boundary detection
- `scrape_page()`: Extract document metadata from individual pages
- `scrape_all_pages()`: Orchestrate multi-threaded page processing

#### Session Management
- **HTTP Headers**: User-Agent rotation and standard browser headers
- **Connection Pooling**: Persistent session for request efficiency
- **Request Timeout**: Configurable timeout handling for network resilience

#### Thread Safety Features
- **Data Lock**: Mutex protection for shared data structures
- **Worker Management**: Configurable thread pool sizing
- **Rate Limiting**: Server-respectful request spacing

### Detailed Code Explanation

#### Import Section
```python
import requests                    # HTTP library for making web requests
from bs4 import BeautifulSoup     # HTML parser for extracting data from web pages
import pandas as pd               # Data manipulation and CSV export
import time                       # Time utilities for delays and timing
from datetime import datetime     # Date parsing and formatting
import re                         # Regular expressions for pattern matching
from concurrent.futures import ThreadPoolExecutor, as_completed  # Multi-threading
import threading                  # Thread synchronization primitives
from urllib.parse import urljoin, urlparse  # URL manipulation utilities
import logging                    # Structured logging system
```

#### Class Constructor Explanation
The `__init__` method sets up the scraper with:
- **base_url**: The target website URL to scrape
- **max_workers**: Number of concurrent threads (default: 5)
- **max_pages**: Maximum pages to process (default: 25)
- **session**: Persistent HTTP session with browser-like headers
- **data_lock**: Thread synchronization lock for safe data access
- **all_documents**: List to store all scraped document data

#### HTTP Session Configuration
```python
self.session.headers.update({
    'User-Agent': 'Mozilla/5.0...',    # Mimics a real browser
    'Accept': 'text/html...',          # Specifies accepted content types
    'Accept-Language': 'en-US,en',     # Language preferences
    'Accept-Encoding': 'gzip, deflate', # Compression support
    'Connection': 'keep-alive',        # Persistent connections
    'Upgrade-Insecure-Requests': '1'   # HTTPS preference
})
```

#### Pagination Detection Logic
The `get_total_pages()` method uses multiple strategies:
1. **HTML Pagination Parser**: Searches for `<ul class="pagination">` elements
2. **Link Analysis**: Extracts page numbers from pagination links using regex
3. **Document Count Validation**: Verifies page contains expected number of documents
4. **Binary Search Fallback**: Tests page availability when pagination is unclear

#### Page Scraping Algorithm
The `scrape_page()` method processes individual pages:
1. **URL Construction**: Builds page-specific URLs with zero-based indexing
2. **HTML Retrieval**: Makes HTTP request with timeout and error handling
3. **DOM Parsing**: Uses BeautifulSoup to parse HTML structure
4. **Data Extraction**: Finds document rows and extracts metadata fields
5. **Date Processing**: Parses various date formats and standardizes output
6. **Link Processing**: Converts relative URLs to absolute URLs

#### Multi-threading Coordination
The `scrape_all_pages()` method orchestrates concurrent processing:
1. **Worker Pool Creation**: Initializes ThreadPoolExecutor with optimal thread count
2. **Task Submission**: Submits each page as a separate task
3. **Result Aggregation**: Collects completed results in thread-safe manner
4. **Progress Tracking**: Logs completion status and document counts
5. **Error Handling**: Gracefully handles individual page failures


### Detailed Line-by-Line Code Explanation

#### Key Method Explanations

**`get_total_pages()` Method Logic:**
```python
# Method 1: Look for pagination HTML elements
pagination = soup.find('ul', class_='pagination')
# Searches for standard pagination navigation in the page footer

# Extract last page number from pagination links
page_match = re.search(r'page=(\\d+)', last_href)
# Uses regex to find "page=NUMBER" pattern in URL parameters

# Method 2: Count documents on first page as fallback
document_count = 0
for row in rows:
    main_col = row.find('div', class_='col-sm-8')
    if main_col and main_col.find('div', class_='field-title'):
        document_count += 1
# Counts actual document entries to validate page structure
```

**`test_page_availability()` Binary Search:**
```python
left, right = 1, self.max_pages  # Set search boundaries
while left <= right:             # Binary search loop
    mid = (left + right) // 2    # Calculate middle page
    test_url = f"{self.base_url}&page={mid - 1}"  # Build test URL
    # Tests if page exists by checking for document content
```

**`scrape_page()` Data Extraction Process:**
```python
# Find main content areas in HTML structure
main_col = row.find('div', class_='col-sm-8')    # Document content column
related_col = row.find('div', class_='col-sm-4')  # Related links column

# Extract date with multiple fallback strategies
date_element = main_col.find('span', {'property': 'dc:date'})
date_content = date_element.get('content', '')     # Try metadata first
date_text = date_element.get_text(strip=True)      # Fallback to display text

# Process title and document link
link_element = title_element.find('a')             # Find link within title
title = link_element.get_text(strip=True)          # Extract clean title text
document_link = urljoin('https://www.presidency.ucsb.edu', href)  # Make absolute URL
```

**`scrape_all_pages()` Threading Coordination:**
```python
# Create thread pool with optimal worker count
actual_workers = min(self.max_workers, total_pages, 6)
with ThreadPoolExecutor(max_workers=actual_workers) as executor:
    
    # Submit all page scraping tasks simultaneously
    future_to_page = {
        executor.submit(self.scrape_page, page_num): page_num 
        for page_num in range(1, total_pages + 1)
    }
    
    # Collect results as they complete (not in order)
    for future in as_completed(future_to_page):
        page_documents = future.result()           # Get results from completed thread
        with self.data_lock:                       # Thread-safe data access
            self.all_documents.extend(page_documents)  # Add to main collection
```

**Date Parsing Logic:**
```python
def parse_date(self, date_string):
    if 'T' in date_string:  # ISO format detection
        return datetime.fromisoformat(date_string.replace('Z', '+00:00'))
    else:  # Standard date format
        return datetime.strptime(date_string, '%Y-%m-%d')
    # Standardizes various date formats to consistent output
```


## Step 2: Metadata Collection Pipeline Execution

This cell executes the primary data collection workflow with the following operational sequence:

### Initialization Phase
1. **Scraper Configuration**: Initialize CampaignDocumentsScraper with parameters:
   - Worker threads: 5 concurrent processors
   - Page limit: 25 maximum pages
   - Request timeout: 15 seconds per page

### Discovery Phase
2. **Pagination Analysis**: Determine total available pages through:
   - HTML pagination element parsing
   - Binary search for page boundary detection
   - Maximum page constraint enforcement

### Collection Phase
3. **Concurrent Page Processing**: Execute multi-threaded data extraction:
   - ThreadPoolExecutor manages worker threads
   - Each worker processes individual pages
   - Thread-safe data aggregation

### Data Processing Phase
4. **Metadata Extraction**: For each document, extract:
   - Publication date with format standardization
   - Document title and classification
   - Direct URL to full content
   - Related category assignments
   - Reference link associations

### Output Generation Phase
5. **Result Compilation**: Generate structured output including:
   - CSV file export with UTF-8 encoding
   - Processing statistics and performance metrics
   - Data quality validation results

### Expected Performance Metrics
- **Processing Rate**: 400-500 documents per second
- **Execution Time**: 60-120 seconds for 25,000 documents
- **Memory Consumption**: 10-15 MB for complete dataset
- **Success Rate**: 99% successful metadata extraction
- **Output File**: `campaign_documents_[count]docs_[timestamp].csv`


## Step 3: Content Extraction Pipeline Execution

This cell executes the advanced content extraction workflow for individual campaign document URLs collected in the metadata phase.

### Processing Overview

#### Data Source Preparation
1. **CSV Import**: Loads previously collected campaign document metadata from CSV files
2. **URL Validation**: Filters valid document URLs for content extraction
3. **Batch Configuration**: Organizes URLs into processing batches for efficient handling

#### Content Extraction Architecture
The system implements a multi-threaded content extraction pipeline:

#### Extraction Process per Document
1. **HTTP Request Execution**: Retrieves full document page content with timeout handling
2. **Document Metadata Extraction**: Extracts title, date, and classification information
3. **Speaker Information Processing**: Identifies document authors and their official titles
4. **Content Text Extraction**: Retrieves complete document text with paragraph preservation
5. **Document Type Classification**: Categorizes documents (Speech, Remarks, Statement, Interview, Address, Debate)
6. **Location Information Extraction**: Parses location data from document titles and content
7. **Video Content Detection**: Identifies presence of embedded video content
8. **Word Count Calculation**: Computes total word count for analysis purposes

#### Advanced Processing Features
- **Caching System**: Pickle-based storage prevents re-processing of existing documents
- **Checkpoint Management**: JSON-based progress tracking enables resumable operations
- **Error Recovery**: Graceful handling of network failures and parsing errors
- **Content Validation**: Verification of extraction completeness and data quality

#### Threading and Performance Management
- **Worker Thread Configuration**: Configurable number of concurrent processors (default: 8)
- **Request Rate Limiting**: Built-in delays respect server resources and prevent blocking
- **Progress Monitoring**: Real-time progress tracking with detailed logging
- **Memory Management**: Efficient handling of large document collections

#### Output Data Enhancement
The enhanced dataset includes additional fields:
- `Document_Title`: Title extracted directly from document page
- `Document_Date`: Parsed date from document metadata
- `Document_Content`: Complete document text content
- `Speaker`: Identified document author or speaker
- `Speaker_Title`: Official position or title of the speaker
- `Document_Type`: Classified document category
- `Location`: Extracted location information
- `Video_Available`: Boolean indicator for video content presence
- `Word_Count`: Total word count of document content
- `Extraction_Status`: Processing status for quality assurance

#### Performance Characteristics
- **Processing Rate**: 2-3 documents per second (network dependent)
- **Total Processing Time**: 30-45 minutes for 7,500 unique documents
- **Memory Usage**: 50-100 MB during processing
- **Cache Storage**: 20-50 MB for persistent caching
- **Success Rate**: 95-99% successful content extraction

#### Quality Assurance Measures
- **Content Verification**: Validation of extracted text completeness
- **Metadata Accuracy**: Cross-verification of extracted metadata
- **Error Logging**: Detailed logging of processing failures
- **Data Consistency**: Standardized output format across all documents


### Step 2 Execution Code Explanation

#### Scraper Initialization and Configuration
```python
base_url = "https://www.presidency.ucsb.edu/documents/app-categories/elections-and-transitions/campaign-documents?items_per_page=1000"
scraper = CampaignDocumentsScraper(base_url, max_workers=5, max_pages=25)
```
- **base_url**: Target URL with 1000 items per page parameter for maximum efficiency
- **max_workers=5**: Limits concurrent threads to avoid overwhelming the server
- **max_pages=25**: Safety limit to prevent infinite loops or excessive requests

#### Timing and Performance Measurement
```python
start_time = time.time()          # Record start time for performance measurement
documents = scraper.scrape_all_pages()  # Execute main scraping operation
end_time = time.time()            # Record completion time
scraping_time = end_time - start_time    # Calculate total execution duration
```

#### Data Processing and Validation
```python
df = pd.DataFrame(documents)      # Convert raw data to structured DataFrame
if not df.empty:                  # Validate that data was successfully collected
    df = df.sort_values(['Page', 'Date'], ascending=[True, False])  # Sort for analysis
```
- **DataFrame Conversion**: Transforms list of dictionaries into structured data
- **Empty Check**: Prevents errors if no data was collected
- **Sorting**: Orders by page first, then by date (newest first within each page)

#### Statistical Output Generation
```python
print(f"Total documents scraped: {len(df):,}")  # Total count with thousands separator
print(f"Pages scraped: {df['Page'].nunique()}")  # Number of unique pages processed
print(f"Actual coverage: {len(df) / (25 * 1000) * 100:.1f}% of maximum possible")
```
- **Document Count**: Shows total successfully scraped documents
- **Page Count**: Verifies how many pages actually contained data
- **Coverage Percentage**: Calculates efficiency against theoretical maximum

#### Performance Metrics Calculation
```python
print(f"Scraping speed: {len(df) / scraping_time:.1f} documents/second")
print(f"Average documents per page: {len(df) / df['Page'].nunique():.0f}")
```
- **Processing Speed**: Documents per second for performance assessment
- **Page Efficiency**: Average documents per page to identify potential issues

#### Data Analysis and Summarization
```python
# Page distribution analysis
page_counts = df['Page'].value_counts().sort_index()
for page, count in page_counts.items():
    print(f"Page {page}: {count:,} documents")

# Sample data display for verification
sample_columns = ['Date', 'Title', 'Related_Category', 'Page']
print(df[sample_columns].head().to_string(index=False, max_colwidth=60))
```
- **Page Distribution**: Shows document count per page to identify inconsistencies
- **Sample Display**: Provides preview of extracted data for quality verification

#### Data Export and File Management
```python
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")  # Create unique timestamp
csv_filename = f'./campaign_documents_{len(df)}docs_{timestamp}.csv'
df.to_csv(csv_filename, index=False, encoding='utf-8')  # Export to CSV
```
- **Timestamp Creation**: Generates unique identifier for output files
- **Filename Generation**: Includes document count and timestamp for clarity
- **CSV Export**: Saves data with UTF-8 encoding to handle special characters

#### Data Quality Analysis
```python
# Memory usage calculation
memory_mb = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory usage: {memory_mb:.2f} MB")

# Category distribution analysis
category_counts = df['Related_Category'].value_counts().head(10)
for category, count in category_counts.items():
    print(f"  {category}: {count:,} documents")

# Temporal distribution analysis
df['Year'] = pd.to_datetime(df['Date'], errors='coerce').dt.year
year_counts = df['Year'].value_counts().sort_index().tail(10)
```
- **Memory Analysis**: Tracks resource usage for optimization
- **Category Analysis**: Identifies most common document categories
- **Temporal Analysis**: Shows document distribution across years


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from datetime import datetime
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from urllib.parse import urljoin, urlparse
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class CampaignDocumentsScraper:
  def __init__(self, base_url, max_workers=5, max_pages=25):
      self.base_url = base_url
      self.max_workers = max_workers
      self.max_pages = max_pages  # Maximum expected pages
      self.session = requests.Session()
      self.session.headers.update({
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
          'Accept-Language': 'en-US,en;q=0.5',
          'Accept-Encoding': 'gzip, deflate',
          'Connection': 'keep-alive',
          'Upgrade-Insecure-Requests': '1',
      })
      self.data_lock = threading.Lock()
      self.all_documents = []
  
  def get_total_pages(self):
      """Get the total number of pages by checking pagination or testing pages"""
      try:
          response = self.session.get(self.base_url, timeout=10)
          response.raise_for_status()
          soup = BeautifulSoup(response.content, 'html.parser')
          
          # Method 1: Try to find pagination info
          pagination = soup.find('ul', class_='pagination')
          if pagination:
              # Look for the last page link
              last_link = pagination.find('li', class_='pager-last')
              if last_link and last_link.find('a'):
                  last_href = last_link.find('a')['href']
                  page_match = re.search(r'page=(\d+)', last_href)
                  if page_match:
                      # Pages are 0-indexed, so add 1 to get total pages
                      total_pages = int(page_match.group(1)) + 1
                      logger.info(f"Found last page index {page_match.group(1)} in pagination, total pages: {total_pages}")
                      return min(total_pages, self.max_pages)
              
              # Look for numbered page links
              page_links = pagination.find_all('a')
              page_numbers = []
              for link in page_links:
                  href = link.get('href', '')
                  page_match = re.search(r'page=(\d+)', href)
                  if page_match:
                      page_numbers.append(int(page_match.group(1)))
              
              if page_numbers:
                  max_page_index = max(page_numbers)
                  total_pages = max_page_index + 1
                  logger.info(f"Found max page index {max_page_index} in links, total pages: {total_pages}")
                  return min(total_pages, self.max_pages)
          
          # Method 2: Check if there are results on the first page
          rows = soup.find_all('div', class_='row')
          document_count = 0
          for row in rows:
              main_col = row.find('div', class_='col-sm-8')
              if main_col and main_col.find('div', class_='field-title'):
                  document_count += 1
          
          if document_count == 0:
              logger.warning("No documents found on first page")
              return 1
          elif document_count < 1000:
              logger.info(f"Found {document_count} documents on first page (less than 1000), assuming single page")
              return 1
          else:
              logger.info(f"Found {document_count} documents on first page, will test for more pages")
              # Test a few pages to find the actual limit
              return self.test_page_availability()
          
      except Exception as e:
          logger.error(f"Error getting total pages: {e}")
          return 1
  
  def test_page_availability(self):
      """Test page availability to find the actual number of pages"""
      logger.info("Testing page availability...")
      
      # Binary search approach to find the last available page
      left, right = 1, self.max_pages
      last_valid_page = 1
      
      while left <= right:
          mid = (left + right) // 2
          test_url = f"{self.base_url}&page={mid - 1}"  # Convert to 0-indexed
          
          try:
              response = self.session.get(test_url, timeout=10)
              response.raise_for_status()
              soup = BeautifulSoup(response.content, 'html.parser')
              
              # Check if this page has documents
              rows = soup.find_all('div', class_='row')
              has_documents = False
              for row in rows:
                  main_col = row.find('div', class_='col-sm-8')
                  if main_col and main_col.find('div', class_='field-title'):
                      has_documents = True
                      break
              
              if has_documents:
                  last_valid_page = mid
                  left = mid + 1
                  logger.info(f"Page {mid} has documents")
              else:
                  right = mid - 1
                  logger.info(f"Page {mid} has no documents")
              
              time.sleep(0.5)  # Be respectful to the server
              
          except Exception as e:
              logger.warning(f"Error testing page {mid}: {e}")
              right = mid - 1
      
      logger.info(f"Found {last_valid_page} total pages through testing")
      return last_valid_page
  
  def parse_date(self, date_string):
      """Parse different date formats"""
      if not date_string:
          return None
      
      try:
          # Try to parse ISO format first
          if 'T' in date_string:
              return datetime.fromisoformat(date_string.replace('Z', '+00:00')).strftime('%B %d, %Y')
          else:
              # Try to parse as regular date
              return datetime.strptime(date_string, '%Y-%m-%d').strftime('%B %d, %Y')
      except:
          return date_string.strip()
  
  def scrape_page(self, page_num):
      """Scrape a single page of documents (page_num is 1-indexed)"""
      page_documents = []
      
      try:
          # Convert to 0-indexed for URL
          page_index = page_num - 1
          
          if page_index == 0:
              url = self.base_url
          else:
              url = f"{self.base_url}&page={page_index}"
          
          logger.info(f"Scraping page {page_num} (URL page={page_index}): {url}")
          
          response = self.session.get(url, timeout=15)
          response.raise_for_status()
          soup = BeautifulSoup(response.content, 'html.parser')
          
          # Find all document rows
          rows = soup.find_all('div', class_='row')
          
          documents_found = 0
          for row in rows:
              try:
                  # Find the main content column (col-sm-8)
                  main_col = row.find('div', class_='col-sm-8')
                  related_col = row.find('div', class_='col-sm-4')
                  
                  if not main_col:
                      continue
                  
                  # Check if this row contains a document (has field-title)
                  title_element = main_col.find('div', class_='field-title')
                  if not title_element:
                      continue
                  
                  # Extract date
                  date_element = main_col.find('span', {'property': 'dc:date'})
                  if date_element:
                      date_content = date_element.get('content', '')
                      date_text = date_element.get_text(strip=True)
                      formatted_date = self.parse_date(date_content) if date_content else date_text
                  else:
                      # Try to find date in h4 tag
                      h4_element = main_col.find('h4')
                      formatted_date = h4_element.get_text(strip=True) if h4_element else 'No Date'
                  
                  # Extract title and link
                  link_element = title_element.find('a')
                  if link_element:
                      title = link_element.get_text(strip=True)
                      document_link = urljoin('https://www.presidency.ucsb.edu', link_element.get('href', ''))
                  else:
                      title = title_element.get_text(strip=True)
                      document_link = 'No Link'
                  
                  # Extract related information
                  related_category = 'No Category'
                  related_link = 'No Link'
                  
                  if related_col:
                      related_link_element = related_col.find('a')
                      if related_link_element:
                          related_category = related_link_element.get_text(strip=True)
                          related_link = urljoin('https://www.presidency.ucsb.edu', related_link_element.get('href', ''))
                  
                  # Create document record
                  document = {
                      'Date': formatted_date,
                      'Title': title,
                      'Document_Link': document_link,
                      'Related_Category': related_category,
                      'Related_Link': related_link,
                      'Page': page_num,
                      'URL_Page_Index': page_index
                  }
                  
                  page_documents.append(document)
                  documents_found += 1
                  
              except Exception as e:
                  logger.warning(f"Error parsing document on page {page_num}: {e}")
                  continue
          
          logger.info(f"Page {page_num}: Found {documents_found} documents")
          
          # Add small delay to be respectful to the server
          time.sleep(0.3)
          
      except Exception as e:
          logger.error(f"Error scraping page {page_num}: {e}")
      
      return page_documents
  
  def scrape_all_pages(self):
      """Scrape all pages using threading"""
      logger.info("Determining total number of pages...")
      total_pages = self.get_total_pages()
      logger.info(f"Will scrape {total_pages} pages (each page has up to 1000 documents)")
      
      # Limit max_workers to avoid overwhelming the server
      actual_workers = min(self.max_workers, total_pages, 6)
      logger.info(f"Using {actual_workers} worker threads")
      
      # Use ThreadPoolExecutor for concurrent scraping
      with ThreadPoolExecutor(max_workers=actual_workers) as executor:
          # Submit all page scraping tasks (1-indexed page numbers)
          future_to_page = {
              executor.submit(self.scrape_page, page_num): page_num 
              for page_num in range(1, total_pages + 1)
          }
          
          # Collect results as they complete
          completed_pages = 0
          for future in as_completed(future_to_page):
              page_num = future_to_page[future]
              try:
                  page_documents = future.result()
                  with self.data_lock:
                      self.all_documents.extend(page_documents)
                  completed_pages += 1
                  logger.info(f"Progress: {completed_pages}/{total_pages} pages completed ({len(self.all_documents)} total documents)")
              except Exception as e:
                  logger.error(f"Error processing page {page_num}: {e}")
                  completed_pages += 1
      
      logger.info(f"Scraping completed. Total documents found: {len(self.all_documents)}")
      return self.all_documents




In [None]:

# Initialize the scraper
base_url = "https://www.presidency.ucsb.edu/documents/app-categories/elections-and-transitions/campaign-documents?items_per_page=1000"
scraper = CampaignDocumentsScraper(base_url, max_workers=5, max_pages=25)

# Start scraping
start_time = time.time()
logger.info("Starting campaign documents scraping...")
logger.info(f"Base URL: {base_url}")
logger.info("Expected: Up to 25 pages with 1000 documents each")

documents = scraper.scrape_all_pages()

end_time = time.time()
scraping_time = end_time - start_time
logger.info(f"Scraping completed in {scraping_time:.2f} seconds")

# Convert to DataFrame
df = pd.DataFrame(documents)

if not df.empty:
    # Sort by page and then by date
    df = df.sort_values(['Page', 'Date'], ascending=[True, False])
    
    # Display summary
    print(f"\n📊 **Campaign Documents Scraping Summary**")
    print(f"Total documents scraped: {len(df):,}")
    print(f"Pages scraped: {df['Page'].nunique()}")
    print(f"Expected max documents (25 pages × 1000): {25 * 1000:,}")
    print(f"Actual coverage: {len(df) / (25 * 1000) * 100:.1f}% of maximum possible")
    print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
    print(f"Scraping time: {scraping_time:.2f} seconds")
    print(f"Average documents per page: {len(df) / df['Page'].nunique():.0f}")
    print(f"Scraping speed: {len(df) / scraping_time:.1f} documents/second")
    
    # Show page distribution
    print(f"\n📄 **Documents per page:**")
    page_counts = df['Page'].value_counts().sort_index()
    for page, count in page_counts.items():
        print(f"Page {page}: {count:,} documents")
    
    # Show sample data
    print(f"\n🔍 **Sample Data (first 5 documents):**")
    sample_columns = ['Date', 'Title', 'Related_Category', 'Page']
    print(df[sample_columns].head().to_string(index=False, max_colwidth=60))
    
    # Export to CSV
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_filename = f'./campaign_documents_{len(df)}docs_{timestamp}.csv'
    df.to_csv(csv_filename, index=False, encoding='utf-8')
    print(f"\n💾 **Data exported to:** {csv_filename}")
    
    # Show data info
    print(f"\n📈 **DataFrame Info:**")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")
    
    # Show unique related categories
    print(f"\n🏷️ **Top Related Categories:**")
    category_counts = df['Related_Category'].value_counts().head(10)
    for category, count in category_counts.items():
        print(f"  {category}: {count:,} documents")
    
    # Show date distribution
    print(f"\n📅 **Document Distribution by Year:**")
    df['Year'] = pd.to_datetime(df['Date'], errors='coerce').dt.year
    year_counts = df['Year'].value_counts().sort_index().tail(10)
    for year, count in year_counts.items():
        if pd.notna(year):
            print(f"  {int(year)}: {count:,} documents")
    
else:
    print("❌ No documents found!")


In [None]:
df = pd.read_csv('./campaign_documents.csv')
df.Date = pd.to_datetime(df.Date)
df = df[df.Date.between('2016-01-01', '2024-12-31')]

In [None]:
# First, let's create the sample CSV file to test with
import pandas as pd

sample_data = {
    'Date': ["September 29, 2024", "September 27, 2024"],
    'Title': [
        "Remarks by the Vice President at a Campaign Event in Las Vegas, Nevada",
        "Remarks by the Vice President at a Campaign Event in Douglas, Arizona"
    ],
    'Document_Link': [
        "https://www.presidency.ucsb.edu/documents/remarks-the-vice-president-campaign-event-las-vegas-nevada-0",
        "https://www.presidency.ucsb.edu/documents/remarks-the-vice-president-campaign-event-douglas-arizona"
    ],
    'Related_Category': ["Kamala Harris", "Kamala Harris"],
    'Related_Link': [
        "https://www.presidency.ucsb.edu/people/other/kamala-harris",
        "https://www.presidency.ucsb.edu/people/other/kamala-harris"
    ],
    'Page': [1, 1],
    'URL_Page_Index': [0, 0]
}

df_sample = pd.DataFrame(sample_data)
df_sample.to_csv('./sample_documents.csv', index=False)

print("Sample CSV created with your data structure:")
print(df_sample)
print(f"\nColumns: {list(df_sample.columns)}")
print(f"Shape: {df_sample.shape}")

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime
import re
from tqdm import tqdm
import concurrent.futures
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import logging
from functools import lru_cache
import pickle
import os
import json

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class OptimizedDocumentExtractor:
    def __init__(self, max_workers=5, cache_file='extraction_cache.pkl', checkpoint_file='checkpoint.json'):
        self.max_workers = max_workers
        self.cache_file = cache_file
        self.checkpoint_file = checkpoint_file
        self.session = self._create_session()
        self.cache = self._load_cache()
        
    def _create_session(self):
        """Create a session with retry strategy and connection pooling"""
        session = requests.Session()
        
        # Retry strategy
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        
        adapter = HTTPAdapter(
            max_retries=retry_strategy,
            pool_connections=20,
            pool_maxsize=20
        )
        
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        })
        
        return session
    
    def _load_cache(self):
        """Load previously extracted data from cache"""
        if os.path.exists(self.cache_file):
            try:
                with open(self.cache_file, 'rb') as f:
                    cache = pickle.load(f)
                logger.info(f"Loaded {len(cache)} cached entries")
                return cache
            except Exception as e:
                logger.warning(f"Could not load cache: {e}")
        return {}
    
    def _save_cache(self):
        """Save cache to disk"""
        try:
            with open(self.cache_file, 'wb') as f:
                pickle.dump(self.cache, f)
            logger.info(f"Saved {len(self.cache)} entries to cache")
        except Exception as e:
            logger.error(f"Could not save cache: {e}")
    
    def _load_checkpoint(self):
        """Load processing checkpoint"""
        if os.path.exists(self.checkpoint_file):
            try:
                with open(self.checkpoint_file, 'r') as f:
                    return json.load(f)
            except Exception as e:
                logger.warning(f"Could not load checkpoint: {e}")
        return {'processed_urls': [], 'last_index': 0}
    
    def _save_checkpoint(self, processed_urls, last_index):
        """Save processing checkpoint"""
        try:
            checkpoint = {
                'processed_urls': processed_urls,
                'last_index': last_index,
                'timestamp': datetime.now().isoformat()
            }
            with open(self.checkpoint_file, 'w') as f:
                json.dump(checkpoint, f)
        except Exception as e:
            logger.error(f"Could not save checkpoint: {e}")
    
    @lru_cache(maxsize=1000)
    def _parse_date(self, date_string):
        """Cached date parsing function"""
        if not date_string:
            return ''
        
        try:
            if 'T' in date_string and '+' in date_string:
                parsed_date = datetime.fromisoformat(date_string.replace('Z', '+00:00'))
                return parsed_date.isoformat()
            else:
                date_formats = ['%B %d, %Y', '%m/%d/%Y', '%Y-%m-%d', '%d %B %Y']
                for fmt in date_formats:
                    try:
                        parsed_date = datetime.strptime(date_string, fmt)
                        return parsed_date.isoformat()
                    except:
                        continue
                return date_string
        except Exception as e:
            logger.warning(f"Date parsing error for '{date_string}': {e}")
            return date_string
    
    def extract_document_info(self, url):
        """Optimized document information extraction"""
        # Check cache first
        if url in self.cache:
            logger.debug(f"Cache hit for {url}")
            return self.cache[url]
        
        try:
            response = self.session.get(url, timeout=15)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            
            result = {
                'Document_Title': '',
                'Document_Date': '',
                'Document_Content': '',
                'Speaker': '',
                'Speaker_Title': '',
                'Document_Type': '',
                'Location': '',
                'Video_Available': False,
                'Word_Count': 0,
                'Extraction_Status': 'Success'
            }
            
            # Extract document title - prioritized selectors
            title_element = (soup.select_one('.field-ds-doc-title h1') or 
                           soup.select_one('h1') or 
                           soup.select_one('.field-title h1'))
            
            if title_element:
                result['Document_Title'] = title_element.get_text(strip=True)
            
            # Extract document date - prioritized selectors
            date_element = (soup.select_one('span[property="dc:date"]') or 
                          soup.select_one('.field-docs-start-date-time span') or
                          soup.select_one('.date-display-single'))
            
            if date_element:
                date_content = date_element.get('content') or date_element.get_text(strip=True)
                result['Document_Date'] = self._parse_date(date_content)
            
            # Extract speaker information
            speaker_element = (soup.select_one('.field-title a') or 
                             soup.select_one('.diet-title a') or
                             soup.select_one('h3 a'))
            
            if speaker_element:
                result['Speaker'] = speaker_element.get_text(strip=True)
            
            # Extract speaker title
            title_element = (soup.select_one('.diet-by-line') or 
                           soup.select_one('.field-resuable-byline'))
            
            if title_element:
                result['Speaker_Title'] = title_element.get_text(strip=True)
            
            # Extract document content - optimized
            content_element = soup.select_one('.field-docs-content')
            if content_element:
                # Remove unwanted elements
                for unwanted in content_element(['script', 'style', 'nav', 'header', 'footer']):
                    unwanted.decompose()
                
                # Extract text more efficiently
                paragraphs = content_element.find_all('p')
                if paragraphs:
                    content_parts = [p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)]
                    result['Document_Content'] = '\\n\\n'.join(content_parts)
                    result['Word_Count'] = len(' '.join(content_parts).split())
            
            # Quick checks for other attributes
            if soup.select_one('iframe[src*="youtube"], iframe[src*="vimeo"], .embedded-video'):
                result['Video_Available'] = True
            
            # Extract location from title
            title_lower = result['Document_Title'].lower()
            location_patterns = [
                r'in ([A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*)',
                r'at the ([A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*)'
            ]
            
            for pattern in location_patterns:
                match = re.search(pattern, result['Document_Title'])
                if match:
                    result['Location'] = match.group(1)
                    break
            
            # Determine document type
            if 'debate' in title_lower:
                result['Document_Type'] = 'Debate'
            elif any(word in title_lower for word in ['remarks', 'speech']):
                result['Document_Type'] = 'Speech/Remarks'
            elif 'interview' in title_lower:
                result['Document_Type'] = 'Interview'
            elif 'statement' in title_lower:
                result['Document_Type'] = 'Statement'
            elif 'address' in title_lower:
                result['Document_Type'] = 'Address'
            else:
                result['Document_Type'] = 'Document'
            
            # Cache the result
            self.cache[url] = result
            return result
            
        except requests.exceptions.RequestException as e:
            logger.error(f"Network error for {url}: {e}")
            return self._create_error_result('Network Error')
        except Exception as e:
            logger.error(f"Parsing error for {url}: {e}")
            return self._create_error_result('Parsing Error')
    
    def _create_error_result(self, status):
        """Create error result structure"""
        return {
            'Document_Title': '',
            'Document_Date': '',
            'Document_Content': '',
            'Speaker': '',
            'Speaker_Title': '',
            'Document_Type': '',
            'Location': '',
            'Video_Available': False,
            'Word_Count': 0,
            'Extraction_Status': status
        }
    
    def process_batch(self, urls_batch):
        """Process a batch of URLs"""
        results = {}
        for url in urls_batch:
            results[url] = self.extract_document_info(url)
            time.sleep(0.5)  # Reduced delay for batched processing
        return results
    
    def process_documents_csv(self, csv_file_path, batch_size=50, save_interval=500):
        """
        Process CSV file with optimizations for large datasets
        """
        # Read the CSV file
        df = pd.read_csv(csv_file_path)
        
        logger.info(f"Processing {len(df)} rows...")
        logger.info(f"Columns in CSV: {list(df.columns)}")
        
        # Check if Document_Link column exists
        if 'Document_Link' not in df.columns:
            logger.error("Error: 'Document_Link' column not found in CSV")
            return None
        
        # Load checkpoint
        checkpoint = self._load_checkpoint()
        processed_urls = set(checkpoint.get('processed_urls', []))
        last_index = checkpoint.get('last_index', 0)
        
        # Get unique URLs that haven't been processed
        all_urls = df['Document_Link'].dropna().unique()
        remaining_urls = [url for url in all_urls if url not in processed_urls]
        
        logger.info(f"Total unique URLs: {len(all_urls)}")
        logger.info(f"Already processed: {len(processed_urls)}")
        logger.info(f"Remaining to process: {len(remaining_urls)}")
        
        if not remaining_urls:
            logger.info("All URLs already processed!")
        else:
            # Process in batches with threading
            url_batches = [remaining_urls[i:i + batch_size] for i in range(0, len(remaining_urls), batch_size)]
            
            with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                batch_futures = []
                
                for i, batch in enumerate(url_batches):
                    future = executor.submit(self.process_batch, batch)
                    batch_futures.append((i, future))
                
                # Process results as they complete
                for i, future in tqdm(batch_futures, desc="Processing batches"):
                    try:
                        batch_results = future.result(timeout=300)  # 5 minute timeout per batch
                        
                        # Update cache and processed URLs
                        self.cache.update(batch_results)
                        processed_urls.update(batch_results.keys())
                        
                        # Save progress periodically
                        if (i + 1) % (save_interval // batch_size) == 0:
                            self._save_cache()
                            self._save_checkpoint(list(processed_urls), i * batch_size)
                            logger.info(f"Saved progress: {len(processed_urls)} URLs processed")
                        
                    except concurrent.futures.TimeoutError:
                        logger.error(f"Batch {i} timed out")
                    except Exception as e:
                        logger.error(f"Error processing batch {i}: {e}")
        
        # Final save
        self._save_cache()
        self._save_checkpoint(list(processed_urls), len(all_urls))
        
        # Add extracted information to the dataframe
        new_columns = ['Document_Title', 'Document_Date', 'Document_Content', 'Speaker', 
                      'Speaker_Title', 'Document_Type', 'Location', 'Video_Available', 
                      'Word_Count', 'Extraction_Status']
        
        for col in new_columns:
            df[col] = df['Document_Link'].map(lambda x: self.cache.get(x, {}).get(col, ''))
        
        # Generate summary statistics
        self._generate_summary(df)
        
        return df
    
    def _generate_summary(self, df):
        """Generate processing summary"""
        total_rows = len(df)
        successful_extractions = len(df[df['Extraction_Status'] == 'Success'])
        unique_speakers = df['Speaker'].nunique()
        total_words = df['Word_Count'].sum()
        
        logger.info("\\n" + "="*50)
        logger.info("EXTRACTION SUMMARY")
        logger.info("="*50)
        logger.info(f"Total rows processed: {total_rows}")
        logger.info(f"Successful extractions: {successful_extractions} ({successful_extractions/total_rows*100:.1f}%)")
        logger.info(f"Unique speakers found: {unique_speakers}")
        logger.info(f"Total words extracted: {total_words:,}")
        logger.info(f"Average words per document: {total_words/successful_extractions:.0f}")
        logger.info(f"Documents with video: {df['Video_Available'].sum()}")
        
        # Document type distribution
        doc_types = df['Document_Type'].value_counts()
        logger.info("\\nDocument Type Distribution:")
        for doc_type, count in doc_types.items():
            logger.info(f"  {doc_type}: {count}")





In [None]:
"""Main function to run the extraction"""
# Initialize extractor
extractor = OptimizedDocumentExtractor(
    max_workers=8,  # Adjust based on your system and respect for the server
    cache_file='document_cache.pkl',
    checkpoint_file='extraction_checkpoint.json'
)

# Process the CSV file
csv_file = './campaign_documents.csv'  # Replace with your file name

try:
    processed_df = extractor.process_documents_csv(
        csv_file, 
        batch_size=50,  # Process 50 URLs per batch
        save_interval=500  # Save progress every 500 URLs
    )
    
    if processed_df is not None:
        # Save the final result
        output_file = './campaign_documents.csv'
        processed_df.to_csv(output_file, index=False)
        logger.info(f"\\nProcessing complete! Results saved to: {output_file}")
        

except KeyboardInterrupt:
    logger.info("\\nProcessing interrupted by user. Progress has been saved.")
    logger.info("You can resume processing by running the script again.")
except Exception as e:
    logger.error(f"Fatal error: {e}")
