# Data Extraction Techniques

Data extraction is the first step in any ETL/ELT pipeline. It involves retrieving data from various source systems and preparing it for transformation and loading. This notebook covers essential extraction patterns, strategies, and practical implementations.

## Table of Contents
1. [Extraction Patterns Overview](#extraction-patterns-overview)
2. [Full Extraction vs Incremental Extraction](#full-vs-incremental)
3. [Change Data Capture (CDC)](#change-data-capture)
4. [Extracting from APIs](#extracting-from-apis)
5. [Extracting from Databases](#extracting-from-databases)
6. [Extracting from Files](#extracting-from-files)
7. [Handling Extraction Challenges](#handling-challenges)
8. [Key Takeaways](#takeaways)

---
## 1. Extraction Patterns Overview <a id='extraction-patterns-overview'></a>

### Common Data Sources

| Source Type | Examples | Common Protocols |
|------------|----------|------------------|
| **Databases** | PostgreSQL, MySQL, MongoDB, Oracle | JDBC, ODBC, Native drivers |
| **APIs** | REST, GraphQL, SOAP | HTTP/HTTPS |
| **Files** | CSV, JSON, Parquet, XML | FTP, SFTP, S3, Local FS |
| **Streams** | Kafka, Kinesis, Pub/Sub | TCP, WebSocket |
| **SaaS** | Salesforce, HubSpot, Stripe | OAuth, API Keys |

### Extraction Strategies

```
┌─────────────────────────────────────────────────────────────────┐
│                    EXTRACTION STRATEGIES                        │
├─────────────────────┬─────────────────────┬─────────────────────┤
│   Full Extraction   │ Incremental Extract │  Change Data Capture│
├─────────────────────┼─────────────────────┼─────────────────────┤
│ • Extract all data  │ • Extract only new/ │ • Capture changes   │
│ • Simple to impl    │   modified records  │   at database level │
│ • High resource use │ • Requires tracking │ • Real-time capable │
│ • Good for small    │ • Lower bandwidth   │ • Log-based or      │
│   datasets          │ • Needs watermarks  │   trigger-based     │
└─────────────────────┴─────────────────────┴─────────────────────┘
```

---
## 2. Full Extraction vs Incremental Extraction <a id='full-vs-incremental'></a>

### Full Extraction

Full extraction involves reading the entire dataset from the source system every time the extraction runs.

**When to use:**
- Small datasets (< 1 million rows)
- Source systems without reliable change tracking
- Initial data loads
- When data consistency is critical

**Pros:**
- Simple implementation
- Guaranteed data consistency
- No need to track state

**Cons:**
- High resource consumption
- Long extraction times for large datasets
- Network bandwidth intensive

In [None]:
import pandas as pd
from datetime import datetime
from typing import Optional
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class FullExtractor:
    """Full extraction pattern - extracts all data from source."""
    
    def __init__(self, source_connection: str):
        self.source_connection = source_connection
        self.extraction_timestamp = None
    
    def extract(self, table_name: str, batch_size: int = 10000) -> pd.DataFrame:
        """
        Extract all records from a table.
        
        Args:
            table_name: Name of the source table
            batch_size: Number of records per batch for memory efficiency
            
        Returns:
            DataFrame containing all extracted records
        """
        self.extraction_timestamp = datetime.now()
        logger.info(f"Starting full extraction from {table_name} at {self.extraction_timestamp}")
        
        # Simulated extraction - in practice, use actual DB connection
        query = f"SELECT * FROM {table_name}"
        
        # Example with chunked reading for memory efficiency
        chunks = []
        # In practice: pd.read_sql(query, connection, chunksize=batch_size)
        # for chunk in pd.read_sql(query, self.source_connection, chunksize=batch_size):
        #     chunks.append(chunk)
        
        logger.info(f"Full extraction completed. Total batches: {len(chunks)}")
        return pd.concat(chunks, ignore_index=True) if chunks else pd.DataFrame()


# Example usage
print("Full Extraction Pattern:")
print("- Extracts ALL data every run")
print("- Best for: small datasets, initial loads, data consistency requirements")

### Incremental Extraction

Incremental extraction only retrieves records that have been added or modified since the last extraction.

**Common Incremental Strategies:**

| Strategy | Description | Requirements |
|----------|-------------|-------------|
| **Timestamp-based** | Use `updated_at` column | Reliable timestamp column |
| **ID-based** | Track max ID extracted | Sequential IDs, inserts only |
| **Checksum** | Compare row checksums | Compute & store checksums |
| **Version columns** | Track row versions | Version/sequence column |

In [None]:
from dataclasses import dataclass
from datetime import datetime, timedelta
import json
from pathlib import Path


@dataclass
class ExtractionState:
    """Tracks extraction state for incremental loads."""
    table_name: str
    last_extracted_at: datetime
    last_extracted_id: Optional[int] = None
    last_watermark: Optional[str] = None
    records_extracted: int = 0


class IncrementalExtractor:
    """Incremental extraction with watermark tracking."""
    
    def __init__(self, state_file: str = "extraction_state.json"):
        self.state_file = Path(state_file)
        self.states: dict[str, ExtractionState] = self._load_state()
    
    def _load_state(self) -> dict:
        """Load extraction state from file."""
        if self.state_file.exists():
            with open(self.state_file) as f:
                data = json.load(f)
                return {
                    k: ExtractionState(
                        table_name=v['table_name'],
                        last_extracted_at=datetime.fromisoformat(v['last_extracted_at']),
                        last_extracted_id=v.get('last_extracted_id'),
                        records_extracted=v.get('records_extracted', 0)
                    ) for k, v in data.items()
                }
        return {}
    
    def _save_state(self):
        """Persist extraction state to file."""
        data = {
            k: {
                'table_name': v.table_name,
                'last_extracted_at': v.last_extracted_at.isoformat(),
                'last_extracted_id': v.last_extracted_id,
                'records_extracted': v.records_extracted
            } for k, v in self.states.items()
        }
        with open(self.state_file, 'w') as f:
            json.dump(data, f, indent=2)
    
    def extract_by_timestamp(
        self,
        table_name: str,
        timestamp_column: str = "updated_at",
        overlap_minutes: int = 5
    ) -> tuple[str, datetime]:
        """
        Generate incremental extraction query using timestamps.
        
        Args:
            table_name: Source table name
            timestamp_column: Column containing update timestamps
            overlap_minutes: Safety overlap to handle clock skew
            
        Returns:
            Tuple of (SQL query, new watermark timestamp)
        """
        # Get last extraction timestamp or use epoch
        if table_name in self.states:
            last_ts = self.states[table_name].last_extracted_at
            # Apply overlap for safety
            watermark = last_ts - timedelta(minutes=overlap_minutes)
        else:
            watermark = datetime(1970, 1, 1)
        
        current_ts = datetime.now()
        
        query = f"""
        SELECT * FROM {table_name}
        WHERE {timestamp_column} > '{watermark.isoformat()}'
          AND {timestamp_column} <= '{current_ts.isoformat()}'
        ORDER BY {timestamp_column} ASC
        """
        
        return query.strip(), current_ts
    
    def extract_by_id(
        self,
        table_name: str,
        id_column: str = "id"
    ) -> tuple[str, None]:
        """
        Generate incremental extraction query using sequential IDs.
        Best for append-only tables with auto-incrementing IDs.
        
        Args:
            table_name: Source table name
            id_column: Column containing sequential IDs
            
        Returns:
            Tuple of (SQL query, None)
        """
        last_id = 0
        if table_name in self.states and self.states[table_name].last_extracted_id:
            last_id = self.states[table_name].last_extracted_id
        
        query = f"""
        SELECT * FROM {table_name}
        WHERE {id_column} > {last_id}
        ORDER BY {id_column} ASC
        """
        
        return query.strip(), None
    
    def update_state(
        self,
        table_name: str,
        extracted_at: datetime,
        last_id: Optional[int] = None,
        records_count: int = 0
    ):
        """Update extraction state after successful extraction."""
        self.states[table_name] = ExtractionState(
            table_name=table_name,
            last_extracted_at=extracted_at,
            last_extracted_id=last_id,
            records_extracted=records_count
        )
        self._save_state()
        logger.info(f"Updated state for {table_name}: {records_count} records")


# Demo
extractor = IncrementalExtractor()
query, watermark = extractor.extract_by_timestamp("orders", "updated_at")
print("Timestamp-based incremental query:")
print(query)

---
## 3. Change Data Capture (CDC) <a id='change-data-capture'></a>

Change Data Capture identifies and captures changes made to data in a database, enabling real-time or near-real-time data replication.

### CDC Approaches

```
┌────────────────────────────────────────────────────────────────────────┐
│                         CDC APPROACHES                                  │
├────────────────────┬───────────────────────┬───────────────────────────┤
│   Log-Based CDC    │    Trigger-Based CDC  │     Query-Based CDC       │
├────────────────────┼───────────────────────┼───────────────────────────┤
│ Read database logs │ DB triggers capture   │ Poll source with queries  │
│ (WAL, binlog)      │ INSERT/UPDATE/DELETE  │ using timestamps/versions │
│                    │                       │                           │
│ ✓ No source impact │ ✓ Real-time           │ ✓ Simple implementation   │
│ ✓ Captures deletes │ ✓ Captures all ops    │ ✓ Works with any DB       │
│ ✗ DB-specific      │ ✗ Adds DB overhead    │ ✗ May miss deletes        │
│ ✗ Complex setup    │ ✗ Schema changes      │ ✗ Not truly real-time     │
└────────────────────┴───────────────────────┴───────────────────────────┘
```

### Popular CDC Tools

| Tool | Type | Supported Sources |
|------|------|-------------------|
| **Debezium** | Log-based | PostgreSQL, MySQL, MongoDB, SQL Server |
| **AWS DMS** | Log-based | Most major databases |
| **Fivetran** | Managed | 150+ connectors |
| **Airbyte** | Various | Open-source, 300+ connectors |

In [None]:
from enum import Enum
from dataclasses import dataclass, field
from typing import Any
import hashlib
import json


class OperationType(Enum):
    """CDC operation types."""
    INSERT = "INSERT"
    UPDATE = "UPDATE"
    DELETE = "DELETE"


@dataclass
class CDCEvent:
    """Represents a change data capture event."""
    table: str
    operation: OperationType
    timestamp: datetime
    primary_key: dict
    before: Optional[dict] = None  # Previous state (for UPDATE/DELETE)
    after: Optional[dict] = None   # New state (for INSERT/UPDATE)
    transaction_id: Optional[str] = None
    
    def to_dict(self) -> dict:
        return {
            'table': self.table,
            'operation': self.operation.value,
            'timestamp': self.timestamp.isoformat(),
            'primary_key': self.primary_key,
            'before': self.before,
            'after': self.after
        }


class QueryBasedCDC:
    """
    Query-based CDC implementation using checksums.
    Compares current state with previous snapshot to detect changes.
    """
    
    def __init__(self, snapshot_store: dict = None):
        self.snapshots = snapshot_store or {}
    
    @staticmethod
    def compute_row_hash(row: dict) -> str:
        """Compute hash of a row for change detection."""
        row_str = json.dumps(row, sort_keys=True, default=str)
        return hashlib.md5(row_str.encode()).hexdigest()
    
    def detect_changes(
        self,
        table_name: str,
        current_data: list[dict],
        primary_key: str
    ) -> list[CDCEvent]:
        """
        Detect INSERT, UPDATE, DELETE by comparing with previous snapshot.
        
        Args:
            table_name: Name of the table being tracked
            current_data: Current state of the data
            primary_key: Name of the primary key column
            
        Returns:
            List of CDC events representing detected changes
        """
        events = []
        current_time = datetime.now()
        
        # Build current state lookup
        current_lookup = {row[primary_key]: row for row in current_data}
        current_hashes = {
            pk: self.compute_row_hash(row) 
            for pk, row in current_lookup.items()
        }
        
        # Get previous snapshot
        prev_snapshot = self.snapshots.get(table_name, {})
        prev_lookup = prev_snapshot.get('data', {})
        prev_hashes = prev_snapshot.get('hashes', {})
        
        # Detect INSERTs and UPDATEs
        for pk, row in current_lookup.items():
            if pk not in prev_lookup:
                # New record - INSERT
                events.append(CDCEvent(
                    table=table_name,
                    operation=OperationType.INSERT,
                    timestamp=current_time,
                    primary_key={primary_key: pk},
                    after=row
                ))
            elif current_hashes[pk] != prev_hashes.get(pk):
                # Hash changed - UPDATE
                events.append(CDCEvent(
                    table=table_name,
                    operation=OperationType.UPDATE,
                    timestamp=current_time,
                    primary_key={primary_key: pk},
                    before=prev_lookup[pk],
                    after=row
                ))
        
        # Detect DELETEs
        for pk in prev_lookup:
            if pk not in current_lookup:
                events.append(CDCEvent(
                    table=table_name,
                    operation=OperationType.DELETE,
                    timestamp=current_time,
                    primary_key={primary_key: pk},
                    before=prev_lookup[pk]
                ))
        
        # Update snapshot
        self.snapshots[table_name] = {
            'data': current_lookup,
            'hashes': current_hashes
        }
        
        return events


# Demo query-based CDC
cdc = QueryBasedCDC()

# Initial snapshot
initial_data = [
    {'id': 1, 'name': 'Alice', 'email': 'alice@example.com'},
    {'id': 2, 'name': 'Bob', 'email': 'bob@example.com'},
]

events = cdc.detect_changes('users', initial_data, 'id')
print(f"Initial load: {len(events)} INSERT events")

# Simulate changes
updated_data = [
    {'id': 1, 'name': 'Alice Smith', 'email': 'alice@example.com'},  # Updated
    # id=2 deleted
    {'id': 3, 'name': 'Charlie', 'email': 'charlie@example.com'},    # Inserted
]

events = cdc.detect_changes('users', updated_data, 'id')
print(f"\nDetected changes:")
for event in events:
    print(f"  {event.operation.value}: PK={event.primary_key}")

---
## 4. Extracting from APIs <a id='extracting-from-apis'></a>

API extraction is common when integrating with SaaS platforms, web services, or microservices.

### Key Considerations

- **Authentication**: API keys, OAuth 2.0, JWT tokens
- **Rate Limiting**: Respect API quotas to avoid being blocked
- **Pagination**: Handle large result sets efficiently
- **Error Handling**: Retries, exponential backoff
- **Data Formats**: JSON, XML, Protocol Buffers

In [None]:
import time
from typing import Generator, Any
from dataclasses import dataclass
import random


@dataclass
class RateLimitConfig:
    """Rate limiting configuration."""
    requests_per_second: float = 10.0
    burst_limit: int = 100
    retry_after_header: str = "Retry-After"


class APIExtractor:
    """
    Robust API data extractor with rate limiting, 
    pagination, and retry logic.
    """
    
    def __init__(
        self,
        base_url: str,
        auth_token: str,
        rate_limit: RateLimitConfig = None
    ):
        self.base_url = base_url.rstrip('/')
        self.auth_token = auth_token
        self.rate_limit = rate_limit or RateLimitConfig()
        self.last_request_time = 0.0
        
    def _get_headers(self) -> dict:
        """Build request headers with authentication."""
        return {
            'Authorization': f'Bearer {self.auth_token}',
            'Content-Type': 'application/json',
            'Accept': 'application/json'
        }
    
    def _wait_for_rate_limit(self):
        """Enforce rate limiting between requests."""
        min_interval = 1.0 / self.rate_limit.requests_per_second
        elapsed = time.time() - self.last_request_time
        
        if elapsed < min_interval:
            sleep_time = min_interval - elapsed
            time.sleep(sleep_time)
        
        self.last_request_time = time.time()
    
    def _make_request_with_retry(
        self,
        endpoint: str,
        params: dict = None,
        max_retries: int = 3,
        base_delay: float = 1.0
    ) -> dict:
        """
        Make HTTP request with exponential backoff retry.
        
        Args:
            endpoint: API endpoint path
            params: Query parameters
            max_retries: Maximum retry attempts
            base_delay: Initial delay between retries (seconds)
            
        Returns:
            JSON response data
        """
        # In production, use 'requests' library
        # import requests
        
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        
        for attempt in range(max_retries + 1):
            try:
                self._wait_for_rate_limit()
                
                # Simulated request - replace with actual HTTP call
                # response = requests.get(
                #     url, 
                #     headers=self._get_headers(),
                #     params=params,
                #     timeout=30
                # )
                
                # Simulated response
                logger.info(f"GET {url} (attempt {attempt + 1})")
                return {'data': [], 'next_cursor': None}  # Simulated
                
            except Exception as e:
                if attempt == max_retries:
                    logger.error(f"Max retries exceeded for {url}")
                    raise
                
                # Exponential backoff with jitter
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                logger.warning(f"Request failed, retrying in {delay:.2f}s: {e}")
                time.sleep(delay)
    
    def extract_paginated(
        self,
        endpoint: str,
        page_size: int = 100,
        max_pages: Optional[int] = None
    ) -> Generator[list[dict], None, None]:
        """
        Extract data with cursor-based pagination.
        
        Args:
            endpoint: API endpoint
            page_size: Records per page
            max_pages: Maximum pages to fetch (None for all)
            
        Yields:
            Lists of records from each page
        """
        cursor = None
        page_count = 0
        
        while True:
            params = {'limit': page_size}
            if cursor:
                params['cursor'] = cursor
            
            response = self._make_request_with_retry(endpoint, params)
            data = response.get('data', [])
            
            if data:
                yield data
                page_count += 1
                logger.info(f"Extracted page {page_count}: {len(data)} records")
            
            # Check for next page
            cursor = response.get('next_cursor')
            if not cursor or not data:
                break
            
            if max_pages and page_count >= max_pages:
                logger.info(f"Reached max pages limit: {max_pages}")
                break
        
        logger.info(f"Pagination complete. Total pages: {page_count}")


# Example usage
print("API Extractor Features:")
print("- Rate limiting with configurable RPS")
print("- Exponential backoff retry")
print("- Cursor-based pagination")
print("- Authentication handling")

In [None]:
# Real-world API extraction example with requests library

def extract_from_rest_api(
    api_url: str,
    headers: dict,
    params: dict = None
) -> pd.DataFrame:
    """
    Production-ready REST API extraction.
    
    Example:
        df = extract_from_rest_api(
            'https://api.example.com/v1/orders',
            headers={'Authorization': 'Bearer token123'},
            params={'status': 'completed', 'limit': 1000}
        )
    """
    import requests
    from requests.adapters import HTTPAdapter
    from urllib3.util.retry import Retry
    
    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    
    # Create session with retry
    session = requests.Session()
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    
    all_records = []
    next_url = api_url
    
    while next_url:
        response = session.get(
            next_url,
            headers=headers,
            params=params if next_url == api_url else None,
            timeout=30
        )
        response.raise_for_status()
        
        data = response.json()
        records = data.get('results', data.get('data', []))
        all_records.extend(records)
        
        # Handle pagination (common patterns)
        next_url = data.get('next') or data.get('next_page_url')
        params = None  # Clear params for subsequent pages
    
    return pd.DataFrame(all_records)


print("REST API extraction function ready")
print("Supports: retry, pagination, rate limit handling")

---
## 5. Extracting from Databases <a id='extracting-from-databases'></a>

Database extraction requires understanding connection management, query optimization, and efficient data transfer.

In [None]:
from contextlib import contextmanager
from typing import Iterator
from abc import ABC, abstractmethod


class DatabaseExtractor(ABC):
    """Abstract base class for database extractors."""
    
    @abstractmethod
    def connect(self):
        """Establish database connection."""
        pass
    
    @abstractmethod
    def extract(self, query: str) -> pd.DataFrame:
        """Execute query and return results."""
        pass


class PostgreSQLExtractor(DatabaseExtractor):
    """
    PostgreSQL extractor with connection pooling and chunked reads.
    """
    
    def __init__(
        self,
        host: str,
        port: int,
        database: str,
        user: str,
        password: str,
        pool_size: int = 5
    ):
        self.connection_string = (
            f"postgresql://{user}:{password}@{host}:{port}/{database}"
        )
        self.pool_size = pool_size
        self._engine = None
    
    def connect(self):
        """Create SQLAlchemy engine with connection pooling."""
        from sqlalchemy import create_engine
        
        self._engine = create_engine(
            self.connection_string,
            pool_size=self.pool_size,
            pool_pre_ping=True,  # Validate connections
            pool_recycle=3600    # Recycle connections after 1 hour
        )
        logger.info("PostgreSQL connection pool created")
    
    @contextmanager
    def get_connection(self):
        """Get connection from pool."""
        if not self._engine:
            self.connect()
        
        conn = self._engine.connect()
        try:
            yield conn
        finally:
            conn.close()
    
    def extract(self, query: str, chunk_size: int = None) -> pd.DataFrame:
        """
        Extract data using a SQL query.
        
        Args:
            query: SQL query to execute
            chunk_size: If set, read in chunks (for large datasets)
            
        Returns:
            DataFrame with query results
        """
        with self.get_connection() as conn:
            if chunk_size:
                # Chunked reading for large datasets
                chunks = pd.read_sql(query, conn, chunksize=chunk_size)
                return pd.concat(chunks, ignore_index=True)
            else:
                return pd.read_sql(query, conn)
    
    def extract_incremental(
        self,
        table: str,
        timestamp_col: str,
        since: datetime,
        columns: list[str] = None
    ) -> pd.DataFrame:
        """
        Incremental extraction using timestamp watermark.
        """
        cols = ', '.join(columns) if columns else '*'
        query = f"""
        SELECT {cols}
        FROM {table}
        WHERE {timestamp_col} > %(since)s
        ORDER BY {timestamp_col}
        """
        
        with self.get_connection() as conn:
            return pd.read_sql(query, conn, params={'since': since})
    
    def extract_with_cursor(
        self,
        query: str,
        batch_size: int = 10000
    ) -> Iterator[pd.DataFrame]:
        """
        Server-side cursor extraction for very large datasets.
        Minimizes memory usage by fetching in batches.
        """
        import psycopg2
        import psycopg2.extras
        
        # Use server-side cursor
        with self.get_connection() as conn:
            cursor_name = f"extract_cursor_{int(time.time())}"
            
            with conn.connection.cursor(
                name=cursor_name,
                cursor_factory=psycopg2.extras.RealDictCursor
            ) as cursor:
                cursor.execute(query)
                
                while True:
                    rows = cursor.fetchmany(batch_size)
                    if not rows:
                        break
                    yield pd.DataFrame(rows)


print("PostgreSQL Extractor Features:")
print("- Connection pooling")
print("- Chunked reads for large datasets")
print("- Incremental extraction")
print("- Server-side cursors for memory efficiency")

In [None]:
# MongoDB Extractor Example

class MongoDBExtractor:
    """
    MongoDB extractor for document databases.
    """
    
    def __init__(self, connection_uri: str, database: str):
        self.connection_uri = connection_uri
        self.database_name = database
        self._client = None
    
    def connect(self):
        """Connect to MongoDB."""
        from pymongo import MongoClient
        
        self._client = MongoClient(
            self.connection_uri,
            maxPoolSize=10,
            serverSelectionTimeoutMS=5000
        )
        # Verify connection
        self._client.admin.command('ping')
        logger.info(f"Connected to MongoDB: {self.database_name}")
    
    @property
    def db(self):
        if not self._client:
            self.connect()
        return self._client[self.database_name]
    
    def extract_collection(
        self,
        collection: str,
        query: dict = None,
        projection: dict = None,
        batch_size: int = 1000
    ) -> pd.DataFrame:
        """
        Extract documents from a MongoDB collection.
        
        Args:
            collection: Collection name
            query: MongoDB query filter
            projection: Fields to include/exclude
            batch_size: Cursor batch size
            
        Returns:
            DataFrame with extracted documents
        """
        cursor = self.db[collection].find(
            filter=query or {},
            projection=projection,
            batch_size=batch_size
        )
        
        documents = list(cursor)
        logger.info(f"Extracted {len(documents)} documents from {collection}")
        
        return pd.DataFrame(documents)
    
    def extract_incremental(
        self,
        collection: str,
        timestamp_field: str,
        since: datetime
    ) -> pd.DataFrame:
        """
        Incremental extraction based on timestamp field.
        """
        query = {timestamp_field: {'$gt': since}}
        return self.extract_collection(collection, query=query)
    
    def extract_with_aggregation(
        self,
        collection: str,
        pipeline: list[dict]
    ) -> pd.DataFrame:
        """
        Extract using MongoDB aggregation pipeline.
        Useful for complex transformations at source.
        """
        cursor = self.db[collection].aggregate(
            pipeline,
            allowDiskUse=True  # For large aggregations
        )
        return pd.DataFrame(list(cursor))


# Example aggregation pipeline
example_pipeline = [
    {'$match': {'status': 'active'}},
    {'$group': {
        '_id': '$category',
        'total': {'$sum': '$amount'},
        'count': {'$sum': 1}
    }},
    {'$sort': {'total': -1}}
]

print("MongoDB Extractor ready")
print(f"Example pipeline: {example_pipeline}")

---
## 6. Extracting from Files <a id='extracting-from-files'></a>

File-based extraction handles CSV, JSON, Parquet, Excel, and other file formats from local storage, cloud storage, or remote servers.

In [None]:
from pathlib import Path
import glob
from concurrent.futures import ThreadPoolExecutor, as_completed


class FileExtractor:
    """
    Multi-format file extractor with parallel processing.
    """
    
    SUPPORTED_FORMATS = {
        '.csv': 'csv',
        '.json': 'json',
        '.jsonl': 'jsonl',
        '.parquet': 'parquet',
        '.xlsx': 'excel',
        '.xls': 'excel',
        '.xml': 'xml'
    }
    
    def __init__(self, max_workers: int = 4):
        self.max_workers = max_workers
    
    def _read_file(self, file_path: Path, **kwargs) -> pd.DataFrame:
        """Read a single file based on its extension."""
        suffix = file_path.suffix.lower()
        
        if suffix not in self.SUPPORTED_FORMATS:
            raise ValueError(f"Unsupported format: {suffix}")
        
        format_type = self.SUPPORTED_FORMATS[suffix]
        
        readers = {
            'csv': lambda p: pd.read_csv(p, **kwargs),
            'json': lambda p: pd.read_json(p, **kwargs),
            'jsonl': lambda p: pd.read_json(p, lines=True, **kwargs),
            'parquet': lambda p: pd.read_parquet(p, **kwargs),
            'excel': lambda p: pd.read_excel(p, **kwargs),
            'xml': lambda p: pd.read_xml(p, **kwargs)
        }
        
        logger.info(f"Reading {format_type} file: {file_path}")
        return readers[format_type](file_path)
    
    def extract_file(self, file_path: str, **kwargs) -> pd.DataFrame:
        """
        Extract data from a single file.
        
        Args:
            file_path: Path to the file
            **kwargs: Additional arguments for pandas reader
            
        Returns:
            DataFrame with file contents
        """
        path = Path(file_path)
        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")
        
        return self._read_file(path, **kwargs)
    
    def extract_directory(
        self,
        directory: str,
        pattern: str = "*.*",
        recursive: bool = True,
        parallel: bool = True,
        **kwargs
    ) -> pd.DataFrame:
        """
        Extract and combine all matching files from a directory.
        
        Args:
            directory: Directory path
            pattern: Glob pattern for file matching
            recursive: Search subdirectories
            parallel: Use parallel processing
            
        Returns:
            Combined DataFrame from all files
        """
        dir_path = Path(directory)
        
        if recursive:
            files = list(dir_path.rglob(pattern))
        else:
            files = list(dir_path.glob(pattern))
        
        # Filter to supported formats
        files = [f for f in files if f.suffix.lower() in self.SUPPORTED_FORMATS]
        
        if not files:
            logger.warning(f"No matching files found in {directory}")
            return pd.DataFrame()
        
        logger.info(f"Found {len(files)} files to extract")
        
        if parallel and len(files) > 1:
            return self._extract_parallel(files, **kwargs)
        else:
            return self._extract_sequential(files, **kwargs)
    
    def _extract_sequential(self, files: list, **kwargs) -> pd.DataFrame:
        """Extract files sequentially."""
        dfs = []
        for file_path in files:
            try:
                df = self._read_file(file_path, **kwargs)
                df['_source_file'] = str(file_path)
                dfs.append(df)
            except Exception as e:
                logger.error(f"Error reading {file_path}: {e}")
        
        return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()
    
    def _extract_parallel(self, files: list, **kwargs) -> pd.DataFrame:
        """Extract files in parallel using thread pool."""
        dfs = []
        
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            future_to_file = {
                executor.submit(self._read_file, f, **kwargs): f 
                for f in files
            }
            
            for future in as_completed(future_to_file):
                file_path = future_to_file[future]
                try:
                    df = future.result()
                    df['_source_file'] = str(file_path)
                    dfs.append(df)
                except Exception as e:
                    logger.error(f"Error reading {file_path}: {e}")
        
        return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()


# Demo
extractor = FileExtractor(max_workers=4)
print("Supported formats:", list(FileExtractor.SUPPORTED_FORMATS.keys()))

In [None]:
# Cloud Storage Extraction (S3 Example)

class S3Extractor:
    """
    Extract files from Amazon S3 with support for large datasets.
    """
    
    def __init__(
        self,
        bucket: str,
        aws_access_key: str = None,
        aws_secret_key: str = None,
        region: str = 'us-east-1'
    ):
        self.bucket = bucket
        self.region = region
        
        # Initialize boto3 client
        # import boto3
        # self.s3_client = boto3.client(
        #     's3',
        #     aws_access_key_id=aws_access_key,
        #     aws_secret_access_key=aws_secret_key,
        #     region_name=region
        # )
    
    def list_objects(self, prefix: str = '') -> list[str]:
        """List objects in bucket with optional prefix filter."""
        objects = []
        paginator = self.s3_client.get_paginator('list_objects_v2')
        
        for page in paginator.paginate(Bucket=self.bucket, Prefix=prefix):
            for obj in page.get('Contents', []):
                objects.append(obj['Key'])
        
        return objects
    
    def extract_parquet(self, key: str) -> pd.DataFrame:
        """
        Extract Parquet file directly from S3.
        Uses pyarrow for efficient columnar reads.
        """
        s3_path = f"s3://{self.bucket}/{key}"
        return pd.read_parquet(s3_path)
    
    def extract_csv_chunked(
        self,
        key: str,
        chunk_size: int = 100000
    ) -> Iterator[pd.DataFrame]:
        """
        Stream large CSV from S3 in chunks.
        
        Yields:
            DataFrame chunks
        """
        import io
        
        response = self.s3_client.get_object(Bucket=self.bucket, Key=key)
        
        for chunk in pd.read_csv(
            io.BytesIO(response['Body'].read()),
            chunksize=chunk_size
        ):
            yield chunk
    
    def extract_partitioned_data(
        self,
        prefix: str,
        partition_filter: dict = None
    ) -> pd.DataFrame:
        """
        Extract partitioned Parquet data (Hive-style partitioning).
        
        Example:
            s3://bucket/data/year=2024/month=01/data.parquet
        """
        import pyarrow.parquet as pq
        import pyarrow.dataset as ds
        
        s3_path = f"s3://{self.bucket}/{prefix}"
        
        dataset = ds.dataset(
            s3_path,
            partitioning='hive'
        )
        
        if partition_filter:
            # Build filter expression
            filter_expr = None
            for col, value in partition_filter.items():
                condition = ds.field(col) == value
                filter_expr = condition if filter_expr is None else filter_expr & condition
            
            return dataset.to_table(filter=filter_expr).to_pandas()
        
        return dataset.to_table().to_pandas()


print("S3 Extractor Features:")
print("- Direct Parquet reads")
print("- Chunked CSV streaming")
print("- Partition pruning for efficient queries")

---
## 7. Handling Extraction Challenges <a id='handling-challenges'></a>

### Common Challenges

| Challenge | Impact | Solution |
|-----------|--------|----------|
| **Rate Limiting** | Blocked requests, failed extractions | Token bucket, exponential backoff |
| **Pagination** | Missing data, incomplete extracts | Cursor tracking, offset management |
| **Large Datasets** | Memory exhaustion, timeouts | Chunking, streaming, partitioning |
| **Network Issues** | Failed transfers | Retries, checkpointing |
| **Schema Changes** | Data type mismatches | Schema evolution handling |
| **Duplicates** | Data quality issues | Deduplication, idempotent operations |

In [None]:
import time
from collections import deque


class TokenBucketRateLimiter:
    """
    Token bucket rate limiter for API requests.
    Allows bursting while maintaining average rate.
    """
    
    def __init__(
        self,
        rate: float,        # Tokens per second
        capacity: int       # Maximum burst size
    ):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_update = time.time()
        self._lock = None  # Use threading.Lock() in production
    
    def _refill(self):
        """Refill tokens based on elapsed time."""
        now = time.time()
        elapsed = now - self.last_update
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.rate
        )
        self.last_update = now
    
    def acquire(self, tokens: int = 1) -> float:
        """
        Acquire tokens, blocking if necessary.
        
        Returns:
            Wait time in seconds (0 if no wait needed)
        """
        self._refill()
        
        if self.tokens >= tokens:
            self.tokens -= tokens
            return 0.0
        
        # Calculate wait time
        tokens_needed = tokens - self.tokens
        wait_time = tokens_needed / self.rate
        
        time.sleep(wait_time)
        self._refill()
        self.tokens -= tokens
        
        return wait_time
    
    def can_proceed(self, tokens: int = 1) -> bool:
        """Check if request can proceed without blocking."""
        self._refill()
        return self.tokens >= tokens


# Demo
limiter = TokenBucketRateLimiter(rate=10, capacity=20)
print(f"Rate limiter: {limiter.rate} req/sec, burst: {limiter.capacity}")
print(f"Can proceed: {limiter.can_proceed()}")
print(f"Available tokens: {limiter.tokens:.1f}")

In [None]:
from dataclasses import dataclass, field
from enum import Enum


class PaginationType(Enum):
    """Types of pagination strategies."""
    OFFSET = "offset"         # ?offset=100&limit=50
    PAGE = "page"             # ?page=2&per_page=50
    CURSOR = "cursor"         # ?cursor=abc123
    KEYSET = "keyset"         # ?after_id=1000
    LINK_HEADER = "link"      # Link: <url>; rel="next"


@dataclass
class PaginationConfig:
    """Configuration for pagination handling."""
    type: PaginationType
    page_size: int = 100
    max_pages: Optional[int] = None
    
    # Field names in API response/request
    cursor_param: str = "cursor"
    cursor_response_field: str = "next_cursor"
    offset_param: str = "offset"
    page_param: str = "page"
    limit_param: str = "limit"
    data_field: str = "data"
    total_field: str = "total"


class PaginationHandler:
    """
    Unified pagination handler supporting multiple strategies.
    """
    
    def __init__(self, config: PaginationConfig):
        self.config = config
        self._current_offset = 0
        self._current_page = 1
        self._current_cursor = None
        self._pages_fetched = 0
        self._exhausted = False
    
    def get_params(self) -> dict:
        """Get query parameters for next request."""
        params = {self.config.limit_param: self.config.page_size}
        
        if self.config.type == PaginationType.OFFSET:
            params[self.config.offset_param] = self._current_offset
        
        elif self.config.type == PaginationType.PAGE:
            params[self.config.page_param] = self._current_page
        
        elif self.config.type == PaginationType.CURSOR:
            if self._current_cursor:
                params[self.config.cursor_param] = self._current_cursor
        
        elif self.config.type == PaginationType.KEYSET:
            if self._current_cursor:
                params['after_id'] = self._current_cursor
        
        return params
    
    def process_response(self, response: dict) -> list:
        """
        Process API response and update pagination state.
        
        Returns:
            List of records from response
        """
        data = response.get(self.config.data_field, [])
        self._pages_fetched += 1
        
        # Update state based on pagination type
        if self.config.type == PaginationType.OFFSET:
            self._current_offset += len(data)
            total = response.get(self.config.total_field)
            if total and self._current_offset >= total:
                self._exhausted = True
        
        elif self.config.type == PaginationType.PAGE:
            self._current_page += 1
        
        elif self.config.type in (PaginationType.CURSOR, PaginationType.KEYSET):
            self._current_cursor = response.get(self.config.cursor_response_field)
            if not self._current_cursor:
                self._exhausted = True
        
        # Check if we've hit page limit or no more data
        if not data:
            self._exhausted = True
        
        if self.config.max_pages and self._pages_fetched >= self.config.max_pages:
            self._exhausted = True
        
        return data
    
    def has_more(self) -> bool:
        """Check if more pages are available."""
        return not self._exhausted
    
    def reset(self):
        """Reset pagination state for fresh extraction."""
        self._current_offset = 0
        self._current_page = 1
        self._current_cursor = None
        self._pages_fetched = 0
        self._exhausted = False


# Example usage
config = PaginationConfig(
    type=PaginationType.CURSOR,
    page_size=100,
    max_pages=50
)

handler = PaginationHandler(config)
print(f"Pagination type: {config.type.value}")
print(f"Initial params: {handler.get_params()}")

# Simulate response
mock_response = {
    'data': [{'id': i} for i in range(100)],
    'next_cursor': 'abc123'
}
handler.process_response(mock_response)
print(f"After first page - has_more: {handler.has_more()}")
print(f"Next params: {handler.get_params()}")

In [None]:
# Checkpointing for resumable extractions

@dataclass
class ExtractionCheckpoint:
    """Checkpoint for resumable extraction jobs."""
    job_id: str
    source: str
    started_at: datetime
    last_updated: datetime
    records_extracted: int = 0
    last_offset: int = 0
    last_cursor: Optional[str] = None
    last_timestamp: Optional[datetime] = None
    status: str = "running"
    error_message: Optional[str] = None


class CheckpointManager:
    """
    Manages extraction checkpoints for fault tolerance.
    """
    
    def __init__(self, checkpoint_dir: str = "./checkpoints"):
        self.checkpoint_dir = Path(checkpoint_dir)
        self.checkpoint_dir.mkdir(exist_ok=True)
    
    def _checkpoint_path(self, job_id: str) -> Path:
        return self.checkpoint_dir / f"{job_id}.json"
    
    def create(self, job_id: str, source: str) -> ExtractionCheckpoint:
        """Create new extraction checkpoint."""
        now = datetime.now()
        checkpoint = ExtractionCheckpoint(
            job_id=job_id,
            source=source,
            started_at=now,
            last_updated=now
        )
        self.save(checkpoint)
        return checkpoint
    
    def save(self, checkpoint: ExtractionCheckpoint):
        """Persist checkpoint to disk."""
        checkpoint.last_updated = datetime.now()
        
        data = {
            'job_id': checkpoint.job_id,
            'source': checkpoint.source,
            'started_at': checkpoint.started_at.isoformat(),
            'last_updated': checkpoint.last_updated.isoformat(),
            'records_extracted': checkpoint.records_extracted,
            'last_offset': checkpoint.last_offset,
            'last_cursor': checkpoint.last_cursor,
            'last_timestamp': (
                checkpoint.last_timestamp.isoformat() 
                if checkpoint.last_timestamp else None
            ),
            'status': checkpoint.status,
            'error_message': checkpoint.error_message
        }
        
        with open(self._checkpoint_path(checkpoint.job_id), 'w') as f:
            json.dump(data, f, indent=2)
    
    def load(self, job_id: str) -> Optional[ExtractionCheckpoint]:
        """Load checkpoint from disk."""
        path = self._checkpoint_path(job_id)
        
        if not path.exists():
            return None
        
        with open(path) as f:
            data = json.load(f)
        
        return ExtractionCheckpoint(
            job_id=data['job_id'],
            source=data['source'],
            started_at=datetime.fromisoformat(data['started_at']),
            last_updated=datetime.fromisoformat(data['last_updated']),
            records_extracted=data['records_extracted'],
            last_offset=data['last_offset'],
            last_cursor=data.get('last_cursor'),
            last_timestamp=(
                datetime.fromisoformat(data['last_timestamp'])
                if data.get('last_timestamp') else None
            ),
            status=data['status'],
            error_message=data.get('error_message')
        )
    
    def mark_complete(self, checkpoint: ExtractionCheckpoint):
        """Mark extraction as successfully completed."""
        checkpoint.status = "completed"
        self.save(checkpoint)
    
    def mark_failed(self, checkpoint: ExtractionCheckpoint, error: str):
        """Mark extraction as failed with error message."""
        checkpoint.status = "failed"
        checkpoint.error_message = error
        self.save(checkpoint)


# Demo
mgr = CheckpointManager()
print("CheckpointManager: Enables resumable extractions")
print("- Create checkpoints before starting")
print("- Update periodically during extraction")
print("- Resume from last checkpoint on failure")

---
## 8. Key Takeaways <a id='takeaways'></a>

### Summary

```
┌────────────────────────────────────────────────────────────────────────┐
│                    DATA EXTRACTION BEST PRACTICES                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. CHOOSE THE RIGHT STRATEGY                                          │
│     • Full extraction: Small datasets, initial loads                   │
│     • Incremental: Large datasets, frequent updates                    │
│     • CDC: Real-time requirements, audit trails                        │
│                                                                         │
│  2. IMPLEMENT ROBUST ERROR HANDLING                                    │
│     • Exponential backoff for retries                                  │
│     • Circuit breakers for failing sources                             │
│     • Checkpointing for resumability                                   │
│                                                                         │
│  3. RESPECT SOURCE SYSTEMS                                             │
│     • Honor rate limits                                                │
│     • Use connection pooling                                           │
│     • Schedule extractions during off-peak hours                       │
│                                                                         │
│  4. OPTIMIZE FOR SCALE                                                 │
│     • Chunked reads for large datasets                                 │
│     • Parallel extraction where possible                               │
│     • Partition pruning for cloud storage                              │
│                                                                         │
│  5. ENSURE DATA QUALITY                                                │
│     • Validate schemas on extraction                                   │
│     • Implement deduplication                                          │
│     • Track extraction lineage                                         │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Extraction Pattern Decision Tree

```
                    Is real-time required?
                           │
              ┌────────────┴────────────┐
              │ Yes                     │ No
              ▼                         ▼
        Use CDC/Streaming        Dataset size?
        (Debezium, Kafka)              │
                           ┌──────────┴──────────┐
                           │ < 1M rows           │ > 1M rows
                           ▼                     ▼
                    Full Extraction        Has timestamp
                                           or version?
                                                │
                                   ┌───────────┴───────────┐
                                   │ Yes                   │ No
                                   ▼                       ▼
                            Incremental            Consider CDC
                            Extraction             or Full Extract
```

### Essential Tools & Libraries

| Category | Tools |
|----------|-------|
| **Database Connectors** | SQLAlchemy, psycopg2, pymongo, pyodbc |
| **API Clients** | requests, httpx, aiohttp |
| **File Processing** | pandas, pyarrow, polars |
| **Cloud Storage** | boto3, google-cloud-storage, azure-storage-blob |
| **CDC** | Debezium, AWS DMS, Airbyte |
| **Orchestration** | Apache Airflow, Prefect, Dagster |

In [None]:
# Quick Reference: Extraction Patterns Comparison

comparison_data = {
    'Pattern': ['Full Extraction', 'Incremental', 'CDC (Log-based)', 'CDC (Query-based)'],
    'Latency': ['High', 'Medium', 'Low (near real-time)', 'Medium'],
    'Resource Usage': ['High', 'Low', 'Low', 'Medium'],
    'Complexity': ['Low', 'Medium', 'High', 'Medium'],
    'Captures Deletes': ['Yes', 'No*', 'Yes', 'Yes'],
    'Best For': [
        'Small tables, initial loads',
        'Large tables with timestamps',
        'Real-time sync, audit logs',
        'When log CDC not available'
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("Extraction Patterns Comparison:")
print(comparison_df.to_string(index=False))
print("\n* Incremental can capture deletes with soft-delete patterns")