# Data Loading Strategies

Data loading is the final and critical phase of the ETL/ELT pipeline where transformed data is written to the target destination. The choice of loading strategy significantly impacts:

- **Performance**: How fast data can be loaded
- **Data Integrity**: Ensuring consistency and accuracy
- **Resource Utilization**: CPU, memory, and I/O consumption
- **Downtime**: Impact on target system availability

This notebook covers the essential loading patterns, strategies, and best practices used in modern data engineering.

---
## 1. Loading Patterns Overview

### Common Loading Destinations

| Destination | Use Case | Common Tools |
|-------------|----------|-------------|
| **Data Warehouse** | Analytics, BI reporting | Snowflake, BigQuery, Redshift |
| **Data Lake** | Raw storage, ML pipelines | S3, ADLS, GCS |
| **OLTP Database** | Transactional systems | PostgreSQL, MySQL, SQL Server |
| **NoSQL Database** | Flexible schema, high throughput | MongoDB, DynamoDB, Cassandra |
| **Search Engine** | Full-text search, log analytics | Elasticsearch, OpenSearch |

### Key Considerations

1. **Volume**: How much data needs to be loaded?
2. **Velocity**: How frequently does data arrive?
3. **Variety**: What format is the data in?
4. **Consistency Requirements**: ACID vs eventual consistency
5. **Target System Constraints**: Lock contention, index maintenance

---
## 2. Full Load vs Incremental Load vs Upsert

### 2.1 Full Load (Complete Refresh)

**Description**: Replace all existing data with the complete source dataset.

```
┌─────────────────┐         ┌─────────────────┐
│   Source Data   │  ───►   │   Target Table  │
│  (Complete Set) │  LOAD   │  (Truncated &   │
│                 │         │   Reloaded)     │
└─────────────────┘         └─────────────────┘
```

**Pros**:
- Simple to implement and understand
- Ensures complete data consistency
- No need to track changes

**Cons**:
- Slow for large datasets
- High resource consumption
- May cause downtime

**Best For**: Small reference tables, dimension tables with infrequent changes

In [None]:
import pandas as pd
from sqlalchemy import create_engine, text
from datetime import datetime

def full_load(df: pd.DataFrame, table_name: str, engine, schema: str = None) -> dict:
    """
    Perform a full load (truncate and reload) to target table.
    
    Args:
        df: DataFrame to load
        table_name: Target table name
        engine: SQLAlchemy engine
        schema: Optional schema name
    
    Returns:
        dict: Load statistics
    """
    start_time = datetime.now()
    
    # Use replace to truncate and reload
    df.to_sql(
        name=table_name,
        con=engine,
        schema=schema,
        if_exists='replace',  # Drops and recreates table
        index=False,
        method='multi'  # Batch inserts
    )
    
    end_time = datetime.now()
    
    return {
        'load_type': 'full',
        'rows_loaded': len(df),
        'duration_seconds': (end_time - start_time).total_seconds(),
        'timestamp': end_time.isoformat()
    }

# Example usage
# engine = create_engine('postgresql://user:pass@localhost:5432/mydb')
# df = pd.read_csv('products.csv')
# stats = full_load(df, 'dim_products', engine)
# print(stats)

### 2.2 Incremental Load (Delta Load)

**Description**: Load only new or changed records since the last load.

```
┌─────────────────┐                    ┌─────────────────┐
│   Source Data   │                    │   Target Table  │
│  ┌───────────┐  │                    │                 │
│  │  Changed  │  │  ───► APPEND ───►  │  + New Records  │
│  │  Records  │  │                    │                 │
│  └───────────┘  │                    │                 │
└─────────────────┘                    └─────────────────┘
```

**Change Detection Methods**:
1. **Timestamp-based**: Use `modified_at` or `created_at` columns
2. **Sequence-based**: Use auto-increment IDs
3. **Hash-based**: Compare row hashes
4. **CDC (Change Data Capture)**: Database log-based tracking

**Pros**:
- Faster than full load
- Lower resource consumption
- Minimal impact on target system

**Cons**:
- Requires change tracking mechanism
- Cannot detect hard deletes (without CDC)
- More complex implementation

In [None]:
from typing import Optional
import hashlib

def get_last_watermark(engine, table_name: str, watermark_column: str) -> Optional[datetime]:
    """
    Get the last watermark (max timestamp/ID) from target table.
    """
    query = text(f"SELECT MAX({watermark_column}) FROM {table_name}")
    with engine.connect() as conn:
        result = conn.execute(query).scalar()
    return result


def incremental_load_timestamp(
    source_df: pd.DataFrame,
    table_name: str,
    engine,
    timestamp_column: str = 'modified_at'
) -> dict:
    """
    Perform incremental load using timestamp-based change detection.
    
    Args:
        source_df: Complete source DataFrame with timestamp column
        table_name: Target table name
        engine: SQLAlchemy engine
        timestamp_column: Column used for change detection
    
    Returns:
        dict: Load statistics
    """
    start_time = datetime.now()
    
    # Get last loaded timestamp (watermark)
    last_watermark = get_last_watermark(engine, table_name, timestamp_column)
    
    if last_watermark:
        # Filter for only new/changed records
        delta_df = source_df[source_df[timestamp_column] > last_watermark]
    else:
        # First load - take all records
        delta_df = source_df
    
    if len(delta_df) > 0:
        # Append new records
        delta_df.to_sql(
            name=table_name,
            con=engine,
            if_exists='append',
            index=False,
            method='multi'
        )
    
    end_time = datetime.now()
    
    return {
        'load_type': 'incremental',
        'detection_method': 'timestamp',
        'last_watermark': str(last_watermark),
        'new_watermark': str(source_df[timestamp_column].max()),
        'rows_loaded': len(delta_df),
        'rows_skipped': len(source_df) - len(delta_df),
        'duration_seconds': (end_time - start_time).total_seconds()
    }

# Example usage
# stats = incremental_load_timestamp(df, 'fact_orders', engine, 'updated_at')
# print(stats)

### 2.3 Upsert (Merge) Load

**Description**: Insert new records and update existing ones based on a key.

```
┌─────────────────┐                    ┌─────────────────┐
│   Source Data   │                    │   Target Table  │
│  ┌───────────┐  │   Key Match?       │  ┌───────────┐  │
│  │ Record A  │──┼───► YES ──────────►│  │ Update A  │  │
│  ├───────────┤  │                    │  ├───────────┤  │
│  │ Record B  │──┼───► NO  ──────────►│  │ Insert B  │  │
│  └───────────┘  │                    │  └───────────┘  │
└─────────────────┘                    └─────────────────┘
```

**Database Support**:
- PostgreSQL: `INSERT ... ON CONFLICT DO UPDATE`
- MySQL: `INSERT ... ON DUPLICATE KEY UPDATE`
- SQL Server: `MERGE`
- Snowflake: `MERGE`

**Pros**:
- Handles both inserts and updates
- Maintains data currency
- Idempotent operations

**Cons**:
- Slower than append-only
- Requires proper key management
- May cause lock contention

In [None]:
from sqlalchemy.dialects.postgresql import insert as pg_insert
from sqlalchemy import Table, MetaData

def upsert_postgresql(
    df: pd.DataFrame,
    table_name: str,
    engine,
    primary_keys: list,
    update_columns: list = None
) -> dict:
    """
    Perform upsert (INSERT ... ON CONFLICT DO UPDATE) for PostgreSQL.
    
    Args:
        df: DataFrame to upsert
        table_name: Target table name
        engine: SQLAlchemy engine
        primary_keys: List of primary key columns
        update_columns: Columns to update on conflict (default: all non-key columns)
    
    Returns:
        dict: Upsert statistics
    """
    start_time = datetime.now()
    
    # Reflect table structure
    metadata = MetaData()
    table = Table(table_name, metadata, autoload_with=engine)
    
    # Determine columns to update
    if update_columns is None:
        update_columns = [col for col in df.columns if col not in primary_keys]
    
    # Convert DataFrame to records
    records = df.to_dict(orient='records')
    
    # Build upsert statement
    stmt = pg_insert(table).values(records)
    
    # Define update on conflict
    update_dict = {col: stmt.excluded[col] for col in update_columns}
    update_dict['updated_at'] = datetime.now()  # Track update time
    
    upsert_stmt = stmt.on_conflict_do_update(
        index_elements=primary_keys,
        set_=update_dict
    )
    
    # Execute upsert
    with engine.begin() as conn:
        result = conn.execute(upsert_stmt)
    
    end_time = datetime.now()
    
    return {
        'load_type': 'upsert',
        'rows_processed': len(df),
        'rows_affected': result.rowcount,
        'duration_seconds': (end_time - start_time).total_seconds()
    }

# Example usage
# stats = upsert_postgresql(
#     df=customer_df,
#     table_name='dim_customers',
#     engine=engine,
#     primary_keys=['customer_id'],
#     update_columns=['name', 'email', 'phone']
# )

In [None]:
def upsert_generic(
    df: pd.DataFrame,
    table_name: str,
    engine,
    primary_keys: list,
    batch_size: int = 1000
) -> dict:
    """
    Database-agnostic upsert using staging table approach.
    Works with any SQL database.
    
    Strategy:
    1. Load data to staging table
    2. Delete matching records from target
    3. Insert all from staging to target
    """
    start_time = datetime.now()
    staging_table = f"stg_{table_name}_{datetime.now().strftime('%Y%m%d%H%M%S')}"
    
    try:
        # Step 1: Load to staging table
        df.to_sql(
            name=staging_table,
            con=engine,
            if_exists='replace',
            index=False
        )
        
        # Build key matching condition
        key_conditions = ' AND '.join(
            [f"{table_name}.{k} = {staging_table}.{k}" for k in primary_keys]
        )
        
        with engine.begin() as conn:
            # Step 2: Delete existing matching records
            delete_sql = text(f"""
                DELETE FROM {table_name}
                WHERE EXISTS (
                    SELECT 1 FROM {staging_table}
                    WHERE {key_conditions}
                )
            """)
            delete_result = conn.execute(delete_sql)
            
            # Step 3: Insert all records from staging
            columns = ', '.join(df.columns)
            insert_sql = text(f"""
                INSERT INTO {table_name} ({columns})
                SELECT {columns} FROM {staging_table}
            """)
            insert_result = conn.execute(insert_sql)
            
            # Cleanup staging table
            conn.execute(text(f"DROP TABLE IF EXISTS {staging_table}"))
    
    except Exception as e:
        # Cleanup on error
        with engine.begin() as conn:
            conn.execute(text(f"DROP TABLE IF EXISTS {staging_table}"))
        raise e
    
    end_time = datetime.now()
    
    return {
        'load_type': 'upsert_generic',
        'rows_processed': len(df),
        'rows_deleted': delete_result.rowcount,
        'rows_inserted': insert_result.rowcount,
        'duration_seconds': (end_time - start_time).total_seconds()
    }

### 2.4 Comparison Summary

| Aspect | Full Load | Incremental | Upsert |
|--------|-----------|-------------|--------|
| **Complexity** | Low | Medium | Medium-High |
| **Speed** | Slow (large data) | Fast | Medium |
| **Resource Usage** | High | Low | Medium |
| **Handles Deletes** | Yes | No* | No* |
| **Handles Updates** | Yes | Append only | Yes |
| **Data Freshness** | Complete | Delta only | Current |
| **Idempotent** | Yes | No | Yes |

*Requires soft deletes or CDC for delete handling

---
## 3. Slowly Changing Dimensions (SCD)

SCDs are dimension tables that change slowly over time. The approach to handling these changes affects historical analysis capabilities.

### Common Example: Customer Dimension
```
Customer moves from New York to Los Angeles
- How do we handle historical orders placed when they lived in NY?
- Should reports show NY or LA for those orders?
```

### 3.1 SCD Type 0 - Retain Original

**Description**: Never update the dimension. Original values are preserved.

```
Original Record: {customer_id: 1, city: 'New York'}
After Move:      {customer_id: 1, city: 'New York'}  ← Unchanged
```

**Use Case**: Birth date, original signup date, SSN

### 3.2 SCD Type 1 - Overwrite

**Description**: Update the dimension with new values. No history preserved.

```
Before: {customer_id: 1, name: 'John', city: 'New York'}
After:  {customer_id: 1, name: 'John', city: 'Los Angeles'}  ← Overwritten
```

**Pros**: Simple, space-efficient
**Cons**: No historical tracking

**Use Case**: Correcting data errors, non-critical attributes

In [None]:
def scd_type1_update(
    df: pd.DataFrame,
    table_name: str,
    engine,
    natural_key: str,
    update_columns: list
) -> dict:
    """
    SCD Type 1: Simple overwrite of changed values.
    
    Args:
        df: Source DataFrame with current values
        table_name: Target dimension table
        engine: SQLAlchemy engine
        natural_key: Business key column
        update_columns: Columns to update when changed
    """
    updates = 0
    inserts = 0
    
    # Get existing records
    existing_df = pd.read_sql(f"SELECT * FROM {table_name}", engine)
    existing_keys = set(existing_df[natural_key].tolist())
    
    with engine.begin() as conn:
        for _, row in df.iterrows():
            if row[natural_key] in existing_keys:
                # Update existing record
                set_clause = ', '.join([f"{col} = :{col}" for col in update_columns])
                update_sql = text(f"""
                    UPDATE {table_name}
                    SET {set_clause}, updated_at = :updated_at
                    WHERE {natural_key} = :{natural_key}
                """)
                params = {col: row[col] for col in update_columns}
                params[natural_key] = row[natural_key]
                params['updated_at'] = datetime.now()
                conn.execute(update_sql, params)
                updates += 1
            else:
                # Insert new record
                inserts += 1
    
    # Bulk insert new records
    new_records = df[~df[natural_key].isin(existing_keys)]
    if len(new_records) > 0:
        new_records.to_sql(table_name, engine, if_exists='append', index=False)
    
    return {'scd_type': 1, 'updates': updates, 'inserts': len(new_records)}

### 3.3 SCD Type 2 - Add New Row (Historical Tracking)

**Description**: Create a new row for each change, preserving full history.

```
┌─────────────┬────────┬─────────────┬────────────┬────────────┬─────────┐
│ surrogate_key│ cust_id│ city        │ valid_from │ valid_to   │ current │
├─────────────┼────────┼─────────────┼────────────┼────────────┼─────────┤
│ 1           │ 101    │ New York    │ 2020-01-01 │ 2024-06-30 │ N       │
│ 2           │ 101    │ Los Angeles │ 2024-07-01 │ 9999-12-31 │ Y       │
└─────────────┴────────┴─────────────┴────────────┴────────────┴─────────┘
```

**Key Components**:
- **Surrogate Key**: System-generated unique identifier
- **Natural/Business Key**: Original identifier (customer_id)
- **Valid From/To**: Date range when record was current
- **Current Flag**: Indicates the active record

**Pros**: Full historical tracking, supports time-travel queries
**Cons**: Table growth, complex queries for current data

In [None]:
def scd_type2_process(
    source_df: pd.DataFrame,
    table_name: str,
    engine,
    natural_key: str,
    tracked_columns: list
) -> dict:
    """
    SCD Type 2: Historical tracking with row versioning.
    
    Table Requirements:
    - surrogate_key (auto-increment)
    - natural_key columns
    - tracked_columns
    - valid_from (DATE)
    - valid_to (DATE)
    - is_current (BOOLEAN)
    """
    today = datetime.now().date()
    far_future = datetime(9999, 12, 31).date()
    
    stats = {'new_records': 0, 'expired_records': 0, 'unchanged': 0}
    
    # Get current active records from target
    current_query = f"""
        SELECT * FROM {table_name}
        WHERE is_current = TRUE
    """
    target_df = pd.read_sql(current_query, engine)
    
    records_to_expire = []
    records_to_insert = []
    
    for _, source_row in source_df.iterrows():
        # Find matching current record in target
        match = target_df[target_df[natural_key] == source_row[natural_key]]
        
        if len(match) == 0:
            # New record - insert with current flag
            new_record = source_row.to_dict()
            new_record['valid_from'] = today
            new_record['valid_to'] = far_future
            new_record['is_current'] = True
            records_to_insert.append(new_record)
            stats['new_records'] += 1
        else:
            # Check if tracked columns changed
            target_row = match.iloc[0]
            has_change = any(
                source_row[col] != target_row[col] 
                for col in tracked_columns
            )
            
            if has_change:
                # Expire old record
                records_to_expire.append({
                    'surrogate_key': target_row['surrogate_key'],
                    'valid_to': today,
                    'is_current': False
                })
                
                # Insert new version
                new_record = source_row.to_dict()
                new_record['valid_from'] = today
                new_record['valid_to'] = far_future
                new_record['is_current'] = True
                records_to_insert.append(new_record)
                stats['expired_records'] += 1
            else:
                stats['unchanged'] += 1
    
    # Execute updates
    with engine.begin() as conn:
        # Expire old records
        for rec in records_to_expire:
            expire_sql = text(f"""
                UPDATE {table_name}
                SET valid_to = :valid_to, is_current = :is_current
                WHERE surrogate_key = :surrogate_key
            """)
            conn.execute(expire_sql, rec)
    
    # Insert new records
    if records_to_insert:
        insert_df = pd.DataFrame(records_to_insert)
        insert_df.to_sql(table_name, engine, if_exists='append', index=False)
    
    return {'scd_type': 2, **stats}

# Example query for SCD Type 2
scd2_query_examples = """
-- Get current customer data
SELECT * FROM dim_customer WHERE is_current = TRUE;

-- Get customer data as of a specific date
SELECT * FROM dim_customer 
WHERE customer_id = 101 
  AND '2023-06-15' BETWEEN valid_from AND valid_to;

-- Get complete history for a customer
SELECT * FROM dim_customer 
WHERE customer_id = 101 
ORDER BY valid_from;
"""
print(scd2_query_examples)

### 3.4 SCD Type 3 - Add New Column

**Description**: Store current and previous value in separate columns.

```
┌──────────┬──────────────┬─────────────────┬────────────┐
│ cust_id  │ current_city │ previous_city   │ city_change│
├──────────┼──────────────┼─────────────────┼────────────┤
│ 101      │ Los Angeles  │ New York        │ 2024-07-01 │
└──────────┴──────────────┴─────────────────┴────────────┘
```

**Pros**: Simple queries, limited history
**Cons**: Only stores one previous value

### 3.5 SCD Type Comparison

| Type | History | Storage | Complexity | Use Case |
|------|---------|---------|------------|----------|
| **Type 0** | None | Minimal | Low | Immutable attributes |
| **Type 1** | None | Minimal | Low | Error corrections |
| **Type 2** | Full | High | High | Regulatory compliance |
| **Type 3** | Limited | Medium | Medium | Before/after comparison |

---
## 4. Python Code for Database Loading

### 4.1 Loading to PostgreSQL

In [None]:
import psycopg2
from psycopg2.extras import execute_values
import io

class PostgresLoader:
    """
    High-performance PostgreSQL data loader with multiple strategies.
    """
    
    def __init__(self, connection_string: str):
        self.conn_string = connection_string
    
    def load_with_copy(self, df: pd.DataFrame, table_name: str) -> dict:
        """
        Fastest method: Use PostgreSQL COPY command.
        Up to 10x faster than INSERT for large datasets.
        """
        start_time = datetime.now()
        
        # Create in-memory CSV buffer
        buffer = io.StringIO()
        df.to_csv(buffer, index=False, header=False, sep='\t')
        buffer.seek(0)
        
        conn = psycopg2.connect(self.conn_string)
        try:
            with conn.cursor() as cursor:
                cursor.copy_from(
                    file=buffer,
                    table=table_name,
                    sep='\t',
                    columns=list(df.columns),
                    null=''
                )
            conn.commit()
        finally:
            conn.close()
        
        end_time = datetime.now()
        return {
            'method': 'COPY',
            'rows': len(df),
            'duration_seconds': (end_time - start_time).total_seconds()
        }
    
    def load_with_execute_values(self, df: pd.DataFrame, table_name: str) -> dict:
        """
        Fast batch insert using psycopg2's execute_values.
        Good balance of speed and flexibility.
        """
        start_time = datetime.now()
        
        columns = ', '.join(df.columns)
        values = [tuple(row) for row in df.values]
        
        conn = psycopg2.connect(self.conn_string)
        try:
            with conn.cursor() as cursor:
                insert_sql = f"INSERT INTO {table_name} ({columns}) VALUES %s"
                execute_values(cursor, insert_sql, values, page_size=1000)
            conn.commit()
        finally:
            conn.close()
        
        end_time = datetime.now()
        return {
            'method': 'execute_values',
            'rows': len(df),
            'duration_seconds': (end_time - start_time).total_seconds()
        }

# Example usage
# loader = PostgresLoader('postgresql://user:pass@localhost:5432/mydb')
# stats = loader.load_with_copy(df, 'fact_sales')

### 4.2 Loading to Snowflake

In [None]:
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas

class SnowflakeLoader:
    """
    Snowflake data loader using optimal loading patterns.
    """
    
    def __init__(self, account: str, user: str, password: str, 
                 warehouse: str, database: str, schema: str):
        self.config = {
            'account': account,
            'user': user,
            'password': password,
            'warehouse': warehouse,
            'database': database,
            'schema': schema
        }
    
    def load_dataframe(self, df: pd.DataFrame, table_name: str, 
                       auto_create_table: bool = True) -> dict:
        """
        Load DataFrame to Snowflake using write_pandas.
        Uses internal stage and COPY command for optimal performance.
        """
        start_time = datetime.now()
        
        conn = snowflake.connector.connect(**self.config)
        try:
            success, num_chunks, num_rows, output = write_pandas(
                conn=conn,
                df=df,
                table_name=table_name.upper(),
                auto_create_table=auto_create_table,
                quote_identifiers=False
            )
        finally:
            conn.close()
        
        end_time = datetime.now()
        
        return {
            'method': 'write_pandas',
            'success': success,
            'chunks': num_chunks,
            'rows': num_rows,
            'duration_seconds': (end_time - start_time).total_seconds()
        }
    
    def load_from_stage(self, stage_path: str, table_name: str,
                        file_format: str = 'CSV') -> dict:
        """
        Load data from external stage (S3, Azure, GCS).
        Best for very large datasets.
        """
        conn = snowflake.connector.connect(**self.config)
        try:
            cursor = conn.cursor()
            
            # Create file format if not exists
            cursor.execute(f"""
                CREATE FILE FORMAT IF NOT EXISTS my_csv_format
                TYPE = '{file_format}'
                FIELD_OPTIONALLY_ENCLOSED_BY = '"'
                SKIP_HEADER = 1
            """)
            
            # Execute COPY command
            copy_sql = f"""
                COPY INTO {table_name}
                FROM @{stage_path}
                FILE_FORMAT = my_csv_format
                ON_ERROR = 'CONTINUE'
            """
            cursor.execute(copy_sql)
            
            result = cursor.fetchall()
            
        finally:
            conn.close()
        
        return {'method': 'COPY_FROM_STAGE', 'result': result}

### 4.3 Loading to MongoDB

In [None]:
from pymongo import MongoClient, UpdateOne
from typing import List, Dict

class MongoLoader:
    """
    MongoDB data loader with bulk operations.
    """
    
    def __init__(self, connection_string: str, database: str):
        self.client = MongoClient(connection_string)
        self.db = self.client[database]
    
    def bulk_insert(self, collection_name: str, 
                    documents: List[Dict], ordered: bool = False) -> dict:
        """
        Bulk insert documents to MongoDB.
        
        Args:
            ordered: If False, continues on error (faster)
        """
        start_time = datetime.now()
        
        collection = self.db[collection_name]
        result = collection.insert_many(documents, ordered=ordered)
        
        end_time = datetime.now()
        
        return {
            'method': 'insert_many',
            'inserted_count': len(result.inserted_ids),
            'duration_seconds': (end_time - start_time).total_seconds()
        }
    
    def bulk_upsert(self, collection_name: str, documents: List[Dict],
                    key_field: str, batch_size: int = 1000) -> dict:
        """
        Bulk upsert using bulk_write with UpdateOne operations.
        """
        start_time = datetime.now()
        collection = self.db[collection_name]
        
        total_modified = 0
        total_upserted = 0
        
        # Process in batches
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            
            operations = [
                UpdateOne(
                    filter={key_field: doc[key_field]},
                    update={'$set': doc},
                    upsert=True
                )
                for doc in batch
            ]
            
            result = collection.bulk_write(operations, ordered=False)
            total_modified += result.modified_count
            total_upserted += result.upserted_count
        
        end_time = datetime.now()
        
        return {
            'method': 'bulk_upsert',
            'modified_count': total_modified,
            'upserted_count': total_upserted,
            'duration_seconds': (end_time - start_time).total_seconds()
        }
    
    def close(self):
        self.client.close()

# Example usage
# loader = MongoLoader('mongodb://localhost:27017', 'analytics')
# docs = df.to_dict(orient='records')
# stats = loader.bulk_upsert('customers', docs, key_field='customer_id')

---
## 5. Bulk Loading Best Practices

### 5.1 General Optimization Strategies

In [None]:
from contextlib import contextmanager
from typing import Generator

class BulkLoadOptimizer:
    """
    Collection of bulk loading optimization techniques.
    """
    
    @staticmethod
    def chunk_dataframe(df: pd.DataFrame, chunk_size: int) -> Generator:
        """
        Split DataFrame into chunks for batch processing.
        Prevents memory issues with large datasets.
        """
        for i in range(0, len(df), chunk_size):
            yield df.iloc[i:i + chunk_size]
    
    @staticmethod
    @contextmanager
    def disable_indexes(engine, table_name: str):
        """
        Context manager to disable/rebuild indexes during bulk load.
        Significantly speeds up large inserts.
        """
        # Get existing indexes
        with engine.connect() as conn:
            # Disable indexes (PostgreSQL example)
            conn.execute(text(f"""
                UPDATE pg_index 
                SET indisready = false 
                WHERE indrelid = '{table_name}'::regclass
            """))
        
        try:
            yield
        finally:
            # Rebuild indexes
            with engine.connect() as conn:
                conn.execute(text(f"REINDEX TABLE {table_name}"))
    
    @staticmethod
    def parallel_load(df: pd.DataFrame, table_name: str, engine,
                      num_workers: int = 4, chunk_size: int = 10000):
        """
        Load data in parallel using multiple connections.
        """
        from concurrent.futures import ThreadPoolExecutor, as_completed
        
        def load_chunk(chunk_df, worker_id):
            chunk_df.to_sql(
                name=table_name,
                con=engine,
                if_exists='append',
                index=False,
                method='multi'
            )
            return len(chunk_df)
        
        chunks = list(BulkLoadOptimizer.chunk_dataframe(df, chunk_size))
        results = []
        
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            futures = {
                executor.submit(load_chunk, chunk, i): i 
                for i, chunk in enumerate(chunks)
            }
            
            for future in as_completed(futures):
                worker_id = futures[future]
                try:
                    rows = future.result()
                    results.append({'worker': worker_id, 'rows': rows})
                except Exception as e:
                    results.append({'worker': worker_id, 'error': str(e)})
        
        return results

# Best practices summary
print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║                      BULK LOADING BEST PRACTICES                             ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  1. BATCH SIZE TUNING                                                        ║
║     • Start with 1,000-10,000 rows per batch                                 ║
║     • Increase until diminishing returns                                     ║
║     • Monitor memory usage                                                   ║
║                                                                              ║
║  2. INDEX MANAGEMENT                                                         ║
║     • Drop indexes before bulk load                                          ║
║     • Rebuild after load completes                                           ║
║     • Consider partial indexes                                               ║
║                                                                              ║
║  3. TRANSACTION HANDLING                                                     ║
║     • Use larger transactions for throughput                                 ║
║     • Commit every N rows (e.g., 100,000)                                    ║
║     • Disable auto-commit during bulk operations                             ║
║                                                                              ║
║  4. CONSTRAINT MANAGEMENT                                                    ║
║     • Defer constraint checking until commit                                 ║
║     • Disable triggers during load                                           ║
║     • Re-enable and validate after                                           ║
║                                                                              ║
║  5. PARALLEL LOADING                                                         ║
║     • Partition data by key ranges                                           ║
║     • Use multiple connections/workers                                       ║
║     • Avoid overlapping key ranges                                           ║
║                                                                              ║
║  6. LOGGING & WAL                                                            ║
║     • Use unlogged tables for staging                                        ║
║     • Increase checkpoint interval                                           ║
║     • Consider minimal logging modes                                         ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝
""")

### 5.2 Database-Specific Optimizations

In [None]:
# PostgreSQL Optimizations
postgresql_optimizations = """
-- Before bulk load
ALTER TABLE target_table SET UNLOGGED;  -- Disable WAL
ALTER TABLE target_table DISABLE TRIGGER ALL;
DROP INDEX idx_target_column;  -- Drop indexes

-- Increase checkpoint distance
SET checkpoint_timeout = '1h';
SET max_wal_size = '10GB';

-- Use COPY for fastest loading
COPY target_table FROM '/path/to/data.csv' WITH (FORMAT csv, HEADER true);

-- After bulk load
CREATE INDEX idx_target_column ON target_table(column);
ALTER TABLE target_table ENABLE TRIGGER ALL;
ALTER TABLE target_table SET LOGGED;  -- Re-enable WAL
ANALYZE target_table;  -- Update statistics
"""

# MySQL Optimizations
mysql_optimizations = """
-- Before bulk load
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET autocommit = 0;
ALTER TABLE target_table DISABLE KEYS;

-- Use LOAD DATA for fastest loading
LOAD DATA INFILE '/path/to/data.csv'
INTO TABLE target_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\\n'
IGNORE 1 ROWS;

-- After bulk load
ALTER TABLE target_table ENABLE KEYS;
SET FOREIGN_KEY_CHECKS = 1;
SET UNIQUE_CHECKS = 1;
COMMIT;
ANALYZE TABLE target_table;
"""

print("PostgreSQL Optimizations:")
print(postgresql_optimizations)
print("\n" + "="*60 + "\n")
print("MySQL Optimizations:")
print(mysql_optimizations)

### 5.3 Error Handling and Recovery

In [None]:
import logging
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class LoadResult:
    """Container for load operation results."""
    success: bool
    rows_loaded: int = 0
    rows_failed: int = 0
    errors: list = field(default_factory=list)
    duration_seconds: float = 0.0

class ResilientLoader:
    """
    Data loader with retry logic and error handling.
    """
    
    def __init__(self, engine, max_retries: int = 3, 
                 error_threshold: float = 0.05):
        self.engine = engine
        self.max_retries = max_retries
        self.error_threshold = error_threshold  # 5% error tolerance
        self.logger = logging.getLogger(__name__)
    
    def load_with_retry(self, df: pd.DataFrame, table_name: str,
                        chunk_size: int = 1000) -> LoadResult:
        """
        Load data with automatic retry on failure.
        """
        start_time = datetime.now()
        total_rows = len(df)
        loaded_rows = 0
        failed_rows = 0
        errors = []
        
        chunks = [df.iloc[i:i+chunk_size] for i in range(0, len(df), chunk_size)]
        
        for chunk_idx, chunk in enumerate(chunks):
            retry_count = 0
            success = False
            
            while retry_count < self.max_retries and not success:
                try:
                    chunk.to_sql(
                        name=table_name,
                        con=self.engine,
                        if_exists='append',
                        index=False
                    )
                    loaded_rows += len(chunk)
                    success = True
                    
                except Exception as e:
                    retry_count += 1
                    self.logger.warning(
                        f"Chunk {chunk_idx} failed (attempt {retry_count}): {e}"
                    )
                    
                    if retry_count >= self.max_retries:
                        failed_rows += len(chunk)
                        errors.append({
                            'chunk_idx': chunk_idx,
                            'error': str(e),
                            'rows_affected': len(chunk)
                        })
            
            # Check error threshold
            error_rate = failed_rows / total_rows
            if error_rate > self.error_threshold:
                self.logger.error(
                    f"Error threshold exceeded: {error_rate:.2%} > {self.error_threshold:.2%}"
                )
                break
        
        end_time = datetime.now()
        
        return LoadResult(
            success=(failed_rows == 0),
            rows_loaded=loaded_rows,
            rows_failed=failed_rows,
            errors=errors,
            duration_seconds=(end_time - start_time).total_seconds()
        )
    
    def load_with_dead_letter_queue(
        self, df: pd.DataFrame, table_name: str, dlq_table: str
    ) -> LoadResult:
        """
        Load data, sending failed records to a dead letter queue.
        """
        loaded_rows = 0
        failed_records = []
        
        for _, row in df.iterrows():
            try:
                row_df = pd.DataFrame([row])
                row_df.to_sql(
                    name=table_name,
                    con=self.engine,
                    if_exists='append',
                    index=False
                )
                loaded_rows += 1
            except Exception as e:
                # Add to dead letter queue
                dlq_record = row.to_dict()
                dlq_record['_error'] = str(e)
                dlq_record['_timestamp'] = datetime.now().isoformat()
                failed_records.append(dlq_record)
        
        # Write failed records to DLQ
        if failed_records:
            dlq_df = pd.DataFrame(failed_records)
            dlq_df.to_sql(
                name=dlq_table,
                con=self.engine,
                if_exists='append',
                index=False
            )
        
        return LoadResult(
            success=(len(failed_records) == 0),
            rows_loaded=loaded_rows,
            rows_failed=len(failed_records)
        )

---
## 6. Key Takeaways

### Loading Strategy Selection Guide

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    LOADING STRATEGY DECISION TREE                       │
└─────────────────────────────────────────────────────────────────────────┘

                         Is the dataset small?
                                 │
                    ┌────────────┴────────────┐
                   YES                        NO
                    │                          │
               FULL LOAD               Need history tracking?
                                              │
                                 ┌────────────┴────────────┐
                                YES                        NO
                                 │                          │
                            SCD TYPE 2              Need to update existing?
                                                           │
                                              ┌────────────┴────────────┐
                                             YES                        NO
                                              │                          │
                                           UPSERT               INCREMENTAL
```

### Summary Table

| Strategy | When to Use | Performance | Complexity |
|----------|------------|-------------|------------|
| **Full Load** | Small datasets, dimension tables | Slow | Low |
| **Incremental** | Large fact tables, append-only | Fast | Medium |
| **Upsert** | Frequently updated entities | Medium | Medium |
| **SCD Type 1** | Error corrections, no history needed | Fast | Low |
| **SCD Type 2** | Audit requirements, historical analysis | Slow | High |
| **Bulk COPY** | Initial loads, large migrations | Fastest | Low |

### Best Practices Checklist

- [ ] **Choose the right loading pattern** based on data volume and requirements
- [ ] **Batch operations** - avoid row-by-row processing
- [ ] **Use native bulk loading** (COPY, LOAD DATA) when possible
- [ ] **Disable indexes and constraints** during large loads
- [ ] **Implement proper error handling** with retry logic
- [ ] **Monitor and log** load statistics for optimization
- [ ] **Use staging tables** for complex transformations
- [ ] **Validate data quality** after loading
- [ ] **Update statistics** after bulk operations
- [ ] **Test with production-like volumes** before deployment