# Data Population Tutorial V1 - Three Common Scenarios

This tutorial demonstrates three common scenarios when working with database tables using our custom utility classes:

1. **Creating a New Table** - First-time table creation and data population
2. **Replacing an Existing Table** - Drop and recreate a table with new structure
3. **Appending to an Existing Table** - Add more data to an existing table

Each scenario follows the same workflow: Load → Process → Populate

## Setup and Imports

First, we import the necessary classes and prepare our data.

In [2]:
import pandas as pd
import os

# Import our custom classes
from data_loader.smart_auto_data_loader import SmartAutoDataLoader
from data_processor.data_processor import DataProcessor
from db_connector.smart_db_connector_enhanced_V3 import db_connector

# Define the path to our sample data file
file_path = os.path.join('data', 'tutorial_customers.csv')

print(f"Ready to process file: {file_path}")

Ready to process file: data/tutorial_customers.csv


## Data Preparation

Let's load and process our data once - we'll reuse it in all scenarios.

In [3]:
# Load the raw data
loader = SmartAutoDataLoader(verbose=True)
raw_df = loader.load(file_path)

# Process the data
processor = DataProcessor()
type_hints = {'customer_id': 'int', 'has_subscription': 'bool'}
processed_df = processor.preprocess_loaded_data(
    raw_df,
    type_hints=type_hints,
    datetime_columns=['joined_date']
)

print("Data loaded and processed successfully!")
print(f"Shape: {processed_df.shape}")
print("\nColumns:", processed_df.columns.tolist())
processed_df.head()

🎯 SmartAutoDataLoader ready!
🎯 Loading file: tutorial_customers.csv
🔍 Format detected: csv
📊 Loading CSV file...
🔤 Encoding detected: utf-8
🗓️ Searching for date columns...
   ✅ Found date column: 'joined_date' (%Y-%m-%d)
   📅 Total date columns found: 1
✅ CSV loaded: 4 rows, 7 columns


NotImplementedError: 

## Database Connection

Connect to NeonDB for all our examples.

In [15]:
print("Connecting to NeonDB...")
db = db_connector()  # Connects to NeonDB by default

# Verify the connection
health_status = db.health_check()
print(f"Database connection status: {health_status.get('status')}")

# Define our working schema
SCHEMA_NAME = 'test_berlin_data'

Connecting to NeonDB...
🌟 SMART DATABASE CONNECTOR V3 - INITIALIZING...
🔗 Using default NeonDB connection
✅ NeonDB configuration loaded
   Default schema: test_berlin_data
🔌 Connecting to NeonDB...
✅ Connection successful!
   Database: neondb
   User: neondb_owner

🔍 Auto-discovering database schemas...
✅ Discovered 4 schemas
🎯 Auto-selected default schema: test_berlin_data

📊 SMART DB CONNECTOR V3 - CONNECTION SUMMARY
🔗 Connection Type: NeonDB

🗂️  Discovered 4 schemas:
  📁 dependency_example: 5 tables
       └─ banks_test_kovalivska_aws (11 columns)
       └─ departments (2 columns)
       └─ districts (3 columns)
       └─ ... and 2 more tables
  📁 nyc_schools: 27 tables
       └─ Audrey_sat_results (10 columns)
       └─ Colleges_Berlin (12 columns)
       └─ Levon_cleaned_sat_scores (8 columns)
       └─ ... and 24 more tables
  📁 public: 15 tables
       └─ audrey_sat_results (10 columns)
       └─ cleaned_sat_results_peter_s (9 columns)
       └─ demo_users (6 columns)
       └─

---
# Scenario 1: Creating a New Table

This is the simplest case - creating a fresh table and populating it with data.

In [33]:
import uuid

# Use UUID to ensure unique table name
unique_id = str(uuid.uuid4())[:8]
table_name_new = f'customers_v1_new_{unique_id}'  # Unique table name

print(f"=== SCENARIO 1: Creating New Table '{table_name_new}' ===")

# Step 0: Clean up any existing table with similar name
try:
    db.query(f'DROP TABLE IF EXISTS {SCHEMA_NAME}.customers_v1_new CASCADE;', show_info=False)
    print("Cleaned up any existing 'customers_v1_new' table")
except Exception as e:
    print(f"No existing table to clean: {e}")

# Step 1: Create the table with proper structure
create_sql = f'''
CREATE TABLE {SCHEMA_NAME}.{table_name_new} (
    customer_id INTEGER PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    joined_date DATE,
    city VARCHAR(100),
    has_subscription BOOLEAN,
    district_id INTEGER,
    district VARCHAR(100)
    -- Foreign key would go here (commented out to avoid FK violations):
    ,CONSTRAINT fk_district FOREIGN KEY (district) REFERENCES {SCHEMA_NAME}.districts(district)
);
'''

print("Creating new table...")
db.query(create_sql, show_info=False)
print("✅ Table created successfully!")

# Step 2: Populate the table
result = db.populate(
    df=processed_df if 'processed_df' in locals() else raw_df,
    table_name=table_name_new,
    schema=SCHEMA_NAME,
    mode='append',
    show_report=True
)

print(f"Population status: {result['status']}")

# Step 3: Verify the data
data_check = db.query(f"SELECT * FROM {SCHEMA_NAME}.{table_name_new} LIMIT 3")
print("\nFirst 3 rows from new table:")
print(data_check)

=== SCENARIO 1: Creating New Table 'customers_v1_new_9cf4da47' ===
Cleaned up any existing 'customers_v1_new' table
Creating new table...
Cleaned up any existing 'customers_v1_new' table
Creating new table...
✅ Table created successfully!

📊 SMART POPULATE - PRE-POPULATION ANALYSIS
🎯 Target: test_berlin_data.customers_v1_new_9cf4da47
📝 Mode: APPEND
🔗 Connection: ConnectionType.NEON_DB

📋 DATASET ANALYSIS:
   Rows: 4
   Columns: 7
   Memory usage: 0.00 MB

🔍 COLUMN ANALYSIS:
   CustomerID: int64 | Nulls: 0 (0.0%) | Unique: 4
   First Name: object | Nulls: 0 (0.0%) | Unique: 4
   joined_date: datetime64[ns] | Nulls: 0 (0.0%) | Unique: 4
   City: object | Nulls: 0 (0.0%) | Unique: 1
   Has_Subscription: object | Nulls: 1 (25.0%) | Unique: 2
   district_id: int64 | Nulls: 0 (0.0%) | Unique: 4
   district: object | Nulls: 0 (0.0%) | Unique: 4

✅ DATA QUALITY CHECKS:
   Total null values: 1
   Duplicate rows: 0

🏗️  TABLE STATUS:
   Table exists: No
📝 Inserting 4 rows × 7 columns
   Target: 

---
# Scenario 2: Replacing an Existing Table

Sometimes we need to completely replace a table with new structure or clean data.

In [34]:
import time
import random
import uuid

# Use UUID to create absolutely unique table names for each run
unique_id = str(uuid.uuid4())[:8]  # Use first 8 characters of UUID
table_name_replace = f'customers_v1_replace_{unique_id}'  # Absolutely unique table name

print(f"=== SCENARIO 2: Replacing Existing Table '{table_name_replace}' ===")

# First, create an "old" table to demonstrate replacement
print("Step 1: Creating old table (with fewer columns)...")
old_table_sql = f'''
CREATE TABLE {SCHEMA_NAME}.{table_name_replace} (
    customer_id INTEGER PRIMARY KEY,
    first_name VARCHAR(100),
    city VARCHAR(100)
);
'''
db.query(old_table_sql, show_info=False)

# Insert some old data - use actual column names from the dataset
print("Available columns:", processed_df.columns.tolist() if 'processed_df' in locals() else raw_df.columns.tolist())

# Map to the correct column names
data_to_use = processed_df if 'processed_df' in locals() else raw_df
old_data = data_to_use[['CustomerID', 'First Name', 'City']].head(2)
old_data = old_data.rename(columns={
    'CustomerID': 'customer_id',
    'First Name': 'first_name', 
    'City': 'city'
})

db.populate(old_data, table_name_replace, SCHEMA_NAME, mode='append', show_report=False)
print("✅ Old table created with 2 rows")

# Check old structure
old_structure = db.query(f"""
    SELECT column_name FROM information_schema.columns 
    WHERE table_schema = '{SCHEMA_NAME}' AND table_name = '{table_name_replace}'
    ORDER BY ordinal_position
""")
print(f"Old table columns: {old_structure['column_name'].tolist()}")

# Step 2: Drop table with CASCADE to remove constraints and references, and verify deletion
print("\nStep 2: Completely removing old table (with constraints if any)...")
try:
    for _ in range(3):
        db.query(f'DROP TABLE IF EXISTS {SCHEMA_NAME}.{table_name_replace} CASCADE;', show_info=False)
        check_existence = db.query(f"""
            SELECT COUNT(*) as exists_count 
            FROM information_schema.tables 
            WHERE table_schema = '{SCHEMA_NAME}' 
            AND table_name = '{table_name_replace}'
        """)
        if check_existence['exists_count'].iloc[0] == 0:
            print("✅ Old table dropped successfully")
            break
        else:
            print("⚠️ Table still exists, retrying drop...")
    else:
        raise RuntimeError("Table still exists after multiple drop attempts!")
except Exception as e:
    print(f"Drop operation: {e}")

# Step 3: Create new table with full structure and constraints
print("Step 3: Creating new table with full structure and constraints...")
new_table_sql = f'''
CREATE TABLE {SCHEMA_NAME}.{table_name_replace} (
    customer_id INTEGER PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    joined_date DATE,
    city VARCHAR(100),
    has_subscription BOOLEAN,
    district_id INTEGER,
    district VARCHAR(100),
    CONSTRAINT fk_district FOREIGN KEY (district) REFERENCES {SCHEMA_NAME}.districts(district)
);
'''
try:
    db.query(new_table_sql, show_info=False)
    print("✅ New table structure with constraints created")
except Exception as e:
    print(f"❌ Table creation failed: {e}")
    print("Trying CREATE TABLE IF NOT EXISTS as fallback...")
    fallback_sql = new_table_sql.replace('CREATE TABLE ', 'CREATE TABLE IF NOT EXISTS ')
    db.query(fallback_sql, show_info=False)
    print("✅ Table created with fallback method (may already exist)")

# Step 4: Prepare data with correct column mapping for population
print("Step 4: Preparing data for population...")
data_for_population = data_to_use.rename(columns={
    'CustomerID': 'customer_id',
    'First Name': 'first_name',
    'City': 'city',
    'Has_Subscription': 'has_subscription'
})

result = db.populate(
    df=data_for_population,
    table_name=table_name_replace,
    schema=SCHEMA_NAME,
    mode='append',
    show_report=True
)

print(f"Population status: {result['status']}")

# Verify new structure
new_structure = db.query(f"""
    SELECT column_name FROM information_schema.columns 
    WHERE table_schema = '{SCHEMA_NAME}' AND table_name = '{table_name_replace}'
    ORDER BY ordinal_position
""")
print(f"New table columns: {new_structure['column_name'].tolist()}")
print(f"✅ Table replacement completed successfully!")

=== SCENARIO 2: Replacing Existing Table 'customers_v1_replace_1a9254d0' ===
Step 1: Creating old table (with fewer columns)...
Available columns: ['CustomerID', 'First Name', 'joined_date', 'City', 'Has_Subscription', 'district_id', 'district']
📝 Inserting 2 rows × 3 columns
   Target: test_berlin_data.customers_v1_replace_1a9254d0
   Action: append
Available columns: ['CustomerID', 'First Name', 'joined_date', 'City', 'Has_Subscription', 'district_id', 'district']
📝 Inserting 2 rows × 3 columns
   Target: test_berlin_data.customers_v1_replace_1a9254d0
   Action: append
✅ Insert completed successfully
✅ Insert completed successfully
✅ Old table created with 2 rows
🔍 Executing query in schema: 'test_berlin_data'
✅ Old table created with 2 rows
🔍 Executing query in schema: 'test_berlin_data'
✅ Query completed: 3 rows, 1 columns
Old table columns: ['customer_id', 'first_name', 'city']

Step 2: Completely removing old table (with constraints if any)...
✅ Query completed: 3 rows, 1 columns

---
# Scenario 3: Appending to Existing Table

Adding more data to a table that already has the correct structure.

In [35]:
import uuid

# Use UUID for append scenario too
unique_id_append = str(uuid.uuid4())[:8]
table_name_append = f'customers_v1_append_{unique_id_append}'

print(f"=== SCENARIO 3: Appending to Existing Table '{table_name_append}' ===")

# Step 1: Create base table with initial data
print("Step 1: Creating base table...")
base_table_sql = f'''
CREATE TABLE {SCHEMA_NAME}.{table_name_append} (
    customer_id INTEGER PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    joined_date DATE,
    city VARCHAR(100),
    has_subscription BOOLEAN,
    district_id INTEGER,
    district VARCHAR(100),
    CONSTRAINT fk_district FOREIGN KEY (district) REFERENCES {SCHEMA_NAME}.districts(district)
);
'''

# Clean existing table first
try:
    db.query(f'DROP TABLE IF EXISTS {SCHEMA_NAME}.customers_v1_append CASCADE;', show_info=False)
    print("Cleaned up any existing 'customers_v1_append' table")
except Exception as e:
    print(f"No existing table to clean: {e}")

db.query(base_table_sql, show_info=False)

# Insert initial data (first 2 rows)
initial_data = processed_df.head(2) if 'processed_df' in locals() else raw_df.head(2)
result1 = db.populate(
    df=initial_data,
    table_name=table_name_append,
    schema=SCHEMA_NAME,
    mode='append',
    show_report=False
)

print(f"✅ Base table created with {len(initial_data)} rows")

# Check current row count
count_check = db.query(f"SELECT COUNT(*) as row_count FROM {SCHEMA_NAME}.{table_name_append}")
print(f"Current row count: {count_check['row_count'].iloc[0]}")

# Step 2: Append additional data (remaining rows)
print("\nStep 2: Appending additional data...")
additional_data = processed_df.tail(2) if 'processed_df' in locals() else raw_df.tail(2)  # Last 2 rows

# Ensure all columns match DB table (snake_case, not spaces/camelCase)
rename_map = {}
if 'customer_id' not in additional_data.columns and 'CustomerID' in additional_data.columns:
    rename_map['CustomerID'] = 'customer_id'
if 'first_name' not in additional_data.columns and 'First Name' in additional_data.columns:
    rename_map['First Name'] = 'first_name'
if 'joined_date' not in additional_data.columns and 'joined_date' in additional_data.columns:
    pass  # already correct
if 'city' not in additional_data.columns and 'City' in additional_data.columns:
    rename_map['City'] = 'city'
if 'has_subscription' not in additional_data.columns and 'Has_Subscription' in additional_data.columns:
    rename_map['Has_Subscription'] = 'has_subscription'
if 'district_id' not in additional_data.columns and 'district_id' in additional_data.columns:
    pass  # already correct
if 'district' not in additional_data.columns and 'district' in additional_data.columns:
    pass  # already correct
additional_data = additional_data.rename(columns=rename_map)

# Only keep columns that exist in the DB table (avoid extra/legacy columns)
expected_cols = ['customer_id', 'first_name', 'joined_date', 'city', 'has_subscription', 'district_id', 'district']
additional_data = additional_data[[col for col in expected_cols if col in additional_data.columns]]

# Modify customer_id to avoid primary key conflicts
additional_data = additional_data.copy()
additional_data['customer_id'] = additional_data['customer_id'] + 100  # Ensure unique IDs

result2 = db.populate(
    df=additional_data,
    table_name=table_name_append,
    schema=SCHEMA_NAME,
    mode='append',
    show_report=True
)

print(f"Append status: {result2['status']}")

# Final verification
try:
    final_count = db.query(f"SELECT COUNT(*) as row_count FROM {SCHEMA_NAME}.{table_name_append}")
    all_data = db.query(f"SELECT * FROM {SCHEMA_NAME}.{table_name_append} ORDER BY customer_id")
    print(f"\nFinal row count: {final_count['row_count'].iloc[0]}")
    print("All data in table:")
    print(all_data)
except Exception as e:
    print(f"❌ Could not fetch all data: {e}")

=== SCENARIO 3: Appending to Existing Table 'customers_v1_append_068c6449' ===
Step 1: Creating base table...
Cleaned up any existing 'customers_v1_append' table
Cleaned up any existing 'customers_v1_append' table
📝 Inserting 2 rows × 7 columns
   Target: test_berlin_data.customers_v1_append_068c6449
   Action: append
📝 Inserting 2 rows × 7 columns
   Target: test_berlin_data.customers_v1_append_068c6449
   Action: append
✅ Insert completed successfully
✅ Insert completed successfully
✅ Base table created with 2 rows
🔍 Executing query in schema: 'test_berlin_data'
✅ Base table created with 2 rows
🔍 Executing query in schema: 'test_berlin_data'
✅ Query completed: 1 rows, 1 columns
Current row count: 2

Step 2: Appending additional data...

📊 SMART POPULATE - PRE-POPULATION ANALYSIS
🎯 Target: test_berlin_data.customers_v1_append_068c6449
📝 Mode: APPEND
🔗 Connection: ConnectionType.NEON_DB

📋 DATASET ANALYSIS:
   Rows: 2
   Columns: 7
   Memory usage: 0.00 MB

🔍 COLUMN ANALYSIS:
   custom

---
# Summary

We've demonstrated three common database scenarios:

1. **New Table Creation**: Clean slate with proper structure and constraints
2. **Table Replacement**: Drop existing table and recreate with new structure
3. **Data Appending**: Add more rows to existing table structure

## Key Takeaways:

- Always verify table structure before population
- Use `mode='append'` for adding data to existing tables
- Drop tables with CASCADE when they have foreign key constraints
- Check row counts before and after operations to verify success

## Cleanup (Optional)

Run the cell below to clean up the test tables created in this tutorial.

In [32]:
import random

# Enhanced cleanup - Remove all tutorial-related tables
print("Cleaning up test tables...")

# Get all tables that start with our tutorial prefixes
all_tables_query = f"""
SELECT table_name 
FROM information_schema.tables 
WHERE table_schema = '{SCHEMA_NAME}' 
AND (table_name LIKE 'customers_v1_%' 
     OR table_name LIKE 'tutorial_customers_%')
"""

try:
    existing_tables = db.query(all_tables_query)
    tutorial_tables = existing_tables['table_name'].tolist() if not existing_tables.empty else []
    
    print(f"Found {len(tutorial_tables)} tutorial tables to clean up")
    
    for table in tutorial_tables:
        try:
            db.query(f'DROP TABLE IF EXISTS {SCHEMA_NAME}.{table} CASCADE;', show_info=False)
            print(f"✅ Dropped {table}")
        except Exception as e:
            print(f"❌ Error dropping {table}: {e}")
    
    print("\n🎉 Cleanup completed successfully!")
    print("💡 All tutorial tables have been removed.")
    print("   You can now re-run any scenario cleanly.")
    
except Exception as e:
    print(f"Error during cleanup: {e}")

print("Cleanup complete!")

Cleaning up test tables...
🔍 Executing query in schema: 'test_berlin_data'
✅ Query completed: 20 rows, 1 columns
Found 20 tutorial tables to clean up
✅ Query completed: 20 rows, 1 columns
Found 20 tutorial tables to clean up
✅ Dropped tutorial_customers_2
✅ Dropped tutorial_customers_2
✅ Dropped tutorial_customers_fixed
✅ Dropped tutorial_customers_fixed
✅ Dropped tutorial_customers_new
✅ Dropped tutorial_customers_new
✅ Dropped customers_v1_new
✅ Dropped customers_v1_new
✅ Dropped customers_v1_replace
✅ Dropped customers_v1_replace
✅ Dropped customers_v1_replace_1755884786
✅ Dropped customers_v1_replace_1755884786
✅ Dropped customers_v1_replace_1755884862_8615
✅ Dropped customers_v1_replace_1755884862_8615
✅ Dropped customers_v1_new_2ec66f6a
✅ Dropped customers_v1_new_2ec66f6a
✅ Dropped customers_v1_replace_f44c6239
✅ Dropped customers_v1_replace_f44c6239
✅ Dropped customers_v1_replace_c229db6d
✅ Dropped customers_v1_replace_c229db6d
✅ Dropped customers_v1_replace_6c444aa7
✅ Dropped c