# End-to-End Data Population Tutorial

This notebook demonstrates the complete workflow for loading, processing, and populating data into a database using our custom utility classes.

**The workflow consists of three main steps:**
1.  **Load**: Read data from a source file using `SmartAutoDataLoader`.
2.  **Process**: Clean and standardize the loaded data using `DataProcessor`.
3.  **Populate**: Connect to a database and insert the processed data using `db_connector`.

We will populate the same data into two different databases, NeonDB and AWS LayeredDB, to show the flexibility of the connector.

## Setup and Imports

First, we import the necessary classes and libraries.

In [32]:
import pandas as pd
import os

# Add project root to sys.path for imports if running from examples/ or project root
import sys
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

try:
    from db_population_utils.data_loader.smart_auto_data_loader import SmartAutoDataLoader
    from db_population_utils.data_processor.data_processor import DataProcessor
    from db_population_utils.db_connector.smart_db_connector_enhanced_V3 import db_connector
except ModuleNotFoundError:
    # Fallback: try relative imports if running from project root
    from data_loader.smart_auto_data_loader import SmartAutoDataLoader
    from data_processor.data_processor import DataProcessor
    from db_connector.smart_db_connector_enhanced_V3 import db_connector

# Define the path to our sample data file (always relative to project root)
project_data_dir = os.path.join(project_root, 'data')
file_path = os.path.join(project_data_dir, 'tutorial_customers.csv')

print(f"Ready to process file: {file_path}")

Ready to process file: /Users/svitlanakovalivska/layered-populate-data-pool-da/db_population_utils/data/tutorial_customers.csv


## Step 1: Load Data with `SmartAutoDataLoader`

We start by loading the raw data from the CSV file. The `SmartAutoDataLoader` will automatically detect the file format, encoding, and any date columns.

In [33]:
# Instantiate the loader
loader = SmartAutoDataLoader(verbose=True)

# Load the data from the file

raw_df = loader.load(file_path)

# Display the first few rows of the loaded data
print("Raw data loaded successfully:")
raw_df.head()

🎯 SmartAutoDataLoader ready!
🎯 Loading file: tutorial_customers.csv
🔍 Format detected: csv
📊 Loading CSV file...
🔤 Encoding detected: utf-8
🗓️ Searching for date columns...
   ✅ Found date column: 'joined_date' (%Y-%m-%d)
   📅 Total date columns found: 1
✅ CSV loaded: 4 rows, 7 columns
Raw data loaded successfully:


Unnamed: 0,CustomerID,First Name,joined_date,City,Has_Subscription,district_id,district
0,1,john,2023-04-15,berlin,True,11001001,Berlin-Mitte
1,2,Jane,2022-11-20,berlin,False,11002002,Berlin-Friedrichshain-Kreuzberg
2,3,Mikey,2024-01-05,berlin,True,11008008,Berlin-Neukölln
3,4,SARAH,2023-08-21,berlin,,11004004,Berlin-Charlottenburg-Wilmersdorf


## Step 2: Process Data with `DataProcessor`

Now that we have the data in a DataFrame, we need to clean it up. The `DataProcessor` will help us standardize column names, correct data types, and handle missing values.

In [None]:
# Instantiate the processor
processor = DataProcessor()

print("Original column names:", raw_df.columns.tolist())
print("Original data types:", raw_df.dtypes)

# Define processing rules
type_hints = {
    'customer_id': 'int',
    'has_subscription': 'bool'
}

# Preprocess the data
processed_df = processor.preprocess_loaded_data(
    raw_df,
    datetime_columns=['joined_date'],
    type_hints=type_hints
)

print("Processed column names:", processed_df.columns.tolist())
print("Processed data types:", processed_df.dtypes)
print("Cleaned data ready for population:")
processed_df.head()

Original column names: ['CustomerID', 'First Name', 'joined_date', 'City', 'Has_Subscription', 'district_id', 'district']
Original data types: CustomerID                   int64
First Name                  object
joined_date         datetime64[ns]
City                        object
Has_Subscription            object
district_id                  int64
district                    object
dtype: object


NotImplementedError: 

## Step 3: Populate Data into Databases

With our data cleaned, we can now populate it into our databases. We will perform the same steps for both NeonDB and AWS LayeredDB.

### Part A: Connect to NeonDB

In [36]:
print("Connecting to NeonDB...")
neon_db = db_connector() # Connects to NeonDB by default

# Verify the connection
health_status = neon_db.health_check()
print(f"NeonDB connection status: {health_status.get('status')}")

Connecting to NeonDB...
🌟 SMART DATABASE CONNECTOR V3 - INITIALIZING...
🔗 Using default NeonDB connection
✅ NeonDB configuration loaded
   Default schema: test_berlin_data
🔌 Connecting to NeonDB...
✅ Connection successful!
   Database: neondb
   User: neondb_owner

🔍 Auto-discovering database schemas...
✅ Connection successful!
   Database: neondb
   User: neondb_owner

🔍 Auto-discovering database schemas...
✅ Discovered 4 schemas
🎯 Auto-selected default schema: test_berlin_data

📊 SMART DB CONNECTOR V3 - CONNECTION SUMMARY
🔗 Connection Type: NeonDB

🗂️  Discovered 4 schemas:
  📁 dependency_example: 5 tables
       └─ banks_test_kovalivska_aws (11 columns)
       └─ departments (2 columns)
       └─ districts (3 columns)
       └─ ... and 2 more tables
  📁 nyc_schools: 27 tables
       └─ Audrey_sat_results (10 columns)
       └─ Colleges_Berlin (12 columns)
       └─ Levon_cleaned_sat_scores (8 columns)
       └─ ... and 24 more tables
  📁 public: 15 tables
       └─ audrey_sat_result

### Part B: Create Table in NeonDB

Before we can insert data, we must create a table in the database with the correct schema and constraints. We will execute a `CREATE TABLE` SQL command.

In [37]:
import uuid

# Use UUID to ensure unique table name
unique_id = str(uuid.uuid4())[:8]
table_name = f'tutorial_customers_{unique_id}'  # Unique table name

schema_name = 'test_berlin_data' # Target schema in NeonDB

# Create table SQL for NeonDB (uses district as foreign key)
create_table_sql = f'''
CREATE TABLE IF NOT EXISTS {schema_name}.{table_name} (
    customer_id INTEGER PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    joined_date DATE,
    city VARCHAR(100),
    has_subscription BOOLEAN,
    district_id INTEGER,
    district VARCHAR(100)
    -- Foreign key constraint for NeonDB (uncomment if districts table exists)
    ,CONSTRAINT fk_district FOREIGN KEY (district) REFERENCES {schema_name}.districts(district)
);
'''

# To be safe, let's drop the table first in case it exists from a previous run
try:
    neon_db.query(f'DROP TABLE IF EXISTS {schema_name}.{table_name};', show_info=False)
    print(f"Dropped existing table: {table_name}")
except Exception as e:
    print(f"Table {table_name} did not exist, which is fine.")

print(f"Creating table {table_name} in schema {schema_name}...")
neon_db.query(create_table_sql, show_info=False)
print("Table created successfully.")

Dropped existing table: tutorial_customers_7dfd0f9b
Creating table tutorial_customers_7dfd0f9b in schema test_berlin_data...
Table created successfully.
Table created successfully.


### Part C: Populate into NeonDB

Now we use the `populate` function with `mode='append'` to insert our processed DataFrame. Using 'append' ensures that we add data to the existing table without destroying its structure or constraints.

In [38]:
result_neon = neon_db.populate(
    df=processed_df if 'processed_df' in locals() else raw_df,
    table_name=table_name,
    schema=schema_name,
    mode='append',
    show_report=True
)

print(f"NeonDB population status: {result_neon['status']}")


📊 SMART POPULATE - PRE-POPULATION ANALYSIS
🎯 Target: test_berlin_data.tutorial_customers_7dfd0f9b
📝 Mode: APPEND
🔗 Connection: ConnectionType.NEON_DB

📋 DATASET ANALYSIS:
   Rows: 4
   Columns: 7
   Memory usage: 0.00 MB

🔍 COLUMN ANALYSIS:
   CustomerID: int64 | Nulls: 0 (0.0%) | Unique: 4
   First Name: object | Nulls: 0 (0.0%) | Unique: 4
   joined_date: datetime64[ns] | Nulls: 0 (0.0%) | Unique: 4
   City: object | Nulls: 0 (0.0%) | Unique: 1
   Has_Subscription: object | Nulls: 1 (25.0%) | Unique: 2
   district_id: int64 | Nulls: 0 (0.0%) | Unique: 4
   district: object | Nulls: 0 (0.0%) | Unique: 4

✅ DATA QUALITY CHECKS:
   Total null values: 1
   Duplicate rows: 0

🏗️  TABLE STATUS:
   Table exists: No
📝 Inserting 4 rows × 7 columns
   Target: test_berlin_data.tutorial_customers_7dfd0f9b
   Action: append
✅ Insert completed successfully
✅ Insert completed successfully

🎉 SMART POPULATE - OPERATION COMPLETED
✅ Status: SUCCESS
🎯 Table: test_berlin_data.tutorial_customers_7dfd0f9

### Part D: Verify the Data in NeonDB

Let's read the data back from the table to confirm it was inserted correctly.

In [39]:
print("Verifying data in NeonDB...")
data_from_neon = neon_db.query(f"SELECT * FROM {schema_name}.{table_name}")
data_from_neon.head()

Verifying data in NeonDB...
🔍 Executing query in schema: 'test_berlin_data'
✅ Query completed: 4 rows, 7 columns
✅ Query completed: 4 rows, 7 columns


Unnamed: 0,CustomerID,First Name,joined_date,City,Has_Subscription,district_id,district
0,1,john,2023-04-15,berlin,True,11001001,Berlin-Mitte
1,2,Jane,2022-11-20,berlin,False,11002002,Berlin-Friedrichshain-Kreuzberg
2,3,Mikey,2024-01-05,berlin,True,11008008,Berlin-Neukölln
3,4,SARAH,2023-08-21,berlin,,11004004,Berlin-Charlottenburg-Wilmersdorf


### Part E: Populate into AWS LayeredDB

Now, we repeat the process for AWS LayeredDB. This demonstrates how the connector can target different environments.

**Note**: This step requires a running SSH tunnel to the AWS database and valid credentials. We will use placeholder credentials here.

In [82]:
print("Connecting to AWS LayeredDB...")
try:
    # IMPORTANT: Replace with your actual username and password
    aws_db = db_connector(
        database='layereddb', 
        username='USERNAME',  # Replace with your AWS username
        password='PASSWORD'  # Replace with your AWS password
    )
    aws_health = aws_db.health_check()
    print(f"AWS connection status: {aws_health.get('status')}")
    aws_connected = True
except Exception as e:
    print(f"Could not connect to AWS LayeredDB: {e}")
    print("Skipping AWS population steps.")
    aws_connected = False

Connecting to AWS LayeredDB...
🌟 SMART DATABASE CONNECTOR V3 - INITIALIZING...
🚇 AWS LayeredDB connection requested
🚇 Tunnel Status: Connected
✅ AWS LayeredDB configuration loaded
   Tunnel: Tunnel is active on localhost:5433
🔌 Connecting to AWS LayeredDB...
✅ Connection successful!
   Database: layereddb
   User: svitlana_kovalivska

🔍 Auto-discovering database schemas...
✅ Connection successful!
   Database: layereddb
   User: svitlana_kovalivska

🔍 Auto-discovering database schemas...
✅ Discovered 2 schemas
🎯 Auto-selected default schema: berlin_source_data

📊 SMART DB CONNECTOR V3 - CONNECTION SUMMARY
🔗 Connection Type: AWS LayeredDB
🚇 Tunnel Status: Connected (localhost:5433)

🗂️  Discovered 2 schemas:
  🎯 [CURRENT] berlin_source_data: 12 tables
       └─ banks_test_kovalivska_aws (11 columns)
       └─ crime_statistics (15 columns)
       └─ districts (3 columns)
       └─ ... and 9 more tables
  📁 public: 22 tables
       └─ aws_test_customers_v3 (5 columns)
       └─ aws_test_p

In [None]:
import uuid

if aws_connected:
    # Define schema and table for AWS
    aws_schema_name = 'berlin_source_data'
    

    # Use UUID to ensure unique table name
    unique_id = str(uuid.uuid4())[:8]
    aws_table_name = f'tutorial_customers_{unique_id}'  # Unique table name
    
    # Create table SQL for LayeredDB - fix data type to match existing districts table
    aws_create_sql = f'''
    CREATE TABLE IF NOT EXISTS {aws_schema_name}.{aws_table_name} (
        customer_id INTEGER PRIMARY KEY,
        first_name VARCHAR(100) NOT NULL,
        joined_date DATE,
        city VARCHAR(100),
        has_subscription BOOLEAN,
        district_id VARCHAR(100),  -- Changed to VARCHAR to match districts table
        district VARCHAR(100)
        -- Foreign key constraint for LayeredDB (uncomment if districts table exists)
        ,CONSTRAINT fk_district_id FOREIGN KEY (district_id) REFERENCES {aws_schema_name}.districts(district_id)
    );
    '''
    try:
        aws_db.query(f'DROP TABLE IF EXISTS {aws_schema_name}.{aws_table_name};', show_info=False)
        print(f"Dropped existing table if it existed: {aws_table_name}")
    except Exception as e:
        print(f"Table {aws_table_name} did not exist, which is fine.")
        
    print(f"Creating table {aws_table_name} in AWS...")
    aws_db.query(aws_create_sql, show_info=False)
    print("Table created successfully in AWS.")
    
    # Populate the table in AWS
    print("Populating data into AWS LayeredDB...")
    result_aws = aws_db.populate(
        df=processed_df if 'processed_df' in locals() else raw_df,
        table_name=aws_table_name,
        schema=aws_schema_name,
        mode='append'
    )
    print(f"AWS population status: {result_aws['status']}")
    
    # Verify the data in AWS
    print("Verifying data in AWS...")
    data_from_aws = aws_db.query(f"SELECT * FROM {aws_schema_name}.{aws_table_name}")
    display(data_from_aws.head())

Dropped existing table if it existed: tutorial_customers_85f4fd8e
Creating table tutorial_customers_85f4fd8e in AWS...
Table created successfully in AWS.
Populating data into AWS LayeredDB...

📊 SMART POPULATE - PRE-POPULATION ANALYSIS
🎯 Target: berlin_source_data.tutorial_customers_85f4fd8e
📝 Mode: APPEND
🔗 Connection: ConnectionType.AWS_LAYERED_DB

📋 DATASET ANALYSIS:
   Rows: 4
   Columns: 7
   Memory usage: 0.00 MB

🔍 COLUMN ANALYSIS:
   CustomerID: int64 | Nulls: 0 (0.0%) | Unique: 4
   First Name: object | Nulls: 0 (0.0%) | Unique: 4
   joined_date: datetime64[ns] | Nulls: 0 (0.0%) | Unique: 4
   City: object | Nulls: 0 (0.0%) | Unique: 1
   Has_Subscription: object | Nulls: 1 (25.0%) | Unique: 2
   district_id: int64 | Nulls: 0 (0.0%) | Unique: 4
   district: object | Nulls: 0 (0.0%) | Unique: 4

✅ DATA QUALITY CHECKS:
   Total null values: 1
   Duplicate rows: 0

🏗️  TABLE STATUS:
   Table exists: No
📝 Inserting 4 rows × 7 columns
   Target: berlin_source_data.tutorial_customer

Unnamed: 0,CustomerID,First Name,joined_date,City,Has_Subscription,district_id,district
0,1,john,2023-04-15,berlin,True,11001001,Berlin-Mitte
1,2,Jane,2022-11-20,berlin,False,11002002,Berlin-Friedrichshain-Kreuzberg
2,3,Mikey,2024-01-05,berlin,True,11008008,Berlin-Neukölln
3,4,SARAH,2023-08-21,berlin,,11004004,Berlin-Charlottenburg-Wilmersdorf


## Conclusion

Congratulations! You have successfully completed the entire data pipeline:

1.  Loaded raw data from a CSV file using **SmartAutoDataLoader**.
2.  Cleaned and standardized the data using **DataProcessor**.
3.  Populated the clean data into two different databases (NeonDB and AWS LayeredDB) using the **db_connector**.

This notebook provides a foundational template for building more complex data ingestion workflows.