# AWS RDS PostgreSQL (FREE TIER) - Sequencing QC Database

This notebook demonstrates setting up an AWS RDS PostgreSQL database for clinical genetic testing sequencing QC operations.

## üéâ FREE TIER VERSION üéâ
This version uses RDS Free Tier (db.t3.micro) which is **100% FREE** for the first 12 months!

## Features
- Create RDS PostgreSQL instance (db.t3.micro - FREE TIER)
- Create tables for sequencing QC metrics
- Generate and populate faux QC data
- Visualize QC metrics
- Clean up all AWS resources

## Free Tier Benefits
- ‚úÖ 750 hours/month of db.t3.micro (enough for 24/7 operation)
- ‚úÖ 20 GB of storage
- ‚úÖ 20 GB of backup storage
- ‚úÖ Free for 12 months
- ‚úÖ Full PostgreSQL 16 compatibility

## Prerequisites
- AWS account eligible for free tier (new account or within first 12 months)
- AWS credentials configured (via AWS CLI or environment variables)
- Appropriate IAM permissions for RDS and EC2
- Python packages: boto3, psycopg2-binary, pandas, matplotlib, seaborn

## 1. Import Libraries and Configuration

In [None]:
import boto3
import json
import time
import psycopg2
import psycopg2.extensions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from psycopg2.extras import execute_values
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Register numpy type adapters for psycopg2 (prevents numpy type errors)
psycopg2.extensions.register_adapter(np.int64, lambda x: int(x))
psycopg2.extensions.register_adapter(np.int32, lambda x: int(x))
psycopg2.extensions.register_adapter(np.float64, lambda x: float(x))
psycopg2.extensions.register_adapter(np.float32, lambda x: float(x))

print("Libraries imported successfully!")
print("‚úì Numpy adapters registered")

In [None]:
# Configuration
AWS_REGION = 'us-east-1'
DB_INSTANCE_IDENTIFIER = 'sequencing-qc-db-free'
DB_NAME = 'sequencing_qc_db'
MASTER_USERNAME = 'qcadmin'
MASTER_PASSWORD = 'QC_Secure_Pass_2024!'  # In production, use Secrets Manager

# Initialize AWS clients
rds_client = boto3.client('rds', region_name=AWS_REGION)
ec2_client = boto3.client('ec2', region_name=AWS_REGION)

print(f"Configuration set for region: {AWS_REGION}")
print(f"Database instance identifier: {DB_INSTANCE_IDENTIFIER}")
print(f"\nüéâ Using FREE TIER: db.t3.micro (100% free for 12 months!)")

## 2. RDS Free Tier PostgreSQL Instance Setup

In [None]:
def create_rds_free_tier_instance():
    """
    Create an RDS PostgreSQL instance using FREE TIER (db.t3.micro).
    Returns the instance endpoint information.
    
    FREE TIER SPECS:
    - Instance: db.t3.micro (2 vCPU, 1 GB RAM)
    - Storage: 20 GB (max free tier)
    - Cost: $0.00 for first 12 months (750 hours/month)
    """
    try:
        print("Creating RDS PostgreSQL instance (FREE TIER)...")
        print("  Instance class: db.t3.micro")
        print("  Storage: 20 GB (free tier max)")
        print("  Engine: PostgreSQL 16.4")
        print("")
        
        # Create the RDS instance
        response = rds_client.create_db_instance(
            DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER,
            DBInstanceClass='db.t3.micro',          # FREE TIER!
            Engine='postgres',
            EngineVersion='16.4',
            MasterUsername=MASTER_USERNAME,
            MasterUserPassword=MASTER_PASSWORD,
            DBName=DB_NAME,
            AllocatedStorage=20,                    # FREE TIER: 20 GB max
            StorageType='gp2',                      # General purpose SSD
            PubliclyAccessible=True,                # For demo; use VPC in production
            BackupRetentionPeriod=1,                # Minimal backups
            StorageEncrypted=True,
            EnableCloudwatchLogsExports=['postgresql'],
            Tags=[
                {'Key': 'Purpose', 'Value': 'SequencingQC'},
                {'Key': 'FreeTier', 'Value': 'Yes'},
                {'Key': 'Environment', 'Value': 'Learning'}
            ]
        )
        
        print(f"‚úì Instance creation initiated: {DB_INSTANCE_IDENTIFIER}")
        print("")
        
        # Wait for instance to be available
        print("Waiting for instance to become available (this may take 5-10 minutes)...")
        print("‚òï Grab a coffee while AWS provisions your FREE database!")
        
        waiter = rds_client.get_waiter('db_instance_available')
        waiter.wait(
            DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER,
            WaiterConfig={'Delay': 30, 'MaxAttempts': 40}
        )
        
        # Get instance details
        instance_info = rds_client.describe_db_instances(
            DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER
        )['DBInstances'][0]
        
        endpoint = instance_info['Endpoint']['Address']
        port = instance_info['Endpoint']['Port']
        
        print(f"\n" + "="*70)
        print("‚úì RDS Instance is available!")
        print("="*70)
        print(f"  Endpoint: {endpoint}")
        print(f"  Port: {port}")
        print(f"  Database: {DB_NAME}")
        print(f"  Instance class: {instance_info['DBInstanceClass']}")
        print(f"  Storage: {instance_info['AllocatedStorage']} GB")
        print(f"  Engine: {instance_info['Engine']} {instance_info['EngineVersion']}")
        print(f"\nüéâ FREE TIER: This costs $0.00 (within 750 hours/month limit)!")
        print("="*70)
        
        return {
            'endpoint': endpoint,
            'port': port,
            'database': DB_NAME,
            'username': MASTER_USERNAME,
            'password': MASTER_PASSWORD
        }
        
    except rds_client.exceptions.DBInstanceAlreadyExistsFault:
        print(f"Instance {DB_INSTANCE_IDENTIFIER} already exists. Retrieving details...")
        instance_info = rds_client.describe_db_instances(
            DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER
        )['DBInstances'][0]
        
        return {
            'endpoint': instance_info['Endpoint']['Address'],
            'port': instance_info['Endpoint']['Port'],
            'database': DB_NAME,
            'username': MASTER_USERNAME,
            'password': MASTER_PASSWORD
        }
    
    except Exception as e:
        print(f"Error creating instance: {str(e)}")
        raise

In [None]:
def configure_security_group(db_instance_identifier):
    """
    Configure security group to allow PostgreSQL connections.
    This is required for connecting to the database from external clients.
    """
    try:
        # Get instance details
        instance_info = rds_client.describe_db_instances(
            DBInstanceIdentifier=db_instance_identifier
        )['DBInstances'][0]
        
        vpc_security_groups = instance_info.get('VpcSecurityGroups', [])
        
        if not vpc_security_groups:
            print("WARNING: No VPC security groups found!")
            return
        
        for sg in vpc_security_groups:
            sg_id = sg['VpcSecurityGroupId']
            
            # Check if rule already exists
            sg_details = ec2_client.describe_security_groups(
                GroupIds=[sg_id]
            )['SecurityGroups'][0]
            
            has_postgres_rule = False
            for rule in sg_details.get('IpPermissions', []):
                if rule.get('FromPort') == 5432 and rule.get('ToPort') == 5432:
                    has_postgres_rule = True
                    break
            
            if not has_postgres_rule:
                print(f"Adding PostgreSQL inbound rule to security group {sg_id}...")
                ec2_client.authorize_security_group_ingress(
                    GroupId=sg_id,
                    IpPermissions=[
                        {
                            'IpProtocol': 'tcp',
                            'FromPort': 5432,
                            'ToPort': 5432,
                            'IpRanges': [
                                {
                                    'CidrIp': '0.0.0.0/0',
                                    'Description': 'PostgreSQL access (demo only)'
                                }
                            ]
                        }
                    ]
                )
                print("  ‚úì Security group configured")
            else:
                print("  ‚úì Security group already configured")
                
    except Exception as e:
        print(f"Warning: Could not configure security group: {e}")
        print("You may need to manually add inbound rule for port 5432")

In [None]:
# Create the RDS FREE TIER instance
db_config = create_rds_free_tier_instance()

# Configure security group
print("\nConfiguring security group...")
configure_security_group(DB_INSTANCE_IDENTIFIER)

## 3. Database Connection Helper

In [None]:
def get_db_connection():
    """
    Create and return a database connection.
    """
    try:
        conn = psycopg2.connect(
            host=db_config['endpoint'],
            port=db_config['port'],
            database=db_config['database'],
            user=db_config['username'],
            password=db_config['password'],
            connect_timeout=10
        )
        return conn
    except Exception as e:
        print(f"Error connecting to database: {str(e)}")
        raise

# Test connection
try:
    test_conn = get_db_connection()
    print("‚úì Database connection successful!")
    test_conn.close()
except Exception as e:
    print(f"Connection failed: {e}")

## 4. Create Database Schema

We'll create 7 tables for clinical sequencing QC:
1. **sequencing_runs** - Sequencing run metadata
2. **samples** - Patient samples
3. **qc_metrics** - Overall QC metrics
4. **coverage_metrics** - Coverage statistics
5. **variant_calls** - Variant calling statistics
6. **contamination_checks** - Sample contamination detection
7. **alignment_stats** - Read alignment statistics

In [None]:
def create_tables():
    """
    Create all tables for sequencing QC database.
    """
    conn = get_db_connection()
    cursor = conn.cursor()
    
    # Drop existing tables if they exist
    cursor.execute("DROP TABLE IF EXISTS alignment_stats CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS contamination_checks CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS variant_calls CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS coverage_metrics CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS qc_metrics CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS samples CASCADE;")
    cursor.execute("DROP TABLE IF EXISTS sequencing_runs CASCADE;")
    
    # Table 1: Sequencing Runs
    cursor.execute("""
        CREATE TABLE sequencing_runs (
            run_id SERIAL PRIMARY KEY,
            run_name VARCHAR(100) UNIQUE NOT NULL,
            sequencer_id VARCHAR(50) NOT NULL,
            platform VARCHAR(50) NOT NULL,
            run_date DATE NOT NULL,
            flow_cell_id VARCHAR(50),
            num_cycles INTEGER,
            operator VARCHAR(100),
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: sequencing_runs")
    
    # Table 2: Samples
    cursor.execute("""
        CREATE TABLE samples (
            sample_id SERIAL PRIMARY KEY,
            sample_name VARCHAR(100) UNIQUE NOT NULL,
            patient_id VARCHAR(50) NOT NULL,
            sample_type VARCHAR(50) NOT NULL,
            collection_date DATE,
            run_id INTEGER REFERENCES sequencing_runs(run_id),
            assay_type VARCHAR(100),
            clinical_indication TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: samples")
    
    # Table 3: QC Metrics
    cursor.execute("""
        CREATE TABLE qc_metrics (
            qc_id SERIAL PRIMARY KEY,
            sample_id INTEGER REFERENCES samples(sample_id),
            total_reads BIGINT NOT NULL,
            passed_filter_reads BIGINT NOT NULL,
            q30_bases_pct DECIMAL(5,2),
            mean_quality_score DECIMAL(5,2),
            gc_content_pct DECIMAL(5,2),
            duplicate_rate_pct DECIMAL(5,2),
            insert_size_mean DECIMAL(8,2),
            qc_status VARCHAR(20),
            qc_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: qc_metrics")
    
    # Table 4: Coverage Metrics
    cursor.execute("""
        CREATE TABLE coverage_metrics (
            coverage_id SERIAL PRIMARY KEY,
            sample_id INTEGER REFERENCES samples(sample_id),
            mean_coverage DECIMAL(10,2) NOT NULL,
            median_coverage DECIMAL(10,2),
            coverage_10x_pct DECIMAL(5,2),
            coverage_20x_pct DECIMAL(5,2),
            coverage_30x_pct DECIMAL(5,2),
            coverage_100x_pct DECIMAL(5,2),
            uniformity_pct DECIMAL(5,2),
            on_target_rate_pct DECIMAL(5,2),
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: coverage_metrics")
    
    # Table 5: Variant Calls
    cursor.execute("""
        CREATE TABLE variant_calls (
            variant_id SERIAL PRIMARY KEY,
            sample_id INTEGER REFERENCES samples(sample_id),
            total_variants INTEGER NOT NULL,
            snvs INTEGER,
            indels INTEGER,
            het_hom_ratio DECIMAL(5,2),
            ti_tv_ratio DECIMAL(5,2),
            pathogenic_variants INTEGER,
            vus_variants INTEGER,
            benign_variants INTEGER,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: variant_calls")
    
    # Table 6: Contamination Checks
    cursor.execute("""
        CREATE TABLE contamination_checks (
            contamination_id SERIAL PRIMARY KEY,
            sample_id INTEGER REFERENCES samples(sample_id),
            contamination_estimate_pct DECIMAL(5,2) NOT NULL,
            contamination_status VARCHAR(20),
            method VARCHAR(50),
            num_snps_analyzed INTEGER,
            confidence_score DECIMAL(5,2),
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: contamination_checks")
    
    # Table 7: Alignment Stats
    cursor.execute("""
        CREATE TABLE alignment_stats (
            alignment_id SERIAL PRIMARY KEY,
            sample_id INTEGER REFERENCES samples(sample_id),
            total_reads BIGINT NOT NULL,
            mapped_reads BIGINT NOT NULL,
            properly_paired_reads BIGINT,
            mapping_quality_mean DECIMAL(5,2),
            error_rate DECIMAL(6,4),
            mismatch_rate DECIMAL(6,4),
            alignment_rate_pct DECIMAL(5,2),
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
    """)
    print("‚úì Created table: alignment_stats")
    
    conn.commit()
    cursor.close()
    conn.close()
    print("\n‚úì All tables created successfully!")

# Create the tables
create_tables()

## 5. Generate and Populate Faux Data

In [None]:
def generate_sequencing_runs(num_runs=10):
    """
    Generate faux sequencing run data.
    """
    platforms = ['NovaSeq 6000', 'NextSeq 2000', 'MiSeq']
    operators = ['Alice Chen', 'Bob Martinez', 'Carol Johnson', 'David Kim']
    
    runs = []
    start_date = datetime.now() - timedelta(days=90)
    
    for i in range(num_runs):
        run_date = start_date + timedelta(days=i*9)
        runs.append((
            f"RUN_{run_date.strftime('%Y%m%d')}_{i+1:03d}",
            f"SEQ_{np.random.choice(['NS', 'NV', 'MS'])}{int(np.random.randint(1000, 9999))}",
            str(np.random.choice(platforms)),
            run_date.date(),
            f"FC_{int(np.random.randint(100000, 999999))}",
            int(np.random.choice([150, 250, 300])),
            str(np.random.choice(operators))
        ))
    
    return runs

def generate_samples(num_samples=50):
    """
    Generate faux sample data.
    """
    sample_types = ['Blood', 'Saliva', 'Tissue', 'Buccal Swab']
    assay_types = ['Whole Exome Sequencing', 'Targeted Gene Panel', 'Whole Genome Sequencing', 'Cancer Panel']
    indications = ['Hereditary Cancer', 'Cardiovascular Disease', 'Rare Disease', 'Pharmacogenomics']
    
    samples = []
    start_date = datetime.now() - timedelta(days=90)
    
    for i in range(num_samples):
        collection_date = start_date + timedelta(days=int(np.random.randint(0, 85)))
        run_id = (i // 5) + 1  # 5 samples per run
        
        samples.append((
            f"SAMP_{i+1:05d}",
            f"PT_{i+1:04d}",
            str(np.random.choice(sample_types)),
            collection_date.date(),
            run_id,
            str(np.random.choice(assay_types)),
            str(np.random.choice(indications))
        ))
    
    return samples

def generate_qc_metrics(num_samples=50):
    """
    Generate faux QC metrics data.
    """
    metrics = []
    
    for sample_id in range(1, num_samples + 1):
        total_reads = int(np.random.randint(40_000_000, 150_000_000))
        passed_filter = int(total_reads * np.random.uniform(0.85, 0.98))
        
        # Introduce some failing samples
        if np.random.random() < 0.1:  # 10% fail rate
            q30_bases = np.random.uniform(70, 84)
            quality_score = np.random.uniform(28, 32)
            status = 'FAIL'
        else:
            q30_bases = np.random.uniform(85, 95)
            quality_score = np.random.uniform(33, 38)
            status = 'PASS'
        
        metrics.append((
            sample_id,
            total_reads,
            passed_filter,
            round(float(q30_bases), 2),
            round(float(quality_score), 2),
            round(float(np.random.uniform(40, 60)), 2),
            round(float(np.random.uniform(5, 25)), 2),
            round(float(np.random.uniform(150, 400)), 2),
            status
        ))
    
    return metrics

def generate_coverage_metrics(num_samples=50):
    """
    Generate faux coverage metrics.
    """
    metrics = []
    
    for sample_id in range(1, num_samples + 1):
        mean_cov = np.random.uniform(80, 250)
        median_cov = mean_cov * np.random.uniform(0.9, 1.1)
        
        metrics.append((
            sample_id,
            round(float(mean_cov), 2),
            round(float(median_cov), 2),
            round(float(np.random.uniform(95, 99.9)), 2),
            round(float(np.random.uniform(90, 99.5)), 2),
            round(float(np.random.uniform(85, 99)), 2),
            round(float(np.random.uniform(70, 95)), 2),
            round(float(np.random.uniform(75, 95)), 2),
            round(float(np.random.uniform(65, 90)), 2)
        ))
    
    return metrics

def generate_variant_calls(num_samples=50):
    """
    Generate faux variant calling data.
    """
    variants = []
    
    for sample_id in range(1, num_samples + 1):
        total = int(np.random.randint(20000, 80000))
        snvs = int(total * np.random.uniform(0.85, 0.92))
        indels = total - snvs
        
        pathogenic = int(np.random.randint(0, 15))
        vus = int(np.random.randint(10, 100))
        benign = total - pathogenic - vus
        
        variants.append((
            sample_id,
            total,
            snvs,
            indels,
            round(float(np.random.uniform(1.2, 2.5)), 2),
            round(float(np.random.uniform(2.0, 2.2)), 2),
            pathogenic,
            vus,
            benign
        ))
    
    return variants

def generate_contamination_checks(num_samples=50):
    """
    Generate faux contamination check data.
    """
    checks = []
    methods = ['VerifyBAMID', 'ContEst', 'FREEMIX']
    
    for sample_id in range(1, num_samples + 1):
        # Most samples have low contamination
        if np.random.random() < 0.9:
            contam = np.random.uniform(0, 2)
            status = 'PASS'
        else:
            contam = np.random.uniform(2, 10)
            status = 'FAIL'
        
        checks.append((
            sample_id,
            round(float(contam), 2),
            status,
            str(np.random.choice(methods)),
            int(np.random.randint(5000, 50000)),
            round(float(np.random.uniform(0.8, 1.0)), 2)
        ))
    
    return checks

def generate_alignment_stats(num_samples=50):
    """
    Generate faux alignment statistics.
    """
    stats = []
    
    for sample_id in range(1, num_samples + 1):
        total = int(np.random.randint(40_000_000, 150_000_000))
        mapped = int(total * np.random.uniform(0.92, 0.99))
        properly_paired = int(mapped * np.random.uniform(0.85, 0.95))
        
        stats.append((
            sample_id,
            total,
            mapped,
            properly_paired,
            round(float(np.random.uniform(50, 60)), 2),
            round(float(np.random.uniform(0.001, 0.01)), 4),
            round(float(np.random.uniform(0.002, 0.015)), 4),
            round(float((mapped / total) * 100), 2)
        ))
    
    return stats

print("Data generation functions defined.")

In [None]:
def populate_database():
    """
    Populate all tables with generated data.
    """
    conn = get_db_connection()
    cursor = conn.cursor()
    
    # Generate data
    print("Generating data...")
    runs = generate_sequencing_runs(10)
    samples = generate_samples(50)
    qc_metrics = generate_qc_metrics(50)
    coverage = generate_coverage_metrics(50)
    variants = generate_variant_calls(50)
    contamination = generate_contamination_checks(50)
    alignment = generate_alignment_stats(50)
    
    # Insert sequencing runs
    print("Inserting sequencing runs...")
    execute_values(cursor, """
        INSERT INTO sequencing_runs (run_name, sequencer_id, platform, run_date, 
                                     flow_cell_id, num_cycles, operator)
        VALUES %s
    """, runs)
    
    # Insert samples
    print("Inserting samples...")
    execute_values(cursor, """
        INSERT INTO samples (sample_name, patient_id, sample_type, collection_date,
                            run_id, assay_type, clinical_indication)
        VALUES %s
    """, samples)
    
    # Insert QC metrics
    print("Inserting QC metrics...")
    execute_values(cursor, """
        INSERT INTO qc_metrics (sample_id, total_reads, passed_filter_reads, q30_bases_pct,
                               mean_quality_score, gc_content_pct, duplicate_rate_pct,
                               insert_size_mean, qc_status)
        VALUES %s
    """, qc_metrics)
    
    # Insert coverage metrics
    print("Inserting coverage metrics...")
    execute_values(cursor, """
        INSERT INTO coverage_metrics (sample_id, mean_coverage, median_coverage,
                                     coverage_10x_pct, coverage_20x_pct, coverage_30x_pct,
                                     coverage_100x_pct, uniformity_pct, on_target_rate_pct)
        VALUES %s
    """, coverage)
    
    # Insert variant calls
    print("Inserting variant calls...")
    execute_values(cursor, """
        INSERT INTO variant_calls (sample_id, total_variants, snvs, indels,
                                  het_hom_ratio, ti_tv_ratio, pathogenic_variants,
                                  vus_variants, benign_variants)
        VALUES %s
    """, variants)
    
    # Insert contamination checks
    print("Inserting contamination checks...")
    execute_values(cursor, """
        INSERT INTO contamination_checks (sample_id, contamination_estimate_pct,
                                         contamination_status, method, num_snps_analyzed,
                                         confidence_score)
        VALUES %s
    """, contamination)
    
    # Insert alignment stats
    print("Inserting alignment statistics...")
    execute_values(cursor, """
        INSERT INTO alignment_stats (sample_id, total_reads, mapped_reads,
                                    properly_paired_reads, mapping_quality_mean,
                                    error_rate, mismatch_rate, alignment_rate_pct)
        VALUES %s
    """, alignment)
    
    conn.commit()
    cursor.close()
    conn.close()
    
    print("\n‚úì Database populated successfully!")
    print(f"  - {len(runs)} sequencing runs")
    print(f"  - {len(samples)} samples")
    print(f"  - {len(qc_metrics)} QC metric records")
    print(f"  - {len(coverage)} coverage metric records")
    print(f"  - {len(variants)} variant call records")
    print(f"  - {len(contamination)} contamination check records")
    print(f"  - {len(alignment)} alignment stat records")

# Populate the database
populate_database()

## 6. Data Visualizations

**Note**: The visualization cells are identical to the Aurora Serverless version.
I'll include just the first one here for brevity - copy the rest from the original notebook if needed.

### Visualization 1: QC Status Distribution

In [None]:
# Query QC status data
conn = get_db_connection()
qc_status_df = pd.read_sql_query("""
    SELECT qc_status, COUNT(*) as count
    FROM qc_metrics
    GROUP BY qc_status
""", conn)

# Create pie chart
plt.figure(figsize=(10, 6))
colors = ['#2ecc71', '#e74c3c']
plt.pie(qc_status_df['count'], labels=qc_status_df['qc_status'], autopct='%1.1f%%',
        colors=colors, startangle=90)
plt.title('Sample QC Status Distribution', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

conn.close()

In [None]:
# Visualization 2: Mean Coverage Distribution¬∂

In [None]:
# Query QC metrics
conn = get_db_connection()
qc_df = pd.read_sql_query("""
    SELECT q30_bases_pct, mean_quality_score, qc_status
    FROM qc_metrics
""", conn)

# Create scatter plot
plt.figure(figsize=(12, 7))
colors = {'PASS': '#2ecc71', 'FAIL': '#e74c3c'}
for status in qc_df['qc_status'].unique():
    subset = qc_df[qc_df['qc_status'] == status]
    plt.scatter(subset['q30_bases_pct'], subset['mean_quality_score'], 
               c=colors[status], label=status, s=100, alpha=0.6, edgecolors='black')

plt.xlabel('Q30 Bases (%)', fontsize=12)
plt.ylabel('Mean Quality Score', fontsize=12)
plt.title('Quality Metrics: Q30 Bases vs Mean Quality Score', fontsize=16, fontweight='bold')
plt.legend(title='QC Status', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

conn.close()

In [None]:
# Visualization 3: Q30 Score vs Quality Score

In [None]:
# Query QC metrics
conn = get_db_connection()
qc_df = pd.read_sql_query("""
    SELECT q30_bases_pct, mean_quality_score, qc_status
    FROM qc_metrics
""", conn)

# Create scatter plot
plt.figure(figsize=(12, 7))
colors = {'PASS': '#2ecc71', 'FAIL': '#e74c3c'}
for status in qc_df['qc_status'].unique():
    subset = qc_df[qc_df['qc_status'] == status]
    plt.scatter(subset['q30_bases_pct'], subset['mean_quality_score'], 
               c=colors[status], label=status, s=100, alpha=0.6, edgecolors='black')

plt.xlabel('Q30 Bases (%)', fontsize=12)
plt.ylabel('Mean Quality Score', fontsize=12)
plt.title('Quality Metrics: Q30 Bases vs Mean Quality Score', fontsize=16, fontweight='bold')
plt.legend(title='QC Status', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

conn.close()

In [None]:
# Visualization 4: Variant Classification Summary

In [None]:
# Query variant data
conn = get_db_connection()
variant_df = pd.read_sql_query("""
    SELECT 
        SUM(pathogenic_variants) as pathogenic,
        SUM(vus_variants) as vus,
        SUM(benign_variants) as benign
    FROM variant_calls
""", conn)

# Prepare data for bar chart
categories = ['Pathogenic', 'VUS', 'Benign']
values = [variant_df['pathogenic'][0], variant_df['vus'][0], variant_df['benign'][0]]
colors_bar = ['#e74c3c', '#f39c12', '#2ecc71']

# Create bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color=colors_bar, edgecolor='black', alpha=0.8)
plt.ylabel('Total Number of Variants', fontsize=12)
plt.title('Variant Classification Summary Across All Samples', fontsize=16, fontweight='bold')
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height):,}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

conn.close()

In [None]:
# Visualization 5: Contamination Status

In [None]:
# Query contamination data
conn = get_db_connection()
contam_df = pd.read_sql_query("""
    SELECT contamination_estimate_pct, contamination_status
    FROM contamination_checks
    ORDER BY contamination_estimate_pct
""", conn)

# Create box plot
plt.figure(figsize=(12, 6))
box_colors = {'PASS': '#2ecc71', 'FAIL': '#e74c3c'}
bp = plt.boxplot([contam_df[contam_df['contamination_status'] == 'PASS']['contamination_estimate_pct'],
                   contam_df[contam_df['contamination_status'] == 'FAIL']['contamination_estimate_pct']],
                  labels=['PASS', 'FAIL'],
                  patch_artist=True,
                  widths=0.6)

for patch, status in zip(bp['boxes'], ['PASS', 'FAIL']):
    patch.set_facecolor(box_colors[status])
    patch.set_alpha(0.7)

plt.axhline(y=2.0, color='red', linestyle='--', linewidth=2, label='Threshold (2%)')
plt.ylabel('Contamination Estimate (%)', fontsize=12)
plt.xlabel('Contamination Status', fontsize=12)
plt.title('Sample Contamination Estimates by Status', fontsize=16, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Visualization 6: Coverage Uniformity Across Samples

In [None]:
# Query coverage uniformity data
conn = get_db_connection()
uniformity_df = pd.read_sql_query("""
    SELECT 
        s.sample_name,
        c.coverage_10x_pct,
        c.coverage_20x_pct,
        c.coverage_30x_pct,
        c.coverage_100x_pct
    FROM coverage_metrics c
    JOIN samples s ON c.sample_id = s.sample_id
    ORDER BY s.sample_id
    LIMIT 20
""", conn)

# Create line plot
plt.figure(figsize=(14, 7))
x = range(len(uniformity_df))
plt.plot(x, uniformity_df['coverage_10x_pct'], marker='o', label='10x Coverage', linewidth=2)
plt.plot(x, uniformity_df['coverage_20x_pct'], marker='s', label='20x Coverage', linewidth=2)
plt.plot(x, uniformity_df['coverage_30x_pct'], marker='^', label='30x Coverage', linewidth=2)
plt.plot(x, uniformity_df['coverage_100x_pct'], marker='d', label='100x Coverage', linewidth=2)

plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Percentage of Bases Covered (%)', fontsize=12)
plt.title('Coverage Uniformity Across Different Depth Thresholds (First 20 Samples)', 
          fontsize=16, fontweight='bold')
plt.legend(loc='lower left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(60, 100)
plt.tight_layout()
plt.show()

conn.close()

In [None]:
# Visualization 7: Alignment Rate by Assay Type

In [None]:
# Query alignment data by assay type
conn = get_db_connection()
alignment_df = pd.read_sql_query("""
    SELECT 
        s.assay_type,
        AVG(a.alignment_rate_pct) as avg_alignment_rate,
        STDDEV(a.alignment_rate_pct) as std_alignment_rate,
        COUNT(*) as num_samples
    FROM alignment_stats a
    JOIN samples s ON a.sample_id = s.sample_id
    GROUP BY s.assay_type
    ORDER BY avg_alignment_rate DESC
""", conn)

# Create bar chart with error bars
plt.figure(figsize=(12, 7))
x_pos = range(len(alignment_df))
bars = plt.bar(x_pos, alignment_df['avg_alignment_rate'], 
               yerr=alignment_df['std_alignment_rate'],
               color='#9b59b6', alpha=0.8, edgecolor='black', capsize=5)

plt.xticks(x_pos, alignment_df['assay_type'], rotation=15, ha='right')
plt.ylabel('Average Alignment Rate (%)', fontsize=12)
plt.xlabel('Assay Type', fontsize=12)
plt.title('Average Alignment Rate by Assay Type', fontsize=16, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.ylim(90, 100)

# Add sample count labels
for i, (bar, count) in enumerate(zip(bars, alignment_df['num_samples'])):
    plt.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.3,
            f'n={count}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

conn.close()

print("\n‚úì All visualizations generated successfully!")

## 7. Database Cleanup and Resource Removal

In [None]:
def cleanup_rds_free_tier_instance():
    """
    Delete RDS instance and all associated resources.
    Optimized for FAST deletion - skips all snapshots and backups.
    WARNING: This will permanently delete all data with NO RECOVERY option!
    """
    import time
    
    print("\n" + "="*70)
    print("üóëÔ∏è  RDS FREE TIER CLEANUP")
    print("="*70)
    print("WARNING: Fast deletion mode - NO SNAPSHOTS will be created!")
    print("This will permanently delete the RDS instance and ALL data!")
    print("="*70 + "\n")
    
    try:
        # Delete DB instance (FAST mode - no snapshots)
        print(f"Deleting RDS instance: {DB_INSTANCE_IDENTIFIER} (fast mode)...")
        
        try:
            rds_client.delete_db_instance(
                DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER,
                SkipFinalSnapshot=True,           # NO final snapshot - FAST
                DeleteAutomatedBackups=True       # Delete all backups
            )
            print(f"  ‚úì Instance deletion initiated (no snapshot)")
            
            # Wait for instance to be deleted
            print("  Waiting for instance deletion (2-5 minutes)...")
            waiter = rds_client.get_waiter('db_instance_deleted')
            waiter.wait(
                DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER,
                WaiterConfig={'Delay': 20, 'MaxAttempts': 40}
            )
            print("  ‚úì Instance deleted successfully")
            
        except rds_client.exceptions.DBInstanceNotFoundFault:
            print("  ‚úì Instance not found (already deleted)")
        except Exception as e:
            print(f"  Warning: Instance deletion issue: {e}")
        
        print("\n" + "="*70)
        print("‚úì All RDS resources cleaned up successfully!")
        print("‚úì No snapshots created - deletion was FAST")
        print("‚úì FREE TIER hours are now available for other resources")
        print("="*70)
        print("\nIMPORTANT: Verify cleanup in AWS Console:")
        print("  1. Go to: RDS > Databases")
        print(f"  2. Confirm no instance named '{DB_INSTANCE_IDENTIFIER}'")
        print("  3. Check: RDS > Snapshots (should be none for this instance)")
        
    except Exception as e:
        print(f"\n‚ùå Error during cleanup: {str(e)}")
        print("\nManual cleanup steps:")
        print("  1. AWS Console > RDS > Databases")
        print(f"  2. Select instance: {DB_INSTANCE_IDENTIFIER}")
        print("  3. Actions > Delete")
        print("  4. UNCHECK 'Create final snapshot'")
        print("  5. CHECK 'Delete automated backups'")
        print("  6. Type instance name and confirm")
        raise

print("Fast cleanup function defined.")
print("\n" + "="*70)
print("To clean up all resources (FAST mode - NO SNAPSHOTS):")
print("  >>> cleanup_rds_free_tier_instance()")
print("\nWARNING: This will permanently delete all data with NO recovery option!")
print("="*70)

In [None]:
# UNCOMMENT THE LINE BELOW TO RUN CLEANUP
cleanup_rds_free_tier_instance()

## Summary

This notebook demonstrated:

1. **Database Setup**: Created an RDS PostgreSQL instance using FREE TIER (db.t3.micro)
2. **Schema Design**: Designed 7 tables for clinical sequencing QC operations:
   - `sequencing_runs` - Run metadata
   - `samples` - Patient samples
   - `qc_metrics` - Quality control metrics
   - `coverage_metrics` - Coverage statistics
   - `variant_calls` - Variant calling results
   - `contamination_checks` - Contamination detection
   - `alignment_stats` - Alignment quality

3. **Data Generation**: Created realistic faux QC metrics for 50 samples across 10 sequencing runs

4. **Visualizations**: Generated 7 informative plots (same as Aurora version)

5. **Resource Cleanup**: Provided function to safely remove all AWS resources

## FREE TIER Benefits

### Cost Comparison:
| Version | Instance Type | Monthly Cost | Free Tier? |
|---------|---------------|--------------|------------|
| **This Version** | db.t3.micro | **$0.00** | ‚úÖ Yes (12 months) |
| Aurora Serverless v2 | 0.5-1.0 ACU | $2-20 | ‚ùå No |

### Free Tier Limits:
- ‚úÖ 750 hours/month (= 24/7 operation)
- ‚úÖ 20 GB storage
- ‚úÖ 20 GB backup storage  
- ‚úÖ Valid for 12 months from AWS account creation

### Important Notes:
- This is a FREE TIER setup - $0 cost if within free tier limits
- Same PostgreSQL functionality as Aurora Serverless
- Publicly accessible database (use VPC in production)
- Passwords are hardcoded (use AWS Secrets Manager in production)
- Remember to clean up to free resources for other exercises!
- Always verify cleanup in AWS Console to ensure no charges

## After Free Tier Expires

After 12 months, db.t3.micro costs:
- **$0.017/hour** = ~$12.41/month if running 24/7
- For learning exercises (4 hours): ~$0.07 (7 cents)
- Still much cheaper than Aurora Serverless v2!