# 🗄️ **AWS Database Investigation & Districts Table Creation**

## 🎯 **Learning Objectives**
By the end of this notebook, students will understand:
1. **Database Connection**: How to connect to AWS RDS PostgreSQL
2. **Schema Investigation**: How to explore existing database structure
3. **PostGIS Setup**: Understanding spatial database extensions
4. **GeoJSON Import**: How to create PostGIS tables from GeoJSON files
5. **Data Validation**: How to verify successful data import

## 📋 **Prerequisites**
- ✅ Enhanced GeoJSON file (`districts_enhanced.geojson`)
- ✅ AWS database credentials
- ✅ Basic understanding of PostgreSQL and spatial data

---

## 📦 **Step 1: Import Required Libraries**

**Learning Point**: We need specific libraries for spatial data and database operations.

In [53]:
# 📦 Import required libraries for spatial database operations
import pandas as pd
import geopandas as gpd
from sqlalchemy import create_engine, text
import os
import traceback
from dotenv import load_dotenv

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print("📚 Ready for spatial database operations")

✅ Libraries imported successfully!
📚 Ready for spatial database operations


## 🔌 **Step 2: AWS Database Connection**

**Learning Point**: Connection strings contain all necessary information to connect to a database.

**Format**: `postgresql+psycopg2://username:password@host:port/database`

## 🔍 **Step 3: Database Schema Investigation**

**Learning Point**: Before creating new tables, always investigate existing database structure to avoid conflicts.

In [54]:
# 🔐 Clean and professional database connection using python-dotenv
print("🔌 **CONNECTING TO AWS DATABASE**")
print("=" * 40)

# Load environment variables from .env file (no string functions needed!)
load_dotenv('../ignored_files/.env')
PASSWORD = os.getenv('PASSWORD')

# Build connection URL with secure password from .env
DATABASE_URL = f'postgresql+psycopg2://postgres:{PASSWORD}@layered-data-warehouse.cdg2ok68acsn.eu-central-1.rds.amazonaws.com:5432/berlin_project_db'

try:
    print("1️⃣ Creating database engine...")
    engine = create_engine(DATABASE_URL, connect_args={'connect_timeout': 10})
    
    print("2️⃣ Testing connection...")
    conn = engine.connect()
    
    # Test query
    test_result = conn.execute(text("SELECT current_database(), current_user, version()"))
    db_info = test_result.fetchone()
    
    print(f"   ✅ Connected successfully!")
    print(f"   🗄️  Database: {db_info[0]}")
    print(f"   👤 User: {db_info[1]}")
    print(f"   📊 PostgreSQL Version: {db_info[2][:50]}...")
    
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("💡 Check network connection and credentials")

🔌 **CONNECTING TO AWS DATABASE**
1️⃣ Creating database engine...
2️⃣ Testing connection...
   ✅ Connected successfully!
   🗄️  Database: berlin_project_db
   👤 User: postgres
   📊 PostgreSQL Version: PostgreSQL 17.4 on aarch64-unknown-linux-gnu, comp...


In [55]:
# 🔍 Investigate existing database structure
print("🔍 **DATABASE SCHEMA INVESTIGATION**")
print("=" * 40)

try:
    # Check available schemas
    print("1️⃣ Available schemas:")
    schemas_result = conn.execute(text("""
        SELECT schema_name 
        FROM information_schema.schemata 
        WHERE schema_name NOT IN ('information_schema', 'pg_catalog', 'pg_toast')
        ORDER BY schema_name
    """))
    schemas = [row[0] for row in schemas_result.fetchall()]
    
    for schema in schemas:
        print(f"   📁 {schema}")
    
    # Check if berlin_data schema exists
    berlin_data_exists = 'berlin_data' in schemas
    print(f"\n2️⃣ Target schema 'berlin_data' exists: {'✅ YES' if berlin_data_exists else '❌ NO'}")
    
    if berlin_data_exists:
        # Set search path to berlin_data
        conn.execute(text("SET search_path = berlin_data, public;"))
        conn.commit()
        print("   📋 Search path set to: berlin_data, public")
        
        # Check existing tables in berlin_data
        print("\n3️⃣ Existing tables in berlin_data schema:")
        tables_result = conn.execute(text("""
            SELECT table_name, table_type 
            FROM information_schema.tables 
            WHERE table_schema = 'berlin_data'
            ORDER BY table_name
        """))
        tables = tables_result.fetchall()
        
        for table in tables:
            print(f"   🗄️  {table.table_name} ({table.table_type})")
        
        print(f"\n   📊 Total tables found: {len(tables)}")
    
    print("\n✅ Schema investigation complete!")
    
except Exception as e:
    print(f"❌ Schema investigation failed: {e}")

🔍 **DATABASE SCHEMA INVESTIGATION**
1️⃣ Available schemas:
   📁 berlin_data
   📁 public

2️⃣ Target schema 'berlin_data' exists: ✅ YES
   📋 Search path set to: berlin_data, public

3️⃣ Existing tables in berlin_data schema:
   🗄️  districts (BASE TABLE)
   🗄️  districts_pop_stat (BASE TABLE)
   🗄️  geography_columns (VIEW)
   🗄️  geometry_columns (VIEW)
   🗄️  green_spaces (BASE TABLE)
   🗄️  hospitals (BASE TABLE)
   🗄️  neighborhoods (BASE TABLE)
   🗄️  regional_statistics (BASE TABLE)
   🗄️  schools_kai (BASE TABLE)
   🗄️  short_time_listings (BASE TABLE)
   🗄️  spatial_ref_sys (BASE TABLE)
   🗄️  ubahn (BASE TABLE)

   📊 Total tables found: 12

✅ Schema investigation complete!


## 🗺️ **Step 4: PostGIS Extension Check**

**Learning Point**: PostGIS is essential for spatial data operations in PostgreSQL.

In [46]:
# 🗺️ Check PostGIS extension status
print("🗺️ **POSTGIS EXTENSION CHECK**")
print("=" * 40)

try:
    # Check if PostGIS is installed
    postgis_check = conn.execute(text("""
        SELECT 
            extname as extension_name,
            extversion as version,
            nspname as schema
        FROM pg_extension e
        JOIN pg_namespace n ON e.extnamespace = n.oid
        WHERE extname = 'postgis'
    """))
    
    postgis_info = postgis_check.fetchone()
    
    if postgis_info:
        print(f"✅ PostGIS is installed!")
        print(f"   📦 Extension: {postgis_info.extension_name}")
        print(f"   🔢 Version: {postgis_info.version}")
        print(f"   📋 Schema: {postgis_info.schema}")
    else:
        print("❌ PostGIS not found")
        print("💡 PostGIS extension may need to be enabled")
    
    print("\n✅ PostGIS check complete!")
    
except Exception as e:
    print(f"❌ PostGIS check failed: {e}")

🗺️ **POSTGIS EXTENSION CHECK**
✅ PostGIS is installed!
   📦 Extension: postgis
   🔢 Version: 3.5.1
   📋 Schema: berlin_data

✅ PostGIS check complete!


## 📂 **Step 5: Load Enhanced Districts GeoJSON**

**Learning Point**: GeoJSON is a standard format for geographic data that can be easily imported into PostGIS.

In [47]:
# 📂 Load the enhanced districts GeoJSON file
print("📂 **LOADING ENHANCED DISTRICTS GEOJSON**")
print("=" * 40)

# Path to the enhanced GeoJSON file
geojson_path = "/Users/zeal.v/Desktop/Webeet-Internship/districts-neighborhoods-populating-db/layered-populate-data-pool-da/districts-neighborhoods-populating-db/sources/districts_enhanced.geojson"

try:
    print("1️⃣ Checking file existence...")
    if os.path.exists(geojson_path):
        print(f"   ✅ File found: {os.path.basename(geojson_path)}")
        
        print("2️⃣ Loading GeoJSON with GeoPandas...")
        districts_gdf = gpd.read_file(geojson_path)
        
        print(f"   ✅ Loaded {len(districts_gdf)} districts")
        print(f"   📊 Columns: {list(districts_gdf.columns)}")
        print(f"   🌍 Coordinate Reference System: {districts_gdf.crs}")
        print(f"   📏 Geometry types: {districts_gdf.geometry.geom_type.unique()}")
        
        print("\n3️⃣ Sample data preview:")
        sample_data = districts_gdf[['district_id', 'district']].head(3)
        for idx, row in sample_data.iterrows():
            print(f"   🏢 {row['district_id']}: {row['district']}")
        
        # Ensure correct CRS (EPSG:4326 for WGS84)
        if districts_gdf.crs != 'EPSG:4326':
            print(f"\n4️⃣ Converting CRS to EPSG:4326...")
            districts_gdf = districts_gdf.to_crs('EPSG:4326')
            print("   ✅ CRS converted to EPSG:4326")
        else:
            print("\n4️⃣ CRS verification: ✅ Already EPSG:4326")
        
        print("\n✅ GeoJSON loaded and ready for database import!")
        
    else:
        print(f"   ❌ File not found: {geojson_path}")
        print("   💡 Make sure to run the main notebook first to create enhanced files")
        
except Exception as e:
    print(f"❌ Error loading GeoJSON: {e}")
    print(f"🔍 Details: {traceback.format_exc()}")

📂 **LOADING ENHANCED DISTRICTS GEOJSON**
1️⃣ Checking file existence...
   ✅ File found: districts_enhanced.geojson
2️⃣ Loading GeoJSON with GeoPandas...
   ✅ Loaded 12 districts
   📊 Columns: ['district_id', 'district', 'geometry']
   🌍 Coordinate Reference System: EPSG:4326
   📏 Geometry types: ['MultiPolygon']

3️⃣ Sample data preview:
   🏢 12: Reinickendorf
   🏢 04: Charlottenburg-Wilmersdorf
   🏢 09: Treptow-Köpenick

4️⃣ CRS verification: ✅ Already EPSG:4326

✅ GeoJSON loaded and ready for database import!


## 🔧 **Step 6: Enable PostGIS for Table Creation**

**Learning Point**: PostGIS extension must be enabled before creating spatial tables.

**Simple approach**: Enable PostGIS and set proper search path.

In [48]:
# 🔧 COMPREHENSIVE PostGIS Setup and Diagnostics
print("🔧 **COMPREHENSIVE POSTGIS SETUP & DIAGNOSTICS**")
print("=" * 55)

try:
    # Step 1: Check current PostGIS status
    print("1️⃣ Checking PostGIS installation status...")
    
    # Check if PostGIS extension exists
    ext_check = conn.execute(text("""
        SELECT extname, extversion, nspname as schema_name
        FROM pg_extension e
        JOIN pg_namespace n ON e.extnamespace = n.oid
        WHERE extname = 'postgis'
    """))
    
    postgis_ext = ext_check.fetchone()
    if postgis_ext:
        print(f"   ✅ PostGIS extension found: v{postgis_ext.extversion} in {postgis_ext.schema_name} schema")
    else:
        print("   ❌ PostGIS extension not found")
    
    # Step 2: Check available PostGIS types
    print("\n2️⃣ Checking available PostGIS types...")
    
    types_check = conn.execute(text("""
        SELECT typname, nspname 
        FROM pg_type t 
        JOIN pg_namespace n ON t.typnamespace = n.oid 
        WHERE typname IN ('geometry', 'geography')
        ORDER BY typname, nspname
    """))
    
    types_found = types_check.fetchall()
    if types_found:
        print("   📋 Available spatial types:")
        for type_info in types_found:
            print(f"      • {type_info.typname} in {type_info.nspname} schema")
    else:
        print("   ❌ No spatial types found")
    
    # Step 3: Enable PostGIS properly
    print("\n3️⃣ Ensuring PostGIS is properly enabled...")
    
    # Try to enable PostGIS extension
    conn.execute(text("CREATE EXTENSION IF NOT EXISTS postgis;"))
    conn.commit()
    print("   ✅ PostGIS extension command executed")
    
    # Check if it worked by testing PostGIS function
    print("\n4️⃣ Testing PostGIS functionality...")
    try:
        test_result = conn.execute(text("SELECT PostGIS_Version();"))
        version = test_result.fetchone()[0]
        print(f"   ✅ PostGIS is working! Version: {version}")
        
        # Test geometry creation
        geom_test = conn.execute(text("""
            SELECT ST_GeomFromText('POINT(13.4050 52.5200)', 4326) IS NOT NULL as geom_works
        """))
        geom_works = geom_test.fetchone()[0]
        print(f"   ✅ Geometry functions work: {geom_works}")
        
    except Exception as test_error:
        print(f"   ❌ PostGIS function test failed: {test_error}")
    
    # Step 5: Set search path
    print("\n5️⃣ Setting search path...")
    conn.execute(text("SET search_path = berlin_data, public;"))
    conn.commit()
    print("   ✅ Search path: berlin_data, public")
    
    # Step 6: Final type check
    print("\n6️⃣ Final spatial type availability check...")
    final_check = conn.execute(text("""
        SELECT typname 
        FROM pg_type 
        WHERE typname IN ('geometry', 'geography')
    """))
    
    final_types = [row[0] for row in final_check.fetchall()]
    print(f"   📋 Available types now: {final_types}")
    
    if 'geography' in final_types:
        print("\n🎉 **PostGIS setup complete! Geography type is available.**")
    else:
        print("\n⚠️ **Geography type still not available. Will use alternative approach.**")
    
except Exception as e:
    print(f"❌ PostGIS setup failed: {e}")
    import traceback
    print(f"� Details: {traceback.format_exc()}")

🔧 **COMPREHENSIVE POSTGIS SETUP & DIAGNOSTICS**
1️⃣ Checking PostGIS installation status...
   ✅ PostGIS extension found: v3.5.1 in berlin_data schema

2️⃣ Checking available PostGIS types...
   📋 Available spatial types:
      • geography in berlin_data schema
      • geometry in berlin_data schema

3️⃣ Ensuring PostGIS is properly enabled...
   ✅ PostGIS extension command executed

4️⃣ Testing PostGIS functionality...
   ✅ PostGIS is working! Version: 3.5 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
   ✅ Geometry functions work: True

5️⃣ Setting search path...
   ✅ Search path: berlin_data, public

6️⃣ Final spatial type availability check...
   📋 Available types now: ['geography', 'geometry']

🎉 **PostGIS setup complete! Geography type is available.**


## 🚀 **Step 7: Create Districts Table (AWS_grocery Method)**

**Learning Point**: We'll use the exact same method that works in AWS_grocery project.

**Proven approach**: geoalchemy2.Geography + SRID=4326;WKT format

In [49]:
# 🔧 **STEP 7.0: TRANSACTION ROLLBACK** (if there was an error)
# =======================================
print("🔧 **STEP 7.5: CLEARING TRANSACTION ERROR**")
print("=" * 45)

try:
    # Rollback any failed transaction
    conn.rollback()
    print("✅ Transaction rolled back successfully!")
except Exception as e:
    print(f"ℹ️  Rollback note: {e}")

print("🚀 Ready to proceed!")

🔧 **STEP 7.5: CLEARING TRANSACTION ERROR**
✅ Transaction rolled back successfully!
🚀 Ready to proceed!


## 🔧 **Step 7A: Connection Check & Transaction Reset**

**Learning Point**: Before creating tables, always verify your connection is working and clear any pending transactions.

**Why this matters**: 
- Database connections can have "dirty" transaction states
- Rolling back ensures we start with a clean slate
- Connection tests verify we can communicate with the database

**Best Practice**: Always check connection health before major operations!

In [21]:
# � **STEP 7A: CONNECTION CHECK & ROLLBACK**
# ============================================
print("� **STEP 7A: CONNECTION CHECK & ROLLBACK**")
print("=" * 45)

try:
    # Check connection status
    test_result = conn.execute(text("SELECT 1 as test"))
    test_value = test_result.fetchone()[0]
    print(f"✅ Connection working: {test_value}")
    
    # Rollback any pending transactions
    conn.rollback()
    print("✅ Transaction state cleared")
    
except Exception as e:
    print(f"❌ Connection issue: {e}")
    print("� Try reconnecting if needed")

print("� Ready for table creation!")

� **STEP 7A: CONNECTION CHECK & ROLLBACK**
✅ Connection working: 1
✅ Transaction state cleared
� Ready for table creation!


## 🏗️ **Step 7B: Create Basic Table Structure**

**Learning Point**: Start with simple table structure before adding complex spatial columns.

**Why this approach**:
- Creates the basic columns first (district_id, district)
- Uses `CREATE TABLE IF NOT EXISTS` to avoid errors if table exists
- Establishes primary structure before spatial additions

**SQL Concepts**:
- `VARCHAR(2)` for district_id (Berlin has 2-digit district codes)
- `UNIQUE NOT NULL` ensures no duplicate district IDs
- `VARCHAR(100)` for district names

**Best Practice**: Build tables incrementally - structure first, then spatial features!

In [29]:
# 🏗️ **STEP 7B: CREATE TABLE STRUCTURE**
# =====================================
print("🏗️ **STEP 7B: CREATE TABLE STRUCTURE**")
print("=" * 45)

try:
    print("1️⃣ Creating basic table structure...")
    
    create_sql = """
    CREATE TABLE IF NOT EXISTS berlin_data.districts_enhanced (
        district_id VARCHAR(2) UNIQUE NOT NULL,
        district VARCHAR(100) NOT NULL
    );
    """
    
    conn.execute(text(create_sql))
    conn.commit()
    print("   ✅ Basic table structure created!")
    
    # Connection check
    test_result = conn.execute(text("SELECT 1"))
    print("   ✅ Connection still working")
    
except Exception as e:
    print(f"❌ Table creation failed: {e}")
    conn.rollback()
    print("🔄 Transaction rolled back")
    raise

🏗️ **STEP 7B: CREATE TABLE STRUCTURE**
1️⃣ Creating basic table structure...
   ✅ Basic table structure created!
   ✅ Connection still working
   ✅ Basic table structure created!
   ✅ Connection still working


## 🗺️ **Step 7C: Add PostGIS Geometry Column**

**Learning Point**: PostGIS provides special functions for adding spatial columns with proper constraints.

**PostGIS Functions**:
- `AddGeometryColumn()` - The standard PostGIS way to add spatial columns
- Automatically sets up spatial constraints and metadata
- Registers the column in PostGIS system tables

**Parameters Explained**:
- `'berlin_data'` - schema name
- `'districts_enhanced'` - table name  
- `'geometry'` - column name
- `4326` - SRID (Spatial Reference System - WGS84)
- `'MULTIPOLYGON'` - geometry type (districts can have multiple polygons)
- `2` - dimensions (2D: X,Y coordinates)

**Fallback Strategy**: If AddGeometryColumn fails, we use manual ALTER TABLE as backup!

In [30]:
# 🗺️ **STEP 7C: ADD GEOMETRY COLUMN**
# ==================================
print("🗺️ **STEP 7C: ADD GEOMETRY COLUMN**")
print("=" * 45)

try:
    print("2️⃣ Adding PostGIS geometry column...")
    
    # Use AddGeometryColumn - the PostGIS standard way
    add_geom_sql = """
    SELECT AddGeometryColumn('berlin_data', 'districts_enhanced', 'geometry', 4326, 'MULTIPOLYGON', 2);
    """
    
    try:
        conn.execute(text(add_geom_sql))
        conn.commit()
        print("   ✅ Geometry column added with AddGeometryColumn!")
    except Exception as geom_error:
        if "already exists" in str(geom_error):
            print("   ✅ Geometry column already exists!")
        else:
            print(f"   ⚠️ AddGeometryColumn failed: {geom_error}")
            print("   🔄 Trying alternative approach...")
            
            # Alternative: Add column manually
            alt_sql = """
            ALTER TABLE berlin_data.districts_enhanced 
            ADD COLUMN IF NOT EXISTS geometry GEOMETRY(MULTIPOLYGON, 4326);
            """
            conn.execute(text(alt_sql))
            conn.commit()
            print("   ✅ Geometry column added manually!")
    
    # Connection check
    test_result = conn.execute(text("SELECT 1"))
    print("   ✅ Connection still working")
    
except Exception as e:
    print(f"❌ Geometry column creation failed: {e}")
    conn.rollback()
    print("🔄 Transaction rolled back")
    raise

🗺️ **STEP 7C: ADD GEOMETRY COLUMN**
2️⃣ Adding PostGIS geometry column...
   ✅ Geometry column added with AddGeometryColumn!
   ✅ Connection still working
   ✅ Connection still working


## 🧹 **Step 7D: Clear Existing Data**

**Learning Point**: Before inserting new data, always clear existing records to avoid duplicates.

**Why clear data**:
- Prevents duplicate records if we re-run the notebook
- Ensures clean, consistent dataset
- Allows for fresh data imports during development

**SQL Concepts**:
- `DELETE FROM table` - removes all records but keeps table structure
- Much faster than `DROP TABLE` + `CREATE TABLE`
- Preserves table constraints and indexes

**Verification**: We count records after deletion to confirm the table is empty.

**Best Practice**: Always verify your operations worked as expected!

In [11]:
# 🧹 **STEP 7D: CLEAR EXISTING DATA**
# =================================
print("🧹 **STEP 7D: CLEAR EXISTING DATA**")
print("=" * 45)

try:
    print("3️⃣ Clearing existing data...")
    
    # Clear the table
    conn.execute(text("DELETE FROM berlin_data.districts_enhanced;"))
    conn.commit()
    print("   ✅ Table cleared successfully")
    
    # Verify it's empty
    count_check = conn.execute(text("SELECT COUNT(*) FROM berlin_data.districts_enhanced;"))
    count = count_check.fetchone()[0]
    print(f"   📊 Current record count: {count}")
    
    # Connection check
    test_result = conn.execute(text("SELECT 1"))
    print("   ✅ Connection still working")
    
except Exception as e:
    print(f"❌ Data clearing failed: {e}")
    conn.rollback()
    print("🔄 Transaction rolled back")
    raise

🧹 **STEP 7D: CLEAR EXISTING DATA**
3️⃣ Clearing existing data...
   ✅ Table cleared successfully
   📊 Current record count: 0
   ✅ Connection still working


## 📥 **Step 7E: Insert Districts with Spatial Data**

**Learning Point**: Converting GeoDataFrame geometries to PostGIS format using WKT (Well-Known Text).

**Spatial Data Conversion**:
- `row['geometry'].wkt` - converts shapely geometry to WKT string
- `ST_GeomFromText(wkt, 4326)` - PostGIS function to create geometry from WKT
- SRID 4326 ensures proper coordinate reference system

**Insertion Strategy**:
- Loop through each district in our GeoDataFrame
- Use parameterized queries to prevent SQL injection
- Insert both attribute data (district_id, district) and spatial data (geometry)

**Progress Tracking**: We show progress every 3 insertions so students can see the process.

**Why WKT**: Well-Known Text is a standard, human-readable format for geometric data that PostGIS understands perfectly!

In [28]:
# 📥 **STEP 7E: INSERT DISTRICTS DATA**
# ===================================
print("📥 **STEP 7E: INSERT DISTRICTS DATA**")
print("=" * 45)

try:
    print("4️⃣ Inserting districts data...")
    
    inserted_count = 0
    for idx, row in districts_gdf.iterrows():
        insert_sql = text("""
            INSERT INTO berlin_data.districts_enhanced 
            (district_id, district, geometry) 
            VALUES (:district_id, :district, ST_GeomFromText(:wkt, 4326))
        """)
        
        conn.execute(insert_sql, {
            'district_id': row['district_id'],
            'district': row['district'],
            'wkt': row['geometry'].wkt
        })
        inserted_count += 1
        
        # Progress indicator every 3 records
        if inserted_count % 3 == 0:
            print(f"   📍 Inserted {inserted_count} districts...")
    
    conn.commit()
    print(f"   ✅ Successfully inserted {inserted_count} districts!")
    
    # Connection check
    test_result = conn.execute(text("SELECT 1"))
    print("   ✅ Connection still working")
    
except Exception as e:
    print(f"❌ Data insertion failed: {e}")
    conn.rollback()
    print("🔄 Transaction rolled back")
    raise

📥 **STEP 7E: INSERT DISTRICTS DATA**
4️⃣ Inserting districts data...
❌ Data insertion failed: (psycopg2.errors.UndefinedTable) relation "berlin_data.districts_enhanced" does not exist
LINE 2:             INSERT INTO berlin_data.districts_enhanced 
                                ^

[SQL: 
            INSERT INTO berlin_data.districts_enhanced 
            (district_id, district, geometry) 
            VALUES (%(district_id)s, %(district)s, ST_GeomFromText(%(wkt)s, 4326))
        ]
[parameters: {'district_id': '12', 'district': 'Reinickendorf', 'wkt': 'MULTIPOLYGON (((13.320744327762688 52.6265990635977, 13.320450024315486 52.62661432040652, 13.320156209034547 52.626629556435226, 13.319861608831037  ... (92321 characters truncated) ... 21483735159507 52.626560722834455, 13.321338930680731 52.62656821917953, 13.321038601486716 52.62658380563724, 13.320744327762688 52.6265990635977)))'}]
(Background on this error at: https://sqlalche.me/e/20/f405)
🔄 Transaction rolled back


ProgrammingError: (psycopg2.errors.UndefinedTable) relation "berlin_data.districts_enhanced" does not exist
LINE 2:             INSERT INTO berlin_data.districts_enhanced 
                                ^

[SQL: 
            INSERT INTO berlin_data.districts_enhanced 
            (district_id, district, geometry) 
            VALUES (%(district_id)s, %(district)s, ST_GeomFromText(%(wkt)s, 4326))
        ]
[parameters: {'district_id': '12', 'district': 'Reinickendorf', 'wkt': 'MULTIPOLYGON (((13.320744327762688 52.6265990635977, 13.320450024315486 52.62661432040652, 13.320156209034547 52.626629556435226, 13.319861608831037  ... (92321 characters truncated) ... 21483735159507 52.626560722834455, 13.321338930680731 52.62656821917953, 13.321038601486716 52.62658380563724, 13.320744327762688 52.6265990635977)))'}]
(Background on this error at: https://sqlalche.me/e/20/f405)

## ✅ **Step 7F: Verify Success & Test Spatial Functions**

**Learning Point**: Always verify your data import was successful and spatial functions work correctly.

**Verification Steps**:
1. **Count Records** - Ensure all districts were inserted
2. **Test Spatial Functions** - Verify PostGIS geometry operations work
3. **Check Data Types** - Confirm geometry types and coordinate systems

**PostGIS Testing Functions**:
- `ST_GeometryType()` - returns the geometry type (e.g., ST_MultiPolygon)
- `ST_SRID()` - returns the Spatial Reference System ID (should be 4326)
- These functions prove our spatial data is properly stored and accessible

**Success Criteria**:
- ✅ Record count matches expected districts (should be 12 for Berlin)
- ✅ Geometry type is MULTIPOLYGON 
- ✅ SRID is 4326 (WGS84)
- ✅ No errors in spatial function calls

**Why Verify**: Data can appear to import successfully but have hidden issues. Testing catches problems early!

In [33]:
# ✅ **STEP 7F: VERIFY SUCCESS**
# ===============================
print("✅ **STEP 7F: VERIFY SUCCESS**")
print("=" * 45)

try:
    print("5️⃣ Verifying results...")
    
    # Count total records
    count_result = conn.execute(text("""
        SELECT COUNT(*) FROM berlin_data.districts_enhanced
    """))
    total_count = count_result.fetchone()[0]
    
    # Test spatial functionality
    spatial_test = conn.execute(text("""
        SELECT district_id, district, 
               ST_GeometryType(geometry) as geom_type,
               ST_SRID(geometry) as srid
        FROM berlin_data.districts_enhanced 
        LIMIT 3
    """))
    
    print(f"   ✅ Total records: {total_count}")
    print("   📊 Sample spatial data:")
    for row in spatial_test.fetchall():
        print(f"      🏢 {row.district_id}: {row.district}")
        print(f"         📐 Type: {row.geom_type}, SRID: {row.srid}")
    
    print("\n🎉 **SUCCESS! Districts table created and populated!**")
    print("=" * 45)
    print("✅ Schema: berlin_data")
    print(f"✅ Table: districts_enhanced ({total_count} records)")
    print("✅ Geometry: MULTIPOLYGON with SRID 4326")
    print("✅ Method: Direct SQL + ST_GeomFromText")
    
    print("\n🖖 'Sometimes the direct path is the most logical!' - Spock")
    
except Exception as e:
    print(f"❌ Verification failed: {e}")
    conn.rollback()

✅ **STEP 7F: VERIFY SUCCESS**
5️⃣ Verifying results...
   ✅ Total records: 12
   📊 Sample spatial data:
      🏢 12: Reinickendorf
         📐 Type: ST_MultiPolygon, SRID: 4326
      🏢 04: Charlottenburg-Wilmersdorf
         📐 Type: ST_MultiPolygon, SRID: 4326
      🏢 09: Treptow-Köpenick
         📐 Type: ST_MultiPolygon, SRID: 4326

🎉 **SUCCESS! Districts table created and populated!**
✅ Schema: berlin_data
✅ Table: districts_enhanced (12 records)
✅ Geometry: MULTIPOLYGON with SRID 4326
✅ Method: Direct SQL + ST_GeomFromText

🖖 'Sometimes the direct path is the most logical!' - Spock


## ✅ **Step 8: Verify Success**

**Learning Point**: Always verify your data was imported correctly.

In [31]:
# ✅ **LEGACY CELL - FIXED VERSION**
print("✅ **LEGACY CELL - FIXED VERSION**")
print("=" * 45)

try:
    # Fix the transaction state
    print("1️⃣ Fixing transaction state...")
    conn.rollback()
    print("   ✅ Transaction rolled back")
    
    # Clear existing data
    print("\n2️⃣ Clearing existing data...")
    conn.execute(text("DELETE FROM berlin_data.districts_enhanced;"))
    conn.commit()
    print("   ✅ Table cleared")
    
    # Insert districts data (FIXED: using 'district' not 'district_name')
    print("\n3️⃣ Inserting districts data...")
    
    inserted_count = 0
    for idx, row in districts_gdf.iterrows():
        insert_sql = text("""
            INSERT INTO berlin_data.districts_enhanced 
            (district_id, district, geometry) 
            VALUES (:district_id, :district, ST_GeomFromText(:wkt, 4326))
        """)
        
        conn.execute(insert_sql, {
            'district_id': row['district_id'],
            'district': row['district'],
            'wkt': row['geometry'].wkt
        })
        inserted_count += 1
    
    conn.commit()
    print(f"   ✅ Successfully inserted {inserted_count} districts!")
    
    # Verify final results (FIXED: using 'district' not 'district_name')
    print("\n4️⃣ Final verification...")
    
    count_result = conn.execute(text("""
        SELECT COUNT(*) FROM berlin_data.districts_enhanced
    """))
    total_count = count_result.fetchone()[0]
    
    # Test spatial functionality (FIXED: using 'district' not 'district_name')
    spatial_test = conn.execute(text("""
        SELECT district_id, district, 
               ST_GeometryType(geometry) as geom_type,
               ST_SRID(geometry) as srid
        FROM berlin_data.districts_enhanced 
        LIMIT 3
    """))
    
    print(f"   ✅ Total records: {total_count}")
    print("   📊 Sample spatial data:")
    for row in spatial_test.fetchall():
        print(f"      🏢 {row.district_id}: {row.district}")
        print(f"         📐 Type: {row.geom_type}, SRID: {row.srid}")
    
    print(f"\n🎉 **LEGACY CELL FIXED! {total_count} Berlin districts ready!**")
    print("=" * 45)
    print("✅ Schema: berlin_data")
    print("✅ Table: districts_enhanced")
    print("✅ Column names: FIXED to match actual table structure")
    print("✅ Spatial data: Working perfectly!")
    
    print("\n🖖 'Logic and consistency restore order!' - Spock")
    
except Exception as e:
    print(f"❌ Error: {e}")
    conn.rollback()

✅ **LEGACY CELL - FIXED VERSION**
1️⃣ Fixing transaction state...
   ✅ Transaction rolled back

2️⃣ Clearing existing data...
   ✅ Table cleared

3️⃣ Inserting districts data...
   ✅ Successfully inserted 12 districts!

4️⃣ Final verification...
   ✅ Total records: 12
   📊 Sample spatial data:
      🏢 12: Reinickendorf
         📐 Type: ST_MultiPolygon, SRID: 4326
      🏢 04: Charlottenburg-Wilmersdorf
         📐 Type: ST_MultiPolygon, SRID: 4326
      🏢 09: Treptow-Köpenick
         📐 Type: ST_MultiPolygon, SRID: 4326

🎉 **LEGACY CELL FIXED! 12 Berlin districts ready!**
✅ Schema: berlin_data
✅ Table: districts_enhanced
✅ Column names: FIXED to match actual table structure
✅ Spatial data: Working perfectly!

🖖 'Logic and consistency restore order!' - Spock
   ✅ Successfully inserted 12 districts!

4️⃣ Final verification...
   ✅ Total records: 12
   📊 Sample spatial data:
      🏢 12: Reinickendorf
         📐 Type: ST_MultiPolygon, SRID: 4326
      🏢 04: Charlottenburg-Wilmersdorf
       

## 🎓 **Summary & Learning Outcomes**

### ✅ **What We Accomplished:**
1. **Database Connection**: Successfully connected to AWS RDS PostgreSQL
2. **Schema Investigation**: Explored existing database structure
3. **PostGIS Verification**: Confirmed spatial extension availability
4. **GeoJSON Import**: Loaded enhanced districts data
5. **Table Creation**: Created new districts table with PostGIS geometry
6. **Data Validation**: Verified successful import

### 📚 **Key Learning Points:**
- **Connection Strings**: How to format and use database URLs
- **Schema Management**: Importance of investigating existing structures
- **Spatial Data**: Working with coordinate reference systems (CRS)
- **PostGIS**: Using spatial database extensions
- **Data Validation**: Always verify your imports!

### 🚀 **Next Steps:**
- Create neighborhoods table with district relationships
- Add spatial indexes for performance
- Implement data quality checks
- Create spatial queries and analysis

---

**🖖 "Logic is the beginning of wisdom, not the end." - Spock**

*This notebook demonstrates systematic approach to spatial database operations.*

In [32]:
# 🚀 Simple & Fast Check
print("🚀 **SIMPLE TABLE CHECK**")

# Just check if table exists and count records
simple_count = conn.execute(text("SELECT COUNT(*) FROM berlin_data.districts_enhanced;"))
count = simple_count.fetchone()[0]
print(f"📊 Total records: {count}")

# Show columns
cols = conn.execute(text("""
    SELECT column_name FROM information_schema.columns 
    WHERE table_schema = 'berlin_data' AND table_name = 'districts_enhanced';
"""))
column_list = [col[0] for col in cols.fetchall()]
print(f"📋 Columns: {', '.join(column_list)}")

print("✅ Quick check complete!")

🚀 **SIMPLE TABLE CHECK**
📊 Total records: 12
📋 Columns: district_id, district, geometry
✅ Quick check complete!


In [39]:
# Query the first 5 rows from berlin_data.districts_enhanced
result = conn.execute(text("SELECT * FROM berlin_data.districts_enhanced LIMIT 5;"))
rows = result.fetchall()

# Display results as a pandas DataFrame for readability
import pandas as pd
df_preview = pd.DataFrame(rows, columns=result.keys())
df_preview


Unnamed: 0,district_id,district,geometry
0,12,Reinickendorf,0106000020E61000000100000001030000000100000084...
1,4,Charlottenburg-Wilmersdorf,0106000020E6100000010000000103000000010000000D...
2,9,Treptow-Köpenick,0106000020E610000001000000010300000001000000E9...
3,3,Pankow,0106000020E61000000400000001030000000100000012...
4,8,Neukölln,0106000020E610000001000000010300000001000000BF...
