# Lecture 20 (Optional) Advanced Merging 

<font size = "5">

In this lecture we will see how to merge datasets <br>
with multiple primary keys

# I. Import Libraries and Data 


<font size = "5">

Import libraries

In [1]:
# psycogpg2 helps us process SQL commands to send to the server
# sqlalchemy facilitates establishing a connection with the server

import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text

<font size = "5">

Connect to SQL server

- In the default instructions we set <br>
the password to "" in windows and <br>
no password for Mac
- ADJUST code accordingly! 

In [2]:
# These are the default settings.
# "postgresql" is a fixed argument
# If you have a different host,database, username, or password,
# change the corresponding connection details

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
conn = engine.connect()


<font size = "5">

Read datasets into Python

In [3]:
bills_actions   = pd.read_csv("data_raw/bills_actions.csv")
bills_subjects  = pd.read_csv("data_raw/bills_subjects.csv")

<font size = "5">

Upload to SQL


In [4]:
# Rollback any existing transaction to clear the error state
conn.rollback()

# Upload the data to SQL
bills_actions.to_sql('bills_actions', 
                     con = conn, 
                     if_exists='replace',
                     index=False)

conn.commit()

In [5]:
# Rollback any existing transaction to clear the error state
conn.rollback()

# Upload the data to SQL
bills_subjects.to_sql('bills_subjects',
               con = conn,
               if_exists='replace',
               index=False)

conn.commit()
conn.close()

In [6]:
# Re-establish the connection
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
conn = engine.connect()

# Execute the SQL query
pd.read_sql("SELECT * FROM bills_actions", conn)

Unnamed: 0,congress,bill_number,bill_type,action,main_action,object,member_id
0,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858
1,116,1031,s,S.Amdt.2698 Amendment SA 2698 proposed by Sena...,senate amendment proposed (on the floor),amendment,675
2,116,1160,s,S.Amdt.2659 Amendment SA 2659 proposed by Sena...,senate amendment proposed (on the floor),amendment,858
3,116,1199,s,"Committee on Health, Education, Labor, and Pen...",senate committee/subcommittee actions,senate bill,1561
4,116,1208,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580
...,...,...,...,...,...,...,...
3298,116,9,hr,H.Amdt.172 Amendment (A004) offered by Ms. Kus...,house amendment offered,amendment,36
3299,116,9,hr,H.Amdt.171 Amendment (A003) offered by Ms. Hou...,house amendment offered,amendment,186
3300,116,9,hr,H.Amdt.170 Amendment (A002) offered by Ms. Oma...,house amendment offered,amendment,477
3301,116,9,hr,POSTPONED PROCEEDINGS - At the conclusion of d...,other house amendment actions,amendment,393


In [7]:
# Re-establish the connection
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
conn = engine.connect()

# Execute the SQL query
pd.read_sql("SELECT * FROM bills_subjects", conn)

Unnamed: 0,congress,bill_number,bill_type,bill_subject
0,116,81,hconres,Appropriations
1,116,81,hconres,Legislative rules and procedure
2,116,83,hconres,Conflicts and wars
3,116,83,hconres,Congressional oversight
4,116,83,hconres,Iran
...,...,...,...,...
158693,116,2474,s,"Veterans' education, employment, rehabilitation"
158694,116,2474,s,Veterans' medical care
158695,116,2474,s,War and emergency powers
158696,116,2474,s,Washington State


<font size = "5">

Merge two datasets where the primary key is given by multiple variables

In [None]:
# Your code here

<font size = "5">

The code below is just to ensure that, if everything else fails, we can drop all tables and start from scratch. No need to run it if everything is working fine. 

In [None]:
from sqlalchemy import create_engine, text, inspect
import time

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()

def clean_database(engine):
    with engine.connect() as conn:
        try:
            # Get inspector to check existing tables
            inspector = inspect(engine)
            existing_tables = inspector.get_table_names()
            
            if not existing_tables:
                print("No tables found in database")
                return
                
            print(f"Found {len(existing_tables)} tables: {existing_tables}")
            
            # Kill other connections
            conn.execute(text("""
                SELECT pg_terminate_backend(pid) 
                FROM pg_stat_activity 
                WHERE pid <> pg_backend_pid()
                AND datname = current_database()
            """))
            
            conn.execute(text("ROLLBACK"))
            conn.execute(text("SET statement_timeout = '30s'"))
            
            # Only drop tables that exist
            for table in existing_tables:
                try:
                    conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
                    print(f"Dropped {table}")
                    conn.commit()
                    time.sleep(1)
                except Exception as e:
                    print(f"Error with {table}: {str(e)}")
                    conn.execute(text("ROLLBACK"))
                    
        except Exception as e:
            print(f"Fatal error: {str(e)}")
            conn.execute(text("ROLLBACK"))

# Execute
clean_database(engine)