# SQL in Jupyter Notebook -- DRAFT NOTEBOOK

This notebook does not teach SQL, but rather provides examples of using SQL and psql from within a Jupyter Notebook.

For an excellent introduction to SQL, with postgres examples, see:  
[Practical SQL](https://www.amazon.com/Practical-SQL-Beginners-Guide-Storytelling-ebook/dp/B07197G78H/)

For an excellent introduction to PostgreSQL, see:  
[PostgreSQL Up and Running 3rd Edition](https://www.amazon.com/PostgreSQL-Running-Practical-Advanced-Database/dp/1491963417/)

It is assumed you have a Postgres Server up and running on your local computer.

For how to install postgres 11 on Ubuntu 18.04, as well as pgAdmin4 see:  
https://sdiehl28.netlify.com/topics/sql/postgres/

The above instructions include how to set the authentication method to md5 (instead of the default of peer), how to configure pgadmin4 for desktop mode, etc.

Of course, a google search for "postgres install" will also find installation instructions.

If you are writing a lot of SQL, then pgadmin4 (and Jupyter Notebook) are not optimal.  An excellent SQL client for several database management systems is: https://dbeaver.io/

## Programmatically Interacting with DB Server

**DB-API**  
This is a specification for interacting with databases.  It is similar to Java's JDBC.  Each database creates its own implementation of the specification.  For Postgres, the most common DB-API implementation is psycopg2.

**SQL Alchemy**  
SQL Alchemy offers two distinct APIs, the Core API and the ORM API. See:  
https://docs.sqlalchemy.org/en/latest/

The Core API is a Pythonic way of interacting with a database using SQL.  It is higher level than DB-API, but it is lower level than the ORM API.

The ORM (Object Relational Mapper) API, is for object-oriented application developers who want to use an OO interface to a database.  A full OO approach could result in poor performance due to the "impedance mismatch" between the object and relational models, but the ORM API allows for addressing these issues on a case by case basis, while also allowing for an OO approach for most database interactions.

**Pandas**  
Using Pandas is simpler than either psycopg2 or SQL Alchemy. For data analysis, this is may be sufficient.

df.to_sql(): can optionally create a table, and write/append dataframe data to a table
df.read_sql(): creates a dataframe from the results of the query

In [59]:
import pandas as pd
import numpy as np
import sqlalchemy as sa
import psycopg2 as pg
from sqlalchemy.engine import create_engine
from IPython.display import HTML, display

%reload_ext sql

In [60]:
# versions
print(f'pandas:     {pd.__version__}')
print(f'numpy:      {np.__version__}')
print(f'sqlalchemy: {sa.__version__}')
print(f'psycopy2:   {pg.__version__}')

pandas:     0.24.1
numpy:      1.15.4
sqlalchemy: 1.2.17
psycopy2:   2.7.6.1 (dt dec pq3 ext lo64)


In [61]:
# postgres version running on local computer
!psql --version

psql (PostgreSQL) 11.2 (Ubuntu 11.2-1.pgdg18.04+1)


## Jupyter Notebook SQL "Magic"

In [62]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

In [63]:
# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/dvdrental'

In [64]:
%sql {connect_str}

'Connected: postgres@dvdrental'

## **psql**

The following shows how to execute psql from within a Jupyter Notebook.

Alternatively, a subset of the psql commands are available after 'pip install pgspecial', as described on: https://github.com/catherinedevlin/ipython-sql

### .pgpass

Having this set properly avoids having to enter a password for psql.

See: https://www.postgresql.org/docs/11/libpq-pgpass.html

Example .pgpass file (fill in user and password as appropriate)  
localhost:5432:dvdrental:<user\>:<password\>

In [65]:
# -H for html output
# this hardcodes the database to dvdrental, the Postgres tutorial database
# this connects, executes, and disconnects
def psql(cmd):
    psql_out = !psql -H -U postgres dvdrental -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [66]:
psql('\conninfo')

## Describe Table

For getting the actual DDL, use pgdump or pgadmin4 or dbeaver.

In [67]:
# describe the actor table
psql('\d actor')

Column,Type,Collation,Nullable,Default
actor_id,integer,,not null,nextval('actor_actor_id_seq'::regclass)
first_name,character varying(45),,not null,
last_name,character varying(45),,not null,
last_update,timestamp without time zone,,not null,now()


In [68]:
# similar to \d, for just the columns of a table, using sql
def get_tbl_info(table):
    return f"""
     SELECT ordinal_position as pos,
         column_name as field,
         data_type,
         column_default as default,
         is_nullable,
         character_maximum_length as max_length,
         numeric_precision as precision
    FROM information_schema.columns
    WHERE table_name = '{table}'
    ORDER BY ordinal_position;
    """

In [69]:
%sql {get_tbl_info('actor')}

 * postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.


pos,field,data_type,default,is_nullable,max_length,precision
1,actor_id,integer,nextval('actor_actor_id_seq'::regclass),NO,,32.0
2,first_name,character varying,,NO,45.0,
3,last_name,character varying,,NO,45.0,
4,last_update,timestamp without time zone,now(),NO,,


## Removing Unwanted Connections
When experimenting, it is possible to leave connections open.

Assuming you are the only one using the database, it can be helpful to close all connections except the current connection.

The following is from:  
https://stackoverflow.com/questions/5108876/kill-a-postgresql-session-connection

In [88]:
%%sql
-- kill all pids except for the current connection
SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'
;

 * postgresql://postgres:***@localhost:5432/dvdrental
0 rows affected.


pg_terminate_backend


In [98]:
%%sql
SELECT pid, query, state from pg_stat_activity
  WHERE state = 'idle in transaction' ORDER BY xact_start;

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


pid,query,state
12132,SELECT COUNT(*) FROM player_game,idle in transaction


In [99]:
%%sql
SELECT pg_cancel_backend(__pid__);

 * postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.ProgrammingError) column "__pid__" does not exist
LINE 1: SELECT pg_cancel_backend(__pid__);
                                 ^
 [SQL: 'SELECT pg_cancel_backend(__pid__);'] (Background on this error at: http://sqlalche.me/e/f405)


In [101]:
%%sql
SELECT pg_cancel_backend(12132);

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


pg_cancel_backend
True


## Pandas

In [74]:
conn = create_engine(connect_str)

In [75]:
df = pd.read_sql("SELECT * FROM actor", conn)

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
actor_id       200 non-null int64
first_name     200 non-null object
last_name      200 non-null object
last_update    200 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 6.3+ KB


In [78]:
df.columns

Index(['actor_id', 'first_name', 'last_name', 'last_update'], dtype='object')

In [80]:
df.dtypes

actor_id                int64
first_name             object
last_name              object
last_update    datetime64[ns]
dtype: object

In [81]:
psql('\d actor')

Column,Type,Collation,Nullable,Default
actor_id,integer,,not null,nextval('actor_actor_id_seq'::regclass)
first_name,character varying(45),,not null,
last_name,character varying(45),,not null,
last_update,timestamp without time zone,,not null,now()


In [82]:
df.to_sql('my_table', conn, if_exists='replace')

In [83]:
psql('\d my_table')

Column,Type,Collation,Nullable,Default
index,bigint,,,
actor_id,bigint,,,
first_name,text,,,
last_name,text,,,
last_update,timestamp without time zone,,,


In [None]:
%%sql