# Using PostgreSQL in Python (with Psycopg2)

## Problem: 
### Your data is in a SQL database, but your machine learning tools are in Python.

---------

## Solution:
### Run SQL queries from Python

* Very useful for scaled data pipelines, pre-cleaning, data exploration
* Allows for dynamic query generation

---------

### Psycopg2
A library that allows Python to connect to an existing PostgreSQL database to utilize SQL functionality.


### Documentation
* http://initd.org/psycopg/docs/install.html

# Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand cursors, executes, and commits
- Learn how to generate dynamic queries

In [2]:
# basic usage, just an example.  (will not work until we set up conn and cursor)
query = "SELECT * FROM some_table;"
cursor.execute(query)
results = cursor.fetchall()

NameError: name 'cursor' is not defined

# Creating a connection with Postgres

### Import

In [3]:
import psycopg2 as pg2

### Create connection with Postgres

In [4]:
conn = pg2.connect(database='postgres', user='dan')

### Retrieve the Cursor

* A cursor is a control structure that enables traversal over the records in a database.  You can think of it as an iterator or pointer for Sql data retrieval.

In [5]:

cur = conn.cursor()

## Create a database

In [7]:
cur.execute('CREATE DATABASE lecture')  # will fail


InternalError: current transaction is aborted, commands ignored until end of transaction block


## What happened?!
### Normally we execute "temporary transactions", but database-wide operations cannot be run temporarily.

### Try again:

In [10]:
conn.close()

conn = pg2.connect(dbname = 'postgres', user='dan')
conn.set_session(autocommit = True)

cur = conn.cursor()
cur.execute('CREATE DATABASE lecture')

## Disconnect from the cursor and database (again)

In [11]:
cur.close() # optional, closing connection always closes any associated cursors
conn.close()

## Let's use our new database

In [12]:
conn = pg2.connect(database='lecture', user='dan')

In [13]:
cur = conn.cursor()

### Create a new table

In [14]:
query1 = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''



In [15]:

cur.execute(query1)

### Insert csv into new table

In [16]:
query2 = '''
        COPY logins 
        FROM '/Users/dan/Documents/Galvanize_Repos/DSI_Lectures/sql-python/danwiesenthal/lecture-example/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''


In [17]:

cur.execute(query2)

### Lets take a look at the data

In [18]:
query3 = '''
        SELECT *
        FROM logins
        LIMIT 20;
        '''


In [19]:
cur.execute(query3)

### One line at a time

In [20]:
cur.fetchone()

(579, datetime.datetime(2013, 11, 20, 3, 20, 6), 'mobile')

### Many lines at a time

In [21]:
cur.fetchmany(5)

[(823, datetime.datetime(2013, 11, 20, 3, 20, 49), 'web'),
 (953, datetime.datetime(2013, 11, 20, 3, 28, 49), 'web'),
 (612, datetime.datetime(2013, 11, 20, 3, 36, 55), 'web'),
 (269, datetime.datetime(2013, 11, 20, 3, 43, 13), 'web'),
 (799, datetime.datetime(2013, 11, 20, 3, 56, 55), 'web')]

### Or everything at once

In [22]:
cur.fetchall()

[(890, datetime.datetime(2013, 11, 20, 4, 2, 33), 'mobile'),
 (330, datetime.datetime(2013, 11, 20, 4, 54, 59), 'mobile'),
 (628, datetime.datetime(2013, 11, 20, 4, 57, 22), 'mobile'),
 (398, datetime.datetime(2013, 11, 20, 5, 3, 19), 'mobile'),
 (482, datetime.datetime(2013, 11, 20, 5, 4, 43), 'mobile'),
 (581, datetime.datetime(2013, 11, 20, 5, 12, 3), 'mobile'),
 (370, datetime.datetime(2013, 11, 20, 5, 26, 46), 'mobile'),
 (230, datetime.datetime(2013, 11, 20, 5, 28, 29), 'web'),
 (596, datetime.datetime(2013, 11, 20, 5, 28, 36), 'web'),
 (274, datetime.datetime(2013, 11, 20, 5, 43, 8), 'mobile'),
 (581, datetime.datetime(2013, 11, 20, 5, 47, 10), 'web'),
 (417, datetime.datetime(2013, 11, 20, 5, 54, 37), 'mobile'),
 (185, datetime.datetime(2013, 11, 20, 5, 56, 22), 'mobile'),
 (371, datetime.datetime(2013, 11, 20, 5, 58, 35), 'mobile')]

In [23]:

cur.execute('SELECT Count(*) FROM logins')

In [24]:
cur.fetchall()

[(10000L,)]

# Dynamic Queries

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python to make this more efficient.  This is possible due to tokenized strings.

In [27]:
# os is needed because we want to dynamically identify the files 
# we need to insert.
import os

In [28]:
query4 = '''
        COPY logins 
        FROM %(file_path)s
        DELIMITER ',' 
        CSV;
        '''

folder_path = '/Users/dan/Documents/Galvanize_Repos/DSI_Lectures/sql-python/danwiesenthal/lecture-example/'


In [29]:
fnames = os.listdir(folder_path)

for fname in fnames:
    path = os.path.join(folder_path, fname)
    cur.execute(query4, {'file_path': path})



# WARNING: BEWARE OF SQL INJECTION

## NEVER use + or % to reformat strings to be used with .execute

In [30]:
num = 579
terribly_unsafe = "SELECT * FROM logins WHERE userid = " + str(num)
print terribly_unsafe


date_cut = "2014-08-01"
horribly_risky = "SELECT * FROM logins WHERE tmstmp > %s" % date_cut
print horribly_risky
## Python is happy, but if num or date_cut included something malicious
## your data could be at risk

SELECT * FROM logins WHERE userid = 579
SELECT * FROM logins WHERE tmstmp > 2014-08-01


### Don't forget to commit your changes

In [31]:
cur.commit()

AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'commit'

## And then close your connection

In [32]:
cur.close()
conn.close()


# Key Things to Know

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no existing databases, you can connect to Postgres using the dbname 'postgres' to initialize one
* Data changes are not actually stored until you choose to commit. This can be done either through commit() or setting autocommit = True.  Until commited, transactions are only stored temporarily
    - Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
    - Use .rollback() on the connection if your .execute() command results in an error. (Only works if change has not yet been committed) 
* SQL connection databases utilize cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically go like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operations on the database/tables until the connection is severed. 


## And don't leave yourself vulnerable to SQL injection!
http://xkcd.com/327/