# Using PostgreSQL in Python (Psycopg2) to build data pipelines

### What is Psycopg2?
A library that allows Python to connect to an existing Postgres database to utilize SQL functionality.

### Why
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation

### Documentation
* http://initd.org/psycopg/docs/install.html

# Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand cursors, executes, and commits
- Learn how to generate dynamic queries

# Key Things to Know

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no existing databases, you can connect to Postgres using the dbname 'postgres' to initialize one
* Data changes are not actually stored until you choose to commit. This can be done either through commit() or setting autocommit = True.  Until commited, transactions are only stored temporarily
    - Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
* SQL connection databases utilize cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically go like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operations on the database/tables until the connection is severed. 


# Installation

In [None]:
!conda install psycopg2

# Creating a connection with Postgres

### Import

In [8]:
import psycopg2 as pg2

### Create connection with Postgres

In [17]:
conn = pg2.connect(database='postgres', user='brad')

### Retrieve the Cursor

* A cursor is a control structure that enables traversal over the records in a database.  You can think of it as an iterator or pointer for Sql data retrieval.

In [18]:
conn.set_session(autocommit=True)
cur = conn.cursor()

## Create a database

In [19]:
cur.execute('CREATE DATABASE lecture')


## Disconnect from the cursor and database

In [20]:
cur.close()
conn.close()

## Let's use our new database

In [21]:
conn = pg2.connect(database='lecture', user='brad')

In [22]:
cur = conn.cursor()

### Create a new table

In [23]:
query1 = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''



In [24]:
cur.execute(query1)

### Insert csv into new table

In [25]:
query2 = '''
        COPY logins 
        FROM '/Users/brad/Dropbox/Galvanize/sql-python/mh-lecture/data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''


In [26]:
cur.execute(query2)

### Lets take a look at the data

In [27]:
query3 = '''
        SELECT *
        FROM logins
        LIMIT 20;
        '''


In [29]:
cur.execute(query3)

### One line at a time

In [30]:
cur.fetchone()

(579, datetime.datetime(2013, 11, 20, 3, 20, 6), 'mobile')

### Many lines at a time

In [31]:
cur.fetchmany(5)

[(823, datetime.datetime(2013, 11, 20, 3, 20, 49), 'web'),
 (953, datetime.datetime(2013, 11, 20, 3, 28, 49), 'web'),
 (612, datetime.datetime(2013, 11, 20, 3, 36, 55), 'web'),
 (269, datetime.datetime(2013, 11, 20, 3, 43, 13), 'web'),
 (799, datetime.datetime(2013, 11, 20, 3, 56, 55), 'web')]

### Or everything at once

In [34]:
cur.execute('SELECT Count(*) FROM logins')

In [35]:
cur.fetchall()

[(10000L,)]

# Dynamic Queries

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python to make this more efficient.  This is possible due to tokenized strings.

In [36]:
# os is needed because we want to dynamically identify the files 
# we need to insert.
import os

In [37]:
query4 = '''
        COPY %(table_name)s 
        FROM %(file_path)s
        DELIMITER ',' 
        CSV;
        '''

folder_path = '/Users/brad/Dropbox/Galvanize/sql-python/mh-lecture/data/'

In [39]:
fnames = os.listdir(folder_path)

for fname in fnames:
    path = os.path.join(folder_path, fname)
    cur.execute(query4, {'file_path': path, 'table_name': 'logins'})



# WARNING: BEWARE OF SQL INJECTION

## NEVER use + or % to reformat strings to be used with .execute

In [91]:
num = 579
terribly_unsafe = "SELECT * FROM logins WHERE userid = " + str(num)
print terribly_unsafe


date_cut = "2014-08-01"
horribly_risky = "SELECT * FROM logins WHERE tmstmp > %s" % date_cut
print horribly_risky
## Python is happy, but if num or date_cut included something malicious
## your data could be at risk

SELECT * FROM logins WHERE userid = 579
SELECT * FROM logins WHERE tmstmp > 2014-08-01


### Don't forget to commit your changes

In [None]:
cur.commit()

## And then close your connection

In [45]:
cur.close()
conn.close()


## And don't leave yourself vulnerable to SQL injection!
http://xkcd.com/327/