# Using PostgreSQL in Python (with Psycopg2)

## Problem: 
### Your data is in a SQL database, but your machine learning tools are in Python.

---------

## Solution:
### Run SQL queries from Python

* Very useful for scaled data pipelines, pre-cleaning, data exploration
* Allows for dynamic query generation

---------

### Psycopg2
A library that allows Python to connect to an existing PostgreSQL database to utilize SQL functionality.


### Documentation
* http://initd.org/psycopg/docs/install.html

# Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand cursors, executes, and commits
- Learn how to generate dynamic queries (CAREFUL about injections (*vide infra*))

# basic usage
```
query = "SELECT * FROM some_table;"
cursor.execute(query)
results = cursor.fetchall()
```

# Creating a connection with Postgres

### Import

In [None]:
import psycopg2 as pg2

### Create connection with Postgres

In [None]:
conn = pg2.connect(database='postgres', user='tim.zeiske')

### Retrieve the Cursor

* A cursor is a control structure that enables traversal over the records in a database.  You can think of it as an iterator or pointer for Sql data retrieval.

In [None]:
cur = conn.cursor()

## Create a database

In [None]:
cur.execute('CREATE DATABASE lecture')

## What happened?!
### Normally we execute "temporary transactions", but database-wide operations cannot be run temporarily.

### Try again:

In [None]:
conn.close()

conn = pg2.connect(dbname='postgres', user='tim.zeiske')
conn.set_session(autocommit=True)  # this is the transaction change

cur = conn.cursor()
cur.execute('CREATE DATABASE lecture')

## Disconnect from the cursor and database (again)

In [None]:
cur.close() # optional, closing connection always closes any associated cursors
conn.close()

## Let's use our new database

In [None]:
conn = pg2.connect(dbname='lecture', user='tim.zeiske')
cur = conn.cursor()

### Create a new table

In [None]:
query1 = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''



In [None]:
cur.execute(query1)

Open psql in a new terminal window and check if the table was created!
Why not? autocommit = False

In [None]:
conn.commit()

Is it there now?

### Insert csv into new table

In [None]:
query2 = '''
        COPY logins 
        FROM '/Users/tim.zeiske/github/gschool/DSI_Lectures/sql-python/tzeiske/lecture-example/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''


In [None]:
cur.execute(query2)

### Lets take a look at the data

In [None]:
query3 = '''
        SELECT *
        FROM logins
        LIMIT 20;
        '''


In [None]:
cur.execute(query3)

### One line at a time

In [None]:
cur.fetchone()

### Many lines at a time

In [None]:
cur.fetchmany(3)

What happens if you run this again? Remember this is an iterator!

### Or everything at once

In [None]:
cur.fetchall()

In [None]:
query4 = query3
cur.execute(query4)
results = cur.fetchall()
results1 = cur.fetchone()

In [None]:
results

In [None]:
results1
#all "used up" by fetchall

In [None]:
cur.execute('SELECT Count(*) FROM logins')
cur.fetchall()

# Dynamic Queries

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python to make this more efficient.  This is possible due to tokenized strings.

In [None]:
# os is needed because we want to dynamically identify the files 
# we need to insert.
import os

In [None]:
query4 = '''
        COPY logins 
        FROM %(file_path)s
        DELIMITER ',' 
        CSV;
        '''

folder_path = '/Users/tim.zeiske/github/gschool/DSI_Lectures/sql-python/tzeiske/lecture-example'


In [None]:
fnames = os.listdir(folder_path)

for fname in fnames:
    path = os.path.join(folder_path, fname)
    cur.execute(query4, {'file_path': path})

Check in psql if the data is in the logins table
No? --> commit

In [None]:
conn.commit()
cur.execute('SELECT Count(*) FROM logins')
cur.fetchall()

# WARNING: BEWARE OF SQL INJECTION

## NEVER use + or % to reformat strings to be used with .execute
## ESPECIALLY if strings are generated from user input

In [None]:
num = '579; DROP TABLE logins;'
terribly_unsafe = "SELECT * FROM logins WHERE userid = " + num
print terribly_unsafe


date_cut = "2014-08-01"
horribly_risky = "SELECT * FROM logins WHERE tmstmp > %s" % date_cut
print horribly_risky
## Python is happy, but if num or date_cut included something malicious
## your data could be at risk

In [None]:
cur.execute(terribly_unsafe)

In [None]:
cur.fetchall()

### Don't forget to commit your changes

In [None]:
conn.commit()

## And then close your connection

In [None]:
cur.close()
conn.close()

# Key Things to Know

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no existing databases, you can connect to Postgres using the dbname 'postgres' to initialize one
* Data changes are not actually stored until you choose to commit. This can be done either through commit() or setting autocommit = True.  Until commited, transactions are only stored temporarily
    - Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
    - Use .rollback() on the connection if your .execute() command results in an error. (Only works if change has not yet been committed) 
    - Roll back to the start of any pending transaction. Closing a connection without committing the changes first will cause an implicit rollback to be performed.
* SQL connection databases utilize cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically go like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operations on the database/tables until the connection is severed. 


## And don't leave yourself vulnerable to SQL injection!
http://xkcd.com/327/