#Python + SQL
## Using psycopg2


## Problem:
### What if your data is in a SQL database,
### but your machine learning is built in Python?

## Solution: 
### Run SQL queries from Python

* Very useful for scaled data pipelines, pre-cleaning, data exploration
* Allows for dynamic query generation

## psycopg2

* "psycopg2" is a Python module that allows you to easily query SQL databases
* Docs: http://initd.org/psycopg/docs/install.html

In [None]:
# basic usage
query = "SELECT * FROM some_table;"
cursor.execute(query)
results = cursor.fetchall()

#Lecture Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand psycopg2's cursors, executes, and commits
- Learn how to generate dynamic queries

## Walkthrough

### 1. Connect to the database

In [2]:
import psycopg2 as pg2
conn = pg2.connect(dbname='postgres', user='gSchool', host='localhost')

### 2. Instantiate the Cursor

* A cursor is a control structure that enables traversal over the records in a database.  You can think of it as an iterator or pointer for SQL data retrieval.

In [4]:
cur = conn.cursor()

### 2. Create a database

In [8]:
cur.execute('CREATE DATABASE test;')

### 3. Create a table

In [4]:
query = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''
cur.execute(query)

### 4. Insert csv into table

In [13]:
query = '''
        COPY logins 
        FROM '/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''
cur.execute(query)

### 5. Select from table

In [14]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

In [15]:
# fetchone() to get one row of data (like iter.next())
cur.fetchone()

(579, datetime.datetime(2013, 11, 20, 3, 20, 6), 'mobile')

In [16]:
# fetchmany(n) to get n rows
cur.fetchmany(10)

[(823, datetime.datetime(2013, 11, 20, 3, 20, 49), 'web'),
 (953, datetime.datetime(2013, 11, 20, 3, 28, 49), 'web'),
 (612, datetime.datetime(2013, 11, 20, 3, 36, 55), 'web'),
 (269, datetime.datetime(2013, 11, 20, 3, 43, 13), 'web'),
 (799, datetime.datetime(2013, 11, 20, 3, 56, 55), 'web'),
 (890, datetime.datetime(2013, 11, 20, 4, 2, 33), 'mobile'),
 (330, datetime.datetime(2013, 11, 20, 4, 54, 59), 'mobile'),
 (628, datetime.datetime(2013, 11, 20, 4, 57, 22), 'mobile'),
 (398, datetime.datetime(2013, 11, 20, 5, 3, 19), 'mobile'),
 (482, datetime.datetime(2013, 11, 20, 5, 4, 43), 'mobile')]

In [17]:
cur.fetchall()

[(581, datetime.datetime(2013, 11, 20, 5, 12, 3), 'mobile'),
 (370, datetime.datetime(2013, 11, 20, 5, 26, 46), 'mobile'),
 (230, datetime.datetime(2013, 11, 20, 5, 28, 29), 'web'),
 (596, datetime.datetime(2013, 11, 20, 5, 28, 36), 'web'),
 (274, datetime.datetime(2013, 11, 20, 5, 43, 8), 'mobile'),
 (581, datetime.datetime(2013, 11, 20, 5, 47, 10), 'web'),
 (417, datetime.datetime(2013, 11, 20, 5, 54, 37), 'mobile'),
 (185, datetime.datetime(2013, 11, 20, 5, 56, 22), 'mobile'),
 (371, datetime.datetime(2013, 11, 20, 5, 58, 35), 'mobile'),
 (133, datetime.datetime(2013, 11, 20, 5, 59, 7), 'web'),
 (621, datetime.datetime(2013, 11, 20, 6, 1, 46), 'web'),
 (306, datetime.datetime(2013, 11, 20, 6, 3, 23), 'mobile'),
 (509, datetime.datetime(2013, 11, 20, 6, 4, 43), 'web'),
 (505, datetime.datetime(2013, 11, 20, 6, 9, 52), 'web'),
 (678, datetime.datetime(2013, 11, 20, 6, 34, 18), 'web'),
 (889, datetime.datetime(2013, 11, 20, 6, 36, 32), 'mobile'),
 (202, datetime.datetime(2013, 11, 20, 

### 6. Rollback pending transaction on error

* Undo operation for cursor.execute(query)

* Only works if transaction has not been committed yet

In [1]:
import psycopg2 as pg2
conn = pg2.connect(dbname='socialmedia', user='gSchool', host='localhost')
cur = conn.cursor()

In [2]:
query = '''UPDATE logins
SET type = asdf
WHERE userid = 1234567;'''
cur.execute(query)

ProgrammingError: column "asdf" does not exist
LINE 2: SET type = asdf
                   ^


In [3]:
conn.rollback()

In [None]:
# for example
try:
    cur.execute(query) # attempt to execute query
except Exception: # an error happens, might specify specific Exception type, like "ProgrammingError"
    conn.rollback() # rollback the pending transaction

### 7.  Close cursor and connection

In [9]:
cur.close() # Optional since you can just close the connection
conn.close()

#Dynamic Queries

* A dynamic query is a query constructed based on context

In [None]:
# example
for table in tables:
    query = "SELECT * FROM {table}".format(table=table) 

* We have 8 csv files to be inserted into the a table.
* Instead of writing 8 COPY FROM queries, we'll construct the queries dynamically.

In [19]:
query = '''
        COPY logins 
        FROM {file_path} 
        DELIMITER ',' 
        CSV;
        '''

folder_path = '/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/'

for f in os.listdir(folder_path):
    if f.endswith('.csv') and f != 'logins01.csv':
        file_path = "'{0}'".format(folder_path + f)
        cur.execute(query.format(file_path=file_path))
        print file_path + 'inserted'

'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins02.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins03.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins04.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins05.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins06.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins07.csv'inserted
'/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins08.csv'inserted


#Key Things to Know

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
* Data changes are not actually stored until you choose to commit. This can be done either through conn.commit() or setting autocommit = True.  Until commited, all transactions is only temporary stored.
* Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
* If you ever need to build similar pipelines for other forms of database, there are libraries such Pyodbc which operates essentially the same
* SQL connection databases utilizes cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically goes like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 