# Lecture Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand psycopg2's cursors, executes, and commits
- Learn how to generate dynamic queries through Python string formatting

##Combining SQL and Python

Often you will find yourself working with data that are only accessable through SQL.  However, your machine learning capabilities are built in Python.  To resolve this issue, we can simply set up a connection from Python to the SQL database to bring the data to us.

## Why do we care?

- SQL based databases are extremely common in almost all industry environments
- Can leverage the benefit of SQL's structure and scalability, while maintaining the flexibility of Python
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation and hence automations

## psycopg2

- A Python library that allows for connections with PostgresSQL databases to easily query and retrieve data for analysis.
- [Documentation--Includes Installation Instructions](http://initd.org/psycopg/docs/install.html)
- In addition to what's listed in the documentation, if you have the anaconda distribution of Python 
```python 
conda install psycopg2 
```
worked for me
- There are similar packages for other flavors of SQL that work much the same way

## General Workflow

1. Establish connection to Postgres database using psycopg2
2. Create a cursor
3. Use the cursor to execute SQL queries
4. Commit SQL actions
4. Close the cursor and connection

# Walkthrough 1: Creating a database from Python

### Connect to the database
- Connections must be established using an existing database, username, database IP/URL, and maybe passwords
- If you need to create a database, you can first connect to Postgres using the dbname 'postgres' to initialize

In [84]:
import psycopg2 as pg2
conn = pg2.connect(dbname='postgres', user='clayton.schupp', host='localhost')

### Commits

- Data changes are not actually stored until you choose to commit
- You can choose to have automatic commit by using ` autocommit = True`
- When connecting directly to the Postgres Server to initiate server level commands such as creating a database, you must use the `autocommit = True` option since Postgres does not have "temporary" transactions at the database level

In [85]:
conn.autocommit = True

### Instantiate the Cursor

- A cursor is a control structure that enables traversal over the records in a database
- Executes and fetches data
- When the cursor points at the resulting output of a query, it can only read each observation once.  If you choose to see a previously read observation, you must rerun the query. 
- Can be closed without closing the connection

In [86]:
cur = conn.cursor()

###  Create a database

In [87]:
cur.execute('DROP DATABASE IF EXISTS temp;')
cur.execute('CREATE DATABASE temp;')

### Disconnect from the cursor and database
- Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

In [88]:
cur.close() # This is optional
conn.close() # Closing the connection also closes all cursors

# Walkthrough 2: Lets use our new database

### Connect to our database

In [89]:
conn = pg2.connect(dbname='temp', user='clayton.schupp', host='localhost')

In [90]:
cur = conn.cursor()

### Create a new table

In [91]:
query = '''
        CREATE TABLE logins (
            userid integer, 
            tmstmp timestamp, 
            type varchar(10)
        );
        '''
cur.execute(query)

### Insert data into new table

In [92]:
query = '''
        COPY logins 
        FROM '/Users/clayton.schupp/cwschupp/galvanize/teaching/DSI_Lectures/sql-python/clayton/logins_data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''
cur.execute(query)

### Run a query to get 30 records from our data

In [93]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

### Lets look at our data one line at a time

In [94]:
cur.fetchone()

(579, datetime.datetime(2013, 11, 20, 3, 20, 6), 'mobile')

### Many lines at a time

In [95]:
#fetchmany(n) to get n rows
cur.fetchmany(10)

[(823, datetime.datetime(2013, 11, 20, 3, 20, 49), 'web'),
 (953, datetime.datetime(2013, 11, 20, 3, 28, 49), 'web'),
 (612, datetime.datetime(2013, 11, 20, 3, 36, 55), 'web'),
 (269, datetime.datetime(2013, 11, 20, 3, 43, 13), 'web'),
 (799, datetime.datetime(2013, 11, 20, 3, 56, 55), 'web'),
 (890, datetime.datetime(2013, 11, 20, 4, 2, 33), 'mobile'),
 (330, datetime.datetime(2013, 11, 20, 4, 54, 59), 'mobile'),
 (628, datetime.datetime(2013, 11, 20, 4, 57, 22), 'mobile'),
 (398, datetime.datetime(2013, 11, 20, 5, 3, 19), 'mobile'),
 (482, datetime.datetime(2013, 11, 20, 5, 4, 43), 'mobile')]

### Or everything at once

In [96]:
#fetchall() grabs all remaining rows
cur.fetchall()

[(581, datetime.datetime(2013, 11, 20, 5, 12, 3), 'mobile'),
 (370, datetime.datetime(2013, 11, 20, 5, 26, 46), 'mobile'),
 (230, datetime.datetime(2013, 11, 20, 5, 28, 29), 'web'),
 (596, datetime.datetime(2013, 11, 20, 5, 28, 36), 'web'),
 (274, datetime.datetime(2013, 11, 20, 5, 43, 8), 'mobile'),
 (581, datetime.datetime(2013, 11, 20, 5, 47, 10), 'web'),
 (417, datetime.datetime(2013, 11, 20, 5, 54, 37), 'mobile'),
 (185, datetime.datetime(2013, 11, 20, 5, 56, 22), 'mobile'),
 (371, datetime.datetime(2013, 11, 20, 5, 58, 35), 'mobile'),
 (133, datetime.datetime(2013, 11, 20, 5, 59, 7), 'web'),
 (621, datetime.datetime(2013, 11, 20, 6, 1, 46), 'web'),
 (306, datetime.datetime(2013, 11, 20, 6, 3, 23), 'mobile'),
 (509, datetime.datetime(2013, 11, 20, 6, 4, 43), 'web'),
 (505, datetime.datetime(2013, 11, 20, 6, 9, 52), 'web'),
 (678, datetime.datetime(2013, 11, 20, 6, 34, 18), 'web'),
 (889, datetime.datetime(2013, 11, 20, 6, 36, 32), 'mobile'),
 (202, datetime.datetime(2013, 11, 20, 

# Dynamic Queries

- A Dynamic Query is a query that generates based on context.


### Example

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python (or any future languages) to make this more efficient.  This is possible due to tokenized strings.

### First lets get an idea of how many records we start with

In [97]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

[(10000L,)]

### os is needed because we want to dynamically identify the files we need to insert using listdir.

In [98]:
import os

### Create a query template and determine file path for imports

In [99]:
query = '''
        COPY logins 
        FROM '{file_path}'
        DELIMITER ','
        CSV;
        '''

folder_path = '/Users/clayton.schupp/cwschupp/galvanize/teaching/DSI_Lectures/sql-python/clayton/logins_data/'

### Use string formatting to generate a query for each approved file.

In [100]:
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv') and file_name != 'logins01.csv':
        dyn_query = query.format(file_path = folder_path + file_name)
        cur.execute(dyn_query)
        print '{0} inserted into table.'.format(file_name)

logins02.csv inserted into table.
logins03.csv inserted into table.
logins04.csv inserted into table.
logins05.csv inserted into table.
logins06.csv inserted into table.
logins07.csv inserted into table.
logins08.csv inserted into table.


### Lets check the total number of records we have right now.

In [101]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

[(78588L,)]

### Don't forget to commit your changes

In [102]:
conn.commit()

### Close your connection

In [103]:
conn.close()

# Key Things to Remember

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
* Data changes are not actually stored until you choose to commit. This can be done either through `conn.commit()` or setting `autocommit = True`.  Until commited, all transactions is only temporary stored.
* Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
* If you ever need to build similar pipelines for other forms of database, there are libraries such Pyodbc which operates essentially the same
* SQL connection databases utilizes cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically goes like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

# Exercise

You're given a file called `playgolf.csv` in the data folder.  The file is comma delimited and the first row is the header.  Without opening and looking at the file, create a table and insert the data. Here is the header and first row:


|Date|Outlook|Temperature|Humidity|Windy|Result|
|----|-------|-----------|--------|-----|------|
|07-01-2014|sunny|85|85|false|Don't Play|

In [104]:
conn = pg2.connect(dbname='temp', user='clayton.schupp', host='localhost')

In [105]:
cur = conn.cursor()

In [106]:
query = '''
        CREATE TABLE play_golf (
            date date,
            outlook varchar(30),
            temp integer, 
            humidity integer,
            windy boolean,
            result varchar(30)
        );
        '''
cur.execute(query)

In [107]:
query = '''
        COPY play_golf 
        FROM '/Users/clayton.schupp/cwschupp/galvanize/teaching/DSI_Lectures/sql-python/clayton/playgolf.csv' 
        DELIMITER ',' 
        HEADER 
        CSV;
        '''
cur.execute(query)

In [108]:
query = '''
        SELECT *
        FROM play_golf;
        '''
cur.execute(query)

In [109]:
cur.fetchall()

[(datetime.date(2014, 7, 1), 'sunny', 85, 85, False, "Don't Play"),
 (datetime.date(2014, 7, 2), 'sunny', 80, 90, True, "Don't Play"),
 (datetime.date(2014, 7, 3), 'overcast', 83, 78, False, 'Play'),
 (datetime.date(2014, 7, 4), 'rain', 70, 96, False, 'Play'),
 (datetime.date(2014, 7, 5), 'rain', 68, 80, False, 'Play'),
 (datetime.date(2014, 7, 6), 'rain', 65, 70, True, "Don't Play"),
 (datetime.date(2014, 7, 7), 'overcast', 64, 65, True, 'Play'),
 (datetime.date(2014, 7, 8), 'sunny', 72, 95, False, "Don't Play"),
 (datetime.date(2014, 7, 9), 'sunny', 69, 70, False, 'Play'),
 (datetime.date(2014, 7, 10), 'rain', 75, 80, False, 'Play'),
 (datetime.date(2014, 7, 11), 'sunny', 75, 70, True, 'Play'),
 (datetime.date(2014, 7, 12), 'overcast', 72, 90, True, 'Play'),
 (datetime.date(2014, 7, 13), 'overcast', 81, 75, False, 'Play'),
 (datetime.date(2014, 7, 14), 'rain', 71, 80, True, "Don't Play")]

In [110]:
import pandas as pd
df=pd.read_sql(query, conn)
df

Unnamed: 0,date,outlook,temp,humidity,windy,result
0,2014-07-01,sunny,85,85,False,Don't Play
1,2014-07-02,sunny,80,90,True,Don't Play
2,2014-07-03,overcast,83,78,False,Play
3,2014-07-04,rain,70,96,False,Play
4,2014-07-05,rain,68,80,False,Play
5,2014-07-06,rain,65,70,True,Don't Play
6,2014-07-07,overcast,64,65,True,Play
7,2014-07-08,sunny,72,95,False,Don't Play
8,2014-07-09,sunny,69,70,False,Play
9,2014-07-10,rain,75,80,False,Play


In [111]:
conn.close()