# SQL Python (Postgres specificially)

## Lecture Objectives

- Connect to a database from within a python program and run queries
- Understand psycopg2's connection, cursors and commits
- Conceptualize the various fetches available on the cursor
- Explain what a database rollback is and how it interacts with a commit
- Understand the 'with' statement
- Generate dynamic queries

Borrowed a majority of content from Miles Erickson's deck

## Combining SQL and Python

Often you will find yourself working with data that are only accessable through SQL.  However, your machine-learning capabilities are built in Python.  To resolve this issue, we can simply set up a connection from Python to the SQL database to bring the data to us.

## Why do we care?

- SQL-based databases are extremely common in almost all industry environments
- Can leverage the benefit of SQL's structure and scalability, while maintaining the flexibility of Python
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation and hence automations

## psycopg2

- A Python library that allows for connections with PostgresSQL databases to easily query and retrieve data for analysis.
- [Documentation--Includes Installation Instructions](http://initd.org/psycopg/docs/install.html)
- In addition to what's listed in the documentation, if you have the anaconda distribution of Python 
```python 
conda install psycopg2 
```
should worked
- There are similar packages for other flavors of SQL that work much the same way

## General Workflow

1. Establish connection to Postgres database using psycopg2
2. Create a cursor
3. Use the cursor to execute SQL queries and retrieve data
4. Commit SQL actions
4. Close the cursor and connection

# Walkthrough 1: Creating a database from Python

### Connect to the database
- Connections must be established using an existing database, username, database IP/URL, and maybe passwords
- If you need to create a database, you can first connect to Postgres using the dbname 'postgres' to initialize

In [1]:
import psycopg2
conn = psycopg2.connect(dbname='postgres', host='localhost')

### Instantiate the Cursor

- A cursor is a control structure that enables traversal over the records in a database
- Executes and fetches data
- When the cursor points at the resulting output of a query, it can only read each observation once.  If you choose to see a previously read observation, you must rerun the query. 
- Can be closed without closing the connection

In [2]:
cur = conn.cursor()

### Commits

- Data changes are not actually stored until you choose to commit
- You can choose to have automatic commit by using ` autocommit = True`
- When connecting directly to the Postgres Server to initiate server level commands such as creating a database, you must use the `autocommit = True` option since Postgres does not have "temporary" transactions at the database level

In [3]:
conn.autocommit = True

###  Create a database

In [5]:
cur.execute('DROP DATABASE IF EXISTS temp;')
cur.execute('CREATE DATABASE temp;')

### Disconnect from the cursor and database
- Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

In [None]:
cur.close() # This is optional
conn.close() # Closing the connection also closes all cursors

# Walkthrough 2: Using the new database

### Connect to our database

In [None]:
conn = psycopg2.connect(dbname='temp', host='localhost')

In [None]:
cur = conn.cursor()

### Create a new table

In [None]:
query = '''
        CREATE TABLE logins (
            userid integer, 
            tmstmp timestamp, 
            type varchar(10)
        );
        '''
cur.execute(query)

### Insert data into new table

`os` is needed to get the current directory and, later, dynamically identify the files we need to insert using listdir.

In [None]:
import os

In [None]:
path = os.getcwd()

In [None]:
path

In [None]:
query = '''
        COPY logins 
        FROM '{0}/logins_data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''.format(path)
cur.execute(query)

### Run a query to get 30 records from our data

In [None]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

### Lets look at our data one line at a time

In [None]:
cur.fetchone()

### Many lines at a time

In [None]:
#fetchmany(n) to get n rows
cur.fetchmany(10)

### Or everything at once

In [None]:
#fetchall() grabs all remaining rows
results = cur.fetchall()

In [None]:
conn.commit()

In [None]:
type(results)

In [None]:
type(results[0])

### You can even iterate over the cursor

In [None]:
cur.execute(query)
for record in cur:
    print "{}: user {} logged in via {}".format(record[1], record[0], record[2])

# Dynamic Queries

- A Dynamic Query is a query that generates based on context.


### Example

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python (or any future languages) to make this more efficient.  This is possible due to tokenized strings.

### First lets get an idea of how many records we start with

In [None]:
cur.execute('SELECT count(*) FROM logins;')
record_count = cur.fetchone()[0]


In [None]:
record_count

In [None]:
type(record_count)

### Create a query template and determine file path for imports

Use string formatting to generate a query for each approved file.

**[WARNING: BEWARE OF SQL INJECTION](http://initd.org/psycopg/docs/usage.html)**

NEVER use + or % or .format to reformat strings to be used with .execute

In [None]:
num = 579
terribly_unsafe = "SELECT * FROM logins WHERE userid = " + str(num) + ";"
print terribly_unsafe


date_cut = "2014-08-01"
horribly_risky = "SELECT * FROM logins WHERE tmstmp > %s;" % date_cut
print horribly_risky
## Python is happy, but if num or date_cut included something malicious
## your data could be at risk

### What is an SQL Injection Attack?

In [None]:
date_cut = "2014-08-01; DROP TABLE logins" # The user enters a date in a field on a web form
horribly_risky = "SELECT * FROM logins WHERE tmstmp > %s;" % date_cut
print horribly_risky

### Practice safe SQL with Psycopg2

In [None]:
query = '''
        COPY logins 
        FROM %(file_path)s
        DELIMITER ','
        CSV;
        '''

In [None]:
path

In [None]:
folder_path = path + '/logins_data/'
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv') and file_name != 'logins01.csv':
        path_=folder_path+file_name
        cur.execute(query, {'file_path':path_})
        print '{0} inserted into table.'.format(file_name)

### Visit [bobby-tables.com](http://www.bobby-tables.com/) to learn more about SQL safety.



### Let's check the total number of records we have right now.

In [None]:
print "Old record count: {}".format(record_count)

cur.execute('SELECT count(*) FROM logins;')
record_count = cur.fetchone()[0]

print "New record count: {}".format(record_count)

### Transactions can be rolled back until they're committed

In [None]:
conn.rollback()
#conn.commit()

cur.execute('SELECT count(*) FROM logins;')
record_count = cur.fetchone()[0]

print "After rollback: {}".format(record_count)

### Don't forget to commit your changes

In [None]:
conn.commit()

### Close your connection

In [None]:
conn.close()

### Using With Statements

In [None]:
#Auto commits/roll backs
query = "SELECT count(*) FROM logins;"
with psycopg2.connect(dbname='temp', host='localhost') as conn:
    with conn.cursor() as curs:
        print("Cursor inside with block: {}".format(curs))
        curs.execute(query)
    print("Cursor outside with block: {}".format(curs))
    

### Note that the connection is *not* closed automatically:

In [None]:
conn

In [None]:
conn.close()
conn

# Key Things to Remember

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
* Data changes are not actually stored until you choose to commit. This can be done either through `conn.commit()` or setting `autocommit = True`.  Until commited, all transactions is only temporary stored.
* Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
* If you ever need to build similar pipelines for other forms of database, there are libraries such PyODBC which operate very similarly.
* SQL connection databases utilizes cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically goes like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

# Exercise

You're given a file called `playgolf.csv` in the data folder.  The file is comma delimited and the first row is the header.  Without opening and looking at the file, create a table and insert the data. Here is the header and first row:


|Date|Outlook|Temperature|Humidity|Windy|Result|
|----|-------|-----------|--------|-----|------|
|07-01-2014|sunny|85|85|false|Don't Play|



Soln
--------

<details><summary>
Query to create the table
</summary>
CREATE TABLE playgolf (
            Date date, 
            Outlook varchar(10), 
            Temperature int,
            Humidity int,
            Windy varchar(10),
            Result varchar(10)
        );
</details>


In [None]:
conn = psycopg2.connect(dbname='temp', host='localhost')
cur = conn.cursor()

query = '''
        CREATE TABLE playgolf (
            Date date, 
            Outlook varchar(10), 
            Temperature int,
            Humidity int,
            Windy varchar(10),
            Result varchar(10)
        );
        '''

cur.execute(query)

In [None]:
query = '''
        COPY playgolf 
        FROM '{}/playgolf.csv' 
        CSV HEADER 
        DELIMITER AS ',';
        '''.format(path)
cur.execute(query)

In [None]:
res = cur.execute('SELECT * from playgolf;')

In [None]:
cur.description

In [21]:
rc_conn = psycopg2.connect(dbname='readychef', host='localhost')

In [22]:
table_query = "select relname from pg_class where relkind='r' and relname !~ '^(pg_|sql_)';"
rc_cur = rc_conn.cursor()

In [None]:
rc_cur.execute(table_query)

In [None]:
rc_cur.fetchall()

In [None]:
q= """
SELECT userid, 
    CASE 
     WHEN campaign_id = 'PI' THEN 'Pinterest'
     WHEN campaign_id = 'FB' THEN 'FaceBook'
     WHEN campaign_id = 'tw' THEN 'Twitter'
     ELSE 't'
    END
FROM users
LIMIT 10;
"""

In [None]:
rc_cur.execute(q)
rc_cur.fetchall()

In [None]:
rc_cur.description

In [None]:
q= """
SELECT type, price, 
    avg(price) OVER (PARTITION BY type)
FROM meals;
"""

In [None]:
rc_cur.execute(q)
rc_cur.fetchall()

In [None]:
rc_cur.description

In [23]:
q= """
SELECT type, price,
    max(price) OVER (PARTITION BY type)
FROM meals
WHERE ;
"""

In [24]:
rc_cur.execute(q)
rc_cur.fetchall()

[('chinese', 6, 13),
 ('chinese', 8, 13),
 ('chinese', 7, 13),
 ('chinese', 9, 13),
 ('chinese', 8, 13),
 ('chinese', 6, 13),
 ('chinese', 12, 13),
 ('chinese', 6, 13),
 ('chinese', 6, 13),
 ('chinese', 7, 13),
 ('chinese', 8, 13),
 ('chinese', 8, 13),
 ('chinese', 10, 13),
 ('chinese', 7, 13),
 ('chinese', 13, 13),
 ('chinese', 9, 13),
 ('chinese', 13, 13),
 ('chinese', 10, 13),
 ('chinese', 11, 13),
 ('chinese', 11, 13),
 ('chinese', 12, 13),
 ('chinese', 11, 13),
 ('chinese', 12, 13),
 ('chinese', 11, 13),
 ('chinese', 11, 13),
 ('chinese', 10, 13),
 ('chinese', 12, 13),
 ('chinese', 6, 13),
 ('chinese', 10, 13),
 ('chinese', 8, 13),
 ('chinese', 12, 13),
 ('chinese', 6, 13),
 ('chinese', 13, 13),
 ('chinese', 11, 13),
 ('chinese', 8, 13),
 ('chinese', 6, 13),
 ('chinese', 8, 13),
 ('chinese', 12, 13),
 ('chinese', 13, 13),
 ('chinese', 7, 13),
 ('chinese', 9, 13),
 ('chinese', 10, 13),
 ('chinese', 13, 13),
 ('chinese', 10, 13),
 ('chinese', 7, 13),
 ('chinese', 10, 13),
 ('chinese

In [None]:
t = r"COPY users TO '/Users/sversage/users.csv' DELIMITER ',' CSV HEADER;"
rc_cur.execute(t)
