Author: Ming Huang

- Last updated: 05/25/2016
- By: Ming Huang

# Objectives

- Use Psycopg2 to connect to Postgres from Python
- Explain how to run SQL queries from Python
- Explain how to create dynamic SQL queries through Python string formatting

# Why?

* Leverage the benefit of SQL's structure and scalability, while maintaining the flexibility of Python
* Very useful for scaled data pipelines, pre-cleaning, data exploration
* Allows for dynamic query generation and hence automations

# Psycopg2

A Python library that allows for connections to an existing Postgres database to execute queries and retrieve data.

### Documentation (Includes Installation Instruction)

- http://initd.org/psycopg/docs/install.html

# Walkthrough 1: Creating a database from Python

#### Step 1: Import psycopg2 into our namespace

In [1]:
import psycopg2 as pg2

#### Step 2: Create a connection with Postgres

This is equivalent to logging into the Postgres shell (using psql in terminal)

If you need to specify the database server url,  you can do it using the host parameter.

In [2]:
conn = pg2.connect(dbname='postgres', user='minghuang')

#### Step 3: Set autocommit to True

It turns out that database changes are not commited (saved) until you tell Psycopg2 to commit some actions.  So until you commit, all your changes are only done in memory.  This is a big problem when it comes to creating databases, because Postgres does not allow the creation of database in just memory. So we need to make sure everything we do gets automatically commited.

In [3]:
conn.autocommit = True

#### Step 4: Create the Cursor

Psycopg2 interacts with Postgres through the means of cursors. Cursors are control structures that allows Psycopg2 to execute queries and traverse through query results (if there's any).

In [4]:
cur = conn.cursor()

#### Step 5: Create a database

With our cursor, we can now execute SQL commands.

In [5]:
cur.execute('DROP DATABASE IF EXISTS temp;')
cur.execute('CREATE DATABASE temp;')

#### Step 6: Disconnect from the cursor and database

After you're done, don't forget to close the cursor or connection.  Otherwise Postgres will complain because it thinks someone else is using it.

In [6]:
cur.close() # This is optional
conn.close()

In [None]:
with cur:
    cur.execute('some query')

# Walkthrough 2: Lets use our new database

#### Step 1: Connect to our database

Now that we have a created database, we can connect to it.  Instead of connect to 'postgres' which takes us into the shell, we can just connect to the database.  This is equivalent to using 'psql temp' in the terminal.

In [7]:
conn = pg2.connect(dbname='temp', user='minghuang')

#### Step 2: Create a cursor

In [8]:
cur = conn.cursor()

#### Step 3: Create a new table

Since we're working from inside Python, we can store our query inside strings and just reference them later.

In [9]:
query = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''
cur.execute(query)

In [10]:
conn.commit()

#### Step 4: Insert .csv data into new table

Lets insert some data so we can play with them.

In [11]:
query = '''
        COPY logins 
        FROM '/Users/minghuang/Documents/git/Galvanize/lecture-prep/sql-python/ming_huang/data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''
cur.execute(query)

#### Step 5: Run a query to get 30 records from our data

In [12]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

#### Step 6: Lets take a look at one line of data

In [13]:
cur.fetchone()

(579, datetime.datetime(2013, 11, 20, 3, 20, 6), 'mobile')

#### Step 7: Lets take a look at 10 lines instead

In [14]:
cur.fetchmany(10)

[(823, datetime.datetime(2013, 11, 20, 3, 20, 49), 'web'),
 (953, datetime.datetime(2013, 11, 20, 3, 28, 49), 'web'),
 (612, datetime.datetime(2013, 11, 20, 3, 36, 55), 'web'),
 (269, datetime.datetime(2013, 11, 20, 3, 43, 13), 'web'),
 (799, datetime.datetime(2013, 11, 20, 3, 56, 55), 'web'),
 (890, datetime.datetime(2013, 11, 20, 4, 2, 33), 'mobile'),
 (330, datetime.datetime(2013, 11, 20, 4, 54, 59), 'mobile'),
 (628, datetime.datetime(2013, 11, 20, 4, 57, 22), 'mobile'),
 (398, datetime.datetime(2013, 11, 20, 5, 3, 19), 'mobile'),
 (482, datetime.datetime(2013, 11, 20, 5, 4, 43), 'mobile')]

#### Step 8: Lets take a look at everything

In [15]:
cur.fetchall()

[(581, datetime.datetime(2013, 11, 20, 5, 12, 3), 'mobile'),
 (370, datetime.datetime(2013, 11, 20, 5, 26, 46), 'mobile'),
 (230, datetime.datetime(2013, 11, 20, 5, 28, 29), 'web'),
 (596, datetime.datetime(2013, 11, 20, 5, 28, 36), 'web'),
 (274, datetime.datetime(2013, 11, 20, 5, 43, 8), 'mobile'),
 (581, datetime.datetime(2013, 11, 20, 5, 47, 10), 'web'),
 (417, datetime.datetime(2013, 11, 20, 5, 54, 37), 'mobile'),
 (185, datetime.datetime(2013, 11, 20, 5, 56, 22), 'mobile'),
 (371, datetime.datetime(2013, 11, 20, 5, 58, 35), 'mobile'),
 (133, datetime.datetime(2013, 11, 20, 5, 59, 7), 'web'),
 (621, datetime.datetime(2013, 11, 20, 6, 1, 46), 'web'),
 (306, datetime.datetime(2013, 11, 20, 6, 3, 23), 'mobile'),
 (509, datetime.datetime(2013, 11, 20, 6, 4, 43), 'web'),
 (505, datetime.datetime(2013, 11, 20, 6, 9, 52), 'web'),
 (678, datetime.datetime(2013, 11, 20, 6, 34, 18), 'web'),
 (889, datetime.datetime(2013, 11, 20, 6, 36, 32), 'mobile'),
 (202, datetime.datetime(2013, 11, 20, 

If you count the lines, you can observe that the previously fetched line is no longer fetched when you call fetch commands again.  This is similar to generators, where you can only traverse through the output results once.

# Dynamic Queries

#### Example

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python to automate the process.

#### Step 1: First lets get an idea of how many records we start with

In [16]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

[(10000L,)]

#### Step 2: I want to navigate through my file directory, so we need to add os into my namespace

In [17]:
import os

#### Step 3: Create a query template and determine file path for imports

In [18]:
query = '''
        COPY logins 
        FROM '{file_path}'
        DELIMITER ','
        CSV;
        '''

folder_path = '/Users/minghuang/Documents/git/Galvanize/lecture-prep/sql-python/ming_huang/data/'

#### Step 4: Use string formatting to generate a query for each approved file.

In [19]:
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv') and file_name != 'logins01.csv':
        dyn_query = query.format(file_path = folder_path + file_name)
        cur.execute(dyn_query)
        print '{0} inserted into table.'.format(file_name)

logins02.csv inserted into table.
logins03.csv inserted into table.
logins04.csv inserted into table.
logins05.csv inserted into table.
logins06.csv inserted into table.
logins07.csv inserted into table.
logins08.csv inserted into table.


#### Step 5: Lets check how many records we have right now.

In [20]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

[(78588L,)]

#### Step 6: Don't forget to commit your changes!

In [21]:
conn.commit()

#### Step 7: Close your connection

In [22]:
conn.close()

# Exercise

You're given a file called playgolf.csv in the data folder.  The file is tab delimited and the first row is the header.  Without opening and looking at the file manually (you can open it within python), create a table and insert the data.

In [42]:
conn = pg2.connect(dbname='temp', user='minghuang')

In [43]:
curry = conn.cursor()

In [24]:
with open('playgolf.csv') as f_in:
    col_line = f_in.readline()

In [27]:
columns = col_line.strip().split('|')
for idx, col in enumerate(columns):
    columns[idx] = col + ' varchar(50)'

In [35]:
create_tb_q = '''
              create table playgolfer (
              {cols}
              );
              '''

In [44]:
curry.execute(create_tb_q.format(cols=','.join(columns)))

In [45]:
curry.execute('select * from playgolfer;')
curry.fetchall()

[]

In [53]:
copy_into_q = '''
              copy playgolfer
              from '{file_path}'
              delimiter '|'
              csv
              header;
              '''

new_folder_path = '/Users/minghuang/Documents/git/Galvanize/lecture-prep/sql-python/ming_huang/'

In [55]:
curry.execute(copy_into_q.format(file_path = new_folder_path + 'playgolf.csv'))

In [56]:
curry.execute('select * from playgolfer;')
curry.fetchall()

[('07-01-2014', 'sunny', '85', '85', 'false', "Don't Play"),
 ('07-02-2014', 'sunny', '80', '90', 'true', "Don't Play"),
 ('07-03-2014', 'overcast', '83', '78', 'false', 'Play'),
 ('07-04-2014', 'rain', '70', '96', 'false', 'Play'),
 ('07-05-2014', 'rain', '68', '80', 'false', 'Play'),
 ('07-06-2014', 'rain', '65', '70', 'true', "Don't Play"),
 ('07-07-2014', 'overcast', '64', '65', 'true', 'Play'),
 ('07-08-2014', 'sunny', '72', '95', 'false', "Don't Play"),
 ('07-09-2014', 'sunny', '69', '70', 'false', 'Play'),
 ('07-10-2014', 'rain', '75', '80', 'false', 'Play'),
 ('07-11-2014', 'sunny', '75', '70', 'true', 'Play'),
 ('07-12-2014', 'overcast', '72', '90', 'true', 'Play'),
 ('07-13-2014', 'overcast', '81', '75', 'false', 'Play'),
 ('07-14-2014', 'rain', '71', '80', 'true', "Don't Play")]

In [57]:
conn.commit()
conn.close()