Author: Ming Huang

- Last updated: 02/10/2016
- By: Ming Huang

Combining SQL and Python
=================================

Often you will find yourself working with data that are only accessable through SQL.  However, SQL is limited in mathematical and machine learning capabilities.  To resolve this issue, we can simply set up a connection from Python to the SQL database.

# Why?

- SQL based databases are very common in almost all industries
- Can leverage the benefit of SQL's structure and scalability, while maintaining the flexibility of Python
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation and hence automations

# Objectives

- Connect to and run Postgres queries from Python
- Create cursors, execute query, and fetch data
- Create dynamic SQL queries through Python string formatting.

# Psycopg2

A Python library that allows for connections to an existing Postgres database to execute queries and retrieve data.

### Documentation (Includes Installation Instruction)

- http://initd.org/psycopg/docs/install.html

# General Workflow

1. Establish connection to database using Psycopg2
2. Create a cursor
3. Use the cursor to execute SQL queries
4. Commit SQL actions
4. Close the cursor and connection

# What is a cursor?

- A cursor is a control structure that enables traversal over the records in a database. 
- Executes and fetches data.
- When the cursor points at the resulting output of a query, it can only read each observation once.  If you choose to see a previously read observation, you much rerun the query.  This is similar to generators in Python.
- Can be closed without closing the connection.

# Commits

- Data changes are not actually stored until you choose to commit.
- You can choose to have automatic commit by using autocommit = True.
- When connecting directly to the Postgres Server to initiate server level commands such as creating a database, you must have autocommit set to True since you cannot "temporary" create a database.

# Key Things to Know

- Connections must be established using an existing database, username, database IP/URL, and maybe passwords
- If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
- Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed.
- If you ever need to build similar pipelines for other forms of database, there are libraries such Pyodbc which operates essentially the same.

# Walkthrough 1: Creating a database from Python

#### First, lets import psycopg2

In [None]:
import psycopg2 as pg2

#### Create a connection with Postgres

In [None]:
conn = pg2.connect(dbname='postgres', user='minghuang', host='localhost')

#### Set autocommit to True

In [None]:
conn.autocommit = True

#### Create the Cursor

In [None]:
cur = conn.cursor()

#### Create a database

In [None]:
cur.execute('DROP DATABASE IF EXISTS temp;')
cur.execute('CREATE DATABASE temp;')

#### Disconnect from the cursor and database

In [None]:
cur.close() # This is optional
conn.close()

# Walkthrough 2: Lets use our new database

#### Connect to our database

In [None]:
conn = pg2.connect(dbname='temp', user='minghuang', host='localhost')

In [None]:
cur = conn.cursor()

#### Create a new table

In [None]:
query = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''
cur.execute(query)

#### Insert .csv data into new table

In [None]:
query = '''
        COPY logins 
        FROM '/Users/minghuang/Documents/git/Galvanize/lecture-prep/sql-python/ming_huang/data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''
cur.execute(query)

#### Run a query to get 30 records from our data

In [None]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

#### Lets look at our data one line at a time

In [None]:
cur.fetchone()

#### Many lines at a time

In [None]:
cur.fetchmany(10)

#### Or everything at once

In [None]:
cur.fetchall()

# Dynamic Queries

A Dynamic Query is a query that generates based on context.

#### Example

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python (or any future languages) to make this more efficient.  This is possible due to tokenized strings.

#### First lets get an idea of how many records we start with

In [None]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

#### os is needed because we want to dynamically identify the files we need to insert using listdir.

In [None]:
import os

#### Create a query template and determine file path for imports

In [None]:
query = '''
        COPY logins 
        FROM '{file_path}'
        DELIMITER ','
        CSV;
        '''

folder_path = '/Users/minghuang/Documents/git/Galvanize/lecture-prep/sql-python/ming_huang/data/'

#### Use string formatting to generate a query for each approved file.

In [None]:
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv') and file_name != 'logins01.csv':
        dyn_query = query.format(file_path = folder_path + file_name)
        cur.execute(dyn_query)
        print '{0} inserted into table.'.format(file_name)

#### Lets check how many records we have right now.

In [None]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

#### Don't forget to commit your changes

In [None]:
conn.commit()

#### Close your connection

In [None]:
conn.close()

# Exercise

You're given a file called playgolf.csv in the data folder.  The file is tab delimited and the first row is the header.  Without opening and looking at the file, create a table and insert the data.

In [None]:
conn.close()