#Using PostgreSql in Python (Psycopg2) to build data pipelines

###What is Psycopg2?
A library that allows Python to set up connection to an existing Postgres database to utilize SQL functionalities.

###Why
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation

###Documentations
* http://initd.org/psycopg/docs/install.html

#Objectives

- Learn how to connect to and run Postgres queries from Python
- Understand cursors, executes, and commits
- Learn how to generate dynamic queries

#Key Things to Know

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
* Data changes are not actually stored until you choose to commit. This can be done either through commit() or setting autocommit = True.  Until commited, all transactions is only temporary stored.
* SQL connection databases utilizes cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically goes like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

#Fun facts

* If you ever need to build similar pipelines for other forms of database, there are libraries such Pyodbc which operates essentially the same
* Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.

#Installation

In [None]:
!pip install psycopg2

#Creating a connection with Postgres

###Import

In [None]:
import psycopg2 as pg2

###Create connection with Postgres

###Retrieve the Cursor

* A cursor is a control structure that enables traversal over the records in a database.  You can think of it as an iterator or pointer for Sql data retrieval.

#Create a database

#Disconnect from the cursor and database

#Lets use our new database

###Create a new table

In [None]:
query = '''
        CREATE TABLE logins (
            userid integer
            , tmstmp timestamp
            , type varchar(10)
        );
        '''
cur.execute(query)

###Insert csv into new table

In [None]:
query = '''
        COPY logins 
        FROM '/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''
cur.execute(query)

###Lets take a look at the data

In [None]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

###One line at a time

###Many lines at a time

###Or everything at once

#Dynamic Queries

We have 8 login csv files that we need to insert into the logins table.  Instead of doing a COPY FROM query 8 times, we should utilize Python (or any future languages) to make this more efficient.  This is possible due to tokenized strings.

In [None]:
# os is needed because we want to dynamically identify the files we need to insert.
import os

In [None]:
query = '''
        COPY logins 
        FROM {file_path} 
        DELIMITER ',' 
        CSV;
        '''

folder_path = '/Users/minghuang/Documents/git/zipfian-dsr/sql-python/mh-lecture/data/'

In [None]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

###Don't forget to commit your changes

In [None]:
conn.commit()

#All done!  Time for morning exercise!

In [None]:
conn.close()