# Lecture Objectives

- Connect to a database from within a python program and run queries
- Understand `psycopg2`'s cursors and commits
- Generate dynamic queries

## Combining SQL and Python

Often you will find yourself working with data that are only accessible through SQL.  However, your machine-learning capabilities are built in Python.  To resolve this issue, we can simply set up a connection from Python to the SQL database to bring the data to us.

## Why do we care?

- SQL-based databases are extremely common in almost all industry environments
- Can leverage the benefit of SQL's structure and scalability, while maintaining the flexibility of Python
- Very useful for scaled data pipelines, pre-cleaning, data exploration
- Allows for dynamic query generation and hence automations

## psycopg2

- A Python library that allows for connections with PostgresSQL databases to easily query and retrieve data for analysis.
- [Documentation--Includes Installation Instructions](http://initd.org/psycopg/docs/install.html)
- In addition to what's listed in the documentation, if you have the anaconda distribution of Python 
```python 
conda install psycopg2 
```
should worked
- There are similar packages for other flavors of SQL that work much the same way

## General Workflow

1. Establish connection to Postgres database using psycopg2
2. Create a cursor
3. Use the cursor to execute SQL queries and retrieve data
4. Commit SQL actions
4. Close the cursor and connection

# Walkthrough 1: Creating a database from Python

### Connect to the database
- Connections must be established using an existing database, username, database IP/URL, and maybe passwords
- If you need to create a database, you can first connect to Postgres using the dbname 'postgres' to initialize

In [None]:
import psycopg2 as pg2
conn = pg2.connect(dbname='postgres', host='localhost')
# this might work better, depending on your setup
#conn = pg2.connect(dbname='postgres', user='postgres', host='localhost')

### Commits

- Data changes are not actually stored until you choose to commit
- You can choose to have automatic commit by using ` autocommit = True`
- When connecting directly to the Postgres Server to initiate server level commands such as creating a database, you must use the `autocommit = True` option since Postgres does not have "temporary" transactions at the database level

In [None]:
conn.autocommit = True

### Instantiate the Cursor

- A cursor is a control structure that enables traversal over the records in a database
- Executes and fetches data
- When the cursor points at the resulting output of a query, it can only read each observation once.  If you choose to see a previously read observation, you must rerun the query. 
- Can be closed without closing the connection

In [None]:
cur = conn.cursor()

###  Create a database

In [None]:
cur.execute('DROP DATABASE IF EXISTS temp;')
cur.execute('CREATE DATABASE temp;')

### Disconnect from the cursor and database
- Cursors and Connections must be closed using `.close()` or else Postgres will lock certain operation on the database/tables to connection is severed. 

In [None]:
cur.close() # This is optional
conn.close() # Closing the connection also closes all cursors

# Walkthrough 2: Lets use our new database

### Connect to our database

In [None]:
conn = pg2.connect(dbname='temp', host='localhost')
# again, this might work better
#conn = pg2.connect(dbname='temp', user='postgres', host='localhost')

In [None]:
cur = conn.cursor()

### Create a new table

In [None]:
query = '''
        CREATE TABLE logins (
            userid integer, 
            tmstmp timestamp, 
            type varchar(10)
        );
        '''
cur.execute(query)

### Insert data into new table

The `os` library is needed to get the current directory and, later, dynamically identify the files we need to insert using the `listdir` method.

In [None]:
import os

In [None]:
path = os.getcwd()

query = '''
        COPY logins 
        FROM '{0}/logins_data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''.format(path)
cur.execute(query)

### Run a query to get 30 records from our data

In [None]:
query = '''
        SELECT *
        FROM logins
        LIMIT 30;
        '''
cur.execute(query)

### Lets look at our data one line at a time

In [None]:
cur.fetchone()

### Many lines at a time

In [None]:
#fetchmany(n) to get n rows
cur.fetchmany(10)

### Or everything at once

In [None]:
#fetchall() grabs all remaining rows
cur.fetchall()

### You can even iterate over the cursor

A cursor is an iterator, just like a `list` or `dict`, or generator like a `range` object, so you can iterate over it in a `for` loop.

In [None]:
cur.execute(query)
for record in cur:
    print("Here's a record:", record)

# Dynamic Queries

A **dynamic query** is a query that's generated based on context.

### Example

We have 8 login csv files that we need to insert into the logins table.  Instead of writing 8 separate  `COPY FROM` queries, we should utilize python (or whatever language we're using) to generate these automatically.

That's what we did in the earlier code:
```sql
query = '''
        COPY logins 
        FROM '{0}/logins_data/logins01.csv' 
        DELIMITER ',' 
        CSV;
        '''.format(path)
cur.execute(query)
```
But we did it in a bad way.

Here are some more bad examples.

In [None]:
num = 579
terribly_unsafe = "SELECT * FROM logins WHERE userid = " + str(num) + ";"
print(terribly_unsafe)


date_cut = "2014-08-01"
horribly_risky = f"SELECT * FROM logins WHERE tmstmp > {date_cut};"
print(horribly_risky)
## Python is happy, but if num or date_cut included something malicious
## your data could be at risk

They all look good, and they work, and python and SQL didn't complain. But suppose that we had asked someone to enter a date into a text form, you used it in the query above. Except instead of entering `"2014-08-01"`, they entered something different.

In [None]:
date_cut = "2014-08-01; DROP TABLE logins"
horribly_risky = f"SELECT * FROM logins WHERE tmstmp > {date_cut};"
print(horribly_risky)

That would very bad.

NEVER use `+` or `%` or `.format` or `f""` to reformat strings to be used with `.execute`. You might think you know where your variable came from, but things might change and you might be wrong.

### Example, continued

In [None]:
import psycopg2

In [None]:
psycopg2.__version__

SQL has a way to generate queries dynamically, but ensuring that the only a string or number or object is substituted in, rather than allowing any arbitrary code. The `psycopg2` library uses this to execute queries.

In `psycopg2`, anything like `%s` can be replaced in the execute step (it's similar to the `printf` command in C; the `s` stands for string). 

In [None]:
query = '''
        COPY logins 
        FROM %s
        DELIMITER ','
        CSV;
        '''

To substitute in changes, we pass an second argument, and array, into the execute command. The first element of the array is substituted in place of the first `%s`, and so on.

In [None]:
folder_path = path + '/logins_data/'
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv') and file_name != 'logins01.csv':
        file_path=folder_path+file_name
        cur.execute(query, [file_path])
        print('{0} inserted into table.'.format(file_name))

### Lets check the total number of records we have right now.

In [None]:
cur.execute('SELECT count(*) FROM logins;')
cur.fetchall()

### Don't forget to commit your changes

In [None]:
conn.commit()

### Close your connection

In [None]:
conn.close()

# Key Things to Remember

* Connections must be established using an existing database, username, database IP/URL, and maybe passwords
* If you have no created databases, you can connect to Postgres using the dbname 'postgres' to initialize db commands
* Data changes are not actually stored until you choose to commit. This can be done either through `conn.commit()` or setting `autocommit = True`.  Until commited, all transactions is only temporary stored.
* Autocommit = True is necessary to do database commands like CREATE DATABASE.  This is because Postgres does not have temporary transactions at the database level.
* If you ever need to build similar pipelines for other forms of database, there are libraries such Pyodbc which operates essentially the same.
* SQL connection databases utilizes cursors for data traversal and retrieval.  This is kind of like an iterator in Python.
* Cursor operations typically goes like the following:
    - execute a query
    - fetch rows from query result if it is a SELECT query
    - because it is iterative, previously fetched rows can only be fetched again by rerunning the query
    - close cursor through .close()
* Cursors and Connections must be closed using .close() or else Postgres will lock certain operation on the database/tables to connection is severed. 

# Exercise

You're given a file called `playgolf.csv` in the data folder.  The file is comma delimited and the first row is the header.  Without opening and looking at the file, create a table and insert the data. Here is the header and first row:


|Date|Outlook|Temperature|Humidity|Windy|Result|
|----|-------|-----------|--------|-----|------|
|07-01-2014|sunny|85|85|false|Don't Play|