# Data Pipelines & PostgreSQL

## Connect, Load, Execute, Commit on PostgreSQL

- Import the psycopg2 library.
- Connect to database.
- Use the print function to display the Connection object.
- Close the Connection using the close method.

In [None]:
import psycopg2
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
print(conn)
conn.close()

- Connect to database
- Using the Cursor object, create a string query that selects all from the test_db table.
- Execute the query using the execute method.
- Fetch all the results from the table and assign it to the variable notes.
- Close the Connection using the close method.

In [None]:
import psycopg2
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('SELECT * FROM test_db')
one = cur.fetchone()
total = cur.fetchall()
notes = total
conn.close()

- Connect to database
- Write a SQL query that creates a table called users in the database, with the following columns and data types:
    - id -- integer data type, and is a primary key.
    - email -- text data type.
    - name -- text data type.
    - address -- text data type.
- Execute the query using the execute method.
- Don't close the connection.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('''CREATE TABLE users (id BIGSERIAL PRIMARY KEY, update VARCHAR(255), available INT, free INT, 
            name VARCHAR(255), long NUMERIC, lat NUMERIC, total INT)''');

- Connect to the database.
- Write a SQL query that creates a table called users in the database, with the following columns and data types:
    - id -- integer data type, and is a primary key.
    - email -- text data type.
    - name -- text data type.
    - address -- text data type.
- Execute the query using the execute method.
- Use the commit method on the Connection object to apply the changes in the transaction to the database.
- Close the Connection.

Whenever we open a Connection in psycopg2, a new transaction will automatically be created. 
All queries run up until the <a href = "http://initd.org/psycopg/docs/connection.html#connection.commit"><b>commit()</b></a> method is called. When a commit is called, the PostgreSQL 
engine will run all the queries at once.

If we don't want to apply the changes in the transaction block, we can call the <b>rollback()</b> 
method to remove the transaction. Not calling either commit or rollback will cause the 
transaction to stay in a pending state, and will result in the changes not being applied to 
the database.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('CREATE TABLE users (id BIGSERIAL PRIMARY KEY, \
            update VARCHAR(255), \
            available INT, \
            free INT, name VARCHAR(255), \
            long NUMERIC, \
            lat NUMERIC, \
            total INT)');
conn.commit()
conn.close()

- Import the csv module.
- Load the user_accounts.csv using the csv module
- Connect to the database
- Execute the insert query on the users table using the execute method.
- Insert every row from the user_accounts.csv file and skip the header row.
- Fetch all the results from the users table and assign it to the variable users.
- Close the Connection using the close method.

In [None]:
import csv
with open('user_accounts.csv') as f:
    reader = csv.reader(f, delimiter=",")
    next(reader)
    rows = [row for row in reader]

conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()

for row in rows:
    cur.execute("INSERT INTO test_db VALUES (%s, %s, %s, %s, %s, %s, %s, %s)", row)

conn.commit()

cur.execute('SELECT * FROM test_db')
users = cur.fetchall()
conn.close()

- Connect to database
- Load the user_accounts.csv using with open(...) as f.
- Skip the header row.
- Using the copy_from method, copy the file into the database.
- Fetch all the results from the users table and assign it to the variable users.
- Close the Connection using the close method.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()

with open('user_accounts.csv') as f:
    next(f)
    cur.copy_from(f, 'test_db', sep=',')

conn.commit()
    
cur.execute('SELECT * FROM vbstatic2')
data = cur.fetchall()
conn.close()

## Creating Tables on PostgreSQL & Data Types

- Using the provided `cur` object, execute the `SELECT` query from the table.
- Call `print()` on the description property of `cur`.

In [None]:
import psycopg2
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('SELECT * FROM test_db LIMIT 0')
print(cur.description)
conn.close()

- Use the provided `cur` object.
- Create a table `ign_reviews` that contains a single field using the correct type for this data.
- Set the `id` column as the `PRIMARY KEY`.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('CREATE TABLE test_db (id BIGSERIAL PRIMARY KEY, \
            update VARCHAR(255), \
            available INT, \
            free INT, \
            name VARCHAR(255), \
            long NUMERIC, \
            lat NUMERIC, \
            total INT)');
conn.commit()
conn.close()

- Import the `csv` module.
- Find the maximum character size of the `name` field using the `csv.reader` object.
- Assign the size to the variable `max_name_len`.

In [None]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[4]) for row in reader] #name column is row[4]

max_len = max(unique_name_lens)
#with max len 50, we can cut down VARCHAR(255) to VARCHAR(55);
#technically adding a little bit extra, just in case.

In [None]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[5]) for row in reader] #long column is row[5]

max_len = max(unique_name_lens)
#longitude and latitude can be limited to just 15 decimal digits of precision; datatype double precision 

`CHAR(N)` pads any empty space of a character with whitespace " " characters while `VARCHAR(N)` does not.

The only reason the `CHAR` datatype is implemented is to keep Postgres consistent with the SQL specification.

In conclusion, when using Postgres, it's better to use the `TEXT` field for uncertain sizes and `VARCHAR(N)` for ones you know the maximum length.

In [None]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[1]) for row in reader] #update column is row[1]

max_len = max(unique_name_lens)
#update can be limited to len 24.

- Use the provided `cur` object.
- Add columns with the proper datatype and length.
- Commit your changes using the `conn` object.
- Note: If you're having trouble running the `CREATE TABLE` command, you can drop the table with `DROP TABLE` before creating it.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute("DROP TABLE test_db");
cur.execute('CREATE TABLE test_db (id BIGSERIAL PRIMARY KEY, \
            update VARCHAR(24), \
            available INT, \
            free INT, \
            name VARCHAR(55), \
            long DOUBLE PRECISION, \
            lat DOUBLE PRECISION, \
            total INT)');
conn.commit()
conn.close()

- Use the provided `cur` object.
- Add the title, url, and platform, genre columns with the proper datatype and/or length if required.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT
    )
""")
conn.commit()

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT,
        score DECIMAL(3, 1)
    )
""")
#for datatype DECIMAL, 3 is the total number of digits in the number,
#1 is the number of digits after the decimal
conn.commit()

- Use the provided `cur` object.
- Add the the `editors_choice` column with the proper datatype.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT,
        score DECIMAL(3, 1),
        editores_choice BOOLEAN
    )
""")
conn.commit()

- Use the provided `cur` object.
- Create the last column, `release_date`, with the proper datetime type.
- Import the `csv` module and `date` module.
- Using the `csv` module, transform the `year`, `month`, and `day` values into a date object for each row.
- Insert the values into the created table using the `INSERT` statement from above.
- Commit your changes using the `conn` object.

In [None]:
import csv
from datetime import date

conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        score DECIMAL(3, 1),
        genre TEXT,
        editors_choice BOOLEAN,
        release_date DATE
    )
""")

with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    for row in reader:
        updated_row = row[:8]
        updated_row.append(date(int(row[8]), int(row[9]), int(row[10])))
        cur.execute("INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", updated_row)
conn.commit()

### Make SQL Database With Pandas Dataframe Using SQL Alchemy

<a href = 'http://docs.sqlalchemy.org/en/latest/core/engines.html'>SQL Alchemy Documentation: Engine Configuration</a><br>
<a href = 'http://docs.sqlalchemy.org/en/latest/core/type_basics.html#sql-standard-and-multiple-vendor-types'>SQL Alchemy Documentation: dtypes</a>

`dialect+driver://username:password@host:port/database`<br>

In [None]:
import sqlalchemy
from sqlalchemy import create_engine

In [None]:
# data is a pandas dataframe prepared seperately
engine = create_engine('postgresql+psycopg2://xtang:xtang@localhost/text_db')
data.to_sql('users', engine, dtype = {'id': sqlalchemy.types.BIGINT, \
                                         'update':sqlalchemy.types.TIMESTAMP(timezone=False), \
                                         'available': sqlalchemy.types.INT, \
                                         'free': sqlalchemy.types.INT, \
                                         'total': sqlalchemy.types.INT, \
                                         'name': sqlalchemy.types.CHAR(length=55), \
                                         'long': sqlalchemy.types.NUMERIC(precision=10, scale=8, asdecimal=True), \
                                         'lat': sqlalchemy.types.NUMERIC(precision=10, scale=8, asdecimal=True)})

In [None]:
with engine.connect() as conn:
    conn.execute('ALTER TABLE users ADD PRIMARY KEY (index);')

## Manage Tables PostgreSQL

- Using the provided `cur` object, execute the `ALTER TABLE` query to rename the `old_ign_reviews` table to `ign_reviews`.
- Commit your changes.
- Execute the `SELECT` query from the example on the table `ign_reviews`.
- Call `print()` on the `cur.description` variable.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE users RENAME TO user')
conn.commit()
cur.execute('SELECT * FROM user LIMIT 0')
print(cur.description)
conn.close()

- Use the provided `cur` object.
- Drop the redundant column `full_url` from `ign_reviews`.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE ign_reviews DROP COLUMN full_url')
conn.commit()

- Use the provided `cur` object.
- Change the column type of `id` to `BIGINT` for `ign_reviews`.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE ign_reviews ALTER COLUMN id TYPE BIGINT')

- Use the provided `cur` object.
- Change the column name of `title_of_game_review` to `title` for `ign_reviews`.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE ign_reviews RENAME COLUMN title_of_game_review TO title')
conn.commit()

- Use the provided `cur` object.
- Add the the `release_date` column with the proper datatype.
- Commit your changes using the `conn` object.

-- Default each entry to Jan 1st, 1991.<Br>
`ALTER TABLE ign_reviews ADD COLUMN release_date DATE DEFAULT 01-01-1991`

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE ign_reviews ADD COLUMN release_date DATE')
conn.commit()

- Use the provided `cur` object.
- Update the `release_date` column for `ign_reviews` using `UPDATE` and for every entry:
- Insert the combination of the columns `release_day`, `release_month`, `release_year`.
- Use the string merger to create the date-like string with the corresponding date format representation.
- Use the `to_date()` function to create the date objects for the column.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('''
UPDATE ign_reviews SET release_date = to_date(
release_day || '-' || release_month || '-' || release_year, 'DD-MM-YYYY');
)
'''
conn.commit()

- Use the provided `cur` object.
- Using `ALTER TABLE` with `DROP COLUMN` to drop the `release_day`, `release_month`, and `release_year` redundant columns.
- Commit your changes using the `conn` object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
cur.execute('ALTER TABLE ign_reviews DROP COLUMN release_day')
cur.execute('ALTER TABLE ign_reviews DROP COLUMN release_month')
cur.execute('ALTER TABLE ign_reviews DROP COLUMN release_year')
conn.commit()

## Loading and Extracting Data PostgreSQL

- Use the provided `cur` variable.
- Load the `ign.csv` file found in terminal table using the `csv` module.
- Run the insert query on the `ign_reviews` table using the execute method using the prepared statement.
- Insert every row from the `ign_review.csv` file except for the header row.
- Note that the last column is `release_date` instead of the 3 `release_day`, `release_month`, and `release_year` columns.
- Commit your changes using the `conn` object.

In [None]:
import csv
import psycopg2

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    for row in reader:
        cur.execute("INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
conn.commit()

- Use the provided `cur` variable.
- Load the `ign.csv` file found in terminal table using the `csv` module.
- Create a comma-seperated string of mogrified values using the `mogrify()` method.
- Mogrify every row from the `ign_review.csv` file and skip the header row.
- Set the comma-seperated string to the variable `mogrified_values`.
- Execute the insert query on the `ign_reviews` table using the execute method.
- Concat the `mogrified_values` to the `INSERT` statement.
- Commit your changes using the `conn` object.

The prepared statement safely converts the Python types to the Postgres types when executing an `INSERT` statement. The conversion takes place in a seperate step within the `psycopg2` library using a method called `mogrify()`

> ` cur.mogrify("INSERT INTO test (num, data) VALUES (%s, %s)", (42, 'bar'))
"INSERT INTO test (num, data) VALUES (42, E'bar')"`

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    mogrified = [cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)", row).decode('utf-8') for row in reader]
mogrified_values = ",".join(mogrified)
cur.execute("INSERT INTO ign_reviews VALUES " + mogrified_values)
conn.commit()

- Use the provided `cur` variable.
- Load the `ign.csv` file.
- Execute the `COPY ... FROM` method on the `ign_reviews` table using the `copy_expert` method.
- Add the `CSV` and `HEADER` options.
- Commit your changes using the `conn` object.

The `cur.copy_from()` method provides a useful API for file copying but only if the file is defined with a simple seperator (delimiter) character

To use the `copy_expert()` method, you first have to declare the full `COPY` statement and then pass in the Python file descriptor. The biggest difference you may notice is that we don't copy from a file, but from the `STDIN` which in this case is the Python file object.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
with open('ign.csv',  'r') as f:
    cur.copy_expert('COPY ign_reviews FROM STDIN WITH CSV HEADER', f)
conn.commit()

- Using the time module, play around to determine which of the last three methods we introduced is the fastest.

In [None]:
import time

conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()

# Multiple single insert statements.
start = time.time()
with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    for row in reader:
        cur.execute(
            "INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)",
            row
        )
conn.rollback()
print("Single statment insert: ", time.time() - start)
        
# Multiple mogrify insert.
start = time.time()
with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    mogrified = [ 
        cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)", row).decode('utf-8')
        for row in reader
    ] 
    mogrified_values = ",".join(mogrified) 
    cur.execute('INSERT INTO ign_reviews VALUES ' + mogrified_values)
conn.rollback()
print("Multiple mogrify insert: ", time.time() - start)

# Copy expert method.
start = time.time()
with open('ign.csv', 'r') as f:
    cur.copy_expert('COPY ign_reviews FROM STDIN WITH CSV HEADER', f)
conn.rollback()
print("Copy expert method: ", time.time() - start)

Single statment insert:  2.948253631591797<br>
Multiple mogrify insert:  1.0108413696289062<br>
Copy expert method:  0.16642284393310547<br>

- Use the provided `cur` variable.
- Open a `old_ign_reviews.csv` file using the statement with `open()` as `f`.
- Execute the `COPY ... TO` method on the `old_ign_reviews` table using the `copy_expert` method.
- Add the `CSV` and `HEADER` options.
- Write it out to the `old_ign_reviews.csv` file.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()

with open('old_ign_reviews.csv', 'w') as f:
    cur.copy_expert('COPY old_ign_reviews TO STDOUT CSV HEADER', f)

- Use the provided `cur` variable.
- Open a `old_ign_reviews.csv` file using the statement with `open()` as `f`.
- Execute the `COPY ... TO` method on the `old_ign_reviews` table using the `copy_expert` method.
- Add the `CSV` and `HEADER` options.
- Process the data and transform it to match the `ign_reviews` table.
- Insert the processed rows into the `ign_reviews` table using whatever `INSERT` command you want.
- Commit your changes.

In [None]:
import csv
from datetime import date

conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()
with open('old_ign_reviews.csv', 'r+') as f:
    cur.copy_expert('COPY old_ign_reviews TO STDOUT WITH CSV HEADER', f)
    f.seek(0)
    next(f) #skip header
    reader = csv.reader(f)
    for row in reader:
        updated_row = row[:8]
        updated_row.append(date(int(row[8]), int(row[9]), int(row[10])))
        cur.execute("INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", updated_row)
    conn.commit()

This approach is great for tables that contain less than a million rows but as the size of the table increases, it becomes unlikely that this approach would work.

- Use the provided `cur` variable.
- Insert rows into the `ign_reviews` table using the `INSERT` with `SELECT` from the `old_ign_reviews` table.
- Commit your changes.

In [None]:
conn = psycopg2.connect('dbname=test_db user=xtang password=xtang')
cur = conn.cursor()

cur.execute('''
INSERT INTO ign_reviews (id, score_phrase, title, url, platform, score, genre, editors_choice, release_date)

SELECT id, score_phrase, title_of_game_review, url, platform, score, genre, editors_choice, 
       to_date(release_day || '-' || release_month || '-' || release_year, 'DD-MM-YYYY') as release_date 
FROM old_ign_reviews
''')
conn.commit()

## User and Database Management PostgreSQL

- Import the `psycopg2` library.
- Use the `print` function to display the Connection object.

In [None]:
import psycopg2
conn = psycopg2.connect(dbname="test_db", user="xtang", password = "xtang")
print(conn)

- Create cursor object using the `.cursor()` method.
- Create a new user that has the following options:
    - Has a password with the value somepassword.
    - Not a superuser.
- Commit the transaction.

In [None]:
conn = psycopg2.connect(dbname = "test_db", user = "xtang")
cur = conn.cursor()
cur.execute("CREATE USER xtang1 WITH PASSWORD 'somepassword' NOSUPERUSER")
conn.commit()

- Use the created Cursor object using the variable `cur`.
- Revoke all privileges from user `xtang1` on the table `user`.
- Commit the transaction.

In [None]:
conn = psycopg2.connect(dbname = "test_db", user = "xtang")
cur = conn.cursor()
cur.execute('REVOKE ALL ON user FROM xtang1;')
conn.commit()

- Use the created Cursor object using the variable `cur`.
- Grant the `SELECT` privilege to user `xtang1` on the table `user`.
- Commit the transaction.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute('GRANT SELECT ON user TO xtang1')
conn.commit()

- Use the created Cursor object using the variable `cur`.
- Create a `readonly` group by doing the following:
    - Create a `NOLOGIN` group named readonly.
    - Revoke all privilges from the group on `user_accounts`.
    - Grant `SELECT` to the group on `user_accounts`.
- Assign `data_viewer` to the `readonly` group.
- Commit the transaction.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute('CREATE GROUP readonly NOLOGIN')
cur.execute('REVOKE ALL ON user FROM readonly')
cur.execute('GRANT SELECT ON user TO readonly')
cur.execute('GRANT readonly TO xtang1')
conn.commit()

- Use the created Cursor object using the variable `cur`.
- Create a database called `accounts` where the owner is the user `data_viewer`.
- Don't commit the transaction.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
# Connection set to autocommit
conn.autocommit = True
cur = conn.cursor()
cur.execute('CREATE DATABASE accounts OWNER xtang')

- Use the created Cursor object using the variable `cur`.
- Create a database called `top_secret`.
- Reconnect to the `top_secret` database with the `xtang` user.
- Create a table in `top_secret` called `documents` following schema:
    - `id` with `INT`.
    - `info` with `TEXT`.
- Create a group called `spies` with only the following privileges:
    - `NOLOGIN`.
    - Can only `INSERT`, `SELECT`, and `UPDATE` on documents.
- Create a user named `double_o_7` with the following options:
    - Can create a database.
    - Password is 'shakennotstirred'.
    - In group `spies`.
- Commit the transaction.
- Connect to the `top_secret` database using `psycopg2.connect()` and the user `double_o_7`.
    - Assign the connection variable to `conn_007`.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE DATABASE top_secret OWNER xtang")
conn = psycopg2.connect(dbname="top_secret", user="xtang")
cur = conn.cursor()
cur.execute("""
CREATE TABLE documents(id INT, info TEXT);
CREATE GROUP spies NOLOGIN;
REVOKE ALL ON documents FROM spies;
GRANT SELECT, INSERT, UPDATE ON documents TO spies;
CREATE USER double_o_7 WITH CREATEDB PASSWORD 'shakennotstirred' IN GROUP spies;
""")
conn.commit()
conn_007 = psycopg2.connect(dbname='top_secret', user='double_o_7', password='shakennotstirred')

## Postgres Internals

- Import the `psycopg2` library.
- Use the `print` function to display the Connection object.

In [None]:
import psycopg2
conn = psycopg2.connect(dbname="test_db", user="xtang1", password="admin123")
print(conn)

This seems to let any user get into any db, and it allows any password to fly. Resources to fix this:
- https://stackoverflow.com/questions/21054549/postgres-accepts-any-password
- https://dba.stackexchange.com/questions/17790/created-user-can-access-all-databases-in-postgresql-without-any-grants

- Use the provided `cur` object.
- Using the `SELECT` query, grab the `table_name` column from the `information_schema.tables` table with the `ORDER BY` option on the `table_name` column.
- Fetch all the results and assign them to the variable `table_names`.
- Loop through `table_names`:
- Print each `table_name` from the query.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute('SELECT table_name FROM information_schema.tables ORDER BY table_name')
table_names = cur.fetchall()
for name in table_names:
    print(name)

In the output, you would have noticed many tables that started with the prefix `pg_*`. Each one of these tables is part of the `pg_catalog` group of internal tables. These are the system catalog tables. 

- Use the provided `cur` object.
- Using the `SELECT` query, grab the `table_name` column from the `information_schema.tables` table.
    - Find user created by filtering the query on the `table_schema` column.
    - `ORDER BY` the `table_name` again.
- Loop through `cur.fetchall()` and print each `table_name` from the query.

In Postgres, schemas are used as a namespace for tables, with the distinct purpose of seperating them into isolated groups or sets within a single database.

In [None]:
cur = conn.cursor()
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name")
for table_name in cur.fetchall():
    name = table_name[0]
    print(name)

- Import `AsIs` from `psycopg2.extensions`.
- Within the loop for `cur.fetchall()` for each table name:
    - Run a `SELECT` query with the table variable using `AsIs`.
    - Print the `cur.description` attribute.
    - Print a black space to seperate the descriptions at the end of each loop.

In [None]:
from psycopg2.extensions import AsIs

conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public'")
for table in cur.fetchall(): 
     table = table[0]
     cur.execute("SELECT * FROM %s LIMIT 0", [AsIs(table)])
     print(cur.description, "\n")

- Use the provided `cur` object.
- Using `execute()`, `SELECT` from the `pg_catalog.pg_type`, choose two columns that can map an integer type code to a human - readable string.
- Create a dict and assign it to the variable `type_mappings`.
- Loop through the returned `SELECT` query and map the integer type code to the string.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()
cur.execute("SELECT oid, typname FROM pg_catalog.pg_type")
type_mappings = {
    int(oid): typname
    for oid, typname in cur.fetchall()
}

>One interesting thing to note about `pg_catalog.pg_type`is that it can be used to create your own Postgres types from scratch. Let's put all this together and create our own table descriptions. We want to rewrite the description attributes from a list of tuples towards something human readable. In the following exercise, we will assemble output from the previous exercises into a dictionary.

- Use the provided `cur`, `type_mappings`, and `table_names` objects.
- Create a dict and assign it to the variable `readable_description`.
- Loop through the `table_names` with the table variable and do the following:
    - Get the description attribute for the given table.
    - Map the name of the table to a dictionary with a columns key.
    - Recreate the columns list from the screen example by iterating through the description, and mapping the appropriate types.
- Print the `readable_description` dictionary at the end.

In [None]:
from psycopg2.extensions import AsIs
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

readable_description = {}
#for table in table_names:
    #cur.execute("SELECT * FROM %s LIMIT 0", [AsIs(table)])
cur.execute("SELECT * FROM stormdata LIMIT 0")
readable_description[table] = dict(
    columns=[
        dict(
            name=col.name,
            type=type_mappings[col.type_code],
            length=col.internal_size
        )
        for col in cur.description
    ]
)
print(readable_description)

- Use the provided `cur` object and `AsIs` class.
- Loop through the `readable_description` keys:
    - Fetch the value of each table's row count and assign it to a `total` key for that table.
- Print the `readable_description` dictionary at the end.

In [None]:
from psycopg2.extensions import AsIs
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

#for table in readable_description.keys():
    #cur.execute("SELECT COUNT(*) FROM %s", [AsIs(table)])
cur.execute("SELECT COUNT(*) FROM stormdata")
readable_description[table]["total"] = cur.fetchone()

- Use the provided `cur` object and `AsIs` class.
- Loop through the `readable_description` keys and run the following:
    - Select the first 100 rows with `SELECT ... LIMIT` using `execute()` and `AsIs`.
    - Fetch the all the rows and assign it to the `readable_description` dictionary for the given table using the `sample_rows` key.
- Print the `readable_description` dictionary at the end.

In [None]:
from psycopg2.extensions import AsIs
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

#for table in readable_description.keys():
#    cur.execute("SELECT * FROM %s LIMIT 100", [AsIs(table)])
cur.execute("SELECT * FROM stormdata LIMIT 100")
readable_description[table]["sample_rows"] = cur.fetchall()

## Debugging PostgresSQL Queries

- Use the provided `cur` object.
- Run the `EXPLAIN` command for a `SELECT` all query on the `vbstatic` table.
- Call `.fetchall()` and pretty print the output.

In [None]:
import psycopg2
import pprint as pp

conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("EXPLAIN SELECT * FROM user")
pp.pprint(cur.fetchall())

> Let's describe the path a query takes when you call `cur.execute()`.
>
>Path of a query:
>
>1. The query is parsed for correct syntax. If there are any errors, the query does not execute and you receive an error message. If error-free, then the query is transformed into a query tree.
>
>2. A rewrite system takes the query tree and checks against the system catalog internal tables for any special rules. Then, if there are any rules, it rewrites them into the query tree.
>
>3. The rewritten query tree is then processed by the planner/optimizer which creates a query plan to send to the executor. The planner ensures that this is the fastest possible route for query execution.
>
>4. The executor takes in the query plan, runs each step, then returns back any rows it found.
>
>When we run the EXPLAIN command, we are examining the query at the third step in its path. In this step, the planner (or optimizer) is responsible for taking the written query and finding the fastest and most efficient way of returning the results.

- Use the provided `cur` object.
- Run the `EXPLAIN` command on a query that returns a `COUNT` of rows greater than the year 2012-01-01 for `homeless_by_coc`.
- Call `.fetchall()` and <a href="https://docs.python.org/3/library/pprint.html">pretty print</a> the output.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("EXPLAIN SELECT COUNT(*) from user WHERE update > '2018-02-22'")
pp.pprint(cur.fetchall())

> The executor will start by running a sequential scan on the table, filter by the year value, and then run the aggregator on those returned results. The plan of execution closely resembles a tree of commands – starting from the bottom and working its way up – but it is not clearly shown by this output format.

- Use the provided `cur` object.
- Run the `EXPLAIN` command on a query that returns a `COUNT` of rows greater than the year 2012-01-01 for `homeless_by_coc`.
    - Format the ouptut with the `json` type.
- Call `.fetchall()` and pretty print the output.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("EXPLAIN (format json) SELECT COUNT(*) from user WHERE update > '2018-02-22'")
pp.pprint(cur.fetchall())

The cost of the operation as estimated by the optimizer's cost-based approach. For statements that use the rule-based approach, this column is null. Cost is not determined for table access operations. The value of this column does not have any particular unit of measurement, it is merely a weighted value used to compare costs of execution plans.  
https://docs.oracle.com/cd/A58617_01/server.804/a58246/explan.htm

> The `Startup Cost` represents the time it takes before a rows can be returned (something like sorting, or collecting the rows and aggregating them). <br>
>`Total Cost` includes `Startup Cost` and is the total time it takes to run the node plan until completion. 

- Use the provided `cur` object.
- Practice running different `EXPLAIN` commands on any of the tables in the database. Here is one to try:
    - `SELECT count FROM homeless_by_coc`
    - `SELECT postal FROM state_info WHERE state='Alabama'`
    - `SELECT state, SUM(count) FROM homeless_by_coc GROUP BY state HAVING SUM(count) > 100000 ORDER BY state`

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("EXPLAIN SELECT count(*) FROM user")
pp.pprint(cur.fetchall())

cur.execute("EXPLAIN SELECT available FROM user WHERE free=0")
pp.pprint(cur.fetchall())

cur.execute("EXPLAIN SELECT available, SUM(free) FROM user GROUP BY available")
pp.pprint(cur.fetchall())

> Under the hood, `EXPLAIN` runs several queries on internal tables to give us the estimated data. One of these tables is the `pg_class` table where the estimated costs and rows are stored. This table only stores estimates of rows and costs (not actual values) so `EXPLAIN` can only give us approximate values for our queries.
>
>If we want to see, and force, actual runtime statistics of our queries, we need to use the `ANALYZE` option of the `EXPLAIN` query. With `ANALYZE`, the `EXPLAIN` command will execute our given query, wait for the results, then return the output with the recorded values.

- Use the provided `cur` object.
- Run the `EXPLAIN` command on a query that returns a `COUNT` of rows greater than the year 2012-01-01 for `homeless_by_coc`.
    - Add the `ANALYZE` option to the `EXPLAIN` command.
    - Format the ouptut with the `json` type.
    - Note that we are trying to add two options for `EXPLAIN`.
- Call `.fetchall()` and pretty print the output.

In [None]:
cur.execute("EXPLAIN (ANALYZE, format json) SELECT COUNT(*) from vbstatic WHERE update > '2018-02-22'")
pp.pprint(cur.fetchall())

>Using the `ANALYZE` option, we have both estimates and actual times (in milliseconds) side by side. Furthermore, we are presented with the total execution time.

- Use the provided `cur` and conn objects.
- Run the `EXPLAIN ANALYZE` command on a `DELETE` query that deletes all the rows in `state_household_incomes`.
    - Format the ouptut with the `json` type.
- Rollback the delete command.
- Call `.fetchall()` and pretty print the output.

In [None]:
conn = psycopg2.connect(dbname="valenbisi2018", user="nmolivo")
cur = conn.cursor()

cur.execute("EXPLAIN (ANALYZE, FORMAT json) DELETE FROM vbstatic")
# Rollback the change.
conn.rollback()
pp.pprint(cur.fetchall())

- Use the provided `cur` object.
- Run `EXPLAIN ANALYZE` on a select from `homeless_by_coc` and `state_info`:
    - Select columns `state`, `coc_number`, and `coc_name` from `homeless_by_coc`.
    - Select column name from `state_info`.
    - Join on `homless_by_coc.state` and `state_info.postal`.
    - Format the ouptut with the `json` type.
- Call `fetchall()` and pretty print the output.

In [None]:
cur.execute('''
    EXPLAIN (ANALYZE, FORMAT json) SELECT hbc.state, hbc.coc_number, hbc.coc_name, 
    si.name FROM homeless_by_coc as hbc, state_info as si WHERE hbc.state = si.postal
    '''
           )
pp.pprint(cur.fetchall())

The output of the `EXPLAIN ANALYZE` command reveals the inefficiency of the join. In the list of plans, each node must first run a Seq Scan which is a loop through each of the tables. Before the join can occur, a loop is performed twice: once in `homeless_by_coc` and once in `state_info`.

## Indexing

> We will work through strategies to make it more efficient. To begin, we will learn about different query scans a `SELECT` performs. Next, we will introduce the concept of an index, and how indexes are used to speed up common queries. 
>
>An index creates a b-tree structure on a column, separate from the table, which allows filtered queries to perform binary search.
>
> Using an index, we will show that we can speed up queries to run in $Olog(n)$ complexity from $O(n)$ We will both prove it theoretically, and then using `EXPLAIN`, show how query speeds will decrease as a result of adding the index. Finally, we will finish by demonstrating the positive effect an index can have on joins.

- Use the provided `cur` object.
- Run the `EXPLAIN` command for a `SELECT` all query on the `homeless_by_coc` table filtering by `id`=10.
- Format the `EXPLAIN` query with json output.
- Call `.fetchall()` and pretty print the output.

In [None]:
import psycopg2
import pprint as pp

conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("EXPLAIN (FORMAT json) SELECT * FROM user WHERE index = 5")
pp.pprint(cur.fetchall())

Since we were searching through the primary key, (see `user_pkey` in the output), Our query knows to stop searching after finding the first record where `index = 5` since in Postgres, all primary key values are unique. Our query does a binary search.<br>
>A binary search can help us find an item in a list efficiently if we know the list is ordered. We can check the middle element of the list, compare it to the item we're looking for, and continue narrowing our search in this manner.

- Use the provided `cur` object.
- Run the `EXPLAIN` command on a select query from each table that filters on their corresponding primary keys:
    - Format by json.
    - `homeless_by_coc.id` equal to 5 and assign `fetchall()` to the variable `homeless_query_plan`.
    - `state_info.name` equal to Alabama and assign `fetchall()` to the variable `state_query_plan`.
    - `state_household_incomes.state` equal to Georgia and assign `fetchall()` to the variable `incomes_query_plan`.
- For each `query_plan` variable (`homeless_query_plan`, `state_query_plan`, `incomes_query_plan`), pretty print the output.

In [None]:
cur.execute("EXPLAIN (format json) SELECT * FROM homeless_by_coc WHERE id=10")
homeless_query_plan = cur.fetchall()
pp.pprint(homeless_query_plan)

cur.execute("EXPLAIN (format json) SELECT * FROM state_info WHERE name='Alabama'")
state_query_plan = cur.fetchall()
pp.pprint(state_query_plan)

cur.execute("EXPLAIN (format json) SELECT * FROM state_household_incomes WHERE state='Georgia'")
incomes_query_plan = cur.fetchall()
pp.pprint(incomes_query_plan)

>Let's create a separate table that's optimized for lookups by a different column than `id` from the `homeless_by_coc` table. First, we assign the column we want to query part of the primary key, so we get the speed benefits, and add the next part of the primary key as the `id` value from the `homeless_by_coc`. We call this table an index and each row in the index contains:
>
>- the value we want to be able to search by,
>- an `id` value for the corresponding row in `homeless_by_coc`,
>- assign both as composite primary keys for the table.

- Use the provided `cur` object.
- Create a table, `state_idx`, that contains the columns `state` and `homeless_id`.
    - Create a composite primary key containing both `state` and `homeless_id`.
    - Insert into `state_idx` the columns `state` and `id` from `homeless_by_coc`.
- Select `state`, `year`, and `coc_number` from `homeless_by_coc` by joining with the `state_idx` id.
    - Filter by `CA` state on `state_idx`.
- Call `fetchall()` and pretty print the results.

In [None]:
cur.execute("SELECT DISTINCT name from user")
names = pd.DataFrame(cur.fetchall())

names.columns = ['name']
names['stationid'] = ["%03d" % (x) for x in list(range(1,306))]

In [None]:
cur.execute("DROP TABLE stations;")
conn.commit()

In [None]:
import sqlalchemy
from sqlalchemy import create_engine

engine = create_engine('postgresql+psycopg2://xtang:xtang@localhost/test_db')
names.to_sql('stations', engine, dtype = {'name': sqlalchemy.types.CHAR(length=55), \
                                         'stationid':sqlalchemy.types.CHAR(length=3)})

In [None]:
cur.execute("SELECT user.index, user.update, user.free, user.available, user.total, user.lat,\
            user.long, stations.stationid, stations.name\
             INTO user1\
             FROM user\
             FULL OUTER JOIN stations\
             ON stations.name = user.name")
conn.commit()

In [None]:
cur.execute("SELECT * from user2")
data = pd.DataFrame(cur.fetchall())

data.columns = ['index', 'update', 'free', 'available', 'total', 'lat', 'long', 'stationid', 'name']
station271 = data[data['stationid'] =='271']

In [None]:
cur.execute("SELECT update, COUNT( stationid ) FROM user2 GROUP BY update HAVING COUNT (stationid)>1 ORDER BY update")
dupcheck = pd.DataFrame(cur.fetchall())

In [None]:
query = """
DELETE FROM user2 
WHERE index IN (SELECT index
                 FROM (SELECT index, ROW_NUMBER() OVER (PARTITION BY stationid, update ORDER BY index) AS rnum 
                       FROM user2) t 
                 WHERE t.rnum >1);
"""

#read this query from the inside, out: What are the things we count as duplicates and lets groupby those things
# Start with the groupby aka partition by:
    # Station ID and Update
    # Order by index - We created index a unique identifyer, it's just an ordered number by collection or 'update' time. 
# Now that we've grouped dups, select index and the new variable that we've created on the fly, called ROW_NUMBER
    # ROWNUMBER counts the number of records stored for a particular stationid/update combination as we iterate through. 
    # Any stationid/update with a ROWNUMBER >1 will be a dup.
    # We declare this as a table t and ROW NUMBER variable to be rnum
    # Select all where Rownumber is >1, and delete it.
    
cur.execute(query) 
conn.commit()

In [None]:
cur.execute("CREATE TABLE station_update_idx (update TIMESTAMP, stationid CHAR(3), PRIMARY KEY (update, stationid))")
cur.execute("INSERT INTO station_update_idx SELECT update, stationid FROM user2")
conn.commit()

In [None]:
cur.execute("SELECT user2.available, user2.free FROM user2, station_update_idx idx\
             WHERE idx.update = '2018-02-20 07:12:19' AND idx.stationid = user2.stationid")
pp.pprint(cur.fetchall())

In [None]:
cur.execute("CREATE TABLE state_idx (state CHAR(2), homeless_id INT, PRIMARY KEY (state, homeless_id))")
cur.execute("INSERT INTO state_idx SELECT state, id FROM homeless_by_coc")
conn.commit()
cur.execute("SELECT hbc.state, hbc.year, hbc.coc_number FROM homeless_by_coc hbc, state_idx WHERE state_idx.state = 'CA' AND state_idx.homeless_id=hbc.id")
pp.pprint(cur.fetchall()) 

- Use the provided `cur` object.
- Run the `EXPLAIN ANALYZE` on the query you built in the last screen.
    - Format the ouptut with the json type.
- Call `.fetchall()` and pretty print the output.
- Run the `EXPLAIN ANALYZE` on a query that returns the columns `id`, `year`, and `coc_number`, and filters `state` equal to `CA` on the `homeless_by_coc` table.
    - Format the ouptut with the `json` type.
- Call `.fetchall()` and pretty print the output.   

In [None]:
cur.execute("""
            EXPLAIN (ANALYZE, format json) SELECT user2.available, user2.free FROM user2,\
                                                                                      station_update_idx idx\
            WHERE idx.update = '2018-02-20 07:12:19' AND idx.stationid = user2.stationid
            """)
pp.pprint(cur.fetchall())

In [None]:
cur.execute("""
SELECT hbc.id, hbc.year, hbc.coc_number FROM homeless_by_coc hbc, state_idx
WHERE state_idx.state = 'CA' AND state_idx.homeless_id = hbc.id
""")
pp.pprint(cur.fetchall())

cur.execute("""
EXPLAIN (ANALYZE, format json) SELECT hbc.id, hbc.year, hbc.coc_number FROM homeless_by_coc hbc, state_idx
WHERE state_idx.state = 'CA' AND state_idx.homeless_id = hbc.id
""")
pp.pprint(cur.fetchall())

cur.execute("""
EXPLAIN (ANALYZE, format json) SELECT id, year, coc_number FROM homeless_by_coc WHERE state='CA'
""")
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` object.
- Create an index on state for the `homeless_by_coc` table.
    - Commit your changes.
- Run `EXPLAIN ANALYZE` on a select all from `homeless_by_coc` and filter by `CA` on the indexed `state` column.
    - Format the output with `json`.
- Call `.fetchall()` and pretty print the output.

>By letting Postgres maintain the indexes, we know that they will remain up to date as rows are added to the table. In addition, Postgres will automatically take advantages of indexes whenever possible, so we can focus on writing queries. This occurs during the planning/optimization stage, which is why we can see it in the EXPLAIN query.
>
>While creating indexes gives us tremendous speed benefits, they come at the cost of space. Each index needs to be stored in the database file. In addition, adding, editing, and deleting rows takes longer since each of the affected indexes need to be updated. Because indexes can be created after a table is created, it's recommended to only create an index when you find yourself querying on a specific column frequently.

In [None]:
cur.execute("DROP INDEX state_idx")
conn.commit()
cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
conn.commit()
cur.execute("EXPLAIN (ANALYZE, format json) SELECT * FROM homeless_by_coc where state = 'CA'")
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` objects.
- Proceeding the `EXPLAIN ANALYZE` command's `fetchall()`, drop the index on the `homeless_by_coc` table.
    - Commit your changes.
- Re-run `EXPLAIN ANALYZE` on a select all from `homeless_by_coc` and filter by `CA` on the indexed `state` column.
    - Format the output with `json`.
- Call `.fetchall()` and pretty print the output.

In [None]:
#cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
#conn.commit()
#cur.execute("EXPLAIN (ANALYZE, format json) SELECT * FROM homeless_by_coc WHERE state='CA'")
#pp.pprint(cur.fetchall())
cur.execute("DROP INDEX IF EXISTS state_idx")
conn.commit()
cur.execute("EXPLAIN (ANALYZE, format json) SELECT * FROM homeless_by_coc WHERE state = 'CA'")
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` objects.
- Create and drop the index for `state` on `homeless_by_coc` to test the benchmark.
    - Run `EXPLAIN ANALYZE` on the given join query for `homeless_by_coc` before and after the drop.
    - Call `.fetchall()` to return the output.
- Pretty print the output from `fechall()`.

In [None]:
#cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
#conn.commit()
query = "EXPLAIN (ANALYZE, format json) SELECT hbc.state, hbc.coc_number, hbc.coc_name, si.name FROM homeless_by_coc as hbc, state_info as si WHERE hbc.state = si.postal"

cur.execute(query)
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` object.
- Create an index on `state` for the `homeless_by_coc` table.
    - Commit your changes.
- Run `EXPLAIN` on a select all from `homeless_by_coc`.
    - Filter by `CA` on the indexed `state` column.
    - Filter years greater than `1991-01-01` on the non-indexed `year` column.
    - Format the output with `json`.
- Call `.fetchall()` and pretty print the output.

In [None]:
cur.execute("DROP INDEX state_idx")
conn.commit()

cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
conn.commit()

cur.execute("EXPLAIN (format json) SELECT * FROM homeless_by_coc WHERE state = 'CA' AND year > '1991-01-01'")
pp.pprint(cur.fetchall())

>A `Bitmap Heap Scan` occurs when Postgres encounters two, or more, columns that contain an index. Our heap scan follows these steps:
>
>1. Run through the indexed column, state, and select all the rows that match CA. This is the `Bitmap Index Scan`.
>2. Create a `Bitmap Heap` that is used as the temporary index.
>3. Scan through the `Bitmap Heap`, and select all rows that have a year value greater than 1991-01-01. This is the `Bitmap Heap Scan`.
>4. Return the results.
>
>This type of scan is more efficient than a pure Seq Scan, because the number of filtered rows in an index will always be less than or equal to the number of rows in the full table. Unfortunately, each filtered row must be sequentially searched again to find values that match the second filter (eg. year greater than 1991).
>
>We can eliminate the second sequential scan by adding an additional index on to another column in our table. This type of index is called a multi-column index. If you commonly run queries that filters two columns, then using a multi-column index can speed up your query times.

- Use the provided `cur` and `conn` objects.
- Create and drop a single column index for `state` on `homeless_by_coc` to test the benchmark.
    - Run `EXPLAIN ANALYZE` on a select all from `homeless_by_coc`.
    - Filter by CA on the indexed `state` column.
    - Filter years greater than `1991-01-01` on the non-indexed year column.
    - Format the output with `json`.
    - Call `fetchall()` and pretty print the output.
- Create a multi-column index on state and year on `homeless_by_coc` and run the same `EXPLAIN ANALYZE`.
    - pretty print the output from `fetchall()`.

In [None]:
cur.execute("DROP INDEX IF EXISTS state_idx")
conn.commit()

cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
conn.commit()

cur.execute("EXPLAIN ANALYZE SELECT * FROM homeless_by_coc WHERE state = 'CA' AND year > '1991-01-01'")
pp.pprint(cur.fetchall())

cur.execute("DROP INDEX IF EXISTS state_idx")
conn.commit()

cur.execute("CREATE INDEX idx ON homeless_by_coc(state, year)")
cur.execute("EXPLAIN ANALYZE SELECT * FROM homeless_by_coc WHERE state = 'CA' AND year > '1991-01-01'")
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` objects.
- Create a multi-column index on `state`, `year`, and `coc_number` on `homeless_by_coc`.
    - Use the convention of naming your index by `snake_casing` the columns in order.
- Commit the index with the `conn` object.

In [None]:
cur.execute("CREATE INDEX state_year_coc_number_idx ON homeless_by_coc(state, year, coc_number)")
conn.commit()

>One or more indexes will impact the performance of your `INSERT` operations. As you increase the amount of indexes, the performance of `INSERT` decreases due to the additional index inserts. This can cause your table to fail when adding rows in a high load environment.
>
>Furthermore, because indexes are a separate structure, they also take up additional disk space in your database.

- Use the provided `cur` and `conn` objects.
- Run a copy statement that loads the `homeless_by_coc.csv` file into the `homeless_by_coc` table.
    - Enclose the `COPY` by a start and end time, then print the `end_time`.
- Delete all the rows in the `homeless_by_coc` table.
- Create a double column index on `state`, `year` for `homeless_by_coc`.
- Run another copy statement that loads the `homeless_by_coc.csv` file into the `homeless_by_coc` table.
    - Enclose the `COPY` by a start and end time, then print the `end_time`.

In [None]:
import time
filename = 'homeless_by_coc.csv'

start_time = time.time()
with open(filename) as f:
    statement = cur.mogrify('COPY %s FROM STDIN WITH CSV HEADER', (AsIs(filename.split('.')[0]), ))
    cur.copy_expert(statement, f)
print(time.time() - start_time)

cur.execute("DELETE FROM homeless_by_coc")
cur.execute("CREATE INDEX state_year_idx ON homeless_by_coc(state, year)")

start_time = time.time()
with open(filename) as f:
    statemnet = cur.mogrify('COPY %s FROM STDIN WITH CSV HEADER', (AsIs(filename.split(',')[0]), ))
    cur.copy_expert(statement, f)
print(time.time() - start_time)

- Use the provided `cur` and `conn` objects.
- Create a double column index on `state`, `year` for `homeless_by_coc`.
    - Add the descending order by option to `year`.
- Commit the index.
- Run a select on `homeless_by_coc`.
    - Select distinct `year`.
    - Filter by CA on the indexed `state` column.
    - Filter years greater than `1991-01-01` on the order by indexed year column.
- Call `fetchall()` and assign the return value to `ordered_years`
- pretty print `ordered_years`.

In [None]:
cur.execute("DROP INDEX IF EXISTS state_year_idx")
conn.commit()

cur.execute("CREATE INDEX state_year_idx ON homeless_by_coc(state, year ASC)")
conn.commit()

cur.execute("SELECT DISTINCT year FROM homeless_by_coc WHERE state = 'CA' AND year > '1991-01-01'")
ordered_years = cur.fetchall()
pp.pprint(ordered_years)

- Use the provided `cur` and `conn` objects.
- Create a case-insensitive expression index on measures for `homeless_by_coc`.
- Commit the index.
- Run a select all from `homeless_by_coc`.
    - Filter `measures` to rows with `'unsheltered homeless people in families'`.
    - Limit to 1 row.
- Call `fetchone()` and assign the return value to `unsheltered_row`

In [None]:
cur.execute("CREATE INDEX measures_idx ON homeless_by_coc(lower(measures))")
conn.commit()

cur.execute("SELECT * FROM homeless_by_coc WHERE lower(measures)='unsheltered homeless people in families'")
unsheltered_row = cur.fetchone()

- Use the provided `cur` and `conn` objects.
    - Create a partial index on `homeless_by_coc`.
    - Index on the `state` column.
- Restrict the index on all rows that have a count greater than 0.
- Commit the index.
- Run an `EXPLAIN ANALYZE` on a select all from `homeless_by_coc`.
    - Filter `state` on CA and count greater than 0.
    - Limit to 1 row.
- Call `fetchall()` and pretty print the result. 

In [None]:
cur.execute("CREATE INDEX idx ON homeless_by_coc(state) WHERE count > 0")
conn.commit()

cur.execute("EXPLAIN ANALYZE SELECT * FROM homeless_by_coc WHERE state = 'CA' AND count > 0")
pp.pprint(cur.fetchall())

- Use the provided cur and conn objects.
- Create a multi-column index that speeds up the following query:
    - `SELECT hbc.year, si.name, hbc.count FROM homeless_by_coc hbc, state_info si WHERE hbc.state = si.postal AND hbc.year > '2007-01-01' AND hbc.measures != 'total homeless'`
- Run `EXPLAIN ANALYZE` on the query.
- Call `.fetchall()` and pretty print the results.

In [None]:
cur.execute("CREATE INDEX state_year_measures_idx ON homeless_by_coc(state, lower(measures)) WHERE year > '2007-01-01'")
conn.commit()
cur.execute("""
EXPLAIN ANALYZE SELECT hbc.year, si.name, hbc.count
FROM homeless_by_coc hbc, state_info si WHERE hbc.state = si.postal
AND hbc.year > '2007-01-01' AND hbc.measures != 'total homeless'
""")
pp.pprint(cur.fetchall())

## Vaccuuming Postgres Databases

>Shouldn't the speed be the same? Why would query speeds be affected by a few deletes? In this mission, we will learn 
> the process by which Postgres runs destructive commands, the reason why it can have a non-trivial effect on querying 
> speeds, and the internal tools to reclaim the lost speed.

- Use the provided `cur` object.
- Run the `DELETE FROM` command on `homeless_by_coc` to delete all the rows in the table.
- Reload the data by running `INSERT` or running a `COPY FROM` psycopg2 cursor query that loads data from the  `homeless_by_coc.csv` file into the `homeless_by_coc` table.
    - Commit your changes.
- Using `execute()`, count the number of rows from `homeless_by_coc`.
- Assign the `int` value return value to `homeless_rows`.

In [None]:
cur.execute("DELETE FROM homeless_by_coc")

filename = 'homeless_by_coc.csv'
with open(filename) as f:
    cur.copy_expert('COPY homeless_by_coc FROM STDIN WITH CSV HEADER', f)
conn.commit()

cur.execute("SELECT COUNT(*) FROM homeless_by_coc")
homeless_rows = cur.fetchone()[0]

`DELETE`<br>
Instead of removing the rows from the table, Postgres will mark the rows as dead, which means they will be eventually removed, once the commit has succeeded.<br>
- Dead rows helps keep consistency and isolation within a transaction<br>
- Dead rows increase table size and will lengthen query times.
- To check if a table has any hanging dead rows, we use an internal table from the `pg_catalog` called `pg_stat_all_tables` which contains a collection of helpful table statistics.<Br><Br>

Transactions are a way to ensure multiple users can concurrently run commands.<br>

All transactions follow a specific set of properties called ACID.

- Atomicity: If one thing fails in the transaction, the whole transaction fails.
- Consitency: A transaction will move the database from one valid state to another.
- Isolation: Concurrent effects to the database will be followed through as sequential changes.
- Durability: Once the transaction is commited, it will stay that way regardless of crash, power outage, etc.

- Use the provided `cur` object.
- Before the `DELETE` command, find the number of dead rows for the `homeless_by_coc` table.
    - Print the result.
- After loading the table, find the number of dead rows for the `homeless_by_coc` tables.
- Assign the `int` return value to `homeless_dead_rows`.

In [None]:
cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname = 'homeless_by_coc'")
print(cur.fetchone()[0]) #prints 0

cur.execute("DELETE FROM homeless_by_coc")
with open('homeless_by_coc.csv') as f:
    cur.copy_expert('COPY homeless_by_coc FROM STDIN WITH CSV HEADER', f)
conn.commit()

cur.execute("SELECT COUNT(*) FROM homeless_by_coc")
homeless_dead_rows = cur.fetchone()[0] #prints 86529

In [None]:
import psycopg2
import pprint as pp

conn = psycopg2.connect(dbname="test_db", user="xtang")
cur = conn.cursor()

cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname = 'users'")
print(cur.fetchone()[0])

- Use the provided `cur` object.
- Note, we have already deleted the rows for you.
- Try running a vacuum on `homeless_by_coc`.

`VACUUM`
- If you run `VACUUM` without a table name, it will vacuum every user created table the current logged in user has access to
- Vacuuming a table will remove the marked dead rows
- You have to do this in SQL because the command cannot run in a Transaction Block.
- To run `VACUUM` outside a transaction block, we need to explicitly set the autocommit property of the psycopg2.Connection object. 
    - By setting autocommit to True, you are signalling to the `psycopg2` driver that you do not want your queries to run in a transaction block.

- Use the provided `cur` and `conn` objects.
- Disable transaction blocks on the connection object.
- Find the number of dead rows for the `homeless_by_coc` table.
    - Print the result.
- Run a vacuum on `homeless_by_coc`.
- After vacuuming the table, find the number of dead rows for the `homeless_by_coc` tables.
- Assign the int return value to `homeless_dead_rows`.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang", password="xtang")
conn.autocommit = True
cur = conn.cursor()
cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname='homeless_by_coc'")
print(cur.fetchall()[0])
cur.execute("VACUUM homeless_by_coc")
cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname='homeless_by_coc'")
homeless_dead_rows = cur.fetchall()[0]

- Use the provided `cur` and `conn` objects.
    - Set the connection to execute outside transaction blocks.
- Run an `EXPLAIN` query for a select all query on `homeless_by_coc`.
    - Pretty print the results.
- Vacuum analyze `homeless_by_coc`.
- Rerun the explain query.
- Pretty print the results from the explain query.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang", password="xtang")
conn.autocommit = True
cur = conn.cursor()

cur.execute("EXPLAIN SELECT * FROM homeless_by_coc")
pp.pprint(cur.fetchall())

cur.execute("VACUUM ANALYZE homeless_by_coc")
cur.execute("EXPLAIN SELECT * FROM homeless_by_coc")
pp.pprint(cur.fetchall())

- Use the provided `cur` and `conn` objects.
- Set the connection to execute outside a transaction block.
- Using `cur.execute()`, vacuum full all user created tables.

The most powerful and risky `VACUUM` option: `FULL`
- Reclaims space for the entire database server
- Claims an <b>exclusive</b> lock on the table it is vacuuming
    - This means that no insert, update, or delete queries can be issued against that table during the vacuum duration. 
    - Select queries on the table are considerably slowed down to the point where they are unusable.
- When we described a general `VACUUM`, we stated that it will remove dead rows from the table and reclaim their lost space. However, that disk space is never freed, it is still assigned to the table as extra space to be used when more data is inserted.
- `VACUUM FULL` will free the disk space for the whole server.

In [None]:
conn = psycopg2.connect(dbname="test_db", user="xtang", password="xtang")
conn.autocommit = True
cur = conn.cursor()
cur.execute("VACUUM FULL")

> Postgres has a feature called <b>autovacuum</b> and it runs periodically on your tables to ensure that dead rows are removed, and your statistics are up to date.
>
> In the latest versions of Postgres, autovacuum is on by default, and requires no additional setup.
>
> When do we explicitly vacuum tables?
> 1. Are you running your normal analysis tasks without major table deletes and load? Then, leave vacuuming to the autovacuum.
>
>2. Have you recently deleted a significant amount of data in your tables, and you want to follow it up with complex analysis commands? Then, run a `VACUUM` or `VACUUM ANALYZE` to ensure optimized query commands.
>
>3. Are your tables growing out of control, and is there little free space left on the database server? Then, disable all queries and run a `VACUUM FULL` to reclaim a signficant amount of space.