# Homework 4

* Assigned: 04/04 Thursday
* Due: 04/23 Tuesday 2:30 PM
* Value: 3.75% of your grade
* **Remember: homeworks are to be completed individually**

In this part of the problem set, you will examine query plans that PostgreSQL uses to execute queries, and try to understand
why it produces the plan it does for a certain query. The data set you will use has the same schema as the `iowa` dataset in HW3.

**NOTE: The iowa table is fairly large with lots of rows, so please try not to run too many generic queries like “SELECT * FROM iowa”. They take a long time to execute, and slow down the database for everyone else. Please see Jupyter notification for shutting down queries.**   

**EXPLAINs are fine since they don't actually execute the queries. When running a query, always use LIMIT clauses and/or selection filters to reduce the number of rows produced.**

### Jupyter Notes: _Read these carefully_

* You **may** create new IPython notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- **just make sure that you run the final cell to submit your results**
  * you can press shift+enter to execute to code in the cell that your cursor is in.
* When you see `In [*]:` to the left of the cell you are executing, this means that the code / query is _running_. Please wait for the execution to complete
    * **If the cell is hanging- i.e. running for too long: you can restart the kernel**
    * To restart kernel using the menu bar: "Kernel >> Restart >> Clear all outputs & restart"), then re-execute cells from the top
* _Have fun!_

### Before Starting
**Please run the following cells to allow COMPLETE output for EXPLAIN query, and connect to db**

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [2]:
ib.connect_db('postgresql://student:w4111student@w4111.cisxo09blonu.us-east-1.rds.amazonaws.com/w4111')

Connected to: postgresql://student:w4111student@w4111.cisxo09blonu.us-east-1.rds.amazonaws.com/w4111


Also, please **set up the TOKEN**

In [3]:
# Your columbia uni that is used in SSOL
#
# IMPORTANT:  make sure this is consistent with the uni/alias used as your @columbia.edu email in SSOL
#
UNI = "cu1111"

# your instabase username (if you go to the instabase homepage, your username should be in the URL)
USER = "custudent"

# your repository name containing 
REPO = "class"


In Part II, we have provided you with 

    Indexes:
      "iowa_cat_btree" btree (category)
      "iowa_date" btree (date)
      "iowa_dt_store_item_vendor_tree" btree (date, store, item, vendor)
      "iowa_store_hash" hash (store)
      "iowa_store_item_vendor_dt_tree" btree (store, item, vendor, date)
      "iowa_store_tree" btree (store)
      "iowa_vendor_hash" hash (vendor)
      "iowa_vendor_tree" btree (vendor)
      "iowa_zip_hash" hash (zipcode)
      "iowa_zip_tree" btree (zipcode)

### A Quick Example

To understand what query plan is being used, PostgreSQL includes the `EXPLAIN` command. 

It prints the plan for a query, including all of the physical operators and access methods being used. 
For example, the following SQL command displays the query plan for the SELECT:

In [4]:
# %%sql
# CREATE INDEX IF NOT EXISTS iowa_cat_btree ON iowa USING btree(category);

In [12]:
%%sql 
EXPLAIN SELECT * FROM iowa WHERE vendor_no = 0;

4 rows affected.


Unnamed: 0,QUERY PLAN
0,Bitmap Heap Scan on iowa (cost=19.59..3535.05 rows=925 width=1534)
1,Recheck Cond: (vendor_no = 0)
2,-> Bitmap Index Scan on iowa_vendor_tree (cost=0.00..19.36 rows=925 width=0)
3,Index Cond: (vendor_no = 0)


For example, this is a query plan with no branches. It first runs a Bitmap Index Scan using the index iowa_vendor_tree, which is a Btree index, and the condition vendor = ''.  It _estimates_ that there would be 925 rows that match the condition.   

The results are then fed into a Bitmap Heap Scan, which gathers all the tuple ids from the index scan together, sorts the tuple ids by the pages the tuples are stored in, and reads the data pages as a single scan while rechecking the vendor condition.

Don't worry about the heap scan too much. We mainly care that the query uses the iowa_vendor_tree index. You should also keep in mind that leaves of the BTree index do not store actual tuples (i.e. it is a secondary index, not a primary index).

**HINT: In some questions it is necessary to provide with some selectivity of information, so you may want to use COUNT function to write some queries from time to time.**

In [3]:
%%sql
SELECT COUNT(*) FROM iowa;

1 rows affected.


Unnamed: 0,count
0,1000000


### Part II

**Q1**: Run `EXPLAIN` on the following query and explain in your own words (in a few sentences) the query plan that PostgreSQL picked (we are expecting something similar to the given example above).

In [13]:
%%sql
EXPLAIN SELECT * FROM iowa WHERE zipcode = '10027';

4 rows affected.


Unnamed: 0,QUERY PLAN
0,Bitmap Heap Scan on iowa (cost=18.56..3043.23 rows=792 width=1534)
1,Recheck Cond: (zipcode = '10027'::text)
2,-> Bitmap Index Scan on iowa_zip_tree (cost=0.00..18.36 rows=792 width=0)
3,Index Cond: (zipcode = '10027'::text)


In [8]:
## please answer between the quotes
a1="""

"""

**Q2**: What did PostgreSQL estimate the number of resulting rows to be and what is the actual number of rows?  
   
Why is there a difference?
_Hint_: Think about how optimizor performs evaluation.

In [14]:
%%sql
-- run this query to get actual number returned
SELECT COUNT(*) FROM iowa WHERE zipcode = '10027';

1 rows affected.


Unnamed: 0,count
0,0


In [10]:
## please answer between the quotes
a2="""

"""

**Q3**: Run `EXPLAIN` on the slightly different query below.  What index does the query use and why is
   it the same or different than the result of Q1?


In [15]:
%%sql
EXPLAIN SELECT * FROM iowa WHERE zipcode = '10027' LIMIT 1;

3 rows affected.


Unnamed: 0,QUERY PLAN
0,Limit (cost=0.00..4.04 rows=1 width=1534)
1,-> Index Scan using iowa_zip_hash on iowa (cost=0.00..3197.86 rows=792 width=1534)
2,Index Cond: (zipcode = '10027'::text)


In [12]:
## please answer between the quotes
a3="""

"""

**Q4**: Run `EXPLAIN` on the following slightly different queries.  Why does the database choose those plans? What are the main reasons for different plans?


In [16]:
%%sql 
-- Q4A
EXPLAIN SELECT * FROM iowa WHERE '50056' < zipcode AND zipcode < '50058';

2 rows affected.


Unnamed: 0,QUERY PLAN
0,Index Scan using iowa_zip_tree on iowa (cost=0.42..8.45 rows=1 width=1534)
1,Index Cond: (('50056'::text < zipcode) AND (zipcode < '50058'::text))


In [17]:
%%sql
-- Q4B
EXPLAIN SELECT * FROM iowa WHERE '50056' < zipcode AND zipcode < '52726';

2 rows affected.


Unnamed: 0,QUERY PLAN
0,Seq Scan on iowa (cost=0.00..215000.00 rows=850022 width=1534)
1,Filter: (('50056'::text < zipcode) AND (zipcode < '52726'::text))


In [15]:
## please answer between the quotes
a4="""

"""

**Q5**: Try the following two EXPLAIN queries (Q5A, Q5B). Do they use the same query plan?  If so, why?  If not, why?
_Hint_: Think from selectivity and cost statistics yield by `EXPLAIN` query in Q4 and Q5.

In [16]:
%%sql
--Q5A
EXPLAIN SELECT * FROM iowa WHERE 4500 < store AND store < 5000;

4 rows affected.


Unnamed: 0,QUERY PLAN
0,Bitmap Heap Scan on iowa (cost=2735.43..190320.10 rows=128683 width=1534)
1,Recheck Cond: ((4500 < store) AND (store < 5000))
2,-> Bitmap Index Scan on iowa_store_tree (cost=0.00..2703.26 rows=128683 width=0)
3,Index Cond: ((4500 < store) AND (store < 5000))


In [18]:
%%sql 
--Q5B
EXPLAIN SELECT * FROM iowa WHERE store = 2634;

4 rows affected.


Unnamed: 0,QUERY PLAN
0,Bitmap Heap Scan on iowa (cost=12.07..1829.59 rows=470 width=1534)
1,Recheck Cond: (store = 2634)
2,-> Bitmap Index Scan on iowa_store_item_vendor_dt_tree (cost=0.00..11.95 rows=470 width=0)
3,Index Cond: (store = 2634)


In [49]:
## please answer between the quotes
a5="""

"""

**Q6**: Consider if we inserted a large batch of new records into the table.  What is the difference in the amount of time it takes change if the table did not contain any indexes, and if the table did contain the indexes?

In [51]:
## please answer between the quotes...
a6="""

"""

## Part II Submission

### Create your submission file¶

Run the following cell to create a results file for your homework

DO NOT MODIFY THE FOLLOWING CELL!!

In [None]:
import datetime
import json

script_path = '{0}/{1}/fs/Instabase%20Drive'.format(USER, REPO)

with ib.open('results'.format(script_path), "w") as f:
    result = dict(
        q1=a1,
        q2=a2,
        q3=a3,
        q4=a4,
        q5=a5,
        q6=a6,
        uni=UNI,
        user=USER
    )
    f.write(json.dumps(result))
    print "Result file created at: {0}".format(datetime.datetime.now())
  
    print
    print "Check your results: https://www.instabase.com/{0}/HW4/results".format(script_path)

Finally, Submit your __HW4 folder including the results file__ at the following URL:  

https://www.instabase.com/apps/file-submission/cmd/submit/c4fe9b25-9c68-4edc-ba7a-be6ceb764c9b