In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("proj2.ipynb")

# Project 2: Query Performance
## Due Date: Friday 03/12, 11:59 PM
## Assignment Details
In this project, we will explore how the database system optimizes query execution and how users can futher tune the performance of their queries.

This project works with the Lahman's Baseball Database, an open source collection of baseball statistics from 1871 to 2020. It contains a variety of data, like batting statistics, team stats, managerial records, Hall of Fame records, and much more.

**Note:** If at any point during the project, the internal state of the database or its tables have been modified in an undesirable way (i.e. a modification not resulting from the instructions of a question), restart your kernel and clear output and simply re-run the notebook as normal. This will shutdown you current connection to the database, which will prevent the issue of multiple connections to the database at any given point, and when re-running the notebook you will create a fresh database based on the provided Postgres dump.

## Scoring Breakdown
Question | Points
--- | ---
1a	| 1
1bi	| 1
1bii	| 2
1c	| 1
1di	| 2
1dii	| 1
1ei	| 1
1eii	| 1
1eiii	| 1
1eiv	| 2
1ev	| 2
2a	| 1
2b	| 1
2c	| 2
2d	| 2
3a	| 2
3bi	| 1
3bii |	2
3ci	| 1
3cii	| 2
4a	| 2
4b	| 1
4c	| 1
4di	| 2
4dii |	2
4ei	| 2
4eii | 	1
4eiii |	2
4eiv |	1
4ev |	1
4evi |	2
4evii |	2
5a | 2
5b | 2
5c | 1
5d | 2
6a | 2
6b | 1
6c | 1
6d | 2
6e | 2
7a | 1
7b | 1
8ai | 0
8aii | 1
8aiii | 1
8b | 1
8c | 1
8d | 2
8e | 1
**Total** | 72

In [2]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

## Getting Connected
Similar to Project 1, we will be using the `ipython-sql` library to connect this notebook to a PostgreSQL database server on your JupyterHub account. Run the following cell to initiate the connection.

In [None]:
# %sql postgresql://postgres:postgres@127.0.0.1:5432/baseball
# !PGPASSWORD=postgres psql -h localhost -U postgres -c 'DROP DATABASE IF EXISTS imdb'
# !PGPASSWORD=postgres psql -h localhost -U postgres -c 'CREATE DATABASE imdb' 
# !PGPASSWORD=postgres psql -h localhost -U postgres -d imdb -f data/imdbdb.sql

In [3]:
%reload_ext sql
%sql postgresql://postgres:postgres@127.0.0.1:5432/postgres

## Setting up the Database
The following cell will create the `baseball` database (if needed), unzip the Postgres dump of the Lahman's Baseball Database, populate the `baseball` database with the desired tables and data, and finally display all databases associated with the Postgres instance. After running the cell, you should see the `baseball` database in the generated list of databases outputted by `%sql \l`.

In [4]:
POSTGRES_URI = 'postgresql://postgres:postgres@127.0.0.1:5432'

# Run this only on initial run
"""            
!psql $POSTGRES_URI -c 'DROP DATABASE IF EXISTS baseball'
!psql $POSTGRES_URI -c 'CREATE DATABASE baseball'
!gzip -dc data/lahman.pgdump.gz | psql "$POSTGRES_URI/baseball" -f -
!psql $POSTGRES_URI -c 'SET max_parallel_workers_per_gather = 0;'
%sql \l
"""

'            \n!psql $POSTGRES_URI -c \'DROP DATABASE IF EXISTS baseball\'\n!psql $POSTGRES_URI -c \'CREATE DATABASE baseball\'\n!gzip -dc data/lahman.pgdump.gz | psql "$POSTGRES_URI/baseball" -f -\n!psql $POSTGRES_URI -c \'SET max_parallel_workers_per_gather = 0;\'\n%sql \\l\n'

Now, run the following cell to connect to the `baseball` database. There should be no errors after running the following cell.

In [5]:
%sql $POSTGRES_URI/baseball

To ensure that the connection to the database has been established, let's try grabbing the first 5 rows from the `halloffame` table.

In [6]:
%%sql
SELECT * FROM halloffame LIMIT 5

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
5 rows affected.


playerid,yearid,votedby,ballots,needed,votes,inducted,category,needed_note
cobbty01,1936,BBWAA,226,170,222,Y,Player,
ruthba01,1936,BBWAA,226,170,215,Y,Player,
wagneho01,1936,BBWAA,226,170,215,Y,Player,
mathech01,1936,BBWAA,226,170,205,Y,Player,
johnswa01,1936,BBWAA,226,170,189,Y,Player,


Run the following cell for grading purposes.

In [7]:
# !mkdir -p results

## Table Descriptions
In its entirety the Lahman's Baseball Database contains 27 tables containing a variety of statistics for players, teams, games, schools, etc. For simplicity, this project will focus on a subset of the tables:
* `appearances`: details on the positions each player appeared at
* `batting`: batting statistics for each player
* `collegeplaying`: list of players and the colleges they attended
* `halloffame`: Hall of Fame voting data
* `people`: player information (name, date of birth, and biographical info)
* `salaries`: player salary data
* `schools`: list of colleges that players attended

## Question 1: Queries and Views
### Question 1a:
Write a query that finds `namefirst`, `namelast`, `playerid` and `yearid` of all people who were successfully inducted into the Hall of Fame. **Note**: Your query should **not** use any sub-queries.

In [34]:
%%sql result_1a <<
SELECT
    namefirst,
    namelast,
    playerid,
    yearid
FROM
    halloffame
JOIN
    people
USING
    (playerid)
WHERE
    halloffame.inducted = 'Y'

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
323 rows affected.
Returning data to local variable result_1a


In [36]:
# Do not delete/edit this cell
result_1a.DataFrame().sort_values(['playerid', 'yearid']).to_csv('results/result_1a.csv', index=False)

In [37]:
grader.check("q1a")

### Question 1b:
In this question, we will compare the query you wrote in `Question 1a` against the provided query in `Question 1bi` by inspecting the query plans (specifically, looking at the execution time, cost, and overall query plan structure). 

__Note__: To inspect the query plan for a given query, create a variable storing the query as a string and invoke a `psql` shell command to `explain analyze` the query: 

`your_variable = "__REPLACE_ME_WITH_QUERY__"`

`!psql -h localhost -d baseball -c "explain analyze $your_variable"`

See the subsequent cells for an example of how to display the query plan.

#### Question 1bi: 
Inspect the query plan for `provided_query` and the query you wrote in `Question 1a`. Record the execution time and cost for each query.

In [11]:
%%sql provided_query <<
SELECT namefirst, namelast, p.playerid, yearid
FROM people AS p, (SELECT * FROM halloffame WHERE inducted = 'Y') AS hof 
WHERE p.playerid = hof.playerid;

In [8]:
provided_query_str = "SELECT namefirst, namelast, p.playerid, yearid FROM people AS p, (SELECT * FROM halloffame WHERE inducted = 'Y') AS hof WHERE hof.inducted = 'Y' AND p.playerid = hof.playerid;"
!psql $POSTGRES_URI/baseball -c "explain analyze $provided_query_str"

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=861.83..959.06 rows=323 width=25) (actual time=10.740..11.341 rows=323 loops=1)
   Hash Cond: ((halloffame.playerid)::text = (p.playerid)::text)
   ->  Seq Scan on halloffame  (cost=0.00..96.39 rows=323 width=13) (actual time=0.006..0.514 rows=323 loops=1)
         Filter: ((inducted)::text = 'Y'::text)
         Rows Removed by Filter: 3868
   ->  Hash  (cost=619.70..619.70 rows=19370 width=21) (actual time=10.537..10.538 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 1293kB
         ->  Seq Scan on people p  (cost=0.00..619.70 rows=19370 width=21) (actual time=0.007..5.031 rows=19370 loops=1)
 Planning Time: 0.749 ms
 Execution Time: 11.415 ms
(10 rows)



In [9]:
inducted_hof_query_str = """
SELECT
    namefirst,
    namelast,
    playerid,
    yearid
FROM
    halloffame
JOIN
    people
USING
    (playerid)
WHERE
    halloffame.inducted = 'Y'
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $inducted_hof_query_str"

                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=861.83..959.06 rows=323 width=25) (actual time=13.508..14.192 rows=323 loops=1)
   Hash Cond: ((halloffame.playerid)::text = (people.playerid)::text)
   ->  Seq Scan on halloffame  (cost=0.00..96.39 rows=323 width=13) (actual time=0.009..0.582 rows=323 loops=1)
         Filter: ((inducted)::text = 'Y'::text)
         Rows Removed by Filter: 3868
   ->  Hash  (cost=619.70..619.70 rows=19370 width=21) (actual time=13.271..13.272 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 1293kB
         ->  Seq Scan on people  (cost=0.00..619.70 rows=19370 width=21) (actual time=0.004..6.437 rows=19370 loops=1)
 Planning Time: 0.757 ms
 Execution Time: 14.279 ms
(10 rows)



In [10]:
provided_query_cost = 959.06
provided_query_timing = 11.415
your_query_cost = 959.06
your_query_timing = 14.279

In [11]:
grader.check("q1bi")

#### Question 1bii:
Given your findings from inspecting the query plans of the two queries, answer the following question. Assign the variable `q1b_part2` to a list of answer choices which are true statements.

1. Consider the following statements:
    1. Both the queries have the same cost.
    1. The provided query has a faster execution time because it makes use of a subquery.
    1. The query you wrote has a faster execution time because it does not make use a subquery.
    1. The provided query has less cost because it makes use of a subquery.
    1. The query you wrote has less cost because it does not make use a subquery.
    1. The queries have the same output.
    1. The queries do not have the same output.
    
**Note:** Your answer should look like `q1b_part2 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q1bii
points: 2
-->

In [12]:
q1b_part2 = ['A', 'F']

In [13]:
grader.check("q1bii")

### Question 1c:
Write a query that creates a view named `inducted_hof_ca` of the people who were successfully inducted into the Hall of Fame and played in college at a school located in California. For each player, return their `namefirst`, `namelast`, `playerid`, `schoolid`, and `yearid` ordered by the `yearid` and then the `playerid`. 

**Note**: For this query, `yearid` refers to player's year of induction into the Hall of Fame.

In [18]:
%%sql result_1c <<
DROP VIEW IF EXISTS inducted_hof_ca;
CREATE VIEW inducted_hof_ca AS
    SELECT
        namefirst, namelast, playerid, schools.schoolid, collegeplaying.yearid
    FROM
        halloffame
    JOIN
        people USING(playerid)
    JOIN
        collegeplaying USING(playerid)
    JOIN
        schools ON collegeplaying.schoolid = schools.schoolid
    WHERE
        halloffame.inducted = 'Y' AND
        schools.schoolstate = 'CA'
    ORDER BY
        yearid, playerid
;
SELECT * FROM inducted_hof_ca;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Done.
23 rows affected.
Returning data to local variable result_1c


In [19]:
# Do not delete/edit this cell
result_1c.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).to_csv('results/result_1c.csv', index=False)

In [20]:
grader.check("q1c")

### Question 1d:
For this question, we want to compute the count of players who were inducted into the Hall of Fame and played baseball at a college in California for each `schoolid` and `yearid` combination ordered by ascending `yearid`.

#### Question 1di:
Write two queries that accomplish this task -- one query using the view you created in the `Question 1c` and one query that does not use the view, common table expressions (CTEs), or any sub-queries.

In [41]:
%%sql with_view <<
SELECT DISTINCT ON
    (playerid, schoolid) schoolid,
    yearid,
    COUNT(1)
FROM inducted_hof_ca
GROUP BY playerid, schoolid, yearid
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
13 rows affected.
Returning data to local variable with_view


In [39]:
%%sql without_view <<
SELECT DISTINCT ON
    (collegeplaying.playerid, schools.schoolid) schools.schoolid,
    collegeplaying.yearid,
    COUNT(1)
FROM halloffame
JOIN people USING(playerid)
JOIN collegeplaying USING(playerid)
JOIN schools ON collegeplaying.schoolid = schools.schoolid
WHERE halloffame.inducted = 'Y' AND schools.schoolstate = 'CA'
GROUP BY collegeplaying.playerid, schools.schoolid, collegeplaying.yearid
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
13 rows affected.
Returning data to local variable without_view


In [42]:
# Do not delete/edit this cell
with_view.DataFrame().sort_values(['schoolid', 'yearid']).to_csv('results/result_1di_view.csv', index=False)
without_view.DataFrame().sort_values(['schoolid', 'yearid']).to_csv('results/result_1di_no_view.csv', index=False)

In [43]:
grader.check("q1di")

#### Question 1dii:
Fill in your queries from `Question 1di` and inspect the query plans for the two queries. Record the execution time and cost for each query.

In [44]:
with_view_query_str = """
SELECT DISTINCT ON
    (playerid, schoolid) schoolid,
    yearid,
    COUNT(1)
FROM inducted_hof_ca
GROUP BY playerid, schoolid, yearid
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $with_view_query_str"

                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=525.47..528.11 rows=96 width=29) (actual time=11.200..11.222 rows=13 loops=1)
   ->  GroupAggregate  (cost=525.47..527.63 rows=96 width=29) (actual time=11.199..11.213 rows=23 loops=1)
         Group Key: inducted_hof_ca.playerid, inducted_hof_ca.schoolid, inducted_hof_ca.yearid
         ->  Sort  (cost=525.47..525.71 rows=96 width=21) (actual time=11.194..11.197 rows=23 loops=1)
               Sort Key: inducted_hof_ca.playerid, inducted_hof_ca.schoolid, inducted_hof_ca.yearid
               Sort Method: quicksort  Memory: 26kB
               ->  Subquery Scan on inducted_hof_ca  (cost=521.11..522.31 rows=96 width=21) (actual ti

In [45]:
without_view_query_str = """
SELECT DISTINCT ON
    (collegeplaying.playerid, schools.schoolid) schools.schoolid,
    collegeplaying.yearid,
    COUNT(1)
FROM halloffame
JOIN people USING(playerid)
JOIN collegeplaying USING(playerid)
JOIN schools ON collegeplaying.schoolid = schools.schoolid
WHERE halloffame.inducted = 'Y' AND schools.schoolstate = 'CA'
GROUP BY collegeplaying.playerid, schools.schoolid, collegeplaying.yearid
"""
!psql $POSTGRES_URI/baseball -c"explain analyze $without_view_query_str"

                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=521.11..523.75 rows=96 width=29) (actual time=12.201..12.222 rows=13 loops=1)
   ->  GroupAggregate  (cost=521.11..523.27 rows=96 width=29) (actual time=12.200..12.215 rows=23 loops=1)
         Group Key: collegeplaying.playerid, schools.schoolid, collegeplaying.yearid
         ->  Sort  (cost=521.11..521.35 rows=96 width=21) (actual time=12.193..12.197 rows=23 loops=1)
               Sort Key: collegeplaying.playerid, schools.schoolid, collegeplaying.yearid
               Sort Method: quicksort  Memory: 26kB
               ->  Nested Loop  (cost=386.71..517.95 rows=96 width=21) (actual time=11.652..12.130 rows=23 loops=1)
                     Join Fil

In [46]:
with_view_cost = 528.11
with_view_timing = 11.222
without_view_cost = 523.75
without_view_timing = 12.222

In [47]:
grader.check("q1dii")

### Question 1e:
#### Question 1ei:
Now, let's try creating a materialized view named `inducted_hof_ca_mat` instead of the regular view from `Question 1c`.

In [48]:
%%sql inducted_hof_ca_materialized <<
DROP MATERIALIZED VIEW IF EXISTS inducted_hof_ca_mat;
CREATE MATERIALIZED VIEW inducted_hof_ca_mat AS
    SELECT
        namefirst, namelast, playerid, schools.schoolid, collegeplaying.yearid
    FROM
        halloffame
    JOIN
        people USING(playerid)
    JOIN
        collegeplaying USING(playerid)
    JOIN
        schools ON collegeplaying.schoolid = schools.schoolid
    WHERE
        halloffame.inducted = 'Y' AND
        schools.schoolstate = 'CA'
    ORDER BY
        yearid, playerid
;
SELECT * FROM inducted_hof_ca_mat;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
23 rows affected.
23 rows affected.
Returning data to local variable inducted_hof_ca_materialized


In [49]:
# Do not delete/edit this cell
inducted_hof_ca_materialized.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).to_csv('results/result_1ei.csv', index=False)

In [50]:
grader.check("q1ei")

#### Question 1eii:

Now, rewrite the query from `Question 1d` to use the materialized view `inducted_hof_ca_mat`.

In [51]:
%%sql with_materialized_view <<
SELECT DISTINCT ON
    (playerid, schoolid) schoolid,
    yearid,
    COUNT(1)
FROM inducted_hof_ca_mat
GROUP BY playerid, schoolid, yearid

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
13 rows affected.
Returning data to local variable with_materialized_view


In [52]:
# Do not delete/edit this cell
with_materialized_view.DataFrame().sort_values(['schoolid', 'yearid']).to_csv('results/result_1eii.csv', index=False)

In [53]:
grader.check("q1eii")

#### Question 1eiii:
Inspect the query plan and record the execution time and cost of the query that uses the materialized view.

In [55]:
with_materialized_view_query_str = """
SELECT DISTINCT ON
    (playerid, schoolid) schoolid,
    yearid,
    COUNT(1)
FROM inducted_hof_ca_mat
GROUP BY playerid, schoolid, yearid
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $with_materialized_view_query_str"

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=24.24..25.74 rows=200 width=98) (actual time=0.162..0.178 rows=13 loops=1)
   ->  Sort  (cost=24.24..24.74 rows=200 width=98) (actual time=0.161..0.165 rows=23 loops=1)
         Sort Key: playerid, schoolid
         Sort Method: quicksort  Memory: 26kB
         ->  HashAggregate  (cost=14.60..16.60 rows=200 width=98) (actual time=0.043..0.055 rows=23 loops=1)
               Group Key: playerid, schoolid, yearid
               Batches: 1  Memory Usage: 40kB
               ->  Seq Scan on inducted_hof_ca_mat  (cost=0.00..12.30 rows=230 width=90) (actual time=0.012..0.017 rows=23 loops=1)
 Planning Time: 0.344 ms
 Execution Time: 0.291 ms
(10 rows)



In [56]:
with_materialized_view_cost = 25.74
with_materialized_view_timing = 0.178

In [57]:
grader.check("q1eiii")

#### Question 1eiv:
Given your findings from inspecting the query plans of queries using a view, no view, and a materialized view, answer the following question. Assign the variable `q1e_part4` to a list of all statements which are true.

1. Consider the following statements:
    1. Views will reduce the execution time and the cost of a query.
    1. Views will reduce the execution time of a query, but not the cost.
    1. Views will reduce the cost of a query, but not the execution time.
    1. Materialized views reduce the execution time and the cost of a query.
    1. Materialized views reduce the execution time, but not cost of a query.
    1. Materialized views reduce the cost of a query, but not the execution time.
    1. Materialized views will result in the same query plan as a query using views.
    1. Materialized views and views take the same time to create.
    1. Materialized views take less time to create than a view.
    1. Materialized views take more time to create than a view.
    
*Note:* Your answer should look like `q1e_part4 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q1eiv
points: 2
-->

In [58]:
q1e_part4 = ['B', 'D', 'H']

In [59]:
grader.check("q1eiv")

<!-- BEGIN QUESTION -->

#### Question 1ev:

Explain your answer to the previous part based on your findings and details from the query plans (your explanation should include why you didn't choose certain options).

<!--
BEGIN QUESTION
name: q1ev
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 2: Predicate Pushdown
In this question, we will explore the impact of predicates on a query's execution, particularly inspecting when the optimizer applies predicates.

### Question 2a:
Recall the `inducted_hof_ca` created in `Question 1c`. Inspect the query plan for a query that that gets all rows from the view, and record the execution time and cost.

In [60]:
query_view_str = """
SELECT * FROM inducted_hof_ca
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $query_view_str"

                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=530.54..530.78 rows=96 width=33) (actual time=5.900..5.903 rows=23 loops=1)
   Sort Key: collegeplaying.yearid, halloffame.playerid
   Sort Method: quicksort  Memory: 26kB
   ->  Nested Loop  (cost=386.71..527.38 rows=96 width=33) (actual time=5.414..5.861 rows=23 loops=1)
         Join Filter: ((halloffame.playerid)::text = (people.playerid)::text)
         ->  Hash Join  (cost=386.42..485.79 rows=96 width=30) (actual time=5.380..5.684 rows=23 loops=1)
               Hash Cond: ((halloffame.playerid)::text = (collegeplaying.playerid)::text)
               ->  Seq Scan on halloffame  (cost=0.00..96.39 rows=323 width=9) (actual time=0.009..0.507 rows=323 loops=1)
              

In [61]:
query_view_cost = 530.78
query_view_timing = 5.903

In [62]:
grader.check("q2a")

### Question 2b:
Now, add a filter to only return rows from `inducted_hof_ca` where the year is later than 2010. Inspect the query plan and record the execution time and cost.

In [63]:
%%sql result_2b <<
SELECT *
FROM inducted_hof_ca
WHERE yearid > 2010

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
0 rows affected.
Returning data to local variable result_2b


In [64]:
query_view_with_filter_str = """
SELECT *
FROM inducted_hof_ca
WHERE yearid > 2010
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $query_view_with_filter_str"

                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=396.86..396.86 rows=1 width=33) (actual time=4.281..4.284 rows=0 loops=1)
   Sort Key: collegeplaying.yearid, halloffame.playerid
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=30.36..396.85 rows=1 width=33) (actual time=4.239..4.242 rows=0 loops=1)
         ->  Nested Loop  (cost=30.07..392.69 rows=1 width=30) (actual time=4.239..4.241 rows=0 loops=1)
               ->  Hash Join  (cost=29.79..359.78 rows=5 width=21) (actual time=0.752..4.191 rows=5 loops=1)
                     Hash Cond: ((collegeplaying.schoolid)::text = (schools.schoolid)::text)
                     ->  Seq Scan on collegeplaying  (cost=0.00..329.88 rows=43 width=21) (actual time=0.048..3.645 

In [65]:
query_view_with_filter_cost = 396.86
query_view_with_filter_timing = 4.284

In [66]:
grader.check("q2b")

### Question 2c:
Given your findings from inspecting the query plans of queries from `Question 2a-b`, answer the following question. Assign the variable `q2c` to a list of all statements which are true.

1. Consider the following statements:
    1. Adding a filter lowered the cost.
    1. Adding a filter increased the cost.
    1. Adding a filter did not change the cost.
    1. Adding a filter increased the execution time.
    1. Adding a filter decreased the execution time.
    1. Adding a filter did not change the execution time.
    1. No statement is true.
    
    
**Note:** Your answer should look like `q2c = ['A', 'B']`

<!--
BEGIN QUESTION
name: q2c
points: 2
-->

In [67]:
q2c = ['A', 'E']

In [68]:
grader.check("q2c")

<!-- BEGIN QUESTION -->

#### Question 2d:

Explain your answer based on your findings and details from the query plans (your explanation should include why you didn't choose certain options).

<!--
BEGIN QUESTION
name: q2d
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 3: Joins
### Question 3a:
Perform a left, right, inner, and full join on the `people` and `collegeplaying` tables on the `playerid` column and inspect their respective query plans (in particular, look at the number of rows resulting from the query).

In [107]:
%%sql left_join <<
SELECT *
FROM people p
LEFT JOIN collegeplaying cp
USING(playerid)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
30145 rows affected.
Returning data to local variable left_join


In [108]:
left_join_query_str = """
SELECT *
FROM people p
LEFT JOIN collegeplaying cp
USING(playerid)
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $left_join_query_str"

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=861.83..1193.88 rows=19370 width=158) (actual time=10.758..31.285 rows=30145 loops=1)
   Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
   ->  Seq Scan on collegeplaying cp  (cost=0.00..286.50 rows=17350 width=21) (actual time=0.018..1.567 rows=17350 loops=1)
   ->  Hash  (cost=619.70..619.70 rows=19370 width=146) (actual time=10.521..10.522 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 3633kB
         ->  Seq Scan on people p  (cost=0.00..619.70 rows=19370 width=146) (actual time=0.009..2.264 rows=19370 loops=1)
 Planning Time: 1.183 ms
 Execution Time: 32.567 ms
(8 rows)



In [109]:
%%sql right_join <<
SELECT *
FROM people p
RIGHT JOIN collegeplaying cp
USING(playerid)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
17350 rows affected.
Returning data to local variable right_join


In [110]:
right_join_query_str = """
SELECT *
FROM people p
RIGHT JOIN collegeplaying cp
USING(playerid)
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $right_join_query_str"

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=861.83..1193.88 rows=17350 width=158) (actual time=29.138..41.657 rows=17350 loops=1)
   Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
   ->  Seq Scan on collegeplaying cp  (cost=0.00..286.50 rows=17350 width=21) (actual time=0.022..1.540 rows=17350 loops=1)
   ->  Hash  (cost=619.70..619.70 rows=19370 width=146) (actual time=28.836..28.843 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 3596kB
         ->  Seq Scan on people p  (cost=0.00..619.70 rows=19370 width=146) (actual time=0.017..12.257 rows=19370 loops=1)
 Planning Time: 1.037 ms
 Execution Time: 42.491 ms
(8 rows)



In [111]:
%%sql inner_join <<
SELECT *
FROM people p
JOIN collegeplaying cp
USING(playerid)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
17350 rows affected.
Returning data to local variable inner_join


In [112]:
inner_join_query_str = """
SELECT *
FROM people p
JOIN collegeplaying cp
USING(playerid)
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $inner_join_query_str"

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=861.83..1193.88 rows=17350 width=158) (actual time=10.092..25.937 rows=17350 loops=1)
   Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
   ->  Seq Scan on collegeplaying cp  (cost=0.00..286.50 rows=17350 width=21) (actual time=0.016..1.892 rows=17350 loops=1)
   ->  Hash  (cost=619.70..619.70 rows=19370 width=146) (actual time=9.856..9.857 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 3633kB
         ->  Seq Scan on people p  (cost=0.00..619.70 rows=19370 width=146) (actual time=0.008..2.074 rows=19370 loops=1)
 Planning Time: 1.130 ms
 Execution Time: 26.869 ms
(8 rows)



In [113]:
%%sql full_join <<
SELECT *
FROM people p
FULL JOIN collegeplaying cp
USING(playerid)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
30145 rows affected.
Returning data to local variable full_join


In [114]:
full_join_query_str = """
SELECT *
FROM people p
FULL JOIN collegeplaying cp
USING(playerid)
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $full_join_query_str"

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Hash Full Join  (cost=861.83..1193.88 rows=19370 width=187) (actual time=9.380..33.945 rows=30145 loops=1)
   Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
   ->  Seq Scan on collegeplaying cp  (cost=0.00..286.50 rows=17350 width=21) (actual time=0.024..2.067 rows=17350 loops=1)
   ->  Hash  (cost=619.70..619.70 rows=19370 width=146) (actual time=9.066..9.068 rows=19370 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 3633kB
         ->  Seq Scan on people p  (cost=0.00..619.70 rows=19370 width=146) (actual time=0.015..1.968 rows=19370 loops=1)
 Planning Time: 1.158 ms
 Execution Time: 35.489 ms
(8 rows)



In [116]:
# Do not delete/edit this cell
left_join.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).iloc[:1000].to_csv('results/result_3a_left.csv', index=False)
right_join.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).iloc[:1000].to_csv('results/result_3a_right.csv', index=False)
inner_join.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).iloc[:1000].to_csv('results/result_3a_inner.csv', index=False)
full_join.DataFrame().sort_values(['playerid', 'schoolid', 'yearid']).iloc[:1000].to_csv('results/result_3a_full.csv', index=False)

In [117]:
grader.check("q3a")

### Question 3b:
#### Question 3bi:
Given your findings from inspecting the query plan of the different joins, answer the following question. Assign the variable `q3b_part1` to a list of all the options which are true for the proposed statement.

1. For the query joining `people` and `collegeplaying` tables on the `playerid`,  the `INNER JOIN` output is the same as:
    1. the `LEFT JOIN` output
    1. the `RIGHT JOIN` output
    1. the `FULL JOIN` output
    1. None of the above 

**Note:** Your answer should be formatted as follows: `q3b_part1 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q3bi
points: 1
-->

In [118]:
q3b_part1 = ['D']

In [119]:
grader.check("q3bi")

<!-- BEGIN QUESTION -->

#### Question 3bii
Explain your answer to the previous part. If you did not select answer choice `D`, ensure your explanation discusses if your findings can be generalized for all queries.

<!--
BEGIN QUESTION
name: q3bii
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Question 3ci:
Given your findings from inspecting the query plan of the different joins, answer the following question. Assign the variable `q3c_part1` to a list of all options which are true for the proposed statement.

1. For the query joining `people` and `collegeplaying` tables on the `playerid`, the `FULL JOIN` output is the same as:
    1. the `LEFT JOIN` output
    1. the `RIGHT JOIN` output
    1. the `INNER JOIN` output
    1. None of the above   

**Note:** Your answer should be formatted as follows: `q3c_part1 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q3ci
points: 1
-->

In [120]:
q3c_part1 = ['A']

In [121]:
grader.check("q3ci")

<!-- BEGIN QUESTION -->

#### Question 3cii

Explain your answer to the previous part. If you did not select answer choice `D`, ensure your explanation discusses if your findings can be generalized for all queries.

<!--
BEGIN QUESTION
name: q3cii
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 4: Indexes

#### Question 4a:
Write a query that outputs the `playerid` and average `salary` for each player that only batted in 10 games (the number of games in which a player batted can be found in the `g_batting` column of the `appearances` table). Inspect the query plan and record the execution time and cost.

In [127]:
%%sql result_4a <<
SELECT a.playerid, AVG(s.salary)
FROM appearances a
JOIN salaries s
USING (yearid, teamid, lgid, playerid)
WHERE a.g_batting = 10
GROUP BY a.playerid
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
134 rows affected.
Returning data to local variable result_4a


In [128]:
# Do not delete/edit this cell
result_4a.DataFrame().sort_values('playerid').to_csv('results/result_4a.csv', index=False)

In [129]:
grader.check("q4ai")

In [130]:
result_4a_query_str = """
SELECT a.playerid, AVG(s.salary)
FROM appearances a
JOIN salaries s
USING (yearid, teamid, lgid, playerid)
WHERE a.g_batting = 10
GROUP BY a.playerid
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4a_query_str"

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=3637.78..3637.81 rows=1 width=17) (actual time=28.250..28.315 rows=134 loops=1)
   Group Key: a.playerid
   ->  Sort  (cost=3637.78..3637.79 rows=1 width=17) (actual time=28.242..28.250 rows=138 loops=1)
         Sort Key: a.playerid
         Sort Method: quicksort  Memory: 35kB
         ->  Hash Join  (cost=2901.00..3637.77 rows=1 width=17) (actual time=18.229..28.030 rows=138 loops=1)
               Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
               ->  Seq Scan on salaries s  (cost

In [131]:
result_4a_cost = 3637.81
result_4a_timing = 28.315

In [132]:
grader.check("q4aii")

### Question 4b:
Add an index with name `g_batting_idx` on the `g_batting` column of the `appearances` table.

In [133]:
%%sql result_4b <<
CREATE INDEX g_batting_idx ON appearances (g_batting)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Returning data to local variable result_4b


Now, re-inspect the query plan of the query from `Question 4a` and record its execution time and cost.

In [134]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4a_query_str"

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=2390.88..2390.91 rows=1 width=17) (actual time=15.001..15.065 rows=134 loops=1)
   Group Key: a.playerid
   ->  Sort  (cost=2390.88..2390.89 rows=1 width=17) (actual time=14.993..15.001 rows=138 loops=1)
         Sort Key: a.playerid
         Sort Method: quicksort  Memory: 35kB
         ->  Hash Join  (cost=1654.10..2390.87 rows=1 width=17) (actual time=3.373..14.812 rows=138 loops=1)
               Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
               ->  Seq Scan on salaries s  (cost=

In [135]:
result_4b_cost = 2390.91
result_4b_timing = 15.065

In [136]:
grader.check("q4b")

In the following question, we will explore adding a different index and evaluating the query from `Question 4a`. To avoid any interference by the `g_batting_idx` index, drop the index before moving onto the next question.

In [137]:
%%sql 
DROP INDEX g_batting_idx;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

### Question 4c:
Write a query to add an index with name `salary_idx` on the `salary` column of the `salaries` table.

In [139]:
%%sql result_4c <<
CREATE INDEX salary_idx ON salaries (salary)
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Returning data to local variable result_4c


Re-inspect the query plan of the query from the `Question 4a` and record its execution time and cost.

In [140]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4a_query_str"

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=3637.78..3637.81 rows=1 width=17) (actual time=23.422..23.493 rows=134 loops=1)
   Group Key: a.playerid
   ->  Sort  (cost=3637.78..3637.79 rows=1 width=17) (actual time=23.414..23.424 rows=138 loops=1)
         Sort Key: a.playerid
         Sort Method: quicksort  Memory: 35kB
         ->  Hash Join  (cost=2901.00..3637.77 rows=1 width=17) (actual time=16.106..23.221 rows=138 loops=1)
               Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
               ->  Seq Scan on salaries s  (cost

In [141]:
result_4c_cost = 3637.81
result_4c_timing = 23.493

In [142]:
grader.check("q4c")

### Question 4d:
#### Question 4di:
Given your findings from inspecting the query plans with no indexes, an index on `g_batting`, and an index on `salary`, answer the following question. Assign the variable `q4d_part1` to a list of all options which are true for the proposed statement.

1. Consider the following statements:
    1. Adding the `g_batting` index did not have a significant impact on the query execution time and cost.
    1. Adding the `g_batting` index did have a significant impact on the query execution time, but not the cost.
    1. Adding the `g_batting` index did have a significant impact on the query cost, but not the execution time.
    1. Adding the `g_batting` index did have a significant impact on the query cost and execution time.
    1. Adding the `salary` index did not have a significant impact on the query execution time and cost.
    1. Adding the `salary` index did have a significant impact on the query execution time, but not the cost.
    1. Adding the `salary` index did have a significant impact on the query cost, but not the execution time.
    1. Adding the `salary` index did have a significant impact on the query cost and execution time.

**Note:** Your answer should be formatted as follows: `q4d_part1 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q4di
points: 2
-->

In [143]:
q4d_part1 = ['D', 'F']

In [144]:
grader.check("q4di")

<!-- BEGIN QUESTION -->

#### Question 4dii:

Explain your answer based on your findings from inspecting the query plans (your explanation should include why you didn't choose certain options).

<!--
BEGIN QUESTION
name: q4dii
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



In the following question, we will further explore the impact of indexes on query performance. To avoid any interference by the `salary_idx` index, please drop the index before moving onto the next question.

In [145]:
%%sql 
DROP INDEX salary_idx
;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

### Question 4e:
#### Question 4ei:
Write a query that finds the `playerid`, `yearid`, and `salary` for each player that had played 10 games and batted in 10 games. Your query should join the `salaries` and `appearances` table on `yearid`, `teamid`, and `playerid`.

In [7]:
%%sql result_4ei <<
SELECT
    a.playerid, a.yearid, s.salary
FROM
    salaries s
JOIN
    appearances a
USING
    (yearid, teamid, playerid)
WHERE
    a.g_all = 10 AND
    a.g_batting = 10

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
120 rows affected.
Returning data to local variable result_4ei


In [8]:
# Do not delete/edit this cell
result_4ei.DataFrame().sort_values(['playerid', 'yearid']).to_csv('results/result_4ei.csv', index=False)

In [9]:
grader.check("q4ei_part1")

Inspect the query plan and record the execution time and cost.

In [11]:
result_4ei_query_str = """
SELECT
    a.playerid, a.yearid, s.salary
FROM
    salaries s
JOIN
    appearances a
USING
    (yearid, teamid, playerid)
WHERE
    a.g_all = 10 AND
    a.g_batting = 10
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4ei_query_str"

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..3306.60 rows=1 width=21) (actual time=15.739..29.539 rows=120 loops=1)
   ->  Seq Scan on appearances a  (cost=0.00..3133.84 rows=21 width=17) (actual time=0.012..22.381 rows=1289 loops=1)
         Filter: ((g_all = 10) AND (g_batting = 10))
         Rows Removed by Filter: 102967
   ->  Index Scan using salaries_pkey on salaries s  (cost=0.29..8.22 rows=1 width=25) (actual time=0.005..0.005 rows=0 loops=1289)
         Index Cond: ((yearid = a.yearid) AND ((teamid)::text = (a.teamid)::text) AND ((playerid)::text = (a.playerid)::text))
 Planning Time: 2.132 ms
 Execution Time: 29.641 ms
(8 rows)



In [12]:
result_4ei_cost = 3306.60
result_4ei_timing = 29.539

In [13]:
grader.check("q4ei_part2")

#### Question 4eii:
Now, let's see the impact of adding an index on the `g_batting` column. Create an index on the `g_batting` column. Re-inspect the query from `Question 4ei` and record the execution time and cost.

In [14]:
%%sql result_4eii << 
CREATE INDEX g_batting_idx ON appearances (g_batting)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Returning data to local variable result_4eii


In [15]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4ei_query_str"

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=19.01..1802.19 rows=1 width=21) (actual time=2.765..6.359 rows=120 loops=1)
   ->  Bitmap Heap Scan on appearances a  (cost=18.72..1629.43 rows=21 width=17) (actual time=0.366..2.228 rows=1289 loops=1)
         Recheck Cond: (g_batting = 10)
         Filter: (g_all = 10)
         Rows Removed by Filter: 58
         Heap Blocks: exact=899
         ->  Bitmap Index Scan on g_batting_idx  (cost=0.00..18.72 rows=1390 width=0) (actual time=0.250..0.251 rows=1347 loops=1)
               Index Cond: (g_batting = 10)
   ->  Index Scan using salaries_pkey on salaries s  (cost=0.29..8.22 rows=1 width=25) (actual time=0.003..0.003 rows=0 loops=1289)
         Index Cond: ((yearid = a.yearid) AND ((teamid):

In [16]:
result_4eii_with_index_cost = 1802.19
result_4eii_with_index_timing = 6.359

In [17]:
grader.check("q4eii")

In the following question, we will further explore the impact of indexes on query performance. To avoid any interference from the index on the `g_batting` column, drop the index before moving onto the next question.

In [18]:
%%sql 
DROP INDEX g_batting_idx

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

#### Question 4eiii:
Write a query that finds the `playerid`, `yearid`, and `salary` for each player that had played 10 games __or__ batted in 10 games. Your query should join the `salaries` and `appearances` table on `yearid`, `teamid`, and `playerid`.

In [19]:
%%sql result_4eiii <<
SELECT
    a.playerid, a.yearid, s.salary
FROM
    salaries s
JOIN
    appearances a
USING
    (yearid, teamid, playerid)
WHERE
    a.g_all = 10 OR
    a.g_batting = 10

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
216 rows affected.
Returning data to local variable result_4eiii


In [20]:
# Do not delete/edit this cell
result_4eiii.DataFrame().sort_values(['playerid', 'yearid']).to_csv('results/result_4eiii.csv', index=False)

In [21]:
grader.check("q4eiii_part1")

In [22]:
result_4eiii_query_str = """
SELECT
    a.playerid, a.yearid, s.salary
FROM
    salaries s
JOIN
    appearances a
USING
    (yearid, teamid, playerid)
WHERE
    a.g_all = 10 OR
    a.g_batting = 10
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4eiii_query_str"

                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3184.87..3852.27 rows=2 width=21) (actual time=20.113..26.712 rows=216 loops=1)
   Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.playerid)::text = (a.playerid)::text))
   ->  Seq Scan on salaries s  (cost=0.00..459.28 rows=26428 width=25) (actual time=0.009..1.871 rows=26428 loops=1)
   ->  Hash  (cost=3133.84..3133.84 rows=2916 width=17) (actual time=19.933..19.934 rows=1655 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 116kB
         ->  Seq Scan on appearances a  (cost=0.00..3133.84 rows=2916 width=17) (actual time=0.008..19.167 rows=1655 loops=1)
               Filter: ((g_all = 10) OR (g_batting = 10))
               Rows Removed by Filter: 102601
 Plann

In [23]:
result_4eiii_cost = 3852.27
result_4eiii_timing = 26.712

In [24]:
grader.check("q4eiii_part2")

#### Question 4eiv
Now, let's see the impact of adding an index on `g_batting` column will have on the query. Re-create the index and re-inspect the query from `Question 4eiii` and record the execution time and cost.

In [25]:
%%sql result_4eiv << 
CREATE INDEX g_batting_idx ON appearances (g_batting)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Returning data to local variable result_4eiv


In [26]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4eiii_query_str"

                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3184.87..3852.27 rows=2 width=21) (actual time=21.372..29.514 rows=216 loops=1)
   Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.playerid)::text = (a.playerid)::text))
   ->  Seq Scan on salaries s  (cost=0.00..459.28 rows=26428 width=25) (actual time=0.009..2.272 rows=26428 loops=1)
   ->  Hash  (cost=3133.84..3133.84 rows=2916 width=17) (actual time=21.111..21.112 rows=1655 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 116kB
         ->  Seq Scan on appearances a  (cost=0.00..3133.84 rows=2916 width=17) (actual time=0.012..20.366 rows=1655 loops=1)
               Filter: ((g_all = 10) OR (g_batting = 10))
               Rows Removed by Filter: 102601
 Plann

In [28]:
result_4eiv_with_index_cost = 3852.27
result_4eiv_with_index_timing = 29.514

In [29]:
grader.check("q4eiv")

In the following question, we will further explore the impact of indexes on query performance. To avoid any interference from the index on the `g_batting` column, drop the index before moving onto the next question.

In [30]:
%%sql 
DROP INDEX g_batting_idx

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

#### Question 4ev:
Now, create a multiple column index on `g_batting` and `g_all` called `g_batting_g_all_idx` and record the query execution time and cost.

In [31]:
%%sql result_4v <<
CREATE INDEX g_batting_g_all_idx ON appearances (g_batting, g_all)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Returning data to local variable result_4v


In [32]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_4eiii_query_str"

                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=2871.52..3538.93 rows=2 width=21) (actual time=3.933..13.793 rows=216 loops=1)
   Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.playerid)::text = (a.playerid)::text))
   ->  Seq Scan on salaries s  (cost=0.00..459.28 rows=26428 width=25) (actual time=0.005..2.593 rows=26428 loops=1)
   ->  Hash  (cost=2820.49..2820.49 rows=2916 width=17) (actual time=3.749..3.751 rows=1655 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 116kB
         ->  Bitmap Heap Scan on appearances a  (cost=1182.39..2820.49 rows=2916 width=17) (actual time=1.294..3.316 rows=1655 loops=1)
               Recheck Cond: ((g_all = 10) OR (g_battin

In [33]:
result_4ev_multiple_col_index_cost = 3538.93
result_4ev_multiple_col_index_timing = 13.793

In [34]:
grader.check("q4ev")

#### Question 4evi:
Given your findings from inspecting the query plans from `Question 4e`, answer the following question. Assign the variable `q4e_part6` to a list of all statements that are true.

1. Consider the following statements:
    1. Adding an index on a column used in an AND predicate will reduce the query time, but not the execution cost.
    1. Adding an index on a column used in an AND predicate will reduce the query cost, but not the execution time.
    1. Adding an index on a column used in an AND predicate will reduce the query cost and the execution time.
    1. Adding an index on a column used in an OR predicate will reduce the query time, but not the execution cost.
    1. Adding an index on a column used in an OR predicate will reduce the query cost, but not the execution time.
    1. Adding an index on a column used in an OR predicate will reduce the query cost and the execution time.
    1. Adding a mutlicolumn index on columns in an OR predicate will reduce the query time, but not the execution cost.
    1. Adding a mutlicolumn index on columns in an OR predicate will reduce the query cost, but not the execution time.
    1. Adding a mutlicolumn index on columns in an OR predicate will reduce the query cost and the execution time.

**Note:** Your answer should be formatted as follows: `q4e_part6 = ['A', 'B']`

<!--
BEGIN QUESTION
name: q4evi
points: 2
-->

In [35]:
q4e_part6 = ['C', 'D', 'G']

In [36]:
grader.check("q4evi")

<!-- BEGIN QUESTION -->

#### Question 4evii

Explain your answer to the previous part based on your inspection of the query plans (your explanation should include why you didn't choose certain options).

<!--
BEGIN QUESTION
name: q4evii
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 5:
#### Question 5a:
Write 2 queries: one that finds the minimum salary from the salary table and one that finds the average. Inspect the queries' query plans and record their execution times and costs.

In [7]:
%%sql result_5a_min << 
SELECT MIN(salary) FROM salaries

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
1 rows affected.
Returning data to local variable result_5a_min


In [8]:
result_5a_min_query_str = "SELECT MIN(salary) FROM salaries"
!psql $POSTGRES_URI/baseball -c "explain analyze $result_5a_min_query_str"

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=525.35..525.36 rows=1 width=8) (actual time=17.996..17.997 rows=1 loops=1)
   ->  Seq Scan on salaries  (cost=0.00..459.28 rows=26428 width=8) (actual time=0.012..7.219 rows=26428 loops=1)
 Planning Time: 0.883 ms
 Execution Time: 18.215 ms
(4 rows)



In [9]:
result_5a_min_query_cost = 525.36
result_5a_min_query_timing = 17.997

In [10]:
grader.check("q5ai")

In [11]:
%%sql result_5a_avg <<
SELECT AVG(salary) FROM salaries

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
1 rows affected.
Returning data to local variable result_5a_avg


In [13]:
result_5a_avg_query_str = "SELECT AVG(salary) FROM salaries"
!psql $POSTGRES_URI/baseball -c "explain analyze $result_5a_avg_query_str"

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=525.35..525.36 rows=1 width=8) (actual time=5.701..5.701 rows=1 loops=1)
   ->  Seq Scan on salaries  (cost=0.00..459.28 rows=26428 width=8) (actual time=0.009..2.169 rows=26428 loops=1)
 Planning Time: 0.287 ms
 Execution Time: 5.777 ms
(4 rows)



In [14]:
result_5a_avg_query_cost = 525.36
result_5a_avg_query_timing = 5.701

In [15]:
grader.check("q5aii")

#### Question 5b:
Create an index on the `salary` column in the `salaries` table and re-inspect the query plans from the previous part and record the respective execution time and cost.

In [16]:
%%sql 
CREATE INDEX salary_idx ON salaries(salary)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

In [17]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_5a_min_query_str"

                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.32..0.33 rows=1 width=8) (actual time=0.112..0.113 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.29..0.32 rows=1 width=8) (actual time=0.108..0.109 rows=1 loops=1)
           ->  Index Only Scan using salary_idx on salaries  (cost=0.29..762.78 rows=26428 width=8) (actual time=0.107..0.107 rows=1 loops=1)
                 Index Cond: (salary IS NOT NULL)
                 Heap Fetches: 0
 Planning Time: 0.856 ms
 Execution Time: 0.177 ms
(8 rows)



In [18]:
result_5b_min_query_cost = 0.33
result_5b_min_query_timing = 0.113

In [19]:
grader.check("q5bi")

In [20]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_5a_avg_query_str"

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=525.35..525.36 rows=1 width=8) (actual time=7.310..7.311 rows=1 loops=1)
   ->  Seq Scan on salaries  (cost=0.00..459.28 rows=26428 width=8) (actual time=0.005..2.781 rows=26428 loops=1)
 Planning Time: 0.324 ms
 Execution Time: 7.389 ms
(4 rows)



In [21]:
result_5b_avg_query_cost = 525.36
result_5b_avg_query_timing = 7.311

In [22]:
grader.check("q5bii")

In the following questions, we will further explore the impact of indexes on query performance. To avoid any interference from the index on the `salary` column, drop the index before moving onto the next question.

In [23]:
%%sql 
DROP INDEX salary_idx

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

#### Question 5c:
Given your findings from `Question 5a-b`, answer the following question. Assign the variable `q5c` to the correct answer choice.

1. Which of the following statements is true?
    1. An index on the column being aggregated in a query will always provide a performance enhancement.
    1. A query finding the MIN(salary) will always benefit from an index on salary, but a query finding MAX(salary) will not.
    1. A query finding the COUNT(salary) will always benefit from an index on salary, but a query finding AVG(salary) will not.
    1. Queries finding the MIN(salary) or MAX(salary) will always benefit from an index on salary, but queries finding AVG(salary) or COUNT(salary) will not.

*Note:* Your answer should be formatted as follows: `q5c = ['A', 'B']`

<!--
BEGIN QUESTION
name: q5c
points: 1
-->

In [24]:
q5c = ['D']

In [25]:
grader.check("q5c")

<!-- BEGIN QUESTION -->

#### Question 5d:

Explain your answer to the previous part based on your inspection of the query plans.

<!--
BEGIN QUESTION
name: q5d
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 6
In this question, we will inspect the impact of that clustering our data on an index can have on a query's performance.

#### Question 6a
Write a query that finds the `playerid`, `yearid`, `teamid`, and `ab` for all players whose `ab` was above 500. Inspect the query plan and record the execution time and cost.

In [26]:
%%sql result_6a <<
SELECT
    playerid, yearid, teamid, ab
FROM batting
WHERE
    ab > 500

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
8839 rows affected.
Returning data to local variable result_6a


In [27]:
# Do not delete/edit this cell
result_6a.DataFrame().sort_values(['playerid', 'yearid', 'teamid']).iloc[:1000].to_csv('results/result_6a.csv', index=False)

In [28]:
grader.check("q6ai")

In [30]:
result_6a_query_str = """
SELECT
    playerid, yearid, teamid, ab
FROM batting
WHERE
    ab > 500
"""
!psql $POSTGRES_URI/baseball -c "explain analyze $result_6a_query_str"

                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Seq Scan on batting  (cost=0.00..2884.05 rows=8972 width=21) (actual time=0.267..15.297 rows=8839 loops=1)
   Filter: (ab > 500)
   Rows Removed by Filter: 95485
 Planning Time: 0.270 ms
 Execution Time: 15.683 ms
(5 rows)



In [31]:
result_6a_cost = 2884.05
result_6a_timing = 15.297

In [32]:
grader.check("q6aii")

#### Question 6b:
Cluster the `batting` table on its primary key (hint: use `\di` to find out what name of the primary key is). We are able to directly cluster on the primary key (without first creating a separate index) because Postgres automatically creates an index for it. Re-inspect the query plan for the query from `Question 6a` and record the execution time and cost.

In [33]:
%sql \di

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
27 rows affected.


Schema,Name,Type,Owner
public,allstarfull_pkey,index,postgres
public,appearances_pkey,index,postgres
public,awardsmanagers_pkey,index,postgres
public,awardsplayers_pkey,index,postgres
public,awardssharemanagers_pkey,index,postgres
public,awardsshareplayers_pkey,index,postgres
public,batting_pkey,index,postgres
public,battingpost_pkey,index,postgres
public,collegeplaying_pkey,index,postgres
public,fielding_pkey,index,postgres


In [34]:
%%sql 
CLUSTER VERBOSE batting USING batting_pkey

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

In [35]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_6a_query_str"

                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Seq Scan on batting  (cost=0.00..2878.05 rows=8972 width=21) (actual time=0.022..22.213 rows=8839 loops=1)
   Filter: (ab > 500)
   Rows Removed by Filter: 95485
 Planning Time: 0.515 ms
 Execution Time: 22.684 ms
(5 rows)



In [36]:
result_6b_cost = 2878.05
result_6b_timing = 22.213

In [37]:
grader.check("q6b")

#### Question 6c
Now, let's try clustering the table based on another index. Create an index on `ab` column called `ab_idx` in the `batting` table and cluster the `batting` table with this new index. Now, re-inspect the query plan and record the execution time and cost.

In [38]:
%%sql 
CREATE INDEX ab_idx ON batting(ab)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

In [39]:
!psql $POSTGRES_URI/baseball -c "explain analyze $result_6a_query_str"

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on batting  (cost=101.83..1787.98 rows=8972 width=21) (actual time=0.775..4.720 rows=8839 loops=1)
   Recheck Cond: (ab > 500)
   Heap Blocks: exact=1311
   ->  Bitmap Index Scan on ab_idx  (cost=0.00..99.58 rows=8972 width=0) (actual time=0.601..0.602 rows=8839 loops=1)
         Index Cond: (ab > 500)
 Planning Time: 0.513 ms
 Execution Time: 5.207 ms
(7 rows)



In [40]:
result_6c_cost = 1787.98
result_6c_timing = 4.720

In [41]:
grader.check("q6c")

#### Question 6d
Given your findings from inspecting the query plans from `Question 6a-c`, answer the following question. Assign the variable `q6d` to a list of all statements that are true.

1. Consider the following statements:
    1. Clustering based on the `ab_idx` decreased the cost of the query.
    1. Clustering based on the `ab_idx` increased the cost of the query.
    1. Clustering based on the `ab_idx` increased the execution time of the query.
    1. Clustering based on the `ab_idx` decreased the execution time of the query.
    1. Clustering based on the `batting_pkey` decreased the cost of the query.
    1. Clustering based on the `batting_pkey` increased the cost of the query.
    1. Clustering based on the `batting_pkey` increased the execution time of the query.
    1. Clustering based on the `batting_pkey` decreased the execution time of the query.
    1. None of the above
    
**Note:** Your answer should be formatted as follows: `q6d = ['A', 'B']`

<!--
BEGIN QUESTION
name: q6d
points: 2
-->

In [42]:
q6d = ['A', 'D', 'F', 'H']

In [43]:
grader.check("q6d")

<!-- BEGIN QUESTION -->

#### Question 6e:

Explain your answer to the previous part based on your inspection of the query plans (your explanation should include why you didn't choose certain options).

<!--
BEGIN QUESTION
name: q6e
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 7
Until now, we have seen the positive potential impact that indexes can have on query performance, but remember in real world technologies/applications, we will be routinely receiving new data (and in large quantities) which would trigger regular updates to our tables. In this section, we will dive into the cost of managing the indexes that we create.

#### Question 7a:
Record the time it takes to insert 300,000 rows into the `salaries` table when no additional index is configured.

Run the following cell to setup a column to track which rows we added as part of these inserts.

In [44]:
%sql ALTER TABLE salaries ADD added boolean DEFAULT False;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

Next, run the provided update script and record the wall time.

In [45]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100000..400000 LOOP
     INSERT INTO salaries (yearid, teamid, lgid, playerid, salary, added)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000, true);
     END LOOP;
END;
$$;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
CPU times: user 7.26 ms, sys: 0 ns, total: 7.26 ms
Wall time: 6.02 s


[]

In [46]:
result_7a_timing = 6.02

In [47]:
grader.check("q7a")

Before adding an index to the salaries table and re-timing the updates, delete all the rows that were added to the table from the update script.

In [48]:
%%sql
DELETE FROM salaries
WHERE added = 'true';

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
300001 rows affected.


[]

#### Question 7b:
Now, create an index on the `salary` column and record the wall time after executing the update script.

In [49]:
%%sql 
CREATE INDEX salary_idx ON salaries(salary)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.


[]

In [50]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100000..400000 LOOP
     INSERT INTO salaries (yearid, teamid, lgid, playerid, salary, added)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000, true);
     END LOOP;
END;
$$;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
CPU times: user 6.62 ms, sys: 566 µs, total: 7.19 ms
Wall time: 11.9 s


[]

In [51]:
result_7b_timing = 11.9

In [52]:
grader.check("q7b")

## Question 8
In this question, we will explore the benefits of bulk loading data and indexes. 

#### Question 8ai:
Create a new table called `fabricated_salaries` with the following schema:

Column Name | Data Type
--- | --- 
yearid |  INTEGER
teamid | CHARACTER VARYING(3)
lgid | CHARACTER VARYING(2)
playerid | CHARACTER VARYING(10)
salary | DOUBLE PRECISION

**Note:** There are no tests associated with this subpart.

In [54]:
%%sql result_8a <<
DROP TABLE IF EXISTS fabricated_salaries;
CREATE TABLE fabricated_salaries (
    yearid INT,
    teamid VARCHAR(3),
    lgid VARCHAR(2),
    playerid VARCHAR(10),
    salary DOUBLE PRECISION
);

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Done.
Returning data to local variable result_8a


#### Question 8aii:
Record the time it takes to copy the contents of the `fabricated_salaries.txt` into the new table. Hint: Use the psql shell comamnd `\copy` to perform the copying (see the documentation [here](https://www.postgresql.org/docs/9.2/app-psql.html#APP-PSQL-META-COMMANDS-COPY)). Your answer should be formatted as follows: `!psql -h localhost -d baseball -c "\copy _REPLACE_ME_"`.

In [55]:
# Unzip the file before copying
!unzip -u data/salaries_trunc.zip -d data/

Archive:  data/salaries_trunc.zip
  inflating: data/salaries_trunc.txt  


In [56]:
%time
!psql $POSTGRES_URI/baseball -c "\copy fabricated_salaries FROM 'data/salaries_trunc.txt' WITH DELIMITER ','"

CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs
Wall time: 17.2 µs
COPY 10000


In [57]:
time_to_copy_data = 0.0172

In [58]:
grader.check("q8aii")

#### Question 8aiii:
Record the time to create an index on the `salary` column with name `fabricated_salary_idx` for the new table you created.

In [76]:
%%time
%%sql 
CREATE INDEX fabricated_salary_idx ON fabricated_salaries(salary)

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
CPU times: user 2.12 ms, sys: 3.67 ms, total: 5.79 ms
Wall time: 111 ms


[]

In [77]:
time_to_create_index = 111

In [78]:
grader.check("q8aiii")

### Question 8b:
Now, create a second table called `fabricated_salaries2` with the same schema definition from `Question 8ai`. Then, record the time to run 10,000 `INSERT` statements.

In [62]:
%%sql result_8b <<
DROP TABLE IF EXISTS fabricated_salaries2;
CREATE TABLE fabricated_salaries2 (
    yearid INT,
    teamid VARCHAR(3),
    lgid VARCHAR(2),
    playerid VARCHAR(10),
    salary DOUBLE PRECISION
);

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Done.
Returning data to local variable result_8b


Run the following cell and record the time it takes to perform 10,000 random inserts into the table you just created.

In [63]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100001..110000 LOOP
     INSERT INTO fabricated_salaries2 (yearid, teamid, lgid, playerid, salary)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000);
     END LOOP;
END;
$$;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
CPU times: user 4.41 ms, sys: 71 µs, total: 4.48 ms
Wall time: 133 ms


[]

In [64]:
time_8b = 133

In [65]:
grader.check("q8b")

### Question 8c
Now, create a third table `fabricated_salaries3` (with the same schema specified in `Question 8ai`) and create an index on the `salary` column with name `fabricated_salary3_idx`.

In [66]:
%%sql result_8c <<
DROP TABLE IF EXISTS fabricated_salaries3;
CREATE TABLE fabricated_salaries3 (
    yearid INT,
    teamid VARCHAR(3),
    lgid VARCHAR(2),
    playerid VARCHAR(10),
    salary DOUBLE PRECISION
);

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
Done.
Done.
Returning data to local variable result_8c


Create an index on the `salary` column of the new table you created.

In [None]:
%%sql 
CREATE INDEX fabricated_salary3_idx ON fabricated_salaries3(salary)

Run the following cell and record the time it takes to perform 10,000 random inserts into the table you just created.

In [67]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100001..110000 LOOP
     INSERT INTO fabricated_salaries3 (yearid, teamid, lgid, playerid, salary)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000);
     END LOOP;
 END;
$$;

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
CPU times: user 4.36 ms, sys: 59 µs, total: 4.42 ms
Wall time: 199 ms


[]

In [68]:
time_8c = 199

In [69]:
grader.check("q8c")

### Question 8d
Now, write an update query that adds $5,000 to every salary entry in the table you created in `Question 8ai`. Record the time it takes to execute the query.

In [79]:
%%time
%%sql
UPDATE fabricated_salaries
SET salary = salary + 5000

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
10000 rows affected.
CPU times: user 4.97 ms, sys: 3.41 ms, total: 8.38 ms
Wall time: 223 ms


[]

In [80]:
time_8d = 223

In [74]:
grader.check("q8d")

## Question 8e
Let's compare the time from the previous part with the time it takes to drop the index, run the same update query from `Question 8c`, and then recreate the index.

In [83]:
%%time
%%sql
BEGIN;
DROP INDEX fabricated_salary_idx;

UPDATE fabricated_salaries
SET salary = salary + 5000
;

CREATE INDEX fabricated_salary_idx ON fabricated_salaries(salary);
COMMIT

 * postgresql://postgres:***@127.0.0.1:5432/baseball
   postgresql://postgres:***@127.0.0.1:5432/postgres
Done.
(psycopg2.errors.UndefinedObject) index "fabricated_salary_idx" does not exist

[SQL: DROP INDEX fabricated_salary_idx;]
(Background on this error at: https://sqlalche.me/e/14/f405)
CPU times: user 5.54 ms, sys: 0 ns, total: 5.54 ms
Wall time: 5.52 ms


In [84]:
time_8e = 5.52

In [85]:
grader.check("q8e")

## Congratulations! You have finished Project 2.

Run the following cell to zip the results of your queries. You will also need to run the export cell at the end of the notebook. **For submission on Gradescope, you will need to submit both the proj2.zip file genreated by the export cell and the results.zip file generated by the following cell.**

In [215]:
!zip -r results.zip results

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()