This Notebook covers the basics of SQL. This does the same things as the Instabase Notebook, but connects to your local instance of PostgreSQL instead. So you can edit and run queries against the database. You should also make sure that you are comfortable using SQL through `psql` (see the [Setup Instructions](https://github.com/umddb/cmsc424-fall2016/tree/master/project0) to get started with that).

The server should already be running (and the `university` database created and populated). The following commands load the requiste modules. 

**NOTE: Although there is a warning, it doesn't seem to affect things.**

In [None]:
%load_ext sql
%sql postgresql://ubuntu:ubuntu@localhost/university

We can now run SQL commands using `magic` commands, which is an extensibility mechanism provided by Jupyter. %sql is for single-line commands, whereas %%sql allows us to do multi-line SQL commands.

In [None]:
%sql select * from instructor;

### Inspecting the Database
One drawback of this way of accessing the database is that we can only run valid SQL -- the commands like `\d` provided by `psql` are not available to us. Instead, we will need to query the system catalog (metadata) directly. The first command below is equivalent to `\d`, whereas the second one is similar to `\d instructor`.

In [None]:
%%sql
SELECT table_schema, table_name FROM information_schema.tables
    WHERE table_type = 'BASE TABLE' AND
    table_schema NOT IN ('pg_catalog', 'information_schema', 'priv');

In [None]:
%%sql
SELECT column_name, data_type
    FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'instructor';

### University Database
Below we will use the University database from the class textbook. The University Dataset is the same as the one discussed in the book, and contains randomly populated information about the students, courses, and instructors in a university. 

You should follow the rest of the Notebook along with the appropriate sections in the book.

The schema diagram for the database is as follows:
<center><img src="https://github.com/umddb/cmsc424-fall2015/raw/master/postgresql-setup/university.png" width=800px></center>

### SQL Data Definition Language (Section 3.2)

You can take a look at the `DDL.sql` file to see how the tables we are using are created. We won't try to run those commands here since they will only give errors. 
Here is how the department table is created. The `primary key` is specified using the special clause.
```
create table department
        (dept_name              varchar(20),
         building               varchar(15),
         budget                 numeric(12,2) check (budget > 0),
         primary key (dept_name)
        );
```

The instructor table is created simiarly and it references the primary key of the department (and hence called `foreign key`).
```
create table instructor
        (ID                     varchar(5),
         name                   varchar(20) not null,
         dept_name              varchar(20),
         salary                 numeric(8,2) check (salary > 29000),
         primary key (ID),
         foreign key (dept_name) references department
                on delete set null
        );
```
Command for inserting a new instructor is also straightforward.
```
insert into instructor values ('10101', 'Srinivasan', 'Comp. Sci.', '65000');
```
If the 'Comp. Sci.' department is not present in the `department` table already, we have a `referential integrity violation`, and the insert command would be rejected.

### Select Queries on a Single Relation (Section 3.3.1)
Let's start with the most basic queries. The following query reports the courses with titles containing Biology.

In [None]:
%sql select * from course where title like '%Biology%';

There are two  courses. How many students are enrolled in the first one (ever)? What about in Summer 2009?

In [None]:
%sql select * from takes where course_id = 'BIO-101';

In [None]:
%sql select * from takes where course_id = 'BIO-101'  and year = 2009 and semester = 'Summer';

### Aggregates

Count the number of instructors in Finance

In [None]:
%sql select count(*) from instructor where dept_name = 'Finance';

Find the instructor(s) with the highest salary. Note that using a nested "subquery" (which first finds the maximum value of the salary) as below is the most compact way to write this query.

In [None]:
%%sql 
select *
from instructor
where salary = (select max(salary) from instructor);

### Joins and Cartesian Product (Section 3.3.2)
To find building names for all instructors, we must do a join between two relations.

In [None]:
%%sql
select name, instructor.dept_name, building
from instructor, department
where instructor.dept_name = department.dept_name;

Since the join here is a equality join on the common attributes in the two relations, we can also just do:

In [None]:
%%sql 
select name, instructor.dept_name, building
from instructor natural join department;

On the other hand, just doing the following (i.e., just the Cartesian Product) will lead to a large number of tuples, most of which are not meaningful.

In [None]:
%%sql
select name, instructor.dept_name, building
from instructor, department;

### Renaming using "as"
**as** can be used to rename tables and simplify queries:

In [None]:
%%sql
explain select distinct T.name
from instructor as T, instructor as S  
where T.salary > S.salary and S.dept_name = 'Biology';

**Self-joins** (where two of the relations in the from clause are the same) are impossible without using `as`. The following query associates a course with the pre-requisite of one of its pre-requisites. There is no way to disambiguate the columns without some form of renaming.

In [None]:
%%sql
explain analyze select p1.course_id, p2.prereq_id as pre_prereq_id
from prereq p1, prereq p2
where p1.prereq_id = p2.course_id;

The small University database doesn't have any chains of this kind. You can try adding a new tuple using a new tuple. Now the query will return an answer.

In [None]:
%sql insert into prereq values ('CS-101', 'PHY-101');

In [None]:
%%sql
select p1.course_id, p2.prereq_id as pre_prereq_id
from prereq p1, prereq p2
where p1.prereq_id = p2.course_id;

### Set Operations
*Union* operation can be used to combine information from two tables (from Section 3.5.1).

In [None]:
%%sql
select course_id
from section
where semester = 'Fall' and year= 2009
union 
select course_id
from section
where semester = 'Spring' and year= 2010;

### Aggregation with Grouping (Section 7.4.2)

In [None]:
%%sql
select dept_name, avg(salary) as avg_salary
from instructor
group by dept_name;

You can use `having` to filter out groups. The following query only returns the average salary for departments with more than 2 instructors.

In [None]:
%%sql
select dept_name, avg(salary) as avg_salary
from instructor
group by dept_name
having count(*) > 2;

### WITH
In many cases you might find it easier to create temporary tables, especially for queries involving finding "max" or "min". This also allows you to break down the full query and makes it easier to debug. It is preferable to use the WITH construct for this purpose. The syntax and support differs across systems, but here is the link to PostgreSQL: http://www.postgresql.org/docs/9.0/static/queries-with.html

These are also called Common Table Expressions (CTEs).

The following query is from Section 3.8.6.

In [None]:
%%sql
with max_budget(value) as (
select max(budget)
from department
)
select budget
from department, max_budget
where department.budget = max_budget.value;

### LIMIT
PostgreSQL allows you to limit the number of results displayed which
is useful for debugging etc. Here is an example.

In [None]:
%sql select * from instructor limit 2;

### Try your own queries
Feel free to use the cells below to write new queries. You can also just modify the above queries directly if you'd like.