# Your First Database

A relational database:

- real-life *entities* become tables
- reduced redundancy
- data integrity by *relationships*

In the course, we will:
- work with real data
- create a database from scratch
- learn three concepts:
    - constraints
    - keys
    - referential integrity

## Creating tables

In [None]:
-- Create a table for the universities entity type
CREATE TABLE universities (
    university_shortname text,
    university text,
    university_city text
);

-- Print the contents of this table
SELECT * 
FROM universities

## Meta-database

A *meta-database* holds information about our current database. 

This would look like:
- Database: `information_schema`
    - Tables:
        - `tables`: information about all tables in the database
        - `columns`: information about all columns in all of the tables in the current database

## `CREATE` tables

The syntax for creating tables is

In [None]:
# notice last semi-colon and no trailing commna
CREATE TABLE table_name (
    column_a data_type,
    column_b data_type,
    column_c data_type
);    

## `ALTER`ing tables and `ADD`ing columns

We can alter tables easily and add new columns as need.

The syntax for altering is

In [None]:
ALTER TABLE table_name
ADD COLUMN column_name data_type;

## `RENAME` columns

We can also just rename a column

In [None]:
ALTER TABLE table_name
RENAME COLUMN old_name TO new_name;

## `DELETE` a column

In [None]:
ALTER TABLE table_name
DROP COLUMN column_name;

## Migrating data with `INSERT`

We can also migrate data from one table to another

In [None]:
INSERT INTO table_name
SELECT DISTINCT column_name1, column_name2
FROM table_a;

This is read in reverse order. First, we `SELECT` something which we `INSERT` into the `table_name`

## `DELETE` a table

In [None]:
DROP TABLE table_name;

# Enforcing data integrity

The idea of a database is to push data into a pre-defined structure, where we info datatypes, relationships and other rules. These rules are called *integrity constraints*.

There are 3 types of integrity constraints:
1. Attribute constraints (data types on columns)
    - data types that can be specified for each column of a table
    - they restrict operations on a table (can't multiply `int` and `string`)
2. Key constraints (keys)
3. Referential integrity constraints

This chapter focuses on attribute constraints.

## Why constraints?

- They give data structure
- Help with consistency and consequently data quality
    - Data quality is a business advantage / data science pre-requisite
- Enforcing is difficult but PostgreSQL helps

In [None]:
# if the columns data types do not match we can CAST
SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 
FROM transactions;

## Working with data types

- Enforced on columns (i.e. attributes)
- Defined the so-called "domain" of a column
- Define what operations are possible
- Enforce consistent storage of values

### Some common datatypes

- `text`
    - character strings of any length
- `varchar`
    - a maximum of n characters
- `char`
    - a fixed-length string of n characters
- `boolean`
    - can only take 3 states: `TRUE`, `FALSE`, `NULL`
- `date`, `time`, `timestamp`
    - formats for data and time calculations
- `numeric`
    - arbitrary precision numbers
- `integer`
    - whole numbers

## `INSERT`ing data and conforming with data types

In [None]:
INSERT INTO table_name (col_text, col_int, col_date)
VALUES ("Text", 1, "2020-02-15");

## `CAST`ing columns to other values

In [None]:
SELECT transaction_date, amount + CAST(fee AS inteeger) AS net_amount
FROM transactions;

## `ALTER`ing column types

In [None]:
ALTER TABLE professors
ALTER COLUMN university_shortname
TYPE varchar(10)

## `USING`

In this example we retain the first 3 characters from a column so we can change its type.

In [None]:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(3)
USING SUBSTRING(column_name FROM 1 FOR 3)

## The not-null and unique constraints

The not-null constraint:
- disallows `NULL` values in a certain column
- must hold true for the current state 
- must hold true for any future state

### Meaning of `NULL`

`NULL` can mean that a value:
- is unknown
- does not exist
- does not apply

> `NULL != NULL` is *always* `FALSE`

## Adding `NULL` constraints when creating a table

In [None]:
CREATE TABLE students (
    ssn integer not null,
    lastname, varchar(64) not null,
    home_phone integer, # default is to allow NULL
    office_phone integer # default is to allow NULL
);

## Adding `NULL` constraints *after* creating a table

In [None]:
# adding a not NULL constraint
ALTER TABLE students
ALTER COLUMN home_phone
SET NOT NULL;

In [None]:
# removing a not NULL constraint
ALTER TABLE students
ALTER COLUMN ssn
DROP NOT NULL;

## The unique constraint

- Disallows duplicates in a column
- Must hold true in the current state
- Must hold true in any future state

### Adding `UNIQUE` constraints when creating a table

In [None]:
CREATE TABLE table_name(
    column_name UNIQUE
);

### Adding `UNIQUE` constraints *after* creating a table

In [None]:
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);

# Keys and Superkeys

## What is a key?

Tipically a database table has an attribute (or a combination of attributes) which uniquely identify a record. 

We can also consider that the table as a whole is a key in itself. However, this is a *superkey*. In a *superkey* some attributes are redundant so they can be removed. Upon removal, we'd still have a key i.e. a set of attributes which can still identify uniquely a row.

When we get to a key where no more attributes can be removed (or the key would no longer be a key), we have a *minimal key*.

## Primary keys

- one primary key per table, chosen from candidate keys
- uniquely identifies records e.g. for referencing in other tables
- unique *and* not null constraints apply
- primary keys an+re time-invariant: choose wisely! 

### Specifying a `PRIMARY KEY` when creating a table

In [None]:
CREATE TABLE products (
    product_no integer PRIMARY KEY,
    name text,
    price numeric
);

Note how this is equivalent to

In [None]:
CREATE TABLE products (
    product_no integer UNIQUE NOT NULL,
    name text,
    price numeric
);

A `PRIMARY KEY` can also be a combination of columns

In [None]:
CREATE TABLE example (
    a integer,
    b integer,
    c integer,
    PRIMARY KEY (a, c)
)

A `PRIMARY KEY` should have as few columns as possible.

## Surrogate keys

Surrogate keys are used to define an artificial primary key. This is used because:
- primary keys should be built from as few columns as possible
- primary keys should never change over time

## Adding a `SERIAL` type

In [None]:
ALTER TABLE cars
ADD COLUMN id serial PRIMARY KEY

## `CONCAT`ing to get a key

In [None]:
ALTER TABLE table_name
ADD COLUMN column_c varchar(256)

UPDATE table_name
SET column_c = CONCAT(column_a, column_b);

ALTER TABLE table_name
ADD CONSTRAINT pk PRIMARY KEY (column_c);

# Glue together tables with foreign keys

- a foreign key points to the primary key of another table
- domain of FK must be equal to the domain of PK
- each value of FK must exist in PK of thee other table (this is the FK constraint, aka *referential integrity*)
- FK are not actual keys because duplicates are allowed

## Specifying a foreign key when creating a table

Notice the `REFERENCES` keyword

In [None]:
CREATE TABLE manufacturers (
    name varchar(255) PRIMARY KEY
);

INSERT INTO manufacturers
VALUES ("Ford"), ("VW"), ("GM");

CREATE TABLE cars (
    model varchar(255) PRIMARY KEY,
    manufacturer_name integer REFERENCES manufacturers (name)
);

## Adding a foreign key to an existing table

In [None]:
ALTER TABLE a
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);

Foreign keys model 1:N relationships. We can model N:M relationships with affiliate tables. The purpose of these tables is to connect two other tables: as such, their columns are usually foreign keys to other tables so they can then be joined.

## Updating the values of a column with the values of another column

In [2]:
# for each row in table_a, find the corresponding row in table_b
# where condition1 AND condition2 etc
# and set the value of column_to_update to the value of 
# column_to_update_from

# this only make sense if there is only one matching row in table_b!
UPDATE table_a
SET column_to_update = table_b.column_to_update_from
FROM table_b
WHERE condition1 AND condition2 AND ...;

## Referential Integrity

The definition is
> A record referencing another table must refer to an existing record in that table

- Enforced  through foreign keys
- Specified between two tables

### Violations

There are two ways referential integrity can be violated:
1. If a record in table B that is referenced from a record in table A is deleted
2. If a record in table A referencing a non-existing record from table B is inserted
3. Foreign keys prevent violations! 

### Dealing with violations

We can specify what should happens in the primary key elsewhere is deleted.

In [None]:
CREATE TABLE a (
    id integer PRIMARY KEY,
    column_a varchar(64),
    ...
    b_id integer REFERENCES b (id) ON DELETE NO ACTION
);

# if we try to delete a record from table B which is referenced 
# from table A, the system will throw an error

Another options are

In [None]:
CREATE TABLE a (
    id integer PRIMARY KEY,
    column_a varchar(64),
    ...
    b_id integer REFERENCES b (id) ON DELETE CASCADE
);

# CASCADE: if we try to delete a record from table B which is referenced 
# from table A, if deletes the record from table B and then all references
# in table A

There is also:
- `NO ACTION`: throws an error;
- `RESTRICT`: also throws an error;
- `SET NULL`: set the referencing column to `NULL`
- `SET DEFAULT`: set the referencing column to its default value