# 3. Uniquely identify records with key constraints
**Now let’s get into the best practices of database engineering. It's time to add primary and foreign keys to the tables. These are two of the most important concepts in databases, and are the building blocks you’ll use to establish relationships between tables.**

## Keys and superkeys
Let's discuss key constraints. They are a very important concept in database systems, so we'll spend a whole chapter on them.

### The current database model
Let's have a look at your current database model first. In the last chapter, you specified attribute constraints, first and foremost data types. You also set not-null and unique constraints on certain attributes. This didn't actually change the structure of the model, so it still looks the same.

### The database model with primary keys
By the end of this chapter, the database will look slightly different. You'll add so-called primary keys to three different tables. You'll name them `id`. In the entity-relationship diagram, keys are denoted by underlined attribute names. Notice that you'll add a whole new attribute to the `professors` table, and you'll modify existing columns of the `organizations` and `universities` tables.

### What is a key?
Before we go into the nitty-gritty of what a primary key actually is, let's look at keys in general. Typically a database table has an attribute, or a combination of multiple attributes, whose values are unique across the whole table. Such attributes identify a record uniquely. Normally, a table, as a whole, only contains unique records, meaning that the combination of all attributes is a key in itself. However, it's not called a key, but a **superkey**, if attributes from that combination can be removed, and the attributes still uniquely identify records. If all possible attributes have been removed but the records are still uniquely identifiable by the remaining attributes, we speak of a minimal superkey. This is the actual key. So a **key** is always minimal. Let's look at an example.

### An example
Here's an example that I found in a textbook on database systems.
```
     license_no     | serial_no |    make    |  model  | year
--------------------+-----------+------------+---------+------
 Texas ABC-739      | A69352    | Ford       | Mustang |    2
 Florida TVP-347    | B43696    | Oldsmobile | Cutlass |    5
 New York MPO-22    | X83554    | Oldsmobile | Delta   |    1
 California 432-TFY | C43742    | Mercedes   | 190-D   |   99
 California RSK-629 | Y82935    | Toyota     | Camry   |    4
 Texas RSK-629      | U028365   | Jaguar     | XJS     |    4
```
Obviously, the table shows six different cars, so the combination of all attributes is a superkey. If we remove the `year` attribute from the superkey, the six records are still unique, so it's still a superkey. Actually, there are a lot of possible superkeys in this example.

However, there are only four minimal superkeys, and these are `license_no`, `serial_no`, and `model`, as well as the combination of `make` and `year`. Remember that superkeys are minimal if no attributes can be removed without losing the uniqueness property. This is trivial for K1 to 3, as they only consist of a single attribute. Also, if we remove `year` from K4, `make` would contain duplicates, and would, therefore, be no longer suited as key. These four minimal superkeys are also called candidate keys. Why candidate keys? In the end, there can only be one key for the table, which has to be chosen from the candidates. 

## Get to know SELECT COUNT DISTINCT
Your database doesn't have any defined keys so far, and you don't know which columns or combinations of columns are suited as keys.

There's a simple way of finding out whether a certain column (or a combination) contains only unique values – and thus identifies the records in the table.

You already know the `SELECT DISTINCT` query. Now you just have to wrap everything within the `COUNT()` function and PostgreSQL will return the number of unique rows for the given columns:
```sql
SELECT COUNT(DISTINCT(column_a, column_b, ...))
FROM table;
```

- First, find out the number of rows in `universities`.

```sql
-- Count the number of rows in universities
SELECT COUNT(*) 
FROM universities;
```

```
11
```

- Then, find out how many unique values there are in the `university_city` column.

```sql
-- Count the number of distinct values in the university_city column
SELECT COUNT(DISTINCT(university_city)) 
FROM universities;
```

```
9
```

*So, obviously, the university_city column wouldn't lend itself as a key. Why? Because there are only 9 distinct values, but the table has 11 rows.*

## Identify keys with SELECT COUNT DISTINCT
There's a very basic way of finding out what qualifies for a key in an existing, populated table:

1. Count the distinct records for all possible combinations of columns. If the resulting number `x` equals the number of all rows in the table for a combination, you have discovered a superkey.

2. Then remove one column after another until you can no longer remove columns without seeing the number `x` decrease. If that is the case, you have discovered a (candidate) key.

The table `professors` has 551 rows. It has only one possible candidate key, which is a combination of two attributes. You might want to try different combinations using the "Run code" button. Once you have found the solution, you can submit your answer.

- Using the above steps, identify the *candidate* key by trying out different combination of columns.

```sql
-- Try out different combinations
SELECT COUNT(DISTINCT(firstname, lastname)) 
FROM professors;
```

```
count
-----
551
```

*Indeed, the only combination that uniquely identifies professors is `{firstname, lastname}`. `{firstname, lastname, university_shortname}` is a superkey, and all other combinations give duplicate values.*

---
## Primary keys
Now it's time to look at an actual use case for superkeys, keys, and candidate keys.

### Primary keys
Primary keys are one of the most important concepts in database design. Almost every database table should have a primary key – chosen by you from the set of candidate keys. The main purpose, as already explained, is uniquely identifying records in a table. This makes it easier to reference these records from other tables, for instance – a concept you will go through in the next and last chapter. You might have already guessed it, but primary keys need to be defined on columns that don't accept duplicate or null values. Lastly, primary key constraints are time-invariant, meaning that they must hold for the current data in the table – but also for any future data that the table might hold. It is therefore wise to choose columns where values will always be unique and not null.

### Specifying primary keys
So these two tables accept exactly the same data, however, the latter has an explicit primary key specified. 
```sql
CREATE TABLE products (
	product_no integer UNIQUE NIT NULL,
	name text,
	price numeric
);
CREATE TABLE products (
	product_no integer PRIMARY KEY,
	name text,
	price numeric
);
```

As you can see, specifying primary keys upon table creation is very easy. Primary keys can also be specified like so: This notation is necessary if you want to designate more than one column as the primary key. Beware, that's still only one primary key, it is just formed by the combination of two columns. Ideally, though, primary keys consist of as few columns as possible.
```sql
CREATE TABLE example (
	a integer,
	b integer,
	c integer,
	PRIMARY KEY (a, c)
);
```
Adding primary key constraints to existing tables is the same procedure as adding unique constraints, which you might remember from the last chapter. As with unique constraints, you have to give the constraint a certain name.
```sql
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
```

### Your database
In the exercises that follow, you will add primary keys to the tables "universities" and "organizations". You will add a special type of primary key, a so-called surrogate key, to the table "professors" in the last part of this chapter.

## Identify the primary key
As the database designer, you have to make a wise choice as to which column should be the primary key.
```
     license_no     | serial_no |    make    |  model  | year
--------------------+-----------+------------+---------+------
 Texas ABC-739      | A69352    | Ford       | Mustang |    2
 Florida TVP-347    | B43696    | Oldsmobile | Cutlass |    5
 New York MPO-22    | X83554    | Oldsmobile | Delta   |    1
 California 432-TFY | C43742    | Mercedes   | 190-D   |   99
 California RSK-629 | Y82935    | Toyota     | Camry   |    4
 Texas RSK-629      | U028365   | Jaguar     | XJS     |    4
```
Which of the following column or column combinations could best serve as primary key?

1. ~PK = {make}~
2. ~PK = {model, year}~
3. **PK = {license_no}**
4. ~PK = {year, make}~

**Answer: 3** *A primary key consisting solely of "license_no" is probably the wisest choice, as license numbers are certainly unique across all registered cars in a country.*

## ADD key CONSTRAINTs to the tables
Two of the tables in your database already have well-suited candidate keys consisting of one column each: `organizations` and `universities` with the `organization` and `university_shortname` columns, respectively.

In this exercise, you'll rename these columns to `id` using the `RENAME COLUMN` command and then specify primary key constraints for them. This is as straightforward as adding unique constraints.
```sql
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
```
Note that you can also specify more than one column in the brackets.

```sql
-- Rename the organization column to id
ALTER TABLE  organizations
RENAME organization TO id;

-- Make id a primary key
ALTER TABLE organizations
ADD CONSTRAINT organization_pk PRIMARY KEY (id);
```

- Rename the `university_shortname` column to `id` in `universities`.
- Make `id` a primary key and name it `university_pk`.

```sql
-- Rename the university_shortname column to id
ALTER TABLE universities
RENAME university_shortname TO id;

-- Make id a primary key
ALTER TABLE universities
ADD CONSTRAINT university_pk PRIMARY KEY (id);
```

*Let's tackle the last table that needs a primary key right now: `professors`. However, things are going to be different this time, because you'll add a so-called surrogate key.*

---
## Surrogate keys
Surrogate keys are sort of an artificial primary key. In other words, they are not based on a native column in your data, but on a column that just exists for the sake of having a primary key. Why would you need that?

### Surrogate keys
There are several reasons for creating an artificial surrogate key. As mentioned before, **a primary key is ideally constructed from as few columns as possible**. Secondly, **the primary key of a record should never change over time**. If you define an artificial primary key, ideally consisting of a unique number or string, you can be sure that this number stays the same for each record. Other attributes might change, but the primary key always has the same value for a given record.

### An example
Let's look back at the example. 
```
     license_no     | serial_no |    make    |  model  | color
--------------------+-----------+------------+---------+------
 Texas ABC-739      | A69352    | Ford       | Mustang | blue
 Florida TVP-347    | B43696    | Oldsmobile | Cutlass | black
 New York MPO-22    | X83554    | Oldsmobile | Delta   | silver
 California 432-TFY | C43742    | Mercedes   | 190-D   | champagne
 California RSK-629 | Y82935    | Toyota     | Camry   | red
 Texas RSK-629      | U028365   | Jaguar     | XJS     | blue
```
I altered it slightly and added the `color` column. In this table, the `license_no` column would be suited as the primary key – the license number is unlikely to change over time, not like the color column, for example, which might change if the car is repainted. So there's no need for a surrogate key here. However, let's say there were only these three attributes in the table. 
```
    make    |  model  | color
------------+---------+------
 Ford       | Mustang | blue
 Oldsmobile | Cutlass | black
 Oldsmobile | Delta   | silver
 Mercedes   | 190-D   | champagne
 Toyota     | Camry   | red
 Jaguar     | XJS     | blue
```
The only sensible primary key would be the combination of `make` and `model`, but that's two columns for the primary key.

### Adding a surrogate key with serial data type
You could add a new surrogate key column, called `id`, to solve this problem. 
```sql
ALTER TABLE cars
ADD COLUMN id serial PRIMARY KEY;
```
Actually, there's a special data type in PostgreSQL that allows the addition of auto-incrementing numbers to an existing table: the `serial` type. It is specified just like any other data type. Once you add a column with the `serial` type, all the records in your table will be numbered. Whenever you add a new record to the table, it will automatically get a number that does not exist yet. There are similar data types in other database management systems, like MySQL.
```sql
INSERT INTO cars
VALUES('Volkswagen', 'Blitz', 'black');
```
```
    make    |  model  | color       | id
------------+---------+-------------+------
 Ford       | Mustang | blue        | 1
 Oldsmobile | Cutlass | black       | 2
 Oldsmobile | Delta   | silver      | 3
 Mercedes   | 190-D   | champagne   | 4
 Toyota     | Camry   | red         | 5
 Jaguar     | XJS     | blue        | 6
 Volkswagen | Blitz  | black        | 7
```

Also, if you try to specify an ID that already exists, the primary key constraint will prevent you from doing so. So, after all, the `id` column uniquely identifies each record in this table – which is very useful, for example, when you want to refer to these records from another table. 
```sql
INSERT INTO cars
VALUES ('Opel', 'Astra', 'green', 1);
```
```
duplicate key value violates unique constraint "id_pkey"
DETAIL: key (id)=(1) already exists.
```

### Another type of surrogate key
Another strategy for creating a surrogate key is to combine two existing columns into a new one. In this example, we first add a new column with the `varchar` data type. We then `UPDATE` that column with the concatenation of two existing columns. The `CONCAT` function glues together the values of two or more existing columns. Lastly, we turn that new column into a surrogate primary key.
```sql
ALTER TABLE table_name
ADD COLUMN column_c varchar(256);

UPDATE table_name
SET column_c = CONCAT(column_a, column_b);

ALTER TABLE table_name
ADD CONSTRAINT pk PRIMARY KEY (column_c);
```

### Your database
In the exercises, you'll add a surrogate key to the `professors` table, because the existing attributes are not really suited as primary key. Theoretically, there could be more than one professor with the same name working for one university, resulting in duplicates. With an auto-incrementing `id` column as the primary key, you make sure that each professor can be uniquely referred to. This was not necessary for organizations and universities, as their names can be assumed to be unique across these tables. In other words: It is unlikely that two organizations with the same name exist, solely for trademark reasons. The same goes for universities.


## Add a SERIAL surrogate key
Since there's no single column candidate key in `professors` (only a composite key candidate consisting of `firstname`, `lastname`), you'll add a new column `id` to that table.

This column has a special data type `serial`, which turns the column into an auto-incrementing number. This means that, whenever you add a new professor to the table, it will automatically get an `id` that does not exist yet in the table: a perfect primary key.

- Add a new column `id` with data type `serial` to the `professors` table.

```sql
-- Add the new column to the table
ALTER TABLE professors 
ADD COLUMN id serial;
```

- Make `id` a primary key and name it `professors_pkey`.

```sql
-- Make id a primary key
ALTER TABLE professors 
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
```

- Write a query that returns all the columns and 10 rows from `professors`.

```sql
-- Have a look at the first 10 rows of professors
SELECT *
FROM professors
LIMIT 10;
```

```
firstname       | lastname    | university_shortname | id
----------------|-------------|----------------------|----
Karl            | Aberer      | EPF                  | 1
Reza Shokrollah | Abhari      | ETH                  | 2
Georges         | Abou Jaoudé | EPF                  | 3
Hugues          | Abriel      | UBE                  | 4
Daniel          | Aebersold   | UBE                  | 5
Marcelo         | Aebi        | ULA                  | 6
Christoph       | Aebi        | UBE                  | 7
Patrick         | Aebischer   | EPF                  | 8
Stephan         | Aier        | USG                  | 9
Anastasia       | Ailamaki    | EPF                  | 10
```

*As you can see, PostgreSQL has automatically numbered the rows with the `id` column, which now functions as a (surrogate) primary key – it uniquely identifies professors.*

## CONCATenate columns to a surrogate key
Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the `CONCAT()` function.

Let's think of the following example table:
```sql
CREATE TABLE cars (
 make varchar(64) NOT NULL,
 model varchar(64) NOT NULL,
 mpg integer NOT NULL
)
```
The table is populated with 10 rows of *completely fictional* data.

Unfortunately, the table doesn't have a primary key yet. None of the columns consists of only unique values, so some columns can be combined to form a key.

In the course of the following exercises, you will combine `make` and `model` into such a surrogate key.

- Count the number of distinct rows with a combination of the `make` and `model` columns.

```sql
-- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model))
FROM cars;
```

```
count
-----
10
```

- Add a new column `id` with the data type `varchar(128)`.

```sql
-- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);
```

- Concatenate `make` and `model` into `id` using an `UPDATE table_name SET column_name = ...` query and the `CONCAT()` function.

```sql
-- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);
```

- Make `id` a primary key and name it `id_pk`.

```sql
-- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);

-- Have a look at the table
SELECT * FROM cars;
```

```
make       | model     | mpg | id
-----------|-----------|-----|--------------------
Subaru     | Forester  | 24  | SubaruForester
Opel       | Astra     | 45  | OpelAstra
Opel       | Vectra    | 40  | OpelVectra
Ford       | Avenger   | 30  | FordAvenger
Ford       | Galaxy    | 30  | FordGalaxy
Toyota     | Prius     | 50  | ToyotaPrius
Toyota     | Speedster | 30  | ToyotaSpeedster
Toyota     | Galaxy    | 20  | ToyotaGalaxy
Mitsubishi | Forester  | 10  | MitsubishiForester
Mitsubishi | Galaxy    | 30  | MitsubishiGalaxy
```

*Let's look into another method of adding a surrogate key now.*

## Test your knowledge before advancing
Before you move on to the next chapter, let's quickly review what you've learned so far about attributes and key constraints.

Let's think of an entity type "student". A student has:

1. a **last name** consisting of *up to 128* characters (required),
2. a unique **social security number**, consisting only of integers, that should serve as a key,
3. a **phone number** of *fixed length 12*, consisting of numbers and characters (but some students don't have one).

- Given the above description of a student entity, create a table `students` with the correct column types.
- Add a `PRIMARY KEY` for the social security number `ssn`.

```sql
-- Create the table
CREATE TABLE students (
  last_name varchar(128) NOT NULL,
  ssn integer PRIMARY KEY,
  phone_no char(12)
);
```