# Spring 2019 | CS 6400

Author: Travis Jefferies<br>
Last updated: 04252019

## Normalization

Databases are forever.<br>
EER Diagrams may go missing over time. <br>
You never know what will be in a database - it could've been designed by an expert, an idiot, and some compromises might have been made in the design of the database in the name of performance.

Given a relation and a set of functional dependencies like these:

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u2@gt.edu | Meditation | 21 | 1969 | Austin | 43,000 |

* Email -> BirthYear, CurrentCity, Salary
* Email, Interest -> SinceAge
* BirthYear -> Salary

***Is this relation laid out in such a manner that it is easy to enforce the functional dependencies?***<br>
***How do we normalize the relation without information loss and so that the functional dependencies can be enforced?***

### The Rules

1. No redundancy of facts
2. No cluttering of facts
3. Must preserve information
4. Must preserve functional dependencies

### NOT a relation NF$^{2}$

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music<br>Reading<br>Tennis<br> | 10<br>5<br>14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging<br>Meditation<br>Surfing | 13<br>21<br>19 | 1969 | Austin | 43,000 |

* Multi-values are not allowed using traditional normalization rules
    * The fix is simple - duplicate the `Email`, `BirthYear`, `CurrentCity`, `Salary` fields for each value in the `Interest` and `SinceAge` columns
    
Remember, relations are supposed to be made up of sets of atomic values.

### Relation with Problems

If we normalize the relation above, getting rid of the multi-values, we will end up with a relation that looks like this:

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u2@gt.edu | Meditation | 21 | 1969 | Austin | 43,000 |

However, this relation still has many problems with it. Going back to the functional dependencies and drawing a picture can oftentimes be helpful:

![](p1.svg)

* Email -> BirthYear, CurrentCity, Salary
* Email, Interest -> SinceAge
* BirthYear -> Salary

Notice how the picture captures the functional dependencies much clearer. Let's examine the potential issues we may have with this relation "as-is".

#### Relation with Problems: Redundancy

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u2@gt.edu | Meditation | 21 | 1969 | Austin | 43,000 |

The first issue with this relation is we are storing redundant information.
* For each `Email`, the same `BirthYear`, `CurrentCity`, and `Salary` are repeated
* For each `BirthYear` the same `Salary` is repeated

Redundancy can lead to inconsistencies.

#### Relation with Problems: Insertion Anomaly

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u2@gt.edu | Meditation | 21 | 1969 | Austin | 43,000 |
| u9@gt.edu | NULL | NULL | 1988 | Las Vegas | 24,000 |

The second issue with the relation above is the issue of insertion anomalies.<br>
If we insert a new `RegularUser` (say u9@gt.edu) without any `Interest`, then we must insert NULL values for `Interest` and `SinceAge`.
* NULL values present in a relation can be a nuisance in calculations and join situations

#### Relation with Problems: Deletion Anomaly

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u2@gt.edu | Meditation | 21 | 1969 | Austin | 43,000 |
| u12@gt.edu | NULL | NULL | **1974** | San Diego | **38,000** |

If we delete or filter out the tuples with NULLs present, we lose the fact that RegularUsers with `BirthYear = 1974` have a `Salary = 38,000`.
* This should prompt us to think about breaking the relation into two relations

#### Relation with Problems: Update Anomaly

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | **Seattle** | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | **Seattle** | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | **Seattle** | 27,000 |
| u2@gt.edu | Blogging | 13 | **1969** | Austin | **43,000** |
| u2@gt.edu | Meditation | 21 | **1969** | Austin | **43,000** |
| u12@gt.edu | Reading | 20 | 1974 | San Diego | 38,000 |

If we choose to update the `CurrentCity` of the RegularUser it will be super inefficient because we must update it in multiple places.<br>
* Alternatively, if we update `BirthYear = 1969` of `Email = u2@gt.edu` to `BirthYear = 1974`, we must do extra work to update `Salary` in multiple places

### Dependencies

Now let's talk about functional dependencies and the need to normalize correctly.

#### Information Loss

Our job as dilegent database designers is to break apart relations like **RegularUser** into smaller relations while still retaining the functional dependencies called out in the schema.

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u3@gt.edu | Meditation | 21 | 1967 | Austin | 48,000 |
| u12@gt.edu | Reading | 20 | 1974 | San Diego | 38,000 |

If we decide to break RegularUser down into two relations like:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u3@gt.edu | Meditation | 21 | 1967 | Austin | 48,000 |
| u12@gt.edu | Reading | 20 | 1974 | San Diego | 38,000 |

**ResultStep1a**:

| CurrentCity | Salary |
| --- |
| Seattle | 27,000 |
| **Austin** | 43,000 |
| **Austin** | 48,000 |
| San Diego | 38,000 |

**ResultStep1b**:

| Email | Interest | SinceAge | BirthYear | CurrentCity |
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle |
| u1@gt.edu | Reading | 5 | 1985 | Seattle |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle |
| u2@gt.edu | Blogging | 13 | 1969 | Austin |
| u3@gt.edu | Meditation | 21 | 1967 | Austin |
| u12@gt.edu | Reading | 20 | 1974 | San Diego |

If we try to join ResultStep1a with ResultStep1b we will get spurious tuple creation - ironically information gain in this case is a form of information loss.
* In this case three extra rows will be generated - breaking consistency with the source RegularUser table

#### Dependency Loss

Not to state the obvious but the functional dependencies define a loose skeleton for the subsequent schema layout. Remember, our functional dependencies were defined as

* Email -> BirthYear, CurrentCity, Salary
* Email, Interest -> SinceAge
* BirthYear -> Salary

Our breakouts above in ResultStep1a and ResultStep1b do not satisfy any of these requirements.

#### Correct Breakout

Given

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u3@gt.edu | Meditation | 21 | 1967 | Austin | 48,000 |
| u12@gt.edu | Reading | 20 | 1974 | San Diego | 38,000 |

With the functional dependencies defined above, we could reasonably propose a normalization of the RegularUser table like so:

**CorrectResultStep1a**:

| Email | BirthYear | CurrentCity |
| --- | --- | --- | --- |
| u1@gt.edu | 1985 | Seattle |
| u2@gt.edu | 1969 | Austin |
| u3@gt.edu | 1967 | Austin |
| u12@gt.edu | 1974 | San Diego |

**CorrectResultStep1b**:

| Email | Interest | SinceAge |
| --- | --- | --- |
| u1@gt.edu | Music | 10 |
| u1@gt.edu | Reading | 5 |
| u1@gt.edu | Tennis | 14 |
| u2@gt.edu | Blogging | 13 |
| u3@gt.edu | Meditation | 21 |
| u12@gt.edu | Reading | 20 |

**CorrectResultStep1c**:

| BirthYear | Salary
| --- | --- | --- | --- | --- | --- |
| 1985 | 27,000 |
| 1969 | 43,000 |
| 1967 | 48,000 |
| 1974 | 38,000 |

A perfect schema has the following desired attributes:
* No redundancy
* No insertion anomalies
* No deletion anomalies
* No update anomalies
* No information loss
* No dependency loss

#### Functional Dependencies

Let $X$ and $Y$ be sets of attributes in $R$.<br>
$Y$ is ***functionally dependent*** on $X$ in $R$ iff for each $x \in R$ there is precisely one $y \in R$.

![](p1.svg)

From the picture above, we know that for every combination of `Email` and `Interest` we should have a unique (one and only one value) `SinceAge`.

**CorrectResultStep1b**:

| Email | Interest | SinceAge |
| --- | --- | --- |
| u1@gt.edu | Music | 10 |
| u1@gt.edu | Reading | 5 |
| u1@gt.edu | Tennis | 14 |
| u2@gt.edu | Blogging | 13 |
| u3@gt.edu | Meditation | 21 |
| u12@gt.edu | Reading | 20 |

CorrectResultStep1b shows the proper manifestation of this functional dependency.

#### Full Functional Dependencies

Let $X$ and $Y$ be sets of attributes in $R$.<br>
$Y$ is ***fully functionally dependent*** on $X$ in $R$ iff $Y$ is functional dependent on $X$ and $Y$ is not functional dependent on any proper subset of $X$.

In CorrectResultStep1b above, we see that for every combination of `Email` and `Interest` we have a unique (one and only one value) `SinceAge` - this indicates to us that `SinceAge` is full functionally dependent on `Email` and `Interest` and can not be determined by just looking at say `Email`.

This is in contrast to the situation between `CurrentCity` and `Email` / `Interest`. We only need `Email` to determine the `CurrentCity` and are in no way dependent on `Interest`.

**RegularUser**:

| Email | Interest | SinceAge | BirthYear | CurrentCity | Salary
| --- | --- | --- | --- | --- | --- |
| u1@gt.edu | Music | 10 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Reading | 5 | 1985 | Seattle | 27,000 |
| u1@gt.edu | Tennis | 14 | 1985 | Seattle | 27,000 |
| u2@gt.edu | Blogging | 13 | 1969 | Austin | 43,000 |
| u3@gt.edu | Meditation | 21 | 1967 | Austin | 48,000 |
| u12@gt.edu | Reading | 20 | 1974 | San Diego | 38,000 |

#### Functional Dependencies and Keys

We use ***keys*** to enforce full functional dependencies, $X \rightarrow Y$.<br>
In a relation, the ***values of the key are unique***
* This is why/how it enforces a function

Our first example involves the full functional dependency
* Email, Interest -> SinceAge

**CorrectResultStep1b**:

| <u>Email</u> | <u>Interest</u> | SinceAge |
| --- | --- | --- |
| u1@gt.edu | Music | 10 |
| u1@gt.edu | Reading | 5 |
| u1@gt.edu | Tennis | 14 |
| u2@gt.edu | Blogging | 13 |
| u3@gt.edu | Meditation | 21 |
| u12@gt.edu | Reading | 20 |

Making the PK of CorrectResultStep1b (`Email`, `Interest`) guarantees the tuple to be unique. Since every (`Email`, `Interest`) pair is unique - `SinceAge` is also guaranteed to be uniquely tied to (`Email`, `Interest`).

Another example involves the functional dependence
* BirthYear -> Salary

**CorrectResultStep1c**:

| <u>BirthYear</u> | Salary
| --- | --- | --- | --- | --- | --- |
| 1985 | 27,000 |
| 1969 | 43,000 |
| 1967 | 48,000 |
| 1974 | 38,000 |

Every `BirthYear` in CorrectResultStep1c must be unique (PK) thus every `Salary` is uniquely tied to that `BirthYear`. Notice that there is a function $X \rightarrow Y$ associated with each example above.

### Normal Forms

To help us recognize how well a relation is laid out we need the concept of normal forms (in ascending order of "goodness"):
* NF$^{2}$ - non-first normal form: the whole set of data structures: mostly non-relations
* 1NF - First normal form
* 2NF - Second normal form
* 3NF - Third normal form
* BCNF - Fourth normal form

We should always strive to acheive BCNF in our database design endeavors.
* A relation in BCNF is also in 3NF -> 2NF -> 1NF
    * Similarly a relation in 3NF is also in 2NF -> 1NF
    * Similarly a relation in 2NF is also in 1NF
    * However, a relation that is 2NF, may not neccesarily be 3NF
    
#### Normal Form Definitions

* NF$^{2}$: non-first normal form

* 1NF: $R$ is in 1NF iff all domain values are atomic
    * Relation: A data structure where all domain values are pulled from sets of atomic values
        * All relations are naturally born in 1NF
        
* 2NF: $R$ is in 2NF iff $R$ is in 1NF and every nonkey attribute is fully dependent on the key

* 3NF: $R$ is in 3NF iff $R$ is in 2NF and every nonkey attribute is non-transitively dependent on the key.

* BCNF (Boyce-Codd Normal Form): $R$ is in BCNF iff every determinant is a candidate key.

* Determinant: A set of attributes on which some other attribute is fully functionally dependent.

*"All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF), so help me Codd!"*

- Kent (1983), Diehr (1984)

#### 1NF

Given the functional dependency diagram below:

![](p2.svg)

**1NF: t1**

* Email -> CurrCity
* Email -> BirthYear
* Email, BirthYear -> Salary
* Email, Interest -> SinceAge

#### BCNF

![](p3.svg)

**BCNF: t2**
* Email, Interest -> SinceAge

If we split t1 into multiple relations including subset t2, `SinceAge` is guaranteed to be unique due to the PK(`Email`,`Interest`).

#### 2NF

![](p4.svg)

**2NF: t3**
* Email -> CurrCity
* Email -> BirthYear
* Email, BirthYear -> Salary

`CurrCity`, `BirthYear`, `Salary`, are all dependent on `Email` and thus are in 2NF. The only thing preventing the t3 relationship from being BCNF is the transitive relationship between `Salary` and `BirthYear`.

However, we can decompose t3 further into relations t4 and t5 to achieve 3NF & BCNF:

#### 3NF & BCNF

![](p5.svg)

* Email -> CurrCity
* Email -> BirthYear
* BirthYear -> Salary

Both t4 and t5 are 3NF and BCNF. There exists no threat of a transitive key in either relation if we select PK(`Email`). `SinceAge` is the only attribute in t5, so naturally it's BCNF with PK(`BirthYear`)

### Computing with Functional Dependencies

#### Armstrong's Rules

The rules that govern computing with decomposed relations - insures that we do not lose information and still meet all functional dependencies (functional requirements):
* ***Reflexivity***:
*If Y is part of X, then $X \rightarrow Y$*<br>
Email, Interest $\rightarrow$ Interest<br>
Interest is on the RHS and the LHS of the arrow, therefore $X \rightarrow Y$<br><br>
* ***Augmentation***:
*If $X \rightarrow Y$, then $WX \rightarrow WY$*<br>
If Email $\rightarrow$ BirthYear, then Email, Interest $\rightarrow$ BirthYear, Interest<br><br>
* ***Transitivity***:
*If $X \rightarrow Y$ and $Y \rightarrow Z$, then $X \rightarrow Z$*<br>
Email $\rightarrow$ BirthYear and BirthYear$\rightarrow$ Salary, then Email $\rightarrow$ Salary<br><br>

##### How to Guarantee Lossless Joins

How to guarantee a lossless join when decomposing a relation into smaller relations:<br>
The join field must be a key in at least one of the relations!

Given:

**t6**:

| <u>Email</u> | <u>Interest</u> | SinceAge |
| --- | --- | --- |
| u1@gt.edu | Music | 10 |
| u1@gt.edu | Reading | 5 |
| u1@gt.edu | Tennis | 14 |
| u2@gt.edu | Blogging | 13 |
| u3@gt.edu | Meditation | 21 |
| u12@gt.edu | Reading | 20 |

**t6a**:

| <u>Email</u> | <u>Interest</u> | SinceAge |
| --- | --- | --- |
| u1@gt.edu | Music | 10 |
| u1@gt.edu | Reading | 5 |
| u1@gt.edu | Tennis | 14 |
| u2@gt.edu | Blogging | 13 |
| u3@gt.edu | Meditation | 21 |
| u12@gt.edu | Reading | 20 |

**t6b**:

| <u>Email</u>  | BirthYear | CurrentCity |
| --- | --- | --- | --- |
| u1@gt.edu | 1985 | Seattle |
| u1@gt.edu | 1985 | Seattle |
| u1@gt.edu | 1985 | Seattle |
| u2@gt.edu | 1969 | Austin |
| u3@gt.edu | 1967 | Austin |
| u12@gt.edu | 1974 | San Diego |

If we try to join t6a to the relation t6b, we obviously do it on the <u>Email</u> field.<br>
Since <u>Email</u> is a key in one of the two relations, as it is here, then we are guaranteed not to lose information from decomposing relation t6 this way.<br>
When the join field is a key, as it is in the example above, there's no way to create duplicate records.

##### How to Guarantee Preservation of Functional Dependencies:

The meaning implied by the remaining functional dependencies must be the same as the meaning that was implied by the original set!<br>

Let's look at the difference between the 2NF decomposition and the BCNF decomposition from above:

![](p6.svg)
**2NF**

* Email -> CurrCity
* Email -> BirthYear
* Email, BirthYear -> Salary

**BCNF**

* Email -> CurrCity
* Email -> BirthYear
* BirthYear -> Salary

The only difference is in the last functional dependency:
Email, BirthYear -> Salary vs. BirthYear -> Salary<br>
Why is this simplification allowed? ***Transitivity***.<br>
Recall that transitivity states If $X \rightarrow Y$ and $Y \rightarrow Z$, then $X \rightarrow Z$<br>
Or in this case, Email -> BirthYear and BirthYear -> Salary, so Email -> Salary<br>
Transitivity takes care of the functional dependency between Email -> Salary

##### Using Armstrong's Rules to Check Decompositions

1. Check for lossless joins
2. Insure all functional dependencies are preserved

Remember that there do exist relations that are in 3NF that are not in BCNF - in fact BCNF is very, very rare in practice!<br>
***Always strive for 3NF!***