# Practical Database Normalisation and Data Anomalies
© Explore Data Science Academy

## Learning Objectives
In this train, you will learn how to:
* Normalize a database up to the $ 3^{rd} $ Normal Form.
* Understand data anomalies and how database normalisation reduces the likelihood of their occurrence. 



## Outline

This train is structured as follows: 

* Data Anomalies
* Imports and DB Connections
* Denormalised Table
* Convert to $1^{st}$ Normal Form
* Convert to $2^{nd}$ Normal Form
* Convert to $3^{rd}$ Normal Form

## Introduction

In this train we will give a practical example of database normalisation. Database normalisation is design technique for decoupling table structures to ***reduce*** data redundancies. We will use the tools and techniques that we have learned throughout the course to achieve this result. At each step along the normalisation process, we will reflect on the data anomalies which are being addressed, and how normalisation attempts to remedy their  state. 


## Data Anomalies

Data anomalies are issues that present themselves in poorly structured or denormalised databases. The following are examples of commonly occuring anomalies which you may find: 

 - **Deletion Anomaly**: The deletion of record that leads to the unintentional removal of another required attribute from the database. 
 - **Insertion Anomaly**: The inability to insert a record as it requires an additional data which may presently not be available.
 - **Update Anomaly**: This occurs when we have duplicated data; if we were to update the affected rows and a single row gets missed, this will lead to a data inconsistency.



## Imports and DB Connections

Please use the below command to install **`sql_magic`** if you do not already have it, this is the package that will assist you with SQL syntax hightlighting.
* `pip install sql_magic`

Remember to start each new cell with:  **`%%read_sql`**


In [1]:
import sqlite3
import csv
from sqlalchemy import create_engine
%load_ext sql_magic

# create engine instance using sqlalchemy
engine = create_engine("sqlite:///SoftDevEmployees.db")
%config SQL.conn_name = 'engine'

# create connection object using sqlite3
conn = sqlite3.connect('SoftDevEmployees.db')
cursor = conn.cursor()

## Denormalised Database

Let us have a look at the **SoftDevEmployees.db** database which contains a single table called **Employees**. Here we can observe, as depicted in the ERD sketch below, that the database is in its denormalised form. Our goal within this train, however, is to transform this database to conform to the $3^{rd}$ Normal Form. 


<img src="DenoramlizedEmployeesTable.png" alt="Update Anomaly" border="0">

## The $1^{st}$ Normal Form - 1NF

To convert our database to the $1^{st}$  normal form we need to make sure that we meet the following conditions: 

1. A cell cannot hold multiple values; it can only hold a single (atomic) value.
2. Values stored in one column should be of the same [domain](https://en.wikipedia.org/wiki/Attribute_domain).
3. The order in which data is stored should not matter.
4. The column names are unique.

Looking at the contents of the database it is clear that the **`Employee`** table is not in the first normal form. The columns **`FullName`**, **`Role`** and **`Department`** do not have single (atomic) values as required by the first normal form. 

Let us a write a small query where we can see all the non-atomic items in the cells.

In [2]:
%%read_sql
SELECT * 
FROM employees
WHERE Role LIKE '%,%'    -- we use the LIKE keyword to search for the comma "," delimiter
OR Department LIKE '%,%' -- we use the LIKE keyword to search for the comma "," delimiter

Query started at 07:35:41 AM SAST; Query executed in 0.00 m

Unnamed: 0,FullName,Title,Role,OccupationBand,Salary,Department
0,"Dumisani, Thwala",Mr,Back-End Developer,Graduate,52171,"Web Applications, Mobile Applications"
1,"Dirk,Banda",Mr,Business Analyst,Intern,37601,"Web Applications,Mobile Applications"
2,"barend,Edwards",MR,Database Analyst,Intern,13163,"Web Applications, Mobile Applications"
3,"kelly ,Manuel",Ms,Full-Stack Developer,Intern,47442,"Web Applications, Mobile Applications"
4,"Janet,Patel",Ms,Systems Analyst,Intern,39081,"Web Applications, Mobile Applications"
5,"Christopher, Walker",Mr,Back-End Developer,Junior,122894,"Web Applications, Mobile Applications"
6,"Marco , Morris",prof,Back-End Developer,Mid-Level,110506,"Web Applications, Mobile Applications"
7,"Danie ,Campbell",Mrs,Business Analyst,Mid-Level,205621,"Web Applications, Mobile Applications"
8,"Jessica ,Mchunu",miss,"Full-Stack Developer, Scrum Master",Mid-Level,70741,Web Applications
9,"Laura,Makhanya",Ms,"Full-Stack Developer, Team Lead",Senior,293352,Mobile Applications


### Steps required to convert to 1NF
To convert the table to the first normal form, we will need to do two things: 

1. The first step is to reduce the content in each cell to insure that we only store a single (atomic) value. Looking at the **`FullName`** column we see that it is in the form: ***Name,Surname***  so it is logical to split the column into two new columns.

2. Secondly, we need to split the content for the **`Role`** and **`Department`** columns as employees can have more than one role, or belong to multiple departments. For this change we will duplicate the row and insert the correct 'Role' or 'Department' attribute values required.

<img src ="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/1NF.png" alt="first Normal Form" >

Let's get to it by first creating the required table. In our case we will drop **`FullName`** column and introduce the **`Name`** and **`Surname`** columns:

In [3]:
%%read_sql

DROP TABLE IF EXISTS Employees_1NF; -- We delete our table for convenience when re-running this cell. 

CREATE TABLE Employees_1NF (
    Name VARCHAR NOT NULL, 
    Surname VARCHAR NOT NULL,
    Role VARCHAR NOT NULL,
    Department VARCHAR NOT NULL,
    Title VARCHAR,
    OccupationBand VARCHAR,
    Salary REAL,
    PRIMARY KEY(Name, Surname, Role, Department) 
);

Query started at 07:35:41 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda7e9190>

Before we move on to inserting values in our newly created table, let us write a few simple queries that will guide the insertion of data. Firstly, let us take a moment to consider how we will split the content in the cells such that each cell only contains one piece of data.

In [4]:
%%read_sql
SELECT 
    FullName,
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name, --Get substring before comma
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname --Get substring after comma
FROM Employees
LIMIT 5;

Query started at 07:35:41 AM SAST; Query executed in 0.00 m

Unnamed: 0,FullName,Name,Surname
0,"Dumisani, Thwala",Dumisani,Thwala
1,"Tony, Horn",Tony,Horn
2,"Vuyokazi,barnes",Vuyokazi,barnes
3,"sello ,Details",sello,Details
4,"Jacqueline ,fredericks",Jacqueline,fredericks


Now let us take a moment to understand this query, by unpacking the string manipulation functions being employed. We will take the first row as an example where **`FullName`** = '*Dumisani, Twala*':

```sql
    TRIM(SUBSTR('Dumisani, Twala',1,INSTR('Dumisani, Twala',,',')-1))
```
The first function call determines the index position of the comma: **`INSTR('Dumisani, Twala',,',')`** = 9.

The second function call returns the substring that appears before the comma: **`SUBSTR('Dumisani, Twala',1,9-1)`** = '   Dumisani'.

The third function call removes any potential whitespaces that might appear at the extremities of our substring **`TRIM('Dumisani')`** = 'Dumisani'

The same explanation will hold true for the creation of the Surname, except that we are looking for the substring after the comma.


**Try it yourself: Write a similar query for the `Role` and `Department` Columns**

In [5]:
%%read_sql
--Write your query here


You may have realised after attempting the above that for the **`Role`** and **`Department`** columns, we can not naively split them into multiple columns. This would require us to create `Role_1`, `Role_2`, ..., `Role_n` columns (the same is true for `Department`) - which is not ideal. If we do this, we are going to potentially introduce multiple null values within the table. Furthermore, we ideally want to grow the table on a row basis as it does not require a change in the table structure, unlike growing the table by its columns/attributes.

So for the next step let's see how we are going to split the contents of the cell and create a duplicate entry for it. We will approach the problem by creating three logical sets and joining them together using a union.

1. The set of all entries containing the first **`Role`** or **`Department`** for all non-atomic cells.
2. The set of all entries containing the second **`Role`**  or **`Department`** for all non-atomic cells.
3. The set of all entries that only contain atomic cells.

**Note:** There are more efficient ways of doing this task, such as using a programming language which will inherently have more data structures available for use. However, for the purposes of this train, we will assume that SQL is the only tool available. So let us flex our SQL Ninja skills to get this done!

In [36]:
%%read_sql
--Below is the INSERT query for the first normal form.
DELETE FROM Employees_1NF;
INSERT INTO Employees_1NF (Name,Surname,Title,Role,OccupationBand,Salary,Department)
/*SET #1 ======================================================================================
   The set of all entries containing the first `Role` or `Department` for all non-atomic cells. 
==============================================================================================*/
SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     -- Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    -- Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) || LOWER(SUBSTR(Title,2)) AS Title, -- Standardizing all Titles to start with a Capital letter
    TRIM(SUBSTR(Role,1,INSTR(Role,',')-1))  AS Role,       --return the first Role (substring before comma)
    OccupationBand,
    Salary,
    Department
FROM
    Employees
WHERE Role LIKE '%,%'-- Target all entries that have non-atomic values Role Column

UNION 

SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     -- Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    -- Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) || LOWER(SUBSTR(Title,2)) AS Title, -- Standardizing all Titles to start with a Capital letter
    Role,
    OccupationBand,
    Salary,
    TRIM(SUBSTR(Department,1,INSTR(Department,',')-1)) AS Department--return the first Department (substring before comma)
     
FROM
    Employees
WHERE Department LIKE '%,%' -- Target all entries that have non-atomic values in the Department column

UNION

/*SET #2 ======================================================================================
   The set of all entries containing the second `Role` or `Department` for all non-atomic cells. 
==============================================================================================*/

SELECT
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     -- Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    -- Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) || LOWER(SUBSTR(Title,2)) AS Title, -- Standardizing all Titles to start with a Capital letter
    TRIM(SUBSTR(Role,INSTR(Role,',')+1))  AS Role,      --return the second Role (substring after comma)
    OccupationBand,
    Salary,
    Department
FROM
    Employees
WHERE Role LIKE '%,%' -- Target all entries that have non-atomic values in the Role column

UNION

SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     -- Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    -- Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) || LOWER(SUBSTR(Title,2)) AS Title, -- Standardizing all Titles to start with a Capital letter
    Role,
    OccupationBand,
    Salary,
    TRIM(SUBSTR(Department,INSTR(Department,',')+1)) AS Department --return the second Department (substring after comma)
     
FROM
    Employees
WHERE Department LIKE '%,%' -- Target all entries that have non-atomic values in the Department column

UNION
/*SET #3 ====================================================================================
   The set of all entries that only contain atomic cells. 
============================================================================================*/
SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     --Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    --Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) ||LOWER(SUBSTR(Title,2)) AS Title, --Standardizing all Title to start with a Capital letter
    Role,
    OccupationBand,
    Salary,
    Department
FROM
    Employees
WHERE  ROLE NOT LIKE '%,%' AND Department NOT LIKE '%,%' --Targets only the atomic values
    


Query started at 08:19:53 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda4c6d90>

### Cement your understanding: Explore the query

Take time to reflect on the query given in the previous cell. Use the below cell to play around with certain chunks of the query to cement your understanding. Explore the following elements of the query:

1. Explore the individual sets
2. Make sure you understand the ***WHERE*** clause conditions for sets 1 and 2
3. Make sure you understand the ***WHERE*** clause conditions for set 3

**NOTE:** You are not restricted to the one cell below, you can create as many cells as you wish.

In [7]:
%%read_sql

-- Write your experimental queries here

### Exercise: Rewrite the above query in a more compact manner:

The above query was naively written and as a result consists of four unions, this can actually be reduced to only two unions. This is because set 1 consists of two smaller subsets, the same is true for set 2. Your task is to ensure that we only have three pure sets (instead of the five that are present in our current query).

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/Sets.png" alt="Second Normal Form">



In [None]:
%%read_sql
--Write your query here.


### Checkpoint: Data Anomalies

You can think of 1NF as the "*common sense*" form which will allow you to write meaningful SQL queries without too much hassle.

Although we have transformed the table to its first normal form, data anomalies still exist however: 

 - **Deletion Anomaly**: If we delete Jessica Mchunu from the table, the **Scrum Master** role will be removed from the database as well.

 - **Update Anomaly**: Christopher's name appears twice in the table. If he were to get a raise and only one entry was updated and the other one missed, it would cause a data inconsistency -  making it seem as if he is getting two different salaries.

 - **Insertion Anomaly**: Some companies like to hire talent but not necessarily assign them to a department or role as they want them to rotate through out the company until they find their niche. This database will not allow them to capture that information as it is required that all employees belong to a department and have at least one role.



## Converting to the $2^{nd}$ Normal Form

To convert to the $2^{nd}$  normal form we need to make sure that we meet the following conditions: 

1. The table needs to be in the $1^{st}$ normal form
2. ***The table should not contain any partial dependencies.***

Point number two simply means that every non key attribute should be fully dependent on the primary key. This translates to each table serving a single purpose. Therefore, the strategy to "employ" here is to create new tables such that each table serves a single purpose. Have a look at the ERD sketch given below.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/2NF.png" alt="Second Normal Form">

Note that most of the heavy lifting was performed when setting up the $1^{st}$ normal form. So we will use this table to create the required fields for the $2^{nd}$  normal form. To start off, we will create the tables required. For each table we will pay special attention to relationships that exist between tables, this will enable us to create foreign keys appropriately to maintain the [referential integrity](https://en.wikipedia.org/wiki/Referential_integrity) of the tables.

In [8]:
%%read_sql

DROP TABLE IF EXISTS Employees_2NF;
DROP TABLE IF EXISTS Titles_2NF;
DROP TABLE IF EXISTS Roles_2NF;
DROP TABLE IF EXISTS Departments_2NF;
DROP TABLE IF EXISTS Employee_Department_2NF;
DROP TABLE IF EXISTS Employee_Role_2NF;

CREATE TABLE Employees_2NF (
    EmployeeID INTEGER NOT NULL,
    Name VARCHAR, 
    Surname VARCHAR,
    Salary REAL,
    OccupationBand VARCHAR,
    TitleID INTEGER,
    FOREIGN KEY(TitleID) REFERENCES Titles_2NF (TitleID), 
    PRIMARY KEY(EmployeeID AUTOINCREMENT)
);

CREATE TABLE Titles_2NF (
    TitleID INTEGER NOT NULL,
    Title VARCHAR,
    PRIMARY KEY(TitleID AUTOINCREMENT)
);


CREATE TABLE Employee_Role_2NF(
    EmployeeID INTEGER NOT NULL,
    RoleID INTEGER NOT NULL,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF (EmployeeID),
    FOREIGN KEY (RoleID) REFERENCES Roles_2NF (RoleID),
    PRIMARY KEY(EmployeeID, RoleID)
);

CREATE TABLE Employee_Department_2NF(
    EmployeeID INTEGER NOT NULL,
    DepartmentID INTEGER NOT NULL,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF (EmployeeID),
    FOREIGN KEY (DepartmentID) REFERENCES Departments_2NF (DepartmentID),
    PRIMARY KEY(EmployeeID, DepartmentID)
);

CREATE TABLE Roles_2NF (
    RoleID INTEGER NOT NULL,
    Role VARCHAR,
    PRIMARY KEY(RoleID AUTOINCREMENT)
);

CREATE TABLE  Departments_2NF (
    DepartmentID INTEGER NOT NULL,
    Department VARCHAR,
    PRIMARY KEY(DepartmentID AUTOINCREMENT)
);


Query started at 07:35:42 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda71af10>

## Populating tables for the $2^{nd}$ Normal Form

Let us proceed to populate the **Titles_2NF, Roles_2NF** and **Departments_2NF** as the queries for these insertions are fairly trivial. They all consist of a simple SELECT DISTINCT queries using the **Employees_1NF** table 

In [9]:
%%read_sql
DELETE FROM Titles_2NF;
DELETE FROM Roles_2NF;
DELETE FROM Departments_2NF;

INSERT INTO Titles_2NF (Title)
SELECT 
    DISTINCT Title 
FROM Employees_1NF 
WHERE Title <> '';

INSERT INTO Roles_2NF (Role)
SELECT 
    DISTINCT Role
FROM Employees_1NF
WHERE Role <>'';

INSERT INTO Departments_2NF (Department)
SELECT
    DISTINCT Department
FROM Employees_1NF
WHERE Department <>'';

Query started at 07:35:42 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda739250>

Now we move onto the **Employees_2NF** table. This is where we start to take the foreign keys into account. Note that we said that the **`TitleID`** attribute referenced the **`TitleID`** column in the **Titles_2NF** table. This gives us a hint as the table we need to join with inorder to populate the **`TitleID`** such that we maintain referential integrity.

In [10]:
%%read_sql
DELETE FROM Employees_2NF;

INSERT INTO Employees_2NF (Name, Surname, Salary, OccupationBand, TitleID)
SELECT DISTINCT
    EMP.Name,
    EMP.Surname,
    EMP.Salary,
    EMP.OccupationBand,
    T.TitleID
FROM Employees_1NF AS EMP
JOIN Titles_2NF AS T ON T.Title = EMP.Title;


Query started at 07:35:42 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda701090>

Now we move onto the contents of **Employee_Department_2NF** and **Employee_Role_2NF**. These tables are usually called 'mapping tables' and are usually made up of foreign keys. The logic on the joins used to populate these structures is again reliant on the foreign key references that we made during the tables' creation stage.

In [11]:
%%read_sql
DELETE FROM Employee_Department_2NF;
DELETE FROM Employee_Role_2NF;

INSERT INTO Employee_Department_2NF (EmployeeID,DepartmentID)
SELECT DISTINCT
    EMP2.EmployeeID,
    DPT.DepartmentID
FROM Employees_1NF AS EMP1
JOIN Employees_2NF AS EMP2 ON EMP1.Name = EMP2.Name AND EMP1.Surname = EMP2.Surname
JOIN Departments_2NF AS DPT ON EMP1.Department = DPT.Department;
    

INSERT INTO Employee_Role_2NF (EmployeeID,RoleID)
SELECT DISTINCT
    EMP2.EmployeeID,
    R.RoleID
FROM Employees_1NF AS EMP1
JOIN Employees_2NF AS EMP2 ON EMP1.Name = EMP2.Name AND EMP1.Surname = EMP2.Surname
JOIN Roles_2NF AS R ON EMP1.Role = R.Role

Query started at 07:35:42 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3e02bca690>

### Checkpoint: Data Anomalies

 - **Deletion Anomaly**: We have eliminated the deletion anomalies that could occur on the **`Roles`**, **`Departments`** and **`Titles`** columns by creating separate tables for them. For example, if Jessica Mchunu get's deleted from the **Employees_2NF** table the **Scrum Master** role will continue to persist in the **Roles_2NF** table.

 - **Update Anomaly**: Christoper only appears once in the **Employees_2NF**, so should he get a raise we only need to change his salary information in one place. This reduces the chances of having any data inconsistencies.

 - **Insertion Anomaly**: Now we can insert new graduates into the database without having to define a role or place them in a specific department.

## Converting to the $3^{rd}$ Normal Form

For the tables to be in the $3^{rd}$ normal form we require the following conditions to be met:

1. The database needs to be in $2^{nd}$ normal form.
2. There should be no **transitive** dependencies on the **Primary Key**

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/3NF.png" alt = "Third Normal Form"/>

Generally speaking, one's salary is related to their occupation band. i.e. we do not expect a graduate to be earning a similar salary to their senior counterparts. So in this case we find ourselves with transitive relation between **`EmployeeID`** , **`Salary`** and **`OccupationBand`**. The *occupation band* is dependent on the *salary* and the *salary* is dependent on the *employee's id*. So we can say that the **`OccupationBand`** is transitively dependent on the Primary Key (**`EmployeeID`** ) through the **`Salary`** column.

Most of the tables required for the third normal form are already available from our second normal form, so we will simply make a copy of these tables.

In [12]:
%%read_sql
DROP TABLE IF EXISTS Departments_3NF;
DROP TABLE IF EXISTS Employee_Department_3NF;
DROP TABLE IF EXISTS Employee_Role_3NF;
DROP TABLE IF EXISTS Roles_3NF;
DROP TABLE IF EXISTS Titles_3NF;

CREATE TABLE Departments_3NF AS
SELECT * FROM Departments_2NF;

CREATE TABLE Employee_Department_3NF AS
SELECT * FROM Employee_Department_2NF;

CREATE TABLE Employee_Role_3NF AS
SELECT * FROM Employee_Role_2NF;

CREATE TABLE Roles_3NF AS
SELECT * FROM Roles_2NF;

CREATE TABLE Titles_3NF AS
SELECT * FROM Titles_2NF;



Query started at 07:35:42 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda74aa90>

For the last step, we need to create two new tables for **Salaries** and **Occupation Bands**, and then modify the **Employees** table.

In [13]:
%%read_sql
DROP TABLE IF EXISTS Employees_3NF;
DROP TABLE IF EXISTS Salaries_3NF;
DROP TABLE IF EXISTS OccupationBands_3NF;

CREATE TABLE Employees_3NF (
    EmployeeID INTEGER NOT NULL,
    Name VARCHAR,
    Surname VARCHAR,
    TitleID INTEGER,
    SalaryID INTEGER,
    FOREIGN KEY(TitleID) REFERENCES Titles_3NF (TitleID)
    FOREIGN KEY(SalaryID) REFERENCES Salaries_3NF (SalaryID)
    PRIMARY KEY(EmployeeID AUTOINCREMENT)
);

CREATE TABLE Salaries_3NF(
    SalaryID INTEGER NOT NULL,
    Salary REAL,
    BandID INTEGER,
    FOREIGN KEY(BandID) REFERENCES OccupationBands (BandID),
    PRIMARY KEY(SalaryID AUTOINCREMENT)
);

CREATE TABLE OccupationBands_3NF (
    BandID INTEGER NOT NULL,
    OccupationBand VARCHAR,
    PRIMARY KEY(BandID AUTOINCREMENT)
);

Query started at 07:35:43 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda77b9d0>

## Populating tables for the $3^{rd}$ Normal Form

Having populated the tables for the $2^{nd}$ normal form will make it easier to populate tables for the the $3^{rd}$ normal form. We will also fall back on the same logic when it comes to populating the foreign key columns to maintain referential integrity - nothing changes. All joins will be dependent on the foreign key references we made when creating the tables.

In [14]:
%%read_sql

DELETE FROM OccupationBands_3NF;
DELETE FROM Salaries_3NF;
DELETE FROM Employees_3NF;

INSERT INTO OccupationBands_3NF (OccupationBand)
SELECT DISTINCT 
    OccupationBand
FROM Employees_2NF;

INSERT INTO Salaries_3NF (Salary,BandID)
SELECT 
    Salary,
    OB.BandID
FROM Employees_2NF AS EMP
JOIN OccupationBands_3NF AS OB ON OB.OccupationBand = EMP.OccupationBand;

INSERT INTO Employees_3NF (Name,Surname,TitleID, SalaryID)
SELECT 
    Name,
    Surname,
    TitleID,
    SalaryID
FROM Employees_2NF AS EMP
JOIN Salaries_3NF AS S ON EMP.Salary = S.Salary; 
    

Query started at 07:35:43 AM SAST; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7f3dda7c0c90>

With our data now in the $3^{rd}$ normal form, we encourage you to explore the data schema and to look at the contents of the various tables in order to solidify your understanding of the transformations which have taken place. 

In [15]:
%%read_sql

--Write your experimental queries here.

### Checkpoint: Data Anomalies

**Deletion Anomaly**: We have eliminated the deletion anomalies that could occur on the **`OccupationBand`**. If all Graduates were promoted to Juniors, the "*Graduate*" occupation band would still exist in the table - ready to be used when new graduates are hired.

All the **update** and **insertion** anomalies were catered for in the $2^{nd}$ normal form.



## Conclusion

We have now completed the normalisation process. From this train you should expect to have a better understanding of some data anomalies which can occur in practice, and how they can be mitigated through database normalisation. However, it is important to note it is sometimes not possible to eliminate all redundancies and data anomalies - we can only minimize them.

This design paradigm will assist you in creating relational databases which have minimized data redundancies and avoid data anomalies. Referential Integrity plays a key role in understanding the relationships between tables - which are usually underpinned by business rules. Take time to understand these [business rules](http://etutorials.org/SQL/Database+design+for+mere+mortals/Part+II+The+Design+Process/Chapter+11.+Business+Rules/What+Are+Business+Rules/) when creating your database. This effort will serve you well for organising your data as a future Data Scientist.


## Appendix

<a href="https://www.studytonight.com/dbms/database-normalization.php">Database Normalisation</a>

<a href="https://www.essentialsql.com/get-ready-to-learn-sql-11-database-third-normal-form-explained-in-simple-english/">Third Normal Form - Transitive dependency explained</a>

<a href="https://databasemanagement.fandom.com/wiki/Data_Anomalies">Data Anomalies</a>

### Excercise Solution: Rewrite your query in a more compact manner

The following code provides a solution to the excercise we encountered earlier in the train. **Once you've attempted the exercise on your own**, we encourage you to compare your solution with this one to enhance your learning!   


In [38]:
%%read_sql
--Below is the INSERT query for the first normal form.
--DELETE FROM Employees_1NF;
--INSERT INTO Employees_1NF (Name,Surname,Title,Role,OccupationBand,Salary,Department)

/*SET #1 ======================================================================================
   The set of all entries containing the first `Role` or `Department` for all non-atomic cells. 
==============================================================================================*/
SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     -- Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    -- Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) || LOWER(SUBSTR(Title,2)) AS Title, -- Standardizing all Titles to start with a Capital letter
    
    CASE WHEN -- To make sure that we don't have blank values when creating duplicate entries due to multiple Departments
        TRIM(SUBSTR(Role,1,INSTR(Role,',')-1))='' THEN Role -- return original role if function calls return a blank
        ELSE TRIM(SUBSTR(Role,1,INSTR(Role,',')-1))         -- return the result of the function calls if it's not blank (substring before comma)
    END AS Role,

    OccupationBand,
    Salary,
    CASE WHEN --To make sure that we don't have blank values when creating duplicate entries due to multiple Roles
        TRIM(SUBSTR(Department,1,INSTR(Department,',')-1))='' THEN Department -- return original department if function calls return a blank
    ELSE TRIM(SUBSTR(Department,1,INSTR(Department,',')-1))                   -- return the result of the function calls if it's not blank (substring before comma)
    END AS Department
FROM
    Employees
WHERE Role LIKE '%,%' OR Department LIKE '%,%' -- Target all entries that have non-atomic values in the two respective columns

UNION


/*SET #2 ======================================================================================
   The set of all entries containing the second `Role` or `Department` for all non-atomic cells. 
==============================================================================================*/
SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,    --Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,   --Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) ||LOWER(SUBSTR(Title,2)) AS Title,--Standardizing all Title to start with a Capital letter
    TRIM(SUBSTR(Role,INSTR(Role,',')+1)) AS Role,              --Return the substring after the comma
    OccupationBand,
    Salary,
    TRIM(SUBSTR(Department,INSTR(Department,',')+1)) AS Department -- return the substring after the comma
FROM
    Employees
WHERE Role LIKE '%,%' OR Department LIKE '%,%' -- Target all entries that have non-atomic values in the two respective columns


UNION 


/*SET #3 ====================================================================================
   The set of all entries that only contain atomic cells. 
============================================================================================*/
SELECT 
    TRIM(SUBSTR(FullName,1,INSTR(FullName,',')-1)) AS Name,     --Splitting FullName to obtain Name,
    TRIM(SUBSTR(FullName,INSTR(FullName,',')+1)) AS Surname,    --Splitting FullName to obtain Surname
    UPPER(SUBSTR(Title,1,1)) ||LOWER(SUBSTR(Title,2)) AS Title, --Standardizing all Title to start with a Capital letter
    Role,
    OccupationBand,
    Salary,
    Department
FROM
    Employees
WHERE  
       LENGTH (TRIM(SUBSTR(Role,1,INSTR(Role,',')-1)))=0
    AND LENGTH (TRIM(SUBSTR(Department,1,INSTR(Department,',')-1)))=0
    
/*The above WHERE clause uses the knowledge that TRIM(SUBSTR(Role,1,INSTR(Role,',')-1)) could possibly 
 produce a blank string if a column contains an atomic value. Thus we use use this property to identify all rows that
 only contain atomic values in their columns by computing their LENGTH
*/


Query started at 08:55:02 AM SAST; Query executed in 0.00 m

Unnamed: 0,Name,Surname,Title,Role,OccupationBand,Salary,Department
0,André,gerber,Mrs,Front-End Developer,Junior,52357,Web Applications
1,Antoinette,Van Der Berg,Dr,UI/UX Developer,Junior,118731,Mobile Applications
2,Bronwyn,Swartz,Miss,UI/UX Developer,Graduate,34350,Mobile Applications
3,Christopher,Walker,Mr,Back-End Developer,Junior,122894,Mobile Applications
4,Christopher,Walker,Mr,Back-End Developer,Junior,122894,Web Applications
...,...,...,...,...,...,...,...
56,marthinus,ngobeni,Ms,Back-End Developer,Senior,354298,Web Applications
57,nicole,Ebrahim,Dr,Database Analyst,Senior,180919,Mobile Applications
58,phumzile,motsepe,Prof,Systems Analyst,Junior,129627,Mobile Applications
59,sello,Details,Mr,Database Analyst,Graduate,54945,Mobile Applications
