# Introduction to Database Design
---------------------------------------------------------------------------------------------------------

In our previous session on databases, we introduced some of the fundamental concepts and definitions applicable to databases in general, along with a brief intro to SQL and SQLite in particular. Some use cases and platforms were also discussed.

In this session, we are going to dig a little deeper into databases as representions of systems and processes. A database with a single table may not feel or function much differently from a spreadsheet. Much of the benefit of using databases results from designing them as models of complex systems in ways that spreadsheets just can't do:

* Inventory control and billing
* Human resources
* Blogging platforms
* Ecosystems

There will be some more advanced SQL statements this time, though we will still be using SQLite. Concepts which will be discussed and implemented in our code include

* Entities and attributes
* Keys
* Relationships
* Normalization

For this session we are also going to play as we go, so let's begin by installing and importing  an iPython notebook SQL library developed by Caterine Devlin and others at https://github.com/catherinedevlin/ipython-sql

Note that in order to run SQL commands within a Jupyter Notebooks, code blocks need to begin with a 'magic' function:

%sql
for inline SQL or

%%sql
for multiple lines of SQL in a code block.

This is a minor addition that is not needed within a standard SQL database or interface, but we like this option because it's notebook friendly and the SQL syntax is otherwise the same.

It may be necessary to install the library:

### Install and load IPython-SQL

In [6]:
#!pip install ipython-sql
#!pip3 install ipython-sql

In [7]:
%load_ext sql
%sql sqlite://

# The Entity Relationship Data Model
------------------------------------------------------------------------------------------------

The entity relationship (ER) model is commonly used to define and develop databases. In the simplest terms, the model defines the things (entities) that are important or interesting within a system or process and the relationships between them.

For demonstration purposes, we will construct an ER model of data recorded in the _ATF observation check-lists October - December 1964 (ATF 6)_, published online by the [Biodiversity Heritage Library](https://www.biodiversitylibrary.org/).

> National Museum of Natural History (U.S.) Pacific Ocean Biological Survey Program (1964). ATF observation check-lists October - December 1964 (ATF 6). https://www.biodiversitylibrary.org/item/246338. DOI: 10.5962/bhl.title.146255 

Looking at the lists at [https://www.biodiversitylibrary.org/item/246338#page/1/mode/1up](https://www.biodiversitylibrary.org/item/246338#page/1/mode/1up), what are some of the challenges related to transferring these data to a flat spreadsheet?

## Round 1: A flat, spreadsheet-like table

In [8]:
%%sql
DROP TABLE IF EXISTS observation_list;
CREATE TABLE observation_list (
  'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
  'location' TEXT NOT NULL,
  'observer' TEXT,
  'weather' TEXT,
  'date' TEXT NOT NULL,
  'time_start' TEXT NOT NULL,
  'time_end' TEXT NOT NULL,
  'laysan_albatross' INTEGER NULL DEFAULT NULL,
  'black_footed_albatross' INTEGER,
  'wedge_tailed_shearwater' INTEGER,
  'christmas_shearwater' INTEGER,
  'audubons_shearwater' INTEGER,
  'bonin_petrel' INTEGER,
  'phoenix_petrel' INTEGER,
  'bulwers_petrel' INTEGER,
  'sooty_petrel' INTEGER,
  'redtailed_tropicbird' INTEGER,
  'whitetailed_tropicbird' INTEGER,
  'masked_booby' INTEGER,
  'brown_booby' INTEGER,
  'redfooted_booby' INTEGER,
  'great_frigatebird' INTEGER,
  'golden_plover' INTEGER,
  'ruddy_turnstone' INTEGER,
  'wandering_tattler' INTEGER,
  'sanderling' INTEGER,
  'bristlethighed_curlew' INTEGER,
  'sooty_tern' INTEGER,
  'graybacked_tern' INTEGER,
  'brownwinged_tern' INTEGER,
  'common_noddy' INTEGER,
  'hawaiian_noddy' INTEGER,
  'bluegray_noddy' INTEGER,
  'fairy_tern' INTEGER ,
  'remarks' TEXT,
  'total_birds' INTEGER
);

 * sqlite://
Done.
Done.


[]

In [9]:
%sql PRAGMA TABLE_INFO(observation_list);

 * sqlite://
Done.


cid,name,type,notnull,dflt_value,pk
0,id,INTEGER,1,,1
1,location,TEXT,1,,0
2,observer,TEXT,0,,0
3,weather,TEXT,0,,0
4,date,TEXT,1,,0
5,time_start,TEXT,1,,0
6,time_end,TEXT,1,,0
7,laysan_albatross,INTEGER,0,,0
8,black_footed_albatross,INTEGER,0,,0
9,wedge_tailed_shearwater,INTEGER,0,,0


In [10]:
# This is (not really) fine until we try to add an observation.
# There are no columns for the manually entered species.

try:
    %sql INSERT INTO observation_list ('location', 'date', 'time_start', 'time_end', 'wedge_tailed_shearwater', 'redfooted_booby', 'great_frigatebird', 'sooty_tern', 'common_noddy', 'skua', 'tern', 'pterochroza', 'remarks', 'total_birds') VALUES ('oahu to 20.38 N 158.34 W', '1964-10-01', '14:20', '17:30', 119, 5, 1, 6, 7, 1, 2, 5, "37.2 and 1.9", 148);
except Exception as e:
    print(str(e))

 * sqlite://
(sqlite3.OperationalError) table observation_list has no column named skua
[SQL: INSERT INTO observation_list ('location', 'date', 'time_start', 'time_end', 'wedge_tailed_shearwater', 'redfooted_booby', 'great_frigatebird', 'sooty_tern', 'common_noddy', 'skua', 'tern', 'pterochroza', 'remarks', 'total_birds') VALUES ('oahu to 20.38 N 158.34 W', '1964-10-01', '14:20', '17:30', 119, 5, 1, 6, 7, 1, 2, 5, 37.2 and 1.9, 148);]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


In [11]:
# Without modifying our table, we can only insert data by leaving out the manually entered species:

try:
    %sql INSERT INTO observation_list ('location', 'date', 'time_start', 'time_end', 'wedge_tailed_shearwater', 'redfooted_booby', 'great_frigatebird', 'sooty_tern', 'common_noddy', 'remarks', 'total_birds') VALUES ('oahu to 20.38 N 158.34 W', '1964-10-01', '14:20', '17:30', 119, 5, 1, 6, 7, "37.2 and 1.9", 148);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.


In [12]:
# Success, but - 
# Can't analyze location, can't align remarks with birds, and we are missing observations
# and the recorded total is wrong.

# Also in some cases we have start and end locations
# Also, what about abundance and breeding?

%sql select * from observation_list

 * sqlite://
Done.


id,location,observer,weather,date,time_start,time_end,laysan_albatross,black_footed_albatross,wedge_tailed_shearwater,christmas_shearwater,audubons_shearwater,bonin_petrel,phoenix_petrel,bulwers_petrel,sooty_petrel,redtailed_tropicbird,whitetailed_tropicbird,masked_booby,brown_booby,redfooted_booby,great_frigatebird,golden_plover,ruddy_turnstone,wandering_tattler,sanderling,bristlethighed_curlew,sooty_tern,graybacked_tern,brownwinged_tern,common_noddy,hawaiian_noddy,bluegray_noddy,fairy_tern,remarks,total_birds
1,oahu to 20.38 N 158.34 W,,,1964-10-01,14:20,17:30,,,119,,,,,,,,,,,5,1,,,,,,6,,,7,,,,1,148


So, among the other problems noted with the flat design, the table structure has to be updated every time a new species not on the original list is observed. Just looking at the data, this is something we would have to do for every observation.

The entity relationship model provides a process for developing a more robust representation of these observations.

The following provides a useful example of an ER diagram, and includes each of the concepts to be discussed below:

![Entity Relationship example diagram](./images/1011px-ER_Diagram_MMORPG.png)

By <a href="https://en.wikipedia.org/wiki/User:TheMattrix" class="extiw" title="en:User:TheMattrix">TheMattrix</a> at the <a href="https://en.wikipedia.org/wiki/" class="extiw" title="w:">English language Wikipedia</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=2278339">Link</a>


## Entities

Entities are _nouns_, and can be physical or logical:

* People - teachers, students, courses
* Places - stores, websites, states
* Things - donuts, grades, purchases

Entities are represented as tables within a database. 

### Attributes

Entities have properties or attributes which describe them. For each attribute there is domain, or a range of legal values. Domains can be limited by data type - integer, string, etc. - and may be further limited by allowable values. For example, the domain of month names is limited to January, February, etc.

There are several types of attributes:

* __Simple attributes__ are atomic values which cannot be decomposed or divided. Examples include _age_, _last name_, _glaze_, etc.
* __Composite attributes__ consist of multiple simple attributes, such as _address_, _full name_, etc.
* __Multivalued attributes__ can include a set of more than one value. _Phone numbers_, _certifications_, etc. are examples of multivalued attributes.
* __Derived attributes__ can be calculated using other attributes. A common example is _age_, which can be calculated from a date of birth.

### Keys

A key is an attribute or combination of attributes which can be used to uniquely identify individual entities within the entity set. That is, keys enforce a uniqueness constraint.

There are multiple types of keys. 

* A __candidate key__ is a simple or composite key that is both unique and minimal. _Minimal_ here means that every included attribute is needed to establish uniqueness. A table or entity set may have more than one candidate keys.
* A __composite key__ is a key composed of two or more attributes. Composite keys are also minimal.
* A __primary key__ is the candidate key which is selected to uniquely identify entities in the entity set.
* A __foreign key__ is an attribute the references the primary key of another table or entity set in the database.

#### Exercise

1. Referring back to ATF observation checklist, identify some important entities and their attributes. How do different simple and composite attributes serve to uniquely identify individual entities?
2. Use the **WWW SQL Designer** at [http://ondras.zarovi.cz/sql/demo/](http://ondras.zarovi.cz/sql/demo/) to create an ERD that can be implemented in a database.

## Round 2: Defining entities and attributes

In [13]:
%%sql
DROP TABLE IF EXISTS location;
CREATE TABLE location (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'start_northing' TEXT NOT NULL,
    'start_easting' TEXT NOT NULL,
    'end_northing' TEXT NOT NULL,
    'end_easting' TEXT NOT NULL,
    'start_name' TEXT,
    'end_name' TEXT
);
DROP TABLE IF EXISTS observer;
CREATE TABLE observer (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'fname' TEXT,
    'lname' TEXT,
    'org' TEXT
);
DROP TABLE IF EXISTS species;
CREATE TABLE species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'taxon' TEXT,
    'common_name' TEXT,
    'count' INTEGER,
    'breeding' TEXT,
    'remarks' TEXT
);
DROP TABLE IF EXISTS observation;
CREATE TABLE observation (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'date' TEXT NOT NULL,
    'location' INTEGER NOT NULL,
    'observer' INTEGER NOT NULL,
    'species' INTEGER NOT NULL
);

 * sqlite://
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

In [14]:
# Now populate the tables that will be referenced when recording observations.

try:
    %sql INSERT INTO location ('start_name', 'start_northing', 'start_easting', 'end_northing', 'end_easting') VALUES ('Oahu', '20.50 N', '158.20 W', '20.38 N', '158.34 W');
    %sql INSERT INTO observer ('org') VALUES ('ATF')
    %sql INSERT INTO species ('common_name', 'count', 'remarks') VALUES ('wedge-tailed shearwater', 119, '37.2')
    %sql INSERT INTO species ('common_name', 'count') VALUES ('red-footed booby', 5)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('great frigatebird', 1)
    %sql INSERT INTO species ('common_name', 'count', 'remarks') VALUES ('sooty tern', 6, '1.9')
    %sql INSERT INTO species ('common_name', 'count') VALUES ('common noddy', 7)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('skua', 1)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('tern', 2)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('pterochroza', 5)
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [15]:
# now can reference ids to insert into observations
# note - this is _not_ an example of good design!

%sql select * from location

 * sqlite://
Done.


id,start_northing,start_easting,end_northing,end_easting,start_name,end_name
1,20.50 N,158.20 W,20.38 N,158.34 W,Oahu,


In [16]:
%sql select * from observer

 * sqlite://
Done.


id,fname,lname,org
1,,,ATF


In [17]:
%sql select * from species

 * sqlite://
Done.


id,taxon,common_name,count,breeding,remarks
1,,wedge-tailed shearwater,119,,37.2
2,,red-footed booby,5,,
3,,great frigatebird,1,,
4,,sooty tern,6,,1.9
5,,common noddy,7,,
6,,skua,1,,
7,,tern,2,,
8,,pterochroza,5,,


In [18]:
# Use IDs to enter observation data

try:
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 1);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 2);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 3);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 4);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 5);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 6);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 7);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 8);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [19]:
# We can deference the ids of location, observer, etc. using joins,
# but for the most part this is a (poor) solution to the problems 
# we encountered trying to capture
# observation data with a flat, spreadsheet-like design

# join example follows in the next cell

%sql select * from observation

 * sqlite://
Done.


id,date,location,observer,species
1,1964-10-01,1,1,1
2,1964-10-01,1,1,2
3,1964-10-01,1,1,3
4,1964-10-01,1,1,4
5,1964-10-01,1,1,5
6,1964-10-01,1,1,6
7,1964-10-01,1,1,7
8,1964-10-01,1,1,8


In [20]:
%%sql
SELECT observation.date, species.common_name, species.count, species.remarks
FROM observation
INNER JOIN species ON observation.species = species.id

 * sqlite://
Done.


date,common_name,count,remarks
1964-10-01,wedge-tailed shearwater,119,37.2
1964-10-01,red-footed booby,5,
1964-10-01,great frigatebird,1,
1964-10-01,sooty tern,6,1.9
1964-10-01,common noddy,7,
1964-10-01,skua,1,
1964-10-01,tern,2,
1964-10-01,pterochroza,5,


# Relationships
------------------------------------------------------------------------------------------------------------

Relationships represent connections between entities. In keeping with the idea that entities are nouns, relationships are verbs. The MMORPG example above demonstrates this: a character _has_ an account, a region _contains_ characters.

_Cardinality_ determines the type of relationship that exists between two entities.

* __One to many (1:M)__: In the example above, region -> character is a 1 to many relationship. That is, one region can have many characters in it.
* __One to one (1:1)__: Not in the diagram above. One to one relationships indicate possible design issues when entities might really reference the same things.
* __Many to many (M:N)__: In the example, character and creep have a many to many relationship. Within a databae, these need to be implemented as a set of 1:M relationships with. 


#### Exercise

1. Identify the relationships between entities in the observation checklist. Since we are working with text, we will use notation similar to the examples as [https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html](https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html):

Location -> Specimen; 1 location can contain multiple specimen -> 1:N

In some ways Round 2 was better than the flat table, but we still have a problem with observation->species relationship. Other relationships are 1:1, as far as we can tell from the data. But as implemented the M:N relationship between observations and species creates a lot of redundancy.


# Normalization
------------------------------------------------------------------------------------------------------------

Normalization is a process of analyzing entities and attributes to reduce redundancy and prevent anomalies:

* Update anomaly: Redundant values within a table must be updated multiple times. In the example below, if Smith's favorite donut changes, the table has to be updated twice. Otherwise, there will be inconsistent  values.
* Delete anomaly: Deleting data forces the deletion of other attributes. For example, removing apple cider donuts from our table would also force the deletion of Wilson and Wilson's dependent, Pete. (Remeber that DELETE operations delete a whole row, not just a single attribute value.)
* Insert anomaly: Data cannot be added to the table without also adding other attributes. If null values are not allowed in the *Favorite_Donut* column, it becomes impossible to add information about an employee who doesn't have a favorite donut.


| EmployeeID | LName      |Favorite_Donut | Dependent |
|------------|------------|---------------|-----------|
| 115        | Smith      | glazed        | James     |
| 115        | Smith      | glazed        | Sandy     |
| 116        | Wilson     | apple cider   | Pete      |

In the second iteration of our database design, the table _species_ particularly demonstrates the update anomaly. Because we combined observation attributes with species attributes, we will end up duplicating values every time a species is observed more than once:

```
CREATE TABLE species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'taxon' TEXT,
    'common_name' TEXT,
    'count' INTEGER,
    'breeding' TEXT,
    'remarks' TEXT
);

```

Normalization involves removing dependencies among attributes to improve the logical structure and consistency of a database.

There are progressive degrees of normalization across multiple _normal forms_ (NF). There are six normal forms, but generally a database is considered normalized if the tables satisfy the requirements of the first three NF.

* 1NF: No repeating columns. Technically, our first flat table design satisfied 1NF, but imagine instead a structure like:

| date       | location   | species_1       | count_1     | species_2    | count_2     |
|------------|------------|-----------------|-------------|--------------|-------------|
| 1964-10-01 | Oahu       | albatross       | 5           | tern         | 1           |

As poor as our initial design was, this would be worse because each observation would have multiple species and count columns, and it's unlikely any two observations would end up having the same number of columns.

All of the tables in our second round of database design satisfy 1NF requirements.

* 2NF: A table must be 1NF AND the primary key is either a single attribute or, if composite, each non-key attribute must be dependent on the entire key for uniqueness. That is, eliminate redundant values.

Technically, all of our table definitions in our second round of design also satisfy the requirements of 2NF because they all have single attribute primary keys. However, our keys are poorly chosen because without additional uniqeness constraints there is nothing to prevent us from entering the same observer or location into the table multiple times:

```
CREATE TABLE location (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'start_northing' TEXT NOT NULL,
    'start_easting' TEXT NOT NULL,
    'end_northing' TEXT NOT NULL,
    'end_easting' TEXT NOT NULL,
    'start_name' TEXT,
    'end_name' TEXT
);

CREATE TABLE observer (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'fname' TEXT,
    'lname' TEXT,
    'org' TEXT
);
```
In effect, the keys selected aren't really the minimal attributes needed to uniquely identify every entity in the entity set.

* 3NF: A table must be 2NF AND eliminate transitive dependencies. That is, remove non-key attributes that depend on other non-key attributes. Our _species_ table definition from above provides a good example of transitive dependencies:

```
CREATE TABLE species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'taxon' TEXT,
    'common_name' TEXT,
    'count' INTEGER,
    'breeding' TEXT,
    'remarks' TEXT
);

```

The attributes _count_, _breeding_, and _remarks_ depend on each other as observation attributes. We need a way to link this information with species _per observation_ by creating a dependent entity to resolve the M:N relationship between species and observations.

## Round 3: Striving for 3NF

Note that some of the choices for primary keys in the final design are based on analysis of a limited set of sample data. More data might require further definition.

**NOTE**: In the table definition for species, *common_name* is used as the primary key. In principle this is a poor choice - common names are subject to regional variation, etc. For demonstration purposes, we are using it as a key here because taxon information is not available and because of how common names are used within the actual observation lists we are modeling.

In [21]:
%%sql
DROP TABLE IF EXISTS location;
CREATE TABLE location (
    'locID' INTEGER NOT NULL,
    'start_northing' TEXT NOT NULL,
    'start_easting' TEXT NOT NULL,
    'end_northing' TEXT NOT NULL,
    'end_easting' TEXT NOT NULL,
    'start_name' TEXT,
    'end_name' TEXT,
    PRIMARY KEY ('start_northing', 'start_easting', 'end_northing', 'end_easting')
);
DROP TABLE IF EXISTS observer;
CREATE TABLE observer (
    'observerID' INTEGER NOT NULL,
    'fname' TEXT,
    'lname' TEXT,
    'org' TEXT NOT NULL PRIMARY KEY UNIQUE
);
DROP TABLE IF EXISTS species;
CREATE TABLE species (
    'speciesID' INTEGER NOT NULL,
    'taxon' TEXT,
    'common_name' TEXT NOT NULL PRIMARY KEY UNIQUE
);
DROP TABLE IF EXISTS observation;
CREATE TABLE observation (
    'observationID' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'date' TEXT NOT NULL,
    'locID' INTEGER NOT NULL,
    'observerID' INTEGER NOT NULL
);
DROP TABLE IF EXISTS observed_species;
CREATE TABLE observed_species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'observationID' INTEGER NOT NULL,
    'speciesID' INTEGER NOT NULL,
    'count' INTEGER NOT NULL,
    'breeding' TEXT,
    'remarks' TEXT
);

 * sqlite://
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

In [22]:
# Add location and observer - note that we are no loner autoincrementing IDs

try:
    %sql INSERT INTO location ('locID', 'start_name', 'start_northing', 'start_easting', 'end_northing', 'end_easting') VALUES (1, 'Oahu', '20.50 N', '158.20 W', '20.38 N', '158.34 W');
    %sql INSERT INTO observer ('observerID', 'org') VALUES (1, 'ATF')
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [23]:
# A much simpler insert statement for species - populate all of the listed species at once

try:
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (1, 'laysan_albatross')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (2, 'black_footed_albatross' )
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (3, 'wedge_tailed_shearwater')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (4, 'christmas_shearwater')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (5, 'audubons_shearwater')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (6, 'bonin_petrel')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (7, 'phoenix_petrel')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (8, 'bulwers_petrel')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (9, 'sooty_petrel')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (10, 'redtailed_tropicbird' )
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (11, 'whitetailed_tropicbird')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (12, 'masked_booby')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (13, 'brown_booby')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (14, 'redfooted_booby')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (15, 'great_frigatebird')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (16, 'golden_plover')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (17, 'ruddy_turnstone')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (18, 'wandering_tattler' )
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (19, 'sanderling')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (20, 'bristlethighed_curlew')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (21, 'sooty_tern')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (22, 'graybacked_tern')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (23, 'brownwinged_tern')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (24, 'common_noddy')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (25, 'hawaiian_noddy')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (26, 'bluegray_noddy')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (27, 'fairy_tern')
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [24]:
%sql select * from location

 * sqlite://
Done.


locID,start_northing,start_easting,end_northing,end_easting,start_name,end_name
1,20.50 N,158.20 W,20.38 N,158.34 W,Oahu,


In [25]:
%sql select * from observer

 * sqlite://
Done.


observerID,fname,lname,org
1,,,ATF


In [26]:
try:
    %sql INSERT INTO observation ('date', 'locID', 'observerID') VALUES ('1964-10-01', 1, 1);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.


In [27]:
%sql select * from observation

 * sqlite://
Done.


observationID,date,locID,observerID
1,1964-10-01,1,1


In [28]:
# for each observation we can easily add any birds not on the list
# then add observations

try:
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (28, 'skua')
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (29, 'tern' )
    %sql INSERT INTO species ('speciesID', 'common_name') VALUES (30, 'pterochroza')
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [29]:
try:
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count', 'remarks') VALUES (1, 3, 119, '37.2');
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 14, 5);
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 15, 1);
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count', 'remarks') VALUES (1, 21, 6, '1.9');
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 24, 7);
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 28, 1);
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 29, 2);
    %sql INSERT INTO observed_species ('observationID', 'speciesID', 'count') VALUES (1, 30, 5);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [30]:
# Using multiple inner joins to dereference IDs, we can generate a view of
# the data that is complete

In [31]:
%%sql
SELECT date, location.start_name, observer.org, species.common_name, observed_species.count
FROM observation
INNER JOIN location ON location.locID = observation.locID
INNER JOIN observer ON observer.observerID = observation.observationID
INNER JOIN observed_species ON observed_species.observationID = observation.observationID
INNER JOIN species ON species.speciesID = observed_species.speciesID

 * sqlite://
Done.


date,start_name,org,common_name,count
1964-10-01,Oahu,ATF,wedge_tailed_shearwater,119
1964-10-01,Oahu,ATF,redfooted_booby,5
1964-10-01,Oahu,ATF,great_frigatebird,1
1964-10-01,Oahu,ATF,sooty_tern,6
1964-10-01,Oahu,ATF,common_noddy,7
1964-10-01,Oahu,ATF,skua,1
1964-10-01,Oahu,ATF,tern,2
1964-10-01,Oahu,ATF,pterochroza,5


In [32]:
# The total bird count can be calculated using the built in sum() function
# Keep in mind this will be different from the total in the observation list because of the one
# entry that is illegible

In [33]:
%%sql
SELECT SUM(count) AS total_birds, observation.date
FROM observed_species
INNER JOIN observation ON observation.observationID = observed_species.observationID

 * sqlite://
Done.


total_birds,date
146,1964-10-01


## Verify that our design is 3NF

In [34]:
# Test whether our design is really 3NF - check for update, delete, and insert anomalies
# UPDATE - for any uniquely identified entity, are multiple updates needed to change an attribute?
# Make a change and re-run the joined view

In [35]:
%%sql
UPDATE species
SET common_name = 'small_pterochroza'
WHERE common_name = 'pterochroza';

 * sqlite://
1 rows affected.


[]

In [36]:
%%sql
SELECT date, location.start_name, observer.org, species.common_name, observed_species.count
FROM observation
INNER JOIN location ON location.locID = observation.locID
INNER JOIN observer ON observer.observerID = observation.observationID
INNER JOIN observed_species ON observed_species.observationID = observation.observationID
INNER JOIN species ON species.speciesID = observed_species.speciesID

 * sqlite://
Done.


date,start_name,org,common_name,count
1964-10-01,Oahu,ATF,wedge_tailed_shearwater,119
1964-10-01,Oahu,ATF,redfooted_booby,5
1964-10-01,Oahu,ATF,great_frigatebird,1
1964-10-01,Oahu,ATF,sooty_tern,6
1964-10-01,Oahu,ATF,common_noddy,7
1964-10-01,Oahu,ATF,skua,1
1964-10-01,Oahu,ATF,tern,2
1964-10-01,Oahu,ATF,small_pterochroza,5


In [37]:
# DELETE - can we delete a row without deleting other data?
# Make a deletion and re-run the joined view

In [38]:
%%sql
DELETE FROM observer
WHERE org = 'ATF';

 * sqlite://
1 rows affected.


[]

In [39]:
%%sql
SELECT date, location.start_name, observer.org, species.common_name, observed_species.count
FROM observation
INNER JOIN location ON location.locID = observation.locID
INNER JOIN observer ON observer.observerID = observation.observationID
INNER JOIN observed_species ON observed_species.observationID = observation.observationID
INNER JOIN species ON species.speciesID = observed_species.speciesID

 * sqlite://
Done.


date,start_name,org,common_name,count


In [40]:
# That looks bad, but only because we deleted our only observer. There is nothing to join on.
# However, the other data - species, location, observation date, etc. are intact.

# (There are some caveats here that relate to assumptions made about who or what counts
# as an observer.)

In [41]:
%%sql
SELECT date, location.start_name, species.common_name, observed_species.count
FROM observation
INNER JOIN location ON location.locID = observation.locID
INNER JOIN observed_species ON observed_species.observationID = observation.observationID
INNER JOIN species ON species.speciesID = observed_species.speciesID

 * sqlite://
Done.


date,start_name,common_name,count
1964-10-01,Oahu,wedge_tailed_shearwater,119
1964-10-01,Oahu,redfooted_booby,5
1964-10-01,Oahu,great_frigatebird,1
1964-10-01,Oahu,sooty_tern,6
1964-10-01,Oahu,common_noddy,7
1964-10-01,Oahu,skua,1
1964-10-01,Oahu,tern,2
1964-10-01,Oahu,small_pterochroza,5


In [42]:
# INSERT - can we add data if particular non-key attributes are not known?
# Maybe - we can't add an observation if the date (a non-key attribute) is not known.

# Add back our observer
try:
    %sql INSERT INTO observer ('observerID', 'org') VALUES (1, 'ATF')
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.


In [43]:
# Try to insert an observation without a date.

try:
    %sql INSERT INTO observation ('locID', 'observerID') VALUES (1, 1);
except Exception as e:
    print(str(e))

 * sqlite://
(sqlite3.IntegrityError) NOT NULL constraint failed: observation.date
[SQL: INSERT INTO observation ('locID', 'observerID') VALUES (1, 1);]
(Background on this error at: http://sqlalche.me/e/13/gkpj)


In [44]:
# So our design is not 3NF. 
# We can fix this by either allowing null dates or making the date attribute a key.

# References

Adrienne Watt and Nelson Eng (n.d.) Databse Design - 2nd Edition. Retrieved from [https://opentextbc.ca/dbdesign01/](https://opentextbc.ca/dbdesign01/)

Datanamic (n.d.) Database normalization. Retrieved from [https://www.datanamic.com/support/database-normalization.html](https://www.datanamic.com/support/database-normalization.html)