# Introduction to Database Design
---------------------------------------------------------------------------------------------------------

In our previous session on databases, we introduced some of the fundamental concepts and definitions applicable to databases in general, along with a brief intro to SQL and SQLite in particular. Some use cases and platforms were also discussed.

In this session, we are going to dig a little deeper into databases as representions of systems and processes. A database with a single table may not feel or function much differently from a spreadsheet. Much of the benefit of using databases results from designing them as models of complex systems in ways that spreadsheets just can't do:

* Inventory control and billing
* Human resources
* Blogging platforms
* Ecosystems

For this session we are also going to play as we go, so let's begin by installing and importing  an iPython notebook SQL library developed by Caterine Devlin and others at https://github.com/catherinedevlin/ipython-sql

Note that in order to run SQL commands within a Jupyter Notebooks, code blocks need to begin with a 'magic' function:

%sql
for inline SQL or

%%sql
for multiple lines of SQL in a code block.

This is a minor addition that is not needed within a standard SQL database or interface, but we like this option because it's notebook friendly and the SQL syntax is otherwise the same.

It may be necessary to install the library:

### Install and load IPython-SQL

In [2]:
#!pip install ipython-sql
#!pip3 install ipython-sql

In [3]:
%load_ext sql
%sql sqlite://

'Connected: @None'

In [4]:
# Some info links:

# https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html
# https://www.datanamic.com/support/database-normalization.html
# https://www.fullstackpython.com/databases.html
# https://www.sqlalchemy.org/
# https://opentextbc.ca/dbdesign01/chapter/chapter-8-entity-relationship-model/
# https://www.tutorialspoint.com/dbms/er_model_basic_concepts.htm

# Use field notes for examples
# https://www.biodiversitylibrary.org/
# https://www.biodiversitylibrary.org/bibliography/146255#/summary
# https://www.biodiversitylibrary.org/item/246338#page/1/mode/1up

# The Entity Relationship Data Model
------------------------------------------------------------------------------------------------

The entity relationship (ER) model is commonly used to define and develop databases. In the simplest terms, the model defines the things (entities) that are important or interesting within a system or process and the relationships between them.

For demonstration purposes, we will construct an ER model of data recorded in the _ATF observation check-lists October - December 1964 (ATF 6)_, published online by the [Biodiversity Heritage Library](https://www.biodiversitylibrary.org/).

> National Museum of Natural History (U.S.) Pacific Ocean Biological Survey Program (1964). ATF observation check-lists October - December 1964 (ATF 6). https://www.biodiversitylibrary.org/item/246338. DOI: 10.5962/bhl.title.146255 

Looking at the lists at [https://www.biodiversitylibrary.org/item/246338#page/1/mode/1up](https://www.biodiversitylibrary.org/item/246338#page/1/mode/1up), what are some of the challenges related to transferring these data to a flat spreadsheet?

## Round 1: A flat, spreadsheet-like table

In [5]:
%%sql
DROP TABLE IF EXISTS observation_list;
CREATE TABLE observation_list (
  'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
  'location' TEXT NOT NULL,
  'observer' TEXT,
  'weather' TEXT,
  'date' TEXT NOT NULL,
  'time_start' TEXT NOT NULL,
  'time_end' TEXT NOT NULL,
  'laysan_albatross' INTEGER NULL DEFAULT NULL,
  'black_footed_albatross' INTEGER,
  'wedge_tailed_shearwater' INTEGER,
  'christmas_shearwater' INTEGER,
  'audubons_shearwater' INTEGER,
  'bonin_petrel' INTEGER,
  'phoenix_petrel' INTEGER,
  'bulwers_petrel' INTEGER,
  'sooty_petrel' INTEGER,
  'redtailed_tropicbird' INTEGER,
  'whitetailed_tropicbird' INTEGER,
  'masked_booby' INTEGER,
  'brown_booby' INTEGER,
  'redfooted_booby' INTEGER,
  'great_frigatebird' INTEGER,
  'golden_plover' INTEGER,
  'ruddy_turnstone' INTEGER,
  'wandering_tattler' INTEGER,
  'sanderling' INTEGER,
  'bristlethighed_curlew' INTEGER,
  'sooty_tern' INTEGER,
  'graybacked_tern' INTEGER,
  'brownwinged_tern' INTEGER,
  'common_noddy' INTEGER,
  'hawaiian_noddy' INTEGER,
  'bluegray_noddy' INTEGER,
  'fairy_tern' INTEGER ,
  'remarks' TEXT,
  'total_birds' INTEGER
);

 * sqlite://
Done.
Done.


[]

In [6]:
%sql PRAGMA TABLE_INFO(observation_list);

 * sqlite://
Done.


cid,name,type,notnull,dflt_value,pk
0,id,INTEGER,1,,1
1,location,TEXT,1,,0
2,observer,TEXT,0,,0
3,weather,TEXT,0,,0
4,date,TEXT,1,,0
5,time_start,TEXT,1,,0
6,time_end,TEXT,1,,0
7,laysan_albatross,INTEGER,0,,0
8,black_footed_albatross,INTEGER,0,,0
9,wedge_tailed_shearwater,INTEGER,0,,0


In [8]:
# This is fine until we try to add an observation.
# There are no columns for the manually entered species.

try:
    %sql INSERT INTO observation_list ('location', 'date', 'time_start', 'time_end', 'wedge_tailed_shearwater', 'redfooted_booby', 'great_frigatebird', 'sooty_tern', 'common_noddy', 'skua', 'tern', 'pterochroza', 'remarks', 'total_birds') VALUES ('oahu to 20.38 N 158.34 W', '1964-10-01', '14:20', '17:30', 119, 5, 1, 6, 7, 1, 2, 5, "37.2 and 1.9", 148);
except Exception as e:
    print(str(e))

 * sqlite://
(sqlite3.OperationalError) table observation_list has no column named skua [SQL: 'INSERT INTO observation_list (\'location\', \'date\', \'time_start\', \'time_end\', \'wedge_tailed_shearwater\', \'redfooted_booby\', \'great_frigatebird\', \'sooty_tern\', \'common_noddy\', \'skua\', \'tern\', \'pterochroza\', \'remarks\', \'total_birds\') VALUES (\'oahu to 20.38 N 158.34 W\', \'1964-10-01\', \'14:20\', \'17:30\', 119, 5, 1, 6, 7, 1, 2, 5, "37.2 and 1.9", 148);'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [9]:
# Without modifying our table, we can only insert data by leaving out the manually entered species:

try:
    %sql INSERT INTO observation_list ('location', 'date', 'time_start', 'time_end', 'wedge_tailed_shearwater', 'redfooted_booby', 'great_frigatebird', 'sooty_tern', 'common_noddy', 'remarks', 'total_birds') VALUES ('oahu to 20.38 N 158.34 W', '1964-10-01', '14:20', '17:30', 119, 5, 1, 6, 7, "37.2 and 1.9", 148);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.


In [10]:
# Success, but - 
# Can't analyze location, can't align remarks with birds, and we are missing observations
# and the recorded total is wrong.

# Also in some cases we have start and end locations
# Also, what about abundance and breeding?

%sql select * from observation_list

 * sqlite://
Done.


id,location,observer,weather,date,time_start,time_end,laysan_albatross,black_footed_albatross,wedge_tailed_shearwater,christmas_shearwater,audubons_shearwater,bonin_petrel,phoenix_petrel,bulwers_petrel,sooty_petrel,redtailed_tropicbird,whitetailed_tropicbird,masked_booby,brown_booby,redfooted_booby,great_frigatebird,golden_plover,ruddy_turnstone,wandering_tattler,sanderling,bristlethighed_curlew,sooty_tern,graybacked_tern,brownwinged_tern,common_noddy,hawaiian_noddy,bluegray_noddy,fairy_tern,remarks,total_birds
1,oahu to 20.38 N 158.34 W,,,1964-10-01,14:20,17:30,,,119,,,,,,,,,,,5,1,,,,,,6,,,7,,,,37.2 and 1.9,148


So, among the other problems noted with the flat design, the table structure has to be updated every time a new species not on the original list is observed. Just looking at the data, this is something we would have to do for every observation.

The following provides a useful example of an ER diagram, and includes each of the concepts to be discussed below:

![Entity Relationship example diagram](./images/1011px-ER_Diagram_MMORPG.png)

By <a href="https://en.wikipedia.org/wiki/User:TheMattrix" class="extiw" title="en:User:TheMattrix">TheMattrix</a> at the <a href="https://en.wikipedia.org/wiki/" class="extiw" title="w:">English language Wikipedia</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=2278339">Link</a>


## Entities

Entities are _nouns_, and can be physical or logical:

* People - teachers, students, courses
* Places - stores, websites, states
* Things - donuts, grades, purchases

Entities are represented as tables within a database. 

### Attributes

Entities have properties or attributes which describe them. For each attribute there is domain, or a range of legal values. Domains can be limited by data type - integer, string, etc. - and may be further limited by allowable values. For example, the domain of month names is limited to January, February, etc.

There are several types of attributes:

* __Simple attributes__ are atomic values which cannot be decomposed or divided. Examples include _age_, _last name_, _glaze_, etc.
* __Composite attributes__ consist of multiple simple attributes, such as _address_, _full name_, etc.
* __Multivalued attributes__ can include a set of more than one value. _Phone numbers_, _certifications_, etc. are examples of multivalued attributes.
* __Derived attributes__ can be calculated using other attributes. A common example is _age_, which can be calculated from a date of birth.

### Keys

A key is an attribute or combination of attributes which can be used to uniquely identify individual entities within the entity set. That is, keys enforce a uniqueness constraint.

There are multiple types of keys. 

* A __candidate key__ is a simple or composite key that is both unique and minimal. _Minimal_ here means that every included attribute is needed to establish uniqueness. A table or entity set may have more than one candidate keys.
* A __composite key__ is a key composed of two or more attributes. Composite keys are also minimal.
* A __primary key__ is the candidate key which is selected to uniquely identify entities in the entity set.
* A __foreign key__ is an attribute the references the primary key of another table or entity set in the database.

#### Exercise

1. Referring back to ATF observation checklist, identify some important entities and their attributes. How do different simple and composite attributes serve to uniquely identify individual entities?
2. Use the **WWW SQL Designer** at [http://ondras.zarovi.cz/sql/demo/](http://ondras.zarovi.cz/sql/demo/) to create an ERD that can be implemented in a database.

## Round 2: Defining entities and attributes

In [11]:
%%sql
DROP TABLE IF EXISTS location;
CREATE TABLE location (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'start_northing' TEXT NOT NULL,
    'start_easting' TEXT NOT NULL,
    'end_northing' TEXT NOT NULL,
    'end_easting' TEXT NOT NULL,
    'start_name' TEXT,
    'end_name' TEXT
);
DROP TABLE IF EXISTS observer;
CREATE TABLE observer (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'fname' TEXT,
    'lname' TEXT,
    'org' TEXT
);
DROP TABLE IF EXISTS species;
CREATE TABLE species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'taxon' TEXT,
    'common_name' TEXT,
    'count' INTEGER,
    'remarks' TEXT
);
DROP TABLE IF EXISTS observation;
CREATE TABLE observation (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'date' TEXT NOT NULL,
    'location' INTEGER NOT NULL,
    'observer' INTEGER NOT NULL,
    'species' INTEGER NOT NULL
);

 * sqlite://
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

In [12]:
# Now populate the tables that will be referenced when recording observations.

try:
    %sql INSERT INTO location ('start_name', 'start_northing', 'start_easting', 'end_northing', 'end_easting') VALUES ('Oahu', '20.50 N', '158.20 W', '20.38 N', '158.34 W');
    %sql INSERT INTO observer ('org') VALUES ('ATF')
    %sql INSERT INTO species ('common_name', 'count', 'remarks') VALUES ('wedge-tailed shearwater', 119, '37.2')
    %sql INSERT INTO species ('common_name', 'count') VALUES ('red-footed booby', 5)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('great frigatebird', 1)
    %sql INSERT INTO species ('common_name', 'count', 'remarks') VALUES ('sooty tern', 6, '1.9')
    %sql INSERT INTO species ('common_name', 'count') VALUES ('common noddy', 7)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('skua', 1)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('tern', 2)
    %sql INSERT INTO species ('common_name', 'count') VALUES ('pterochroza', 5)
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [14]:
# now can reference ids to insert into observations
# note - this is _not_ an example of good design!

%sql select * from location

 * sqlite://
Done.


id,start_northing,start_easting,end_northing,end_easting,start_name,end_name
1,20.50 N,158.20 W,20.38 N,158.34 W,Oahu,


In [15]:
%sql select * from observer

 * sqlite://
Done.


id,fname,lname,org
1,,,ATF


In [16]:
%sql select * from species

 * sqlite://
Done.


id,taxon,common_name,count,remarks
1,,wedge-tailed shearwater,119,37.2
2,,red-footed booby,5,
3,,great frigatebird,1,
4,,sooty tern,6,1.9
5,,common noddy,7,
6,,skua,1,
7,,tern,2,
8,,pterochroza,5,


In [17]:
# Use IDs to enter observation data

try:
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 1);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 2);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 3);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 4);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 5);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 6);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 7);
    %sql INSERT INTO observation ('date', 'location', 'observer', 'species') VALUES ('1964-10-01', 1, 1, 8);
except Exception as e:
    print(str(e))

 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.
 * sqlite://
1 rows affected.


In [20]:
# the problem here is that without foreign keys or a lot of joins we can't deference the ids of location, observer, etc.
# but for the most part this is a (poor) solution to the problems we encountered trying to capture
# observation data with a flat, spreadsheet-like design

# join example follows in the next cell

%sql select * from observation

 * sqlite://
Done.


id,date,location,observer,species
1,1964-10-01,1,1,1
2,1964-10-01,1,1,2
3,1964-10-01,1,1,3
4,1964-10-01,1,1,4
5,1964-10-01,1,1,5
6,1964-10-01,1,1,6
7,1964-10-01,1,1,7
8,1964-10-01,1,1,8


In [22]:
%%sql
SELECT observation.date, species.common_name, species.count, species.remarks
FROM observation
INNER JOIN species ON observation.species = species.id

 * sqlite://
Done.


date,common_name,count,remarks
1964-10-01,wedge-tailed shearwater,119,37.2
1964-10-01,red-footed booby,5,
1964-10-01,great frigatebird,1,
1964-10-01,sooty tern,6,1.9
1964-10-01,common noddy,7,
1964-10-01,skua,1,
1964-10-01,tern,2,
1964-10-01,pterochroza,5,


# Relationships
------------------------------------------------------------------------------------------------------------

Relationships represent connections between entities. In keeping with the idea that entities are nouns, relationships are verbs. The MMORPG example above demonstrates this: a character _has_ an account, a region _contains_ characters.

_Cardinality_ determines the type of relationship that exists between two entities.

* __One to many (1:M)__: In the example above, region -> character is a 1 to many relationship. That is, one region can have many characters in it.
* __One to one (1:1)__: Not in the diagram above. One to one relationships indicate possible design issues when entities might really reference the same things.
* __Many to many (M:N)__: In the example, character and creep have a many to many relationship. Within a databae, these need to be implemented as a set of 1:M relationships with. 


#### Exercise

1. Identify the relationships between entities in the observation checklist. Since we are working with text, we will use notation similar to the examples as [https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html](https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html):

Location -> Specimen; 1 location can contain multiple specimen -> 1:N

In some ways Round 2 was better than the flat table, but we still have a problem with observation->species relationship. Other relationships are 1:1, as far as we can tell from the data. But as implemented the M:N relationship between observations and species creates a lot of redundancy.


# Normalization
------------------------------------------------------------------------------------------------------------

Relevant to normalization - 'observations' and 'species' has a M:N cardinality that needs to be resolved

Also, do other tables as defined satisfy 1NF, 2NF, and 3NF?

* 1NF - no repeating columns
* 2NF - 1NF AND a) PK is a single attribute or if composite b) each non-key attribute must be dependent on the entire key for uniqueness (eliminate redundant values)s
* 3NF - 2NF AND elinimate transitive dependency: non-key attributes may not be functionally dependent on another non-key attribute (https://opentextbc.ca/dbdesign01/chapter/chapter-12-normalization/)

So Round 2 definitions were 1NF, and also 2NF since PK is a single attribute
If we had created composite keys with species and observation, they would not be 2NF

location is 2NF
observer is 2NF
species is 2NF with transitive dependencies
observation is 2NF with transitive dependencies

Now achieve 3NF for all

In observation, what makes each row unique is the species. Create a dependent entity to resolve M:N relationship