# Database Design Concepts
---------------------------------------------------------------------------------------------------------

In our previous session on databases, we introduced some of the fundamental concepts and definitions applicable to databases in general, along with a brief intro to SQL and SQLite in particular. Some use cases and platforms were also discussed.

In this session, we are going to dig a little deeper into databases as representions of systems and processes. A database with a single table may not feel or function much differently from a spreadsheet. Much of the benefit of using databases results from designing them as models of complex systems in ways that spreadsheets just can't do:

* Inventory control and billing
* Human resources
* Blogging platforms
* Ecosystems

There will be some more advanced SQL statements this time, though we will still be using SQLite. Concepts which will be discussed and implemented in our code include

* Entities and attributes
* Keys
* Relationships
* Normalization

For this session we are also going to play as we go. We will use data from the Portal Project Teaching Database:

> Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2018): Portal Project Teaching Database. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1314459.v10 

Go to the item record in figshare and click on the button to _Download all_. Download and unzip the data to your preferred location on your computer.

# The Entity Relationship Data Model
------------------------------------------------------------------------------------------------

The entity relationship (ER) model is commonly used to define and develop databases. In the simplest terms, the model defines the things (entities) that are important or interesting within a system or process and the relationships between them.


## Design Round 1: A flat, spreadsheet-like table

For this part of the workshop, we will use Jamboards to collaboratively identify the entities represented within the data. 

#### Exercise

1. Go to the Jamboard at [INSERT JAMBOARD LINK]
1. Go to the folder with the data just downloaded from figshare. Open the file *combined.csv*.
1. We have also shared a file with some example field notes (*field_notes.xslx*). Try adding the survey information from the field notes to *combined.csv*. What problems or issues do you run into? Add notes to the Jamboard.
1. It turns out there was an error in data collection. All of the data in *combined.csv* with a date of March 5, 2000, was actually collected on March 6. Try to update all the affected records. As before, add notes to the Jamboard about any issues you encounter while doing this.

Some of the columns in the CSV file have dependencies on information from other columns. We can simplify our data entry and reduce the risk of human error if we split or decompose our single table into multiple tables to eliminate these dependencies.

The entity relationship model provides a process for developing a more robust representation of the system we are observing with our survey data.

The following provides a useful example of an ER diagram, and includes each of the concepts to be discussed below:

![Entity Relationship example diagram](./images/1011px-ER_Diagram_MMORPG.png)

By <a href="https://en.wikipedia.org/wiki/User:TheMattrix" class="extiw" title="en:User:TheMattrix">TheMattrix</a> at the <a href="https://en.wikipedia.org/wiki/" class="extiw" title="w:">English language Wikipedia</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=2278339">Link</a>


## Entities

Entities are *nouns*, and can be physical or logical:

* People - teachers, students, courses
* Places - stores, websites, states
* Things - donuts, grades, purchases

Entities are represented as tables within a database. 

## Attributes

Entities have properties or attributes which describe them. For each attribute there is domain, or a range of legal values. Domains can be limited by data type - integer, string, etc. - and may be further limited by allowable values. For example, the domain of month names is limited to January, February, etc.

There are several types of attributes:

* __Simple attributes__ are atomic values which cannot be decomposed or divided. Examples include _age_, _last name_, _glaze_, etc.
* __Composite attributes__ consist of multiple simple attributes, such as _address_, _full name_, etc.
* __Multivalued attributes__ can include a set of more than one value. _Phone numbers_, _certifications_, etc. are examples of multivalued attributes.
* __Derived attributes__ can be calculated using other attributes. A common example is _age_, which can be calculated from a date of birth.

## Keys

A key is an attribute or combination of attributes which can be used to uniquely identify individual entities within the entity set. That is, keys enforce a uniqueness constraint.

There are multiple types of keys. 

* A __candidate key__ is a simple or composite key that is both unique and minimal. _Minimal_ here means that every included attribute is needed to establish uniqueness. A table or entity set may have more than one candidate keys.
* A __composite key__ is a key composed of two or more attributes. Composite keys are also minimal.
* A __primary key__ is the candidate key which is selected to uniquely identify entities in the entity set.
* A __foreign key__ is an attribute the references the primary key of another table or entity set in the database.


## Design Round 2: Defining entities and attributes

#### Exercise

1. In the Jamboard, the columns in *combined.csv* have been added as stickies. Each column can be considered an attribute of an entity, for example a plot. Working together, rearrange the stickies into groups of attributes describing a single entity. Use a marker or another stick of the same color to name the entity (for example "plot").
1. Use DB Browser to open the *portal_mammals.sqlite* database included in the data we downloaded from figshare. Compare the data table definitions with the entities and attributes defined in the previous step. 
1. It may be useful to have information about the person who collected the survey data in the field. Let's add a new table, "recorder," to hold this information. What are some attributes of this entity?
    * first name
    * last name
    * what else?
1. Use the _Modify Table_ feature to add additional attributes for "status" (undergraduate, graduate, staff) and date of birth. 



# Relationships
------------------------------------------------------------------------------------------------------------

Relationships represent connections between entities. In keeping with the idea that entities are nouns, relationships are verbs. The MMORPG example above demonstrates this: a character _has_ an account, a region _contains_ characters.

_Cardinality_ determines the type of relationship that exists between two entities.

* __One to many (1:M)__: In the example above, region -> character is a 1 to many relationship. That is, one region can have many characters in it.
* __One to one (1:1)__: Not in the diagram above. One to one relationships indicate possible design issues when entities might really reference the same things.
* __Many to many (M:N)__: In the example, character and creep have a many to many relationship. Within a databae, these need to be implemented as a set of 1:M relationships with. 


#### Exercise

1. Identify the relationships between entities in the observation checklist. Since we are working with text, we will use notation similar to the examples as [https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html](https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html):

Location -> Specimen; 1 location can contain multiple specimen -> 1:N

In some ways Round 2 was better than the flat table, but we still have a problem with observation->species relationship. Other relationships are 1:1, as far as we can tell from the data. But as implemented the M:N relationship between observations and species creates a lot of redundancy.


# Normalization
------------------------------------------------------------------------------------------------------------

Normalization is a process of analyzing entities and attributes to reduce redundancy and prevent anomalies:

* Update anomaly: Redundant values within a table must be updated multiple times. In the example below, if Smith's favorite donut changes, the table has to be updated twice. Otherwise, there will be inconsistent  values.
* Delete anomaly: Deleting data forces the deletion of other attributes. For example, removing apple cider donuts from our table would also force the deletion of Wilson and Wilson's dependent, Pete. (Remeber that DELETE operations delete a whole row, not just a single attribute value.)
* Insert anomaly: Data cannot be added to the table without also adding other attributes. If null values are not allowed in the *Favorite_Donut* column, it becomes impossible to add information about an employee who doesn't have a favorite donut.


| EmployeeID | LName      |Favorite_Donut | Dependent |
|------------|------------|---------------|-----------|
| 115        | Smith      | glazed        | James     |
| 115        | Smith      | glazed        | Sandy     |
| 116        | Wilson     | apple cider   | Pete      |


Normalization involves removing dependencies among attributes to improve the logical structure and consistency of a database.

There are progressive degrees of normalization across multiple _normal forms_ (NF). There are six normal forms, but generally a database is considered normalized if the tables satisfy the requirements of the first three NF.

* 1NF: No repeating columns. 
* 2NF: A table must be 1NF AND the primary key is either a single attribute or, if composite, each non-key attribute must be dependent on the entire key for uniqueness. That is, eliminate redundant values.
* 3NF: A table must be 2NF AND eliminate transitive dependencies. That is, remove non-key attributes that depend on other non-key attributes. For example, consider the following table definition for _species_:

```
CREATE TABLE species (
    'id' INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    'taxon' TEXT,
    'common_name' TEXT,
    'count' INTEGER,
    'breeding' TEXT,
    'remarks' TEXT
);

```

The attributes _count_, _breeding_, and _remarks_ depend on each other as observation attributes. We need a way to link this information with species _per observation_ by creating a dependent entity to resolve the M:N relationship between species and observations.

## Design Round 3: Data Integrity



# References

Adrienne Watt and Nelson Eng (n.d.) Databse Design - 2nd Edition. Retrieved from [https://opentextbc.ca/dbdesign01/](https://opentextbc.ca/dbdesign01/)

Datanamic (n.d.) Database normalization. Retrieved from [https://www.datanamic.com/support/database-normalization.html](https://www.datanamic.com/support/database-normalization.html)