# Intro to Databases and SQL

Up until now, all of the data in this book has been stored in single files (also called **flat files**) that are not directly connected with one another. In the real world, its very common to come across data that is stored in a **database**, which is often defined as an organized collection of data. This chapter introduces you to databases and SQL, a powerful language to query and update data stored in a relational database.

## Databases

Numerous ways and theories for how data can be organized have been developed and studied. A **database model** is a term that describes the structure in which the data is stored within a database. A few popular database models are listed below.

* **Flat** - a single file of one table of rows and columns. This is the structure for most of the data in this book.
* **Relational** - multiple two-dimensional tables with rows and columns linked together by primary and foreign keys. This is the structure we will use in this part of the book
* **Hierarchical** - a tree-like structure such as the one used for your computer's file system.
* **Key-Value** - Similar to a Python dictionary where keys map to specific pieces of data

### Database Management System

A **Database Management System** or **DBMS** is software that provides users access to the database so that they can store, retrieve, and update the data. This software provides specific commands that the users can issue in order to manage the data.

**Relational Database Management Software** or **RDBMS** is specific to databases employing a relational model and the only structure of data we will be covering in this chapter. Dozens of RDBMS exist with some popular ones listed below:

* **[SQLite][sqlite]** - lightweight free and open source software that comes shipped with Python
* **[MySQL][mysql]** - popular free and open source software often used in web applications
* **[PostgreSQL][postgres]** - popular free and open source software with more features than the two above
* **[Oracle][oracle]** - enterprise software
* **[SQL Server][msft]** - enterprise software from Microsoft

## SQL

Although there are dozens of different RDBMS's, nearly all provide users with a language called **SQL** (Structured Query Language) to update and retrieve the data. SQL is a **declarative** programming language, where you describe the results of the operation but without the exact control flow of how the operation is completed. This contrasts with **imperative** programming languages like Python, where each command expresses exactly what operations the computer will execute. Although SQL is declarative, specific syntax rules still exist that must be adhered to in order for the commands to execute properly.

### Different dialects of SQL

Each RDBMS has its own distinct dialect of SQL that must be used. Although each dialect is distinct, most commands are similar and many will be the exact same. The American National Standards Institute (ANSI) does have a specification for SQL covering all of the features, but each RDBMS chooses the features it wishes to implement from the standard.

[sqlite]: https://www.sqlite.org/index.html
[mysql]: https://www.mysql.com/
[postgres]: https://www.postgresql.org/
[oracle]: https://www.oracle.com
[msft]: https://www.microsoft.com/sql-server

## Download DbSchema

To make learning SQL easier, download free software called [DbSchema][0]. This software is a graphical user interface to view the database diagram, browse the data in the tables, and execute SQL statements. It is not an RDBMS, but allows you to use it with any RDBMS.

[0]: https://dbschema.com/

### Steps to connect to the SQLite Healthcare Database

Once you've downloaded and installed DbSchema, open it and choose **Connect to the Database**. On the next screen, choose **SQLite** from the RDBMS dropdown menu on the top. If prompted, download the **driver** specifically for SQLite. Towards the bottom of the window, you'll need to select the specific database file location. Click the **Choose...** button and navigate to the **data** folder for this book and again into the **databases** folder. Select the **healthcare.db** file and press **connect**. Check all of the boxes on the next screen and click OK.

## The database diagram

You should now see the **database diagram**, also called the **Entity-Relationship Diagram** or **ERD**, which is presented below. The entities are the tables and the relationships are the lines connecting them.

![1]

It's important to understand this diagram as all of the items on it are meaningful. Each of the five large rectangular boxes represents a **table** containing two-dimensional data with rows and columns. This is the same structure of data that we've seen in the CSV files. The table name is centered at the top of each box. The column names and their corresponding data type reside below the table names.

### Primary and foreign keys

Most tables in a relational database will have a column designated as the **primary key**. Each value in a primary key is unique and therefore uniquely identifies each row in the table. The little golden key symbol to the left of the first column in each table represents the primary key in our database diagram.

A **foreign key** is a column that is a primary key in another table. The appointment table is the only one containing foreign keys and it contains the primary keys of each of the other four tables. The blue arrows to the right of the columns pointed up are the symbols used as a foreign key. If a primary key is contained in another table, then it a golden arrow pointed down to the right of the column. 

Foreign keys are not unique in the table they appear and may repeat as many times as possible. In our database, the values of the four foreign keys in the appointment table all repeat multiple times. When these column names are each the primary key in their respective table, then each value is unique.

[1]: images/health_erd.png

### One-to-one, one-to-many, and many-to-many relationships

The dashed lines connecting the tables represent the type of relationships each table has with the other. There are three broad types of relationships between two tables:

**One-to-one relationship** - Each row of one table is connected to exactly one row of the other table. In the diagram below, each row in the table on the left is connected to one, and exactly one row in the table on the right. This type of relationship does not exist in our database and is uncommon in relational databases.

![1]

**One-to-many relationship** - Each row of one table is connected to any number of rows of the other table. In the diagram below (which is a subset of the database we will examine), values of the patient_id in the patient table appear exactly once. They are connected to the same patient_id column in the appointment table on the right, where it may appear any number of times. Here, the order of the tables matter. The "one" component of the relationship applies to the patient table, while "many" applies to the appointment table. From the perspective of the appointment table, it is a **many-to-one relationship**.

![2]

**Many-to-many relationship** - Each row of one table is connected to multiple rows of the other table and vice-versa. Like the one-to-one relationship, this one does not exist in our database and is uncommon in relational databases. 

![3]

The reasons for why one-to-one and many-to-many relationships are uncommon in relational databases will be discussed in the **Data Normalization** chapter.

## Crow's foot notation

Look back at the database diagram and take note of the ending of each of the dashed lines. All of the lines originating from the appointment table begin with three little prongs and a little circle. This symbol is part of a set of symbols called **crow's foot notation**.

There are two components of each symbol from crow's foot notation - the **maximum** and the **minimum**. The maximum is always closest to the table and can either represent **one** or **many**. The minimum will refer to either **zero** or **one**. With this particular symbol, the three prongs refer to the maximum and represent **many**. The little circle refers to the minimum and represents **zero**. 

What is meant by a minimum of zero? There might be a patient_id that exists in the patient table, but does not exist in the appointment table. In other words, some patients in the database have no appointments. The four common crow's foot notation symbols are depicted below:

![4]

In DbSchema, you'll notice an arrow at the other end of each of the lines. This is not part of crow's foot notation, but is used whenever the connecting column is the primary key of the table. Because it is a primary key, it is assumed that it is unique (maximum of one) and necessary (minimum of one). In crow's foot notation, you would use the symbol with two vertical lines.

Therefore, looking at the entire line originating from the appointment table to the patient table, we see that there exists a many-to-one relationship with the possibility that not all patient_id values in the patient table appear in the appointment table.

### More examples of crow's foot notation

The full context of crow's foot notation is seen through the entire connection of two tables, not just on one end. Let's look at a couple more examples. Here the person and social_security table have a one-to-one relationship where the person_id appears exactly once in each table.

![5]

In this many-to-many relationship, song_id values in the price table may not appear in the song table.

![6]


[1]: images/oneone.png
[2]: images/onetomany.png
[3]: images/many_to_many.png
[4]: images/crowsfoot.png
[5]: images/ssn_erd.png
[6]: images/manymanyerd.png

## Viewing the data

All of the data in our healthcare database is stored in two-dimensional tables and is viewable in DbSchema in the Sample Relational Data Explorer open in the bottom half of the screen. Close out of any open tables currently in that area and then drag and drop individual table names from the upper left-hand-side window pane into the data explorer. You'll need to navigate into the Schemas and Default folder folders to find the table names. Here, you can scroll up and down to view the raw data from specific tabls.

![1]

[1]: images/view_data.png

## SQL statements

We are now ready to write our first lines of SQL code. All of our SQL code will composed with a **SQL statement**, which is a specific set of instructions to complete a particular task. You can analogize a SQL statement to an English language sentence that commands an action to be taken. As SQL is a declarative language, the specific steps of how to complete the command are not given, just the final result of what is needed. For instance, the command "Bring me a medium-rare steak and ice cream" does not specify how the items are to be secured or prepared, only what the result needs to be.

### Categories of SQL statements

Most SQL statements begin with single-word verbs that describe an action. Some common statements include **CREATE**, **DROP**, **ALTER**, **SELECT**, **INSERT**, **UPDATE**, **DELETE**, **GRANT**, **REVOKE**, **COMMIT**, and **ROLLBACK**. These statements are often grouped into different categories that describe the types of actions performed.

**Data Definition** - makes changes to entire table
* **CREATE** - creates tables
* **DROP** - deletes tables
* **ALTER** - changes a table's structure - changes table/column names, data types, adds new columns

**Data Manipulation** - modifies individual rows/values in a table
* **SELECT** - queries specific data
* **UPDATE** - changes specific values that already exist
* **INSERT** - inserts new rows
* **DELETE** - deletes rows

**Data Control** - determines which users have access to data/commands
* **GRANT** - grant access to specific users
* **REVOKE** - revoke access to specific users

**Transaction Control** - make changes to database permanent
* **COMMIT** - permanently save changes
* **ROLLBACK** - permanently undo changes since last commit
    

### SQL statement syntax

Each SQL statement has its own specific syntax that must be followed so that it executes properly. Below, we have a **CREATE** statement that creates a table named patient with six columns.

```sqlite
CREATE TABLE patient (
	patient_id INTEGER PRIMARY KEY,
	first_name TEXT NOT NULL,
	last_name TEXT NOT NULL,
    sex TEXT,
    address TEXT,
    date_of_birth DATETIME
);
```

As previously mentioned, each RDBMS uses its own dialect of SQL. The above syntax works for SQLite and many others, but is not guaranteed to work for all dialects. Also, the available SQL statements themselves differ depending on the RDBMS. While understanding all of the SQL statements is necessary for administering and managing databases, it is the SELECT statement that is of most importance to those who wish to analyze data and is the main statement that will be covered in great detail in the next chapter. 

## Data types and missing values

Like pandas DataFrames, each column of a SQL table has a data type informing us as to what kind of values we can expect in the column. You can expect most RDBMS's to have integer, float, text, and datetime data types. Often, data types specify the size of the value, just like pandas specifies the integer and float bit size (int32, float64, etc...). With text data, it's common to see the maximum number of characters possible, such as nvarchar(20) - which stands for a variable character up to 20 characters in length. SQLite is an exception as its data types are flexible and not as strict as other RDBMS's.

Unlike pandas, missing values in SQL are simple and are all represented by the keyword NULL. Every column, regardless of its data type uses NULL to represent a missing value.

## Exercises

Use the following ERD for the exercises.

![1]

[1]: images/prof_student_class.png

### Exercise 1

<span style="color:green; font-size:16px">In words, describe the relationship between the professor and class tables.</span>

### Exercise 2

<span style="color:green; font-size:16px">In words, describe the relationship between the class and students_in_class tables.</span>

### Exercise 3

<span style="color:green; font-size:16px">What is the minimum and maximum number of professors each student can have?</span>