# Data Normalization
Let's take a look at few rows from some visits to a doctor in the `tidy/doctor_visits.csv` file.

In [None]:
import pandas as pd
import numpy as np
dv = pd.read_csv('../data/tidy/doctor_visits.csv')
dv

### The information is easy to read
All the information presented in this table is easy to read. Any question that is asked about the visits can be quickly answered.

### Yet, something is wrong, what is it?
Although the table of data is sufficient to address our questions, it repeats much of the data and will not scale well as more patients come into the system. It will also be difficult to update historical data, such as if the patient's address changes or a clinic changes names.

### A short intro into data normalization
Modern databases attempt to reduce the amount of replication in the data by a process called **normalization**. It involves separating data into tables to minimize replication and increase data accuracy.

From [wikipedia](https://en.wikipedia.org/wiki/Database_normalization):
> Database Normalization, or simply normalization, is the process of organizing the columns (attributes) and tables (relations) of a relational database to reduce data redundancy and improve data integrity. Normalization is also the process of simplifying the design of a database so that it achieves the optimum structure. It reduces and eliminates redundant data. In normalization, data integrity is assured. It was first proposed by Dr. Edgar F. Codd, as an integral part of a relational model.

There is much, much more to data normalization. The following example is just a brief overview.

### Which values are always the same for each visit?
When looking through the table, you should notice that certain values will repeat for every single visit. For instance, the patient name, address, and birth date are going to be the same. There isn't a need to keep repeating all these values for every visit.

### Separating patient values into a distinct table
Let's select all the patient columns in a single DataFrame. The **`copy`** method is called to ensure that this is new DataFrame is a completely different DataFrame and not referring to the same data as the original.

In [None]:
patient = dv[['patient_name', 'patient_address', 'patient_birthdate']]
patient

### Drop duplicate rows
There is no need to store duplicate values of each row. Let's keep only the unique rows.

In [None]:
patient = patient.drop_duplicates()
patient

## Create a primary key to uniquely identify each row
When we normalize our data, and separate it into new tables, an additional column is added to the table to uniquely identify each row. The unique value that identifies each row is called a **primary key**. Let's add a primary key to the patient table. Note: If you get the **`SettingWithCopyWarning`**, ignore it.

In [None]:
patient['patient_id'] = np.arange(len(patient))
patient

### Rearrange the columns so that `patient_id` is first
It's best to always put the primary key as the first column.

In [None]:
cols = ['patient_id', 'patient_name', 'patient_address', 'patient_birthdate']
patient = patient[cols]
patient

## We just created a dimension
In database terminology, the patient table would be considered a **dimension**. Dimensions are things that exist in your data that are independent of any event taking place. They tend to be static and do not change often (such as your name or birth date.

### Create tables for all the other dimensions
There are several other dimensions in our original table. Let's create new dimension tables for each one of the following:

* clinic
* doctor
* procedure

We will select each of the columns unique to each dimension, drop the replicated rows and add a primary key as the first column to uniquely identify each row.

### Clinic Dimension

In [None]:
cols = ['clinic_name', 'clinic_address']
clinic = dv[cols].drop_duplicates()
clinic

In [None]:
clinic['clinic_id'] = np.arange(len(clinic))
clinic

In [None]:
cols = ['clinic_id', 'clinic_name', 'clinic_address']
clinic = clinic[cols]
clinic.head()

### Doctor Dimension

In [None]:
cols = ['doctor_name', 'doctor_specialty']
doctor = dv[cols].drop_duplicates()
doctor

In [None]:
doctor['doctor_id'] = np.arange(len(doctor))
doctor

In [None]:
cols = ['doctor_id', 'doctor_name', 'doctor_specialty']
doctor = doctor[cols]
doctor.head()

### Procedure Dimension
Here, the primary key is already given to us with the procedure code. We will keep it as is.

In [None]:
cols = ['procedure_code', 'procedure_name']
procedure = dv[cols].drop_duplicates()
procedure

## Replacing original data with primary keys
We can now revisit our original DataFrame and replace all columns in each dimension with a single column, the primary key of that dimension.

### Join original table to dimension tables
To make the replacement, we will join our original table to each one of our dimension tables. The **`merge`** method joins tables together in Pandas. We can specify how the tables will join with the **`on`** parameter. We will join on all the non-primary key columns. Below, we join the **`patient`** table. The result is one extra column at the end of the DataFrame, the **`patient_id`**.

In [None]:
dv_fact = dv.merge(patient, on=['patient_name', 'patient_address', 'patient_birthdate'])
dv_fact

### Drop the dimension columns
We can now drop all the original patient columns as the **`patient_id`** now refers to them.

In [None]:
dv_fact = dv_fact.drop(columns=['patient_name', 'patient_address', 'patient_birthdate'])
dv_fact

## Replace all the other dimensions with primary key columns

### Doctor Dimension

In [None]:
dv_fact = dv_fact.merge(doctor, on=['doctor_name', 'doctor_specialty'])
dv_fact = dv_fact.drop(columns=['doctor_name', 'doctor_specialty'])
dv_fact

In [None]:
clinic

### Clinic Dimension

In [None]:
dv_fact = dv_fact.merge(clinic, on=['clinic_name', 'clinic_address'])
dv_fact = dv_fact.drop(columns=['clinic_name', 'clinic_address'])
dv_fact

### Procedure Dimension
Since the primary key is already in the table, we can just drop the **`procedure_name`** column

In [None]:
dv_fact = dv_fact.drop(columns=['procedure_name'])
dv_fact

### Rearrange columns with foreign keys first
When a primary key from table is found in another table, it is called a **foreign key**. Foreign keys can repeat in the table they are in. Primary keys, however, can never repeat in the tables they are in. This is a very important property. Foreign keys are 

In [None]:
cols = ['patient_id', 'clinic_id', 'doctor_id', 
        'procedure_code', 'visit_date', 'cost']
dv_fact = dv_fact[cols]
dv_fact

## Fact Table
This last DataFrame, **`dv_fact`** is called a **fact table** using database terminology. Fact tables hold the actual **events** or **transactions** that take place in a business. They hold the columns that are subject to change such as date and cost here. If we had data from a grocery store, our fact table would have columns like the number of items purchased, the cost of each item, and the type of payment used.  Fact tables have references to the static dimension tables through foreign keys.

## Data Model Diagram
The diagram of the **data model** or **entity-relationship diagram** is presented below. Data models show the logical relationships between the fact and dimension tables. This type of data model is called a **star schema** and one of the simplest designs.

![](images/doctor_data_model.png)

See [this simple Stack Overflow answer][1] for another description of fact and dimension tables.

[1]: https://stackoverflow.com/a/33750545/3707607

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Tidy the dataset `tidy/store_transactions.csv`.</span>