# Exercise - Assess Data Quality and Structure Visually

In this exercise, we will be looking at synthetic phase two clinical trial dataset of 350 patients for a new innovative oral insulin called Auralin - a proprietary capsule that can solve a stomach lining problem. 

**Note:** Auralin and Novodra are not real insulin products. This clinical trial data was fabricated for the sake of this course. When assessing this data, the issues that you'll detect (and later clean) are meant to simulate real-world data quality and tidiness issues with the capability to **impact quality of care, patient registration, and revenue**. The datasets, `patients` and `treatments`, were constructed with the consultation of real doctors to ensure plausibility.

In [1]:
#DO NOT MODIFY - imports and loaing data
import pandas as pd

patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')

## 1. Identify data quality issues in the patients data

The `patients` dataframe contains the following variables:

- **patient_id**: the unique identifier for each patient
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year). 
> The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. 
> BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

### 1.1 Display dataset

Display the first 10 rows of the patients data, and visually inspect the data, using directed and non-directed visual assessment as you see necessary. 

Identify **four** instances of problematic data points corresponding to **accuracy** and **validity** in the patients dataset.

**Hint**: take a look at the `given_name`, `zip_code`, and `height` columns. 

In [2]:
#FILL IN - inspect the first 10 rows of the patients data
patients.head(10)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
5,6,male,Rafael,Costa,1140 Willis Avenue,Daytona Beach,Florida,32114.0,United States,386-334-5237RafaelCardosoCosta@gustr.com,8/31/1931,183.9,70,26.4
6,7,female,Mary,Adams,3145 Sheila Lane,Burbank,NV,84728.0,United States,775-533-5933MaryBAdams@einrot.com,11/19/1969,146.3,65,24.3
7,8,female,Xiuxiu,Chang,2687 Black Oak Hollow Road,Morgan Hill,CA,95037.0,United States,XiuxiuChang@einrot.com1 408 778 3236,8/13/1958,158.0,60,30.9
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,816-265-9578DavidGustafsson@armyspy.com,3/6/1937,163.9,66,26.5
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4


### 1.2 Identify issues

- Issue 1 (Accuracy): Row 8, column `given_name`. One of the patients' names is "Dsvid Gustafsson". In the contact the email has "David" with an "a" so this is probably a data entry error.
- Issue 2 (Accuracy): Row 4, column `height`. Tim Neudorf's height is 27 inches instead of 72 inches
- Issue 3 (Validity): All rows, column `zip_code`. Zip code is being processed by pandas as a float, rather than a string.
- Issue 4 (Validity): Row 3, column `zip_code`. The zip code has four digits only.

## 2. Identify data quality and structural issues in the treatments data

Let's take a look at brief context around the `treatments` data:

350 patients participated in this clinical trial. None of the patients were using **Novodra** (a popular injectable insulin) or **Auralin** (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.
> HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash `-`) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash `-`). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment.

- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment.
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start - hba1c_end`.

### 2.1 Display data

Display the first 10 rows of the `treatments` data, and visually inspect the data. Identify 3 instances of problematic data quality issues corresponding to **completeness** and **consistency** in the dataset.

Hints: 
- For the first completeness issue, recall that the size of each treatment arm was 175 patients in each for the Auralin and Novodra arms. How many records should we have in the dataset, and how many rows does the treatments data currently contain?
- For the second completeness issue, take a look at the `hba1c_change` variable. Is there anything noticable about the values for this column? What could we do to mitigate this during cleaning?
- For the consistency issue, take a look at the `given_name` and `surname` variables. How are these different from what we saw in the `patients` dataframe?

In [3]:
#FILL IN - inspect the first 10 rows of the treatments data
treatments.head(10)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
5,jasmine,sykes,-,42u - 44u,7.56,7.18,0.38
6,sophia,haugen,37u - 42u,-,7.65,7.27,0.38
7,eddie,archer,31u - 38u,-,7.89,7.55,0.34
8,saber,ménard,-,54u - 54u,8.08,7.7,
9,asia,woźniak,30u - 36u,-,7.76,7.37,


In [4]:
#FILL IN - calculate the number of rows in your data
treatments.shape

(280, 7)

### 2.2. Identify issues
- Issue 1 (Completeness): There are 280 rows in the treatments table but the size of each treatment arm was 175 patients in each for the Auralin and Novodra arms. The dataframe should contain 350 records instead of 280, indicating missing data.
- Issue 2 (Completeness): We see a number of different NaN values in the `hba1c_change` column. Since the patients' HbA1c level is computed by calculating `hba1c_start` - `hba1c_end`, it should be possible to fill in these NA values if the two columns `hba1c_start` and `hba1c_end` have values. Note: You may have also noticed we also see NA values in the `auralin` and `novodra` when one treatment applies and the other does not.
- Issue 3 (Consistency): Data in the `treatments` dataframe's `given_name` and `surname` columns are lowercase, but patients'  names in the `patients` dataframe start with uppercase letters, indicating a data recording or entry error.

### 2.3 Data structural issue 

Identify one data structural issue in the dataset - take a look at the `auralin` and `novodra` columns.

Issue (Structural): The auralin and novodra columns violate the first rule of tidiness: that each variable forms a column. There are three variables: treatment (auralin or novodra), start dose (e.g., 41 units), and end dose (e.g., 48 units). Column headers in this case are values, not variable names. Instead of the auralin and novodra columns, there should be three columns:
- treatment, which contains values Auralin or Novodra
- start_dose
- end_dose