# Assessing Data
Assessing your data is the second step in the data wrangling process. When assessing, you're inspecting your data set for two things:

- **Data quality issues:** Data that has quality issues have issues with content like missing, duplicate, or incorrect data. This is called dirty data.
- **Lack of tidiness:** Data that has specific structural issues that slow you down when cleaning and analyzing, visualizing, or modeling your data later.

This lesson will use a dataset constructed by Udacity about a clinical trial. In this trial, the efficacy of a hypothetical drug, Auralin, that would allow patients to take insulin orally. The control group was treated with injectable insulin called Novodra. By comparing key metrics between these two drugs, we can determine if Auralin is effective.

---

### Dirty vs. Messy Data

The example of a dirty and messy bedroom can be used to illustrate this difference. A dirty bedroom can have dirty plates and cutlery, banana peels or candy wrapers, for instance. These things don't belong in your bedroom and should be cleaned. 

A messy (or untidy) bedroom, on the other hand, can have clothes on the floor or an unmade bed. The clothes are meant to be there, but not on the ground, while the bed should probably be made as well. When data is organized, it is no longer messy (or untidy). 

On more formal terms:

- **Dirty data** which has issues with its *content* is often called *low-quality data* and can include things like inaccurate data, corrupted data, and duplicate data.
- **Messy data** has issues with its *structure*. It is often referred to as *untidy*.

---

### Types of Assessment
- Visual: opening and looking through the data in its entirety.
- Programmatic: uses code to view specific parts of the data, like the .info method for instance.

### Steps to Assessing Data
1. Detecting an issue
2. Documenting that issue: here, you don't have to specify how to fix it. That can be done later, in the cleaning step of the data wrangling process.

--- 
## Quality Issues

### Visual Assessment: Acquaint Yourself
This Auralin Phase II clinical trial dataset comes in three tables: `patients`, `treatments`, and `adverse_reactions`. Acquaint yourself with them through visual assessment below.

### Gather

In [1]:
import pandas as pd

In [2]:
patients = pd.read_csv('support-files/03_Assessing-Data/patients.csv')
treatments = pd.read_csv('support-files/03_Assessing-Data/treatments.csv')
adverse_reactions = pd.read_csv('support-files/03_Assessing-Data/adverse_reactions.csv')

### Assess
In the cells below, each column of each table in this clinical trial dataset is described. To see the table that goes hand in hand with these descriptions, display each table in its entirety by displaying the pandas DataFrame that it was gathered into. This task is the mechanical part of visual assessment in pandas.

In [3]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


`patients` columns:
- **patient_id**: the unique identifier for each patient in the [Master Patient Index](https://en.wikipedia.org/wiki/Enterprise_master_patient_index) (i.e. patient database) of the pharmaceutical company that is producing Auralin
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year). The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

In [4]:
tim_weight_lbs = 192.3
tim_height_in = 27
tim_bmi = 703 * tim_weight_lbs / (tim_height_in**2) 
# 703 is the conversion factor to calculate BMI in inches 
tim_bmi # a human with a 185 bmi wouldn't be alive! there must be a mistake

185.44156378600823

What if Tim Neudorf's height was wrongly logged? What if it's actually 72 in instead of 27 in?

In [5]:
tim_weight_lbs = 192.3
tim_height_in = 72
tim_bmi = 703 * tim_weight_lbs / (tim_height_in**2) 
tim_bmi # that seems more likely! and it matches the recorded BMI for Tim Neudorf in the patients table

26.077719907407406

In [6]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [7]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

---

### Assessing vs. Exploring
The discovery of the issues outlined at the end of the notebook ensure that the analysis can be executed, which for this clinical trial data includes calculated average patient metrics (e.g. age, weight, height, and BMI) and calculating the confidence interval for the difference in HbA1c change means between Novodra and Auralin patients.

#### Exploring (EDA)
In the context of this dataset, exploring might include using summary statistics like count on the state column or mean on the weight column to see if patients from certain states or of certain weights are more likely to have diabetes, which we can use to exclude certain patients from the analysis and make it less biased.

---

### Data Quality Dimensions
Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

1. **Completeness:** do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
2. **Validity:** we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
3. **Accuracy:** inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
4. **Consistency** inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

These are listed in decreasing order of severity, meaning that the dimension listed first, completeness, is the most important.

---

### Programmatic Assessment
Programmatic asessment is usually driven by the problem you want to solve. Using `.info()` we'll see that there are only 171 entries in the `hba1c_change` column in the `treatments` table. That indicates there are missing data. 

Non-directed programmatic assessment can also be useful. It means randomly typing in programmatic assessments without any directed goal in mind. The `.sample()` method pandas is the on data frames, displays a random sample of entries.

### Assess
These are the programmatic assessment methods in pandas that you will probably use most often:

* .head (DataFrame and Series)
* .tail (DataFrame and Series)
* .sample (DataFrame and Series)
* .info (DataFrame only)
* .describe (DataFrame and Series)
* .value_counts (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

Check out the [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) for detailed usage information.

Try `.head` and `.tail` on the `patients` table.

In [8]:
patients.head(1)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6


In [9]:
patients.tail(1)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


Try `.sample` on the `treatments` table.

In [10]:
treatments.sample(2)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
130,kamila,pecinová,-,54u - 51u,7.77,7.39,0.38
244,angel,grant,31u - 37u,-,7.93,7.47,


Try `.info` on the `treatments` table.

In [11]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


Try `.describe` on the `patients` table.

In [12]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


Try `.value_counts` on the *adverse_reaction* column of the `adverse_reactions` table.

In [13]:
adverse_reactions['adverse_reaction'].value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
cough                         2
throat irritation             2
nausea                        2
Name: adverse_reaction, dtype: int64

Try selecting the records in the `patients` table for patients that are from the *city* New York.

In [14]:
len(patients.query('city == "New York"'))

18

In [15]:
# also possible
len(patients.loc[patients['city'] == 'New York'])

18

### Quality: Programmatic Assessment

In [16]:
# check if there are missing data
patients.info() # it seems that there is

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


Columns with the wrong data type (dtype) (validity issues):
- `assigned_sex` should be categorical
- `state` should be categorical
- `zip_code` should be a string (object), since you can't perform calculations with zip codes
- `birthdate` should be datetime

There are specific summaries or calculations to be performed on each data type. If a column is mislabeled, these analyses won't be possible. Document these issues!

In [17]:
# filter by missing data (completeness issue)
patients[patients['address'].isnull()] # document this missing info

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [18]:
# descriptive stats for the numerical data
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [19]:
# descriptive stats for the numerical data
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


Notice that the difference in hba1c_change from the 25% to 50% percentile is 0.04, while the difference from the 50% from 75% percentile is 0.54. That suggests inaccurate data. Whomever recorded this data might have miscalculated or made a mistake when writing the number down. 

The success or failure of this clinical trial hinges on the results from the HbA1c change, so it's of utmost importance there are no errors in this column. Document the issue to clean it later. 

In [20]:
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
254,255,male,Kang,Mai,1109 Beechwood Drive,Pittsburgh,PA,15205.0,United States,412-274-6756KangMai@jourrapide.com,9/16/1986,200.4,70,28.8
102,103,female,Najla,Touma,2146 Willow Greene Drive,Montgomery,AL,36104.0,United States,NajlaQismahTouma@cuvox.de+1 (334) 261-1235,8/20/1948,165.0,61,31.2
437,438,male,Alwin,Svensson,1846 Joseph Street,Union Grove,WI,53182.0,United States,AlwinSvensson@armyspy.com+1 (262) 878-9576,11/2/1924,137.7,63,24.4
111,112,female,Nicole,Zimmerman,909 Williams Avenue,Newhall,California,91321.0,United States,661-291-1812NicoleZimmerman@rhyta.com,1/14/1984,225.9,74,29.0
452,453,female,Fearne,McGregor,1168 Stout Street,Harrisburg,PA,17111.0,United States,717-368-8321FearneMcGregor@superrito.com,6/30/1936,142.1,65,23.6


There are multiple number data formats. This is a consistency error in the data entry. Document the issue. 

In [21]:
patients['country'].value_counts()

United States    491
Name: country, dtype: int64

#### Are there any issues in these programmatic assessments?

In [22]:
patients['surname'].value_counts()
# Doe is usually the surname chosen when you don't know someone's name
# this should be investigated 

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [23]:
patients['address'].value_counts()
# there are six people who live in the same place? and this address seems fake
# there are 3 address with two hits each: do these people live together or
# were these patients logged twice?

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [24]:
patients[patients['address'].duplicated()]
# here we can see the rows that represent the value counts above
# I can also see that every 123 Main Street address is related to 
# a John Doe patient. So there is a validity issue here. 
# document this!
# this assessment also brings up the null values for the address column

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [25]:
# 'elizabeth knudsen' is not a duplicated name, so there is no issue here
patients.query('surname == "Knudsen"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4


In [26]:
# 'jake jakobsen' is a duplicated name, since
# jake and jakob have the same address, they probably used both his
# given name and a nickname, logging two separate entries
# validy issue. document this!
patients.query('surname == "Jakobsen"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020.0,United States,KarenJakobsen@jourrapide.com1 979 203 0438,11/25/1962,185.2,67,29.0


In [27]:
patients.query('address == "2476 Fulton Street"')
# this is the same validity issue as Jake/Jakob above
# document this!

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4


In [28]:
patients.query('address == "2778 North Avenue"')
# this is the same validity issue as Jake/Jakob and Sandy/Sandra above
# document this!

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


---

In [29]:
patients['weight'].sort_values()
# the first entry, 48.8, is most likely innacurate considering one of the requirements
# to participate in the study is being older than 18. 

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

The issue above is an interesting one. It could be an accuracy or validity issue. However, if we think a bit deeper, this could be a consistency issue in disguise, if this was mistakenly logged as kilograms instead of pounds. We can check this using the height and BMI entry for this patient. Document this!

In [30]:
patients.query('weight == 48.8')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691.0,United States,330-202-2145CamillaZaitseva@superrito.com,11/26/1938,48.8,63,19.1


In [31]:
weight_lbs = patients.query('weight == 48.8')['weight'] * 2.20462 
# 2.20462 is the conversion factor between kilograms and pounds
height_in = patients.query('weight == 48.8')['height']
bmi_check = 703 * weight_lbs / (height_in**2)
bmi_check # notice 19.1 matches the bmi entry in the table 

210    19.055827
dtype: float64

In [32]:
# notice 19.1 matches the bmi entry in the table 
patients.query('weight == 48.8')['bmi']

210    19.1
Name: bmi, dtype: float64

---

In [33]:
sum(treatments['auralin'].isnull())
# there are no null values, but there should be
# since not all patients were treated with auralin

0

In [34]:
sum(treatments['novodra'].isnull())

0

In [35]:
# there are no null values, but there should be
# since not all patients were treated with novodra
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


This is a common issue misrepresenting missing values like a dash or a slash or N/A or None, for example. We wouldn't be able to convert these columns to int with these dashes (strings) present. Without converting the columns, calculations won't be possible, which we'll need to report the clinical trial findings. This is a validity issue, because nulls should be... nulls, not text. Document this!

---

## Tidiness

### Requirements
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table

### Visual Assessment

In [36]:
patients.head(2)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2


The `contact` column is the first tidiness issue, violating the first requirement for tidy data. There are two variables in one column: phone number and email. They should be separate. Document this! 

In [37]:
treatments.head(2)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97


Using the `auralin` column as an example, the same column currently holds two pieces of information: the starting dosage and the ending dosage. The same goes for the `novodra` column. Since each variable should form a column for it to be considered *tidy*, then these two columns should be split into three: `treatment`, `start_dose` and `end_dose`.

In the lesson about cleaning data, this will be done using the `melt` function and the `split` method. 

**More Information**
Hadley Wickham, the creator of the tidy data format, provides some standard vocabulary for describing the structure and semantics of a dataset, and then uses those definitions to define tidy data in [Tidying messy datasets](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).

---

### Programmatic Assessment
Programmatic assessment can be handy for the third tidiness requirement: each type of observational unit forms a table.

- Which column headers should belong in what table?
- How many tables do we need to make this clinical trial dataset as whole tidy?

In [38]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [39]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [40]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


This dataset could be tidily represented in two tables. One, containing the patients table with the columns as it is now. 

The other with the treatments table containing the adverse reactions information. If we want to look at treatment data, adverse reaction information is naturally connected. Therefore, it's the same observational unit (the third requirement of tidy data). We need only keep the `adverse_reaction` column. The other two, `given_name` and `surname` can be dropped, since they are actually duplicated.

To facilitate joining tables afterwards, it's good practice to use `id` (in this case, `patient_id`) as the primary key. That's because names can change, ID's can't. 

In [41]:
# summing all columns from the three tables we can see that
# given_name and surname are duplicated, and therefore can be dropped
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

## Issues

### Quality
##### `patients` table: 
- Zip code is a float not a [string](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas) (validity issue)
- Zip code has four digits sometimes (probably the spreadsheet software recognized the zip code column as a number and supressed the 0 if the zip code started with a 0) (accuracy issue). 
- Tim Neudorf height is 27 inches instead of 72 inches (accuracy issue)
- Full state names sometimes, abbreviations other times (consistency issue)
- Dsvid Gustafsson (name typo, patient_id: 9) (accuracy issue)
- Missing demographic information (address through contact columns)
- Erroneus datatypes (`assigned_sex`, `state`, `zip_code`, and `birthdate` columns)
- Multiple phone number formats (`contact` column)
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- Kgs instead of lbs for patient surname Zaitseva weight (patient_id: 211)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `treatments` table:
- Missing HbA1c changes (completeness issue)
- The letter 'u' in starting and ending doses for Auralin and Novodra (validity issue)
- Lowercase given names and surnames (since the patients table has the name in uppercase, this will be an issue if you decide to join tables) (consistency issue)
- Missing records (280 instead of 350) (completeness issue)
- Erroneus datatypes (`auralin` and `novodra` columns)
- Inaccurate HbA1c changes 

##### `adverse_reactions` table:
- Lowercase given names and surnames (consistency issue)

### Tidiness
##### `patients` table: 
- `contact` column should be split into `phone number` and `email`

##### `treatments` table:
- Three variables in two columns in `treatments` table (`treatment`, i.e., auralin or novadra, `start_dose`, and `end_dose`).
- Include the `adverse_reaction` column from the `adverse_reactions` table
- Include `patient_id` column from the `patients` table to serve as primary key and facilitate joining the remaining two tables (`treatments` and `patients`) afterwards

##### `adverse_reactions` table:
- After adding the `adverse_reaction` column to the `treatment` table, drop column. 

### Reminder: Data Wrangling Can Be Iterative
The concept of iterating isn't that applicable for clinical trials given the rigor involved in their planning. But, theoretically, the following situations could arise that require iteration:

- Maybe you (as the data analyst or data scientist on the clinical trial research team) realized your statistical power calculations were wrong, and you needed to recruit more patients to make your study statistically significant. You'd also have to do revisit gathering in this scenario.
- Maybe you realized you were missing a key piece of patient information, like patient blood type (again, unlikely given the rigor of clinical trials, but mistakes happen) because you discovered new research that related insulin resistance to blood type. You'd also have to do revisit gathering in this scenario.
- Maybe you finished assessing, started cleaning, and spotted another data quality issue. Revisiting assessing to add these assessments to your notes is fine.

### Assess: Summary
Assessing is the second step in the data wrangling process:

- Gather
- **Assess**
- Clean

You can assess data for:

- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    1. Each variable forms a column.
    2. Each observation forms a row.
    3. Each type of observational unit forms a table.

...using two types of assessment:

1. Visual assessment: scrolling through the data in your preferred software application (Google Sheets, Excel, a text editor, etc.).
2. Programmatic assessment: using code to view specific portions and summaries of the data (pandas' `head`, `tail`, and `info` methods, for example).