# Assessing Data Quality Programmatically

Context: Dataset is synthetic phase two clinical trial dataset of 350 patients for a new innovative oral insulin called Auralin - a proprietary capsule that can solve a stomach lining problem.

In [2]:
#Import pandas
import pandas as pd

#Load in datasets
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')

## Data Quality issues with the treatments table

In [3]:
#View the first few rows of the treatments dataframe
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [4]:
#View the last few rows of the treatments dataframe
treatments.tail()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.2
276,john,teichelmann,-,49u - 49u,7.9,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36
279,samúel,guðbrandsson,53u - 56u,-,8.0,7.64,0.36


In [5]:
#Returns one random entry from the datframe 
#for non-directed programmatic assessment
treatments.sample()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
95,jaakko,eskelinen,39u - 45u,-,7.73,7.42,0.31


In [6]:
treatments.sample(5) #returns 5 entries

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
48,zak,kelly,-,38u - 38u,7.66,7.26,
245,wu,sung,-,47u - 48u,7.61,7.12,0.99
87,finley,chandler,31u - 45u,-,7.65,7.26,0.39
10,joseph,day,29u - 36u,-,7.7,7.19,
227,erica,macdonald,33u - 39u,-,7.55,7.26,0.29


In [7]:
#Returns 5 entries with a random state 2
treatments.sample(5, random_state=2) 

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
157,asuna,morita,-,35u - 39u,7.58,7.25,0.33
7,eddie,archer,31u - 38u,-,7.89,7.55,0.34
99,abel,yonatan,-,38u - 39u,7.88,7.5,
13,gregor,bole,-,47u - 45u,7.61,7.16,0.95
112,olof,holm,39u - 52u,-,7.85,7.43,


**1. Completeness**

In [8]:
#Get the information of the treatments dataframe using .info()
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


*Interpretation:* Looking at the summary of the treatments DataFrame returned by .info we can see that there are only 171 hba1c_change entries while there are 280 entries for the other columns.

**2. Validity**

In [9]:
#Get the data types of the different variables in the dataframe
treatments.dtypes

given_name       object
surname          object
auralin          object
novodra          object
hba1c_start     float64
hba1c_end       float64
hba1c_change    float64
dtype: object

*Interpretation:* For the auralin and novodra columns, their data type is object.  Ideally, the values for these two columns would be formatted as integers or the float64 data type to make it easier to access this information.

**3. Accuracy**

In [10]:
#Describe the dataframe using .describe()
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


*Interpretation*: 0.99 is a big max value for the hba1c_change variable. 

If we look below at our visual assessment of the treatments table, we'll see that this hba1c_change for this 0.97 entry for Elliot Richardson is calculated incorrectly. 7.56-7.09=0.47, not 0.97, indicating an accuracy issue.

In [11]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


**4. Validity**

In [12]:
#Identify null values in our dataset
#This value should be not null, 
#since there are hyphens indicating missing data
sum(treatments.auralin.isnull())

0

In [13]:
#Identify null values in our dataset
#This value should be not null, 
#since there are hyphens indicating missing data
sum(treatments.novodra.isnull())

0

*Interpretation*: The dashes aren't picked up as null or non-values for the `auralin` and `novodra` columns, which can be problematic when doing calculations on the data.  Misrepresenting missing values is a validity issue.

## Data quality issues with the patients table

**5. Consistency**

In [23]:
#Use the describe() function on the patients dataframe
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


*Interpretation*: The minimum weight is 48 pounds, which is very low. The average weight seems to be about 173 lbs. 

In [24]:
#Use sort_values() method to sort values
#for the weight column from low to high weight.
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

Why is the minimum weight value 48.8? Let's look at the corresponding row.

In [25]:
#Identify the row where the weight column value is at is min.
patients[patients.weight==patients.weight.min()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691.0,United States,330-202-2145CamillaZaitseva@superrito.com,11/26/1938,48.8,63,19.1


Let’s check this strange anomaly by making an assumption - maybe for this patient, the unit is incorrect and it is logged in kilograms instead of pounds. Let's test this by doing the calculation ourselves.

In [26]:
#Conversation factor from kilograms to pounds is 2.20462
#Get the weight of the patient
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
#Get the height of the patient
height_in = patients[patients.surname == 'Zaitseva'].height
#Calculate the BMI using the equation 703 * weight/(height^2)
bmi_check = 703  *weight_lbs / ( height_in*  height_in )
bmi_check

210    19.055827
dtype: float64

In [27]:
#Get the patients' reported BMI
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

*Interpretation:* The BMI we calculated comes up to 19.055, which is equivalent to the BMI reported in the data. This means this anomaly of the patient’s weight was actually recorded in kilograms whereas the rest of the dataset is indicating weight in pounds. This is a consistency issue with the unit measurement of the data. 

**6. Uniqueness + Validity**

In [28]:
#Get the number of unique values in our surname column
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [29]:
#Get the number of unique values in our address column
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [12]:
#Use .duplicated() on the surname and address columns to get the rows
patients[patients.duplicated(subset=['surname','address'])]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


*Interpretation:* We find that there are several John Doe's that live at 123 Main Street New York, New York, ZIP Code 12345 with the email johndoe@email.com. 

This is a uniqueness issue, because we have multiple duplicated rows in the dataset, and a validity issue because this data doesn't conform to the defined schema of one record per patient.