# Cleaning Data
The best way to clean data is to code it yourself.

The same Phase II clinical trial dataset for a new oral insulin called Auralin in Lesson 3 (Assessing Data) is used here again in Lesson 4 (Cleaning Data).

---

## Manual vs. Programmatic Cleaning
**Manual Data Cleaning** includes:

- Retyping incorrect data
- Copying and pasting columns and rows

However, manual cleaning is inefficient, error-prone, and demoralizing. So never clean manually.

**Programmatic Data Cleaning** uses code to:

- Automate cleaning tasks
- Minimize repetition
- Save time

Data wrangling takes a tremendous amount of time for the data professional, so doing anything that saves time is great.

---

## Data Cleaning Process
Programmatic data cleaning is a separate process within data wrangling. It has three steps:

1. **Define:** the first step is to define a data cleaning plan in writing by converting your assessments into cleaning tasks by writing little how-to guides. This plan also serves as documentation so that your work can be reproduced.
2. **Code:** second, you'll translate these words to code and actually run it.
3. **Test:** finally, you'll test your dataset often using code to make sure your cleaning code worked. This is like a revisiting the assess step.

> Remember that before any cleaning occurs, it's good practice to make a copy of each piece of data. This can be done using the `copy`[method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html).

---

## Cleaning Sequences
There are multiple ways of sequencing the steps in the data cleaning process. For instance:

### Gather

In [1]:
import pandas as pd
import numpy as np

In [2]:
animals = pd.read_csv('support-files/04_Cleaning-Data/animals.csv')

### Assess

In [3]:
animals.head()

Unnamed: 0,Animal,Body weight (kg),Brain weight (g)
0,bbMountain beaver,1!35,8!1
1,bbCow,465,423
2,bbGrey wolf,36!33,119!5
3,bbGoat,27!66,115
4,bbGuinea pig,1!04,5!5


#### Quality
- bb before every animal name
- ! instead of . for decimal in body weight and brain weight

### Clean

In [4]:
animals_cleaned = animals.copy()

#### Define
- Remove 'bb' before every animal name using string slicing
- Replace ! with . in body weight and brain weight columns

#### Code

In [5]:
# remove 'bb' before every animal name using string slicing
animals_cleaned['Animal'] = animals_cleaned['Animal'].str[2:]

Note: `str.replace` [documentation](https://docs.python.org/dev/library/stdtypes.html#str.replace).

In [6]:
# replace ! with . in body weight and brain weight columns
animals_cleaned['Body weight (kg)'] = animals_cleaned['Body weight (kg)'].str.replace('!', '.')
animals_cleaned['Brain weight (g)'] = animals_cleaned['Brain weight (g)'].str.replace('!', '.')

#### Test

In [7]:
animals_cleaned.head()

Unnamed: 0,Animal,Body weight (kg),Brain weight (g)
0,Mountain beaver,1.35,8.1
1,Cow,465.0,423.0
2,Grey wolf,36.33,119.5
3,Goat,27.66,115.0
4,Guinea pig,1.04,5.5


### Another possible cleaning sequence
You can also use multiple **Define**, **Code**, and **Test** headers, one for each data quality and tidiness issue (or group of data quality and tidiness issues). Effectively, you are defining then coding then testing immediately. This sequence is helpful when you have a lot of quality and tidiness issues to clean. Since that is the case in this lesson, this sequence will be used.

Pasting each assessment above the **Define** header as its own header can also be helpful.

In [8]:
# reload animals_cleaned
animals_cleaned = animals.copy()

In [9]:
# checking it worked
animals_cleaned.head(1)

Unnamed: 0,Animal,Body weight (kg),Brain weight (g)
0,bbMountain beaver,1!35,8!1


### Clean

#### Assessment
bb before every animal name

#### Define
Remove 'bb' before every animal name using string slicing.

#### Code

In [10]:
animals_cleaned['Animal'] = animals_cleaned['Animal'].str[2:]

#### Test

In [11]:
animals_cleaned.head()

Unnamed: 0,Animal,Body weight (kg),Brain weight (g)
0,Mountain beaver,1!35,8!1
1,Cow,465,423
2,Grey wolf,36!33,119!5
3,Goat,27!66,115
4,Guinea pig,1!04,5!5


#### Assessment
! instead of . for decimal in body weight and brain weight

#### Define
Replace ! with . in body weight and brain weight columns

#### Code

In [12]:
animals_cleaned['Body weight (kg)'] = animals_cleaned['Body weight (kg)'].str.replace('!', '.')
animals_cleaned['Brain weight (g)'] = animals_cleaned['Brain weight (g)'].str.replace('!', '.')

#### Test

In [13]:
animals_cleaned.head()

Unnamed: 0,Animal,Body weight (kg),Brain weight (g)
0,Mountain beaver,1.35,8.1
1,Cow,465.0,423.0
2,Grey wolf,36.33,119.5
3,Goat,27.66,115.0
4,Guinea pig,1.04,5.5


## Cleaning: Phase II Clinical Trial for Auralin

### Assessments (copied from the previous lesson to avoid having two notebooks open):

### Quality
##### `patients` table: 
- Zip code is a float not a [string](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas) (validity issue)
- Zip code has four digits sometimes (probably the spreadsheet software recognized the zip code column as a number and supressed the 0 if the zip code started with a 0) (accuracy issue). 
- Tim Neudorf height is 27 inches instead of 72 inches (accuracy issue)
- Full state names sometimes, abbreviations other times (consistency issue)
- Dsvid Gustafsson (name typo, patient_id: 9) (accuracy issue)
- Missing demographic information (address through contact columns)
- Erroneus datatypes (`assigned_sex`, `state`, `zip_code`, and `birthdate` columns)
- Multiple phone number formats (`contact` column)
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- Kgs instead of lbs for patient surname Zaitseva weight (patient_id: 211)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `treatments` table:
- Missing HbA1c changes (completeness issue)
- The letter 'u' in starting and ending doses for Auralin and Novodra (validity issue)
- Lowercase given names and surnames (since the patients table has the name in uppercase, this will be an issue if you decide to join tables) (consistency issue)
- Missing records (280 instead of 350) (completeness issue)
- Erroneus datatypes (`auralin` and `novodra` columns)
- Inaccurate HbA1c changes 

##### `adverse_reactions` table:
- Lowercase given names and surnames (consistency issue)

### Tidiness
##### `patients` table: 
- `contact` column should be split into `phone number` and `email`

##### `treatments` table:
- Three variables in two columns in `treatments` table (`treatment`, i.e., auralin or novadra, `start_dose`, and `end_dose`).
- Include the `adverse_reaction` column from the `adverse_reactions` table
- Include `patient_id` column from the `patients` table to serve as primary key and facilitate joining the remaining two tables (`treatments` and `patients`) afterwards

##### `adverse_reactions` table:
- After adding the `adverse_reaction` column to the `treatment` table, drop column. 

---

In [14]:
patients = pd.read_csv('support-files/04_Cleaning-Data/patients.csv')
treatments = pd.read_csv('support-files/04_Cleaning-Data/treatments.csv')
adverse_reactions = pd.read_csv('support-files/04_Cleaning-Data/adverse_reactions.csv')

### Remember to make a copy of the original DataFrames!

In [15]:
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### Address Missing Data First

When checking data quality, it is usually best to deal with **completeness issues first**. For missing data this means:

- Concatenate
- Join
- Impute, if possible

It's important to do this upfront so that subsequent data cleaning will not have to be repeated.

Going through the assessments above, there are three (3) completeness issues:

**`treatments` table:**

- Missing HbA1c changes
- Missing records (280 instead of 350)

**`patients` table:**
-  Missing demographic information (address - contact columns)

Unfortunately, we can't do anything about the missing demographic information because we have no way of accessing that information until those patients come back. But let's deal with the other missing data issues now.

### Clean

#### 1. Missing Data

#### `treatments`: Missing records (280 instead of 350)

##### 1.1. Define: 

First, import the cut treatments into a DataFrame. Then, use `pd.concat` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)) to join the 70 missing records to the `treatments` table. 

##### Code:

In [16]:
# load the missing entries from the treatments table
treatment_cut = pd.read_csv('support-files/04_Cleaning-Data/treatments_cut.csv')

##### Test:

In [17]:
# check to see if it worked
treatment_cut.info() # the 70 missing entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  42 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [18]:
# confirm that our current treatment dataframe is missing those 70 entries
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


##### Code:

In [19]:
# use pd.concat to join both dataframes
treatments_clean = pd.concat([treatments_clean, treatment_cut], ignore_index=True)

##### Test:

In [20]:
# check to see if it worked
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   auralin       350 non-null    object 
 3   novodra       350 non-null    object 
 4   hba1c_start   350 non-null    float64
 5   hba1c_end     350 non-null    float64
 6   hba1c_change  213 non-null    float64
dtypes: float64(3), object(4)
memory usage: 19.3+ KB


---
#### `treatments`: Missing HbA1c changes and inaccurate HbA1c changes (leading 4s mistaken as 9s)

##### 1.2. Define:
Subtract the column `hba1c_start` from the column `hba1c_end` to get the missing `hba1c_change` entries. 

Note: once you perform the subtraction on the whole column, the inaccurate HbA1c changes will also be correted. Two birds, one stone. 

##### Code:

In [21]:
# in the second entry we can spot an instance of a inaccurate HbA1c change
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [22]:
# subtract the two columns
treatments_clean['hba1c_change'] = treatments_clean['hba1c_start'] - treatments_clean['hba1c_end']

##### Test:

In [23]:
treatments_clean['hba1c_change'].head()

0    0.43
1    0.47
2    0.43
3    0.35
4    0.32
Name: hba1c_change, dtype: float64

### Address Tidiness
After dealing with structural issues (like missing data), it's best to address tidiness. Finally, deal with content issues (quality).

Statistician, Hadley Wickham, is the pioneer of tidy data, and in his paper, 'Tidy data' (*The Journal of Statistical Software*, vol. 59, 2014), he makes these key points:

- Tidy datasets are easy to manipulate
- Tidy datasets with data quality issues are almost always easier to clean than untidy datasets with the same issues

#### Tidiness issues in the clinical trial dataset:

##### `patients` table: 
- `contact` column should be split into `phone number` and `email`

##### `treatments` table:
- Three variables in two columns in `treatments` table (`treatment`, i.e., auralin or novadra, `start_dose`, and `end_dose`).
- Include the `adverse_reaction` column from the `adverse_reactions` table
- Include `patient_id` column from the `patients` table to serve as primary key and facilitate joining the remaining two tables (`treatments` and `patients`) afterwards

### Clean

#### 2. Tidiness

#### `patients`: `contact` column should be split into `phone number` and `email`

#####  2.1. Define:
In the `patients` table, use `str.extract` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html) to separate phone number and email into two columns. 

- Hint 1: [regex tutorial](https://regexone.com/)
- Hint 2: [various phone number regex patterns](https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number)
- Hint 3: [email address regex pattern](http://emailregex.com/)

##### Code:

In [24]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


There are dozens of ways to format US phone number. To name a few:

```###-###-####```

```(###) ###-####```

```### ### ####```

```###.###.####```

```#+ ###.###.####```

The regular expression must be able to encompass them all!

In [25]:
# use str.extract with a regular expression to get the phone numbers
patients_clean['phone_number'] = patients_clean['contact'].str.extract(r'(\+?\d?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=False)

##### Test:

In [26]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phone_number
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6,951-719-9170
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2,+1 (217) 569-3204
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8,402-363-6804
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7,+1 (732) 636-8246
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1,334-515-7487


In [27]:
# 491 is the same number of entries we have in contact
patients_clean['phone_number'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: phone_number
Non-Null Count  Dtype 
--------------  ----- 
491 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


#####  Code:

In [28]:
# use str.extract with a regular expression to get the emails
# the regular expression must ensure the email starts with a letter and ends with one, 
# with an "@" in the middle
# use str.lower so the email is formatted all lowercase (not mandatory)
patients_clean['email'] = patients_clean['contact'].str.extract(r'([a-zA-Z].*@[\w\.]*[a-zA-Z])', expand=False).str.lower()

# drop 'contact' column
patients_clean.drop(columns='contact', inplace=True)

##### Test:

In [29]:
# check to see if the column 'contact' was removed and
# if the column 'email' was created
patients_clean.head(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,121.7,66,19.6,951-719-9170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,118.8,66,19.2,+1 (217) 569-3204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,177.8,71,24.8,402-363-6804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,220.9,70,31.7,+1 (732) 636-8246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,192.3,27,26.1,334-515-7487,timneudorf@cuvox.de


In [30]:
# 491 is the same number of entries we used to have in 'contact'
patients_clean['email'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: email
Non-Null Count  Dtype 
--------------  ----- 
491 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


In [31]:
# checking if any there are any emails starting with numbers
patients_clean['email'].sort_values().head()

404               aaliyahrice@dayrep.com
11          abdul-nurmummarisa@rhyta.com
332                abelefrem@fleckens.hu
258              abelyonatan@teleworm.us
305    addoloratalombardi@jourrapide.com
Name: email, dtype: object

In [32]:
# checking if any there are any emails ending with numbers
patients_clean['email'].sort_values(ascending=False).head()

49              zsinkovivien@teleworm.us
0               zoewellish@superrito.com
165              zlatkorukavina@cuvox.de
456    zikoranaudodimmachinedum@cuvox.de
199           zdeneksynek@jourrapide.com
Name: email, dtype: object

---

#### `treatments`: three variables in two columns in (`treatment`, i.e., auralin or novadra, `start_dose`, and `end_dose`).

##### 2.2. Define: 

Use panda's `melt` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) and `str.split()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) to transform these two columns into three. 

`melt` [tutorial](https://deparkes.co.uk/2016/10/28/reshape-pandas-data-with-melt/)

##### Code:

In [33]:
# checking column headers before melting the dataframe
treatments_clean.head(1)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43


In [34]:
treatments_index = treatments_clean.index
treatments_clean = pd.melt(treatments_clean, 
               id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
               value_vars=['auralin', 'novodra'],
               var_name='treatment',
               value_name='dose')

##### Test:

In [35]:
# the columns 'auralin' and 'novodra' were stacked into a single column 'treatment'
# so, having 700 entries is expected
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    700 non-null    object 
 1   surname       700 non-null    object 
 2   hba1c_start   700 non-null    float64
 3   hba1c_end     700 non-null    float64
 4   hba1c_change  700 non-null    float64
 5   treatment     700 non-null    object 
 6   dose          700 non-null    object 
dtypes: float64(3), object(4)
memory usage: 38.4+ KB


##### Code:

In [36]:
# get the indexes of every row with a '-' 
to_drop = treatments_clean.query('dose == "-"').index

# drop those rows and we should have 350 entries once again
treatments_clean = treatments_clean.drop(index=to_drop)

##### Test:

In [37]:
# check to see if there are 350 rows
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 698
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   hba1c_start   350 non-null    float64
 3   hba1c_end     350 non-null    float64
 4   hba1c_change  350 non-null    float64
 5   treatment     350 non-null    object 
 6   dose          350 non-null    object 
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


##### Code: 

In [38]:
# separate the 'dose' column into two: 'start_dose' and 'end_dose'
treatments_clean[['start_dose', 'end_dose']] = treatments_clean['dose'].str.split(pat=' - ', expand=True)

# drop the column 'dose', since it will no longer be necessary
treatments_clean.drop(columns='dose', inplace=True)

##### Test:

In [39]:
# confirm the 'dose' column was indeed dropped
treatments_clean.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u,48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u
7,eddie,archer,7.89,7.55,0.34,auralin,31u,38u
9,asia,woźniak,7.76,7.37,0.39,auralin,30u,36u


---

#### `treatments`: include the `adverse_reaction` column from the `adverse_reactions` table

##### 2.3 Define: 

Using [`df.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html), left join the `adverse_reactions` table into the `treatments` table

##### Code: 

In [40]:
treatments_clean = treatments_clean.merge(adverse_reactions, on=['given_name', 'surname'], how='left')

##### Test:

In [41]:
# check how many patients had adverse reactions
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


In [42]:
# check to see if there are indeed 35
treatments_clean.query('adverse_reaction != None')['adverse_reaction'].count()

35

--- 

#### Include `patient_id` column from the `patients` table to serve as primary key and facilitate joining the remaining two tables (`treatments` and `patients`) afterwards

##### 2.4. Define: 
Before including the patient_id in the `treatments` table, we must first capitalize `given_name` and `surname`. Otherwise, we won't be able to use those columns as the primary keys for the merge. Then, we create a new DataFrame (`names_id`) from the `patients` table, containing three columns: `patient_id`, `given_name` and `surname`. The first one is the column we actually need to add to the `treatments` table. The other two will be used as the keys for the join. Finally, drop the columns `given_name` and `surname`, since they will no longer be needed. 

##### Code:

In [43]:
# capitalize column
treatments_clean['given_name'] = treatments_clean['given_name'].str.capitalize()

# capitalize column
treatments_clean['surname'] = treatments_clean['surname'].str.capitalize()

##### Test:

In [44]:
# check if name and surname were indeed capitalized
treatments_clean.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction
0,Veronika,Jindrová,7.63,7.2,0.43,auralin,41u,48u,
1,Skye,Gormanston,7.97,7.62,0.35,auralin,33u,36u,
2,Sophia,Haugen,7.65,7.27,0.38,auralin,37u,42u,
3,Eddie,Archer,7.89,7.55,0.34,auralin,31u,38u,
4,Asia,Woźniak,7.76,7.37,0.39,auralin,30u,36u,


##### Code:

In [45]:
# create a dataframe with name, surname and id
id_names = patients_clean[['given_name', 'surname', 'patient_id']]

# merge id_names with treatments on given name and surname
treatments_clean = treatments_clean.merge(id_names, on=['given_name', 'surname'])

# since these columns are duplicated and no longer needed, we can drop them
treatments_clean.drop(columns=['given_name', 'surname'], inplace=True)

##### Test:

In [46]:
# check to see if given_name and surname were dropped
treatments_clean.head()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48u,,225
1,7.97,7.62,0.35,auralin,33u,36u,,242
2,7.65,7.27,0.38,auralin,37u,42u,,345
3,7.89,7.55,0.34,auralin,31u,38u,,276
4,7.76,7.37,0.39,auralin,30u,36u,,15


In [47]:
# patient_id should be the only duplicate column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]

22    patient_id
dtype: object

---

### Address Quality
Once the missing data and tidiness issues are cleaned, all that remains is cleaning the remaining data quality issues.

##### `patients` table: 
- Zip code is a float not a [string](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas) (validity issue)
- Zip code has four digits sometimes (probably the spreadsheet software recognized the zip code column as a number and supressed the 0 if the zip code started with a 0) (accuracy issue). 
- Tim Neudorf height is 27 inches instead of 72 inches (accuracy issue)
- Full state names sometimes, abbreviations other times (consistency issue)
- Dsvid Gustafsson (name typo, patient_id: 9) (accuracy issue)
- Erroneus datatypes (`assigned_sex`, `state`, `zip_code`, and `birthdate` columns)
- Multiple phone number formats (`contact` column)
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- Kgs instead of lbs for patient surname Zaitseva weight (patient_id: 211)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `treatments` table:
- The letter 'u' in starting and ending doses for Auralin and Novodra (validity issue)
- Lowercase given names and surnames (since the patients table has the name in uppercase, this will be an issue if you decide to join tables) (consistency issue)
- Erroneus datatypes (`auralin` and `novodra` columns)
- Inaccurate HbA1c changes 

##### `adverse_reactions` table:
- Lowercase given names and surnames (consistency issue)

### Clean

#### 3. Quality

- `patients`: Zip code is a float not a [string](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas) (validity issue)
- `patients`: Zip code has four digits sometimes (probably the spreadsheet software recognized the zip code column as a number and supressed the 0 if the zip code started with a 0) (accuracy issue). 

#####  3.1. Define:
First, use astype to convert the zip_code column to string. The '.0' will remain, so use string slicing to remove them. There are some zip codes that start with a zero, but they aren't being displayed. Use zfill to pad the zip codes with only four characters with a leading 0. Finally, use DataFrame.replace and np.nan to convert the NaNs who were converted to string in the beginning back to null values. 

##### Code:

In [48]:
# check the data type for zip_code
patients_clean['zip_code'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: zip_code
Non-Null Count  Dtype  
--------------  -----  
491 non-null    float64
dtypes: float64(1)
memory usage: 4.1 KB


In [49]:
# convert the column to str
patients_clean['zip_code'] = patients_clean['zip_code'].astype(str)

##### Test:

In [50]:
# confirm it worked
# later, we will have to reconvert the missing values to NaNs
patients_clean['zip_code'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: zip_code
Non-Null Count  Dtype 
--------------  ----- 
503 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


In [51]:
# the .0 are still showing
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,121.7,66,19.6,951-719-9170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,118.8,66,19.2,+1 (217) 569-3204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,177.8,71,24.8,402-363-6804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,220.9,70,31.7,+1 (732) 636-8246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,192.3,27,26.1,334-515-7487,timneudorf@cuvox.de


##### Code:

In [52]:
# remove the '.0's using string slicing
patients_clean['zip_code'] = patients_clean['zip_code'].str[:-2]

##### Test:

In [53]:
# confirm it worked
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,7/10/1976,121.7,66,19.6,951-719-9170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,4/3/1967,118.8,66,19.2,+1 (217) 569-3204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,2/19/1980,177.8,71,24.8,402-363-6804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095,United States,7/26/1951,220.9,70,31.7,+1 (732) 636-8246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,27,26.1,334-515-7487,timneudorf@cuvox.de


##### Code:

In [54]:
# check how many zip_codes need to be padded with a leading 0
patients_clean.query('zip_code.str.len() < 5')['zip_code'].count()

61

In [55]:
# pad the entries with fewer than 5 characters with a leading 0
patients_clean['zip_code'] = patients_clean['zip_code'].str.zfill(5)

##### Test:

In [56]:
# confirm it worked
patients_clean.query('zip_code.str.len() < 5')['zip_code'].count()

0

In [57]:
# these "0000n" will need to go back to NaNs
patients_clean['zip_code'].sort_values().head()

249    0000n
264    0000n
269    0000n
278    0000n
242    0000n
Name: zip_code, dtype: object

##### Code:

In [58]:
# replace '0000n' with NaN using np.nan
patients_clean['zip_code'] = patients_clean['zip_code'].replace('0000n', np.nan)

##### Test:

In [59]:
# check to see if it worked
patients_clean['zip_code'].sort_values().head()

316    01002
38     01581
290    01730
252    01730
167    01730
Name: zip_code, dtype: object

In [60]:
# confirm there are still 12 null values
patients_clean['zip_code'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: zip_code
Non-Null Count  Dtype 
--------------  ----- 
491 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


---

#### `patients`: Tim Neudorf height is 27 inches instead of 72 inches (accuracy issue)

##### 3.2. Define: 
Use [`DataFrame.at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) to access the height for Tim Neudorf (iloc 4) and change it to 72. 

##### Code:

In [61]:
# display height for Tim Neudorf
patients_clean.query('surname == "Neudorf"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,27,26.1,334-515-7487,timneudorf@cuvox.de


Note: acesssing a single value in a row/column pair using [`DataFrame.at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html)

In [62]:
# change it to 72
patients_clean.at[4, 'height'] = 72

##### Test:

In [63]:
# confirm it worked
patients_clean.query('surname == "Neudorf"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,72,26.1,334-515-7487,timneudorf@cuvox.de


In [64]:
# confirm it worked
patients_clean.iloc[4]

patient_id                         5
assigned_sex                    male
given_name                       Tim
surname                      Neudorf
address         1428 Turkey Pen Lane
city                          Dothan
state                             AL
zip_code                       36303
country                United States
birthdate                  2/18/1928
weight                         192.3
height                            72
bmi                             26.1
phone_number            334-515-7487
email            timneudorf@cuvox.de
Name: 4, dtype: object

---

#### Full state names sometimes, abbreviations other times

##### 3.3. Define: 

Find out which states are represented by their full name. Then, create a dictionary with these states. After that, create a function to later apply to the whole column. The logic is that, if the state is one of the keys in the dict (their full name), then change it to the abbreviation. If not, leave the entry untouched. Finally, use [`df.apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to apply the function to the `state` column.

##### Code:

In [65]:
# check which states are represented by their full name
patients_clean['state'].unique()

array(['California', 'Illinois', 'Nebraska', 'NJ', 'AL', 'Florida', 'NV',
       'CA', 'MO', 'New York', 'MI', 'TN', 'VA', 'OK', 'GA', 'MT', 'MA',
       'NY', 'NM', 'IL', 'LA', 'PA', 'CO', 'ME', 'WI', 'SD', 'MN', 'FL',
       'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD', 'AZ', 'TX',
       'NE', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH', 'OR',
       nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

In [66]:
# create a dict with these states
state_abbrev = {'California': 'CA',
                'Illinois': 'IL',
                'Nebraska': 'NE',
                'Florida': 'FL',
                'New York': 'NY'}

# create a function to apply to the column
def abbreviate_state(df):
    if df['state'] in state_abbrev.keys():
        abbrev = state_abbrev[df['state']]
        return abbrev
    else:
        return df['state']

# apply function
patients_clean['state'] = patients_clean.apply(abbreviate_state, axis=1)

##### Test:

In [67]:
# there are no more full name states
patients_clean['state'].unique()

array(['CA', 'IL', 'NE', 'NJ', 'AL', 'FL', 'NV', 'MO', 'NY', 'MI', 'TN',
       'VA', 'OK', 'GA', 'MT', 'MA', 'NM', 'LA', 'PA', 'CO', 'ME', 'WI',
       'SD', 'MN', 'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD',
       'AZ', 'TX', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH',
       'OR', nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

---

#### `patients`: Dsvid Gustafsson (name typo, patient_id: 9) (accuracy issue)

##### 3.4. Define: 
Use [`pd.series.at`](https://pandas.pydata.org/docs/reference/api/pandas.Series.at.html) to correct David's given name

##### Code:

In [68]:
# use pd.series.at to correct David's given name
patients_clean.at[8, 'given_name'] = 'David'

##### Test:

In [69]:
# confirm it worked
patients_clean.query('surname == "Gustafsson"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105,United States,3/6/1937,163.9,66,26.5,816-265-9578,davidgustafsson@armyspy.com


In [70]:
# confirm it worked
patients_clean.iloc[8]

patient_id                                9
assigned_sex                           male
given_name                            David
surname                          Gustafsson
address                  1790 Nutter Street
city                            Kansas City
state                                    MO
zip_code                              64105
country                       United States
birthdate                          3/6/1937
weight                                163.9
height                                   66
bmi                                    26.5
phone_number                   816-265-9578
email           davidgustafsson@armyspy.com
Name: 8, dtype: object

---
#### `patients`: erroneus datatypes (`assigned_sex`, `state`, `zip_code`, and `birthdate` columns)

##### 3.5. Define:

- `patients['assigned_sex']`: Use [`pd.Series.astype`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) to change the column's dtype to `category`. 
- `patients['state']`: Use [`pd.Series.astype`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) to change the column's dtype to `category`. 
- `patients['zip_code']`: Zip code was already addressed in item 3.1. 
- `patients['birthdate']`: Use [`pd.to_datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to change the column's dtype to `datetime`. 

##### Code:

In [71]:
# check which dtype 'assigned_sex' currently has
patients_clean['assigned_sex'].unique()

array(['female', 'male'], dtype=object)

In [72]:
# use pd.series.astype to change it to 'category'
patients_clean['assigned_sex'] = patients_clean['assigned_sex'].astype('category')

##### Test:

In [73]:
# check to see if it worked
patients_clean['assigned_sex'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: assigned_sex
Non-Null Count  Dtype   
--------------  -----   
503 non-null    category
dtypes: category(1)
memory usage: 755.0 bytes


In [74]:
# check to see if it worked
patients_clean['assigned_sex'].unique()

['female', 'male']
Categories (2, object): ['female', 'male']

##### Code:

In [75]:
# check which dtype 'state' currently has
patients_clean['state'].unique()

array(['CA', 'IL', 'NE', 'NJ', 'AL', 'FL', 'NV', 'MO', 'NY', 'MI', 'TN',
       'VA', 'OK', 'GA', 'MT', 'MA', 'NM', 'LA', 'PA', 'CO', 'ME', 'WI',
       'SD', 'MN', 'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD',
       'AZ', 'TX', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH',
       'OR', nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

In [76]:
# use pd.series.astype to change it to 'category'
patients_clean['state'] = patients_clean['state'].astype('category')

##### Test:

In [77]:
# check to see if it worked
patients_clean['state'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: state
Non-Null Count  Dtype   
--------------  -----   
491 non-null    category
dtypes: category(1)
memory usage: 2.0 KB


In [78]:
# check to see if it worked
patients_clean['state'].unique()

['CA', 'IL', 'NE', 'NJ', 'AL', ..., NaN, 'VT', 'ID', 'DC', 'AR']
Length: 50
Categories (49, object): ['AK', 'AL', 'AR', 'AZ', ..., 'WA', 'WI', 'WV', 'WY']

##### Code:

In [79]:
# check which dtype 'birthdate' currently has
patients_clean['birthdate'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: birthdate
Non-Null Count  Dtype 
--------------  ----- 
503 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


In [80]:
# use pd.to_datetime to convert the column to datetime
patients_clean['birthdate'] = pd.to_datetime(patients_clean['birthdate'])

##### Test:

In [81]:
# check to see if it worked
patients_clean['birthdate'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 503 entries, 0 to 502
Series name: birthdate
Non-Null Count  Dtype         
--------------  -----         
503 non-null    datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 4.1 KB


In [82]:
# check to see if it worked
patients_clean.head(2)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,CA,92390,United States,1976-07-10,121.7,66,19.6,951-719-9170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,IL,61812,United States,1967-04-03,118.8,66,19.2,+1 (217) 569-3204,pamelashill@cuvox.de


---
#### `treatments`: the letter 'u' in starting and ending doses for Auralin and Novodra (validity issue)
#### `treatments`: erroneus datatypes (`auralin`and `novodra`)

##### 3.6. Define:
- `treatments`: use [`pd.Series.str.strip`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html#pandas.Series.str.strip) to remove the letter 'u' in the `start_dose` and `end_dose` columns.
- `treatments`: use [`pd.Series.astype`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) to change `start_dose` and `end_dose` to int. 

##### Code:

In [83]:
# check how start_date is currently formatted
treatments_clean.head(1)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48u,,225


In [84]:
# use str.strip to remove the 'u'
treatments_clean['start_dose'] = treatments_clean['start_dose'].str.strip('u')

##### Test:

In [85]:
# check to see if it worked
treatments_clean.head(1)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41,48u,,225


##### Code: 

In [86]:
# use str.strip to remove the 'u'
treatments_clean['end_dose'] = treatments_clean['end_dose'].str.strip('u')

In [87]:
# check to see if it worked
treatments_clean.head(1)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41,48,,225


##### Code:

In [88]:
# check what's the current dtype for start_dose
treatments_clean['start_dose'].info()

<class 'pandas.core.series.Series'>
Int64Index: 339 entries, 0 to 338
Series name: start_dose
Non-Null Count  Dtype 
--------------  ----- 
339 non-null    object
dtypes: object(1)
memory usage: 5.3+ KB


In [89]:
# change it to int
treatments_clean['start_dose'] = treatments_clean['start_dose'].astype(int)

##### Test:

In [90]:
# check to see if it worked
treatments_clean['start_dose'].info()

<class 'pandas.core.series.Series'>
Int64Index: 339 entries, 0 to 338
Series name: start_dose
Non-Null Count  Dtype
--------------  -----
339 non-null    int32
dtypes: int32(1)
memory usage: 4.0 KB


##### Code:

In [91]:
# check what's the current dtype for end_dose
treatments_clean['end_dose'].info()

<class 'pandas.core.series.Series'>
Int64Index: 339 entries, 0 to 338
Series name: end_dose
Non-Null Count  Dtype 
--------------  ----- 
339 non-null    object
dtypes: object(1)
memory usage: 5.3+ KB


In [92]:
# change it to int
treatments_clean['end_dose'] = treatments_clean['end_dose'].astype(int)

##### Test:

In [93]:
# check to see if it worked
treatments_clean['end_dose'].info()

<class 'pandas.core.series.Series'>
Int64Index: 339 entries, 0 to 338
Series name: end_dose
Non-Null Count  Dtype
--------------  -----
339 non-null    int32
dtypes: int32(1)
memory usage: 4.0 KB


---
#### `patients`: multiple phone number formats (`contact` column)

##### 3.7. Define: 

First, use [`str.replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html#pandas.Series.str.replace) with a regular expression to remove any non-digit characters (`\D+`). Then, use [`str.pad`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.pad.html) to include a leading 1 to entries that currently don't have the international code. 

##### Code:

In [94]:
# use str.replace to remove any non-digit characters
patients_clean['phone_number'] = patients_clean['phone_number'].str.replace(r'\D+', '', regex=True)

##### Test:

In [95]:
# check to see if it worked
patients_clean.head(2)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,CA,92390,United States,1976-07-10,121.7,66,19.6,9517199170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,IL,61812,United States,1967-04-03,118.8,66,19.2,12175693204,pamelashill@cuvox.de


##### Code:

In [96]:
# use str.pad to include a leading 1 to entries that don't have the international code
patients_clean['phone_number'] = patients_clean['phone_number'].str.pad(11, fillchar='1')

##### Test:

In [97]:
# check to see if it worked
patients_clean.head(1)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,CA,92390,United States,1976-07-10,121.7,66,19.6,19517199170,zoewellish@superrito.com


---
#### Default John Doe data

##### 3.8. Define:
Since we can't recover the John Doe records from the `patients` table, drop them. First, use [`pd.Series.index`](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html) to get the indexes for John Doe. Then. use [`pd.Series.drop`](https://pandas.pydata.org/docs/reference/api/pandas.Series.drop.html) to drop those records. 

##### Code:

In [98]:
# check how many instances of John Doe are present in the patients table
patients_clean.query('given_name == "John" and surname == "Doe"')['patient_id'].count()

6

In [99]:
# get the indexes of those entries
john_doe = patients_clean.query('given_name == "John" and surname == "Doe"').index

In [100]:
# use pd.Series.drop to drop them
patients_clean.drop(index=john_doe, inplace=True)

##### Test:

In [101]:
# check to see if it worked
patients_clean.query('given_name == "John" and surname == "Doe"')['patient_id'].count()

0

---
#### Multiple records for Jakobsen, Gersten, Taylor

##### 3.9. Define: 

First, find the indexes for the duplicated records. Then, use [`pd.Series.drop`](https://pandas.pydata.org/docs/reference/api/pandas.Series.drop.html) to drop them. 

##### Code:

In [102]:
# jakob and jake are the same person, which can be confirmed
# by the fact that their info, like address, are the same
patients_clean.query('surname == "Jakobsen"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,1985-08-01,155.8,67,24.4,18458587707,jakobcjakobsen@einrot.com
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,1985-08-01,155.8,67,24.4,18458587707,jakobcjakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020,United States,1962-11-25,185.2,67,29.0,19792030438,karenjakobsen@jourrapide.com


In [103]:
# drop one of them
patients_clean.drop(29, inplace=True)

##### Test:

In [104]:
# check to see if it worked
patients_clean.query('surname == "Jakobsen"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,1985-08-01,155.8,67,24.4,18458587707,jakobcjakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020,United States,1962-11-25,185.2,67,29.0,19792030438,karenjakobsen@jourrapide.com


##### Code:

In [105]:
# patrick and pat are the same person, which can be confirmed
# by the fact that their info, like address, are the same
patients_clean.query('surname == "Gersten"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,1954-05-03,138.2,71,19.3,14028484923,patrickgersten@rhyta.com
502,503,male,Pat,Gersten,2778 North Avenue,Burr,NE,68324,United States,1954-05-03,138.2,71,19.3,14028484923,patrickgersten@rhyta.com


In [106]:
# drop one of them
patients_clean.drop(502, inplace=True)

##### Test:

In [107]:
# check to see if it worked
patients_clean.query('surname == "Gersten"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,1954-05-03,138.2,71,19.3,14028484923,patrickgersten@rhyta.com


##### Code:

In [108]:
# sandra and sandy are the same person, which can be confirmed
# by the fact that their info, like address, are the same
patients_clean.query('surname == "Taylor"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,sandractaylor@dayrep.com
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,sandractaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,33179,United States,1992-09-02,186.6,69,27.6,13054346299,rogeliojtaylor@teleworm.us


In [109]:
# drop one of them
patients_clean.drop(282, inplace=True)

##### Test:

In [110]:
patients_clean.query('surname == "Taylor"')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,sandractaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,33179,United States,1992-09-02,186.6,69,27.6,13054346299,rogeliojtaylor@teleworm.us


---

#### Kgs instead of lbs for patient surname Zaitseva weight (patient_id: 211)

##### 3.10. Define:

First, convert her weight from kgs to lbs (multiply by 2.20462262). Then. use Use [`pd.series.at`](https://pandas.pydata.org/docs/reference/api/pandas.Series.at.html) to make the substitution. Don't forget to [round](https://pandas.pydata.org/docs/reference/api/pandas.Series.round.html) the weight to one decimal letter, as the other entries are formatted.

##### Code:

In [111]:
# find her entry
patients_clean.query('patient_id == 211')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691,United States,1938-11-26,48.8,63,19.1,13302022145,camillazaitseva@superrito.com


In [112]:
# convert her weight from kgs to lbs
weight_lbs = patients_clean.at[210, 'weight'] * 2.20462262
weight_lbs

107.58558385599999

In [113]:
# round the results to one decimal letter
patients_clean.at[210, 'weight'] = weight_lbs.round(1)

##### Test:

In [114]:
# check to see if it worked
patients_clean.query('patient_id == 211')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691,United States,1938-11-26,107.6,63,19.1,13302022145,camillazaitseva@superrito.com


In [115]:
# check to see if it worked
patients_clean['weight'].sort_values()

459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 494, dtype: float64

---

##### 3.11. Define:

Check to see if there are any duplicated values. If there are, drop them.

##### Code:

In [116]:
# checking the treatments table
sum(treatments_clean.duplicated())

1

In [117]:
# drop duplicate from the treatments table
treatments_clean.drop_duplicates(inplace=True)

##### Test:

In [118]:
# checking to see if it worked
sum(treatments_clean.duplicated())

0

##### Code:

In [119]:
# checking the patients table
sum(patients_clean.duplicated())

0

---
### Optional: Save cleaned datasets

Note: since we have joined the adverse reactions to the treatments table, there's no need to save that file.

In [120]:
treatments_clean.to_csv('support-files/04_Cleaning-Data/treatments_clean.csv')

In [121]:
patients_clean.to_csv('support-files/04_Cleaning-Data/patients_clean.csv')

---
### Reminder: Data Wrangling Can Be Iterative
The concept of iterating isn't that applicable for clinical trials given the rigor involved in their planning. But, there are other situations that require iteration. Alternatively, you might proceed to anaysis or visualization at this point. 

### Clean: Summary
Cleaning is the third step in the data wrangling process:

- Gather
- Assess
- **Clean**

There are two types of cleaning:

- Manual (not recommended unless the issues are one-off occurrences)
- Programmatic

The programmatic data cleaning process:

1. Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
2. Code: convert those definitions to code and run that code.
3. Test: test your dataset, visually or with code, to make sure your cleaning operations worked.

***Always make copies of the original pieces of data before cleaning!***