## 1. The Discovery of Handwashing
<p>In the mid 1800s, Dr. Ignaz Semmelweis was an obstetrician at Vienna General Hospital. At the time, maternal death due to puerperal fever was common, but he was particularly concerned that the death rate in his clinic (Clinic 1) was much higher than the death rate in another clinic at Vienna General Hospital (Clinic 2). <em>So what was the difference between these two clinics?</em> Doctors and midwives worked in Clinic 1, while only midwives worked in Clinic 2. This led Dr. Semmelweis to hypothesize that doctors carried deadly "cadaverous particles" from their autopsies to their patients in Clinic 2.</p>
<p>In 1847, Dr. Semmelweis instated a policy where doctors had to use a chlorine solution to wash their hands between performing autopsies and seeing patients. The maternal mortality rate drastically decreased as seen in the plot below. Sadly, germ theory (the idea that there are particles that cause disease) was not widely accepted at the time, so his hypothesis was rejected by most doctors.</p>
<p><img src="https://assets.datacamp.com/production/project_1187/img/semmelweis_plot.png" alt="Line plot of maternal mortality rate in Clinic 1 at Vienna General Hospital" width="600px"></p>
<p>The two datasets you will use are from Dr. Semmelweis's original 1859 publication<sup>1</sup>. Here are the details:</p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
<div style="font-size:20px"><b>datasets/clinic_data.csv</b></div>
This contains yearly clinic-level data on births and maternal deaths in each of the two maternity clinics at Vienna General Hospital.
<ul>
<li><b><code>year</code>:</b> each year from 1833 to 1858</li>
<li><b><code>births</code>:</b> total number of births in the clinic</li>
<li><b><code>deaths</code>:</b> number of maternal deaths in the clinic</li>
<li><b><code>clinic</code>:</b> clinic (either <code>clinic_1</code> or <code>clinic_2</code>). Doctors and midwives worked in Clinic 1, while only midwives worked in Clinic 2.</li>
</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
<div style="font-size:20px"><b>datasets/hospital_data.csv</b></div>
This contains yearly hospital-level data on births and maternal deaths. 
<ul>
<li><b><code>year</code>:</b> each year from 1784 to 1848</li>
<li><b><code>births</code>:</b> total number of births at the hospital</li>
<li><b><code>deaths</code>:</b> number of maternal deaths at the hospital</li>
<li><b><code>hospital</code>:</b> hospital (either <code>Vienna</code> or <code>Dublin</code>). At the Vienna General Hospital where Dr. Semmelweis worked, doctors began performing pathological autopsies in 1823. At the Dublin Rotunda Hospital, doctors did not perform pathological autopsies at all.</li>
</ul>
</div>
<p><small><sup>1</sup><a href="http://graphics8.nytimes.com/images/blogs/freakonomics/pdf/the%20etiology,%20concept%20and%20prophylaxis%20of%20childbed%20fever.pdf">Ignaz Semmelweis: The etiology, concept, and prophylaxis of childbed fever.</a></small></p>

## Questions
1. What were the death rates for each year in the both datasets? 
2. In each clinic, what was the average death rate for the years before handwashing was introduced in 1847?
3. What were the average death rates in the Vienna General Hospital both before and after pathological autopsies were introduced in 1823? 

In [1]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
# Load the data
clinic_data <- read.csv("datasets/clinic_data.csv")
hospital_data <- read.csv("datasets/hospital_data.csv")

# check first 5 rows of each data
head(clinic_data)
head(hospital_data)

Unnamed: 0_level_0,year,births,deaths,clinic
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>
1,1833,3737,197,clinic_1
2,1834,2657,205,clinic_1
3,1835,2573,143,clinic_1
4,1836,2677,200,clinic_1
5,1837,2765,251,clinic_1
6,1838,2987,91,clinic_1


Unnamed: 0_level_0,year,births,deaths,hospital
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>
1,1784,1261,11,Dublin
2,1785,1292,8,Dublin
3,1786,1351,8,Dublin
4,1787,1347,10,Dublin
5,1788,1469,23,Dublin
6,1789,1435,25,Dublin


In [3]:
str(clinic_data)
str(hospital_data)

'data.frame':	52 obs. of  4 variables:
 $ year  : int  1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 ...
 $ births: int  3737 2657 2573 2677 2765 2987 2781 2889 3036 3287 ...
 $ deaths: int  197 205 143 200 251 91 151 267 237 518 ...
 $ clinic: chr  "clinic_1" "clinic_1" "clinic_1" "clinic_1" ...
'data.frame':	130 obs. of  4 variables:
 $ year    : int  1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 ...
 $ births  : int  1261 1292 1351 1347 1469 1435 1546 1602 1631 1747 ...
 $ deaths  : int  11 8 8 10 23 25 12 25 10 19 ...
 $ hospital: chr  "Dublin" "Dublin" "Dublin" "Dublin" ...


In [4]:
#check missing value
sum(is.null(clinic_data))
sum(is.null(hospital_data))

### 1. What were the death rates for each year in the both datasets? 

In [5]:
#create a death rate column on each data set
clinic_data<-clinic_data %>% mutate(death_rate = deaths/ births)
hospital_data<-hospital_data %>%mutate(death_rate = deaths/ births)

#show the first 5 rows
head(clinic_data)
head(hospital_data)

Unnamed: 0_level_0,year,births,deaths,clinic,death_rate
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>,<dbl>
1,1833,3737,197,clinic_1,0.05271608
2,1834,2657,205,clinic_1,0.07715469
3,1835,2573,143,clinic_1,0.05557715
4,1836,2677,200,clinic_1,0.0747105
5,1837,2765,251,clinic_1,0.09077758
6,1838,2987,91,clinic_1,0.03046535


Unnamed: 0_level_0,year,births,deaths,hospital,death_rate
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>,<dbl>
1,1784,1261,11,Dublin,0.008723236
2,1785,1292,8,Dublin,0.00619195
3,1786,1351,8,Dublin,0.00592154
4,1787,1347,10,Dublin,0.007423905
5,1788,1469,23,Dublin,0.015656909
6,1789,1435,25,Dublin,0.017421603


### 2. In each clinic, what was the average death rate for the years before handwashing was introduced in 1847? 

In [6]:
#create a new column for handwashing based on handwashing year
#clinic_data <- clinic_data %>% mutate(handwashing = year >= 1847)

#check the average of death rate before and after handwash and conver into dataframe
rate_by_clinic_pre_handwashing<-data.frame(clinic_data %>% filter(year<1847) %>% group_by(clinic) %>%summarise(avg_rate = mean(death_rate)))

#rename the colunms name
colnames(rate_by_clinic_pre_handwashing) <-c("clinic", "avg_rate")
rate_by_clinic_pre_handwashing

`summarise()` ungrouping output (override with `.groups` argument)



clinic,avg_rate
<chr>,<dbl>
clinic_1,0.07993925
clinic_2,0.04787381


### 3. What were the average death rates in the Vienna General Hospital both before and after pathological autopsies were introduced in 1823? 

In [7]:
#create a new column for handwashing based on pathological autopsiesyear
hospital_data <- hospital_data %>% filter(hospital == 'Vienna')%>%mutate(handwashing = year >= 1823)

#check the average of death rate before and after handwash and conver into dataframe
rate_by_autopsies_introduced<-data.frame(hospital_data %>% group_by(handwashing) %>%summarise(avg_rate = mean(death_rate)))

#rename the colunms name
colnames(rate_by_autopsies_introduced) <- c("autopsies_introduced", "avg_rate")
rate_by_autopsies_introduced

`summarise()` ungrouping output (override with `.groups` argument)



autopsies_introduced,avg_rate
<lgl>,<dbl>
False,0.01166024
True,0.05877959


In [8]:
# # write a R function to generate calculations avergae death rate before and after on the given year
# avg_death_rate_change <- function(df, yr, name){
#     # this functions takes two input arguments:
#     # df --> dataframe
#     # yr --> the year interested, INT
#     #name --> string, rename dataframe column name
#     #return a dataframe of avg death change based on the year
    
#     #create a column called year_interested
#     df<- df%>%mutate(yr_interested = year >=yr)
#     temp <- data.frame(df %>% group_by(yr_interested)%>%summarise(avg_rate = mean(death_rate)))
#     colnames(temp) <-c(name,"avg_rate")
#     return (temp)
# }

In [9]:
# rate_by_clinic_pre_handwashing<-avg_death_rate_change(clinic_data, 1847, "clinic")
# rate_by_clinic_pre_handwashing

In [10]:
# rate_by_autopsies_introduced<- avg_death_rate_change(hospital_data, 1823,"autopsies_introduced")
# rate_by_autopsies_introduced