# Dr. Semmelweis and the Discovery of Handwashing
Reanalyse the data behind one of the most important discoveries of modern medicine: handwashing.

Project Description
In 1847, the Hungarian physician Ignaz Semmelweis made a breakthough discovery: he discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.

# 1. Meet Dr. Ignaz Semmelweis

This is Dr. Ignaz Semmelweis, a Hungarian physician born in 1818 and active at the Vienna General Hospital. If Dr. Semmelweis looks troubled it's probably because he's thinking about childbed fever: A deadly disease affecting women that just have given birth. He is thinking about it because in the early 1840s at the Vienna General Hospital as many as 10% of the women giving birth die from it. He is thinking about it because he knows the cause of childbed fever: It's the contaminated hands of the doctors delivering the babies. And they won't listen to him and wash their hands!

In this notebook, we're going to reanalyze the data that made Semmelweis discover the importance of handwashing. Let's start by looking at the data that made Semmelweis realize that something was wrong with the procedures at Vienna General Hospital.

Task 1: Instructions
- Load in the tidyverse package.
- Read in datasets/yearly_deaths_by_clinic.csv using read_csv and assign it to the variable yearly.
- Print out yearly.

# Good to know
The tidyverse package automatically loads in the packages ggplot2, dplyr, and readr. Make sure you use read_csv from the readr package, and not read.csv, to read in the data. This project assumes you can manipulate data frames using dplyr and make simple plots using ggplot2. You can learn these skills in the course Introduction to the Tidyverse. The most relevant exercises are:

Using mutate to change or create a column
Adding color to a scatter plot
Summarizing by continent
Visualizing median GDP per capita by continent over time
Even if you've taken this course you will still find this project challenging unless you use some external documentation. In this project Rstudio's ggplot2 cheat sheet and dplyr cheat sheet can come in handy.

In [None]:
# Load in the tidyverse package
library(tidyverse)

# Read datasets/yearly_deaths_by_clinic.csv into yearly
yearly <- read_csv('yearly_deaths_by_clinic.csv')

# Print out yearly
yearly

# 2. The alarming number of deaths
The table above shows the number of women giving birth at the two clinics at the Vienna General Hospital for the years 1841 to 1846. You'll notice that giving birth was very dangerous; an alarming number of women died as the result of childbirth, most of them from childbed fever.

We see this more clearly if we look at the proportion of deaths out of the number of women giving birth.

Task 2: Instructions
Use mutate to add the column proportion_deaths to yearly calculated as the proportion of deaths per number of births.
Print out yearly.
For an example of how mutate works look under Make New Variables in the dplyr cheat sheet.

In [None]:
# Adding a new column with proportion of deaths per no. births
yearly <- yearly %>% 
  mutate(proportion_deaths = deaths / births)

# Print out yearly
yearly

# 3. Death at the clinics
If we now plot the proportion of deaths at both clinic 1 and clinic 2 we'll see a curious pattern…

Task 3: Instructions
Use ggplot to make a line plot of proportion_deaths by year with one line per clinic.
The lines should have different colors.
If you don't remember how to plot line plots with ggplot check out the ggplot2 cheat sheet under Geoms, continuous function.

In [None]:
# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)

# Plot yearly proportion of deaths at the two clinics
ggplot(yearly, aes(x = year, y = proportion_deaths, colour = clinic)) +
  geom_line()

Task 4: Instructions
Read in datasets/monthly_deaths.csv and assign it to the variable monthly.
Add the column proportion_deaths to monthly calculated as the proportion of deaths per number of births.
Print out the first rows in monthly using the head() function.


# 4. The handwashing begins
Why is the proportion of deaths constantly so much higher in Clinic 1? Semmelweis saw the same pattern and was puzzled and distressed. The only difference between the clinics was that many medical students served at Clinic 1, while mostly midwife students served at Clinic 2. While the midwives only tended to the women giving birth, the medical students also spent time in the autopsy rooms examining corpses.

Semmelweis started to suspect that something on the corpses, spread from the hands of the medical students, caused childbed fever. So in a desperate attempt to stop the high mortality rates, he decreed: Wash your hands! This was an unorthodox and controversial request, nobody in Vienna knew about bacteria at this point in time.

Let's load in monthly data from Clinic 1 to see if the handwashing had any effect.

In [None]:
# Read datasets/monthly_deaths.csv into monthly
monthly <- read_csv("datasets/monthly_deaths.csv")

# Adding a new column with proportion of deaths per no. births
monthly <- monthly %>% 
  mutate(proportion_deaths = deaths / births)

# Print out the first rows in monthly
head(monthly)

# 5. The effect of handwashing
With the data loaded we can now look at the proportion of deaths over time. In the plot below we haven't marked where obligatory handwashing started, but it reduced the proportion of deaths to such a degree that you should be able to spot it!

Task 5: Instructions
Make a line plot of proportion_deaths by date for the monthly data frame using ggplot.
Use the labs function to give the x-axis and y-axis any prettier labels.
For how to use the labs function to add labels check out the ggplot2 cheat sheet under Labels.

In [None]:
ggplot(monthly, aes(date, proportion_deaths)) +
  geom_line() +
  labs(x = "Year", y = "Proportion Deaths")

# 6. The effect of handwashing highlighted
Starting from the summer of 1847 the proportion of deaths is drastically reduced and, yes, this was when Semmelweis made handwashing obligatory.

The effect of handwashing is made even more clear if we highlight this in the graph.

Task 6: Instructions
Add a TRUE/FALSE column to monthly called handwashing_started which is TRUE for dates where obligatory handwashing was enforced.
Make a line plot of proportion_deaths by date for the monthly data frame using ggplot. Make the color of the line depend on handwashing_started.
Use the labs function to give the x-axis and y-axis any prettier labels.
Since the column monthly$date is a Date column you can now compare it to other Dates using the comparison operators (<, >=, ==, etc.). For example, the following would create a new column in monthly which is FALSE for all dates except for the month when handwashing started:

monthly <- monthly %>%
  mutate(is_start_month = 
    date == handwashing_start)

In [None]:
# From this date handwashing was made mandatory
handwashing_start = as.Date('1847-06-01')

# Add a TRUE/FALSE to monthly called handwashing_started
monthly <- monthly %>%
  mutate(handwashing_started = date >= handwashing_start)

# Plot monthly proportion of deaths before and after handwashing
ggplot(monthly, aes(x = date, y = proportion_deaths, color = handwashing_started)) +
  geom_line()

# 7. More handwashing, fewer deaths?¶
Again, the graph shows that handwashing had a huge effect. How much did it reduce the monthly proportion of deaths on average?

Task 7: Instructions
Use group_by and summarise to calculate the mean proportion of deaths before and after handwashing was enforced.
Put the resulting table into monthly_summary.
The resulting data frame should look like below, but with 0.????? replaced by the actual numbers.

Look under Group Cases in the dplyr cheat sheet for an example of how group_by and summarise work together.

In [None]:
# Calculating the mean proportion of deaths 
# before and after handwashing.

monthly_summary <- monthly %>% 
  group_by(handwashing_started) %>%
  summarise(mean_proportion_deaths = mean(proportion_deaths))

# Printing out the summary.
monthly_summary

# 8. A statistical analysis of Semmelweis handwashing data
It reduced the proportion of deaths by around 8 percentage points! From 10% on average before handwashing to just 2% when handwashing was enforced (which is still a high number by modern standards). To get a feeling for the uncertainty around how much handwashing reduces mortalities we could look at a confidence interval (here calculated using a t-test).

Task 8: Instructions
Use the t.test function to calculate a 95% confidence interval around how much dirty hands increases proportion_deaths.
A t-test is a simple statistical model for the means of two groups where you have continuous measurements. The two groups we have are monthly proportion_deaths before handwashing had started and then after it was enforced. A t-test produces a lot of numbers, but what we are interested in is the confidence interval, here a measure of uncertainty around what the increase in mortality could be due to doctors not washing their hands.

If df is a data frame, outcome is a numeric column in df, and group is a TRUE/FALSE column splitting df into two groups, then the following would run a t-test for the two groups:

t.test(outcome ~ group, data = df)
The tilde (~) should be read as "depends on", and so the above means "assume the outcome depends on group".

In [None]:
# Calculating a 95% Confidence intrerval using t.test 
test_result <- t.test( proportion_deaths ~ handwashing_started, data = monthly)
test_result

# 9. The fate of Dr. Semmelweis
That the doctors didn't wash their hands increased the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.

The tragedy is that, despite the evidence, Semmelweis' theory — that childbed fever was caused by some "substance" (what we today know as bacteria) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.

One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn't show any graphs nor confidence intervals. If he would have had access to the analysis we've just put together he might have been more successful in getting the Viennese doctors to wash their hands.

Task 9: Instructions
Given the data Semmelweis collected, is it TRUE or FALSE that doctors should wash their hands?
Congratulations, you've made it this far! If you haven't tried it already, you should check your project now by clicking the "Check project" button.

Good luck! :)

In [None]:
# The data Semmelweis collected points to that:
doctors_should_wash_their_hands <- TRUE