Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1386 lines (1027 sloc) 51.5 KB
title: "An outbreak of gastroenteritis in Stegen, Germany, June 1998"
authors: ["Amrish Baidjoe", "Thibaut Jombart", "Janetta Skarp", "Zhian N. Kamvar", "Alexander Spina", "Patrick Keating"]
categories: "case studies"
tags: ["level: beginner", "epicurve", "single variable analysis", "2x2 tables", "gastroenteritis", "plotting cases"]
date: 2018-11-12
slug: stegen
licenses: CC-BY
image: img/highres/graduation-1965.jpg
showonlyimage: true
toc: true
always_allow_html: yes
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, warning = FALSE)
<!--[if IE]>
<p class="chromeframe"> You are using an <strong>outdated</strong> browser. Please <a href="">upgrade your browser</a> or <a href="">activate Google Chrome Frame</a> to improve your experience.</p>
# Context
On 26 June 1998, the St Sebastian High School in Stegen (school A), Germany,
celebrated a graduation party, where 250 to 350 participants were
expected. Attendants included graduates from that school, their families and
friends, teachers, 12th grade students, and some graduates from a nearby school
(school B).
A self-service party buffet was supplied by a commercial caterer from a the
nearby city of Freiburg. Food was prepared on the day of the party and
transported in a refrigerated van to the school. Relative locations of the two
schools and Freiburg can be seen below:
<iframe src="" width="100%" height="500"></iframe>
Festivities at school A started with a dinner buffet open from 8.30 pm onwards and were
followed by a dessert buffet offered from 10 pm. The party and the buffet
extended late into the night and alcoholic beverages were quite popular. All
agreed it was a party to be remembered.
## The alert
On 2nd July 1998, the Freiburg local health office reported to the Robert Koch
Institute (RKI) in Berlin a significant increase of cases of gastroenteritis in the municipality of Stegen
following the graduation party described above. More than 100 cases were
suspected among participants and some of them were so ill that they were admitted to nearby
hospitals. Sick people showed symptoms like; *fever, nausea, diarrhoea and vomiting*
lasting for several days. Stool samples were collected from several ill cases and sent to the laboratory in Freiburg for microbiological identification. *Salmonella enteritidis* was identified in 19 stool samples.
In response to the large number of cases associated with the dinner the Freiburg health office sent an outbreak control team to investigate the kitchen of the caterer in Freiburg. Food preparation procedures were reviewed. Samples were taken of remaining food items after the graduation ceremony and were sent to the laboratory of Freiburg University. Microbiological analyses were performed on samples of the following food items:
brown chocolate mousse, caramel cream, remoulade sauce, yoghurt dill sauce, and
10 raw eggs. Unfortunately no tiramisu was left over after the dinner so no samples were tested of this specific food item.
The Freiburg health office requested help from the RKI to
assess the magnitude of the outbreak and identify potential vehicle(s) and risk
factors for transmission in order to better control the outbreak.
## The epidemiological study
**Case definition**: cases were defined as any person who had attended the party
at St Sebastian High School and who suffered from *diarrhoea* (min. 3 loose
stool for 24 hours) with the onset between 27 June and 29 June 1998, or from at least three of
the following symptoms: *vomiting*, *fever over 38.5° C*, *nausea*,
*abdominal pain*, *headache*.
Students from both schools attending the party were asked through phone
interviews to provide names of persons who attended the party.
Overall, 291 responded to enquiries and 103 cases were identified. The linelist
analysed in this case study was built from these 291 responses.
## This case study
In this case study, we will take you through the analysis of this epidemic using R. This
will be the occasion to illustrate more generally useful practices for data
analysis using **R**, including how to:
- read and import data from Excel
- explore data using tables and summaries
- clean data
- make graphics to describe the data
- test which food items were associated with disease
- plot a very basic spatial overview of cases
# Initial data processing
## Project setup
When using R, it's a very good idea to keep your data and scripts organised
together. One way of doing this is to keep them in a single folder that hosts
and RStudio project file. For this project, your folder structure will start
out looking like this:
├── 01-stegen-analysis.R
├── data/
│   ├── stegen-map/
│   │   ├── stegen_households.cpg
│   │   ├── stegen_households.prj
│   │   ├── stegen_households.qpj
│   │   ├── stegen_households.shp
│   │   └── stegen_households.shx
│   └── stegen_raw.xlsx
└── stegen.Rproj
To create this, do the following steps:
(@) Open RStudio and [create a new RStudio project in a new directory called
After you create your RStudio project, look at the top right corner of the
RStudio window and make sure that the project is set to "stegen" and not
"Project: (None)":
![Screenshot of RStudio project set to stegen](../../img/screenshots/stegen-project-initial.png)
(@) Once you have your Project set up, make a new folder called "data/" and
[download `stegen_raw.xlsx` and save it to that folder](../../data/stegen_raw.xlsx)
(@) [Download ``](../../data/ and extract it to the
"data/" folder
(@) In RStudio, click on the following menu path: <kbd>File > New File > R
Script</kbd> and save it as `01-stegen-analysis.R`. This will be the R
script where you will save the code of the analysis. Your RStudio window
should look like this when you are done:
![Screenshot of RStudio project set to stegen with all the data files](../img/screenshots/stegen-project-final.png)
<summary><b>Going Further:</b> What are the `LICENSE` and `` files?</summary>
Both of these files are plain text files (edited via programs such as notepad
or textedit) that tell people how they can use the data and code.
The `` file is a text file that describes in plain language what this
project is about and what the components are. It's called README with the idea
that anyone who comes across this project should be able to read this file
first and understand how to work with the project.
The `LICENSE` file describes how the data and code are to be used. For example,
this case study is licensed under the [Creative Commons Attribution 4.0
International license]( which
gives people the freedom to reuse this in any way as long as they attribute the
original authors.
## Loading required packages
The following packages will be used in the case study. You can use the
`install.packages()` command to install them; this will save and install them to
your computer's R library, so you only have to do this once.:
- [*here*]( to find the path to data or script files
- [*readxl*]( to read Excel spreadsheets into R
- [*readr*]( to write (and read) spreadsheets as text files
- [*incidence*]( to build epicurves
- [*epitrix*]( to clean labels from our spreadsheet
- [*dplyr*]( to help with factors
- [*ggplot2*]( to create custom visualisations
- [*epitools*]( to calculate risk ratios
- [*sf*]( To read in shape files
- [*leaflet*]( to demonstrate interactive maps
Once we have these packages installed, we can tell R to load these packages from
our R library. This needs to be done once at the top of every R script:
```{r stegen_libraries, warning=FALSE, message=FALSE}
library("here") # find data/script files
library("readxl") # read xlsx files
library("readr") # read and write text spreadsheets
library("incidence") # make epicurves
library("epitrix") # clean labels and variables
library("dplyr") # general data handling
library("ggplot2") # advanced graphics
library("epitools") # statistics for epi data
library("sf") # shapefile handling
library("leaflet") # interactive maps
<summary> <b> Package not installed? Did you get an error? </b> </summary>
Note that you will get an error if the packages have not been installed on your
system. To install them (you only need to do this once!), type:
```{r stegen_install_libraries, eval = FALSE}
Loading these packages makes all functions implemented by the packages
available, so that `function_name()` can be used directly (without the
`package_name::` prefix). This is not a problem here as these packages do not
implement functions with identical names.
## Importing data from Excel
Linelist data can be read from various formats, including flat text files (e.g.
`.txt`, `.csv`), other statistical software (e.g. STATA) or Excel spreadsheets.
We illustrate the latter, which is probably the most common format. We assume
that [the data file `stegen_raw.xlsx`](../../data/stegen_raw.xlsx) has been
saved in a `data/` folder of your project, and that your current R session is
at the root of the project.
Here we decompose the steps to read data in:
1. defining the path to the data (`path_to_data`) with the *here* package
2. using the function `read_excel()` from the *readxl* package to read data in,
and save the data within R as a new object called `stegen`.
```{r stegen_show_data, eval = FALSE}
path_to_data <- here("data", "stegen_raw.xlsx")
```{r stegen_to_show, echo = FALSE}
path_to_data <- "/home/zkamvar/Documents/Projects/stegen/data/stegen_raw.xlsx"
The `here()` function combines the path to the current R project on your
computer and the path to your data. If you look at what is in the `path_to_data`
object, you will notice that it will end with `data/stegen_raw.xlsx`, but *it
will look different than what I have here*:
```{r stegen_show_data_really}
This is because my project is stored in the folder on my computer called
"/home/zkamvar/Documents/Projects/stegen-analysis". On your computer, it will be
in a different place, but the important part is that *this code will work on any
computer that has this R project*.
Now that we have the address of the data saved in the `path_to_data` variable,
we can read it in with the `read_excel()` function:
```{r fake_read_Data, eval = FALSE}
stegen <- read_excel(path_to_data)
```{r stegen_read_data, results = 'hide', echo = FALSE}
stegen <- readxl::read_excel(here::here("static", "data", "stegen_raw.xlsx"))
To look at the content of the dataset, we can use either of these commands:
```{r stegen_print_data}
```{r, stegen_view, eval = FALSE}
<summary><b>Problems?</b> In case of trouble, click here to toggle additional help:</summary>
If the above fails, you should check:
- that your data file has been saved in the right folder, with the right name
(lower case and upper case do matter)
- that your R session was started from the right project - if unsure, close R
and re-open Rstudio by double-clicking on the `.Rproj` file
- that all packages are installed and loaded (see installation guidelines above)
## Overview and summaries
We first have a quick look at the content of the data set. The information we are looking for is:
- the numbers of cases (rows) and variables (columns) in the data
- the *name* of the variables: do they use consistent capitalisation and separators?
- the *type* of the variables: are dates, numeric or categorical variables used when they should?
- the *coding* of the variables: are explicit labels (e.g. `"male"`/`"female"`) used where relevant?
We first check the dimensions of the `stegen` object, and the name of the variables:
```{r stegen_dim}
dim(stegen) # rows x columns
names(stegen) # column labels
We can now try and summarise the dataset using:
```{r stegen_summary}
Note that binary variables, when treated as numeric values (0/1), are summarised
as such, which may not always be useful. As an alternative, `table()` can be used
to list all possible values of a variable, and count how many time each value
appears in the data. For instance, we can compare the `summary()` and `table()` for
consumption of `tiramisu`:
```{r stegen_summary_table}
stegen$tiramisu # all values
summary(stegen$tiramisu) # summary
table(stegen$tiramisu) # table
<summary><b>Going Further:</b> Showing missing values in <code>table()</code></summary>
You may notice that the `tiramisu` column has missing data, but `table()` does
not show how many missing values there are. By studying the documentation for
`table()` (type `?table` in your R console), we find that there is an option
called `useNA` with three options: "no", "ifany", and "always". Let's take a
look at what happens when we use these three options:
```{r usena}
table(stegen$tiramisu, useNA = "no")
table(stegen$tiramisu, useNA = "ifany")
table(stegen$tiramisu, useNA = "always")
We can see that both "ifany" and "always" show us that there are
`r sum($tiramisu))` missing values in our data set. What, then is
the difference between the two options? What happens if we have no missing
values? We can try it with a dummy data set:
```{r usena_dummy}
table(c(1, 1, 0), useNA = "ifany")
table(c(1, 1, 0), useNA = "always")
**Good news**: the dataset has the expected dimensions, and all the relevant
variables seem to be present. There are, however, **a few often observed issues**:
1. variable names are a bit messy, and include different separators, spaces, and
irregular capitalisation
2. variable types are partly wrong:
- *unique keys* should be `character` (i.e. character strings)
- *dates of onset* should be `Date` (**R** is good at handling actual dates)
- *illness* and *sex* should be `factor` (i.e. a categorical variables)
3. labels used in some variables are ambiguous:
- *sex* should be coded explicitely, not as 0 (here, male) and 1 (here,
- *illness* should be coded explicitely, not as 0 (here, non-case) and 1 (case)
4. some binary variables have maximum values of 9 (see `summary(stegen)`)
## Data cleaning
While it is tempting to go back to the Excel spreadsheet to fix issues with
data, it is almost always quicker and more reliable to clean data in **R**
directly. Here, we make a copy of the old data set, and clean `stegen` before
further analysis.
```{r stegen_old}
stegen_old <- stegen # save 'dirty data'
We use *epitrix*'s function `clean_labels()` to standardise the variable names:
```{r stegen_clean_labels}
new_labels <- clean_labels(names(stegen)) # generate standardised labels
new_labels # check the result
names(stegen) <- new_labels
We convert the unique identifiers to character strings (`character`), dates
of onset to actual `Date` objects, and sex and illness are set to categorical
variables (`factor`), which allows us to represent the variables as numbers so
that we can do modelling, but allows us to add meaningful labels to them:
```{r stegen_unique_ids}
stegen$unique_key <- as.character(stegen$unique_key)
stegen$sex <- factor(stegen$sex)
stegen$ill <- factor(stegen$ill)
stegen$date_onset <- as.Date(stegen$date_onset)
We use the function `recode()` from the *dplyr* package to recode `sex` and
`ill` more explicitely.
```{r stegen_recode}
stegen$sex <- recode_factor(stegen$sex, "0" = "male", "1" = "female")
stegen$ill <- recode_factor(stegen$ill, "0" = "non case", "1" = "case")
if we look at the structure of the data, they are still represented as numbers:
```{r stegen_recode_show}
Finally we look in more depth into these variables having maximum values of 9,
where we expect 0/1; `table()` is useful to list all values taken by a variable,
and listing their frequencies:
```{r stegen_nines}
table(stegen$pork, useNA = "always")
table(stegen$salmon, useNA = "always")
table(stegen$horseradish, useNA = "always")
The only rogue values are `9`; they are likely either data entry issues, or
missing data, which in **R** should be coded as `NA` ("not available"). We can
replace these values using:
```{r stegen_replace_9}
stegen$pork[stegen$pork == 9] <- NA
stegen$salmon[stegen$salmon == 9] <- NA
stegen$horseradish[stegen$horseradish == 9] <- NA
Now we can confirm that the `9`'s have been replaced by `NA`.
```{r stegen_nine_agains}
table(stegen$pork, useNA = "always")
table(stegen$salmon, useNA = "always")
table(stegen$horseradish, useNA = "always")
<summary> <b> Going further </b> Why does this work? Click here for more details. </summary>
There are several things going on in a command like:
```{r eval = FALSE}
stegen$pork[stegen$pork == 9] <- NA
let us break them down from the inside out.
1. `stegen$pork` means "get the variable called `pork` in the dataset `stegen`"
2. The `==` is a logical check for equality. `stegen$pork == 9` checks each
element in `stegen$pork` if it's equal to `9`, returning `TRUE` if an element of
`stegen$pork` is equal to `9` and `FALSE` if it doesn't.
3. The square brackets (`[ ]`) subset the vector `stegen$pork` according to
whatever is between them; In this case, it's checking `stegen$pork == 9`, which
evaluates to `r head(stegen_old$PORK == 9)`... This will return only the cases
in `stegen$pork` that have `9`s recorded.
4. The replacement: `... <- NA` replace `...` with `NA` (missing value)
in other words: "isolate the entries of `stegen$pork` which equal `9`, and
replace them with `NA`; here is another example to illustrate the procedure:
```{r stegen_subset}
## make example input vector
example_vector <- 1:5
## make example logical vector for subsetting
example_logical <- c(FALSE, TRUE, TRUE, FALSE, TRUE)
example_vector[example_logical] # subset values
example_vector[example_logical] <- 0 # replace subset values
example_vector # check outcome
## Saving clean data {#clean-data}
Now that we've cleaned our data, we will probably want to re-use it in further
analyses, but we probably don't want to go through the steps of cleaning it
again. Here, we will save our cleaned data into a new folder under `data/`
called `cleaned/`. We will use **R**'s function `dir.create()` to create this
folder under our data folder:
```{r create-clean-dir, eval = FALSE}
clean_dir <- here("data", "cleaned")
> STOP: open your file browser and confirm that file called "data/cleaned" has been created
We will store two different files in this directory that represent the same data:
1. a flat text file (spreadsheet) that can be read by any program
2. a binary file that can only be accessed from R
### Saving as a flat file
A flat file is a text file that can be read by any program. The most common type
of flat file is a called a **csv** (comma separated values) file. This is a
text version of a spreadsheet with commas denoting the columns. We can use the
`write_csv()` function from the *readr* package to do this:
```{r save-clean-data, eval = FALSE}
stegen_clean_file <- here("data", "cleaned", "stegen_clean.csv")
write_csv(stegen, path = stegen_clean_file)
When you want to read in the file later, you can use the `read_csv()` function.
### Saving as a binary file
When you save as a flat file, you will have to re-define which varaibles are
dates and which are factors when you re-read in the clean data. One way to
avoid doing this is to save your data into a file called an **RDS** file, which
is a data file that only R can read. This will preserve your column
```{r save-clean-rds, eval = FALSE}
stegen_clean_rds <- here("data", "cleaned", "stegen_clean.rds")
saveRDS(stegen, file = stegen_clean_rds)
To read in the clean data for later use, we would use the `readRDS()` function.
# Data exploration
## Summaries of age and sex distribution
It is good practice to first start to explore and familiarise yourself with a dataset before heading into the analyses of your data. We can have a brief look at age and sex distributions using some basic
summaries; for instance:
```{r stegen_age_sex}
summary(stegen$age) # age stats
summary(stegen$sex) # gender distribution
tapply(stegen$age, INDEX = stegen$sex, FUN = summary) # age stats by gender
<summary> <b>Going further</b> click here to learn about <code>tapply()</code>: </summary>
`tapply()` is a very handy function to stratify any kind of analyses. You will
find more details by reading the documentation of the function `?tapply()`, but
briefly, the syntax to be used is `tapply(input_data, INDEX = stratification,
FUN = function_to_use, further_arguments)`. In the command used above:
```{r eval = FALSE}
tapply(stegen$age, INDEX = stegen$sex, FUN = summary)
L499 [TEXT]: maybe specify that FUN is just where you choose your function (and that this can be any function), couple people were confused about what FUN is...
this literally means: select the age variable in the dataset `stegen`
(`stegen$age`), stratify it by sex (`INDEX = stegen$sex`), and apply the
`summary()` function `FUN = summary` to summarise each stratum. So for
instance, to get the average age by sex (function `mean()`), one could use:
```{r stegen_age_per_sex}
tapply(stegen$age, INDEX = stegen$sex, FUN = mean, na.rm = TRUE)
Here we illustrate that further arguments to the function `mean()` can be passed
on; here, `na.rm = TRUE` means "ignore missing data".
## Graphical exploration
The summaries above may be useful for reporting purposes, but graphics are
usually better for getting a feel for the data. Here, we illustrate how the
package *ggplot2* can be used to derive informative graphics of age/sex
distribution. It implements an alternative graphics system to the basic **R**
plots, in which the plot is built by adding (literally using `+`) different
layers of information, using:
- `ggplot()` to specify the dataset to use
- `geom_...()` functions which define a type of graphics to use (e.g. barplot, histogram)
- `aes()` to map elements of the data into aesthetic properties (e.g. x/y axes, color, shape)
For instance, to get an histogram of age:
```{r stegen_histogram}
ggplot(stegen) + geom_histogram(aes(x = age))
<summary>What does "`stat_bin()` using `bins = 30`. Pick better value with `binwidth`" mean?</summary>
This is a message that came from `geom_histogram()`. If we look at the help
documentation for `geom_histogram()`, we can see this paragraph in the **Details**
> By default, the underlying computation (‘stat_bin()’) uses 30
> bins; this is not a good default, but the idea is to get you
> experimenting with different bin widths. You may need to look
> at a few to uncover the full story behind your data.
Each bar in a histogram is considered a "bin". The height of each bar is the
number of observations that fall within the bounds of that bar. Because we are
looking at age, it may be better for us to define a binwidth of 1 so that each
bar represents a single year.
```{r stegen_histogram-bin}
ggplot(stegen) + geom_histogram(aes(x = age), binwidth = 1)
We can differentiate between genders by filling the bars with color (n.b. we are
also setting the binwidth to be 1 year):
```{r stegen_histogram_sex}
ggplot(stegen) + geom_histogram(aes(x = age, fill = sex), binwidth = 1)
Here, the age distribution is pretty much identical between male and female.
<summary> <b> Going further:</b> click here to learn about customising <i>ggplot2</i> graphics:</summary>
*ggplot2* graphics are highly customisable, and a lot of help and examples can
be found online. The [official website]( is a good place to start.
For instance, here we:
- define the width of each bin to be 1 year (`binwidth = 1` to avoid the warning from above)
- add white borders to the plot (`color = "white"`)
- specify colors manually for male / female, using specified color codes (see
[html color picker]( to
define your own colors) (`scale_fill_manual(...)`)
- add labels for title, *x* and *y* axes (`labs()`)
- specify a lighter general color theme, using Times font of a larger size by default (`theme_light(...)`)
- move the legend inside the plot (`theme(...)`)
Note, it's best practice to include only one *ggplot2* command per line if you
are using more than two.
```{r stegen_ggplot2_custom}
ggplot(stegen) +
geom_histogram(aes(x = age, fill = sex), binwidth = 1, color = "white") +
scale_fill_manual(values = c(male = "#4775d1", female = "#cc6699")) +
labs(title = "Age distribution by gender", x = "Age (years)", y = "Number of cases") +
theme_light(base_family = "Times", base_size = 16) +
theme(legend.position = c(0.8, 0.8))
## Epidemic curve
Epicurves (counts of incident cases) can be built using the package
*incidence*, which will compute the number of new cases given a vector of dates
(here, onset) and a time interval (1 day by default). We use the function
`incidence()` to achieve this, and then visualise the results:
```{r stegen_incidence}
i <- incidence(stegen$date_onset)
Details of the case counts can be obtained using:
```{r stegen_case_counts}
How long is this outbreak? It looks like most cases occurred over the course of
3 days, but that cases kept showing up 10 days after the peak. Is this true? Not
really. Stratifying the epidemic curve by case definition will clarify the
```{r incidence_stratified}
i_ill <- incidence(stegen$date_onset, group = stegen$ill)
plot(i_ill, color = c("non case" = "#66cc99", "case" = "#993333"))
> n.b. Above, we defined the colors of the groups by specifying
> `color = c("non case" = "#66cc99", "case" = "#993333")` in our plot command.
> Here we are saying "The non case group should have the color \#66cc99 and the
> case group should have the color \#993333". These six digit codes are
> hexadecimal color codes widely used in web design. It's not expected that
> anyone should be able to memorize what hex codes represent what colors, so
> there are websites that can help you pick a pleasing color palette such as
> For example, this page shows information about how
> the color \#66cc99 looks and what its complements are:
The outbreak really only happened over 3 days: onsets reported after did not
match the epi case definition. This is compatible with a food-borne outbreak
with limited or no person-to-person transmission.
Note that the plots produced by *incidence* are *ggplot2* objects, so that the
options seen before can be used for further customisation (see below).
<summary> <b> Going further</b> click here to learn about the <i>incidence</i> package: </summary>
More information on the *incidence* package can be found from the
[dedicated website]( Here, we illustrate how incidence can be stratified e.g. by case definition:
We can also customise this graphic like other *ggplot2* plots (see [this tutorial]( for more):
```{r incidence_stratified_custom, fig.height = 12}
plot(i_ill, border = "grey40", show_cases = TRUE, color = c("non case" = "#66cc99", "case" = "#993333")) +
labs(title = "Epicurve by case", x = "Date of onset", y = "Number of cases") +
theme_light(base_family = "Times", base_size = 16) + # changes the overal theme
theme(legend.position = c(x = 0.7, y = 0.8)) + # places the legend inside the plot
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + # sets the dates along the x axis at a 45 degree angle
coord_equal() # makes each case appear as a box
## Age and gender distribution of the cases
We use the same principles as before to visualise the distribution of illness by
age and gender. To split the plot into different panels according to `sex`, we
use `facet_wrap()` (see previous extra info on *ggplot2* customisation for
further details otions):
```{r stegen_case_age_sex}
ggplot(stegen) +
geom_histogram(aes(x = age, fill = ill), binwidth = 1) +
scale_fill_manual("Illness", values = c("non case" = "#66cc99", "case" = "#993333")) +
facet_wrap(~sex, ncol = 1) + # stratify the sex into a single column of panels
labs(title = "Cases by age and gender") +
From these graphics, we can see that illness does not appear to depend on age or
gender, which suggests that the illness was caused by one or more of the foods
consumed. We will explore this with a risk ratio analysis in the next section.
# Looking for the culprits
In order to figure out which food item was most strongly associated with the
illness, we need to test the [Risk Ratio](
for all of the food items recorded. Risk ratios are normally computed on
*contingency* tables (commonly known as 2x2 tables). In this part, we will
compute contingency tables for each exposure (food item) against the defined
cases, calculate their risk ratio with confidence intervals and p-values, and
plot them as points and errorbars using *ggplot2*.
## Reading in clean data
In the section on saving clean data, we saved a binary file with our stegen
data set. If you are starting here, you will want to read this file in with
```{r read-clean-data, eval = FALSE}
stegen_clean_rds <- here("data", "cleaned", "stegen_clean.rds")
stegen <- readRDS(stegen_clean_rds)
```{r read-clean-data-show, echo = FALSE}
### Isolating the variables to test
You need to update the two paragraphs describing the exposure variables here
(sorry came back to this after reading the code updates lower down - code fine,
description wrong). So this whole only keeping the variables (unless theres a
good data practice and/or a rational for using it within apply) - I dont think
is necessary. From an epi standpoint you generally would always keep the data
that you have collected (because if its not useful you shouldnt collect it).
Furthermore when transitioning to multivariable analysis, we would want to
include all variables of interest not just food. So, I would suggest: keep the
food vector with the names of variables of interest, but dont subset the
dataset - just use the food names vector in the functions below?
Because not all of the columns in our data set are food items (i.e. "age", and
"tportion"), we only want to test the columns that are food items for our
```{r stegen_isolate_food}
In this case, we need to test columns 6 to 21, excluding `tportion` and
`mportion`, which are not binary; this can be done using the subsetting
brackets `[ ]`, where we will place character vector indicating the columns we
want to test.
```{r stegen_food}
food <- c('tiramisu', 'wmousse', 'dmousse', 'mousse', 'beer', 'redjelly',
'fruit_salad', 'tomato', 'mince', 'salmon', 'horseradish',
'chickenwin', 'roastbeef', 'pork')
## Risk ratios
### For one variable
The [risk ratio]( is defined as the
ratio between the proportion of illness in one group (typically 'exposed') vs
another ('non-exposed'). For instance, let us consider the contingency table of
pork consumption and illness. One way to calcuate a contingency table is using
the `epitable()` function from the *epitools* package:
```{r stegen_pork_table}
pork_table <- epitable(stegen$pork, stegen$ill)
We can get the risk ratio by using the `riskratio()` function from the *epitools*
package. Here, we want to use Yates' continuity correction and calculate Wald
confidence intervals (more information about this can be found on the help page
for `riskratio()` by typing `?riskratio` in your R console).
```{r stegen_pork_rr}
pork_rr <- riskratio(pork_table, correct = TRUE, method = "wald")
The documentation of `riskratio()` shows that it returns quite a few things as
a list. However, we only want three things from this list:
1. risk ratio estimate
2. 95% confidence interval
3. P-value
<summary> <b>Going further:</b> What is a list? </summary>
`pork_rr` is an R object called a "list":
```{r stegen_pork-list1}
A list is an R vector that can hold other vectors. While this may look strange
and unfamiliar, we've actually seen lists before. If you look close, you can
see that each element of a list starts with a `$`. This is the exact same syntax
as data frames. If we inspect our `stegen` data frame, we can see that it is
actually a list:
```{r stegen_is_list}
Each column of a data frame is just a different element of a list. In fact, we
can use the `$` operator to access different elements of the list just as we
would the columns of a data frame:
```{r stegen_is_list2}
stegen_list <- as.list(stegen)
All of these are present in the `pork_rr` object, but this output does not lend
itself well to quick visual inspection. What we can do now is to *extract*
elements from this output and place it in a data frame.
The first thing we want is the risk ratio and the confidence intervals for those
consumption of pork. This is presented in the `$measure` element of the list:
```{r extract-rr}
We can see that this is present as a "matrix". A matrix has rows and columns and
we can access them like we do a vector with square brackets (`[ ]`), but we
separate the rows and columns with a comma like so: `[rows, columns]`. For
example, if we wanted the risk ratio for pork consumed, we would take the second
row in the first column using the row and column numbers:
```{r pork-rr-mat}
pork_rr$measure[2, 1]
In our case, we want to take the entire second row, so we can leave the `columns`
portion blank:
```{r pork-rr-mat2}
pork_est_ci <- pork_rr$measure[2, ]
We also want to get the P-value, of which, we have three choices. In our case,
we will choose the P-value from Fisher's exact test of the exposure (by selecting
the second row and the column called `"fisher.exact"`) and combine the result into
a vector that we will turn into a data frame:
```{r pork-rr-p-value}
pork_p <- pork_rr$p.value[2, "fisher.exact"]
res <- data.frame(estimate = pork_est_ci["estimate"],
lower = pork_est_ci["lower"],
upper = pork_est_ci["upper"],
p.value = pork_p
Now we have a table that tells us that pork has a risk ratio of
`r round(res$estimate, 3)` CI: (`r round(res$lower, 3)`,
`r round (res$upper, 3)`), P = `r round(res$p.value, 3)`.
<summary> Going Further: Different types of Univariate tests </summary>
## Univariate tests
Methods for testing the association between two variables can be broken down in 3 types,
depending on which types these variables are:
1. **2 categorical variables**: Chi-squared test on the 2x2 table
(a.k.a. *contingency* table) and similar methods (e.g. Fisher's exact test)
2. **1 quantitative, 1 categorical**: ANOVA types of approaches; particular case with
2 groups: Student's $t$-test
3. **2 quantitative variables**: Pearson's correlation coefficient ($r$) and similar
We can use these approaches to test if the disease is associated with any other
variables of interest. As illness itself is a categorical variable, only
approaches of type 1 and 2 are illustrated in this case study.
### Is illness associated with age?
We can use the function `t.test()` to test if the average age is different
across illness status. As this test assumes that the two categories exhibit
similar variation, we first ensure that the variances are comparable using
Bartlett's test. The syntax `variable ~ group` is used to indicate the variable
of interest (left hand-side), and the group (right hand-side):
```{r stegen_bartlett_test}
bartlett.test(stegen$age ~ stegen$ill)
The resulting p-value (`r round(bartlett.test(stegen$age ~ stegen$ill)[["p.value"]], 3)`) suggests the variances are indeed comparable. We can thus proceed to a $t$-test:
```{r stegen_t_test}
t.test(stegen$age ~ stegen$ill, var.equal = TRUE)
The p-value of `r round(t.test(stegen$age ~ stegen$ill, var.equal = TRUE)[["p.value"]], 3)` confirms the previous graphics: the illness does not appear to be associated with age.
### Is illness associated with gender?
To test the association between gender and illness (2 categorical variables), we
first build a 2-by-2 (contingency) table, using:
```{r stegen_2by2}
tab_sex_ill <- table(stegen$sex, stegen$ill)
Note that proportions can be obtained using `prop.table`:
```{r stegen_prop_table}
## basic proportions
## expressed in %, rounded:
round(100 * prop.table(tab_sex_ill))
Once a contingency table has been built, the Chi-square test can be run using `chisq.test`:
```{r stegen_chisq}
Here, the p-value of `r round(chisq.test(tab_sex_ill)[["p.value"]], 3)` suggests illness is not associated with gender either.
### Consolidating several steps into one (writing functions)
To recap, to get a risk ratio we've done the following:
1. created a contingency table
2. estimated the risk ratio
3. extracted estimate into a data frame
However, we have `r ncol(food)` exposures to sort through and we don't want to
have to repeat these procedures manually over and over. One strategy to help
with this is to create our own *function*, which we can think of as a miniature
recipe for a computer to read. If we take the steps above and place them into
individual steps we would get:
```{r make-fun-1}
et <- epitable(stegen$pork, stegen$ill)
rr <- riskratio(et)
estimate <- rr$measure[2, ]
res <- data.frame(estimate = estimate["estimate"],
lower = estimate["lower"],
upper = estimate["upper"],
p.value = rr$p.value[2, "fisher.exact"]
Just like a recipe, a function needs both ingredients and instructions. What we
have above are instructions, but we still need to say what our ingredients are.
If we look at the above code, we can see that we need to define our exposure
variable (`stegen$pork`) and outcome (`stegen$ill`); everything else flows from
that. We can then write our function (which we will call `single_risk_ratio()`)
like this:
```{r make-fun-2}
single_risk_ratio <- function(exposure, outcome) { # ingredients defined here
et <- epitable(exposure, outcome) # ingredients used here
rr <- riskratio(et)
estimate <- rr$measure[2, ]
res <- data.frame(estimate = estimate["estimate"],
lower = estimate["lower"],
upper = estimate["upper"],
p.value = rr$p.value[2, "fisher.exact"]
return(res) # return the data frame
Now, we can use this `single_risk_ratio()` like any other function:
```{r risk-ratio-df-fun-run}
pork_rr <- single_risk_ratio(exposure = stegen$pork, outcome = stegen$ill)
If we change the variable, we'll get a different answer:
```{r risk-ratio-df-fun-run-fruit}
fruit_rr <- single_risk_ratio(exposure = stegen$fruit_salad, outcome = stegen$ill)
We can combine the two using a function from *dplyr* called `bind_rows()`:
```{r pork-fruit}
bind_rows(pork = pork_rr, fruit = fruit_rr, .id = "exposure")
Here we can see that the fruit salad has a higher risk ratio with a lower
P-value, but we want to compare all of the variables and still, copying and
pasting all this code would be a nightmare and potentially lead to errors. We
will see in the next section how we can use R to apply the function
`single_risk_ratio()` to each column of `stegen` that we have listed in `food`
to give us the risk ratios for all the food items.
### Multiple variables
In the example above, we have computed the risk ratio using a simple recipe:
1. create a contingency table
2. estimate the risk ratio
3. extract estimate into a data frame
We placed this recipe in a function called `single_risk_ratio()` and we know we
can use it to calcuate the risk ratio for individual variables, but we need to
be able to calculate it over all the variables from `stegen` that we have
listed in our `food` vector by subetting with the square brackets (`[ ]`). The
solution is to use an R function called `lapply()`, which takes each element of
a list (or column of a data frame) and uses it as the first ingredient (also
known as argument) of a function. Like `tapply()`, the function can be anything
and additional arguments are placed behind the function. It will then return
the output as a list:
```{r stegen_all_rr}
all_rr <- lapply(stegen[food], FUN = single_risk_ratio, outcome = stegen$ill)
Finally, like we did above, we re-shape these results into a data frame
```{r all-food-df}
all_food_df <- bind_rows(all_rr, .id = "exposure")
This is still in the order of our data, but we want it sorted in order of the
risk ratio. For this, we will use the *dplyr* functions `arrange()` and `desc()`
to arrange our data by the risk ratio estimate in descending order:
```{r arrange-desc}
all_food_df <- arrange(all_food_df, desc(estimate))
Now we can use this data frame to plot the results using *ggplot2*:
```{r plot-arrange}
# first, make sure the exposures are factored in the right order
all_food_df$exposure <- factor(all_food_df$exposure, unique(all_food_df$exposure))
# plot
p <- ggplot(all_food_df, aes(x = estimate, y = exposure, color = p.value)) +
geom_point() +
geom_errorbarh(aes(xmin = lower, xmax = upper)) +
geom_vline(xintercept = 1, linetype = 2) +
scale_x_log10() +
scale_color_viridis_c(trans = "log10") +
labs(x = "Risk Ratio (log scale)",
y = "Exposure",
title = "Risk Ratio for gastroenteritis in Stegen, Germany")
The results are a lot clearer now: the tiramisu has by far the highest risk
ratio in this outbreak, but there are a couple of things to note:
1. its wide confidence interval suggests that the sample size is a bit small.
2. The other variables that have risk ratios significantly different from zero
have overlapping confidence intervals, which may suggest that confounding
factors were involved (i.e. the food items were not independent or in this
case, were potentially sharing contaminated ingredients).
<summary> <b>Going Further:</b> Creating a function for multiple risk ratios </summary>
Just like we created the function `single_risk_ratio()` to calculate the risk
ratio of a single variable, we can create another function that will calculate
the risk ratio for all variables in a data frame. We can do it the same way we
did above. First, define the recipe:
```{r multi-risk-recipe}
all_rr <- lapply(stegen[food], FUN = single_risk_ratio, outcome = stegen$ill)
all_food_df <- bind_rows(all_rr, .id = "exposure")
all_food_df <- arrange(all_food_df, desc(estimate))
all_food_df$exposure <- factor(all_food_df$exposure, unique(all_food_df$exposure))
Now, find the ingredients. The first step is `lapply()` which needs
`stegen[food]` and `stegen$ill`, which are a data frame of exposures and a
vector of outcomes, resepectively, so we will add these as *arguments* called
`exposures` and `outcome`:
```{r multi-risk-df}
multi_risk_ratio <- function(exposures, outcome) {
all_rr <- lapply(exposures, FUN = single_risk_ratio, outcome = outcome)
all_food_df <- bind_rows(all_rr, .id = "exposure")
all_food_df <- arrange(all_food_df, desc(estimate))
all_food_df$exposure <- factor(all_food_df$exposure, unique(all_food_df$exposure))
And we can call the function like so:
```{r multi-risk-run}
multi_risk_ratio(exposures = stegen[food], outcome = stegen$ill)
Note that we have defined arguments for `exposures` and `outcome`, but we
didn't define an argument for the `single_risk_ratio()` function. This is
because we know that we've defined it above, but this also means that if we
want to use the `multi_risk_ratio()` function in other scripts, we have to also
define `single_risk_ratio()` as well. One way of keeping these organised is to
always write the functions together:
```{r funwrite}
single_risk_ratio <- function(exposure, outcome) { # ingredients defined here
et <- epitools::epitable(exposure, outcome) # ingredients used here
rr <- epitools::riskratio(et)
estimate <- rr$measure[2, ]
res <- data.frame(estimate = estimate["estimate"],
lower = estimate["lower"],
upper = estimate["upper"],
p.value = rr$p.value[2, "fisher.exact"]
return(res) # return the data frame
multi_risk_ratio <- function(exposures, outcome) {
all_rr <- lapply(exposures, FUN = single_risk_ratio, outcome = outcome)
all_food_df <- dplyr::bind_rows(all_rr, .id = "exposure")
all_food_df <- dplyr::arrange(all_food_df, dplyr::desc(estimate))
all_food_df$exposure <- factor(all_food_df$exposure, unique(all_food_df$exposure))
You might notice, however that these functions look a bit different. The
`epitable()` function is now written as `epitools::epitable()`. This is no
accident. This is a way for us to tell R to use a function *even if the package
hasn't been loaded*, which makes it easier to share these functions.
# Plotting a very basic spatial overview of cases
To complete your report, you would like to include a very basic spatial
overview of cases. Based on the postal codes of all individuals with a date of
symptom onset spatial coordinates were generated providing everyone with a
latitude and longitude which corresponds with their household (in relation to
their postal code). The variables `latitude` and `longitude` include the
longitude and latitude which we will use to plot the cases using *ggplot2*.
Here, we want to plot the data to see if there is any spatial component to the
outbreak. Since we have the coordinates for each household, it becomes
straightforward to plot them with *ggplot2*. Here we will use points and color
them based on whether or not the person was ill:
```{r stegen-ill-plot-1, fig.width = 10}
ggplot(stegen) +
geom_point(aes(x = longitude, y = latitude, color = ill)) +
scale_color_manual("Illness", values = c("non case" = "#66cc99", "case" = "#993333")) +
Here, we can see that there is no clear correlation between illness and location.
## Using shapefile data
It's not uncommon to have community-level shapefile data. These data give us a
better idea of how the cases relate to each other. In this instance, we have
shapefiles of the households located in `data/stegen-map/stegen_households.shp`.
We will use the `read_sf()` function from the *sf* package to read the
shapefiles in:
```{r read-shapefiles_fake, eval = FALSE}
stegen_shp <- read_sf(here("data", "stegen-map", "stegen_households.shp"))
```{r read-shapefiles, echo = FALSE}
stegen_shp <- read_sf(here("static/data/stegen-map/stegen_households.shp"))
Now we can use the same ggplot2 code as above, but we place a call to `geom_sf()`
right after the ggplot call to place the shapes underneath the points. Note that
we are removing the `coord_map()` because the shapefile ensures that the
projection is correct.
```{r sf_map, fig.width = 10}
ggplot(stegen) +
geom_sf(data = stegen_shp) +
geom_point(aes(x = longitude, y = latitude, color = ill)) +
scale_color_manual("Illness", values = c("non case" = "#66cc99", "case" = "#993333"))
## Interactive maps
It is also possible to create interactive maps that allow you to zoom in and get
information about individual cases. This uses a different system than *ggplot2*,
but behaves in a similar idea of building up the map layer by layer. Here, we
use the package *leaflet*, which uses image data from open street maps. Because
mapping could take up an entire workshop in and of itself, we will present the
code as is. One thing that we must do first is to subset the data to all cases
with non-missing coordinates with the command `!` where the `!` means
"not" and `` checks each element of a vector if it's missing.
```{r subset_coords}
stegen_sub <- stegen[!$longitude), ]
```{r interactive_map, eval = FALSE}
# create the map
lmap <- leaflet()
# add open street map tiles
lmap <- addTiles(lmap)
# set the coordinates for Stegen
lmap <- setView(lmap, lng = 7.963, lat = 47.982, zoom = 15)
# Add the shapefile
lmap <- addPolygons(lmap, data = st_transform(stegen_shp, '+proj=longlat +ellps=GRS80'))
# Add the cases
lmap <- addCircleMarkers(lmap,
label = ~ill,
color = ~ifelse(ill == "case", "#993333", "#66cc99"),
stroke = FALSE,
fillOpacity = 0.8,
data = stegen_sub)
# show the map
<iframe src="widgets/stegen-casemap.html" width="100%" height="500"></iframe>
# Conclusion
This case study illustrated how **R** can be used to import data, clean data,
and derive basic summaries for a first glance at the data. It also showed how to
generate graphics using *ggplot2*, and how to detect associations between
several potential risk factors and the occurrence of illness. One major caveat
here is that we are not accounting for potential confounding factors. These will
be treated in a separate case study, which will focus on the use of logistic
regression in epidemic studies. Lastly we plotted a basic overview of the cases
in this outbreak in relative distance to each other. More sophisticated mapping
and spatial methodologies will be covered in other case studies.
# About this document
## Source
This case study was first designed by Alain Moren and Gilles Desve, EPIET. It is
based on an real outbreak investigation conducted by Anja Hauri, RKI, Berlin, 1998.
## Contributors
- original version: Alain Moren, Gilles Desve
- reviewers, previous versions: Marta Valenciano, Alain Moren
- adaptation for EPIET module: Alicia Barrasa, Ioannis Karagiannis
- rewriting for R: Alexander Spina, Patrick Keating, Janetta Skarp, Zhian N. Kamvar, Thibaut Jombart, Amrish Baidjoe
Contributions are welcome via [pull
requests]( The [source file for this case study can be found
## Legal stuff
**License**: [CC-BY](
**Copyright**: 2018