# Missing Data Treatment: A Hand-on Illustration using `R` Package `mice`

<img src="/home/tony/uio/pc/Dokumenter/PhD/Teaching/UseR/Missing_Data/Figures/cat-mouse.jpg" style="width:100%">

## 0. Overview

### 0.1 Learning Intention
This learning session aims to empower the `R` user community with a gentle introduction to the multiple imputation procedure in order to address missing data problems using one widely used `R` package: multivariate imputation by chained equations (`mice`, van Buuren & Groothuis-Oudshoorn, 2011).

### 0.2 Success Criteria
At the end of this session, participants would gain *interest* and *confidence* in dealing with missing data in their work.

Learners would be, firstly, able to assess the missing data mechanism in their datasets and apply appropriate treatment procedures accordingly. They can subsequently appreciate the workflow of `mice` and see how Rubin's rule is applied to the pulling process. Finally, learners are able to interpret `mice` output in order to answer their research questions.

Furthermore, participants would continue their learning by reaching out to advanced MI literature for techniques suitable for their specific modelling needs.

### 0.3 Learning Structure

## 1. Background

### 1.1 Rationale for Multiple Imputation
Complete-case analyses are only valid and unbiased under very restricted conditions (MCAR, defined below). Even when such condition holds, removing cases would cause huge loss in estimation efficiency. In addition, all resources and efforts may go to waste due to one single impurity in an observation. Multiple imputation tries to salvage imperfections by filling the "holes" with "guesses". The uncertainty of our guesses is reflected in the variation of the imputed values--the wider the variation, the less certain we are about our guesses.

### 1.2 Two Approaches to Missing Data Treatment
**Joint modelling** (JM, Schafter (1997); `R` package `jomo`) and **fully conditional specification** (FCS) are the two main approaches to missing data treatment. FCS is also known as multivariate imputation by chained equations (MICE). This session will focus exclusively on the `R` package `mice` by van Buuren and Groothuis-Oudshoorn (2011), currently in Version 3.14.0, to demonstrate the power and flexibility of the MICE procedure for handling missing data. Other `R` packages that work with missing data are `Amelia`, `Hmisc`, `jomo`, `mi`, `norm`, `norm2` and `pan`. See Table 5.1 of Kleinke et al. (2020) for a comparison of these packages. Table 6 of Grund et al. (2018) provides specific advice on package uses for multilevel models.

## 2. Data Missing Mechanism (Rubin, 1976)

<img src="/home/tony/uio/pc/Dokumenter/PhD/Teaching/UseR/Missing_Data/Figures/mechanism.png" style="width:100%">

### 2.1 Missing Completely at Random (**MCAR**)

A variable's missing propensity is *independent* of all variables in the dataset.

This is the "least evil" case of missings. Under MCAR, complete-case analyses are still valid and estimates unbiased--although efficiency is reduced due to smaller dataset.

### 2.2 Missing at Random (**MAR**)

A variable's missing propensity depends *exclusively* on the *observed* variables.

The MAR assumption is behind most MI procedures, including `mice`.

### 2.3 Missing not at Random (**MNAR**)

A variable's missing propensity depends on *unobserved* variables.

MNAR represents the "most evil" end of the spectrum. The exact theory and treatment of MNAR is complicated. Interested readers are referred to Rose's (2013) thesis for richer references. In practice, we can introduce lots of covariates to the MI process in order to push MNAR more towards the MAR region of the spectrum.

### 2.4 Ignorability
The missing data literature commonly refers to MCAR and MAR as "ignorable missings" while to MNAR as "non-ignorable missings". Such vocabulary does *not* suggest no treatment is required, however, but whether subsequent analyses interact with the data missing process.

## 3. The `mice` Package

### 3.1 `mice` Workflow

<img src="/home/tony/uio/pc/Dokumenter/PhD/Teaching/UseR/Missing_Data/Figures/workflow.jpg" style="width:100%">

### 3.2 First Look at `mice`

In [None]:
# Install some packages
# install.packages(c("mice","VIM"),dependencies = T)

# Set working directory
setwd("~/uio/pc/Dokumenter/PhD/Teaching/UseR/Missing_Data/")

# Load the mice package (suppress both warnings and messages)
suppressWarnings(suppressMessages(library(mice)))

# Use the example dataset nhanes (came with the package)
nhanes

# observations = 25, variables = 4
# Variable names
#    age     age group               ordered categorical
#    bmi     body mass index         numerical
#    hyp     hypertension status     binary
#    chl     cholesterol level       numerical

### 3.3 Missing Pattern Inspection

There are two ways we can inspect missing data pattern: one is "macro" and the second one is more "micro":

In [None]:
# Inspect missing pattern (Method 1)
md.pattern(nhanes)

# Colour convention
#   blue    observed
#   red     missing
# Interpretation
#   13 rows are complete
#   3 row that chl is missing
#   1 row that bmi is missing
#   1 row that both hyp and bmi are missing
#   7 rows that only age is observed
#   Total number of missing values = 3x1 + 1x1 + 1x2 + 7x3 = 27
#   Most missing values (10) occur in chl

In [None]:
# Inspection by variable pairs (Method 2)
md.pairs(nhanes)

# Symbol convention (left, top)
#   r   observed (remain?)
#   m   missing

# Interpretation (focus on (bmi, chl) pair)
#   13 completely observed paris
#   3 pairs: bmi is observed but chl is missing
#   2 pairs: bmi is missing but chl is observed
#   7 pairs: both bmi and chl are missing

### 3.4 Margin Plot

A very helpful way to visualise missing data pattern is through a margin plot, generated by the `VIM` package:

In [None]:
# Margin plot
par(mar = c(7, 7, 3, 3)) # In order to show the axes labels
# Inspect data range
c(min(nhanes$chl, na.rm = T), max(nhanes$chl, na.rm = T)) # (113, 284)
c(min(nhanes$bmi, na.rm = T), max(nhanes$bmi, na.rm = T)) # (20.4, 35.3)
# Generate margin plot
VIM::marginplot(nhanes[, c("chl", "bmi")],
    xlim = c(110, 290), ylim=c(20, 36),
    col = mdc(1:2), pch = 19,
    cex = 1.2, cex.lab = 1.2, cex.numbers = 1.3
)

# Interpretation
#   red 9   variable bmi contains 9 missings
#   red 10  variable chl contains 10 missings
#   red 7   The variable pair (bmi, chl) contains 7 missings
#   three red dots on the left: bmi values are known but chl missing
#   two red dots on the bottom: chl values are known but bmi missing
#   red dot cross between bmi and chl actually represents 7 dots
#   Total # dots = 13 (blue) + 3 (red left) + 2 (red bottom) + 7 (red cross)
#   Box plots: marginal distributions (blue = obs, red = mis)
#   If MCAR => red and blue box plots are expected to be identical

### 3.5 Impute Missing Data

We can now generate our first imputation using `mice`, accepting all defult settings:

In [None]:
imp <- mice(nhanes, printFlag = F, seed = 23109)
# The multiply imputed dataset, imp, is of class mids (MI data set)
print(imp)

### 3.6 Diagnostic Checking

It is important to check that the imputed data indeed make sense:

In [None]:
# Recall that bmi contains 9 missings
# The MI procedure produced five guesses for each missing:
imp$imp$bmi

In [None]:
# The 1st complete dataset combines the observed and imputed values:
complete(imp)

In [None]:
# We can print out the 2nd set of the complete dataset
complete(imp, 2)
# If complete to start with => identical in all five sets
# If missing to start with => differ in each set
# Degree of difference reflects degree of uncertainty

### 3.7 MI Visual Inspection

Similar to Section 3.3, the reasonableness of the MI procedure can be inspected visually using two plots:

In [None]:
# Visual inspection (big picture)
stripplot(imp, pch = 20, cex = 1.2)
# Colour convention
#   blue    observed
#   red     imputed
# Each x-axis marker is one version of MI. 0 = original set
# Red points follow the blue points reasonably well, including the gaps in the distribution.

In [None]:
# Visual inspection (fine details)
xyplot(imp, bmi ~ chl | .imp, pch = 20, cex = 1.4)
# Red points have more or less the same shape as blue data => imputed data could have been plausible measurements if they had not been missing
# Differences between the red points represents uncertainty about the true, but unkown, values

### 3.8 Analysing Imputed Datasets

With all the "holes" now being filled, we can apply our familiar analyses to the "restored" datasets. Fortunately, `mice` will automate the analyses for us (`with` function) and pool the results together using Rubin's rule (`pool` function):

In [None]:
# Original regression: lm(chl ~ age + bmi)
# Repeat this analysis to each version of MI
fit <- with(data = imp, exp = lm(chl ~ age + bmi))
# Pool the multiple versions of the analyses together
summary(pool(fit))
# Both age and bmi are significant at .05 level

In [None]:
# If we increase m, the number of imputations, significant levels may change
summary(pool(with(
    mice(nhanes, m = 10, printFlag = F, seed = 23109), # order 10 sets of MI
    lm(chl ~ age + bmi)
))) # More significant

## 4. Fine Tuning `mice`

## 9. References
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). `mice`: Multivariate imputation by chained equations in `R`. Journal of Statistical
Software, 45(3), 1–67. <https://doi.org/10.18637/jss.v045.i03>

Grund, S., Lüdtke, O., & Robitzsch, A. (2018). Multiple imputation of missing data for multilevel models: Simulations and
recommendations. Organizational Research Methods, 21(1), 111–149. <https://doi.org/10.1177/1094428117703686>

Kleinke, K., Reinecke, J., Salfrán, D., & Spiess, M. (2020). Applied multiple imputation: Advantages, pitfalls, new developments and
applications in `R`. Springer. <https://doi.org/10.1007/978-3-030-38164-6>

Rose, N. (2013). Item nonresponses in educational and psychological measurement [PhD Thesis, Friedrich-Schiller-Universität
Jena]. Open Access Thesis and Dissertations.
<https://www.db-thueringen.de/servlets/MCRFileNodeServlet/dbt_derivate_00027809/Diss/NormanRose.pdf>

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. <https://doi.org/10.1093/biomet/63.3.581>

Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman & Hall; CRC.