In [None]:
<a href="https://colab.research.google.com/github/rhodes-byu/cs-stat-180/blob/main/notebooks/missingness-types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# 📊 Tutorial: Understanding Data Missingness

This notebook introduces the three classic missingness mechanisms—**MCAR**, **MAR**, and **MNAR**—with small synthetic examples and visuals. Run the cells top-to-bottom.

## 0) Setup & Synthetic Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
data = pd.DataFrame({
    'PatientID': range(1, 11),
    'Age': np.random.randint(20, 70, 10),
    'Weight': np.random.randint(50, 100, 10),
    'Cholesterol': np.random.randint(150, 300, 10)
})

data

---
## 1) MCAR — Missing Completely at Random
**Definition:** Missingness is independent of any observed/unobserved values.

**Example:** Random logging glitches cause some *Cholesterol* values to be missing.

In [None]:
mcar = data.copy()
mcar.loc[np.random.choice(mcar.index, 3, replace=False), 'Cholesterol'] = np.nan
mcar

---
## 2) MAR — Missing at Random
**Definition:** Missingness depends on observed data, not on the missing value itself.

**Example:** Younger patients (Age < 35) are less likely to get *Cholesterol* measured.

In [None]:
mar = data.copy()
mar.loc[mar['Age'] < 35, 'Cholesterol'] = np.nan
mar

---
## 3) MNAR — Missing Not at Random
**Definition:** Missingness depends on the value itself, even after conditioning on observed data.

**Example:** Patients with *very high* cholesterol avoid reporting it.

In [None]:
mnar = data.copy()
mnar.loc[mnar['Cholesterol'] > 250, 'Cholesterol'] = np.nan
mnar

---
## 4) Visual Intuition
Each plot shows **Age** vs **Cholesterol**. Missing points are shown as **X** markers at random y-positions so you can see where values are missing on the x-axis.

In [None]:
# Helper to plot one dataset
def plot_missing(df, title):
    plt.figure(figsize=(6,4))
    # Observed values
    obs = df.dropna(subset=['Cholesterol'])
    plt.scatter(obs['Age'], obs['Cholesterol'], s=60, label='Observed')
    # Missing values (placed at random y for visibility)
    miss_mask = df['Cholesterol'].isna()
    if miss_mask.any():
        y_fake = np.random.uniform(150, 300, miss_mask.sum())
        plt.scatter(df.loc[miss_mask, 'Age'], y_fake, marker='x', s=100, label='Missing (y shown randomly)')
    plt.title(title)
    plt.xlabel('Age')
    plt.ylabel('Cholesterol')
    plt.legend()
    plt.show()

plot_missing(mcar, 'MCAR (random)')
plot_missing(mar, 'MAR (depends on Age)')
plot_missing(mnar, 'MNAR (depends on Cholesterol)')


---
## 5) How Missingness Affects Simple Estimates
We compare the **mean Cholesterol** under each mechanism when we:
1. Use the full data mean (ground truth from the original complete data),
2. Use the observed-only mean after missingness (listwise deletion).

Observe how **MCAR** preserves unbiasedness (on average), while **MAR/MNAR** can introduce bias.

In [None]:
def summary_means(df_complete, df_with_missing, label):
    baseline = df_complete['Cholesterol'].mean()
    observed = df_with_missing['Cholesterol'].mean(skipna=True)
    return {
        'Scenario': label,
        'Full-data mean': round(float(baseline), 2),
        'Observed-only mean': round(float(observed), 2)
    }

rows = [
    summary_means(data, mcar, 'MCAR'),
    summary_means(data, mar, 'MAR'),
    summary_means(data, mnar, 'MNAR')
]
pd.DataFrame(rows)

---
## 6) Key Takeaways
- **MCAR**: Safe to drop missing rows (unbiased but less efficient).
- **MAR**: Adjust for the observed predictors of missingness (e.g., include Age), or use multiple imputation.
- **MNAR**: Requires modeling the missingness mechanism or collecting additional information.

💡 Tip: In practice, use diagnostic plots and sensitivity analyses to evaluate assumptions.