# INST 414 — Additional Lab Tasks (Week 4)

Genetic screening + probability in pandas (independence and Bayes theorem).


## Setup

We will work with a genetic screening dataset with these columns:
- `has_variant`: 1 if the person has the genetic variant, 0 otherwise
- `screen_positive`: 1 if the screening test is positive, 0 otherwise
- `age_group`: a category like `Under 50` or `50+`

Unless otherwise stated, report probabilities rounded to 3 decimals.


In [None]:
import numpy as np
import pandas as pd


## Data
Run the next cell to create the dataset we’ll use for these tasks.
It’s a small, deterministic toy dataset with the columns we need: `age_group`, `has_variant`, and `screen_positive`.


In [None]:
# Generate a deterministic toy dataset

rng = np.random.default_rng(414)

n_under_50 = 2000
n_50_plus  = 2000

age_group = np.array(['Under 50'] * n_under_50 + ['50+'] * n_50_plus)

# Different pre-test probabilities by age group
p_variant = np.where(age_group == 'Under 50', 0.010, 0.030)
has_variant = (rng.random(age_group.size) < p_variant).astype(int)

# Screening test quality (same across groups in this toy example)
sensitivity = 0.86  # P ( screen_positive = 1 | has_variant = 1 )
specificity = 0.92  # P ( screen_positive = 0 | has_variant = 0 )

# Generate test results conditional on has_variant
screen_positive = np.empty_like(has_variant)

mask_variant = has_variant == 1
mask_no_variant = ~mask_variant

screen_positive[mask_variant] = (rng.random(mask_variant.sum()) < sensitivity).astype(int)
# For no-variant, a positive is a false positive: probability 1 - specificity
screen_positive[mask_no_variant] = (rng.random(mask_no_variant.sum()) < (1 - specificity)).astype(int)

# Add a small amount of missingness to practice isna / dropna
df = pd.DataFrame({
    'has_variant': has_variant,
    'screen_positive': screen_positive,
    'age_group': age_group,
})

missing_idx = rng.choice(df.index, size=40, replace=False)
df.loc[missing_idx[:15], 'has_variant'] = np.nan
df.loc[missing_idx[15:30], 'screen_positive'] = np.nan
df.loc[missing_idx[30:], 'age_group'] = np.nan

df.head()


## Task 1 — Load, inspect, and clean

1) Check missing values with `df.isna().sum()` for `has_variant`, `screen_positive`, and `age_group`.

2) Create `df_clean` by dropping rows with missing values in those three columns.

3) Confirm that `has_variant` and `screen_positive` are truly binary (only 0 and 1) in `df_clean`.

4) Report how many rows you dropped, and give a one-sentence justification.


In [None]:
# Your code here


## Task 2 — Base rates (marginal probabilities)

Compute:
- Overall pre-test probability: $$ P ( has\_variant = 1 ) $$
- Overall positive screening rate: $$ P ( screen\_positive = 1 ) $$

Then compute both of these by `age_group`:
- $$ P ( has\_variant = 1 \mid age\_group ) $$
- $$ P ( screen\_positive = 1 \mid age\_group ) $$

In one sentence: which `age_group` has the higher pre-test probability, and by how much?


In [None]:
# Your code here


## Task 3 — Joint distribution table

1) Create a 2×2 joint probability table for $$ P ( has\_variant , screen\_positive ) $$ from `df_clean`.

2) Verify the table sums to 1.

3) Create the same 2×2 joint probability table separately for each `age_group`.

Optional shortcut: use `pd.crosstab(..., normalize=True)` to build the 2×2 joint table and confirm it matches your other method.


In [None]:
# Your code here


## Task 4 — Test quality (conditional probabilities)

Compute the following overall, and then again within each `age_group`:

1) Sensitivity: $$ P ( screen\_positive = 1 \mid has\_variant = 1 ) $$

2) Specificity: $$ P ( screen\_positive = 0 \mid has\_variant = 0 ) $$

3) False positive rate: $$ P ( screen\_positive = 1 \mid has\_variant = 0 ) $$

In 1 to 2 sentences: do these quantities look similar across age groups, or do they differ?


In [None]:
# Your code here


## Task 5 — Independence check

Overall, test whether `has_variant` and `screen_positive` are independent by comparing:

- $$ P ( has\_variant = 1 , screen\_positive = 1 ) $$
- $$ P ( has\_variant = 1 ) \cdot P ( screen\_positive = 1 ) $$

Report both numbers and their difference. Conclude: independent or not?

Then repeat the same check within each `age_group` (using group-specific probabilities).


In [None]:
# Your code here


## Task 6 — Bayes theorem (post-test probability) + low base-rate effect

1) Compute the post-test probability directly from the data:
$$ P ( has\_variant = 1 \mid screen\_positive = 1 ) $$

2) Compute it again using Bayes theorem:
$$ P ( has\_variant = 1 \mid screen\_positive = 1 ) = \frac{ P ( screen\_positive = 1 \mid has\_variant = 1 ) \cdot P ( has\_variant = 1 ) }{ P ( screen\_positive = 1 ) } $$

3) Confirm the two answers match (up to rounding).

4) Repeat steps 1 to 3 within each `age_group`. In one sentence: why can post-test probability differ across age groups?

Low base-rate check:
5) Pick one `age_group`. Keep its sensitivity and false positive rate fixed, but set a hypothetical prevalence like $$ P ( has\_variant = 1 ) = 0.0002 $$. Compute the hypothetical post-test probability.

6) In 2 to 3 sentences: explain why the post-test probability can still be low even when the screening test is very accurate.


In [None]:
# Your code here
