# INST414 — Lab 4: Performance Metrics + Naive Bayes (with Pandas)

**How this notebook works:**
- Everything **before** the **Lab Tasks** section is your **pre-lab** work to complete at home.
- In class, we’ll focus on the **Lab Tasks** and I’ll call on people to walk through solutions.

**Today’s goal:** compute performance metrics as **conditional probabilities**, see how metrics change when you change the **threshold**, and then use **Naive Bayes** to make a simple classification.

## Learning goals
By the end, you should be able to:
- Turn predicted probabilities into predictions using a threshold.
- Explain the confusion-matrix cells (TP, FP, TN, FN).
- Compute a joint distribution of `( y , \hat{y} )` with `value_counts`.
- Compute **TPR (recall)** and **PPV (precision)** from data.
- Use `sort_values`, `between`, and `map` for common data tasks.
- Compare two Naive Bayes scores and classify.

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run from the last successful cell.
- When you see **Checkpoint** prompts, pause and try before scrolling.
- After any big step, do a quick check with `.shape`, `.head()`, and `value_counts()`.

## Common issues (quick fixes)
- If you get a `KeyError`, check column names with `lab4_data.columns`.
- If your probabilities look weird, double-check the denominator (what you’re conditioning on).
- If `value_counts` output is hard to read, use `.sort_index()` to organize it.


**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

**Turn off AI assistance:** Go to **Settings → AI Assistance** and uncheck everything. AI-generated code is not allowed on assignments in this course.

# Load modules and settings


In [1]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Pre-lab (do before class)

We’ll build the core workflow in a small example first, then repeat it on the Maryland dataset.


## Part 1 — Warm-up: metrics on a tiny table

We’ll treat:
- `y` as the true outcome (0/1)
- `prediction` as a model’s predicted probability for `y = 1`

Then we’ll create ŷ by choosing a threshold and compute TPR and PPV.


### Step 0: From a probability to a predicted label

A model gives a number like `prediction = 0.73`. Think of it as: 
> the model’s estimated probability that `y = 1`

To make a yes/no prediction, we pick a **threshold** `t` and define:
- predict $\hat{y} = 1$ if `prediction ≥ t`
- predict $\hat{y} = 0$ if `prediction < t`

Changing `t` changes how many people we predict as positive, which changes the error types we make.

### Step 1: Confusion matrix (four outcomes)

Once you have `y` and $\hat{y}$, every row is one of:
- **TP**: `y = 1` and $\hat{y} = 1$
- **FP**: `y = 0` and $\hat{y} = 1$
- **TN**: `y = 0` and $\hat{y} = 0$
- **FN**: `y = 1` and $\hat{y} = 0$

In Pandas, `value_counts` on two columns is a fast way to compute those counts.

### Step 1.5: Recode booleans into 0/1

When we apply a threshold, we naturally get a `True/False` value (did the prediction clear the threshold?).
For the rest of the lab, it’s convenient to also have a `0/1` version.

We’ll use Pandas `map` to recode values like this:
- `True → 1`
- `False → 0`



In [None]:
df_toy = pd.DataFrame({
    'prediction': [0.95, 0.80, 0.73, 0.68, 0.60, 0.55, 0.52, 0.49, 0.30, 0.18],
    'y':          [1,    0,    1,    1,    0,    1,    0,    1,    0,    0],
})
df_toy


In this toy table:
- `y = 1` means the outcome happened.
- `prediction` is the model’s score (higher means the model thinks `y = 1` is more likely).


In [None]:
df_toy['yhat'] = df_toy['prediction'] >= 0.5

# map: recode True/False into 1/0
df_toy['yhat01'] = df_toy['yhat'].map({True: 1, False: 0})

df_toy


Here’s what we just did:
- `yhat` is `True/False` (a boolean).
- `yhat01` converts that to `1/0` using `map`, which is convenient for counting and probability calculations.


In [None]:
joint_counts_toy = df_toy[['y', 'yhat01']].value_counts().sort_index()
joint_counts_toy


`value_counts` on `['y', 'yhat01']` gives counts for each pair `( y , \hat{y} )`.

Interpretation example: `( 1 , 1 )` is the number of rows where the outcome is 1 and the prediction is 1 (that’s **TP**).

We’ll use those four counts to compute TPR and PPV.


### Step 2: Metrics as conditional probabilities

- **TPR / Recall** = $P ( \hat{y} = 1 \mid y = 1 ) = \frac{TP}{TP + FN}$
- **PPV / Precision** = $P ( y = 1 \mid \hat{y} = 1 ) = \frac{TP}{TP + FP}$

Notice the denominators:
- TPR conditions on **actual positives** (`y = 1`).
- PPV conditions on **predicted positives** (`\hat{y} = 1`).


In [None]:
cm_toy = joint_counts_toy.unstack(fill_value=0).reindex(index=[0, 1], columns=[0, 1], fill_value=0)
cm_toy.index = ['y = 0', 'y = 1']
cm_toy.columns = ['yhat = 0', 'yhat = 1']

TN = int(cm_toy.iloc[0, 0])
FP = int(cm_toy.iloc[0, 1])
FN = int(cm_toy.iloc[1, 0])
TP = int(cm_toy.iloc[1, 1])

TPR = TP / (TP + FN)
PPV = TP / (TP + FP)

summary_counts = pd.DataFrame({'Count': [TP, FP, TN, FN]}, index=['TP', 'FP', 'TN', 'FN'])
summary_metrics = pd.DataFrame({'Value': [TPR, PPV]}, index=['TPR (Recall)', 'PPV (Precision)']).round(3)

cm_toy
summary_counts
summary_metrics


**Checkpoint:** lower the threshold to 0.3. What happens to TPR? What happens to PPV?


A common pattern is:
- Lower threshold → predict more positives → **TPR tends to go up** (you miss fewer true positives)
- But you may also predict more false positives → **PPV can go down**


In [None]:
df_toy['yhat01_03'] = (df_toy['prediction'] >= 0.3).map({True: 1, False: 0})
joint_counts_toy_03 = df_toy[['y', 'yhat01_03']].value_counts().sort_index()

cm_toy_03 = joint_counts_toy_03.unstack(fill_value=0).reindex(index=[0, 1], columns=[0, 1], fill_value=0)
cm_toy_03.index = ['y = 0', 'y = 1']
cm_toy_03.columns = ['yhat = 0', 'yhat = 1']

FP = int(cm_toy_03.iloc[0, 1])
FN = int(cm_toy_03.iloc[1, 0])
TP = int(cm_toy_03.iloc[1, 1])

TPR_03 = TP / (TP + FN)
PPV_03 = TP / (TP + FP)

summary_metrics_03 = pd.DataFrame({'Value': [TPR_03, PPV_03]}, index=['TPR (Recall)', 'PPV (Precision)']).round(3)

cm_toy_03
summary_metrics_03


### Step 3: Conditional independence (preview for Naive Bayes)

Naive Bayes works by multiplying probabilities like $P ( A = 1 \mid y ) \times P ( B = 1 \mid y )$.
That multiplication is only exactly valid if the features are **conditionally independent** given $y$.

**Conditional independence idea:** within a fixed value of $y$, knowing $A$ shouldn’t tell you anything extra about $B$.
Mathematically, one way to write this is:

- $P ( A = 1 , B = 1 \mid y ) = P ( A = 1 \mid y ) \times P ( B = 1 \mid y )$

We’ll check that with a tiny toy table where $A$ and $B$ are binary (0/1).


In [None]:
rows_y0 = ([{'y': 0, 'A': 0, 'B': 0}] * 4
           + [{'y': 0, 'A': 0, 'B': 1}] * 4
           + [{'y': 0, 'A': 1, 'B': 0}] * 4
           + [{'y': 0, 'A': 1, 'B': 1}] * 4)

rows_y1 = ([{'y': 1, 'A': 0, 'B': 0}] * 4
           + [{'y': 1, 'A': 1, 'B': 1}] * 4)

df_ci = pd.DataFrame(rows_y0 + rows_y1)
df_ci


**Checkpoint:** Check conditional independence for $A$ and $B$ separately for $y = 0$ and for $y = 1$.

For each value of $y$, compute:
- $P ( A = 1 , B = 1 \mid y )$
- $P ( A = 1 \mid y ) \times P ( B = 1 \mid y )$

Are they the same?


In [None]:
def check_conditional_independence(df, y_value):
    sub = df[df['y'] == y_value]
    p_joint = ((sub['A'] == 1) & (sub['B'] == 1)).mean()
    p_a = (sub['A'] == 1).mean()
    p_b = (sub['B'] == 1).mean()
    p_prod = p_a * p_b

    return pd.DataFrame({
        'Value': [p_joint, p_a, p_b, p_prod]
    }, index=[
        'P(A=1, B=1 | y)',
        'P(A=1 | y)',
        'P(B=1 | y)',
        'P(A=1 | y) * P(B=1 | y)'
    ]).round(3)

check_conditional_independence(df_ci, 0)
check_conditional_independence(df_ci, 1)


## Part 2 — Maryland dataset

This lab uses a dataset with:
- a **model prediction** (a probability between 0 and 1), and
- a **binary outcome** (0/1).

Our goal is to practice turning a probability into a predicted label using a threshold, then computing performance metrics.


### What the columns mean (for this lab)

- `umd_id`: an ID for each person (not a feature for prediction).
- `outcome_violent_rearrest`: the true label $y$ (0/1). In the raw data it may show up as `0.0`/`1.0`, but we treat it as 0/1.
- `prediction`: the model’s predicted probability that $y = 1$.

We’ll also use a few columns to build toy features:
- `age_at_arrest`: age (in years) at the time of arrest.
- `n_vio_convictions_last_4yrs`: number of violent convictions in the last 4 years.
- `n_vio_convictions_last_180days`: number of violent convictions in the last 180 days.
- `n_vio_arrests_last_4yrs`: number of violent arrests in the last 4 years.
- `n_vio_arrests_last_180days`: number of violent arrests in the last 180 days.

Workflow (same as the toy example): choose a threshold, build $\hat{y}$, then compute metrics.


In [2]:
lab4_data = pd.read_csv('https://www.dropbox.com/scl/fi/0nhbo32tdw7mi2v2g0vd0/lab5_dataset.csv?rlkey=e0oevkrciv91r53rricdesq00&dl=1')
lab4_data.head()


Unnamed: 0,umd_id,outcome_violent_rearrest,prediction,n_vio_convictions_last_4yrs,n_vio_convictions_last_180days,n_vio_arrests_last_4yrs,n_vio_arrests_last_180days,age_at_arrest
0,5974815,0.0,0.084093,0.0,0.0,0.0,0.0,35.060274
1,5835352,1.0,0.420702,0.0,0.0,0.0,0.0,19.717808
2,7222541,0.0,0.124549,0.0,0.0,0.0,0.0,56.427397
3,5031868,1.0,0.181424,0.0,0.0,0.0,0.0,25.186301
4,3532116,0.0,0.114881,0.0,0.0,0.0,0.0,28.764384


**Sanity check:**


In [3]:
lab4_data.shape


(10000, 8)

We’ll mostly work with three columns for the metrics part:
- `y`
- `prediction`
- `yhat01`


In [None]:
lab4_data['y'] = lab4_data['outcome_violent_rearrest']
lab4_data['yhat'] = lab4_data['prediction'] >= 0.5
lab4_data['yhat01'] = lab4_data['yhat'].map({True: 1, False: 0})

lab4_data[['y', 'yhat01']].head()


In [None]:
joint_counts = lab4_data[['y', 'yhat01']].value_counts().sort_index()
joint_counts


Just like in the toy example, `joint_counts` contains the four confusion-matrix counts (TP, FP, TN, FN), indexed by `( y , \hat{y} )`.


In [None]:
cm = joint_counts.unstack(fill_value=0).reindex(index=[0, 1], columns=[0, 1], fill_value=0)
cm.index = ['y = 0', 'y = 1']
cm.columns = ['yhat = 0', 'yhat = 1']

FP = int(cm.iloc[0, 1])
FN = int(cm.iloc[1, 0])
TP = int(cm.iloc[1, 1])

TPR = TP / (TP + FN)
PPV = TP / (TP + FP)

metrics_md = pd.DataFrame({'Value': [TPR, PPV]}, index=['TPR (Recall)', 'PPV (Precision)']).round(3)

cm
metrics_md


## Part 3 — Pandas functions we’ll keep using

You’ve already seen one of these (`map`) in the toy example. Here are three small functions that show up constantly in data science work:

- `sort_values`: sort a DataFrame by a column (we’ll rank people by predicted risk).
- `between`: create a boolean for whether a value falls in a range (we’ll define an age group).
- `map`: recode values (we’ll convert `True/False` into `1/0`).


In [None]:
# sort_values: rank people by predicted risk (highest first)
ranked = lab4_data.sort_values('prediction', ascending=False)
ranked[['prediction', 'y']].head()

# Example: PPV among the top 100 highest-risk people
ppv_top_100 = ranked.head(100)['y'].mean()
ppv_top_100


The idea above is the same as the in-class “top 500” task — it’s just using a different K.


In [None]:
# between example
lab4_data['young_adult'] = lab4_data['age_at_arrest'].between(18, 25)
lab4_data['young_adult01'] = lab4_data['young_adult'].map({True: 1, False: 0})
lab4_data[['age_at_arrest', 'young_adult', 'young_adult01']].head()


`between(18, 25)` is often nicer than writing two conditions with `&`.

It returns `True/False`. In this course we often map that to `1/0` so we can treat it like an indicator variable.


In [None]:
# map example
pd.Series([True, False, True]).map({True: 1, False: 0})


`map` is a general recoding tool. We’ll use it mostly to turn booleans into `1/0`, but you can also map categories (like `'M'` → `'Male'`).


# Lab Tasks (in class)

You should come to class having completed the pre-lab above.
In class, we’ll work through these tasks together.


## A) Changing the threshold changes the metrics

1) Compute the base rate `base_rate = P ( y = 1 )`.

Then create `yhat_new` that is 1 if `prediction >= base_rate` and 0 otherwise.


2) Compute TPR and PPV for `yhat_new`, and compare them to the threshold-0.5 values you computed in the pre-lab.

In 1–2 sentences: why might TPR go up while PPV goes down?


## B) Rank-based thresholding (top K)

3) Sort `lab4_data` by `prediction` from highest to lowest.

Compute the PPV among the **top 500** rows (treat “top 500” as being predicted positive).


## C) Naive Bayes classification

We’ll classify based on two features:
- `young_adult` (18–25 inclusive)
- `vio_conv_gt1` meaning `n_vio_convictions__last_4_years > 1`

4) Create `vio_conv_gt1` and convert it to 0/1 if helpful.


5) Compute the pieces you need for the `y = 1` score and compute:

`score_1 = P ( y = 1 ) * P ( young_adult = 1 \mid y = 1 ) * P ( vio_conv_gt1 = 1 \mid y = 1 )`.


6) Compute `score_0` similarly, then classify as `y = 1` if `score_1 > score_0` and `y = 0` otherwise.

In 1 sentence: what does Naive Bayes assume when it multiplies those conditional probabilities?
