# INST414 — Lab 4: Performance Metrics + Naive Bayes (with Pandas)

**What you’ll do today:** compute performance metrics as **conditional probabilities**, see how metrics change when you change the **threshold**, and then use **Naive Bayes** to make a simple classification.

## Learning goals
By the end, you should be able to:
- Turn predicted probabilities into predictions using a **threshold**.
- Use `value_counts` to compute a joint distribution of `( y , ŷ )`.
- Compute **TPR (recall)** and **PPV (precision)** from data.
- Explain why changing the threshold can change TPR and PPV in opposite directions.
- Use `sort_values` to compute a metric for the **top K** highest-risk people.
- Use Naive Bayes to compare two unnormalized scores and classify.

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run from the last successful cell.
- After any big step, do a quick check with `.shape`, `.head()`, and `value_counts()`.
- If your output has the wrong *type* (Series vs DataFrame), check your brackets.


**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

**Turn off AI assistance:** Go to **Settings → AI Assistance** and uncheck everything. AI-generated code is not allowed on assignments in this course.

# Load modules and settings


In [1]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Part A — Warm-up: metrics on a tiny table

Before we use a big dataset, we’ll practice on a tiny table where you can see every row.

We’ll treat:
- `y` as the true outcome (0/1)
- `prediction` as a model’s predicted probability for `y = 1`

Then we’ll create `ŷ` (a prediction) by choosing a threshold.


In [None]:
df_toy = pd.DataFrame({
    'prediction': [0.95, 0.80, 0.73, 0.68, 0.60, 0.55, 0.52, 0.49, 0.30, 0.18],
    'y':          [1,    0,    1,    1,    0,    1,    0,    0,    0,    0],
})

df_toy


# Lab Tasks

We’ll build up the ideas in three parts. The goal is that each step reuses the same small set of Pandas patterns.


## A) Warm-up on a tiny table

We’ll start by turning `prediction` into `ŷ`, then compute a joint distribution, then compute metrics.


1) Create a new column `yhat` that is `True` when `prediction >= 0.5` and `False` otherwise.

Then create `yhat01` that converts `True/False` into `1/0` using `map`.


2) Compute the joint distribution of `y` and `yhat01` using `value_counts`.

(You can compute **counts** first, or use `normalize=True` to get probabilities.)


3) Compute:
- TPR (recall) = `P ( ŷ = 1 \mid y = 1 )`
- PPV (precision) = `P ( y = 1 \mid ŷ = 1 )`

Write each one as a single number.


4) Compute the base rate `base_rate = P ( y = 1 )`.

Now create `yhat_base` which is `1` when `prediction >= base_rate` and `0` otherwise.

Compute TPR and PPV again using `yhat_base`. What changed?


5) Sort the rows from highest to lowest `prediction`.

Compute the PPV among the **top 5** rows (treating “in the top 5” as being predicted positive).


# Part B — Maryland dataset: performance metrics

Now we’ll repeat the same workflow on a larger dataset.

`prediction` is a model score (a predicted probability of violent rearrest within 1 year).
`outcome_violent_rearrest` is what actually happened (0/1).


## Load the Maryland data

Column meanings (high level):
- `umd_id`: person identifier
- `outcome_violent_rearrest`: the true outcome `y` (0/1)
- `prediction`: predicted probability of violent rearrest (0–1)
- other columns: prior arrests/convictions and age


In [2]:
lab4_data = pd.read_csv('https://www.dropbox.com/scl/fi/0nhbo32tdw7mi2v2g0vd0/lab5_dataset.csv?rlkey=e0oevkrciv91r53rricdesq00&dl=1')
lab4_data.head()


Unnamed: 0,umd_id,outcome_violent_rearrest,prediction,n_vio_convictions_last_4yrs,n_vio_convictions_last_180days,n_vio_arrests_last_4yrs,n_vio_arrests_last_180days,age_at_arrest
0,5974815,0.0,0.084093,0.0,0.0,0.0,0.0,35.060274
1,5835352,1.0,0.420702,0.0,0.0,0.0,0.0,19.717808
2,7222541,0.0,0.124549,0.0,0.0,0.0,0.0,56.427397
3,5031868,1.0,0.181424,0.0,0.0,0.0,0.0,25.186301
4,3532116,0.0,0.114881,0.0,0.0,0.0,0.0,28.764384


**Sanity check:** run the next cell to see the shape.


In [3]:
lab4_data.shape


(10000, 8)

## B) Performance metrics on the Maryland dataset

We’ll compute TPR and PPV using two different threshold rules, then compare them.


6) Create a new column `y` that is a copy of `outcome_violent_rearrest`.


7) Create `yhat` that is `True` when `prediction >= 0.5` and `False` otherwise.

Then convert it to `yhat01` (0/1) using `map`.


8) Compute the joint distribution of `y` and `yhat01` (counts or probabilities).


9) Compute TPR and PPV using `y` and `yhat01`.


10) Compute the base rate `base_rate = P ( y = 1 )`.

Now create `yhat_new` that is `1` when `prediction >= base_rate` and `0` otherwise.

Compute TPR and PPV again, and compare to the threshold-0.5 version. In 1–2 sentences: why might TPR go up while PPV goes down?


11) Rank-based thresholding: sort `lab4_data` by `prediction` from highest to lowest.

Compute the PPV among the **top 500** rows (treat “top 500” as being predicted positive).


# Part C — Naive Bayes (a simple classification)

Now we’ll use Naive Bayes to classify whether someone is high risk (`y = 1`) or not (`y = 0`) based on two features.

We’ll compare two **unnormalized** scores:
- score for `y = 1`
- score for `y = 0`

and pick the bigger score.


## C) Naive Bayes tasks

We’ll use these two features:
- `young_adult` = 1 if age is between 18 and 25 (inclusive)
- `vio_conv_gt1` = 1 if `n_vio_convictions__last_4_years > 1`


12) Create a new column `young_adult` using `between(18, 25)` on `age_at_arrest`.

Convert it to 0/1 if you want (it can be helpful for counting).


13) Create a new column `vio_conv_gt1` that is `True` when `n_vio_convictions__last_4_years > 1` and `False` otherwise.

(Again, you can convert to 0/1 if helpful.)


14) Compute the pieces you need for the `y = 1` score:

- `P ( young_adult = 1 \mid y = 1 )`
- `P ( vio_conv_gt1 = 1 \mid y = 1 )`
- `P ( y = 1 )`

Then compute: `score_1 = P ( y = 1 ) * P ( young_adult = 1 | y = 1 ) * P ( vio_conv_gt1 = 1 | y = 1 )`.


15) Do the same for `y = 0`, and compute `score_0`.


16) Classification: if `score_1 > score_0`, classify as `y = 1`; otherwise classify as `y = 0`.

In 1 sentence: what does Naive Bayes assume when it multiplies those conditional probabilities?
