# Population Metrics Configuration and Execution Tutorial

**Purpose:** Walk through configuring and running the population_metrics repo

**What this notebook does:**
1. Shows current config settings
2. Validates data files exist
3. Runs population metrics
4. Displays results

## Part 1: Setup

In [1]:
import sys
import pandas as pd
from pathlib import Path
import subprocess

In [2]:
# Add population_metrics to path
project_root = Path.cwd().parent
pop_metrics_dir = project_root / "external_repos" / "population_metrics"
sys.path.insert(0, str(pop_metrics_dir))

## Part 2: View Current Configuration

In [3]:
# Import config
import config_custom as CFG

In [4]:
# Show file paths
print("File Paths:")
for key, value in CFG.PATHS.items():
    print(f"  {key}: {value}")

File Paths:
  demographics: ../../data/demographics.csv
  current_commitments: ../../data/current_commitments_clean.csv
  prior_commitments: ../../data/prior_commitments_clean.csv


In [5]:
# Show column mappings
print("\nColumn Mappings:")
for key, value in CFG.COLS.items():
    print(f"  {key}: {value}")


Column Mappings:
  id: cdcno
  current_sentence_months: aggregate sentence in months
  completed_months: None
  past_time_months: None
  offense_begin_date: offense begin date
  release_date: expected release date
  age_years: None
  offense_code: offense_clean
  offense_description: offense description
  offense_category: offense category
  in_prison: in prison


In [6]:
# Show offense lists
print("\nOffense Classifications:")
print(f"  Violent codes: {len(CFG.OFFENSE_LISTS.get('violent', []))} codes")
print(f"    Sample: {CFG.OFFENSE_LISTS.get('violent', [])[:5]}")
print(f"  Nonviolent codes: {len(CFG.OFFENSE_LISTS.get('nonviolent', []))} codes")
print(f"    Sample: {CFG.OFFENSE_LISTS.get('nonviolent', [])[:5]}")


Offense Classifications:
  Violent codes: 83 codes
    Sample: ['187', '190.05', '190.25', '190(d)', '190(c)']
  Nonviolent codes: 51 codes
    Sample: ['136.1', '210.5', '244', '245.2', '245.3']


In [7]:
# Show metric weights
print("\nMetric Weights:")
for metric, weight in CFG.METRIC_WEIGHTS.items():
    direction = "GOOD" if weight > 0 else "BAD"
    print(f"  {metric}: {weight:+.1f} ({direction})")


Metric Weights:
  desc_nonvio_curr: +1.0 (GOOD)
  desc_nonvio_past: +1.0 (GOOD)
  severity_trend: +1.0 (GOOD)
  age: -0.5 (BAD)
  freq_violent: -1.0 (BAD)
  freq_total: -0.5 (BAD)


## Part 3: Validate Data Files

In [8]:
# Check if files exist
print("Checking data files...\n")

for name, path in CFG.PATHS.items():
    full_path = pop_metrics_dir / path
    exists = full_path.exists()
    status = "✓" if exists else "✗"
    print(f"{status} {name}: {full_path}")
    
    if exists:
        df = pd.read_csv(full_path)
        print(f"  Rows: {len(df):,}")

Checking data files...

✓ demographics: C:\Users\gandh\PycharmProjects\PythonProject\external_repos\population_metrics\..\..\data\demographics.csv
  Rows: 95,476
✓ current_commitments: C:\Users\gandh\PycharmProjects\PythonProject\external_repos\population_metrics\..\..\data\current_commitments_clean.csv


  df = pd.read_csv(full_path)


  Rows: 369,125
✓ prior_commitments: C:\Users\gandh\PycharmProjects\PythonProject\external_repos\population_metrics\..\..\data\prior_commitments_clean.csv
  Rows: 191,436


## Part 4: Preview Data Structure

In [9]:
# Load demographics to check columns
demo_path = pop_metrics_dir / CFG.PATHS['demographics']
demographics = pd.read_csv(demo_path)

print("Demographics columns:")
for col in demographics.columns:
    print(f"  - {col}")

Demographics columns:
  - cdcno
  - ethnicity
  - controlling offense
  - description
  - offense begin date
  - offense end date
  - controlling case number
  - controlling case sentencing county
  - sentence type
  - aggregate sentence in months
  - offense category
  - eprd mepd month and year
  - current location
  - aggregate sentence in years
  - time served in years
  - expected release date


In [10]:
# Load current commits to check offense_clean column
current_path = pop_metrics_dir / CFG.PATHS['current_commitments']
current = pd.read_csv(current_path)

print("\nCurrent commits - key columns:")
key_cols = ['cdcno', 'offense', 'offense_clean', 'offense description']
existing_cols = [col for col in key_cols if col in current.columns]
print(current[existing_cols].head(3))


Current commits - key columns:
        cdcno        offense offense_clean  \
0  2cf2a233c4     VC10851(a)         10851   
1  5a72696541  PC191.5(c)(2)         191.5   
2  5a72696541      PC187 2nd       187 2ND   

                        offense description  
0                             Vehicle Theft  
1  Vehicular Manslaughter While Intoxicated  
2                                Murder 2nd  


  current = pd.read_csv(current_path)


## Part 5: Verify Offense Classification Will Work

In [11]:
# Check how many offenses will match violent/nonviolent lists
if 'offense_clean' in current.columns:
    violent_codes = set(CFG.OFFENSE_LISTS.get('violent', []))
    nonviolent_codes = set(CFG.OFFENSE_LISTS.get('nonviolent', []))
    
    current_codes = current['offense_clean'].dropna().unique()
    
    violent_matches = sum(1 for code in current_codes if str(code) in violent_codes)
    nonviolent_matches = sum(1 for code in current_codes if str(code) in nonviolent_codes)
    
    print(f"Offense code matching:")
    print(f"  Total unique offense codes in data: {len(current_codes)}")
    print(f"  Matches violent list: {violent_matches}")
    print(f"  Matches nonviolent list: {nonviolent_matches}")
    print(f"  Unclassified: {len(current_codes) - violent_matches - nonviolent_matches}")
else:
    print("Warning: offense_clean column not found!")

Offense code matching:
  Total unique offense codes in data: 463
  Matches violent list: 22
  Matches nonviolent list: 21
  Unclassified: 420


## Part 6: Run Population Metrics (Demo Mode)

**Note:** This will take 30-60 seconds

In [14]:
# Run via subprocess to show progress
# This runs: python scripts/01b_run_population_metrics_fast.py

script_path = project_root / "scripts" / "01b_run_population_metrics_fast.py"

print("Running population metrics (demo mode)...\n")
print("This will process 5,000 random individuals\n")

# Note: This runs the script, but you'll need to provide input "1" for demo mode
# Uncomment the line below to run:
result = subprocess.run(['python', str(script_path)], capture_output=True, text=True, input="1\n")
print(result.stdout)

print("To run, uncomment the subprocess lines above, or run in terminal:")
print(f"  python {script_path}")

Running population metrics (demo mode)...

This will process 5,000 random individuals


To run, uncomment the subprocess lines above, or run in terminal:
  python C:\Users\gandh\PycharmProjects\PythonProject\scripts\01b_run_population_metrics_fast.py


## Alternative: Run Population Metrics Directly in Notebook

In [15]:
# Import the modules
import compute_metrics as cm
import sentencing_math as sm
from tqdm import tqdm
import random

In [16]:
# Load data
print("Loading data...")
demo = cm.read_table(pop_metrics_dir / CFG.PATHS["demographics"])
cur = cm.read_table(pop_metrics_dir / CFG.PATHS["current_commitments"])
pri = cm.read_table(pop_metrics_dir / CFG.PATHS["prior_commitments"])

print(f"Demographics: {len(demo):,} rows")
print(f"Current commits: {len(cur):,} rows")
print(f"Prior commits: {len(pri):,} rows")

Loading data...


AttributeError: 'WindowsPath' object has no attribute 'lower'

In [17]:
# Get sample of IDs for demo
all_ids = demo[CFG.COLS["id"]].astype(str).unique().tolist()
random.seed(42)
sample_ids = random.sample(all_ids, min(1000, len(all_ids)))  # Small sample for notebook

print(f"Processing {len(sample_ids):,} individuals...")

NameError: name 'demo' is not defined

In [None]:
# Process each person
rows = []

for uid in tqdm(sample_ids[:100]):  # Just first 100 for demo
    try:
        feats, aux = cm.compute_features(uid, demo, cur, pri, CFG.OFFENSE_LISTS)
        present = feats.keys() & CFG.METRIC_WEIGHTS.keys()
        score = sm.suitability_score_named(feats, CFG.METRIC_WEIGHTS)
        score_out_of = sum(abs(CFG.METRIC_WEIGHTS[k]) for k in present)
        
        rows.append({
            CFG.COLS["id"]: uid,
            **feats,
            "score": score,
            "score_out_of": score_out_of
        })
    except Exception as e:
        print(f"Error processing {uid}: {e}")

# Convert to DataFrame
results_df = pd.DataFrame(rows)
print(f"\nProcessed {len(results_df):,} individuals successfully")

## Part 7: View Results

In [None]:
# Load results (if you ran via script)
# OR use results_df from above if you ran in notebook

output_path = project_root / "outputs" / "population_metrics_demo.csv"

if output_path.exists():
    results_df = pd.read_csv(output_path)
    print(f"Loaded results from: {output_path}")
else:
    print("Using results from notebook execution")

print(f"\nResults shape: {results_df.shape}")

In [None]:
# Display first few rows
results_df.head()

In [None]:
# Summary statistics
results_df.describe()

In [None]:
# Distribution of suitability scores
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(results_df['score'], bins=30, edgecolor='black')
plt.xlabel('Suitability Score')
plt.ylabel('Frequency')
plt.title('Distribution of Suitability Scores')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# Check for zeros (potential issue)
zero_scores = (results_df['score'] == 0).sum()
print(f"Individuals with score = 0: {zero_scores} ({zero_scores/len(results_df)*100:.1f}%)")

if zero_scores > len(results_df) * 0.5:
    print("\n⚠️ Warning: More than 50% have score = 0")
    print("This suggests offense classification may not be working properly")
else:
    print("\n✓ Score distribution looks reasonable")

## Part 8: Diagnostic - Look at Specific Person

In [None]:
# Pick a specific person to examine
example_id = results_df.iloc[0]['cdcno']
print(f"Examining person: {example_id}\n")

# Their metrics
person_metrics = results_df[results_df['cdcno'] == example_id]
print("Calculated metrics:")
print(person_metrics.T)

In [None]:
# Their raw data
print("\nRaw offense data:")
person_offenses = cur[cur[CFG.COLS['id']] == example_id][['offense_clean', 'offense description']]
print(person_offenses)

In [None]:
# Check if offenses matched classification
person_codes = person_offenses['offense_clean'].dropna().tolist()
violent_codes = set(CFG.OFFENSE_LISTS.get('violent', []))
nonviolent_codes = set(CFG.OFFENSE_LISTS.get('nonviolent', []))

print("\nClassification check:")
for code in person_codes:
    if str(code) in violent_codes:
        print(f"  {code}: VIOLENT")
    elif str(code) in nonviolent_codes:
        print(f"  {code}: NONVIOLENT")
    else:
        print(f"  {code}: UNCLASSIFIED")

## Summary

**What we did:**
1. ✓ Validated configuration settings
2. ✓ Checked data files exist
3. ✓ Verified offense classification
4. ✓ Ran population metrics
5. ✓ Reviewed results

**Next steps:**
- If scores look good → Run full dataset
- If many zeros → Debug offense classification
- Proceed to regression analysis