# INST414 — Lab 5: Feature Engineering for Event-Based Data

**What you’ll do today:** build features for a prediction problem using two tables: a **universe** table (one row per prediction) and an **events** table (many rows per person).

## Learning goals
By the end, you should be able to:
- Explain why we need a **universe** table (one row per prediction).
- Use `pd.get_dummies(...)` for **one-hot encoding**.
- Use `merge(..., how='left')` and verify the number of rows stays the same.
- Build a **time-window** feature using `pd.to_datetime(...)` and `pd.DateOffset(...)`.
- Use `groupby(...).size().reset_index(name=...)` to create count-based features.
- Handle missing values created by merges using `fillna(...)`.
- Clean up columns using `drop(...)`.

## How to work in this notebook
- Run cells top-to-bottom. If something errors, re-run from the last successful cell.
- After every merge, do a quick sanity check: `df.shape` and `df.head()`.
- If a merge changes the number of rows in the universe table, stop and debug before moving on.


**Before you start:** Click **File → Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

**Turn off AI assistance:** Go to **Settings → AI Assistance** and uncheck everything. AI-generated code is not allowed on assignments in this course.

# Load modules and settings


In [1]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Part 1 — Load DataFrames (Universe + Events)

We’ll work with two DataFrames:
- **`universe`**: one row per arrest that will receive a prediction (this row count should stay fixed).
- **`arrest_events`**: an event log of arrests (multiple rows per person over time).

Most feature engineering in this lab follows the same pattern:
1) create/transform columns in `arrest_events` (or in a temporary merged table),
2) aggregate to one row per universe arrest,
3) **left-merge** the result into `universe`.


In [2]:
universe = pd.read_csv('https://www.dropbox.com/scl/fi/69syqjo6pfrt9123rubio/universe_lab6.feather?rlkey=h2gt4o6z9r5649wo6h6ud6dce&dl=1')
arrest_events = pd.read_csv('https://www.dropbox.com/scl/fi/wv9kthwbj4ahzli3edrd7/arrest_events_lab6.feather?rlkey=mhxozpazqjgmo6qqahc2vd0xp&dl=1')

# convert string dates to date types.
# filing date is the same thing as the arrest date
universe['filing_date'] = pd.to_datetime(universe.filing_date)
arrest_events['filing_date'] = pd.to_datetime(arrest_events.filing_date)

In [3]:
universe.head()

Unnamed: 0,arrest_id,person_id,filing_date,age_at_arrest,sex,race
0,7268817,25928,2016-04-06,59.336986,M,Black
1,5958672,1448386,2017-03-29,44.989041,M,Black
2,5014551,1571572,2017-10-14,39.394521,F,Black
3,3573863,126282,2018-06-20,36.465753,M,Black
4,5502020,883298,2017-07-16,18.547945,M,White


In [4]:
arrest_events.head()

Unnamed: 0,person_id,arrest_id,filing_date,charge_degree,offense_category
0,78786,3835604,2018-05-01,felony,property
1,78786,11999735,2018-05-01,felony,property
2,1064849,3442497,2018-07-12,misdemeanor,other
3,78786,4205882,2018-03-09,felony,property
4,78786,12006352,2018-03-09,felony,property


# Part 2 — Feature 1: Current Incident (Charge Degree)

A **current-incident** feature uses information about the arrest we are predicting *right now*.
Here we’ll convert a categorical column (`charge_degree`) into binary columns using one-hot encoding.

## One-hot encode `charge_degree`


In [5]:
arrest_events = pd.get_dummies(data=arrest_events, columns=['charge_degree'])
arrest_events.head()

Unnamed: 0,person_id,arrest_id,filing_date,offense_category,charge_degree_felony,charge_degree_misdemeanor
0,78786,3835604,2018-05-01,property,1,0
1,78786,11999735,2018-05-01,property,1,0
2,1064849,3442497,2018-07-12,other,0,1
3,78786,4205882,2018-03-09,property,1,0
4,78786,12006352,2018-03-09,property,1,0


## Merge into `universe` (left merge)

We merge on `arrest_id` because this feature is about the **current arrest**.

**Sanity check:** after the merge, `universe` should have the **same number of rows** as before.


In [6]:
universe = universe.merge(
    right = arrest_events[['arrest_id', 'charge_degree_felony']],
    on=['arrest_id'], 
    how='left')
universe.head()

Unnamed: 0,arrest_id,person_id,filing_date,age_at_arrest,sex,race,charge_degree_felony
0,7268817,25928,2016-04-06,59.336986,M,Black,1
1,5958672,1448386,2017-03-29,44.989041,M,Black,0
2,5014551,1571572,2017-10-14,39.394521,F,Black,0
3,3573863,126282,2018-06-20,36.465753,M,Black,0
4,5502020,883298,2017-07-16,18.547945,M,White,0


# Part 3 — Feature 2: Prior History (Arrests in the Last Year)

A **prior-history** feature uses information from *earlier* events to describe someone’s history at prediction time.
Our goal here: for each row in `universe`, count how many arrests that person had in the **year before** the `filing_date`.

## Create `num_arr_last_year`


In [7]:
temp_df = universe[['arrest_id', 'person_id', 'filing_date']].merge(
    arrest_events, on=['person_id'], how='left', suffixes=['_univ', '_arr']
)
temp_df.shape
temp_df.head()


(6519, 8)

Unnamed: 0,arrest_id_univ,person_id,filing_date_univ,arrest_id_arr,filing_date_arr,offense_category,charge_degree_felony,charge_degree_misdemeanor
0,7268817,25928,2016-04-06,10695699,2013-03-18,property,0,1
1,7268817,25928,2016-04-06,10584440,2013-04-26,property,0,1
2,7268817,25928,2016-04-06,5224389,2017-09-09,property,1,0
3,7268817,25928,2016-04-06,7268817,2016-04-06,violent,1,0
4,5958672,1448386,2017-03-29,11383129,2012-07-14,property,0,1


## Step 1 — Keep only arrests that happened before the universe arrest

We only want history that was available *before* the prediction date.
So we keep rows where the event date in `arrest_events` is **earlier** than the `filing_date` in `universe`.


In [8]:
temp_df = temp_df[temp_df.filing_date_arr < temp_df.filing_date_univ]
temp_df.shape
temp_df.head()


(3045, 8)

Unnamed: 0,arrest_id_univ,person_id,filing_date_univ,arrest_id_arr,filing_date_arr,offense_category,charge_degree_felony,charge_degree_misdemeanor
0,7268817,25928,2016-04-06,10695699,2013-03-18,property,0,1
1,7268817,25928,2016-04-06,10584440,2013-04-26,property,0,1
4,5958672,1448386,2017-03-29,11383129,2012-07-14,property,0,1
6,5958672,1448386,2017-03-29,21440843,2012-04-09,property,0,1
7,5958672,1448386,2017-03-29,11566425,2012-05-15,property,0,1


## Step 2 — Keep only arrests within the last year

Now we restrict to a 1-year window using `pd.DateOffset(years=1)`.
This creates a feature that is easier to interpret (recent history) and avoids counting very old events.


In [9]:
temp_df = temp_df[temp_df.filing_date_arr > (temp_df.filing_date_univ - pd.DateOffset(years=1))]
temp_df.shape
temp_df.head()

(904, 8)

Unnamed: 0,arrest_id_univ,person_id,filing_date_univ,arrest_id_arr,filing_date_arr,offense_category,charge_degree_felony,charge_degree_misdemeanor
14,3573863,126282,2018-06-20,4778011,2017-11-20,other,0,1
15,3573863,126282,2018-06-20,4516013,2018-01-12,property,1,0
22,3573863,126282,2018-06-20,4762793,2017-11-24,property,0,1
29,3573863,126282,2018-06-20,4406759,2018-02-01,drug,0,1
33,5502020,883298,2017-07-16,6080238,2017-02-23,other,0,1


## Step 3 — Count arrests for each universe row

After filtering, we want **one row per universe arrest** with a count.
A common pattern is: `groupby(...)` → `.size()` → `.reset_index(name=...)`.


In [10]:
temp_df.groupby(['arrest_id_univ', 'person_id']).size()

arrest_id_univ  person_id
2472356         242203       1
2500555         4931         1
2511968         358578       2
2532252         2582921      1
2547343         122092       2
                            ..
7563566         49472        4
7587118         997983       1
7591648         2326         2
7598842         1531743      1
12005804        1065717      1
Length: 373, dtype: int64

In [11]:
temp_df = temp_df.groupby(['arrest_id_univ', 'person_id']).size().reset_index(name="num_arr_last_year")
temp_df.shape
temp_df.head()

(373, 3)

Unnamed: 0,arrest_id_univ,person_id,num_arr_last_year
0,2472356,242203,1
1,2500555,4931,1
2,2511968,358578,2
3,2532252,2582921,1
4,2547343,122092,2


## Step 4 — Merge the count feature back into `universe`

We’ll left-merge the counts into the universe table.
If a universe row has **no prior arrests** in the last year, it won’t appear in the count table, so it will get `NaN` after the merge.

**Why `left_on` / `right_on`?**
Sometimes the key columns have different names after earlier merges (for example, suffixes like `_univ`).
`left_on` and `right_on` let you specify which columns should match.


In [12]:
universe.columns
temp_df.columns

Index(['arrest_id', 'person_id', 'filing_date', 'age_at_arrest', 'sex', 'race',
       'charge_degree_felony'],
      dtype='object')

Index(['arrest_id_univ', 'person_id', 'num_arr_last_year'], dtype='object')

In [13]:
universe = universe.merge(
    right=temp_df,
    left_on=['arrest_id','person_id'],
    right_on=['arrest_id_univ', 'person_id'],
    how='left')
universe.shape
universe.head()

(1000, 9)

Unnamed: 0,arrest_id,person_id,filing_date,age_at_arrest,sex,race,charge_degree_felony,arrest_id_univ,num_arr_last_year
0,7268817,25928,2016-04-06,59.336986,M,Black,1,,
1,5958672,1448386,2017-03-29,44.989041,M,Black,0,,
2,5014551,1571572,2017-10-14,39.394521,F,Black,0,,
3,3573863,126282,2018-06-20,36.465753,M,Black,0,3573863.0,4.0
4,5502020,883298,2017-07-16,18.547945,M,White,0,5502020.0,1.0


## Step 5 — Replace missing counts with 0

After the merge, missing values mean: *we didn’t find any qualifying prior arrests*.
For a count feature, that should be `0`, not `NaN`.


In [14]:
universe['num_arr_last_year'] = universe['num_arr_last_year'].fillna(value=0)
universe.head()

Unnamed: 0,arrest_id,person_id,filing_date,age_at_arrest,sex,race,charge_degree_felony,arrest_id_univ,num_arr_last_year
0,7268817,25928,2016-04-06,59.336986,M,Black,1,,0.0
1,5958672,1448386,2017-03-29,44.989041,M,Black,0,,0.0
2,5014551,1571572,2017-10-14,39.394521,F,Black,0,,0.0
3,3573863,126282,2018-06-20,36.465753,M,Black,0,3573863.0,4.0
4,5502020,883298,2017-07-16,18.547945,M,White,0,5502020.0,1.0


You can also use `inplace=True` with `fillna` to update a column directly.


In [15]:
universe['num_arr_last_year'].fillna(value=0, inplace=True)

## Step 6 — Drop columns you don’t need

Merges can create extra key columns (like `arrest_id_univ`).
If you don’t need them anymore, drop them to avoid confusion later.


In [16]:
universe.columns
universe.drop(columns=['arrest_id_univ'], inplace=True)
universe.columns

Index(['arrest_id', 'person_id', 'filing_date', 'age_at_arrest', 'sex', 'race',
       'charge_degree_felony', 'arrest_id_univ', 'num_arr_last_year'],
      dtype='object')

Index(['arrest_id', 'person_id', 'filing_date', 'age_at_arrest', 'sex', 'race',
       'charge_degree_felony', 'num_arr_last_year'],
      dtype='object')

# Lab Tasks

Complete the tasks below by filling in the blank code cells.


## Task 1 — One-hot encode offense categories (current incident)

We’ll create binary indicator columns from `offense_category` and merge them into `universe`.
This is the same core idea as `charge_degree`, just with a different categorical column.


1) Left-merge the following offense categories into `universe`:
- `drug`
- `property`
- `violent`

We won’t merge `other` because it would be redundant: if all three above are 0, then the category is `other`.


2) After the merge, how many rows are in `universe`? (It should not change.)


3) What share (probability) of universe arrests are felonies?


## Task 2 — Prior history: property arrests in the last 2 years


Your goal is to create a new feature called `num_prop_arr_last_2yrs`:

- Build a temporary DataFrame by merging relevant columns from `universe` and `arrest_events`.
- Use `suffixes` so you can tell the two dates apart.
- Then filter + count + merge back into `universe`.


1) Keep only rows where the event date is **before** the universe date.


2) Keep only rows where the event date is within **2 years** of the universe date.


3) Keep only rows where `offense_category` is `property`.


4) Use `groupby` + `size` + `reset_index` to create a DataFrame with the count column named `num_prop_arr_last_2yrs`.


5) Left-merge the count feature into `universe`.


6) Replace missing values in `num_prop_arr_last_2yrs` with 0.


7) What is the average number of property arrests in the last 2 years (mean of `num_prop_arr_last_2yrs`)?
