## 🛠️ Mod5 Data Challenge 2: Feature Engineering (Interaction Terms)


**Goal:** Practice designing and interpreting *interaction features* that might help stakeholders (Ops Manager, Borough Director, Mayor’s Office) understand and act on 311 performance.

**Structure**
- Instructor: list candidate interaction features; build *one* together
- Students: build more interactions, each tied to a stakeholder need
- Wrap-up: talk through explainability, complexity, and trade-offs


### Data
Use the **nyc311.csv** file located in your Github's `data` folder within Mod5/DataChallenges.  This is a sample of the originial file looking at just one week of data since the dataset is HUGE.  Read more about the columns [HERE](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data).   

### 👩‍🏫 Instructor-Led Demo (15 minutes) -- FOLLOW ALONG

#### Step 1:  Prep the Data 
* You have seen a majority of this code already!  Take **4 mins** and read through it AGAIN so you know what is stored in each boolean var (e.g. what is "high priority", etc.)

* Read in data, and run the next cell AS IS!  (This is creating our features used for "interaction terms")

In [None]:
import pandas as pd
import numpy as np

df = None

In [None]:
# RUN THIS CELL WITHOUT CHANGES 

# Helper: robust tz conversion
LOCAL_TZ = "America/New_York"

def to_utc(series, local_tz=LOCAL_TZ):
    s = pd.to_datetime(series, errors="coerce")
    if s.dt.tz is None:
        s = s.dt.tz_localize(local_tz, nonexistent="shift_forward", ambiguous="NaT")
    return s.dt.tz_convert("UTC")

# Identify likely datetime columns
candidate_created = [c for c in df.columns if "Created" in c and "Date" in c]
candidate_closed  = [c for c in df.columns if "Closed"  in c and "Date" in c]
if not candidate_created or not candidate_closed:
    raise KeyError("Could not find 'Created Date' and 'Closed Date' columns. Rename or update detection logic.")

CREATED_COL, CLOSED_COL = candidate_created[0], candidate_closed[0]

# Drop nulls, convert to tz-aware UTC
df = df.dropna(subset=[CREATED_COL, CLOSED_COL]).copy()
df[CREATED_COL] = to_utc(df[CREATED_COL])
df[CLOSED_COL]  = to_utc(df[CLOSED_COL])
df = df.dropna(subset=[CREATED_COL, CLOSED_COL])

# Compute response time (hrs)
delta = df[CLOSED_COL] - df[CREATED_COL]
df["response_time_hrs"] = delta.dt.total_seconds() / 3600

# Base temporal features
df["hour_of_day"] = df[CREATED_COL].dt.hour
df["weekday"]     = df[CREATED_COL].dt.weekday  # 0=Mon
df["is_weekend"]  = df["weekday"] >= 5
df["is_night"]    = df["hour_of_day"].isin([0,1,2,3,4,5])
df["is_peak_commute"] = df["hour_of_day"].isin([7,8,9,16,17,18,19])

# Complaint text normalization (helps reproducibility)
if "Complaint Type" in df.columns:
    df["complaint_norm"] = (df["Complaint Type"].astype(str)
                            .str.normalize("NFKC")
                            .str.strip()
                            .str.casefold())
else:
    df["complaint_norm"] = ""

# High-priority complaint flag (example list)
priority_list = {"heat/hot water","electric","elevator","structural","gas","sewer","water system"}
df["is_high_priority"] = df["complaint_norm"].isin(priority_list)

# Borough normalization + a borough flag (example)
if "Borough" in df.columns:
    df["borough_norm"] = df["Borough"].astype(str).str.strip().str.title()
    example_borough = "Brooklyn"
    df["is_brooklyn"] = df["borough_norm"].eq(example_borough)
else:
    df["borough_norm"] = ""
    df["is_brooklyn"] = False

# Clean negatives & NaNs for response_time_hrs
df = df[df["response_time_hrs"] >= 0].dropna(subset=["response_time_hrs"]).copy()

df.head(3)


#### Step 2:  Brainstorm Interaction Features using the Boolean Columns Created Above 


Type out several interactions **we haven't built in the code-along** and then we will create **#1** together

1. `is_high_priority × is_weekend` → priority complaints that **arrive on weekends** (we will build this one together)


We'll **build #1** together and discuss how a stakeholder could use it.

In [None]:
# Binary interaction (1 if high-priority AND weekend)
df["int_highprio_weekend"] = None

# Quick view and a sanity check
display(df[["complaint_norm","is_high_priority","is_weekend","int_highprio_weekend"]].head(8))

# What does this code do? 
print("Share of records that are high-priority weekend:",
      df["int_highprio_weekend"].mean().round(4))

### Step 3:  Aggregation of the feature we created against response time 

In [None]:
# Compare average response time: high-priority weekend vs others
agg = (df.groupby(None))[None]
         .mean()
         .rename({0:"other",1:"highprio_weekend"})
         .reset_index(drop=True))
agg.to_frame("avg_response_time_hrs")


#### Side Note:  Why create `int_highprio_weekend` feature in the first place aka the use-case

- **Definition:** 1 if the complaint is both **high-priority** and submitted on a **weekend**, else 0.
- **Why it matters:** Ops can check if **high-priority weekend** cases have longer response times or need different staffing.
- **Stakeholder example:** The **Operations Manager** could propose adding a weekend on-call rota for elevator/electric emergencies if these cases show longer average response time.


### 👩‍💻 Student-Led Section (20 minutes)

**Goal:** Create and interpret *interaction features* that would help a stakeholder you choose (e.g., Ops Manager, Borough Director).

**Rules**
- The instructor built one interaction; you will build **others**.
- For each interaction you create, add **one sentence** explaining how a stakeholder could use it.

**Deliverables**
- At least **2 new interactions**
- A short **stakeholder note** under each (1–2 sentences)


#### Task 1:  Create Interaction #1 

**Goal:** Build a binary interaction using two booleans we already have (e.g., `is_weekend × is_night`, or `is_high_priority × is_peak_commute`).

**Stakeholder note (1–2 sentences):** Explain how your chosen stakeholder could use this interaction to make a decision.


In [None]:
# Example pattern (replace None):
new_col_name_1 = None  # e.g., "int_highprio_peakcommute"
left_bool_1    = None  # e.g., df["is_high_priority"]
right_bool_1   = None  # e.g., df["is_peak_commute"]

df[new_col_name_1] = (left_bool_1.astype(int) * right_bool_1.astype(int))
df[[new_col_name_1]].mean().round(4)  # quick share check

#### Task 2: Create Interaction #2

**Goal:** Build a **borough-specific** interaction (e.g., `is_brooklyn × is_high_priority` or `is_brooklyn × is_weekend`).

**Stakeholder note (1–2 sentences):** Explain how a **Borough Director** could use this to prioritize crews or adjust on-hand maintenance staff.


In [None]:
# Example pattern (replace None):
new_col_name_2 = None   # e.g., "int_brooklyn_weekend"
left_bool_2    = None   # e.g., df["is_brooklyn"]
right_bool_2   = None   # e.g., df["is_weekend"]

df[new_col_name_2] = (left_bool_2.astype(int) * right_bool_2.astype(int))
df[[new_col_name_2]].mean().round(4)

#### Task 3 — Aggregate Check

Compute and show the average `response_time_hrs` grouped by **each new interaction** you created.  
Add a 1‑sentence takeaway for each (e.g., “Group=1 is slower by ~3.2 hours → staffing gap.”)


In [None]:
# Replace None with your new column names from Tasks 1 & 2
for col in [None, None]:  # e.g., ["int_highprio_peakcommute", "int_brooklyn_weekend"]
    grp = (df.groupby(col)["response_time_hrs"].mean().round(2))
    print(f"\n=== {col} ===")
    print(grp)

#### Reflection (2–5 sentences)
Pick **one** of your interactions and explain:

- Why it is useful to your chosen stakeholder
- One risk or bias it might introduce (e.g., masking weekday patterns)
- A next step to validate it (e.g., compare across months, run an A/B on staffing)


### 📣 Class Share-Out & Instructor Wrap-Up (15 minutes)

Be ready to have the students share out the following points with the class: 

**Explain:**
Your response to the reflection question above 

#### Instructor Wrap-Up: Explainability & Complexity (Notes)

- **Interpretability:** Simple binary interactions are clearer to stakeholders than complex numeric × numeric terms.
- **Complexity vs. Performance:** More interactions can improve model accuracy but may reduce explainability -- Mod 6
- **Scaling:** If you use numeric × numeric interactions, consider scaling before modeling -- We didn't do that here BUT you saw how to scale in code-along
- **Correlation & Leakage:** Interactions can be correlated!  This can cause a problem for modeling so keep that in mind -- Mod 6
- **DECISION MAKING:** Remember our DATA IN CONTEXT AREA-- this notebook helped you drive decision making by not just looking at the data surface but creating features and interaction terms that MATTER!
