# Data Challenge 9 — Feature Engineering & Feature Selection

**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Engineer better predictors (one-hot/dummies, interactions, polynomials), avoid unnecessary complexity, and compare a **Base** vs **Engineered** model on the **same train–test split** using **MAE/RMSE**. Interpret coefficients in units and explain business value.



> Dataset: **NYC Yellow Taxi — Dec 2023** (CSV). Keep code *simple*: light numeric coercion only for your chosen columns.

## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

**Docs (quick links):**
- One-hot encoding (pandas): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html  
- OneHotEncoder (sklearn): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html  
- Train/Test Split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  
- MAE / MSE / RMSE: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html  
- OLS (statsmodels): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html  
- OLS Results (coef/p/CIs/resid): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html

### Pseudocode Plan (Feature Engineering + Selection)
1) **Load CSV** → preview columns/shape.  
2) **Pick Y and initial Xs (2–3 numeric)** → keep it simple and decision-time-available.  
3) **Engineer features:**
   - **One-hot** a categorical with a dropped baseline (e.g., `payment_type` or `weekday/weekend`).  
   - **Interaction**: choose a hypothesis-driven pair (e.g., `trip_distance × is_weekend`).  
   - **Polynomial**: add one squared term for a plausible curve (e.g., `trip_distance²`).  
4) **Build Base vs Engineered design matrices** (add intercept).  
5) **Single train–test split** (80/20, fixed `random_state`) shared by both models.  
6) **Fit on TRAIN**, **predict on TEST** for both models; compute **MAE/RMSE** (units of Y).  
7) **Interpretation**: write unit-based coefficient sentences; note baseline category for dummies.  
8) **Light selection**: if Engineered model doesn’t beat Base on TEST (or adds complexity w/o value), prefer Base.  
9) **Diagnostics (quick)**: residuals vs fitted (train); note any cones (heteroskedasticity).  
10) **Stakeholder one-liner**: which model, why (TEST metrics in units), and what the added features *mean*.
markdown


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [1]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
import scipy.stats as stats
from pathlib import Path

pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 1 — Load CSV & Preview
- Point to your **Dec 2023** taxi CSV.
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [2]:
path = '/Users/Marcy_Student/Desktop/Marcy-Modules/marcy-git/DA2025_Lectures/Mod6/data/2023_Yellow_Taxi_Trip_Data_20251015.csv'
df = pd.read_csv(path)
print(df.shape)
print(df.keys())

  df = pd.read_csv(path)


(3310907, 19)
Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')


### Step 2 —  Pick Target **Y** and Predictors **Xs** (choose 2–3 numeric)

- **Avoid** using an X that directly defines Y (e.g., `total_amount` when Y = `fare_amount`).
- Coerce **only these columns** to numeric; drop NA rows.

In [3]:
# Coerce fare, tip, distance to numeric safely
num_cols = ['fare_amount', 'tip_amount', 'trip_distance', 'passenger_count']
for c in num_cols:
    df[c] = pd.to_numeric(
        df[c].astype(str).str.strip().str.replace(r'[^0-9.+\-eE]', '', regex=True),
        errors='coerce'
)

In [None]:
# Some filters
df = df[(df['tip_amount']<= 1000) & (df['tip_amount'] > 0) & (df['trip_distance'] > 0) & (df['trip_distance'] <= 560) & (df['extra'] >= 0)]

### Step 3 —  Engineer New Features (One-hot, Interaction, Polynomial)

Pick **one** categorical to one-hot (drop baseline). Options that usually exist:

- `payment_type` (codes): treat as categorical strings for clarity, then one-hot with drop_first=True, or  
- derive **weekday/weekend** from `tpep_pickup_datetime` if present.

Then add **one interaction** and **one squared term** guided by a business hypothesis.

In [None]:
# not gonna use this for the One-hot feature, but practicing using OneHotEncoder()

ohe = OneHotEncoder(drop='first', sparse_output=False)
feature_array = ohe.fit_transform(df[['payment_type']])
feature_labels = list(ohe.get_feature_names_out())
dummies_skl = pd.DataFrame(feature_array, columns=feature_labels)

# Concatenate back to the original dataframe
df_with_skl = pd.concat([df.reset_index(drop=True), dummies_skl.reset_index(drop=True)], axis=1)
print("\n--- Data with sklearn.OneHotEncoder ---")
print(df_with_skl.head())


--- Data with sklearn.OneHotEncoder ---
   VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  passenger_count  \
0         2  12/01/2023 04:11:39 PM  12/01/2023 04:19:13 PM           2.0000   
1         1  12/01/2023 04:11:39 PM  12/01/2023 04:34:39 PM           2.0000   
2         2  12/01/2023 04:11:40 PM  12/01/2023 04:28:50 PM           6.0000   
3         1  12/01/2023 04:11:41 PM  12/01/2023 04:14:35 PM           1.0000   
4         2  12/01/2023 04:11:41 PM  12/01/2023 04:28:34 PM           1.0000   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0         0.6900      1.0000                  N           141           140   
1         3.0000      1.0000                  N           164           211   
2         2.1500      1.0000                  N           238            48   
3         0.3000      1.0000                  N           163           161   
4         1.4700      1.0000                  N           137           229   

   

In [23]:
# One-hot is_weekend
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['weekday'] = df['tpep_pickup_datetime'].dt.weekday

df['weekday'].value_counts()

weekday
5    409057
4    391492
3    361469
2    352651
6    334953
1    330441
0    265981
Name: count, dtype: int64

In [35]:
weekend_flag = []

for row in df['weekday']:
    if row == 5 or row == 6:
        weekend_flag.append(1)
    else:
        weekend_flag.append(0)

df['is_weekend'] = weekend_flag
df['is_weekend'].value_counts()

is_weekend
0    1702034
1     744010
Name: count, dtype: int64

In [37]:
# Interaction is_weekend * trip_distance
df['weekend_trip_distance'] = df['is_weekend'] * df['trip_distance']

In [38]:
# Polynomial term trip_distance squared
df['trip_distance_sq']  = df['trip_distance']**2
df['trip_distance']

0          0.6900
4          3.0000
7          2.1500
9          0.3000
10         1.4700
            ...  
3310899    2.7000
3310902    1.7000
3310903   21.6000
3310905    0.0100
3310906   16.6700
Name: trip_distance, Length: 2446044, dtype: float64

### Step 4 — Build **Base** and **Engineered** Design Matrices

- **Base** = intercept + base predictors (Xs you assigned in Step 2) 
- **Engineered** = intercept + base predictors + engineered columns (dummies + interaction + polynomial)


In [40]:
X1 = df[['is_weekend', 'weekend_trip_distance']]
X2 = df[['is_weekend', 'weekend_trip_distance', 'trip_distance_sq']]
y = df['fare_amount']

### Step 5 — Single Train–Test Split (Shared by Both Models)

Use one split so Base and Engineered are comparable.

In [41]:
x_train, x_test, y_train, y_test = train_test_split(X1, y, train_size=0.8, test_size=0.2, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2, y, train_size=0.8, test_size=0.2, random_state=42)

### Step 6 — Fit on TRAIN, Predict on TEST, Compute **MAE/RMSE** (units of Y)

In [10]:
None

### Step 7 — Interpret Key Coefficients (Plain Language)

Write **unit-based** interpretations for 2–3 impactful coefficients **in the Engineered model**, noting:
- The **baseline** category for dummies (the dropped category).
- **Interaction** meaning (change in slope under the condition).
- **Polynomial** meaning (curve: does effect rise then taper?).


*(Use this template; edit to your variables/units):*

- **Dummy (pay_…):** Compared to baseline **[dropped category]**, the expected **Y** is **β** higher/lower, holding other features constant.  
- **Interaction (dist×weekend):** On weekends, each additional **mile** changes **Y** by **β_interaction** *more/less* than on weekdays, holding other features constant.  
- **Polynomial (distance²):** The marginal effect of distance changes with distance; the negative/positive β on distance² indicates **diminishing/increasing** returns.

### Step 8 —  Quick Diagnostics (Train Residuals) — Engineered Model
- **Residuals vs Fitted:** random cloud ≈ good; cone/funnel suggests non-constant variance.  
- **Q–Q plot:** points roughly along diagonal (normality for inference).  
- **Durbin–Watson:** printed in `eng_model.summary()` (~2 suggests independence).

In [11]:
None

## We Share — Reflection & Wrap‑Up

**Notes on Feature Selection**
- If **Engineered** doesn’t beat **Base** on TEST (or gains are tiny), prefer **Base** for simplicity.  
- If two engineered features are redundant (e.g., highly correlated dummies), consider dropping one.  
- Keep features that improve TEST error **and** you can explain to a stakeholder.


Write **2 short paragraphs** and be specific:


1) **Which model would you deploy today—Base or Engineered—and why?**  
Use **TEST MAE/RMSE in units**, your coefficient interpretations (baseline/interaction/polynomial), and any residual observations.

2) **What engineered feature was most useful (or not)?**  
Explain the **business logic** behind it and whether it earned its place on the TEST set. If not, what would you try next (different interaction, different categorical, or simplifying features)?

In [12]:
# First, recreate the DataFrame from Reading 9
np.random.seed(0)
ad_spend = np.random.rand(100) * 10
web_traffic = np.random.rand(100) * 50
sales = 15 + 5 * ad_spend + 1.5 * web_traffic + np.random.normal(0, 8, 100)
df_mlr = pd.DataFrame({'sales': sales, 'ad_spend': ad_spend, 'web_traffic': web_traffic})
df_mlr['day_type'] = np.random.choice(['Weekday', 'Weekend'], 100)

# --- 1. Interaction Term ---
df_mlr['ad_x_traffic'] = df_mlr['ad_spend'] * df_mlr['web_traffic']

# --- 2. Squared Term (for non-linear effect) ---
df_mlr['ad_spend_sq'] = df_mlr['ad_spend']**2

# --- 3. Dummy Variables (Method A: pandas.get_dummies) ---
# Easy for analysis, use drop_first=True
dummies_pd = pd.get_dummies(df_mlr['day_type'], drop_first=True, prefix='day')
df_with_pd = pd.concat([df_mlr, dummies_pd], axis=1)
print("--- Data with pd.get_dummies ---")
print(df_with_pd.head())

# --- 3. Dummy Variables (Method B: sklearn.OneHotEncoder) ---
# Better for ML pipelines. Note: This is more complex to set up.
# We fit the encoder on the 'day_type' column
ohe = OneHotEncoder(drop='first', sparse_output=False)
feature_array = ohe.fit_transform(df_mlr[['day_type']])
feature_labels = list(ohe.get_feature_names_out())
dummies_skl = pd.DataFrame(feature_array, columns=feature_labels)

# Concatenate back to the original dataframe
df_with_skl = pd.concat([df_mlr.reset_index(drop=True), dummies_skl.reset_index(drop=True)], axis=1)
print("\n--- Data with sklearn.OneHotEncoder ---")
print(df_with_skl.head())


--- Data with pd.get_dummies ---
     sales  ad_spend  web_traffic day_type  ad_x_traffic  ad_spend_sq  \
0 102.2900    5.4881      33.8908  Weekend      185.9974      30.1196   
1  62.3706    7.1519      13.5004  Weekday       96.5534      51.1496   
2  91.0980    6.0276      36.7597  Weekend      221.5740      36.3324   
3 110.9057    5.4488      48.1094  Weekend      262.1402      29.6898   
4  50.8550    4.2365      12.4377  Weekend       52.6927      17.9483   

   day_Weekend  
0         True  
1        False  
2         True  
3         True  
4         True  

--- Data with sklearn.OneHotEncoder ---
     sales  ad_spend  web_traffic day_type  ad_x_traffic  ad_spend_sq  \
0 102.2900    5.4881      33.8908  Weekend      185.9974      30.1196   
1  62.3706    7.1519      13.5004  Weekday       96.5534      51.1496   
2  91.0980    6.0276      36.7597  Weekend      221.5740      36.3324   
3 110.9057    5.4488      48.1094  Weekend      262.1402      29.6898   
4  50.8550    4.2365