# Data Challenge 10 — MLR Interpretation with Adjusted R² (HVFHV Trips)


**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Build **3 MLR models** with different feature sets to predict a numeric target, then compare **Adjusted R²** and **p-values** to select the better model and justify it in business terms.

**Data:** July 1, 2023 - July 15, 2023 For Hire Vehicle Data in NYC

[July For Hire Vehicles Data](https://data.cityofnewyork.us/Transportation/2023-High-Volume-FHV-Trip-Data/u253-aew4/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- TLC HVFHV data dictionary (columns/meaning): https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf  
- statsmodels OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html  
- OLS Results (attributes like `rsquared_adj`, `pvalues`): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html  

### Pseudocode Plan
1) **Load CSV** → preview columns/shape; confirm target & candidate predictors exist.  
2) **Assign Y + Xs** (start small, add features with a hypothesis). Coerce **just these columns** to numeric.  
3) **Light prep:** derive `trip_time_minutes` from `trip_time` (seconds); convert flags (`shared_request_flag`, `wav_request_flag`) to 0/1 if present.  
4) **Model sets (3 total):**  
   - **Model A (parsimonious).**  
   - **Model B (adds one meaningful predictor).**  
   - **Model C (adds 1–2 more, e.g., flags).  
5) **Add intercept** and **fit** each with OLS on the same rows.  
6) **Record metrics:** `rsquared_adj`, coefficient table, and **p-values**.  
7) **Compare:** Prefer higher **Adjusted R²** and keep an eye on **p-values** (and signs/units).  
8) **Interpretation:** Write unit-based sentences **holding others constant**.  
9) **Selection rationale:** Pick the simplest model that improves **Adjusted R²** and 


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [1]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from pathlib import Path

pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 1 — Load CSV & Preview
- Point to your For Hire Vehicle Data 
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [2]:
csv_path = Path('../data/FHV_072023.csv') 


df = pd.read_csv(csv_path)
print('Shape:', df.shape)
print('Columns (first 20):', df.columns.tolist()[:20])
df.head()

  df = pd.read_csv(csv_path)


Shape: (8324591, 24)
Columns (first 20): ['hvfhs_license_num', 'dispatching_base_num', 'originating_base_num', 'request_datetime', 'on_scene_datetime', 'pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_miles', 'trip_time', 'base_passenger_fare', 'tolls', 'bcf', 'sales_tax', 'congestion_surcharge', 'airport_fee', 'tips', 'driver_pay', 'shared_request_flag']


Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0005,B03406,,07/01/2023 05:34:30 PM,,07/01/2023 05:37:48 PM,07/01/2023 05:44:45 PM,158,68,1.266,...,1.35,2.75,0.0,2.0,5.57,N,N,N,N,False
1,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:36:53 PM,07/01/2023 05:37:15 PM,07/01/2023 05:55:15 PM,162,234,2.35,...,1.52,2.75,0.0,3.28,13.38,N,N,,N,False
2,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:35:17 PM,07/01/2023 05:35:52 PM,07/01/2023 05:44:27 PM,161,163,0.81,...,0.49,2.75,0.0,0.0,5.95,N,N,,N,False
3,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:37:39 PM,07/01/2023 05:39:35 PM,07/01/2023 06:23:02 PM,122,229,15.47,...,5.17,2.75,0.0,0.0,54.46,N,N,,N,True
4,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:36:06 PM,07/01/2023 05:36:39 PM,07/01/2023 05:45:06 PM,67,14,1.52,...,0.85,0.0,0.0,3.0,7.01,N,N,,N,False


### Step 2 —  Choose Target **Y** and Candidate Predictors

- Suggested **Y**: `base_passenger_fare` (USD).
- Start with **distance** and **time**; optionally add **flags** if present.
- Derive `trip_time_minutes` from `trip_time` (seconds) if available.

In [3]:
# === EDIT HERE if your file has different names ===
Y = 'base_passenger_fare'
candidate_numeric = ['trip_miles', 'trip_time']  # we'll transform trip_time -> minutes if present
candidate_flags   = ['shared_request_flag', 'wav_request_flag']  # if present

# Keep only columns we touch
keep_cols = [col for col in [Y, *candidate_numeric, *candidate_flags] if col in df.columns]
df = df[keep_cols].copy()

# Coerce numeric columns we actually have
for c in [Y] + [col for col in candidate_numeric if col in df.columns]:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Trip time to minutes (if present)
if 'trip_time' in df.columns:
    df['trip_time_minutes'] = df['trip_time'] / 60.0

# Convert flags to 0/1 if present (robust to 'Y'/'N', True/False, ints)
for col in candidate_flags:
    if col in df.columns:
        df[col] = (
            df[col]
            .map({'Y':1,'N':0,'y':1,'n':0, True:1, False:0, 1:1, 0:0})
            .fillna(0)
            .astype(int)
        )

# Final usable columns (drop rows with NA in used cols)
usable_cols = [Y] + [c for c in ['trip_miles','trip_time_minutes'] if c in df.columns] \
              + [c for c in candidate_flags if c in df.columns]
df = df.dropna(subset=usable_cols)

print('Usable cols:', usable_cols)
df[usable_cols].describe()

Usable cols: ['base_passenger_fare', 'trip_miles', 'trip_time_minutes', 'shared_request_flag', 'wav_request_flag']


Unnamed: 0,base_passenger_fare,trip_miles,trip_time_minutes,shared_request_flag,wav_request_flag
count,4486118.0,4486118.0,4486118.0,4486118.0,4486118.0
mean,13.0977,2.0963,10.1909,0.0199,0.0017
std,5.5997,1.395,3.6448,0.1398,0.0409
min,-9.3,0.0,0.0167,0.0,0.0
25%,8.92,1.16,7.3167,0.0,0.0
50%,11.83,1.73,10.2167,0.0,0.0
75%,15.56,2.59,13.2167,0.0,0.0
max,502.34,23.803,16.65,1.0,1.0


### Step 3 — Define Three Model Specs (A, B, C)
Example models you can chose any models you want as long as Model A has one term, Model B two terms, etc.

- **Model A:** distance only.  
- **Model B:** distance + time (minutes).  
- **Model C:** distance + time + flags (whichever exist).

In [4]:
models = {}

# A: distance only (if available)
if 'trip_miles' in df.columns:
    models['A_dist'] = ['trip_miles']

# B: distance + time (if available)
if 'trip_miles' in df.columns and 'trip_time_minutes' in df.columns:
    models['B_dist_time'] = ['trip_miles', 'trip_time_minutes']

# C: distance + time + flags (only those present)
flags_present = [c for c in ['shared_request_flag','wav_request_flag'] if c in df.columns]
if 'trip_miles' in df.columns and 'trip_time_minutes' in df.columns and flags_present:
    models['C_dist_time_flags'] = ['trip_miles', 'trip_time_minutes'] + flags_present

print('Model specs:', models)


Model specs: {'A_dist': ['trip_miles'], 'B_dist_time': ['trip_miles', 'trip_time_minutes'], 'C_dist_time_flags': ['trip_miles', 'trip_time_minutes', 'shared_request_flag', 'wav_request_flag']}


### Step 4 — Fit Each Model (with intercept) and Collect Adjusted R² & p-values


In [5]:
results = []

for name, Xcols in models.items():
    X = sm.add_constant(df[Xcols].astype(float))
    y = df[Y].astype(float)
    res = sm.OLS(y, X).fit()
    results.append({
        'model': name,
        'features': Xcols,
        'adj_r2': res.rsquared_adj,
        'r2': res.rsquared,
        'n': int(res.nobs),
        'coef': res.params.to_dict(),
        'pvalues': res.pvalues.to_dict(),
        'summary': res  # keep the full result for later printing
    })

# Comparison table (Adj R² + n + features)
comp = pd.DataFrame([{
    'Model': r['model'],
    'Adj_R2': r['adj_r2'],
    'R2': r['r2'],
    'n': r['n'],
    'Features': ', '.join(r['features'])
} for r in results]).sort_values('Adj_R2', ascending=False)

### Step 5 — Inspect Full Summaries (coefficients, p-values, diagnostics)

- Print summaries for the top 1–2 models by **Adjusted R²**.
- Write **unit-based** interpretations “holding others constant.”

In [6]:
# Show summaries in order of Adj R²
for r in sorted(results, key=lambda d: d['adj_r2'], reverse=True):
    print('='*80)
    print(f"Model: {r['model']} | Adj R²: {r['adj_r2']:.4f} | R²: {r['r2']:.4f} | n={r['n']}")
    print('- Features:', r['features'])
    print(r['summary'].summary())

Model: C_dist_time_flags | Adj R²: 0.5036 | R²: 0.5036 | n=4486118
- Features: ['trip_miles', 'trip_time_minutes', 'shared_request_flag', 'wav_request_flag']
                             OLS Regression Results                            
Dep. Variable:     base_passenger_fare   R-squared:                       0.504
Model:                             OLS   Adj. R-squared:                  0.504
Method:                  Least Squares   F-statistic:                 1.138e+06
Date:                 Sun, 02 Nov 2025   Prob (F-statistic):               0.00
Time:                         17:51:41   Log-Likelihood:            -1.2523e+07
No. Observations:              4486118   AIC:                         2.505e+07
Df Residuals:                  4486113   BIC:                         2.505e+07
Df Model:                            4                                         
Covariance Type:             nonrobust                                         
                          coef    std err 

### Step 6 — Interpretations (write below)

Using the **best model’s** coefficients interpret each coefficient using markdown

## We Share — Reflection & Wrap‑Up

Write **2 short paragraphs** and be specific:

1) **Which model (A/B/C) do you pick and why?**  
Reference **Adjusted R²** (higher is better when comparing models with different numbers of predictors) and the **p-values**/signs of key coefficients.

2) **Business explanation:**  
Give a stakeholder-friendly summary in **units** (e.g., “+1 mile ≈ +$X in base fare, holding time constant”). If you added flags, explain their effect plainly. Mention any limitations (e.g., time vs distance confounding, missing columns).