## Mod 5 Lecture 2 Code-Along:  Feature Engineering & Scaling 

### Goals
* Create `hour_of_day` and `is_weekend` if not already done

* Create `night_weekend_interaction` = `is_weekend * is_night`

* Scale `hour_of_day` and `response_time_hrs` using both techniques (StandardScalar & MinMax)

### Data
Using the same NYC 311 dataset (remember the data is HUGE so we extracted just a week).  Data information exists [HERE](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data)

In [1]:
import pandas as pd

In [4]:
# Read in data nyc311.csv 
path = '/Users/Marcy_Student/Desktop/Marcy-Modules/Mod5/Data-challenge/nyc311.csv'
df = pd.read_csv(path)
df

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,66178993,09/17/2025 02:50:56 AM,,DOT,Department of Transportation,Street Condition,Pothole,,10457.0,CROSS BRONX EXPRESSWAY,...,,,,,,,,,,
1,66174339,09/17/2025 02:44:55 AM,,DOT,Department of Transportation,Street Condition,Pothole,,10469.0,SEYMOUR AVENUE,...,,,,,,,,,,
2,66170874,09/17/2025 02:42:45 AM,,DOT,Department of Transportation,Street Condition,Pothole,,10467.0,BOSTON ROAD,...,,,,,,,,,,
3,66172189,09/17/2025 01:51:12 AM,,NYPD,New York City Police Department,Illegal Parking,Blocked Sidewalk,Street/Sidewalk,11691.0,13-54 DAVIES ROAD,...,,,,,,,,40.599549,-73.748018,"(40.5995492740367, -73.74801784107588)"
4,66175640,09/17/2025 01:50:43 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11372.0,89-07 34 AVENUE,...,,,,,,,,40.754443,-73.878352,"(40.754442992557145, -73.87835236707805)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55810,66123433,09/11/2025 10:48:18 AM,09/13/2025 05:56:22 PM,HPD,Department of Housing Preservation and Develop...,APPLIANCE,REFRIGERATOR,RESIDENTIAL BUILDING,11208.0,1097 GLENMORE AVENUE,...,,,,,,,,40.677233,-73.868409,"(40.67723284394943, -73.86840939045801)"
55811,66119235,09/11/2025 10:48:18 AM,09/13/2025 05:56:22 PM,HPD,Department of Housing Preservation and Develop...,UNSANITARY CONDITION,PESTS,RESIDENTIAL BUILDING,11208.0,1097 GLENMORE AVENUE,...,,,,,,,,40.677233,-73.868409,"(40.67723284394943, -73.86840939045801)"
55812,66119222,09/11/2025 10:48:18 AM,09/13/2025 05:56:22 PM,HPD,Department of Housing Preservation and Develop...,UNSANITARY CONDITION,MOLD,RESIDENTIAL BUILDING,11208.0,1097 GLENMORE AVENUE,...,,,,,,,,40.677233,-73.868409,"(40.67723284394943, -73.86840939045801)"
55813,66117803,09/11/2025 10:48:18 AM,09/13/2025 05:56:22 PM,HPD,Department of Housing Preservation and Develop...,PLUMBING,WATER SUPPLY,RESIDENTIAL BUILDING,11208.0,1097 GLENMORE AVENUE,...,,,,,,,,40.677233,-73.868409,"(40.67723284394943, -73.86840939045801)"


In [7]:
#Run this cell without changes!  You've done this in the previous Data Challenge 

from sklearn.preprocessing import StandardScaler, MinMaxScaler


LOCAL_TZ = "America/New_York"

def to_utc(series, local_tz=LOCAL_TZ):
    """
    Idempotent conversion:
      1) Parse to datetime.
      2) If naive -> localize to local_tz (handle DST).
      3) Convert to UTC.
    Safe to re-run without raising 'Already tz-aware' errors.
    """
    s = pd.to_datetime(series, errors="coerce")

    # if tz-naive, localize; if tz-aware, leave as-is
    if s.dt.tz is None:
        s = s.dt.tz_localize(local_tz, nonexistent="shift_forward", ambiguous="NaT")

    return s.dt.tz_convert("UTC")

# --- Apply to your DataFrame (df) ---
# Ensure the columns exist; adjust names if your file uses different headers
required_cols = ["Created Date", "Closed Date"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise KeyError(f"Missing expected columns: {missing}")

# Optionally drop rows that lack either timestamp before conversion
df = df.dropna(subset=["Created Date", "Closed Date"]).copy()

df["Created Date"] = to_utc(df["Created Date"])
df["Closed Date"]  = to_utc(df["Closed Date"])

# Compute response time in hours
delta = df["Closed Date"] - df["Created Date"]
df["response_time_hrs"] = delta.dt.total_seconds() / 3600

# Drop any rows that became NaT due to ambiguous DST cases
df = df.dropna(subset=["Created Date", "Closed Date"])

In [8]:
df.dtypes

Unique Key                                      int64
Created Date                      datetime64[ns, UTC]
Closed Date                       datetime64[ns, UTC]
Agency                                         object
Agency Name                                    object
Complaint Type                                 object
Descriptor                                     object
Location Type                                  object
Incident Zip                                  float64
Incident Address                               object
Street Name                                    object
Cross Street 1                                 object
Cross Street 2                                 object
Intersection Street 1                          object
Intersection Street 2                          object
Address Type                                   object
City                                           object
Landmark                                       object
Facility Type               

### Task 1:  Create Features 

Extract the hour and create a variable for weekends (we done this previously!). We will define “night” as any time from midnight to 6am.

In [9]:
# Create base features
df['hour_of_day'] = df['Created Date'].dt.hour
df['is_weekend'] = df['Created Date'].dt.day_of_week >=5
df['is_night'] = df['hour_of_day'].isin([0,1,2,3,4,5,6]) # 12am to 6am

### Task 2:  Create Interaction Term 

Create the `night_weekend_interaction` feature 

In [11]:
df['night_weekend_interaction'] = (df['is_weekend'].astype(int)*df['is_night'])

#Look at the data -- so many columns in the data so only showing the ones we need 
df[['hour_of_day', 'is_weekend', 'is_night', 'night_weekend_interaction']].head()

Unnamed: 0,hour_of_day,is_weekend,is_night,night_weekend_interaction
7,5,False,True,0
37,5,False,True,0
43,5,False,True,0
47,5,False,True,0
50,5,False,True,0


### Task 3:  Scale Data 
* Use sklearn's StandardScaler object to scale hours and response time 
* Use sklearn's MinMaxScaler object to scale hours and response time 

**Note:  You will scale data before modeling in Mod 6; however, it will look slightly different because you will only scale a subset of the data (which we call "training data") vs. the whole dataset like we do here.  This is an important note!** 

In [12]:
df = df.dropna(subset=["response_time_hrs", "hour_of_day"])

scaler = StandardScaler()
df['hour_scaled'] = scaler.fit_transform(df[['hour_of_day']])
df['resp_scaled'] = scaler.fit_transform(df[['response_time_hrs']])

In [13]:
#Run this cell without changes -- do you see the difference in the scaled column? 

df[['resp_scaled', 'response_time_hrs']]

Unnamed: 0,resp_scaled,response_time_hrs
7,-0.381873,1.178056
37,-0.438908,0.158889
43,-0.432504,0.273333
47,-0.446681,0.020000
50,-0.445748,0.036667
...,...,...
55810,2.637687,55.134444
55811,2.637687,55.134444
55812,2.637687,55.134444
55813,2.637687,55.134444


In [14]:
minmax = MinMaxScaler()
df["hour_mm"] = minmax.fit_transform(df[["hour_of_day"]])
df["resp_mm"] = minmax.fit_transform(df[['response_time_hrs']])

In [15]:
#Run this cell without changes -- do you see the difference in all 3 of the scaled columns? 

df[['resp_mm','resp_scaled', 'response_time_hrs',]]

Unnamed: 0,resp_mm,resp_scaled,response_time_hrs
7,0.434466,-0.381873,1.178056
37,0.429910,-0.438908,0.158889
43,0.430421,-0.432504,0.273333
47,0.429289,-0.446681,0.020000
50,0.429363,-0.445748,0.036667
...,...,...,...
55810,0.675696,2.637687,55.134444
55811,0.675696,2.637687,55.134444
55812,0.675696,2.637687,55.134444
55813,0.675696,2.637687,55.134444
