## Mod 5 Lecture 3 Code-Along:  Data Transformations 

### Goals
* Apply a log-transformation on a continous numeric variable 
* Be able to articulate the difference between "non-logged" and log variables 

### Data
Using the same NYC 311 dataset (remember the data is HUGE so we extracted just a week).  Data information exists [HERE](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Read in data nyc311.csv 

df = None

In [None]:
#Run this cell without changes!  You've done this in the previous code

LOCAL_TZ = "America/New_York"

def to_utc(series, local_tz=LOCAL_TZ):
    """
    Idempotent conversion:
      1) Parse to datetime.
      2) If naive -> localize to local_tz (handle DST).
      3) Convert to UTC.
    Safe to re-run without raising 'Already tz-aware' errors.
    """
    s = pd.to_datetime(series, errors="coerce")

    # if tz-naive, localize; if tz-aware, leave as-is
    if s.dt.tz is None:
        s = s.dt.tz_localize(local_tz, nonexistent="shift_forward", ambiguous="NaT")

    return s.dt.tz_convert("UTC")

# --- Apply to your DataFrame (df) ---
# Ensure the columns exist; adjust names if your file uses different headers
required_cols = ["Created Date", "Closed Date"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise KeyError(f"Missing expected columns: {missing}")

# Optionally drop rows that lack either timestamp before conversion
df = df.dropna(subset=["Created Date", "Closed Date"]).copy()

df["Created Date"] = to_utc(df["Created Date"])
df["Closed Date"]  = to_utc(df["Closed Date"])

# Compute response time in hours
delta = df["Closed Date"] - df["Created Date"]
df["response_time_hrs"] = delta.dt.total_seconds() / 3600

# Drop any rows that became NaT due to ambiguous DST cases
df = df.dropna(subset=["Created Date", "Closed Date"])

### Task 1:  Visualize the "response time" BEFORE LOGGING 

In [None]:
# Assume df already loaded with tz-aware dates and response_time_hrs
df["response_time_hrs"] = None # avoid log(0)

# Quick look at original distribution
None

### Task 2:  Apply a log transformation 

In [None]:
df["log_response_time"] = None
df[["response_time_hrs", "log_response_time"]].describe()

### Task 3:  Visualize the Difference 

In [None]:
plt.figure(figsize=(12,5))

plt.subplot(1, 2, 1)
None
plt.title("Original Response Time")

plt.subplot(1, 2, 2)
None
plt.title("Log-Transformed")

plt.tight_layout()
plt.show()
