# FEATURE ENGINEERING

In this phase we build the **final analytical base table** used for downstream feature preselection and modeling.

We do three things:

1) **Create features (transformations).** The most domain-specific part of this phase and maybe the whole project.

   *Why* - Raw AIS + schedule fields do not directly express behaviors like acceleration, maneuvering, reporting cadence, or time-to-ETA.  
   We convert them into **explicit, measurable signals**.

3) **Transform categorical variables (encodings)**  

   *Why* - Most ML models require numeric inputs. Ports and vessel identifiers must be represented as numbers.  
   - **One-Hot Encoding** for low/moderate-cardinality categorical variables (ports)

   - **Target Encoding** for high-cardinality identifiers (ship / IMO)
     


5) **Rescale numerical features**  

   *Why* - Some algorithms benefit from comparable numeric scales and it makes feature selection methods more stable.  


   We use **Min-Max scaling** to map numeric features into [0, 1] without changing their ordering.

No feature preselection here (that starts in the next notebook: `05_feature_preselection`

## IMPORT LIBRARIES

In [19]:
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.preprocessing import MinMaxScaler

# Autocomplete
%config IPCompleter.greedy=True

## IMPORT DATASETS

Why: we start from the **cleaned dataset** (structurally valid) produced earlier.

We also sort by **(imo, updated_ts)** because many features are based on:
- time differences (`diff`)
- rolling statistics (`rolling`)
- previous position calculations (`shift`)

These operations require correct chronological order **within each vessel**.

In [20]:
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 200)

project_path = "/Users/rober/smartport-delay-risk-scoring/"
working_dir = project_path + "/02_Data/03_Working/"

name_df = "work_clean.csv"
output_name = "work_fe.csv"

df = pd.read_csv(working_dir + name_df)

  df = pd.read_csv(working_dir + name_df)


In [21]:
df

Unnamed: 0,record_id,updated_ts,ship_name,imo,lat,lon,sog,cog,hdg,dep_port,etd_schedule,etd,atd,arr_port,eta_schedule,eta,ata
0,20404,,Europa,8919805,59.4454,24.7695,2.9,29,221,EETLL,,,,FIHEL,,,
1,20712,,Finlandia,9214379,59.4453,24.7648,3.0,67,246,EETLL,,,,FIHEL,,,
2,20731,,Megastar,9773064,59.4446,24.7708,3.0,33,210,EETLL,,,,FIHEL,,,
3,20741,,Star,9364722,60.1483,24.9152,0.6,162,207,FIHEL,,,,EETLL,,,
4,20742,,Finlandia,9214379,59.4460,24.7680,5.4,66,239,EETLL,,,2018-04-13 02:51:13,FIHEL,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467479,912177,,Finlandia,9214379,59.9357,24.8750,20.6,197,195,FIHEL,,,2019-03-15 06:51:54,EETLL,,,
467480,912178,,Finlandia,9214379,59.9308,24.8722,20.6,197,196,FIHEL,,,2019-03-15 06:51:54,EETLL,,,
467481,912179,,Finlandia,9214379,59.9242,24.8683,20.7,196,196,FIHEL,,,2019-03-15 06:51:54,EETLL,,,
467482,912180,,Finlandia,9214379,59.9198,24.8657,20.8,197,195,FIHEL,,,2019-03-15 06:51:54,EETLL,,,


### Datetime integrity check (no timestamp conversion here)

Feature engineering uses time deltas (e.g., `updated_ts.diff()`), which require datetime dtypes.

So we **validate** dtypes. If datetimes are not already datetime, we fail fast with a clear instruction.

In [22]:
# Parse datetime columns (CSV does not preserve dtypes)
datetime_cols = [
    "updated_ts", "etd_schedule", "etd",
    "atd", "eta_schedule", "eta", "ata"
]

for c in datetime_cols:
    if c in df.columns:
        df[c] = pd.to_datetime(df[c], errors="coerce")

# Validate conversion
bad_dt = [(c, df[c].dtype) for c in datetime_cols if not np.issubdtype(df[c].dtype, np.datetime64)]
if bad_dt:
    raise TypeError(f"Datetime parsing failed for columns: {bad_dt}")

print("Datetime columns parsed and validated.")

Datetime columns parsed and validated.


## TARGET CREATION (Delay Target)

Why: we need a clear target definition before we encode or scale, because:
- target encoding uses the target to compute category statistics
- downstream notebooks assume the target is already present

| **Field**       | **Description**                                                             |
| --------------- | --------------------------------------------------------------------------- |
| **etdSchedule** | Scheduled Estimated Time of Departure                                       |
| **etd**         | Updated Estimated Time of Departure (may differ from schedule)              |
| **atd**         | Actual Time of Departure                                                    |
| **arrPort**     | Arrival port code                                                           |
| **etaSchedule** | Scheduled Estimated Time of Arrival                                         |
| **eta**         | Updated Estimated Time of Arrival (may differ from schedule)                |
| **ata**         | Actual Time of Arrival                                                      |


### Objective
The goal of this section is to define the **ground truth** (`delay_flag`). To be operationally useful for **tugboats** (remolcadores) and **stevedores** (estibadores), the model must learn to identify delays based on patterns observed *before* the vessel reaches the dock.

### Calculation Logic (Priority Waterfall)
Since operational data is collected at different stages of a journey, we calculate the delay using a hierarchical approach to ensure we always use the most reliable signal available:

1.  **Primary Signal (Arrival Delay):** If the **Actual Time of Arrival (ATA)** is recorded, the delay is calculated as:  
    $$\text{Delay Minutes} = \text{ATA} - \text{ETA Schedule}$$
2.  **Secondary Proxy (Departure Delay):** If the vessel has not yet arrived but has departed from its previous location, we use the **Actual Time of Departure (ATD)** as a proxy for journey friction:  
    $$\text{Delay Minutes} = \text{ATD} - \text{ETD Schedule}$$
3.  **Incomplete Data:** If neither an actual arrival nor departure timestamp is available, the delay cannot be verified, and the record is excluded from training to maintain data integrity.



### Defining the Binary Target
We convert the continuous delay measurement into a binary classification event:
* **Threshold:** A **120-minute** limit (`DELAY_THRESHOLD_MIN`) is applied.
* **Delayed (`1`):** Any record with a delay $\ge 120$ minutes.
* **On-Time (`0`):** Any record with a delay $< 120$ minutes.

### The "Blinding" Protocol for Predictive Integrity
To ensure the model is truly predictive and not merely reporting past events, we implement a **Blinding Protocol**:
* **Labeling:** We use the "future" timestamps ($ATA$, $ATD$) exclusively to create the `delay_flag`.
* **Blinding:** Immediately after the flag is created, the $ATA$, $ATD$, and $delay\_minutes$ columns are **deleted** from the feature set ($X$).
* **The Result:** The model is forced to find predictive signals in "live" data—such as Speed Over Ground (SOG), heading, and distance to port—ensuring it can provide a high-sensitivity **Early Warning** even when the actual arrival time is still unknown.

In [23]:
DELAY_THRESHOLD_MIN = 120 

# 1. Define the columns we use to create the target
reality_cols = ["ata", "atd", "eta_schedule", "etd_schedule"]

# Check if we still have the columns. If not, we've already blinded the data.
if all(col in df.columns for col in ["ata", "atd"]):
    
    # Ensure they are datetime
    for col in reality_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')

    # 2. Calculate real-world delays
    arr_delay = (df["ata"] - df["eta_schedule"]).dt.total_seconds() / 60.0
    dep_delay = (df["atd"] - df["etd_schedule"]).dt.total_seconds() / 60.0

    # 3. Assign target (priority: arrival > departure)
    df["delay_minutes"] = np.where(
        df["ata"].notna(), 
        arr_delay, 
        np.where(df["atd"].notna(), dep_delay, np.nan)
    )

    # 4. Create binary flag (1/0)
    df["delay_flag"] = np.nan
    mask = df["delay_minutes"].notna()
    df.loc[mask, "delay_flag"] = (df.loc[mask, "delay_minutes"] >= DELAY_THRESHOLD_MIN).astype(int)

    # 5. BLINDING: Drop the "future" reality columns
    # This is the "Point of No Return" for this variable in memory
    df = df.drop(columns=["ata", "atd", "delay_minutes"])
    print("✔ Target created and reality columns dropped (Blinded).")

else:
    print("⚠ Reality columns already removed. Skipping target creation logic.")

# 6. Final Clean: Only keep rows where we have a known answer
df = df.dropna(subset=['delay_flag'])

✔ Target created and reality columns dropped (Blinded).


In [24]:
df[["delay_flag"]]

Unnamed: 0,delay_flag
31359,1.0
31399,1.0
31406,1.0
31407,1.0
31408,1.0
...,...
463334,0.0
463335,0.0
463336,0.0
463337,0.0


## FEATURE CREATION (Domain-driven transformations)

We convert raw signals into features that represent:
- temporal behavior (when + how frequently AIS reports)
- movement dynamics (acceleration, volatility)
- navigation changes (turning / heading shifts)
- geospatial progression (distance traveled)
- port-call context (time relative to ETD/ETA)
- route stability (trend + variability)

These features are deterministic and reproducible.


### Temporal features

- `date`: calendar date extracted from the update timestamp, used to identify daily and weekly patterns
- `hour`: hour of day extracted from the update timestamp, used to capture intraday behavior 
- `hour_of_day`: reuses existing hour (no timestamp recreation)
- `day_of_week`: captures weekday/weekend effects
- `time_since_last_position_min`: reporting gaps, missing signals, changes in cadence
- `reporting_interval_min`: rolling median to summarize typical reporting frequency

In [25]:
df["date"] = df["updated_ts"].dt.floor("D")
df["hour"] = df["updated_ts"].dt.hour

df["hour_of_day"] = df["hour"]
df["day_of_week"] = df["date"].dt.dayofweek

df["time_since_last_position_min"] = (
    df.groupby("imo")["updated_ts"]
      .diff()
      .dt.total_seconds() / 60
)

df["reporting_interval_min"] = (
    df.groupby("imo")["time_since_last_position_min"]
      .transform(lambda s: s.rolling(3, min_periods=1).median())
)

In [26]:
df.loc[
    df[[
        "date",
        "hour",
        "hour_of_day",
        "day_of_week",
        "time_since_last_position_min",
        "reporting_interval_min"
    ]].notna().all(axis=1),
    [
        "updated_ts",
        "date",
        "hour",
        "hour_of_day",
        "day_of_week",
        "time_since_last_position_min",
        "reporting_interval_min"
    ]
]

Unnamed: 0,updated_ts,date,hour,hour_of_day,day_of_week,time_since_last_position_min,reporting_interval_min
31407,2018-01-05 03:06:00,2018-01-05,3.0,3.0,4.0,4.0,4.0
31408,2018-01-05 03:06:00,2018-01-05,3.0,3.0,4.0,0.0,2.0
31409,2018-01-05 03:08:00,2018-01-05,3.0,3.0,4.0,2.0,2.0
31410,2018-01-05 03:09:00,2018-01-05,3.0,3.0,4.0,1.0,1.0
31411,2018-01-05 03:10:00,2018-01-05,3.0,3.0,4.0,1.0,1.0
...,...,...,...,...,...,...,...
463334,2019-03-03 22:24:00,2019-03-03,22.0,22.0,6.0,1.0,1.0
463335,2019-03-03 22:25:00,2019-03-03,22.0,22.0,6.0,2.0,2.0
463336,2019-03-03 22:24:00,2019-03-03,22.0,22.0,6.0,0.0,1.0
463337,2019-03-03 22:26:00,2019-03-03,22.0,22.0,6.0,1.0,2.0


### Movement / navigational features

Delay risk can be reflected in abnormal movement patterns:
- slowing down / speeding up (speed deltas and trends)
- unstable speeds (rolling std)
- maneuvering changes (course / heading changes)

We wrap angular differences to handle 0/360 crossings

In [27]:
def angular_diff(series: pd.Series) -> pd.Series:
    d = series.diff()
    return (d + 180) % 360 - 180  # wrap to [-180, 180]

df["speed_delta"] = df.groupby("imo")["sog"].diff()

df["rolling_mean_sog"] = (
    df.groupby("imo")["sog"]
      .transform(lambda s: s.rolling(window=5, min_periods=1).mean())
)

df["rolling_std_sog"] = (
    df.groupby("imo")["sog"]
      .transform(lambda s: s.rolling(window=5, min_periods=2).std())
)

df["course_change"] = df.groupby("imo")["cog"].transform(angular_diff)
df["heading_change"] = df.groupby("imo")["hdg"].transform(angular_diff)

In [28]:
df[["sog","speed_delta","rolling_mean_sog","rolling_std_sog","course_change","heading_change"]]

Unnamed: 0,sog,speed_delta,rolling_mean_sog,rolling_std_sog,course_change,heading_change
31359,4.6,,4.600000,,,
31399,4.5,,4.500000,,,
31406,1.3,,1.300000,,,
31407,5.6,1.0,5.100000,0.707107,-22.0,-14.0
31408,4.9,-0.7,5.033333,0.513160,-9.0,9.0
...,...,...,...,...,...,...
463334,1.9,-5.3,10.620000,6.202177,129.0,85.0
463335,8.4,-4.4,16.660000,5.795084,41.0,43.0
463336,1.0,-0.9,7.540000,6.433739,72.0,14.0
463337,7.3,-1.1,13.820000,6.288641,6.0,7.0


### Geospatial features

Distance traveled between consecutive positions captures:
- progress (or lack of it)
- holding patterns
- slow approach / drifting

We compute great-circle distance using haversine, in nautical miles (nm).

In [29]:
R_EARTH_NM = 3440.065

def haversine_nm(lat1, lon1, lat2, lon2) -> pd.Series:
    lat1 = np.radians(lat1)
    lon1 = np.radians(lon1)
    lat2 = np.radians(lat2)
    lon2 = np.radians(lon2)
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2.0)**2
    c = 2.0 * np.arctan2(np.sqrt(a), np.sqrt(1.0-a))
    return R_EARTH_NM * c

df["distance_from_last_position_nm"] = haversine_nm(
    df.groupby("imo")["lat"].shift(),
    df.groupby("imo")["lon"].shift(),
    df["lat"],
    df["lon"]
)

In [30]:
df[["lat","lon","distance_from_last_position_nm"]]

Unnamed: 0,lat,lon,distance_from_last_position_nm
31359,59.4457,24.7662,
31399,59.4456,24.7718,
31406,60.1482,24.9151,
31407,59.4488,24.7763,0.360090
31408,59.4495,24.7775,0.055746
...,...,...,...
463334,60.1456,24.9138,0.055251
463335,59.4482,24.7760,0.341190
463336,60.1456,24.9136,0.005978
463337,59.4466,24.7739,0.115483


## Port-call Features: Measuring Against the Plan

This section generates features that describe the vessel's progress relative to its operational timeline. By calculating the "time-gap" between the current moment and the scheduled events, we provide the model with the context needed to identify potential delays.

To maintain **predictive integrity**, we only calculate time features using data available at the moment of prediction. 

* **Heuristic (is_in_port):** We identify if a vessel is stationary using a Speed Over Ground (SOG) threshold of $\le 0.5$ knots. This helps the model distinguish between transit time and berth time.
* **Time to Schedule (The "Window"):** We calculate the minutes remaining until the **Scheduled Arrival (ETA)** and **Scheduled Departure (ETD)**. This tells the model how much "buffer" remains in the plan.
* **Time Since Scheduled Departure:** By measuring the time elapsed since the `etd_schedule`, the model can detect if a ship is already "running late" before it even begins its current transit.


**Avoiding Target Leakage:**

In this version of the features, we have strictly excluded **Actual Arrival (ATA)** and **Actual Departure (ATD)** timestamps. 
* **The Reason:** In a real-world production environment, the "Actual" time is exactly what we are trying to predict—therefore, it is unknown. 
* **The Result:** By forcing the model to compare the current timestamp (`updated_ts`) against the **Schedules**, we ensure the model learns to identify the *symptoms* of a delay rather than simply calculating a known result.

In [47]:
# 1. Helper function to calculate time differences safely
def diff_minutes(later, earlier):
    """Calculates the difference in minutes between two datetime columns."""
    return (pd.to_datetime(later) - pd.to_datetime(earlier)).dt.total_seconds() / 60.0

IN_PORT_SOG_THRESHOLD = 0.5

# 2. Heuristic: Is the ship stationary?
df["is_in_port"] = ((df["sog"] <= IN_PORT_SOG_THRESHOLD) & df["sog"].notna()).astype("Int8")

# 3. Time measures relative to the SCHEDULE (The Plan)
# We ONLY use 'etd_schedule' and 'eta_schedule' because these are known 
# BEFORE the ship arrives. 

# Minutes until the planned departure
df["time_to_etd_schedule_min"] = diff_minutes(df["etd_schedule"], df["updated_ts"])

# Minutes until the planned arrival
df["time_to_eta_schedule_min"] = diff_minutes(df["eta_schedule"], df["updated_ts"])

# Minutes since the planned departure (to detect if already behind schedule)
df["time_since_etd_schedule_min"] = diff_minutes(df["updated_ts"], df["etd_schedule"])

# 4. Remove the "Leakage" features identified in the screenshots
# We drop columns that use Actual timestamps to ensure a true prediction.
cols_to_drop = ["time_since_atd_min", "time_to_eta_min", "time_to_etd_min"]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors='ignore')

print("✔ Features calculated using Schedule only. Predictive integrity maintained.")

✔ Features calculated using Schedule only. Predictive integrity maintained.


### Route behavior features

Ee want compact indicators of route consistency:
- `distance_variation`: rolling variability of distance traveled between reports
- `avg_speed_trend`: how average speed is changing
- `movement_stability`: combined stability index (higher = more stable)

These help summarize movement behavior over short windows.

In [34]:
df["distance_variation"] = (
    df.groupby("imo")["distance_from_last_position_nm"]
      .transform(lambda s: s.rolling(window=5, min_periods=2).std())
)

df["avg_speed_trend"] = df.groupby("imo")["rolling_mean_sog"].diff()

df["movement_stability"] = 1.0 / (
    1.0
    + df["rolling_std_sog"].fillna(0.0)
    + df["distance_variation"].fillna(0.0)
)

In [35]:
df[["distance_variation","avg_speed_trend","movement_stability"]]

Unnamed: 0,distance_variation,avg_speed_trend,movement_stability
31359,,,1.000000
31399,,,1.000000
31406,,,1.000000
31407,,0.500000,0.585786
31408,0.215204,-0.066667,0.578582
...,...,...,...
463334,0.237353,-2.960000,0.134417
463335,0.097707,-2.600000,0.145079
463336,0.252350,-3.080000,0.130105
463337,0.154460,-2.840000,0.134353


## TRANSFORMATIONS IN CATEGORICAL VARIABLES (Encodings)

We choose encodings based on high/low cardinality:

**One-Hot Encoding** (low cardinality) for ports (`dep_port`, `arr_port`)
- Why: ports usually manageable in count


**Target Encoding** for high-cardinality identifiers (`imo`, `ship_name`)
- Why: one-hot for IMO/ship_name can explode dimensionality (too many sparse columns).
  Target encoding compresses this into a single numeric feature representing historical delay rate.

Note: target encoding can overfit if not validated correctly; feature preselection/modeling will handle robust validation. Here we only construct the representation once.

In [36]:
# One-Hot Encoding
cat_ohe_cols = ["dep_port", "arr_port"]
cat_ohe_df = pd.get_dummies(df[cat_ohe_cols], prefix=cat_ohe_cols, dummy_na=True)

# Target Encoding
imo_te_map = df.groupby("imo")["delay_flag"].mean()
ship_te_map = df.groupby("ship_name")["delay_flag"].mean()

df["imo_te"] = df["imo"].map(imo_te_map)
df["ship_name_te"] = df["ship_name"].map(ship_te_map)

cat_target_df = df[["imo_te", "ship_name_te"]]

In [37]:
cat_ohe_df, cat_target_df

(        dep_port_EETLL  dep_port_FIHEL  dep_port_nan  arr_port_EETLL  arr_port_FIHEL  arr_port_nan
 31359             True           False         False           False            True         False
 31399             True           False         False           False            True         False
 31406            False            True         False            True           False         False
 31407             True           False         False           False            True         False
 31408             True           False         False           False            True         False
 ...                ...             ...           ...             ...             ...           ...
 463334            True           False         False           False            True         False
 463335           False            True         False            True           False         False
 463336            True           False         False           False            True         False


## TRANSFORMATIONS IN NUMERICAL VARIABLES (Selection for scaling)

At this stage we define the numerical feature space that will be used for scaling and modeling.

- Identifiers are excluded and kept only for traceability.

- Raw categorical fields are excluded; their information is already captured via encodings.

- Target variables are kept separate to prevent leakage and accidental scaling.

- Raw timestamps and calendar dates are not included in the feature space. They are raw material used to derive meaningful temporal signals (hours, weekdays, time deltas, reporting cadence), which are the variables that actually carry behavioral information.

The output of this step is a clean, model-ready set of numerical features that can be safely scaled and passed downstream.

In [38]:
# Base identifiers we keep for traceability (not scaled, not used as features directly)
id_cols = ["record_id", "imo", "ship_name"]

# Targets
target_cols = ["delay_flag", "delay_minutes"]

# Raw categorical columns (kept optional; features come from encoded versions)
raw_cat_cols = ["dep_port", "arr_port"]

# Collect numeric feature columns from engineered set + original numeric signals
numeric_candidate_cols = [
    "lat","lon","sog","cog","hdg",
    "hour","hour_of_day","day_of_week",
    "time_since_last_position_min","reporting_interval_min",
    "speed_delta","rolling_mean_sog","rolling_std_sog",
    "course_change","heading_change",
    "distance_from_last_position_nm",
    "is_in_port",
    "time_to_etd_min","time_to_eta_min","time_since_atd_min","time_since_etd_schedule_min",
    "distance_variation","avg_speed_trend","movement_stability",
    # target encodings are numeric too
    "imo_te","ship_name_te",
]

# Keep only those that exist (safe)
numeric_candidate_cols = [c for c in numeric_candidate_cols if c in df.columns]

num_df = df[numeric_candidate_cols].copy()

In [39]:
num_df

Unnamed: 0,lat,lon,sog,cog,hdg,hour,hour_of_day,day_of_week,time_since_last_position_min,reporting_interval_min,speed_delta,rolling_mean_sog,rolling_std_sog,course_change,heading_change,distance_from_last_position_nm,is_in_port,time_since_etd_schedule_min,distance_variation,avg_speed_trend,movement_stability,imo_te,ship_name_te
31359,59.4457,24.7662,4.6,67,244,3.0,3.0,4.0,,,,4.600000,,,,,0,2.0,,,1.000000,0.866983,0.866983
31399,59.4456,24.7718,4.5,31,214,4.0,4.0,4.0,,,,4.500000,,,,,0,-12.0,,,1.000000,0.859157,0.859157
31406,60.1482,24.9151,1.3,209,209,4.0,4.0,4.0,,,,1.300000,,,,,0,-10.0,,,1.000000,0.860206,0.860206
31407,59.4488,24.7763,5.6,45,230,3.0,3.0,4.0,4.0,4.0,1.0,5.100000,0.707107,-22.0,-14.0,0.360090,0,6.0,,0.500000,0.585786,0.866983,0.866983
31408,59.4495,24.7775,4.9,36,239,3.0,3.0,4.0,0.0,2.0,-0.7,5.033333,0.513160,-9.0,9.0,0.055746,0,6.0,0.215204,-0.066667,0.578582,0.866983,0.866983
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463334,60.1456,24.9138,1.9,217,179,22.0,22.0,6.0,1.0,1.0,-5.3,10.620000,6.202177,129.0,85.0,0.055251,0,114.0,0.237353,-2.960000,0.134417,0.860206,0.860206
463335,59.4482,24.7760,8.4,210,209,22.0,22.0,6.0,2.0,2.0,-4.4,16.660000,5.795084,41.0,43.0,0.341190,0,115.0,0.097707,-2.600000,0.145079,0.859157,0.859157
463336,60.1456,24.9136,1.0,289,193,22.0,22.0,6.0,0.0,1.0,-0.9,7.540000,6.433739,72.0,14.0,0.005978,0,114.0,0.252350,-3.080000,0.130105,0.860206,0.860206
463337,59.4466,24.7739,7.3,216,216,22.0,22.0,6.0,1.0,2.0,-1.1,13.820000,6.288641,6.0,7.0,0.115483,0,116.0,0.154460,-2.840000,0.134353,0.859157,0.859157


## MERGE TRANSFORMED DATASETS

In [40]:
X_num = num_df.reset_index(drop=True)
X_ohe = cat_ohe_df.reset_index(drop=True)

# imo_te and ship_name_te already inside num_df, so cat_target_df doesn't need separate concat. We'll avoid duplication.

df_features = pd.concat([X_num, X_ohe], axis=1)

In [41]:
df_features

Unnamed: 0,lat,lon,sog,cog,hdg,hour,hour_of_day,day_of_week,time_since_last_position_min,reporting_interval_min,speed_delta,rolling_mean_sog,rolling_std_sog,course_change,heading_change,distance_from_last_position_nm,is_in_port,time_since_etd_schedule_min,distance_variation,avg_speed_trend,movement_stability,imo_te,ship_name_te,dep_port_EETLL,dep_port_FIHEL,dep_port_nan,arr_port_EETLL,arr_port_FIHEL,arr_port_nan
0,59.4457,24.7662,4.6,67,244,3.0,3.0,4.0,,,,4.600000,,,,,0,2.0,,,1.000000,0.866983,0.866983,True,False,False,False,True,False
1,59.4456,24.7718,4.5,31,214,4.0,4.0,4.0,,,,4.500000,,,,,0,-12.0,,,1.000000,0.859157,0.859157,True,False,False,False,True,False
2,60.1482,24.9151,1.3,209,209,4.0,4.0,4.0,,,,1.300000,,,,,0,-10.0,,,1.000000,0.860206,0.860206,False,True,False,True,False,False
3,59.4488,24.7763,5.6,45,230,3.0,3.0,4.0,4.0,4.0,1.0,5.100000,0.707107,-22.0,-14.0,0.360090,0,6.0,,0.500000,0.585786,0.866983,0.866983,True,False,False,False,True,False
4,59.4495,24.7775,4.9,36,239,3.0,3.0,4.0,0.0,2.0,-0.7,5.033333,0.513160,-9.0,9.0,0.055746,0,6.0,0.215204,-0.066667,0.578582,0.866983,0.866983,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116476,60.1456,24.9138,1.9,217,179,22.0,22.0,6.0,1.0,1.0,-5.3,10.620000,6.202177,129.0,85.0,0.055251,0,114.0,0.237353,-2.960000,0.134417,0.860206,0.860206,True,False,False,False,True,False
116477,59.4482,24.7760,8.4,210,209,22.0,22.0,6.0,2.0,2.0,-4.4,16.660000,5.795084,41.0,43.0,0.341190,0,115.0,0.097707,-2.600000,0.145079,0.859157,0.859157,False,True,False,True,False,False
116478,60.1456,24.9136,1.0,289,193,22.0,22.0,6.0,0.0,1.0,-0.9,7.540000,6.433739,72.0,14.0,0.005978,0,114.0,0.252350,-3.080000,0.130105,0.860206,0.860206,True,False,False,False,True,False
116479,59.4466,24.7739,7.3,216,216,22.0,22.0,6.0,1.0,2.0,-1.1,13.820000,6.288641,6.0,7.0,0.115483,0,116.0,0.154460,-2.840000,0.134353,0.859157,0.859157,False,True,False,True,False,False


## RESCALING (Min-Max)

Rescaling puts numeric features in a comparable range [0, 1], which can help:
- some supervised selection methods
- models sensitive to feature scale
- stable optimization

We apply Min-Max scaling to `df_features` (the merged dataframe including cat and num transformed).

We do not scale:
- identifiers
- targets

In [42]:
scaler = MinMaxScaler()

df_features_scaled = pd.DataFrame(
    scaler.fit_transform(df_features),
    columns=df_features.columns
)

In [43]:
df_features_scaled

Unnamed: 0,lat,lon,sog,cog,hdg,hour,hour_of_day,day_of_week,time_since_last_position_min,reporting_interval_min,speed_delta,rolling_mean_sog,rolling_std_sog,course_change,heading_change,distance_from_last_position_nm,is_in_port,time_since_etd_schedule_min,distance_variation,avg_speed_trend,movement_stability,imo_te,ship_name_te,dep_port_EETLL,dep_port_FIHEL,dep_port_nan,arr_port_EETLL,arr_port_FIHEL,arr_port_nan
0,0.002675,0.376320,0.165468,0.186111,0.679666,0.130435,0.130435,0.666667,,,,0.167638,,,,,0.0,0.515838,,,1.000000,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0
1,0.002534,0.392297,0.161871,0.086111,0.596100,0.173913,0.173913,0.666667,,,,0.163994,,,,,0.0,0.515676,,,1.000000,0.719043,0.719043,1.0,0.0,0.0,0.0,1.0,0.0
2,0.991833,0.801141,0.046763,0.580556,0.582173,0.173913,0.173913,0.666667,,,,0.047376,,,,,0.0,0.515699,,,1.000000,0.756725,0.756725,0.0,1.0,0.0,1.0,0.0,0.0
3,0.007040,0.405136,0.201439,0.125000,0.640669,0.130435,0.130435,0.666667,0.832573,0.001001,0.520661,0.185860,0.053641,0.440111,0.462396,0.008404,0.0,0.515884,,0.554192,0.572978,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0
4,0.008026,0.408559,0.176259,0.100000,0.665738,0.130435,0.130435,0.666667,0.832566,0.000956,0.485537,0.183431,0.038929,0.476323,0.526462,0.001301,0.0,0.515884,0.009208,0.496251,0.565550,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116476,0.988172,0.797432,0.068345,0.602778,0.498607,0.956522,0.956522,1.000000,0.832568,0.000933,0.390496,0.387026,0.470500,0.860724,0.738162,0.001290,0.0,0.517131,0.010155,0.200409,0.107650,0.756725,0.756725,1.0,0.0,0.0,0.0,1.0,0.0
116477,0.006195,0.404280,0.302158,0.583333,0.582173,0.956522,0.956522,1.000000,0.832569,0.000956,0.409091,0.607143,0.439617,0.615599,0.621170,0.007963,0.0,0.517143,0.004180,0.237219,0.118642,0.719043,0.719043,0.0,1.0,0.0,1.0,0.0,0.0
116478,0.988172,0.796862,0.035971,0.802778,0.537604,0.956522,0.956522,1.000000,0.832566,0.000933,0.481405,0.274781,0.488066,0.701950,0.540390,0.000140,0.0,0.517131,0.010797,0.188139,0.103205,0.756725,0.756725,1.0,0.0,0.0,0.0,1.0,0.0
116479,0.003943,0.398288,0.262590,0.600000,0.601671,0.956522,0.956522,1.000000,0.832568,0.000956,0.477273,0.503644,0.477059,0.518106,0.520891,0.002695,0.0,0.517155,0.006609,0.212679,0.107584,0.719043,0.719043,0.0,1.0,0.0,1.0,0.0,0.0


## ANALYTICAL BASE TABLE

Downstream notebooks (preselection and modeling) should consume one clean table that includes:
- identifiers (for grouping/debugging)
- scaled features (ready to model)
- targets (ready for selection/modeling)

In [50]:
# 1. Update your target columns list to ONLY include the flag
# We remove 'delay_minutes' because it was dropped to prevent leakage
target_cols = ["delay_flag"] 

# 2. Build the table
df_analytical_base = pd.concat(
    [
        df[id_cols].reset_index(drop=True),
        df_features_scaled.reset_index(drop=True),
        df[target_cols].reset_index(drop=True),
    ],
    axis=1
)

print("✔ Analytical Base Table created successfully with 'delay_flag' as the only target.")

✔ Analytical Base Table created successfully with 'delay_flag' as the only target.


In [51]:
df_analytical_base

Unnamed: 0,record_id,imo,ship_name,lat,lon,sog,cog,hdg,hour,hour_of_day,day_of_week,time_since_last_position_min,reporting_interval_min,speed_delta,rolling_mean_sog,rolling_std_sog,course_change,heading_change,distance_from_last_position_nm,is_in_port,time_since_etd_schedule_min,distance_variation,avg_speed_trend,movement_stability,imo_te,ship_name_te,dep_port_EETLL,dep_port_FIHEL,dep_port_nan,arr_port_EETLL,arr_port_FIHEL,arr_port_nan,delay_flag
0,69195,9214379,Finlandia,0.002675,0.376320,0.165468,0.186111,0.679666,0.130435,0.130435,0.666667,,,,0.167638,,,,,0.0,0.515838,,,1.000000,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,69260,9773064,Megastar,0.002534,0.392297,0.161871,0.086111,0.596100,0.173913,0.173913,0.666667,,,,0.163994,,,,,0.0,0.515676,,,1.000000,0.719043,0.719043,1.0,0.0,0.0,0.0,1.0,0.0,1.0
2,69269,9364722,Star,0.991833,0.801141,0.046763,0.580556,0.582173,0.173913,0.173913,0.666667,,,,0.047376,,,,,0.0,0.515699,,,1.000000,0.756725,0.756725,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,69272,9214379,Finlandia,0.007040,0.405136,0.201439,0.125000,0.640669,0.130435,0.130435,0.666667,0.832573,0.001001,0.520661,0.185860,0.053641,0.440111,0.462396,0.008404,0.0,0.515884,,0.554192,0.572978,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0,1.0
4,69273,9214379,Finlandia,0.008026,0.408559,0.176259,0.100000,0.665738,0.130435,0.130435,0.666667,0.832566,0.000956,0.485537,0.183431,0.038929,0.476323,0.526462,0.001301,0.0,0.515884,0.009208,0.496251,0.565550,1.000000,1.000000,1.0,0.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116476,882641,9364722,Star,0.988172,0.797432,0.068345,0.602778,0.498607,0.956522,0.956522,1.000000,0.832568,0.000933,0.390496,0.387026,0.470500,0.860724,0.738162,0.001290,0.0,0.517131,0.010155,0.200409,0.107650,0.756725,0.756725,1.0,0.0,0.0,0.0,1.0,0.0,0.0
116477,882642,9773064,Megastar,0.006195,0.404280,0.302158,0.583333,0.582173,0.956522,0.956522,1.000000,0.832569,0.000956,0.409091,0.607143,0.439617,0.615599,0.621170,0.007963,0.0,0.517143,0.004180,0.237219,0.118642,0.719043,0.719043,0.0,1.0,0.0,1.0,0.0,0.0,0.0
116478,882643,9364722,Star,0.988172,0.796862,0.035971,0.802778,0.537604,0.956522,0.956522,1.000000,0.832566,0.000933,0.481405,0.274781,0.488066,0.701950,0.540390,0.000140,0.0,0.517131,0.010797,0.188139,0.103205,0.756725,0.756725,1.0,0.0,0.0,0.0,1.0,0.0,0.0
116479,882644,9773064,Megastar,0.003943,0.398288,0.262590,0.600000,0.601671,0.956522,0.956522,1.000000,0.832568,0.000956,0.477273,0.503644,0.477059,0.518106,0.520891,0.002695,0.0,0.517155,0.006609,0.212679,0.107584,0.719043,0.719043,0.0,1.0,0.0,1.0,0.0,0.0,0.0


## VALIDATION (structural checks)

We verify that feature construction did not create invalid numeric values.

We do not do correlation pruning or selection here (that's next notebook).

In [52]:
# Check no inf / -inf in numeric columns (pandas-safe)
num_check = df_analytical_base.select_dtypes(include=[np.number])

has_inf = np.isinf(num_check).any().any()
print("Are we cooked with inf/-inf?", has_inf)

# Quick sanity checks for bounded fields
assert df_analytical_base["hour_of_day"].dropna().between(0, 23).all()
assert df_analytical_base["day_of_week"].dropna().between(0, 6).all()

print("Basic validation in hour_of_day and day_of_week passed")

Are we cooked with inf/-inf? False
Basic validation in hour_of_day and day_of_week passed


## SAVE DATASET

In [53]:
df_analytical_base.to_csv(working_dir + output_name, index=False)
print("Saved:", working_dir + output_name)

Saved: /Users/rober/smartport-delay-risk-scoring//02_Data/03_Working/work_fe.csv
