# Feature Enhancement: Adding Lag Features (Recent Congestion Levels) and Weather Data



This notebook combines previously engineered features (from earlier notebooks) with two important new sources of information to create the final feature set.


### 1. Adding Lag Features 



These features (prev_1h_severity, prev_2h_severity) are meant to help the models incorporate patterns from historical traffic severity at each location and time.

In [1]:
import pandas as pd

# Load existing engineered dataset 
df = pd.read_csv("../data/engineered_traffic_data.csv")

# Ensure timestamp is parsed correctly
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Sort the data by road and timestamp to calculate lags properly
df = df.sort_values(by=['road', 'timestamp']).reset_index(drop=True)

# Create lag features
df['prev_1h_severity'] = df.groupby('road')['severity_level'].shift(1)
df['prev_2h_severity'] = df.groupby('road')['severity_level'].shift(2)

# to handle missing values for first rows (for roads where there’s not enough history)
df['prev_1h_severity'] = df['prev_1h_severity'].fillna(-1).astype(int)
df['prev_2h_severity'] = df['prev_2h_severity'].fillna(-1).astype(int)

print(df.head())

#  Save this version (before merging with weather data later on)
df.to_csv("../data/engineered_traffic_with_lags.csv", index=False)
print("Lag features successfully created and saved.")


  road status            description           timestamp  hour weekday  \
0   A1   Good  No Exceptional Delays 2025-03-10 00:08:00     0  Monday   
1   A1   Good  No Exceptional Delays 2025-03-10 00:38:00     0  Monday   
2   A1  Minor           Minor Delays 2025-03-10 01:08:00     1  Monday   
3   A1   Good  No Exceptional Delays 2025-03-10 01:38:00     1  Monday   
4   A1   Good  No Exceptional Delays 2025-03-10 02:08:00     2  Monday   

   day_of_week  is_weekend  is_rush_hour  severity_level  prev_1h_severity  \
0            0           0             0               0                -1   
1            0           0             0               0                 0   
2            0           0             0               1                 0   
3            0           0             0               0                 1   
4            0           0             0               0                 0   

   prev_2h_severity  
0                -1  
1                -1  
2                 0 

### 2. Adding Weather Data

Gotten from https://open-meteo.com/

Weather data is merged into the main dataset, providing variables such as: temperature_2m, precipitation, rain, snowfall, wind_speed_10m, wind_gusts_10m, cloud_cover.

These features are meant to potentially help the models understand the environmental context of traffic patterns.

In [2]:
import pandas as pd

weather_df = pd.read_csv(
    '../data/open-meteo-51.49N0.16W1m.csv',
    skiprows=3  # skip metadata, use row 4 as header
)

weather_df.columns = [col.split(' (')[0].strip() for col in weather_df.columns]

weather_df['time'] = pd.to_datetime(weather_df['time'])

print(weather_df.head())


                 time  temperature_2m  precipitation  rain  snowfall  \
0 2025-03-10 00:00:00             8.7            0.0   0.0       0.0   
1 2025-03-10 01:00:00             8.6            0.0   0.0       0.0   
2 2025-03-10 02:00:00             8.8            0.0   0.0       0.0   
3 2025-03-10 03:00:00             8.4            0.0   0.0       0.0   
4 2025-03-10 04:00:00             8.0            0.0   0.0       0.0   

   wind_speed_10m  wind_gusts_10m  cloud_cover  
0             6.9            15.8           97  
1             6.4            14.0           29  
2             7.3            14.8           99  
3             7.5            14.8           99  
4             7.1            14.8           11  


#### Weather Data Merging Approach

The traffic data contains observations at 30-minute intervals, while the available weather data is recorded hourly. To integrate weather information into the traffic dataset, we aligned each traffic observation with the corresponding hourly weather conditions. This was achieved by flooring each traffic timestamp to the nearest full hour (i.e., assigning both the HH:00 and HH:30 traffic records to the same hourly weather record at HH:00).

This approach assumes that weather conditions are relatively stable within each hour, allowing each hourly weather record to be used for both 30-minute traffic records in that hour. 

This provides a consistent, aligned dataset where every traffic observation is enriched with the most recent weather information available at that time.

In [3]:

# Align traffic timestamps to full hour to match weather
df['weather_time'] = df['timestamp'].dt.floor('H')

# Merge traffic with weather
merged_df = pd.merge(df, weather_df, left_on='weather_time', right_on='time', how='left')

# Drop unnecessary columns
merged_df = merged_df.drop(columns=['weather_time', 'time'])

print(merged_df.head())

# save merged data
merged_df.to_csv("../data/engineered_traffic_with_lags_and_weather.csv", index=False)
print("Successfully merged weather data.")


  road status            description           timestamp  hour weekday  \
0   A1   Good  No Exceptional Delays 2025-03-10 00:08:00     0  Monday   
1   A1   Good  No Exceptional Delays 2025-03-10 00:38:00     0  Monday   
2   A1  Minor           Minor Delays 2025-03-10 01:08:00     1  Monday   
3   A1   Good  No Exceptional Delays 2025-03-10 01:38:00     1  Monday   
4   A1   Good  No Exceptional Delays 2025-03-10 02:08:00     2  Monday   

   day_of_week  is_weekend  is_rush_hour  severity_level  prev_1h_severity  \
0            0           0             0               0                -1   
1            0           0             0               0                 0   
2            0           0             0               1                 0   
3            0           0             0               0                 1   
4            0           0             0               0                 0   

   prev_2h_severity  temperature_2m  precipitation  rain  snowfall  \
0               

  df['weather_time'] = df['timestamp'].dt.floor('H')


Successfully merged weather data.


### Features scaling

Scaling makes sure that features with different numeric ranges are on a comparable scale, so the model treats them fairly during learning.

It transforms the feature values to a smaller, standardized range (using StandardScaler)

For example:

- Temperatures may range from 0°C to 30°C.

- Precipitation may range from 0 mm to 10 mm.

- Wind speeds may range from 0 to 40 km/h.

- Cloud cover is 0-100%.

These very different scales confuse many machine learning algorithms.

I can scale only the weather features.
Traffic categorical features (severity_level, lags, weekday, etc.) don't need scaling.

In [4]:

print(df.columns.tolist())


['road', 'status', 'description', 'timestamp', 'hour', 'weekday', 'day_of_week', 'is_weekend', 'is_rush_hour', 'severity_level', 'prev_1h_severity', 'prev_2h_severity', 'weather_time']


In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the full dataset
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")

# Strip any possible whitespaces in column names
df.columns = df.columns.str.strip()

# Select weather features
weather_cols = ['temperature_2m', 'precipitation', 'rain', 'snowfall',
                'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover']

# Apply scaling
scaler = StandardScaler()
df[weather_cols] = scaler.fit_transform(df[weather_cols])

# Save scaled version
df.to_csv("../data/engineered_traffic_with_lags_and_weather_scaled.csv", index=False)

print("Scaling applied and final dataset saved ")


Scaling applied and saved successfully!


### Summary

So now I have 2 datasets (engineered_traffic_with_lags.csv and engineered_traffic_with_lags_and_weather_scaled.csv) that I will use in the following 2 notebooks to experiment with ML models.