# Baseline Probability Model Based on Historical Frequencies



This notebook implements a simple, interpretable baseline model that predicts traffic severity using empirical class probabilities conditioned on the time of day, weekday, and road segment. These probabilities are computed using historical data and serve as a reference point for evaluating more complex models later in the project.

Therefore, in this notebook, no machine learning is used.

The approach is useful because it reflects recurring traffic patterns and allows for probability-based feature engineering, which is later used to enhance the performance of machine learning models.

How the probabilities are calculated:

- I'm using empirical (historical) frequency-based probabilities, purely based on counting past events.
- I group the historical dataset by: road, hour, weekday
- For each group, I count how many times each severity level (0, 1, 2) occurred.
- Then, I normalize those counts to get probabilities.
  
These probabilities are not predictions from a model, but rather statistical summaries from the historical data.

This creates a simple, interpretable baseline.

These probabilities can later be added as features to ML models (e.g., Random Forest, XGBoost) — and they often improve performance, especially in structured time series or location-based tasks like traffic forecasting.

In [1]:


import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load engineered dataset
df = pd.read_csv("../data/engineered_traffic_data.csv")

# Parse timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract hour and weekday for grouping
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek  # 0 = Monday

# Map status to numeric severity levels
severity_map = {'Good': 0, 'Minor': 1, 'Serious': 2}
df['severity_level'] = df['status'].map(severity_map)

# Create the baseline probability lookup table by
# computing the distribution of severity classes for each (road, hour, weekday) combination using normalized counts
prob_df = (
    df.groupby(['road', 'hour', 'weekday'])['severity_level']
    .value_counts(normalize=True)
    .unstack(fill_value=0)
    .reset_index()
)

# Rename columns for clarity 
prob_df = prob_df.rename(columns={
    0: 'prob_severity_0',
    1: 'prob_severity_1',
    2: 'prob_severity_2'
})

# Save baseline probabilities for future notebooks
prob_df.to_csv("../results/baseline_probabilities.csv", index=False)

print("Baseline probabilities saved ")




# EVALUATION
# baseline prediction based on most likely class

# Create a lookup dictionary for quick predictions
lookup = (
    prob_df
    .set_index(['road', 'hour', 'weekday'])
    .to_dict(orient='index')
)

# Function to get prediction based on highest probability
def predict_baseline(row):
    key = (row['road'], row['hour'], row['weekday'])
    probs = lookup.get(key, None)
    if probs:
        # Pick class with max probability
        pred = max(probs, key=probs.get)
        return int(pred.split('_')[-1])
    else:
        return 0  # Default to 'Good' if no data

# Apply baseline predictions
df['baseline_prediction'] = df.apply(predict_baseline, axis=1)

# Evaluate performance
y_true = df['severity_level']
y_pred = df['baseline_prediction']

report = classification_report(y_true, y_pred)
conf_matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)

print("\nAccuracy:", round(accuracy, 4))
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix)


Baseline probabilities saved 

Accuracy: 0.795

Classification Report:
               precision    recall  f1-score   support

           0       0.80      1.00      0.89     63108
           1       0.47      0.03      0.06     11823
           2       0.40      0.00      0.00      4557

    accuracy                           0.79     79488
   macro avg       0.56      0.34      0.32     79488
weighted avg       0.73      0.79      0.71     79488


Confusion Matrix:
 [[62797   304     7]
 [11434   384     5]
 [ 4421   128     8]]


### Observations

The baseline model achieves an overall accuracy of 79.5%, primarily due to its strong performance on predicting the 'Good' severity class (Precision: 0.80, Recall: 1.00). However, the model struggles to predict 'Minor' and 'Serious' delays, which are significantly underrepresented in the dataset and exhibit more variability that cannot be fully captured by time-based aggregation alone.

As expected, this baseline approach performs reasonably well on the majority class but fails to capture more complex dependencies that may influence the occurrence of traffic incidents. Nevertheless, these estimated probabilities still capture useful temporal patterns, and can be leveraged as additional features in more advanced models to potentially improve predictive performance.