# Car Insurance Telematics - Feature Engineering

**Notebook 1: Feature Engineering and Data Preparation**

This notebook creates comprehensive features from raw telematics data for risk assessment and premium calculation.

## Objectives:

- Engineer driver behavior features
- Create risk indicators
- Calculate aggregated metrics
- Prepare data for modeling
- Export processed dataset


In [19]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

# Set display options
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)

print("Feature Engineering for Car Insurance Telematics")
print("=" * 50)

Feature Engineering for Car Insurance Telematics


## 1. Data Loading and Initial Inspection


In [20]:
# Load the telematics data
# Replace 'your_data.csv' with your actual file path
df = pd.read_csv("../data/processed/processed_trips_1200_drivers.csv")

print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")

df.info()

print(f"\nFirst few rows:")
df.head()

Dataset shape: (17819, 19)

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17819 entries, 0 to 17818
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   trip_id                    17819 non-null  object 
 1   driver_id                  17819 non-null  object 
 2   driver_trip_number         17819 non-null  int64  
 3   duration_minutes           17819 non-null  float64
 4   distance_miles             17819 non-null  float64
 5   average_speed_mph          17819 non-null  float64
 6   max_speed_mph              17819 non-null  float64
 7   harsh_braking_events       17819 non-null  int64  
 8   harsh_acceleration_events  17819 non-null  int64  
 9   sharp_cornering_events     17819 non-null  int64  
 10  phone_usage_seconds        17819 non-null  int64  
 11  speeding_percent           17819 non-null  float64
 12  night_driving              17819 non-null  int64  
 13  rush

Unnamed: 0,trip_id,driver_id,driver_trip_number,duration_minutes,distance_miles,average_speed_mph,max_speed_mph,harsh_braking_events,harsh_acceleration_events,sharp_cornering_events,phone_usage_seconds,speeding_percent,night_driving,rush_hour,time_of_day,start_zone,end_zone,data_quality_score,gps_accuracy_avg
0,trip_016937,0002af179ff1223f,1,17.3,9.64,26.0,27.3,1,1,0,0,0.036,1,0,early_morning,zone_37.67_-122.28,zone_37.64_-122.37,0.708209,21.113862
1,trip_016938,0002af179ff1223f,2,6.7,4.64,43.0,57.1,1,0,0,0,0.109,0,1,morning_commute,zone_37.83_-122.26,zone_37.89_-122.19,0.996235,20.336994
2,trip_016939,0002af179ff1223f,3,34.2,28.98,48.8,51.7,2,0,0,0,0.093,0,1,evening_commute,zone_37.65_-122.36,zone_37.71_-122.29,0.89667,15.342455
3,trip_016940,0002af179ff1223f,4,28.0,24.95,35.9,46.3,0,1,1,0,0.072,0,1,evening_commute,zone_37.79_-122.42,zone_37.71_-122.46,0.98535,18.978125
4,trip_016941,0002af179ff1223f,5,11.6,4.8,39.9,50.9,0,0,0,0,0.079,0,1,morning_commute,zone_37.87_-122.36,zone_37.84_-122.43,0.983278,9.109515


## 2. Trip-Level Feature Engineering


In [21]:
# Create a copy for feature engineering
df_features = df.copy()

# 1. Speed-related features
df_features["speed_variance"] = (
    df_features["max_speed_mph"] - df_features["average_speed_mph"]
)
df_features["speed_ratio"] = df_features["max_speed_mph"] / (
    df_features["average_speed_mph"] + 1e-6
)
df_features["high_speed_flag"] = (df_features["max_speed_mph"] > 80).astype(int)
df_features["excessive_speed_flag"] = (df_features["max_speed_mph"] > 100).astype(int)

# 2. Aggressive driving indicators
df_features["total_harsh_events"] = (
    df_features["harsh_braking_events"]
    + df_features["harsh_acceleration_events"]
    + df_features["sharp_cornering_events"]
)

df_features["harsh_events_per_mile"] = df_features["total_harsh_events"] / (
    df_features["distance_miles"] + 1e-6
)
df_features["harsh_events_per_minute"] = df_features["total_harsh_events"] / (
    df_features["duration_minutes"] + 1e-6
)

# 3. Phone usage features
df_features["phone_usage_percent"] = (
    df_features["phone_usage_seconds"] / (df_features["duration_minutes"] * 60)
) * 100
df_features["phone_usage_flag"] = (df_features["phone_usage_seconds"] > 0).astype(int)
df_features["excessive_phone_use"] = (df_features["phone_usage_percent"] > 10).astype(
    int
)

# 4. Trip characteristics
df_features["trip_efficiency"] = df_features["distance_miles"] / (
    df_features["duration_minutes"] + 1e-6
)
df_features["long_trip_flag"] = (df_features["distance_miles"] > 50).astype(int)
df_features["short_trip_flag"] = (df_features["distance_miles"] < 5).astype(int)
df_features["extended_duration_flag"] = (df_features["duration_minutes"] > 120).astype(
    int
)

# 5. Risk timing features
df_features["high_risk_time"] = (
    (df_features["night_driving"] == 1) | (df_features["rush_hour"] == 1)
).astype(int)

# 6. Data quality features
df_features["low_quality_data"] = (df_features["data_quality_score"] < 0.8).astype(int)
df_features["poor_gps_accuracy"] = (df_features["gps_accuracy_avg"] > 10).astype(int)

print("Trip-level features created:")
new_features = [col for col in df_features.columns if col not in df.columns]
for feature in new_features:
    print(f"  - {feature}")

Trip-level features created:
  - speed_variance
  - speed_ratio
  - high_speed_flag
  - excessive_speed_flag
  - total_harsh_events
  - harsh_events_per_mile
  - harsh_events_per_minute
  - phone_usage_percent
  - phone_usage_flag
  - excessive_phone_use
  - trip_efficiency
  - long_trip_flag
  - short_trip_flag
  - extended_duration_flag
  - high_risk_time
  - low_quality_data
  - poor_gps_accuracy


## 3. Driver-Level Aggregated Features


In [22]:
# Calculate driver-level aggregated features
driver_features = (
    df_features.groupby("driver_id")
    .agg(
        {
            # Basic trip statistics
            "trip_id": "count",
            "duration_minutes": ["mean", "std", "sum"],
            "distance_miles": ["mean", "std", "sum"],
            "average_speed_mph": ["mean", "std"],
            "max_speed_mph": ["mean", "max"],
            # Aggressive driving
            "harsh_braking_events": ["mean", "sum"],
            "harsh_acceleration_events": ["mean", "sum"],
            "sharp_cornering_events": ["mean", "sum"],
            "total_harsh_events": ["mean", "sum"],
            "harsh_events_per_mile": "mean",
            "harsh_events_per_minute": "mean",
            # Phone usage
            "phone_usage_seconds": ["mean", "sum"],
            "phone_usage_percent": "mean",
            "phone_usage_flag": "mean",
            "excessive_phone_use": "mean",
            # Speed behavior
            "speeding_percent": "mean",
            "speed_variance": "mean",
            "speed_ratio": "mean",
            "high_speed_flag": "mean",
            "excessive_speed_flag": "mean",
            # Timing patterns
            "night_driving": "mean",
            "rush_hour": "mean",
            "high_risk_time": "mean",
            # Trip patterns
            "long_trip_flag": "mean",
            "short_trip_flag": "mean",
            "extended_duration_flag": "mean",
            "trip_efficiency": "mean",
            # Data quality
            "data_quality_score": "mean",
            "gps_accuracy_avg": "mean",
            "low_quality_data": "mean",
            "poor_gps_accuracy": "mean",
        }
    )
    .round(4)
)

# Flatten column names
driver_features.columns = ["_".join(col).strip() for col in driver_features.columns]
driver_features = driver_features.reset_index()

# Rename some columns for clarity
driver_features.rename(
    columns={
        "trip_id_count": "total_trips",
        "duration_minutes_sum": "total_driving_time_minutes",
        "distance_miles_sum": "total_distance_miles",
    },
    inplace=True,
)

print(f"Driver-level features shape: {driver_features.shape}")
print(f"Number of unique drivers: {driver_features['driver_id'].nunique()}")
driver_features.head()

Driver-level features shape: (1200, 43)
Number of unique drivers: 1200


Unnamed: 0,driver_id,total_trips,duration_minutes_mean,duration_minutes_std,total_driving_time_minutes,distance_miles_mean,distance_miles_std,total_distance_miles,average_speed_mph_mean,average_speed_mph_std,max_speed_mph_mean,max_speed_mph_max,harsh_braking_events_mean,harsh_braking_events_sum,harsh_acceleration_events_mean,harsh_acceleration_events_sum,sharp_cornering_events_mean,sharp_cornering_events_sum,total_harsh_events_mean,total_harsh_events_sum,harsh_events_per_mile_mean,harsh_events_per_minute_mean,phone_usage_seconds_mean,phone_usage_seconds_sum,phone_usage_percent_mean,phone_usage_flag_mean,excessive_phone_use_mean,speeding_percent_mean,speed_variance_mean,speed_ratio_mean,high_speed_flag_mean,excessive_speed_flag_mean,night_driving_mean,rush_hour_mean,high_risk_time_mean,long_trip_flag_mean,short_trip_flag_mean,extended_duration_flag_mean,trip_efficiency_mean,data_quality_score_mean,gps_accuracy_avg_mean,low_quality_data_mean,poor_gps_accuracy_mean
0,0002af179ff1223f,12,32.625,22.2946,391.5,19.2492,12.2774,230.99,39.0583,6.1187,47.4,57.1,1.25,15,1.25,15,0.75,9,3.25,39,0.1883,0.1027,0.0,0,0.0,0.0,0.0,0.0977,8.3417,1.2157,0.0,0.0,0.25,0.5833,0.8333,0.0,0.25,0.0,0.5982,0.93,18.3363,0.0833,0.8333
1,003485c0a33db1b4,17,24.0529,12.4703,408.9,10.1388,6.6773,172.36,29.6059,4.7059,38.0529,51.0,0.7059,12,0.8824,15,0.9412,16,2.5294,43,0.3146,0.1163,9.0588,154,1.4313,0.1176,0.0588,0.0945,8.4471,1.2862,0.0,0.0,0.1765,0.7059,0.8824,0.0,0.1765,0.0,0.4328,0.887,17.0923,0.1765,0.8824
2,00b03adcdee26a16,9,28.9889,21.1209,260.9,24.0533,22.5892,216.48,47.6333,4.9513,60.2778,78.6,4.2222,38,4.5556,41,2.4444,22,11.2222,101,0.5119,0.3979,2.7778,25,0.4495,0.1111,0.0,0.3476,12.6444,1.2664,0.0,0.0,0.3333,0.4444,0.7778,0.1111,0.0,0.0,0.7893,0.8789,17.8971,0.2222,0.6667
3,00c67cb4783ab2fa,18,23.7889,9.5601,428.2,18.0206,7.8179,324.37,49.1833,4.149,58.35,65.8,3.8889,70,2.6111,47,3.0,54,9.5,171,0.5939,0.4315,8.7222,157,0.5119,0.4444,0.0,0.2782,9.1667,1.1898,0.0,0.0,0.1667,0.7222,0.8889,0.0,0.0556,0.0,0.7524,0.891,14.9008,0.2222,0.6667
4,00c89280fe405154,17,23.0824,12.5857,392.4,12.4771,6.7808,212.11,30.6824,4.3216,43.5059,61.9,0.3529,6,0.4706,8,0.1765,3,1.0,17,0.0793,0.0448,8.5882,146,0.8647,0.1176,0.0,0.0251,12.8235,1.4369,0.0,0.0,0.2353,0.4118,0.6471,0.0,0.1176,0.0,0.5449,0.8912,16.2346,0.1176,0.8235


## 4. Advanced Risk Features


In [23]:
# Calculate additional risk-based features

# 1. Driving frequency patterns
driver_features["avg_trips_per_day"] = (
    driver_features["total_trips"] / 30
)  # Assuming 30-day period
driver_features["avg_miles_per_trip"] = (
    driver_features["total_distance_miles"] / driver_features["total_trips"]
)
driver_features["avg_duration_per_trip"] = (
    driver_features["total_driving_time_minutes"] / driver_features["total_trips"]
)

# 2. Risk intensity scores
driver_features["harsh_driving_intensity"] = (
    driver_features["harsh_braking_events_mean"]
    + driver_features["harsh_acceleration_events_mean"]
    + driver_features["sharp_cornering_events_mean"]
)

driver_features["speed_risk_score"] = (
    driver_features["speeding_percent_mean"] * 0.4
    + driver_features["high_speed_flag_mean"] * 30
    + driver_features["excessive_speed_flag_mean"] * 50
)

driver_features["distraction_risk_score"] = (
    driver_features["phone_usage_percent_mean"] * 2
    + driver_features["excessive_phone_use_mean"] * 20
)

driver_features["timing_risk_score"] = (
    driver_features["night_driving_mean"] * 15 + driver_features["rush_hour_mean"] * 10
)

# 3. Composite risk score
driver_features["composite_risk_score"] = (
    driver_features["harsh_driving_intensity"] * 10
    + driver_features["speed_risk_score"] * 0.3
    + driver_features["distraction_risk_score"] * 0.5
    + driver_features["timing_risk_score"] * 0.2
)

# 4. Risk categories
driver_features["risk_category"] = pd.cut(
    driver_features["composite_risk_score"],
    bins=[-np.inf, 10, 25, 50, np.inf],
    labels=["Low", "Medium", "High", "Very High"],
)

# 5. Experience indicators
driver_features["high_mileage_driver"] = (
    driver_features["total_distance_miles"] > 1000
).astype(int)
driver_features["frequent_driver"] = (driver_features["total_trips"] > 50).astype(int)
driver_features["consistent_driver"] = (
    driver_features["duration_minutes_std"] < 30
).astype(int)

print("Advanced risk features created:")
risk_features = [
    "avg_trips_per_day",
    "avg_miles_per_trip",
    "avg_duration_per_trip",
    "harsh_driving_intensity",
    "speed_risk_score",
    "distraction_risk_score",
    "timing_risk_score",
    "composite_risk_score",
    "risk_category",
    "high_mileage_driver",
    "frequent_driver",
    "consistent_driver",
]

for feature in risk_features:
    print(f"  - {feature}")

Advanced risk features created:
  - avg_trips_per_day
  - avg_miles_per_trip
  - avg_duration_per_trip
  - harsh_driving_intensity
  - speed_risk_score
  - distraction_risk_score
  - timing_risk_score
  - composite_risk_score
  - risk_category
  - high_mileage_driver
  - frequent_driver
  - consistent_driver


## 5. Feature Summary and Quality Check


In [24]:
# Display feature summary
print("=" * 60)
print("FEATURE ENGINEERING SUMMARY")
print("=" * 60)

print(f"Original dataset: {df.shape[0]} trips, {df.shape[1]} columns")
print(
    f"Trip-level features: {df_features.shape[0]} trips, {df_features.shape[1]} columns"
)
print(
    f"Driver-level features: {driver_features.shape[0]} drivers, {driver_features.shape[1]} columns"
)

print(f"\nRisk Category Distribution:")
print(driver_features["risk_category"].value_counts())

print(f"\nComposite Risk Score Statistics:")
print(driver_features["composite_risk_score"].describe())

# Check for missing values
missing_values = driver_features.isnull().sum()
if missing_values.sum() > 0:
    print(f"\nMissing values found:")
    print(missing_values[missing_values > 0])
else:
    print(f"\n✓ No missing values in driver features")

# Display sample of key features
key_features = [
    "driver_id",
    "total_trips",
    "total_distance_miles",
    "harsh_driving_intensity",
    "speed_risk_score",
    "distraction_risk_score",
    "composite_risk_score",
    "risk_category",
]

print(f"\nSample of key driver features:")
driver_features[key_features].head(10)

FEATURE ENGINEERING SUMMARY
Original dataset: 17819 trips, 19 columns
Trip-level features: 17819 trips, 36 columns
Driver-level features: 1200 drivers, 55 columns

Risk Category Distribution:
risk_category
High         551
Very High    345
Medium       288
Low           16
Name: count, dtype: int64

Composite Risk Score Statistics:
count    1200.000000
mean       50.703241
std        36.499889
min         1.125104
25%        24.567082
50%        37.438980
75%        74.400122
max       165.799748
Name: composite_risk_score, dtype: float64

✓ No missing values in driver features

Sample of key driver features:


Unnamed: 0,driver_id,total_trips,total_distance_miles,harsh_driving_intensity,speed_risk_score,distraction_risk_score,composite_risk_score,risk_category
0,0002af179ff1223f,12,230.99,3.25,0.03908,0.0,34.428324,High
1,003485c0a33db1b4,17,172.36,2.5295,0.0378,4.0386,29.26694,High
2,00b03adcdee26a16,9,216.48,11.2222,0.13904,0.899,114.601912,Very High
3,00c67cb4783ab2fa,18,324.37,9.5,0.11128,1.0238,97.489784,Very High
4,00c89280fe405154,17,212.11,1.0,0.01004,1.7294,12.397212,Medium
5,00f1339f026d0abb,14,259.93,3.9285,0.02596,0.0,40.935588,High
6,0146d0dbbee1eb8b,12,243.32,2.2501,0.03004,0.6202,24.820012,Medium
7,015d7c3a8cfbc0e9,20,267.08,3.05,0.02864,1.0356,32.826392,High
8,017188caf2097abf,21,319.36,1.5715,0.00676,0.0,17.479128,Medium
9,01d62493148fcd53,18,387.43,10.0556,0.14328,1.3492,103.107084,Very High


## 6. Export Processed Data


In [25]:
# Export trip-level features
trip_features_file = "../data/processed/trip_level_features.csv"
df_features.to_csv(trip_features_file, index=False)
print(f"✓ Trip-level features saved to: {trip_features_file}")

# Export driver-level features
driver_features_file = "../data/processed/driver_level_features.csv"
driver_features.to_csv(driver_features_file, index=False)
print(f"✓ Driver-level features saved to: {driver_features_file}")

# Create a feature dictionary for documentation
feature_descriptions = {
    "speed_variance": "Difference between max and average speed",
    "speed_ratio": "Ratio of max to average speed",
    "total_harsh_events": "Sum of all harsh driving events",
    "harsh_events_per_mile": "Harsh events normalized by distance",
    "harsh_events_per_minute": "Harsh events normalized by time",
    "phone_usage_percent": "Percentage of trip time using phone",
    "trip_efficiency": "Distance per minute ratio",
    "harsh_driving_intensity": "Combined harsh driving behavior score",
    "speed_risk_score": "Weighted speed-related risk score",
    "distraction_risk_score": "Phone usage risk score",
    "timing_risk_score": "Night/rush hour driving risk",
    "composite_risk_score": "Overall driver risk assessment",
    "risk_category": "Categorical risk level (Low/Medium/High/Very High)",
}

# Save feature descriptions
feature_desc_df = pd.DataFrame(
    list(feature_descriptions.items()), columns=["Feature", "Description"]
)
feature_desc_df.to_csv("../data/processed/feature_descriptions.csv", index=False)
print(f"✓ Feature descriptions saved to: feature_descriptions.csv")

print(f"=" * 60)
print("FEATURE ENGINEERING COMPLETED SUCCESSFULLY")
print("=" * 60)
print(f"Files created:")
print(f"  1. {trip_features_file} - Trip-level features ({df_features.shape[0]} rows)")
print(
    f"  2. {driver_features_file} - Driver-level features ({driver_features.shape[0]} rows)"
)
print(f"  3. feature_descriptions.csv - Feature documentation")
print(f"\nReady for next notebook: Risk Scoring and Analysis")

✓ Trip-level features saved to: ../data/processed/trip_level_features.csv
✓ Driver-level features saved to: ../data/processed/driver_level_features.csv
✓ Feature descriptions saved to: feature_descriptions.csv
FEATURE ENGINEERING COMPLETED SUCCESSFULLY
Files created:
  1. ../data/processed/trip_level_features.csv - Trip-level features (17819 rows)
  2. ../data/processed/driver_level_features.csv - Driver-level features (1200 rows)
  3. feature_descriptions.csv - Feature documentation

Ready for next notebook: Risk Scoring and Analysis
