`Modeling Car Insurance Claim Outcomes`

`September 2025`

This project aims to build a predictive model for a car insurance company to determine whether a customer will make a claim during their policy period. Using customer data, we will explore various factors that may influence claim behavior and develop a logistic regression model to predict claim outcomes.

`Any questions, please reach out!`

Chiawei Wang, PhD\
Data & Product Analyst\
<chiawei.w@outlook.com>

`*` Note that the table of contents and other links may not work directly on GitHub.

[Table of contents](#table-of-contents)
1. [Executive summary](#executive-summary)
   - [Challenge](#challenge)
   - [Research questions](#research-questions)
   - [Data overview](#data-overview)
   - [Approach](#approach)
   - [Results](#results)
   - [Conclusion](#conclusion)
2. [Exploratory data analysis](#exploratory-data-analysis)

# Executive summary

## Challenge

Car insurance companies need to accurately predict the likelihood of customers making claims in order to optimize pricing and manage risk. On the Road Car Insurance has requested a model to predict whether a customer will make a claim during the policy period, using their customer data.

## Research questions

1. What factors are most predictive of a customer making a claim?
2. How can we quantify the impact of each factor on the likelihood of a claim?
3. What is the overall accuracy of our predictive model?

## Data overview


| Index | Column                | Type    | Description                                                                          |
| ----- | --------------------- | ------- | ------------------------------------------------------------------------------------ |
| 0     | `id`                  | int64   | Unique client identifier                                                             |
| 1     | `age`                 | int64   | Client's age: 0: 16-25, 1: 26-39, 2: 40-64, 3: 65+                                   |
| 2     | `gender`              | int64   | Client's gender: 0: Female, 1: Male                                                  |
| 3     | `driving_experience`  | object  | Years the client has been driving: 0: 0-9, 1: 10-19, 2: 20-29, 3: 30+                |
| 4     | `education`           | object  | Client's level of education: 0: No education, 1: High school, 2: University          |
| 5     | `income`              | object  | Client's income level: 0: Poverty, 1: Working class, 2: Middle class, 3: Upper class |
| 6     | `credit_score`        | float64 | Client's credit score (between zero and one)                                         |
| 7     | `vehicle_ownership`   | float64 | Client's vehicle ownership status: 0: Does not own, 1: Owns                          |
| 8     | `vehicle_year`        | object  | Year of vehicle registration: 0: Before 2015, 1: 2015 or later                       |
| 9     | `married`             | float64 | Client's marital status: 0: Not married, 1: Married                                  |
| 10    | `children`            | float64 | Client's number of children                                                          |
| 11    | `postal_code`         | int64   | Client's postal code                                                                 |
| 12    | `annual_mileage`      | float64 | Number of miles driven by the client each year                                       |
| 13    | `vehicle_type`        | object  | Type of car: 0: Sedan, 1: Sports car                                                 |
| 14    | `speeding_violations` | int64   | Total number of speeding violations received by the client                           |
| 15    | `duis`                | int64   | Number of times the client has been caught driving under the influence               |
| 16    | `past_accidents`      | int64   | Total number of previous accidents the client has been involved in                   |
| 17    | `outcome`             | float64 | Whether the client made a claim on their car insurance: 0: No claim, 1: Made a claim |

## Approach

1. Reading in and exploring the dataset
2. Filling missing values
3. Preparing for modeling
4. Building and storing the models
5. Measuring performance
6. Finding the best performing model

## Results

- **Best feature:** `driving_experience` was found to be the most predictive of whether a customer would make a claim. Less experienced drivers have a higher claim rate.
- **Best accuracy:** The model using `driving_experience` as the predictor achieved the highest accuracy of approximately 0.78.

## Conclusion

The analysis identified `driving_experience` as the most significant predictor of insurance claims among the features evaluated. This insight can help the company to refine their risk assessment and pricing strategies, focusing on driver experience as a key factor in claim likelihood.

# Exploratory data analysis

In [1]:
# Import necessary libraries
import pandas as pd
from statsmodels.formula.api import logit

In [2]:
# Read in the CSV as a DataFrame
df = pd.read_csv('insurance.csv')

# Preview the data
print(df.shape)
df.head()

(10000, 18)


Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
0,569520,3,0,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,0,1,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,0,0,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,0,1,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,1,1,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


In [3]:
# Fill missing values with the mean
df['credit_score'].fillna(df['credit_score'].mean())
df['annual_mileage'].fillna(df['annual_mileage'].mean())

# Empty list to store model results
models = []

# Feature columns
features = df.columns.drop(['id', 'outcome'])

# Loop through features
for col in features:
    # Create a model
    model = logit(f'outcome ~ {col}', data = df).fit(disp = False)
    # Add each model to the models list
    models.append(model)

# Empty list to store accuracies
accuracies = []

# Loop through models
for model in models:
    # Compute the confusion matrix
    conf_matrix = model.pred_table()
    # True negatives
    tn = conf_matrix[0, 0]
    # True positives
    tp = conf_matrix[1, 1]
    # False negatives
    fn = conf_matrix[1, 0]
    # False positives
    fp = conf_matrix[0, 1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)

# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

# Create best_feature_df
best_feature_df = pd.DataFrame({'best_feature': [best_feature], 'best_accuracy': [round(max(accuracies), 2)]})
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.78
