

## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-25</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code |
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client |
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

In [1]:
# Import required modules
import pandas as pd  # pandas is used for data manipulation and analysis
import numpy as np  # numpy is used for numerical operations, though not directly used in this code
from statsmodels.formula.api import logit  # logit function from statsmodels for logistic regression

In [2]:
# Load the dataset from a CSV file
cars = pd.read_csv("car_insurance.csv")  # Reads the CSV file into a pandas DataFrame
cars.info()  # Prints information about the dataset (column names, data types, and null values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

In [3]:
# Fill missing values in the 'credit_score' column with the column's mean value
cars["credit_score"].fillna(cars['credit_score'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars["credit_score"].fillna(cars['credit_score'].mean(), inplace=True)


In [4]:
# Fill missing values in the 'annual_mileage' column with the column's mean value
cars['annual_mileage'].fillna(cars['annual_mileage'].mean(), inplace=True)  # Replaced missing values with the mean

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars['annual_mileage'].fillna(cars['annual_mileage'].mean(), inplace=True)  # Replaced missing values with the mean


In [5]:
# Initialize an empty list to store models
models = []

In [6]:
# Create a new DataFrame `features` that drops the 'id' and 'outcome' columns from the original dataset
features = cars.drop(columns=["id", "outcome"])  # 'id' is not a predictor, and 'outcome' is the target variable
columns = features.columns  # Get the list of feature column names


In [7]:
# Loop through each feature column to fit a logistic regression model for each feature individually
for col in columns:
    # For each feature, fit a logistic regression model using the formula `outcome ~ feature`
    model = logit(f"outcome ~ {col}", data=cars).fit()  # Fits a logistic regression model for predicting 'outcome'
    models.append(model)  # Store the fitted model in the 'models' list

Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572557
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
  

In [8]:
# Initialize an empty list to store the accuracies of each model
accuracies = []

In [9]:
# Loop through each model and calculate its accuracy using the confusion matrix
for feature in range(0, len(models)):  # Iterate through all models
    conf_matrix = models[feature].pred_table()  # Get the confusion matrix for the current model
    tn = conf_matrix[0, 0]  # True Negatives: correctly predicted negatives
    tp = conf_matrix[1, 1]  # True Positives: correctly predicted positives
    fn = conf_matrix[1, 0]  # False Negatives: positives incorrectly predicted as negatives
    fp = conf_matrix[0, 1]  # False Positives: negatives incorrectly predicted as positives

    # Calculate the accuracy of the model: Accuracy = (TN + TP) / (TN + TP + FP + FN)
    acc = (tn + tp) / (tn + tp + fp + fn)
    accuracies.append(acc)  # Store the calculated accuracy in the 'accuracies' list

In [10]:
# Find the feature with the highest accuracy by identifying the index of the maximum accuracy
best_feature = features.columns[accuracies.index(max(accuracies))]

In [11]:
# Create a DataFrame that contains the best feature and its corresponding accuracy
best_feature_df = pd.DataFrame({"best_feature": [best_feature],
                                "best_accuracy": [max(accuracies)]},
                               index=[0])

In [12]:
# Display the result DataFrame containing the best feature and its accuracy
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.7771
