# Car Insurance Claims Classification

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict if a claim was made?

# 2. Data

Data from: https://www.kaggle.com/sagnik1511/car-insurance-data

## Context

The company has shared its annual car insurance data. Now, you have to find out the real customer behaviors over the data.

## Content

The columns are resembling practical world features.
The outcome column indicates 1 if a customer has claimed his/her loan else 0.
The data has 19 features from there 18 of them are corresponding logs which were taken by the company.

## Acknowledgements

Mostly the data is real and some part of it is also generated by Sagnik Roy.

# 3. Evaluation

As this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## Input/Features

    1. ID
    2. AGE
    3. GENDER
    4. RACE
    5. DRIVING_EXPERIENCE
    6. EDUCATION
    7. INCOME
    8. CREDIT_SCORE
    9. VEHICLE_OWNERSHIP
    10. VEHICLE_YEAR
    11. MARRIED
    12. CHILDREN
    13. POSTAL_CODE
    14. ANNUAL_MILEAGE
    15. VEHICLE_TYPE
    16. SPEEDING_VIOLATIONS
    17 .DUIS
    18.PAST_ACCIDENTS
    
## Outputs/Labels
    
    19. OUTCOME

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the Dataset

In [None]:
# Local
# df = pd.read_csv('Data/Car_Insurance_Claim.csv')

#Kaggle
df = pd.read_csv('/kaggle/input/car-insurance-data/Car_Insurance_Claim.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Outcome Count')
sns.countplot(data=df, x ='OUTCOME');

from the count, we can see that the data is in-balanced.

In [None]:
df['AGE'].value_counts()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Age Count')
sns.countplot(data=df, x ='AGE');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Age Count colored by outcome')
sns.countplot(data=df, x ='AGE', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Gender Count color by outcome')
sns.countplot(data=df, x ='GENDER', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Race Count color by outcome')
sns.countplot(data=df, x ='RACE', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Driving Experience Count color by outcome')
sns.countplot(data=df, x ='DRIVING_EXPERIENCE', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Education Count color by outcome')
sns.countplot(data=df, x ='EDUCATION', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Income Count color by outcome')
sns.countplot(data=df, x ='INCOME', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Credit Score histogram')
sns.histplot(data=df, x='CREDIT_SCORE', kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of Credit Score')
sns.boxplot(data=df, x='CREDIT_SCORE')

In [None]:
df[df['CREDIT_SCORE']<0.1]

In [None]:
df[df['CREDIT_SCORE']>0.9]

As we can see there are some outlier in the credit scores, however we dont think i will effect the overall model

In [None]:
plt.figure(figsize=(20,10))
plt.title('Vehicle ownership Score histogram')
sns.histplot(data=df, x='VEHICLE_OWNERSHIP', kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Vehicle ownership Count colored by outcome')
sns.countplot(data=df, x='VEHICLE_OWNERSHIP', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Vehicle Year Count colored by outcome')
sns.countplot(data=df, x='VEHICLE_YEAR', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('married Count colored by outcome')
sns.countplot(data=df, x='MARRIED', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Children Count colored by outcome')
sns.countplot(data=df, x='CHILDREN', hue='OUTCOME');

In [None]:
df.info()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Annual Mileage Score histogram')
sns.histplot(data=df, x='ANNUAL_MILEAGE', kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Annual Mileage Score histogram')
sns.histplot(data=df, x='ANNUAL_MILEAGE',hue='OUTCOME', kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Vehicle type Count colored by outcome')
sns.countplot(data=df, x='VEHICLE_TYPE', hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Speeding Violations histogram colored by outcome')
sns.histplot(data=df, x='SPEEDING_VIOLATIONS',hue='OUTCOME', kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('DUIS count colored by outcome')
sns.countplot(data=df, x='DUIS',hue='OUTCOME');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Past Accidents colored by outcome')
sns.countplot(data=df, x='PAST_ACCIDENTS',hue='OUTCOME');

In [None]:
df.info()

we will use the income group to fill the nan for the credit scores

In [None]:
df['INCOME'].value_counts()

In [None]:
upper_class_median = df[df['INCOME'] == 'upper class']['CREDIT_SCORE'].median()
middle_class_median = df[df['INCOME'] == 'middle class']['CREDIT_SCORE'].median()
poverty_class_median = df[df['INCOME'] == 'poverty']['CREDIT_SCORE'].median()
working_class_median = df[df['INCOME'] == 'working class']['CREDIT_SCORE'].median()

In [None]:
df[(df['INCOME'] == 'working class') & df['CREDIT_SCORE'].isnull()].index

In [None]:
df.loc[(df[(df['INCOME'] == 'working class') & df['CREDIT_SCORE'].isnull()].index),'CREDIT_SCORE'] = df[df['INCOME'] == 'working class']['CREDIT_SCORE'].fillna(working_class_median)
df.loc[(df[(df['INCOME'] == 'poverty') & df['CREDIT_SCORE'].isnull()].index),'CREDIT_SCORE'] = df[df['INCOME'] == 'poverty']['CREDIT_SCORE'].fillna(poverty_class_median)
df.loc[(df[(df['INCOME'] == 'middle class') & df['CREDIT_SCORE'].isnull()].index),'CREDIT_SCORE'] = df[df['INCOME'] == 'middle class']['CREDIT_SCORE'].fillna(middle_class_median)
df.loc[(df[(df['INCOME'] == 'upper class') & df['CREDIT_SCORE'].isnull()].index),'CREDIT_SCORE'] = df[df['INCOME'] == 'upper class']['CREDIT_SCORE'].fillna(upper_class_median)

In [None]:
df.info()

we will use the median ANNUAL_MILEAGE to fill the the nan values for ANNUAL_MILEAGE

In [None]:
df['ANNUAL_MILEAGE'] = df['ANNUAL_MILEAGE'].fillna(df['ANNUAL_MILEAGE'].median())

In [None]:
df.info()

In [None]:
df.describe()

# 5. Modelling

In [None]:
X = df.drop(['OUTCOME','ID'], axis=1)
y = df['OUTCOME']
X = pd.get_dummies(X, drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Baseline Model Scores

In [None]:
from sklearn.metrics import classification_report,precision_score, recall_score,f1_score

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        model_scores[name] = model.score(X_test,y_test)
        model_recall[name] = recall_score(y_test, y_preds)
        model_f1[name] = f1_score(y_test, y_preds)
        model_precision[name] = precision_score(y_test, y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    model_recall = pd.DataFrame(model_recall, index=['Recall']).transpose()
    model_recall = model_recall.sort_values('Recall')
    model_f1 = pd.DataFrame(model_f1, index=['F1']).transpose()
    model_f1 = model_f1.sort_values('F1')
    model_precision = pd.DataFrame(model_precision, index=['Precision']).transpose()
    model_precision = model_precision.sort_values('Precision')
        
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores

In [None]:
model_recall

In [None]:
model_f1

In [None]:
model_precision

Since the labels are in-balanced, We will choose to use the LGBMClassifier as it provides the best overall scores. we will do a Randome Search CV to find the optimized hyper parameters

## Random Search CV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                     param_distributions=params[name],
                                      scoring='f1',
                                      cv=5,
                                     n_iter=40,
                                     verbose=0)        
        rs_model.fit(X_train,y_train)
        model_rs_scores[name] = rs_model.score(X_test,y_test)
        model_rs_best_param[name] = rs_model.best_params_
        y_preds = rs_model.predict(X_test)
        print('\n')
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        
        
    return model_rs_scores, model_rs_best_param

## Baseline CV scores

In [None]:
models = {'LGBMClassifier': LGBMClassifier()}

params = {'LGBMClassifier':{}}

In [None]:
model_rs_scores_base, model_rs_best_param_base = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

## RS Model 1

In [None]:
params = {'LGBMClassifier':{'num_leaves': np.arange(21,42,2),
                           'learning_rate': np.linspace(0.1,0.9,9),
                            'n_estimators':[50,100,200,300,500],
                            'min_split_gain':np.linspace(0.0,0.9,10),
                            'min_child_weight':np.linspace(0.0,0.9,10),
                            'min_child_samples': [10,20,40,80,100],
                            'reg_alpha': np.linspace(0.0,0.9,10),
                            'reg_lambda': np.linspace(0.0,0.9,10)
                           }
         }

In [None]:
model_rs_scores1, model_rs_best_param1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores1

In [None]:
model_rs_best_param1

## RS Model 2

In [None]:
params = {'LGBMClassifier':{'num_leaves': np.arange(30,43),
                           'learning_rate': np.linspace(0.0,0.2,9),
                            'n_estimators':[250,300,250],
                            'min_split_gain':np.linspace(0.7,0.9,10),
                            'min_child_weight':np.linspace(0.0,0.1,10),
                            'min_child_samples': [5,10,20,30],
                            'reg_alpha': np.linspace(0.4,0.6,10),
                            'reg_lambda': np.linspace(0.4,1.6,10)
                           }
         }

In [None]:
model_rs_scores2, model_rs_best_param2 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores2

In [None]:
model_rs_best_param2

## RS Model 3

In [None]:
params = {'LGBMClassifier':{'num_leaves': [41],
                           'learning_rate': np.linspace(0.001,0.006,9),
                            'n_estimators':[290,300,310],
                            'min_split_gain':np.linspace(0.8,0.9,10),
                            'min_child_weight':[0.05555555555555556],
                            'min_child_samples': [3,4,5,6,7,8],
                            'reg_alpha': [0.5333333333333333],
                            'reg_lambda': [1.6]
                           }
         }

In [None]:
model_rs_scores3, model_rs_best_param3 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores3

In [None]:
model_rs_best_param3

In [None]:
model_rs_scores2

In [None]:
model_rs_best_param2

We will use RS model 2 as that provides the best hyperparameters

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix,plot_roc_curve
from sklearn.model_selection import cross_val_score

In [None]:
model = LGBMClassifier(reg_lambda = 1.6,
                      reg_alpha = 0.5333333333333333,
                      num_leaves = 41,
                      n_estimators = 300,
                      min_split_gain = 0.8555555555555556,
                      min_child_weight = 0.05555555555555556,
                      min_child_samples = 5,
                      learning_rate = 0.05)

In [None]:
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test, y_preds))

## Confusion Matirx

In [None]:
plot_confusion_matrix(model, X_test,y_test)

## ROC Curve

In [None]:
plot_roc_curve(model, X_test,y_test)

## Evalution using Cross-Validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=cv,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=cv,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=cv,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=cv,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X_train, y_train, cv=5)

In [None]:
cv_merics

with the model, and with the CV evalution, we are able to get the following:

    Accuracy 0.850714
    Precision 0.762139
    Recall 0.762747
    f1 0.762747

# 7. Experimentation / Improvements

with a scoring model of Recall 76% and f1 of 76% in the CV and classification, we hope to get a better scoring model.

maybe we can look into the follow for improvements:

    1. Check for other outliers? or other ways to fill nan values? or dropping the nan instead of filling them?
    2. Build and looking in to the dataset again to build a better model
    3. Getting more data to balance out the labels?