# **Auto Insurance; Fraud Claim Prediction Model**

Insurance fraud is a deliberate deception perpetrated against or by an insurance company or agent for the purpose of financial gain. Fraud may be committed at different points in the transaction by applicants, policyholders, third-party claimants, or professionals who provide services to claimants. Insurance agents and company employees may also commit insurance fraud. Common frauds include “padding,” or inflating claims; misrepresenting facts on an insurance application; submitting claims for injuries or damage that never occurred; and staging accidents.

Auto insurance fraud ranges from misrepresenting facts on insurance applications and inflating insurance claims to staging accidents and submitting claim forms for injuries or damage that never occurred, to false reports of stolen vehicles.

***Source***: https://www.iii.org/article/background-on-insurance-fraud

# **Problem Statement**
” Our objective is to create an interface for insurance company with Machine Learning model in the backend, to identify the fraud claims in the automobile industry.”

**H0**: "*Give data is fraud*"

**H1**: "*Given data is genuine*"

In [None]:
import numpy as np #Linear Algebra library
import pandas as pd #Data Analytical library
import matplotlib.pyplot as plt #Data Visualisation Library
import seaborn as sns #Statistical Visualisation Library

### **Loading the Data**
Using pandas library, we imported insurance dataset.

***Source***: https://www.kaggle.com/roshansharma/insurance-claim

In [None]:
df=pd.read_csv("../input/insurance-claim-report/insurance_claims_report.csv")
df_copy = df.copy()
pd.set_option('display.max_columns', 100)
df_copy

In [None]:
df_copy.shape

In [None]:
df_copy.info()

In [None]:
df_copy.describe().transpose()

In [None]:
df_copy.select_dtypes(include='object').describe().transpose()

# Data Pre Processing
-  Data cleaning
-  Data Transformation
-  Data Integration

## 1. Data Cleaning

In [None]:
#replacing '?' with NAN

df_copy = df_copy[df_copy != '?']
df_copy.shape

In [None]:
df_copy.isnull().sum()

In [None]:
#replacing null values with mode of the respective columns

df_copy['collision_type'].fillna(df_copy['collision_type'].mode()[0],inplace = True)
df_copy['property_damage'].fillna(df_copy['property_damage'].mode()[0],inplace = True)
df_copy['police_report_available'].fillna(df_copy['police_report_available'].mode()[0],inplace = True)

In [None]:
df_copy.isnull().sum()

In [None]:
df_copy.isna().sum()

## 2. Data Transformation

In [None]:
df_copy['policy_bind_date'] = pd.to_datetime(df_copy['policy_bind_date'])
df_copy['incident_date'] = pd.to_datetime(df_copy['incident_date'])

# EDA 
- Descriptive Statistics
- Outlier Analysis
- Data Visualisation

## 1. Descriptive Statistics

In [None]:
df_copy.describe().transpose()

In [None]:
df_copy.select_dtypes(include='object').describe().transpose()

## 2. Outlier Analysis

In [None]:
df_copy.plot.box(figsize = (16,6))
plt.xticks(rotation = 90)

In [None]:
#import sklearn.preprocessing as pre
#lb=pre.LabelEncoder()

#df_copy['fraud_reported']=lb.fit_transform(df_copy['fraud_reported'])

In [None]:
#correlation matrix for numerical variables

plt.figure(figsize = (10, 10))
sns.heatmap(df_copy.corr(), annot = True, cmap = 'Blues')
plt.title('Correlation matrix for numerical features')

## 3. Data Visualisation

In [None]:
#function for crosstabs
def cross_tab(x,y):
    crtab = pd.crosstab(df_copy[x], df_copy[y])
    return crtab


In [None]:
#Number of fraud claim

p = df_copy['fraud_reported'].value_counts()
print(p)
df_copy['fraud_reported'].value_counts().plot.bar()


In [None]:
sns.countplot(x=df_copy['fraud_reported'])

In [None]:
#Age v/s fraud reported


table=pd.crosstab(df_copy.age,df_copy.fraud_reported)
stacked_data = table.apply(lambda x: x*100/sum(x), axis=1)
stacked_data.plot(kind="bar", stacked=True)

In [None]:
#insured sex v/s fraud reported

cross_tab('insured_sex','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='insured_sex',hue='fraud_reported',kind='count')

In [None]:
#policy state v/s fraud reported

cross_tab('policy_state','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='policy_state',hue='fraud_reported',kind='count')

In [None]:
#Hour in which incident happend 

table1=pd.crosstab(df_copy['incident_hour_of_the_day'],df_copy['fraud_reported'])
stacked_data = table1.apply(lambda x: x*100/sum(x), axis=1)
stacked_data.plot(kind="bar", stacked=True)

In [None]:
#Hour in which incident happend 

cross_tab('incident_hour_of_the_day','fraud_reported')

In [None]:
sns.countplot(data = df_copy,x ='incident_hour_of_the_day',hue='fraud_reported')
plt.xticks(rotation = 90)

In [None]:
#incident type v/s fraud reported

cross_tab('incident_type','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='incident_type',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#insured education level v/s fraud reported

cross_tab('insured_education_level','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='insured_education_level',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#insured occupation v/s fraud reported 

cross_tab('insured_occupation','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='insured_occupation',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#insured hobbies v/s fraud reported

cross_tab('insured_hobbies','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='insured_hobbies',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#insured relationship v/s fraud reported

cross_tab('insured_relationship','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='insured_relationship',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#incident type v/s fraud reported

cross_tab('incident_type','fraud_reported')

In [None]:
#incident type v/s fraud reported

sns.catplot(data=df_copy,x='incident_type',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#collision type v/s fraud reported

cross_tab('collision_type','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='collision_type',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#incident severity v/s fraud reported

cross_tab('incident_severity','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='incident_severity',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
sns.catplot(data =df_copy, x="incident_severity", y="property_claim",hue='fraud_reported',kind='box')
plt.xticks(rotation = 90)

In [None]:
#authorities contacted v/s fraud reported

cross_tab('authorities_contacted','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='authorities_contacted',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
cross_tab('police_report_available','fraud_reported')

In [None]:

sns.catplot(data=df_copy,x='police_report_available',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#incident state v/s fraud reported

cross_tab('incident_state','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='incident_state',hue='fraud_reported',kind='count')

In [None]:
#incident withness v/s fraud reported

cross_tab('witnesses','fraud_reported')

In [None]:
sns.catplot(data=df_copy,x='witnesses',hue='fraud_reported',kind='count')
plt.xticks(rotation = 90)

In [None]:
#incident type v/s total claim amount WRT fraud reported

sns.catplot(data=df_copy,x='incident_type',y='total_claim_amount',hue='fraud_reported',kind='bar')
plt.xticks(rotation = 90)

In [None]:


plt.figure(figsize = (15, 5))
df_temp = df_copy[df_copy.fraud_reported == 'Y']
sns.set_style('darkgrid')
sns.countplot(x = 'auto_year', data = df_temp)
plt.ylabel('No. of fraud reported')
plt.title('Fraud reported VS vehicle year')

# **CONCLUSION**

The insurance claim report dataset was taken from a US insurance company. It had 1000 entries and 39 variables. The dataset had an equal division of categorical and numerical variables. 

EDA gives a clear understanding about the data and the inter-dependencies of the variables.The data has a bias towards being genuine than being fraud. Only a few variables like incident severity, insured hobbies etc. contributed to the target variable(fraud_reported).

When it comes to modeling or prediction, we can use different supervised machine learning models like Logistic Regression, Decision Tree, Random Forest etc.(since this is a classification problem) for making the prediction. It should be always taken care that, the biasness of data has to be resolved. It is advised to use "Stratified Sampling", under sampling, oversampling or SMOTE methods for this purpose.