本demo使用网络车险案件数据集，演示机器学习模型应用于车险欺诈识别的效果

需要配合本demo使用的文件：
* insurance_claims.csv
* 测试数据.xlsx
* 测试数据标签.xlsx
* 训练数据.xlsx
* 训练数据标签.xlsx

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

df = pd.read_csv('insurance_claims.csv')
df.replace('?', np.nan, inplace = True)
import missingno as msno

msno.bar(df)
df['collision_type'] = df['collision_type'].fillna(df['collision_type'].mode()[0])
df['property_damage'] = df['property_damage'].fillna(df['property_damage'].mode()[0])
df['police_report_available'] = df['police_report_available'].fillna(df['police_report_available'].mode()[0])
#df.isna().sum()
plt.figure(figsize = (18, 12))
corr = df.corr()
sns.heatmap(data = corr, annot = True, fmt = '.2g', linewidth = 1)
#plt.show()
to_drop = ['policy_number','policy_bind_date','policy_state','insured_zip','incident_location','incident_date',
           'incident_state','incident_city','insured_hobbies','auto_make','auto_model','auto_year', '_c39']
df.drop(to_drop, inplace = True, axis = 1)
plt.figure(figsize = (18, 12))
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))
sns.heatmap(data = corr, mask = mask, annot = True, fmt = '.2g', linewidth = 1)
#plt.show()
df.drop(columns = ['age', 'total_claim_amount'], inplace = True, axis = 1)
X = df.drop('fraud_reported', axis = 1)
y = df['fraud_reported']
cat_df = X.select_dtypes(include = ['object'])
cat_df = pd.get_dummies(cat_df, drop_first = True)
num_df = X.select_dtypes(include = ['int64'])
X = pd.concat([num_df, cat_df], axis = 1)
plt.figure(figsize = (25, 20))
plotnumber = 1
for col in X.columns:
    if plotnumber <= 24:
        ax = plt.subplot(5, 5, plotnumber)
        sns.distplot(X[col])
        plt.xlabel(col, fontsize = 15)
        
    plotnumber += 1   
#plt.tight_layout()
#plt.show()
plt.figure(figsize = (20, 15))
plotnumber = 1
for col in X.columns:
    if plotnumber <= 24:
        ax = plt.subplot(5, 5, plotnumber)
        sns.boxplot(X[col
        ])
        plt.xlabel(col, fontsize = 15)
    
    plotnumber += 1
#plt.tight_layout()
#plt.show()

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
num_df = X_train[['months_as_customer', 'policy_deductable', 'umbrella_limit',
       'capital-gains', 'capital-loss', 'incident_hour_of_the_day',
       'number_of_vehicles_involved', 'bodily_injuries', 'witnesses', 'injury_claim', 'property_claim',
       'vehicle_claim']]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_df)
scaled_num_df = pd.DataFrame(data = scaled_data, columns = num_df.columns, index = X_train.index)
X_train.drop(columns = scaled_num_df.columns, inplace = True)
X_train = pd.concat([scaled_num_df, X_train], axis = 1)

In [10]:
pd.DataFrame(X_train).to_excel('训练数据.xlsx')
pd.DataFrame(y_train).to_excel('训练数据标签.xlsx')

In [12]:
pd.DataFrame(X_test).to_excel('测试数据.xlsx')
pd.DataFrame(y_test).to_excel('测试数据标签.xlsx')

上面的代码将原始数据进行了预处理（包括数据清洗、数据编码等工作，将原始的数据转换成了适用与机器学习模型的数据），同时将原始数据划分为训练模型用的数据以及测试模型用的数据，均导出为了excel文件。

## 决策树模型

In [14]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
dtc_train_acc = accuracy_score(y_train, dtc.predict(X_train))
dtc_test_acc = accuracy_score(y_test, y_pred)
from sklearn.model_selection import GridSearchCV
grid_params = {
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [3, 5, 7, 10],
    'min_samples_split' : range(2, 10, 1),
    'min_samples_leaf' : range(2, 10, 1)
}
grid_search = GridSearchCV(dtc, grid_params, cv = 5, n_jobs = -1, verbose = 1)
grid_search.fit(X_train, y_train)
dtc = grid_search.best_estimator_
y_pred = dtc.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
dtc_train_acc = accuracy_score(y_train, dtc.predict(X_train))
dtc_test_acc = accuracy_score(y_test, y_pred)

Fitting 5 folds for each of 512 candidates, totalling 2560 fits


上述代码使用750个训练用案件数据训练了机器学习模型，并对250个测试用案件数据进行了预测，这250个模型的预测结果如下('N'为不存在欺诈，'Y'为存在欺诈)

In [15]:
y_pred

array(['N', 'N', 'N', 'N', 'N', 'N', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y',
       'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'Y', 'Y', 'N', 'N',
       'N', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'Y', 'Y', 'N', 'N',
       'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'Y', 'N', 'Y',
       'Y', 'N', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'N', 'N',
       'N', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N',
       'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'N',
       'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N',
       'N', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'Y', 'N',
       'Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'Y', 'N', 'N', 'N',
       'N', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'N', 'Y', 'N', 'N',
       'Y', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'N', 'Y', 'N',
       'N', 'N', 'N', 'N', 'Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N

与250个测试案件的实际结果对比，当前模型的准确率如下：

In [16]:
print(f"Training accuracy of Decision Tree is : {dtc_train_acc}")
print(f"Test accuracy of Decision Tree is : {dtc_test_acc}")

Training accuracy of Decision Tree is : 0.7906666666666666
Test accuracy of Decision Tree is : 0.768
