## Myth: Fraud is a victimless crime.

## Reality : Insurance fraud is one of the biggest crimes in the U.S.

- Every year claims and underwriting fraud cost **$80 billion**
- Fraudulent claims account for **5-10%** of all claims 

![Image](https://www.wnsdecisionpoint.com/Portals/1/Images/Reports-Images/Insurance-Fraud-Detection/exhibit-2.jpg)

## Impact of Fraud

- Fraudulent claims directly impact Loss Ratio, thereby, reducing profitability and also negatively impacting Return on Equity (ROE)
- Fraudulent claims add to premium costs, since insurers are compelled to pass on the cost of such claims to policyholders

## Business Need

- Control dishonest claims payout through better fraud detection techniques
- Automation of Fraud Detection Approach making better use of data

![Image](https://www.wnsdecisionpoint.com/Portals/1/Images/Reports-Images/Insurance-Fraud-Detection/exhibit-4.jpg)


![Image](https://www.wnsdecisionpoint.com/Portals/1/Images/Reports-Images/Insurance-Fraud-Detection/exhibit-5.jpg)

Detecting insurance fraud poses an interesting problem from a data science perspective as the problem becomes a binary classification problem in which our “response” variable is “Y” or “N” — claim is fraudulent or not.

![Image](https://www.wnsdecisionpoint.com/Portals/1/Images/Reports-Images/Insurance-Fraud-Detection/exhibit-8.jpg)

![Image](https://www.wnsdecisionpoint.com/Portals/1/Images/Reports-Images/Insurance-Fraud-Detection/exhibit-12-1.jpg)

We'll demonstrate an attempt to solve this problem by using an auto insurance dataset containing 1000 observations on 39 features, and using this data to train a classifier to be able to predict whether a particular claim is fraudulent or not.

The [dataset](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4954928053318020/1058911316420443/167703932442645/latest.html) is made available by APN Partner - [Databricks](https://databricks.com/), a company that provides an unified analytics platform, nuifying data science and engineering across the Machine Learning lifecycle. 

In [None]:
!python -m pip install tornado

In [None]:
!python -m pip install nose

In [None]:
!python -m pip install pandas

In [None]:
!python -m pip install matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("insurance_claims.csv")

## Data exploration

Display number of observations (rows) and features (columns)

In [None]:
data.shape

Quick look at some sample observations reveal the dataset is not the cleanest:
- Some features have wide variations in values
- Some features have missing values
- There is a mix of numberic and categorical features

In [None]:
data.head(10)

In [None]:
data.dtypes

In [None]:
stringCols = []
for i,t in enumerate(data.dtypes):
    if t == 'object':
        stringCols.append(data.columns[i])
        
for col in stringCols:        
    print("{} : {}".format(col, len(data.groupby(col).policy_number.nunique())))
        

Based on a count of categorical variables, it can be seen that following fields have too many categories, hence cannot be encoded into numeric values meaningfully.
- `policy_bind_date`
- `incident_date`
- `incident_location`

Among numeric columns, `insured_zip` is in reality a categorical value, and cannot be encoded meaningfully.

In addition, `policy_number` serves as an idnetifier, and likely have no effect on a claim being fraudulent or not.

These columns are therefore removed, for the sake of simplicity.

In [None]:
colsToDelete = ["policy_number", "policy_bind_date", "insured_zip", "incident_location", "incident_date"]

for col in colsToDelete:
    del data[col]

Remaining columns will be encoded later.

In [None]:
filteredStringColList = [i for i in stringCols if i not in colsToDelete]
print(filteredStringColList)

The target variable, to be predicted is `fraud-reported`. Plotting the count of observations belonging the two classes `Y` and `N` also reveals a large imbalance.

In [None]:
data['fraud_reported'].value_counts().plot(kind='bar', color=plt.cm.Set1(np.arange(len(data))), rot=1)

Analyzing the location of claims using the feature `incident_state` reveals this dataset only contains records from mid-Atlantic states. 

Plotting the claims data on a map provides a nice visual on which area are more rpone to insurance claims, which could prove to be a valuable insight.

In [None]:
data['incident_state'].value_counts()

In [None]:
data['incident_state_count'] = data['incident_state']
numCol = data.shape[1] -1
for i in range(len(data['incident_state_count'])):
    if data.iloc[i, numCol] == "NY":
        data.iloc[i, numCol] = 262
    if data.iloc[i, numCol] == "SC":
        data.iloc[i, numCol] = 248
    if data.iloc[i, numCol] == "WV":
        data.iloc[i, numCol] = 217
    if data.iloc[i, numCol] == "VA":
        data.iloc[i, numCol] = 110
    if data.iloc[i, numCol] == "NC":
        data.iloc[i, numCol] = 110
    if data.iloc[i, numCol] == "PA":
        data.iloc[i, numCol] = 30
    if data.iloc[i, numCol] == "OH":
        data.iloc[i, numCol] = 23

In [None]:
#!pip install chart_studio 

In [None]:
import plotly.graph_objs as go

import chart_studio.plotly as cplt
import plotly.graph_objs as go

cplt.sign_in(username='binoyd', api_key='AqufAmuAZYO1iXQXPVnU')

plotdata = [go.Choropleth(autocolorscale = True, locations = data['incident_state'],
                      z = data['incident_state_count'],
                      locationmode = 'USA-states',
                      marker = go.choropleth.Marker(line = go.choropleth.marker.Line(color = 'rgb(255,255,255)', width = 2)),
                      colorbar = go.choropleth.ColorBar(title = "Number of Incidents"))]
layout = go.Layout(
    title = go.layout.Title(
        text = 'Insurance Incident Claims by State'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)'),
)

fig = go.Figure(data = plotdata, layout = layout)
cplt.iplot(fig, filename = 'd3-cloropleth-map')

In [None]:
del data['incident_state_count']

Plotting total claim amount report by year shows no pattern in claims, meaning date of claim of likely have not much influence.

In [None]:
import plotly.express as px
fig = px.bar(data, x='auto_year', y='total_claim_amount')
fig.show()

Now the data is ready to split into `X` and `Y`, representing features and target, with categorical variables encoded into numeric ones.

Note, that number of columns go up, since each categorical column now gets expanded into a column for each of it's distinct values.

In [None]:
X = data.loc[:, data.columns[0]:data.columns[data.shape[1]-2]]
X = pd.get_dummies(X)
Y = data[data.columns[data.shape[1]-1]]
Y = Y.replace("Y", 1)
Y = Y.replace("N", 0)
print(X.shape)
print(Y.shape)

## LogisticRegression - Random Cross-validation

The problem we have at hand now is a binary classification, where we have to predict observations into one of two classes - `1` or `0` (corresponding to whether a particular claim is identified as frauduent or not)

This can be achieved using different algorithms, such as LogisticRegression, RandomForest, DecisionTree, SVM, etc., with various hyper-paramaeters, such as penalty function used, number of estimators, maximum depth of tree etc.

Before doing a full Grid search, our first attempt is to run a Logistic Regression model through grid search and random cross validation, using Scikit-learn libraries, that are computationally inexpensive.

Grid search is performed using a LogisticRegresion learner with `lblinear` solver, with hyperparameter space defined over penalty (L1 and L2 norms) and inverse regularization strenth ( 0  to 10, with $1/10^6$ fold increments)

In [None]:
%%time
from scipy.stats import uniform
from sklearn import linear_model, datasets
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

def perform_randomized_search(features, target, model, hyperparams, kFolds):
    randomizedsearch = RandomizedSearchCV(model, hyperparams, cv = kFolds, verbose=1)
    best_model = randomizedsearch.fit(features, target)
    print("The mean accuracy of the model is:",best_model.score(features, target))
    print("The best parameters for the model are:",best_model.best_params_)
    return best_model
    
logistic = linear_model.LogisticRegression(class_weight = 'balanced', solver='liblinear', max_iter=100)
norms = ['l1', 'l2']
C = np.random.uniform(0, 10, 100000)
hyperparameters = dict(C=C, penalty=norms)
model = perform_randomized_search(X, Y, logistic, hyperparameters, 10)

## Classification report - Logistic Regression
As a result of this grid search, we obtain a model that yields an accuracy of close to 88%.

However accuracy in itself is not of much value, specially when dealing with classification with imbalanced classes. In this particular case, as we saw earlier, there is a large imbalance where much more claims are identified to be not fraudulent in training data than the ones that are reported to be fraudulent.

In reality, insurance industry would prefer to predict as many true negatives (under represented class) to minimize the loss.

So we create a classification report to see how our Logistic Regression model performed in predicting both classes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=30)

logReg = linear_model.LogisticRegression(C=2.2014271313563083, penalty='l1', solver='liblinear')
logReg.fit(X_train, y_train)
y_pred = logReg.predict(X_test)

print(classification_report(y_test, y_pred))

In spite of relative difficulty in predicting negative samples, the Logistic regression still predicts fraudult claims correctly for about 62% of time.

Even though the preliminary test looks good, and feels like we can proceed to productionize this model with Logistic regression, it would be prudent to do a search on other algortihms avaiable for the task.

## GridSearch with Pipeline

Instead of running different classifiers separately and comparining the outcomes manually, in Scikit learn we can use Pipeline feature with a list of classifiers. 

We then conduct a Grid search over the algorithms and hyperparameters specified for each.

In [None]:
%%time
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
## We will include our logistic regression models in addition to RandomForestClassifier and DecisionTreeClassifier

models = [
    {
        "classifier": [LogisticRegression()], 
        "classifier__penalty": ['l2','l1'], 
        "classifier__C": np.logspace(0, 10, 50)
    },
    {
        "classifier": [LogisticRegression()], 
        "classifier__penalty": ['l2'], 
        "classifier__C": np.logspace(0, 10, 50),
        "classifier__solver":['newton-cg','saga','sag','liblinear']
    },
    {
        "classifier": [RandomForestClassifier()],
        "classifier__n_estimators": [10, 100, 1000, 1000],
        "classifier__max_depth":[5,8,15,25,30,None],
        "classifier__min_samples_leaf":[1,2,5,10,15,100],
        "classifier__max_leaf_nodes": [2, 5,10]
    },
    {
        "classifier": [DecisionTreeClassifier()],
        "classifier__splitter":['best', 'random'],
        "classifier__max_depth":[5,8,15,25,30,None],
        "classifier__min_samples_leaf":[1,2,5,10,15,100],
        "classifier__max_leaf_nodes": [2, 5,10]
    },
    {
        "classifier": [DecisionTreeClassifier(class_weight = 'balanced')],
        "classifier__splitter":['best', 'random'],
        "classifier__max_depth":[5,8,15,25,30,None],
        "classifier__min_samples_leaf":[1,2,5,10,15,100],
        "classifier__max_leaf_nodes": [2, 5,10]
    }
]

In [None]:
from sklearn.model_selection import GridSearchCV
def execute_pipeline(features,target, model_list, kFolds):
    pipe = Pipeline([("classifier", RandomForestClassifier())])
    gridsearch = GridSearchCV(pipe, model_list, cv=kFolds, verbose=1, n_jobs=-1) # Fit grid search
    best_model = gridsearch.fit(features, target)
    print("The mean accuracy of the model is:",best_model.score(features, target))
    print("The best parameters for the model are:",best_model.best_params_)
    return best_model


model = execute_pipeline(X, Y, models, 10)

GridSearch finds Decision Tree Classifier to be the best performing model, although the accuracy of 86% is clearly lower than that of Logistic Regression. 

But then again, only accuracy measure is not that relevant given teh class imbalance that we noticed for this dataset.

## Classification report - DecisionTreeClassifier


We therefore run a classification report using the identified classifier from GridSearch, so as to be able to compare it with the one obtained for Logistic Regression earlier.

In [None]:
decT = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=5, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')

decT.fit(X_train, y_train)
y_pred = decT.predict(X_test)
print(classification_report(y_test, y_pred))

Comparing the classification report, we discover that the Decision Tree classifier yields a much higher precision of 94% for sample class - `0` (NOT Fraudulent).

However, the precision for sample class - `1` (Fraudulent) is lower at 59% (as opposed to 63% as obtained from Logistic regression).

Since, as discussed before, correctly identifying fraudulent claims is more important for insurance claims, it would be safe to proceed with Logistic Regression classifier.

## Next Steps

With the insight derived thus far, we'll now proceed to create a SageMaker training and deployment pipeleine using one of the SageMaker native algorithm - Linear Learner Algorithm.

Linear Learner Algorithm can be used for either regression or classification. When used for Classification, it is essentially same as using a LogisticRegression classifier with `liblinear` solver.