Skip to content

Files

Latest commit

 

History

History

fraud_detection

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Architect and Build an End-to-End Workflow for Auto Claim Fraud Detection with SageMaker Services

The purpose of this end-to-end example is to demonstrate how to prepare, train, and deploy a model that detects auto insurance claims.

Contents

  1. Business Problem
  2. Technical Solution
  3. Solution Components
  4. Solution Architecture
  5. Code Resources
  6. Exploratory Data Science and Operational ML workflows
  7. The ML Life Cycle: Detailed View

Business Problem

"Auto insurance fraud ranges from misrepresenting facts on insurance applications and inflating insurance claims to staging accidents and submitting claim forms for injuries or damage that never occurred, to false reports of stolen vehicles. Fraud accounted for between 15 percent and 17 percent of total claims payments for auto insurance bodily injury in 2012, according to an Insurance Research Council (IRC) study. The study estimated that between $5.6 billion and $7.7 billion was fraudulently added to paid claims for auto insurance bodily injury payments in 2012, compared with a range of $4.3 billion to $5.8 billion in 2002. " source: Insurance Information Institute

In this example, we will use an auto insurance domain to detect claims that are possibly fraudulent.
more precisely we address the use-case: "what is the likelihood that a given auto claim is fraudulent?" , and explore the technical solution.

As you review the notebooks and the architectures presented at each stage of the ML life cycle, you will see how you can leverage SageMaker services and features to enhance your effectiveness as a data scientist, as a machine learning engineer, and as an ML Ops Engineer.

We then perform data exploration on the synthetically generated datasets for Customers and Claims.

Then, we provide an overview of the technical solution by examining the Solution Components and the Solution Architecture. We are motivated by the need to accomplish new tasks in ML by examining a detailed view of the Machine Learning Lifecycle, recognizing the separation of exploratory data science and operationalizing an ML worklfow.

Car Insurance Claims: Data Sets and Problem Domain

The inputs for building our model and workflow are two tables of insurance data: a claims table and a customers table. This data was synthetically generated is provided to you in its raw state for pre-processing with SageMaker Data Wrangler. However, completing the SageMaker Data Wrangler step is not required to continue with the rest of this notebook. If you wish, you may use the claims_preprocessed.csv and customers_preprocessed.csv in the data directory as they are exact copies of what SageMaker Data Wrangler would output.

Technical Solution

In this introduction, you will look at the technical architecture and solution components to build a solution for predicting fraudulent insurance claims and deploy it using SageMaker for real-time predictions. While a deployed model is the end-product of this notebook series, the purpose of this guide is to walk you through all the detailed stages of the machine learning (ML) lifecycle and show you what SageMaker services and features are there to support your activities in each stage.

Solution Components

The following SageMaker Services are used in this solution:

  1. SageMaker DataWrangler - docs
  2. SageMaker Processing - docs
  3. SageMaker Feature Store- docs
  4. SageMaker Clarify- docs
  5. SageMaker Training with XGBoost Algorithm and Hyperparameter Optimization- docs
  6. SageMaker Model Registry- docs
  7. SageMaker Hosted Endpoints- predictors - docs
  8. SageMaker Pipelines- docs

Solution Components

Solution Architecture

The overall architecture is shown in the diagram below. 1end to end

We will go through 5 stages of ML and explore the solution architecture of SageMaker. Each of the sequancial notebooks will dive deep into corresponding ML stage.

Notebook 1: Data Exploration

Notebook 2: Data Preparation, Ingest, Transform, Preprocess, and Store in SageMaker Feature Store

Solution Architecture

Notebook 3 and Notebook 4 : Train, Tune, Check Pre- and Post-Training Bias, Mitigate Bias, Re-train, Deposit, and Deploy the Best Model to SageMaker Model Registry

Solution Architecture

This is the architecture for model deployment.

Solution Architecture

Pipeline Notebook: End-to-End Pipeline - MLOps Pipeline to run an end-to-end automated workflow with all the design decisions made during manual/exploratory steps in previous notebooks.

Pipelines Solution Architecture

Code Resources

Stages

Our solution is split into the following stages of the ML Lifecycle, and each stage has its own notebook:

The Exploratory Data Science and ML Ops Workflows

Exploratory Data Science and Scalable MLOps

Note that there are typically two workflows: a manual exploratory workflow and an automated workflow.

The exploratory, manual data science workflow is where experiments are conducted and various techniques and strategies are tested.

After you have established your data prep, transformations, featurizations and training algorithms, testing of various hyperparameters for model tuning, you can start with the automated workflow where you rely on MLOps or the ML Engineering part of your team to streamline the process, make it more repeatable and scalable by putting it into an automated pipeline.

the 2 flows

The ML Life Cycle: Detailed View

title

The Red Boxes and Icons represent comparatively newer concepts and tasks that are now deemed important to include and execute, in a production-oriented (versus research-oriented) and scalable ML lifecycle.

These newer lifecycle tasks and their corresponding, supporting AWS Services and features include:

  1. Data Wrangling: AWS Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The outputs of Data Wrangler are code generated to work with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store or just a plain old python script with pandas.
    1. Feature Engineering has always been done, but now with AWS Data Wrangler we can use a GUI based tool to do so and generate code for the next phases of the lifecycle.
  2. Detect Bias: Using AWS Clarify, in Data Prep or in Training we can detect pre-training and post-training bias, and eventually at Inference time provide Interpretability / Explainability of the inferences (e.g., which factors were most influential in coming up with the prediction)
  3. Feature Store (Offline): Once we have done all of our feature engineering, the encoding and transformations, we can then standardize features, offline in AWS Feature Store, to be used as input features for training models.
  4. Artifact Lineage: Using AWS SageMaker’s Artifact Lineage features we can associate all the artifacts (data, models, parameters, etc.) with a trained model to produce metadata that can be stored in a Model Registry.
  5. Model Registry: AWS Model Registry stores the metadata around all artifacts that you have chosen to include in the process of creating your models, along with the model(s) themselves in a Model Registry. Later a human approval can be used to note that the model is good to be put into production. This feeds into the next phase of deploy and monitor.
  6. Inference and the Online Feature Store: For real-time inference, we can leverage an online AWS Feature Store we have created to get us single digit millisecond low latency and high throughput for serving our model with new incoming data.
  7. Pipelines: Once we have experimented and decided on the various options in the lifecycle (which transforms to apply to our features, imbalance or bias in the data, which algorithms to choose to train with, which hyper-parameters are giving us the best performance metrics, etc.) we can now automate the various tasks across the lifecycle using SageMaker Pipelines.
    1. In this notebook, we will show a pipeline that starts with the outputs of AWS Data Wrangler and ends with storing trained models in the Model Registry.
    2. Typically, you could have a pipeline for data prep, one for training until model registry (which we are showing in the code associated with this blog), one for inference, and one for re-training using SageMaker Model Monitor to detect model drift and data drift and trigger a re-training using an AWS Lambda function.