The purpose of this end-to-end example is to demonstrate how to prepare, train, and deploy a model that detects auto insurance claims.
- Business Problem
- Technical Solution
- Solution Components
- Solution Architecture
- Code Resources
- Exploratory Data Science and Operational ML workflows
- The ML Life Cycle: Detailed View
"Auto insurance fraud ranges from misrepresenting facts on insurance applications and inflating insurance claims to staging accidents and submitting claim forms for injuries or damage that never occurred, to false reports of stolen vehicles. Fraud accounted for between 15 percent and 17 percent of total claims payments for auto insurance bodily injury in 2012, according to an Insurance Research Council (IRC) study. The study estimated that between $5.6 billion and $7.7 billion was fraudulently added to paid claims for auto insurance bodily injury payments in 2012, compared with a range of $4.3 billion to $5.8 billion in 2002. " source: Insurance Information Institute
In this example, we will use an auto insurance domain to detect claims that are possibly fraudulent.
more precisely we address the use-case: "what is the likelihood that a given auto claim is fraudulent?" , and explore the technical solution.
As you review the notebooks and the architectures presented at each stage of the ML life cycle, you will see how you can leverage SageMaker services and features to enhance your effectiveness as a data scientist, as a machine learning engineer, and as an ML Ops Engineer.
We then perform data exploration on the synthetically generated datasets for Customers and Claims.
Then, we provide an overview of the technical solution by examining the Solution Components and the Solution Architecture. We are motivated by the need to accomplish new tasks in ML by examining a detailed view of the Machine Learning Lifecycle, recognizing the separation of exploratory data science and operationalizing an ML worklfow.
The inputs for building our model and workflow are two tables of insurance data: a claims table and a customers table. This data was synthetically generated is provided to you in its raw state for pre-processing with SageMaker Data Wrangler. However, completing the SageMaker Data Wrangler step is not required to continue with the rest of this notebook. If you wish, you may use the claims_preprocessed.csv
and customers_preprocessed.csv
in the data
directory as they are exact copies of what SageMaker Data Wrangler would output.
In this introduction, you will look at the technical architecture and solution components to build a solution for predicting fraudulent insurance claims and deploy it using SageMaker for real-time predictions. While a deployed model is the end-product of this notebook series, the purpose of this guide is to walk you through all the detailed stages of the machine learning (ML) lifecycle and show you what SageMaker services and features are there to support your activities in each stage.
The following SageMaker Services are used in this solution:
- SageMaker DataWrangler - docs
- SageMaker Processing - docs
- SageMaker Feature Store- docs
- SageMaker Clarify- docs
- SageMaker Training with XGBoost Algorithm and Hyperparameter Optimization- docs
- SageMaker Model Registry- docs
- SageMaker Hosted Endpoints- predictors - docs
- SageMaker Pipelines- docs
The overall architecture is shown in the diagram below. 1end to end
We will go through 5 stages of ML and explore the solution architecture of SageMaker. Each of the sequancial notebooks will dive deep into corresponding ML stage.
Notebook 1: Data Exploration
Notebook 2: Data Preparation, Ingest, Transform, Preprocess, and Store in SageMaker Feature Store
Notebook 3 and Notebook 4 : Train, Tune, Check Pre- and Post-Training Bias, Mitigate Bias, Re-train, Deposit, and Deploy the Best Model to SageMaker Model Registry
This is the architecture for model deployment.
Pipeline Notebook: End-to-End Pipeline - MLOps Pipeline to run an end-to-end automated workflow with all the design decisions made during manual/exploratory steps in previous notebooks.
Our solution is split into the following stages of the ML Lifecycle, and each stage has its own notebook:
- Notebook 1: Data Exploration: We first explore the data.
- Notebook 2: Data Prep and Store: We prepare a dataset for machine learning using SageMaker Data Wrangler, create and deposit the datasets in a SageMaker Feature Store.
- Notebook 3: Train, Assess Bias, Establish Lineage, Register Model: We detect possible pre-training and post-training bias, train and tune a XGBoost model using Amazon SageMaker, record Lineage in the Model Registry so we can later deploy it.
- Notebook 4: Mitigate Bias, Re-train, Register, Deploy Unbiased Model: We mitigate bias, retrain a less biased model, store it in a Model Registry. We then deploy the model to a Amazon SageMaker Hosted Endpoint and run real-time inference via the SageMaker Online Feature Store.
- Pipeline Notebook: Create and Run an MLOps Pipeline: We then create a SageMaker Pipeline that ties together everything we have done so far, from outputs from Data Wrangler, Feature Store, Clarify, Model Registry and finally deployment to a SageMaker Hosted Endpoint. --> Architecture
Note that there are typically two workflows: a manual exploratory workflow and an automated workflow.
The exploratory, manual data science workflow is where experiments are conducted and various techniques and strategies are tested.
After you have established your data prep, transformations, featurizations and training algorithms, testing of various hyperparameters for model tuning, you can start with the automated workflow where you rely on MLOps or the ML Engineering part of your team to streamline the process, make it more repeatable and scalable by putting it into an automated pipeline.
The Red Boxes and Icons represent comparatively newer concepts and tasks that are now deemed important to include and execute, in a production-oriented (versus research-oriented) and scalable ML lifecycle.
These newer lifecycle tasks and their corresponding, supporting AWS Services and features include:
- Data Wrangling: AWS Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The outputs of Data Wrangler are code generated to work with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store or just a plain old python script with pandas.
- Feature Engineering has always been done, but now with AWS Data Wrangler we can use a GUI based tool to do so and generate code for the next phases of the lifecycle.
- Detect Bias: Using AWS Clarify, in Data Prep or in Training we can detect pre-training and post-training bias, and eventually at Inference time provide Interpretability / Explainability of the inferences (e.g., which factors were most influential in coming up with the prediction)
- Feature Store (Offline): Once we have done all of our feature engineering, the encoding and transformations, we can then standardize features, offline in AWS Feature Store, to be used as input features for training models.
- Artifact Lineage: Using AWS SageMaker’s Artifact Lineage features we can associate all the artifacts (data, models, parameters, etc.) with a trained model to produce metadata that can be stored in a Model Registry.
- Model Registry: AWS Model Registry stores the metadata around all artifacts that you have chosen to include in the process of creating your models, along with the model(s) themselves in a Model Registry. Later a human approval can be used to note that the model is good to be put into production. This feeds into the next phase of deploy and monitor.
- Inference and the Online Feature Store: For real-time inference, we can leverage an online AWS Feature Store we have created to get us single digit millisecond low latency and high throughput for serving our model with new incoming data.
- Pipelines: Once we have experimented and decided on the various options in the lifecycle (which transforms to apply to our features, imbalance or bias in the data, which algorithms to choose to train with, which hyper-parameters are giving us the best performance metrics, etc.) we can now automate the various tasks across the lifecycle using SageMaker Pipelines.
- In this notebook, we will show a pipeline that starts with the outputs of AWS Data Wrangler and ends with storing trained models in the Model Registry.
- Typically, you could have a pipeline for data prep, one for training until model registry (which we are showing in the code associated with this blog), one for inference, and one for re-training using SageMaker Model Monitor to detect model drift and data drift and trigger a re-training using an AWS Lambda function.