# Amazon SageMaker Solution: Explaining Credit Decisions

Given the increasing complexity of machine learning models, the need for
model explainability has been growing lately. Some governments have also
introduced stricter regulations that mandate a *right to explanation*
from machine learning models. In this solution, we take a look at how
[Amazon SageMaker](https://aws.amazon.com/sagemaker/) can be used to
explain individual predictions from machine learning models, and also
shine a light on the global behavior of models too.

## Credit Default Classification
As an example use-case, we classify credit applications and predict
whether the credit would be payed back or not (often called a *credit
default*). Given a credit application from a bank customer, the aim of
the bank is to predict whether or not the customer will pay back the
credit in accordance with their repayment plan. When a customer can't pay
back their credit, often called a 'default', the bank loses money and the
customers credit score will be impacted. On the other hand, denying
trustworthy customers credit also has a set of negative impacts. Using
accurate machine learning models to classify the risk of a credit
application can help find a good balance between these two scenarios, but
this provides no comfort to those customers who have been denied credit.

Using explanability methods, it's possible to determine actionable
factors that had a negative impact on the application. Customers can then
take action to increase their chance of obtaining credit in subsequent
applications. Companies can also use explanations to identify risk
factors.

We train a tree-based
[LightGBM](https://lightgbm.readthedocs.io/en/latest/) model using
[Amazon SageMaker](https://aws.amazon.com/sagemaker/) and explain its
predictions using a game theoretic approach called
[SHAP](https://github.com/slundberg/shap) (SHapley Additive
exPlanations). We deploy a endpoint that returns the credit default risk
score, alongside an explanation, in real-time. We also show how
explanations can be computed in batch mode.

## What is SHAP?
SHAP is the method used for calculating explanations in this solution.
Unlike other feature attribution methods, such as single feature
permutation, SHAP tries to disentangle the effect of a single feature by
looking at all possible combinations of features.

[SHAP](https://github.com/slundberg/shap) (Lundberg et al. 2017) stands
for SHapley Additive exPlanations. 'Shapley' relates to a game theoretic
concept called [Shapley
values](https://en.wikipedia.org/wiki/Shapley_value) that is used to
create the explanations. A Shapley value describes the marginal
contribution of each 'player' when considering all possible 'coalitions'.
Using this in a machine learning context, a Shapley value  describes the
marginal contribution of each feature when considering all possible sets
of features. 'Additive' relates to the fact that these Shapley values can
be summed together to give the final model prediction.

As an example, we might start off with a baseline credit default risk of
10%. Given a set of features, we can calculate the Shapley value for each
feature. Summing together all the Shapley values, we might obtain a
cumulative value of +30%. Given the same set of features, we therefore
expect our model to return a credit default risk of 40% (i.e. 10% + 30%).

## Architecture

As part of the solution, the following services are used:

* [AWS Lambda](https://aws.amazon.com/lambda/): Used to generate a synthetic credits dataset and upload to Amazon S3.
* [AWS Glue](https://aws.amazon.com/glue/): Used to crawl datasets, and transform the credits dataset using Apache Spark.
* [Amazon S3](https://aws.amazon.com/s3/): Used to store datasets and the outputs of the AWS Glue Job.
* [Amazon SageMaker Notebook](https://aws.amazon.com/sagemaker/): Used to train the LightGBM model.
* [Amazon ECR](https://aws.amazon.com/ecr/): Used to store the custom Scikit-learn + LightGBM training environment.
* [Amazon SageMaker Endpoint](https://aws.amazon.com/sagemaker/): Used to deploy the trained model and SHAP explainer.
* [Amazon SageMaker Batch Transform](https://aws.amazon.com/sagemaker/): Used to compute explanations in batch.

<p align="center">
  <img src="https://github.com/awslabs/sagemaker-explaining-credit-decisions/raw/master/docs/architecture_diagrams/complete.png" width="1000px">
</p>

## Stages

Our solution is split into the following stages, and each stage has it's own notebook:

* [Introduction](./0_introduction.ipynb): We take a high-level look at the solution components.
* [Datasets](./1_datasets.ipynb): We prepare a dataset for machine learning using AWS Glue.
* [Training](./2_training.ipynb): We train a LightGBM model using Amazon SageMaker, so we have an example trained model to explain.
* [Endpoint](./3_endpoint.ipynb): We deploy the model explainer to a HTTP endpoint using Amazon SageMaker and visualize the explanations.
* [Batch Transform](./4_batch_transform.ipynb): We use Amazon SageMaker Batch Transform to obtain explanations for our complete dataset.
* [Dashboard](./5_dashboard.ipynb): We develop a dashboard for explanations using Amazon SageMaker and Streamlit.
* [Conclusion](./6_conclusion.ipynb): We wrap things up and discuss how to clean up the solution.

## Next Stage

Up next we'll take a look at preparing datasets for machine learning using AWS Glue.

[Click here to continue.](./1_datasets.ipynb)