# Datasets

In this notebook, we'll take a look at preparing datasets for machine
learning using AWS Glue and create schemas to enforce validity of the
data in later stages.

**Note**: When running this notebook on SageMaker Studio, you should make
sure the 'SageMaker JumpStart Data Science 1.0' image/kernel is used. You
can run all cells or step through them one at a time.

<p align="center">
  <img src="https://github.com/awslabs/sagemaker-explaining-credit-decisions/raw/master/docs/architecture_diagrams/stage_1.png" width="1000px">
</p>

When creating the AWS CloudFormation stack, a collection of synthetic
datasets were generated and stored in our solution Amazon S3 bucket with
a prefix of `dataset`. Most of the features contained in these datasets
are based on the [German Credit
Dataset](http://archive.ics.uci.edu/ml/datasets/statlog%2B%28german%2Bcredit%2Bdata%29)
(UCI Machine Learning Repository), but there are some synthetic data
fields too. All personal information was generated using
[`Faker`](https://faker.readthedocs.io/en/master/). We have 3 datasets in
total: credits, people and contacts.

### Dataset #1: Credits

Our credits dataset contains features directly related to the credit
application.

It is a CSV file (i.e. Comma Seperated Value file) that has a header row
with feature names. Of particular note is the feature called `default`.
It is our target variable that we're trying to predict with our LightGBM
model. We show the first two rows of the dataset below:

```
"credit_id","person_id","amount","duration","purpose","installment_rate","guarantor","coapplicant","default"
"51829372","f032303d",1169,6,"electronics",4,0,0,False
```

### Dataset #2: People

Our credits data contains features related to the people making the
credit applications (i.e. the applicants).

It's a [JSON Lines](http://jsonlines.org/) file, where each row is a
separate JSON object. Of particular note is the feature called
`person_id`. You'll notice that this feature was also included in the
credits dataset. It is used to connect the credit application with the
applicant. We show the first row of the dataset below:

```
{
    "person_id": "f032303d",
    "finance": {
        "accounts": {
            "checking": {
                "balance": "negative"
            }
        },
        "repayment_history": "very_poor",
        "credits": {
            "this_bank": 2,
            "other_banks": 0,
            "other_stores": 0
        },
        "other_assets": "real_estate"
    },
    "personal": {
        "age": 67,
        "gender": "male",
        "relationship_status": "single",
        "name": "Peter Jones"
    },
    "dependents": [
        {
            "gender": "male",
            "name": "Michael Morales"
        }
    ],
    "employment": {
        "type": "professional",
        "title": "Learning disability nurse",
        "duration": 11,
        "permit": "foreign"
    },
    "residence": {
        "type": "own",
        "duration": 4
    }
}
```

### Dataset #3: Contacts

Our contacts dataset contains contact information for the applicants.

It is a CSV file that has a header row with feature names. Once again we
have `person_id`. We show the first two rows of the dataset below:

```
"contact_id","person_id","type","value"
"5996e20a","f032303d","telephone","(716)406-9514x345"
```

## AWS Glue

One of the most time consuming tasks in developing a machine learning
workflow is data preperation. AWS Glue can be used to simplify this
process. As a demonstration of how it can be used to infer data schemas
and perform extract, transform and load (ETL) jobs in Spark, we'll
prepare a dataset using AWS Glue. Although our sample datasets are small,
there are many real world scenarios that will benefit from the
scalability of AWS Glue.

When creating the AWS CloudFormation stack, a number of AWS Glue resources
were created:

* A
  [Database](https://docs.aws.amazon.com/glue/latest/dg/define-database.html)
  is used to organize solution's tables.
* A [Crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) is
  used infer formats and schemas of the datasets above.
* A [Custom
  Classifier](https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html)
  is used to help the classifier infer the schema of the contacts datasets.
    * All fields are of type 'string', so we need to indicate that the first
      row is a header row rather than data.
* A [Job](https://docs.aws.amazon.com/glue/latest/dg/author-job.html) is used
  to join the datasets together, drop certain feature, create other features,
  and split train and test sets.
* A
  [Workflow](https://docs.aws.amazon.com/glue/latest/dg/orchestrate-using-workflows.html)
  (and associated
  [triggers](https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html)) to
  orchestrate the above crawler and job.

You can explore the service console for AWS Glue for more details.

We then import a variety of packages that will be used throughout
the notebook. One of the most important packages used throughout this
solution is the Amazon SageMaker Python SDK (i.e. `import sagemaker`). We
also import modules from our own custom package that can be found at
`./package`.

In [2]:
import boto3
from pathlib import Path
import sagemaker
import sys

sys.path.insert(0, '../package')
from package import config, utils
from package.data import glue

We'll now start the AWS Glue workflow which will first crawl the datasets
stored in Amazon S3 and then execute a job for our data ELT (i.e.
Extract, Transform & Load).

In [3]:
glue_run_id = glue.start_workflow(config.GLUE_WORKFLOW)

Our workflow takes around 10 minutes to complete. Most of this time is spend
on resource provisioning, but there is a [preview
feature](https://pages.awscloud.com/glue-reduced-spark-times-preview-2020.html)
for reduced start times. We'll wait until the AWS Glue workflow has completed
before continuing. We need the dataset before training our model in Amazon
SageMaker.

In [4]:
glue.wait_for_workflow_finished(config.GLUE_WORKFLOW, glue_run_id)

.................................................
AWS Glue Workflow has finished successfully.


With our AWS Glue workflow complete, we should now have 4 additional
datasets in our solution's Amazon S3 bucket: `data_train`, `label_train`,
`data_test` and `label_test`. We show an example first record of
`data_train` below:

```
{
	"credit__coapplicant": 0,
	"credit__guarantor": 0,
	"finance__credits__other_stores": 0,
	"employment__permit": "foreign",
	"employment__type": "labourer",
	"finance__accounts__checking__balance": "no_account",
	"finance__credits__this_bank": 2,
	"finance__credits__other_banks": 1,
	"personal__num_dependents": 1,
	"finance__accounts__savings__balance": "very_high",
	"residence__duration": 2,
	"credit__amount": 250,
	"contact__has_telephone": false,
	"employment__duration": 4,
	"credit__duration": 6,
	"finance__repayment_history": "very_poor",
	"finance__other_assets": "real_estate",
	"credit__purpose": "new_car",
	"residence__type": "own",
	"credit__installment_rate": 2
}
```

We now have 20 features that describe a credit application and its
applicant. Since we are using JSON formatted record we still have the
feature names, but additional information such as the feature types can
be retrieved from the schema stored in our AWS Glue catalog. Since we're
interested in explaining the model predictions, and our explanations will
assign contributions to features, it's useful if our feature names are
understandable.

**Advanced**: We can also organize features in a hierarchy (using a seperator
in the feature names), which enables summarization of the explanations. As an
example, `employment__type` and `employment__duration` are both `employment`
related features. We use two consecutive underscores (`__`) as our level
separator.

## Schema
Schemas can be used to keep track of feature names, descriptions and
types. Our solution uses
[`jsonschema`](https://python-jsonschema.readthedocs.io/en/stable/) as
the primary schema format. We have the added bonus of being able to use
schemas to validate input to the trained model and deployed endpoints,
and consistently map unordered dictionaries to ordered lists (as required
by the model). We don't need to create schemas to use Amazon SageMaker,
but they help in this solution.

We already have most of this schema information in our AWS Glue catalog
(it was added when data was exported by the AWS Glue Job), so let's start
by retrieving the table schema for `data_train`.

In [5]:
data_schema = glue.get_table_schema(
    database_name=config.GLUE_DATABASE, table_name="data_train"
)

We can now add additional information such as feature descriptions, that will
be shown inside the tooltip on the visuals later on.

In [6]:
# flake8: noqa: E501
data_schema.title = "Credit Application"
data_schema.description = "An array of items used to describe a credit application."
item_descriptions_dict = {
    "contact__has_telephone": "Customer has a registered telephone number.",
    "credit__amount": "Amount of money requested as part of credit application (in EUR).",
    "credit__coapplicant": "Co-applicant on credit application.",
    "credit__duration": "Amount of time the credit is requested for (in months).",
    "credit__guarantor": "Guarantor on credit application.",
    "credit__installment_rate": "Credit installment rate (as a percentage of the customer's disposable income).",
    "credit__purpose": "Customer's reason for requiring credit.",
    "employment__duration": "Amount of time the customer has been employed at their current employer (in years).",
    "employment__permit": "Customer's current work permit type.",
    "employment__type": "Customer's current job classification.",
    "finance__accounts__checking__balance": "Customer's checking account balance.",
    "finance__accounts__savings__balance": "Customer's savings account balance.",
    "finance__credits__other_banks": "Count of credits the customer has at other banks.",
    "finance__credits__other_stores": "Count of credits the customer has at other stores.",
    "finance__credits__this_bank": "Count of credits the customer has at this bank.",
    "finance__other_assets": "Customer's most significant asset.",
    "finance__repayment_history": "Quality of the customer's repayment history.",
    "personal__num_dependents": "Count of the customer's dependents.",
    "residence__duration": "Amount of time the customer has been at their current residence (in years).",
    "residence__type": "Class of the customer's residence."
}
data_schema.item_descriptions_dict = item_descriptions_dict

We do the same for `label_train` too.

In [7]:
label_schema = glue.get_table_schema(
    database_name=config.GLUE_DATABASE, table_name="label_train"
)
label_schema.title = "Credit Application Outcome"
item_descriptions_dict = {
    "credit__default": (
        "0 if the customer successfully made credit payments, "
        "1 if the customer defaulted on credit payments.")
}
label_schema.item_descriptions_dict = item_descriptions_dict

Since the schemas for train and test datasets are the same, we can skip
`data_test` and `label_test`.

We can save our updated schemas to disk, in preperation for uploading to
Amazon S3. You can check the `schema_folder` afterwards, and examine the
`data.schema.json` and `label.schema.json` files for data and label
schemas respectively.

In [8]:
current_folder = utils.get_current_folder(globals())
schema_folder = Path(current_folder, "../schemas").resolve()
data_schema_filepath = Path(schema_folder, "data.schema.json")
data_schema.save(data_schema_filepath)
label_schema_filepath = Path(schema_folder, "label.schema.json")
label_schema.save(label_schema_filepath)

Up next, we create a SageMaker Session. A SageMaker Session can be used to
conveniently perform certain AWS actions, such as uploading and downloading
files from Amazon S3. We use the SageMaker Session to upload our schemas to
Amazon S3.

In [9]:
boto_session = boto3.session.Session(region_name=config.AWS_REGION)
sagemaker_session = sagemaker.Session(boto_session)

sagemaker_session.upload_data(
    path=str(schema_folder),
    bucket=config.S3_BUCKET,
    key_prefix=config.SCHEMAS_S3_PREFIX
)

's3://sagemaker-soln-ecd-js-51p2dp-us-east-1-396548483691/schemas'

We now have data and label schemas uploaded to Amazon S3, and they can
now be used during model training and for model deployment.

## Customization

We have provided an example dataset above, but our solution is
customizable if you have your own datasets. You can choose to use AWS
Glue or perform the processing steps in some other way of your choosing.

When your own data still needs to be prepared for machine learning (e.g.
still need to be joined and flattened to create a single table), you
can modify the AWS Glue Workflow that's provided to suit your own
data. As an example, if your data is stored in Amazon RDS (or another
JDBC data store) rather than Amazon S3, you can add an AWS Glue
Connector and configure the AWS Glue Crawler and AWS Glue Job to use it.
You should also modify the AWS Glue Job's script (written in PySpark) to
suit your own data (e.g. change the joins and select features of
interest).

When your own data is already in a suitable format for machine learning
(i.e. can be represented as a single table), you don't necessarily need
to run the AWS Glue Workflow. Just upload the data to Amazon S3 (in the
bucket created as part of the solution). You should however convert your
data to the JSON Lines format that is used by the solution. And you
should create JSON schemas for the data and label to take advantage of
the automatic data preprocessing and data validation.

## Next Stage

Up next we'll train a LightGBM model using Amazon SageMaker, so we have
an example trained model to explain.

[Click here to continue.](./2_training.ipynb)