# Mortgage Workflow with Deep Learning
**Workflow modified with PyTorch DNN by Chris Green**

## Dataset

The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

To acquire this dataset, please visit [RAPIDS Datasets Homepage](https://rapidsai.github.io/demos/datasets/mortgage-data)

## Goal
The aim of this notebook is to **combine RAPIDS GPU data processing** with a [PyTorch](https://pytorch.org) **deep neural network** (DNN) to train 12-month mortgage loan delinquency prediction model in the same *vein as the XGBoost end-to-end example*.

**Note: this notebook is here for archival purposes and is not intended to illustrate best practices or currently working code with the latest RAPIDS Version.  It has been tested to work with RAPIDS 0.7**.

## ETL: What's new?

The ETL pipeline below looks very much like the original E2E example with two notable exceptions:

* Train/Validation/Test Data Split
* Feature Discretization & Hashing

#### Data Split
When training a model we want to ensure its generalizability, i.e. that it perorms well on new data. As we train the DNN below we will want to make sure it does not begin to overfit and once trained that we have a good measure of expected performance (if it was to be used in the real-world). To accomplish these ends we break the time series dataset into three non-overlapping fixed time intervals. The earliest will be used as training data, the second as validation data to track the performance of the model during training, and the last as our final test set.

#### Feature Discretization & Hashing
The original ETL pipeline results in both discrete and continuous features. In order to easily employ a DNN we need to transform
the continuous features to discrete. This is accomplished by computing a discrete histogram (`Series.quantile`)
of these features and then assigning each value to its bin id (`Series.digitize`).

Once we have discrete features we want to hash
them into a range whose values will act as indices to an embedding table providing
the inputs to the network. We use the `Series.hash_encode` method to
accomplish this [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing).

## PyTorch Deep Neural Network

### Model
The model constructed below starts with an initial embedding layer ([`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/nn.html#embeddingbag)) that takes the indices from the ETL pipeline, looks up the embeddings in the hash table and takes their mean. This vector then passes to a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) which finally outputs a single score.

Many of the model architecture parameters can be configured by the user such as embedding dimension, number and size of hidden layers, and activation functions.

### Training
To cut down on boilerplate code and realize the benefits of [early stopping](https://en.wikipedia.org/wiki/Early_stopping)
we'll use the [`ignite`](https://pytorch.org/ignite/) library.


## Requirements
Beyond the dependencies that come installed in the standard 
[RAPIDS docker containers](https://hub.docker.com/r/rapidsai/rapidsai) we'll also
need the following `pip` dependencies installed:

In [2]:
!pip install torch pytorch-ignite

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/ac/23/a4b5c189dd624411ec84613b717594a00480282b949e3448d189c4aa4e47/torch-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (676.9MB)
[K     |██████████████████████          | 464.0MB 25.7MB/s eta 0:00:09

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 676.9MB 102kB/s 
[?25hCollecting pytorch-ignite
[?25l  Downloading https://files.pythonhosted.org/packages/98/7b/1da69e5fdcb70e8f40ff3955516550207d5f5c81b428a5056510e72c60c5/pytorch_ignite-0.2.0-py2.py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 21.1MB/s 
Installing collected packages: torch, pytorch-ignite
Successfully installed pytorch-ignite-0.2.0 torch-1.1.0


## CODE

### Imports

In [3]:
from collections import defaultdict, OrderedDict
import datetime as dt
import glob
import os
import re
import subprocess
import tempfile
import time

import cudf
import dask
from dask.delayed import delayed
from dask.distributed import as_completed, Client, wait
from dask_cuda import LocalCUDACluster
from ignite.engine import create_supervised_evaluator, create_supervised_trainer, Events
from ignite.handlers import EarlyStopping as IgniteEarlyStopping
from ignite.metrics import Loss, Metric
import numpy as np
import pyarrow.parquet as pq
from sklearn.metrics import auc, precision_recall_curve
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as torch_optim
from torch.utils import data as torch_data

# CUDF_VERSION = tuple(map(int, cudf.__version__.split(".")[:3]))
# assert CUDF_VERSION >= (0, 6, 0), "cudf version must be at least 0.6.0! Found {}!".format(CUDF_VERSION)

## Configuration

#### ETL - Raw Data

In [None]:
# to download data for this notebook, visit https://rapidsai.github.io/demos/datasets/mortgage-data and update the following paths accordingly
acq_data_path = "/data/mortgage/acquisition"
perf_data_path = "/data/mortgage/perf_clean_full_split/"
col_names_path = "/data/mortgage/names.csv"

#### ETL - Data Splits
The loan range below is used to select only those loans acquired by FNMA during the configured period. Note that the validation range's
earliest date is > 12 months after the last training date because the training target, 12 month delinquency, uses information about one year in the future. To ensure generalizability we must validate and test on data beyond those dates.

In [None]:
# Loans are divided on a per quarter basis
loan_range = ['2000Q1', '2000Q4']

# Loan performance data are divided on a per month basis
train_range = ['200001', '200012']
validation_range = ['200201', '200201']
test_range = ['200202', '200202']

#### ETL - Discretization

In [None]:
max_quantiles = 20  # Used for computing histograms of continuous features
num_features = 2 ** 18  # When hashing features range will be [0, num_features)

#### Training - Model

In [None]:
embedding_size = 16
hidden_dims = [256, 256, 256]

device = "cuda"
dropout = None  # Can add dropout probability in [0, 1] here
activation = nn.ReLU()

#### Training - Optimization

In [None]:
epoch_size = 10000000

train_batch_size = 512
validation_batch_size = 2048

log_interval = 1000

learning_rate = 0.01
patience = 4
lr_multiplier = 0.5
max_epochs = 3  # Increase this for a more realistic training run 

### Start Dask CUDA Cluster

In [None]:
cmd = "hostname --all-ip-addresses"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
IPADDR = str(output.decode()).split()[0]

cluster = LocalCUDACluster(ip=IPADDR, scheduler_port=8786, diagnostics_port=8787)
client = Client(cluster)
client

In [None]:
def run_dask_task(func, **kwargs):
    task = func(**kwargs)
    return task

## Date Range Setup

In [None]:
def parse_year_quarter(date_str, regex):
    matches = regex.findall(date_str)
    if not matches:
        raise Exception("'{}' does not match '{}' pattern!".format(date_str, regex.pattern))
    return int(matches[0][0]), int(matches[0][1])


def check_date_range(rng, regex):
    dates = [parse_year_quarter(d, regex) for d in rng]
    assert dates[0][0] <= dates[1][0], "First year > second year"
    if dates[0][0] == dates[1][0]:
        assert dates[0][1] <= dates[1][1]
    return tuple(dates)

MONTH_RE = re.compile('([0-9]{4})(0[1-9]|1[0-2])')
QUARTER_RE = re.compile('([0-9]{4})Q([1234])')

In [None]:
start_date, end_date = check_date_range(loan_range, QUARTER_RE)
train_dates = check_date_range(train_range, MONTH_RE)
validation_dates = check_date_range(validation_range, MONTH_RE)
test_dates = check_date_range(test_range, MONTH_RE)

In [None]:
print("Using data from loans acquired by FNMA from {} to {}".format(loan_range[0], loan_range[1]))

## RMM Pool Functions

In [None]:
def initialize_rmm_pool():
    from librmm_cffi import librmm_config as rmm_cfg

    rmm_cfg.use_pool_allocator = True
    #rmm_cfg.initial_pool_size = 2<<30 # set to 2GiB. Default is 1/2 total GPU memory
    import cudf
    return cudf.rmm.initialize()

def initialize_rmm_no_pool():
    from librmm_cffi import librmm_config as rmm_cfg
    
    rmm_cfg.use_pool_allocator = False
    import cudf
    return cudf.rmm.initialize()

def finalize_rmm():
    import cudf
    return cudf.rmm.finalize()

In [None]:
client.run(initialize_rmm_pool)

## Load Raw Mortgage Data

In [None]:
PERFORMANCE_COLS = OrderedDict([
    ("loan_id", "int64"),
    ("monthly_reporting_period", "date"),
    ("servicer", "category"),
    ("interest_rate", "float64"),
    ("current_actual_upb", "float64"),
    ("loan_age", "float64"),
    ("remaining_months_to_legal_maturity", "float64"),
    ("adj_remaining_months_to_maturity", "float64"),
    ("maturity_date", "date"),
    ("msa", "float64"),
    ("current_loan_delinquency_status", "int32"),
    ("mod_flag", "category"),
    ("zero_balance_code", "category"),
    ("zero_balance_effective_date", "date"),
    ("last_paid_installment_date", "date"),
    ("foreclosed_after", "date"),
    ("disposition_date", "date"),
    ("foreclosure_costs", "float64"),
    ("prop_preservation_and_repair_costs", "float64"),
    ("asset_recovery_costs", "float64"),
    ("misc_holding_expenses", "float64"),
    ("holding_taxes", "float64"),
    ("net_sale_proceeds", "float64"),
    ("credit_enhancement_proceeds", "float64"),
    ("repurchase_make_whole_proceeds", "float64"),
    ("other_foreclosure_proceeds", "float64"),
    ("non_interest_bearing_upb", "float64"),
    ("principal_forgiveness_upb", "float64"),
    ("repurchase_make_whole_proceeds_flag", "category"),
    ("foreclosure_principal_write_off_amount", "float64"),
    ("servicing_activity_indicator", "category")
])

ACQUISITION_COLS = OrderedDict([
    ("loan_id", "int64"),
    ("orig_channel", "category"),
    ("seller_name", "category"),
    ("orig_interest_rate", "float64"),
    ("orig_upb", "int64"),
    ("orig_loan_term", "int64"),
    ("orig_date", "date"),
    ("first_pay_date", "date"),
    ("orig_ltv", "float64"),
    ("orig_cltv", "float64"),
    ("num_borrowers", "float64"),
    ("dti", "float64"),
    ("borrower_credit_score", "float64"),
    ("first_home_buyer", "category"),
    ("loan_purpose", "category"),
    ("property_type", "category"),
    ("num_units", "int64"),
    ("occupancy_status", "category"),
    ("property_state", "category"),
    ("zip", "int64"),
    ("mortgage_insurance_percent", "float64"),
    ("product_type", "category"),
    ("coborrow_credit_score", "float64"),
    ("mortgage_insurance_type", "float64"),
    ("relocation_mortgage_indicator", "category")
])
    
NAMES_COLS = OrderedDict([
    ("seller_name", "category"),
    ("new", "category"),
])

def gpu_load_performance_csv(performance_path, drop_cols=[], skip_rows=0):
    """ Loads performance data

    Returns
    -------
    GPU DataFrame
    """
    df = cudf.read_csv(performance_path, names=PERFORMANCE_COLS.keys(), delimiter='|',
                       dtype=list(PERFORMANCE_COLS.values()), skiprows=skip_rows)
    for col in drop_cols:
        df.drop_column(col)
    return df

def gpu_load_acquisition_csv(acquisition_path, skip_rows=0):
    """ Loads acquisition data

    Returns
    -------
    GPU DataFrame
    """
    return cudf.read_csv(acquisition_path, names=ACQUISITION_COLS.keys(), 
                         delimiter='|', dtype=list(ACQUISITION_COLS.values()), skiprows=skip_rows)

def gpu_load_names(col_names_path):
    """ Loads names used for renaming the banks
    
    Returns
    -------
    GPU DataFrame
    """
    return cudf.read_csv(col_names_path, names=NAMES_COLS.keys(), delimiter='|', 
                         dtype=list(NAMES_COLS.values()), skiprows=1)


## Feature Engineer

In [None]:
def make_dt_first_day(ym):
    return dt.datetime(ym[0], ym[1], 1)


def sample_df(df, start, end):

    start_dt = make_dt_first_day(start)
    end_dt = make_dt_first_day(end)

    query = "timestamp >= @{} and timestamp <= @{}"
    return df.query(query.format("start_dt", "end_dt"))


def null_workaround(df, **kwargs):
    for column, data_type in df.dtypes.items():
        if str(data_type) == "category":
            df[column] = df[column].astype('int32').fillna(np.dtype(np.int32).type(-1))
        if str(data_type) in ['int8', 'int16', 'int32', 'int64', 'float32', 'float64']:
            df[column] = df[column].fillna(np.dtype(data_type).type(-1)).astype(data_type)
    return df


def create_ever_features(df):
    """Calculate if loans ever been delinquent for 30/90/180 days"""

    everdf = df[['loan_id', 'current_loan_delinquency_status']]
    everdf = everdf.groupby('loan_id', method='hash', as_index=False).max()
    del(df)
    everdf['ever_30'] = (everdf['current_loan_delinquency_status'] >= 1).astype('int8')
    everdf['ever_90'] = (everdf['current_loan_delinquency_status'] >= 3).astype('int8')
    everdf['ever_180'] = (everdf['current_loan_delinquency_status'] >= 6).astype('int8')
    everdf.drop_column('current_loan_delinquency_status')
    return everdf


def create_delinq_features(df, **kwargs):
    """Find minimum dates when loans are delinquent for 30/90/180 days"""
    delinq_df = df[['loan_id', 'monthly_reporting_period', 'current_loan_delinquency_status']]
    del(df)
    delinq_30 = delinq_df.query('current_loan_delinquency_status >= 1')[['loan_id', 'monthly_reporting_period']].groupby('loan_id', method='hash', as_index=False).min()
    delinq_30['delinquency_30'] = delinq_30['monthly_reporting_period']
    delinq_30.drop_column('monthly_reporting_period')
    delinq_90 = delinq_df.query('current_loan_delinquency_status >= 3')[['loan_id', 'monthly_reporting_period']].groupby('loan_id', method='hash', as_index=False).min()
    delinq_90['delinquency_90'] = delinq_90['monthly_reporting_period']
    delinq_90.drop_column('monthly_reporting_period')
    delinq_180 = delinq_df.query('current_loan_delinquency_status >= 6')[['loan_id', 'monthly_reporting_period']].groupby('loan_id', method='hash', as_index=False).min()
    delinq_180['delinquency_180'] = delinq_180['monthly_reporting_period']
    delinq_180.drop_column('monthly_reporting_period')
    del(delinq_df)
    delinq_merge = delinq_30.merge(delinq_90, how='left', on=['loan_id'], type='hash')
    delinq_merge['delinquency_90'] = delinq_merge['delinquency_90'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    delinq_merge = delinq_merge.merge(delinq_180, how='left', on=['loan_id'], type='hash')
    delinq_merge['delinquency_180'] = delinq_merge['delinquency_180'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    del(delinq_30)
    del(delinq_90)
    del(delinq_180)
    return delinq_merge


def join_ever_delinq_features(everdf_tmp, delinq_merge, **kwargs):
    """
    Output cols: loan_id|ever_30|ever_90|ever_180|delinquency_30|delinquency_90|delinquency_180
    """
    everdf = everdf_tmp.merge(delinq_merge, on=['loan_id'], how='left', type='hash')
    del(everdf_tmp)
    del(delinq_merge)
    everdf['delinquency_30'] = everdf['delinquency_30'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    everdf['delinquency_90'] = everdf['delinquency_90'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    everdf['delinquency_180'] = everdf['delinquency_180'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    return everdf


def create_joined_df(df, everdf, **kwargs):
    """Join the full dataframe column subset with the joined delinquency/ever dataframe

    Note:
      * The column 'delinquency_12' returned is just the current loan delinquency status
      * The column 'upb_12' returned is just the current actual unpaid principal balance
      * Output cols:
        ** loan_id
    ** timestamp
    ** timestamp_month
    ** timestamp_year
    ** delinquency_12
    ** upb_12
    ** ever_30
    ** ever_90
    ** ever_180
    ** delinquency_30
    ** delinquency_90
    ** delinquency_180
    """
    test = df[['loan_id', 'monthly_reporting_period', 'current_loan_delinquency_status', 'current_actual_upb']]
    del(df)
    test['timestamp'] = test['monthly_reporting_period']
    test.drop_column('monthly_reporting_period')
    test['timestamp_month'] = test['timestamp'].dt.month
    test['timestamp_year'] = test['timestamp'].dt.year

    test['delinquency_12'] = test['current_loan_delinquency_status']
    test.drop_column('current_loan_delinquency_status')
    test['delinquency_12'] = test['delinquency_12'].fillna(-1)

    test['upb_12'] = test['current_actual_upb']
    test.drop_column('current_actual_upb')
    test['upb_12'] = test['upb_12'].fillna(999999999)
    
    joined_df = test.merge(everdf, how='left', on=['loan_id'], type='hash')
    del(everdf)
    del(test)
    
    joined_df['ever_30'] = joined_df['ever_30'].fillna(-1)
    joined_df['ever_90'] = joined_df['ever_90'].fillna(-1)
    joined_df['ever_180'] = joined_df['ever_180'].fillna(-1)
    joined_df['delinquency_30'] = joined_df['delinquency_30'].fillna(-1)
    joined_df['delinquency_90'] = joined_df['delinquency_90'].fillna(-1)
    joined_df['delinquency_180'] = joined_df['delinquency_180'].fillna(-1)
    
    joined_df['timestamp_year'] = joined_df['timestamp_year'].astype('int32')
    joined_df['timestamp_month'] = joined_df['timestamp_month'].astype('int32')
    
    return joined_df


def create_12_mon_features(joined_df):
    testdfs = []
    n_months = 12
    for y in range(1, n_months + 1):
        tmpdf = joined_df[['loan_id', 'timestamp_year', 'timestamp_month', 'delinquency_12', 'upb_12']]
        tmpdf['josh_months'] = tmpdf['timestamp_year'] * 12 + tmpdf['timestamp_month']
        tmpdf['josh_mody_n'] = ((tmpdf['josh_months'].astype('float64') - 24000 - y) / 12).floor()
        tmpdf = tmpdf.groupby(['loan_id', 'josh_mody_n'], method='hash', as_index=False).agg({'delinquency_12': 'max','upb_12': 'min'})
        tmpdf['delinquency_12'] = (tmpdf['max_delinquency_12']>3).astype('int32')
        tmpdf['delinquency_12'] +=(tmpdf['min_upb_12']==0).astype('int32')
        tmpdf.drop_column('max_delinquency_12')
        tmpdf['upb_12'] = tmpdf['min_upb_12']
        tmpdf.drop_column('min_upb_12')
        tmpdf['timestamp_year'] = (((tmpdf['josh_mody_n'] * n_months) + 24000 + (y - 1)) / 12).floor().astype('int16')
        tmpdf['timestamp_month'] = np.int8(y)
        tmpdf.drop_column('josh_mody_n')
        testdfs.append(tmpdf)
        del(tmpdf)
    del(joined_df)

    return cudf.concat(testdfs)


def combine_joined_12_mon(joined_df, testdf, **kwargs):
    joined_df.drop_column('delinquency_12')
    joined_df.drop_column('upb_12')
    joined_df['timestamp_year'] = joined_df['timestamp_year'].astype('int16')
    joined_df['timestamp_month'] = joined_df['timestamp_month'].astype('int8')
    return joined_df.merge(testdf, how='left', on=['loan_id', 'timestamp_year', 'timestamp_month'], type='hash')

def final_performance_delinquency(df, joined_df, **kwargs):
    merged = null_workaround(df)
    joined_df = null_workaround(joined_df)
    merged['timestamp_month'] = merged['monthly_reporting_period'].dt.month
    merged['timestamp_month'] = merged['timestamp_month'].astype('int8')
    merged['timestamp_year'] = merged['monthly_reporting_period'].dt.year
    merged['timestamp_year'] = merged['timestamp_year'].astype('int16')
    merged = merged.merge(joined_df, how='left', on=['loan_id', 'timestamp_year', 'timestamp_month'], type='hash')

    merged.drop_column('timestamp_year')
    merged.drop_column('timestamp_month')
    return merged


def join_perf_acq_dfs(perf, acq, **kwargs):
    perf = null_workaround(perf)
    acq = null_workaround(acq)
    return perf.merge(acq, how='left', on=['loan_id'], type='hash')


def last_mile_cleaning(df):
    df['timestamp'] = df['monthly_reporting_period'].astype('datetime64[ms]')
    drop_list = [
        'loan_id', 'orig_date', 'first_pay_date', 'seller_name', 'monthly_reporting_period', 
        'last_paid_installment_date', 'maturity_date', 'ever_30', 'ever_90', 'ever_180',
        'delinquency_30', 'delinquency_90', 'delinquency_180', 'upb_12',
        'zero_balance_effective_date','foreclosed_after', 'disposition_date'
    ]
    for column in drop_list:
        df.drop_column(column)
    for col, dtype in df.dtypes.iteritems():
        if str(dtype)=='category':
            df[col] = df[col].cat.codes
        if col != 'timestamp':
            if 'float' in str(dtype):
                df[col] = df[col].astype('float32')
    df['delinquency_12'] = df['delinquency_12'] > 0
    df['delinquency_12'] = df['delinquency_12'].fillna(False).astype('int8')

    for column in df.columns:
        df[column] = df[column].fillna(-1)

    return df


def load_sample_and_clean_data(year,
                               quarter,
                               perf_file,
                               dates=None,
                               **kwargs):
    names = gpu_load_names(col_names_path)

    acq_df = gpu_load_acquisition_csv(acquisition_path=acq_data_path + "/Acquisition_"
                                      + str(year) + "Q" + str(quarter) + ".txt")
    acq_df = acq_df.merge(names, how='left', on=['seller_name'])
    acq_df.drop_column('seller_name')
    acq_df['seller_name'] = acq_df['new']
    acq_df.drop_column('new')
    
    perf_df_tmp = gpu_load_performance_csv(perf_file)

    df = perf_df_tmp
    everdf = create_ever_features(df)
    delinq_merge = create_delinq_features(df)
    everdf = join_ever_delinq_features(everdf, delinq_merge)
    del(delinq_merge)
    joined_df = create_joined_df(df, everdf)
    testdf = create_12_mon_features(joined_df)
    joined_df = combine_joined_12_mon(joined_df, testdf)
    del(testdf)
    perf_df = final_performance_delinquency(df, joined_df)
    del(df, joined_df)
    final_df = join_perf_acq_dfs(perf_df, acq_df)
    del(perf_df)
    del(acq_df)
    final_df = last_mile_cleaning(final_df)
    if dates is None:
        final_df.drop_column("timestamp")
        out = {"train": final_df.to_arrow(preserve_index=False)}
        del(final_df)
        return out
    else:
        output = {}
        for k, (start, end) in dates.items():
            sampled_df = sample_df(final_df, start, end)
            sampled_df.drop_column("timestamp")
            output[k] = sampled_df.to_arrow(preserve_index=False)
            del(sampled_df)
        del(final_df)
        return output



def process_quarter_gpu(client, year, quarter, perf_file, dates=None):

    ml_arrays = run_dask_task(delayed(load_sample_and_clean_data),
                              year=year,
                              quarter=quarter,
                              acq_subdir=acq_data_path,
                              perf_file=perf_file,
                              dates=dates,
                              dask_key_name="PROCESS/" + os.path.basename(perf_file))
    return client.compute(ml_arrays,
                          optimize_graph=False,
                          fifo_timeout="0ms")


def load_and_process_data(client,
                          start_year,
                          start_quarter,
                          end_year,
                          end_quarter,
                          dates=None):
    assert start_year <= end_year
    if start_year == end_year:
        assert start_quarter <= end_quarter
    
    gpu_dfs = []
    year = start_year
    quarter = start_quarter
    while year <= end_year:
        if year == end_year and quarter > end_quarter:
            break
        perf_file_regex = os.path.join(perf_data_path,
                                       'Performance_{}Q{}*'.format(year, quarter))
        for perf_file in glob.glob(perf_file_regex):
            gpu_dfs.append(process_quarter_gpu(client,
                                               year,
                                               quarter,
                                               perf_file,
                                               dates=dates))
        quarter += 1
        if quarter == 5:
            year += 1
            quarter = 1
    return gpu_dfs

In [None]:
dates = {
    "train": train_dates,
    "validation": validation_dates,
    "test": test_dates
}

tables_fut = load_and_process_data(client, start_date[0], start_date[1], 
                                   end_date[0], end_date[1], dates=dates)
for future in as_completed(tables_fut):
    print(future)

## Compute Discretization & Hashing

In [None]:
@delayed
def get_dtype_columns(table_dict, key="train"):
    df = cudf.DataFrame.from_arrow(table_dict[key])
    columns = defaultdict(list)
    items = df.dtypes.items()
    del(df)
    for column, dtype in items:
        columns[str(dtype)].append(column)
    return columns


@delayed
def select_column(tables_dict, column, key="train"):
    return tables_dict[key].column(column).to_pandas()


def calculate_quantiles(tables, max_quantiles, client, columns, tables_key="train", unique=True):
    assert len(tables) > 0
    quantiles = {}
    for col in columns:
        col_pieces = [select_column(td, col, key=tables_key) for td in tables]
        col_pieces = client.compute(col_pieces)
        wait(col_pieces)
        col_list = client.gather(col_pieces)
        assert len(col_list)
        col_data = np.concatenate(col_list)
        series = cudf.Series.from_pandas(col_data, nan_as_null=False)
        assert series.null_count == 0, "Found {} null values. Should be 0!".format(series.null_count)
        step = 1 / max_quantiles
        quantiles[col] = series.quantile(np.arange(0, 1, step), quant_index=False).to_array().astype(np.float32)
        if unique:
            quantiles[col] = np.unique(quantiles[col])
    return quantiles


@delayed
def discretize(tables_dict, bin_dict, hash_size):
    for key, table in tables_dict.items():
        df = cudf.DataFrame.from_arrow(table)
        for col, dtype in df.dtypes.items():
            if col != 'delinquency_12':
                if col in bin_dict:
                    bins = bin_dict[col]
                    df[col] = df[col].astype(np.float32).digitize(bins)
                    df[col] = df[col].hash_encode(hash_size, use_name=True)
                elif 'float' in str(dtype):
                    raise RuntimeError(
                        "Float column '{}' does not have bins for discretization!".format(col))
                else:
                    df[col] = df[col].hash_encode(hash_size, use_name=True)
        tables_dict[key] = df.to_arrow(preserve_index=False)
    return tables_dict

In [None]:
col_type_fut = client.compute(get_dtype_columns(tables_fut[0]),
                              optimize_graph=False, fifo_timeout="0ms")
wait(col_type_fut)
col_type_dict = client.gather(col_type_fut)
cts_columns = col_type_dict["float32"] + col_type_dict["float64"]
print("Found {} continuous value columns: {}".format(len(cts_columns),
                                                     ",".join(cts_columns)))
quantiles = calculate_quantiles(tables_fut, max_quantiles, client, cts_columns)

In [None]:
bin_delayed = []
for df in tables_fut:
    key = "DISCRETIZE/" + os.path.basename(df.key)
    task = run_dask_task(discretize,
                         tables_dict=df,
                         bin_dict=quantiles,
                         hash_size=num_features,
                         dask_key_name=key)
    bin_delayed.append(task)

bin_fut = client.compute(bin_delayed, optimize_graph=False, fifo_timeout="0ms")
for future in as_completed(bin_fut):
    print(future)

## Persist Data in Apache Parquet Format

In [None]:
@delayed
def save(tables_dict, base_output_dir, fnames):
    out_dirs = {}
    for k in tables_dict.keys():
        out_dirs[k] = os.path.join(base_output_dir, k)
        assert os.path.exists(out_dirs[k]), "Output directory {} does not exist!".format(out_dirs[k])

    for k, table in tables_dict.items():
        path = os.path.join(base_output_dir, k, fnames[k])
        assert not os.path.exists(path), "Output path already exists at {}!".format(path)
        pq.write_table(table, path, compression='snappy')

        
def persist_data(client, tables_fut, out_fname_dates):
    out_delayed = []
    for td in tables_fut:
        perf_file = os.path.basename(td.key)
        key = os.path.join("PERSIST_DNN", perf_file)
        fnames = {k: perf_file.replace(".", "_") + "_" + v + ".parquet" for (k, v) in out_fname_dates.items()}
        task = run_dask_task(save,
                             tables_dict=td,
                             base_output_dir=out_dir,
                             fnames=fnames,
                             dask_key_name=key)
        out_delayed.append(task)
    out_fut = client.compute(out_delayed, optimize_graph=False, fifo_timeout="0ms")
    for future in as_completed(out_fut):
        print(future)

In [None]:
out_fname_dates = {}
out_dir = tempfile.mkdtemp()
for key, ((sy, sm), (ey, em)) in dates.items():
    key_dir = os.path.join(out_dir, key)
    print("Creating directory: {}".format(key_dir))
    os.makedirs(key_dir)
    out_fname_dates[key] = "{}{:02d}_{}{:02d}".format(sy, sm, ey, em)
persist_data(client, bin_fut, out_fname_dates)

## Shutdown Dask - ETL Complete

In [None]:
client.close()
cluster.close()

## Torch Dataset from Parquet

In [None]:
def load_tensors_from_parquet(path, target_name='delinquency_12'):
    tbl = pq.read_table(path).to_pandas()
    target = None
    if target_name in tbl:
        target = torch.from_numpy(tbl.pop(target_name).values.astype(np.float32))
    features = torch.from_numpy(tbl.values.astype(np.long))
    tensors = [features]
    if target is not None:
        tensors.append(target)
    return tuple(tensors)


class MortgageParquetDataset(torch_data.Dataset):

    def __init__(self, root_path, num_samples=None, target_name='delinquency_12',
                 shuffle_files=False):
        self.parquet_files = glob.glob(os.path.join(root_path, "*.parquet"))
        if shuffle_files:
            self.parquet_files = list(np.random.permutation(self.parquet_files))
        self.target_name = target_name
        self.metadata = [pq.read_metadata(p) for p in self.parquet_files]
        self.cumsum_rows = np.cumsum([m.num_rows for m in self.metadata])

        self.times_through = 0
        if num_samples is not None:
            self.num_samples = min(num_samples, self.cumsum_rows[-1])
        else:
            self.num_samples = self.cumsum_rows[-1]

        self.loaded_tensors = None

    def __len__(self):
        return self.num_samples

    def __getitem__(self, item):
        tt = self.times_through
        if item == len(self) - 1:
            self.times_through += 1
        item += tt * len(self)
        item %= len(self)

        part_idx = np.searchsorted(self.cumsum_rows, item, side='right')

        if self.loaded_tensors is None or self.loaded_tensors[0] != part_idx:
            tensors = load_tensors_from_parquet(self.parquet_files[part_idx])
            self.loaded_tensors = (part_idx, tensors)

        i = item if part_idx == 0 else item - self.cumsum_rows[part_idx - 1]
        return tuple(tensor[i] for tensor in self.loaded_tensors[1])



def load_torch_dataset(root_path, num_samples=None, shuffle_files=False):
    return MortgageParquetDataset(root_path, num_samples=num_samples, shuffle_files=shuffle_files)

## PyTorch DNN Model

In [None]:
def _make_hidden_layer(in_dim, out_dim, activation, dropout=None):
    if dropout:
        return nn.Sequential(nn.Linear(in_dim, out_dim), activation, nn.Dropout(p=dropout))
    return nn.Sequential(nn.Linear(in_dim, out_dim), activation)


class MortgageNetwork(nn.Module):
    """Mortgage Delinquency DNN."""

    def __init__(
        self,
        num_features,
        embedding_size,
        hidden_dims,
        use_cuda=True,
        activation=nn.ReLU(),
        dropout=None,
        embedding_bag_mode='mean'
    ):
        super(MortgageNetwork, self).__init__()
        self.input_size = num_features
        self.embedding_size = embedding_size
        if use_cuda and torch.cuda.is_available():
            self.device = torch.device("cuda")
        else:
            self.device = torch.device("cpu")
        self.activation = activation
        self.dropout = dropout

        self.embedding = nn.modules.EmbeddingBag(self.input_size, self.embedding_size,
                                                 mode=embedding_bag_mode)

        if len(hidden_dims) > 0:
            dims = [self.embedding_size] + hidden_dims
            hidden_layers = [
                _make_hidden_layer(dims[i], dims[i + 1], self.activation, self.dropout)
                for i in range(len(dims) - 1)
            ]
            self.hidden_layers = nn.ModuleList(hidden_layers)
            self.hidden_layers.extend([nn.Linear(dims[-1], 1)])
        else:
            self.hidden_layers = []

        self.to(self.device)

    def forward(self, x):
        """Forward pass."""
        out = self.embedding(x)
        out = self.activation(out)
        for layer in self.hidden_layers:
            out = layer(out)
        return out.squeeze()

## Metric Used for Early Stopping

In [None]:
class PrAucMetric(Metric):
    def __init__(self, ignore_bad_metric=False):
        super(PrAucMetric, self).__init__()
        self.name = "PR-AUC"
        self._predictions = []
        self._targets = []
        self._ignore_bad_metric = ignore_bad_metric

    def reset(self):
        self._predictions = []
        self._targets = []

    def update(self, output):
        if len(output) == 2:
            y_pred, y_target = output
        else:
            raise Exception("Expected output of length 2!")
        self._predictions.append(y_pred)
        self._targets.append(y_target)

    def curve(self, targets, predictions):
        prec, rec, _ = precision_recall_curve(targets, predictions)
        return rec, prec, None

    def compute(self):
        targets = torch.cat(self._targets).cpu()
        predictions = torch.cat(self._predictions).cpu()
        print("Number of targets for {}-Curve: {}".format(self.name, len(targets)))
        start = time.time()
        x, y, _ = self.curve(targets, predictions)
        if not self._ignore_bad_metric and len(x) == 2:
            raise MetricCurveError("{}-Curve returned only two points!".format(self.name))
        start = time.time()
        output = auc(x, y)
        return output

## Early Stopping Handler

In [None]:
class EarlyStopping(IgniteEarlyStopping):
    def __init__(
        self, model, optimizer, lr_multiplier=0.5, min_lr=1.0e-7, delta=0.0005, *args, **kwargs
    ):
        super(EarlyStopping, self).__init__(*args, **kwargs)
        self.optimizer = optimizer
        self.model = model
        self.lr_multiplier = lr_multiplier
        self.min_lr = min_lr
        tmp_dir = tempfile.mkdtemp()
        self._state_path = os.path.join(tmp_dir, "best_state.pth")
        self.delta = delta

    def _state(self):
        return {
            "model": self.model.state_dict(),
            "optimizer": self.optimizer.state_dict(),
        }

    def _save_state(self):
        print("Saving state to {}.".format(self._state_path))
        state = self._state()
        torch.save(state, self._state_path)

    def _load_state(self, update_lr=True):
        print("Loading state from {}.".format(self._state_path))
        state = torch.load(self._state_path)
        self.model.load_state_dict(state["model"])

        new_lr = max(self.optimizer.param_groups[0]["lr"] * self.lr_multiplier, self.min_lr)
        self.optimizer.load_state_dict(state["optimizer"])
        if update_lr:
            self.optimizer.param_groups[0]["lr"] = new_lr
            self._logger.info("Updated optimizer: {}".format(str(self.optimizer)))


    def __call__(self, engine):
        score = self.score_function(engine)

        if self.best_score is None:
            self.best_score = score
            self._save_state()
        elif score < self.best_score + self.delta:
            self.counter += 1
            print("Score did not improve! EarlyStopping: %i / %i" % (self.counter, self.patience))
            self._load_state()
            if self.counter >= self.patience:
                print("EarlyStopping: Stop training")
                self.trainer.terminate()

        else:
            self.best_score = score
            self.counter = 0
            self._save_state()

## Training 

In [None]:
def run_training(model):
    # Data
    train_dataset = load_torch_dataset(os.path.join(out_dir, "train"), epoch_size, shuffle_files=True)
    validation_dataset = load_torch_dataset(os.path.join(out_dir, "validation"))
    test_dataset = load_torch_dataset(os.path.join(out_dir, "test"))
    
    train_loader = torch_data.DataLoader(train_dataset,
                                     batch_size=train_batch_size,
                                     num_workers=8)
    validation_loader = torch_data.DataLoader(validation_dataset,
                                         batch_size=validation_batch_size,
                                         num_workers=8)
    test_loader = torch_data.DataLoader(test_dataset,
                                        batch_size=validation_batch_size,
                                        num_workers=8)
    # Optimizer
    optimizer = torch_optim.Adam(model.parameters(), lr=learning_rate)
    
    # Loss Function
    loss_fn = lambda pred, target: F.binary_cross_entropy_with_logits(pred, target)

    trainer = create_supervised_trainer(model=model, optimizer=optimizer, loss_fn=loss_fn, device=device)
    evaluator = create_supervised_evaluator(model, metrics={"pr-auc": PrAucMetric()}, device=device)

    # Early stopping
    early_stopping_handler = EarlyStopping(
        model=model,
        optimizer=optimizer,
        lr_multiplier=lr_multiplier,
        patience=patience,
        score_function=lambda engine: engine.state.metrics["pr-auc"],
        trainer=trainer,)
    evaluator.add_event_handler(Events.COMPLETED, early_stopping_handler)

    # Events
    @trainer.on(Events.EPOCH_STARTED)
    def timer(engine):
        setattr(engine.state, "epoch_start", time.time())

    num_epoch_batches = len(train_loader)
    examples_per_epoch = num_epoch_batches * train_batch_size
    @trainer.on(Events.ITERATION_COMPLETED)
    def log_training_loss(engine):
        iter = (engine.state.iteration - 1) % num_epoch_batches + 1
        if iter % log_interval == 0:
            epoch_time_elapsed = time.time() - engine.state.epoch_start
            examples = engine.state.iteration * train_batch_size
            epoch_examples_per_second = (examples - (engine.state.epoch - 1) * examples_per_epoch) / epoch_time_elapsed
            print(
                "Epoch[{}] Iteration[{}/{}] Loss: {:.5f} Example/s: {:.3f} (Total examples: {})".format(
                    engine.state.epoch, iter, num_epoch_batches, engine.state.output,
                    epoch_examples_per_second, examples))

    @trainer.on(Events.EPOCH_COMPLETED)
    def log_validation_results(engine):
        evaluator.run(validation_loader)
        metrics = evaluator.state.metrics
        pr_auc = metrics["pr-auc"]
        print("Validation Results - Epoch: {}\n\tPR-AUC: {:.5f}".format(engine.state.epoch, pr_auc))

    @trainer.on(Events.COMPLETED)
    def log_test_results(engine):
        evaluator.run(test_loader)
        metrics = evaluator.state.metrics
        pr_auc = metrics["pr-auc"]
        print("Final Test Results - PR-AUC: {:.5f}".format(pr_auc))
    trainer.run(train_loader, max_epochs=max_epochs)

In [None]:
model = MortgageNetwork(num_features, embedding_size, hidden_dims,
                        dropout=dropout, activation=activation)
run_training(model)