# Mortgage Workflow

## The Dataset
The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

To acquire this dataset, please visit [RAPIDS Datasets Homepage](https://docs.rapids.ai/datasets/mortgage-data).  We've also added some cells below to make it easier to download the data, set parameters, and run the notebook.

## Introduction
The Mortgage workflow is composed of three core phases:

1. ETL - Extract, Transform, Load
2. Data Conversion
3. ML - Training

### ETL
Data is 
1. Read in from storage
2. Transformed to emphasize key features
3. Loaded into volatile memory for conversion

### Data Conversion
Features are
1. Broken into (labels, data) pairs
2. Distributed across many workers
3. Converted into compressed sparse row (CSR) matrix format for XGBoost

### Machine Learning
The CSR data is fed into a distributed training session with Dask-XGBoost

## Imports statements

In [1]:
%env NCCL_P2P_DISABLE=1 # Necessary for NCCL < 2.4

env: NCCL_P2P_DISABLE=1 # Necessary for NCCL < 2.4


In [2]:
import dask_xgboost as dxgb_gpu
import dask
import dask_cudf
from dask_cuda import LocalCUDACluster
from dask.delayed import delayed
from dask.distributed import Client, wait
import cudf

import pynvml
import numpy as np
import xgboost as xgb

from collections import OrderedDict
import gc
from glob import glob
import os

## Set up RMM and Dask_cudf cluster

While `dask_cudf` allows RAPIDS to use all the GPUs in the cluster as a single, large GPU, if the data is far larger than your current dask cluster size, you will have to use `RMM`.  RMM requires you to declare the number of workers RMM defaults to 50% of your GPU memory size.  However, you are able to manage the memory to use up to 100% and that may help your performance, and maybe some out of memory errors.  Below are some 90%+ values that you can use to set your `initial_pool_size`.  They are commented out, as is setting the inital pool size.  Uncomment the one you want to use.

### Define your GPU Memory Allocation

In [4]:
# Decide initial pool size based on the memory size of the smallest GPU in your cluster and uncomment that line, as well as the initial_pool_size in the cell after
# ips = 6<<30 #8gb card
# ips = 14<<30 #16gb card
# ips = 22<<30 #24gb card
# ips = 29<<30 #32gb card
# ips = 43<<30 #48gb card


### Define the number of GPU workers that you have

In [5]:
n_workers = 4  # Please change your n_workers amount to the number of GPUs you have

### Set up RMM pool and start Dask cluster

In [6]:
def initialize_rmm_pool():
    import rmm
    rmm.reinitialize(pool_allocator=True, 
                     #initial_pool_size=ips, #RMM defaults to 50% GPU memory.  You may get further performance by increasing the pool size per GPU.  Do not go above your smallest GPU mem size
                     managed_memory=True)

def initialize_rmm_no_pool():
    import rmm

    rmm.reinitialize(pool_allocator=False, managed_memory=True)

In [7]:
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.get('distributed.scheduler.work-stealing')
dask.config.set({'distributed.scheduler.bandwidth': 1})
dask.config.get('distributed.scheduler.bandwidth')
cluster = LocalCUDACluster(n_workers=n_workers, threads_per_worker=1) 
print(cluster)
client = Client(cluster)
client.restart()
client.run(initialize_rmm_pool)
client

LocalCUDACluster('tcp://127.0.0.1:43807', workers=4, threads=4, memory=270.40 GB)


0,1
Client  Scheduler: tcp://127.0.0.1:43807  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 270.40 GB


## Get Data

We've made it easy for you download them and start your ETL.  Please select the range of years that you want to analyze.  The more years you pick, the longer it will take to download the data.  **If you already have the data, please edit and run only the first cell**

The [RAPIDS Datasets Homepage](https://docs.rapids.ai/datasets/mortgage-data) will give you further information about the datasets, such as size.  Our largest dataset is 196GB uncompressed, so download the dataset that can fit in your own storage requirements.

In [8]:
# Edit and uncomment `end_year` if you already have the dataset you want downloaded.  Please keep track of that end year.  
# If you don't have the data, you can use the below cells to download the size you want
path = os.getcwd()
data_dir = path + "/data/mortgage/" #your folder where the mortgage data is located.
start_year = 2000
print(data_dir)
#end_year = 2000 #uncomment only if you have the data downloaded to your machine already and know your end year.   

/raid/tdyer/rapids-h2o/mortgage/data/mortgage/


If you have a dataset already, declare its end year and 
don't 
- download the data again
- untar your data, if you have already done so.

In [9]:
end_year = 2007

If you don't have the dataset, please choose a span of years to download by commenting our default years and uncommenting the amount of years that you want.  Our default is the 2 years dataset.

In [None]:
#One year
# !curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
# end_year = 2000

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 78  448M   78  354M    0     0   552k      0  0:13:52  0:10:56  0:02:56  491k7  547k  0:13:32  0:08:20  0:05:12  528k      0  0:13:38  0:08:53  0:04:45  493k      0  0:13:47  0:10:09  0:03:38  503k

In [None]:
#Two years
!curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
end_year = 2001

In [None]:
#Four years
# !curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2003.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
# end_year = 2003

In [6]:
#Eight years
# !curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2007.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
# end_year = 2007

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14.0G  100 14.0G    0     0  7916k      0  0:31:00  0:31:00 --:--:-- 16.4M 6188k      0  0:39:39  0:01:10  0:38:29 3913k     0  4457k      0  0:55:04  0:02:44  0:52:20 1307k3k      0  0:53:05  0:09:53  0:43:12 5156k  4620k      0  0:53:07  0:10:30  0:42:37 8265k   0  4617k      0  0:53:09  0:11:54  0:41:15 4241k807k      0  0:51:03  0:20:14  0:30:49 6953k  0     0  5637k      0  0:43:32  0:25:20  0:18:12 11.5M0:43:05  0:25:34  0:17:31 11.0M0G   62 8976M    0     0  5822k      0  0:42:09  0:26:18  0:15:51 8155k  0  0:40:59  0:26:43  0:14:16 18.9M 0:29:30  0:04:23 19.5M:22  0:30:42  0:00:40 26.7M


In [None]:
#Sixteen years
# !curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2015.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
# end_year = 2015

In [None]:
#Seventeen years
# !curl http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2016.tgz --create-dirs -o data/mortgage/mortgage_compressed.tgz
# end_year = 2016

In [7]:
#untar your data
!tar -xvzf data/mortgage/mortgage_compressed.tgz -C data/mortgage/

names.csv
acq/Acquisition_2007Q4.txt
acq/Acquisition_2007Q3.txt
acq/Acquisition_2007Q2.txt
acq/Acquisition_2007Q1.txt
acq/Acquisition_2006Q4.txt
acq/Acquisition_2006Q3.txt
acq/Acquisition_2006Q2.txt
acq/Acquisition_2006Q1.txt
acq/Acquisition_2005Q4.txt
acq/Acquisition_2005Q3.txt
acq/Acquisition_2005Q2.txt
acq/Acquisition_2005Q1.txt
acq/Acquisition_2004Q4.txt
acq/Acquisition_2004Q3.txt
acq/Acquisition_2004Q2.txt
acq/Acquisition_2004Q1.txt
acq/Acquisition_2003Q4.txt
acq/Acquisition_2003Q3.txt
acq/Acquisition_2003Q2.txt
acq/Acquisition_2003Q1.txt
acq/Acquisition_2002Q4.txt
acq/Acquisition_2002Q3.txt
acq/Acquisition_2002Q2.txt
acq/Acquisition_2002Q1.txt
acq/Acquisition_2001Q4.txt
acq/Acquisition_2001Q3.txt
acq/Acquisition_2001Q2.txt
acq/Acquisition_2001Q1.txt
acq/Acquisition_2000Q4.txt
acq/Acquisition_2000Q3.txt
acq/Acquisition_2000Q2.txt
acq/Acquisition_2000Q1.txt
perf/Performance_2007Q4.txt
perf/Performance_2007Q3.txt
perf/Performance_2007Q2.txt
perf/Performance_2007Q1.txt
perf/Performan

#### Define the paths to data and set the size of the dataset

The cell below serves as a quick start on mortgage data for the year 2000. Visit the [RAPIDS Datasets Homepage](https://docs.rapids.ai/datasets/mortgage-data) and update `data_url` below if you want to try other years

In [10]:
acq_data_path = data_dir + "acq"
perf_data_path = data_dir + "perf"
col_names_path = data_dir + "names.csv"

In [11]:
#test to see you are reading from the proper directory.  You will see an output if you are.
temp = cudf.read_csv(col_names_path)
print(temp.head())
del temp



                                   seller_name"|"new
0                        WITMER FUNDING|LLC"|"Witmer
1  WELLS FARGO CREDIT RISK TRANSFER SECURITIES TR...
2                 WELLS FARGO BANK| NA"|"Wells Fargo
3                WELLS FARGO BANK|N.A."|"Wells Fargo
4                  WELLS FARGO BANK|NA"|"Wells Fargo


#### Define functions to encapsulate the workflow into a single call

In [12]:
#You can edit your number of dask_cudf partitions here, depending on how many GPUs you have and how big your dataset is.  The number chosen can boost or reduce performance
n_partitions = 40

In [13]:
def process_quarter_gpu(year=2000, quarter=1, perf_file=""):
    ml_arrays = run_gpu_workflow(quarter=quarter,year=year, perf_file=perf_file)
    return ml_arrays

def null_workaround(df, **kwargs):
    for column, data_type in df.dtypes.items():
        if str(data_type) == "str":
            df[column] = df[column].astype('int32').fillna(-1)
        if str(data_type) == "category":
            df[column] = df[column].astype('int32').fillna(-1)
        if str(data_type) in ['int8', 'int16', 'int32', 'int64', 'float32', 'float64']:
            df[column] = df[column].fillna(np.dtype(data_type).type(-1))
    return df

def gpu_load_performance_csv(performance_path, **kwargs):
    """ Loads performance data

    Returns
    -------
    Dask GPU DataFrame
    """
    
    cols = [
        "loan_id", "monthly_reporting_period", "servicer", "interest_rate", "current_actual_upb",
        "loan_age", "remaining_months_to_legal_maturity", "adj_remaining_months_to_maturity",
        "maturity_date", "msa", "current_loan_delinquency_status", "mod_flag", "zero_balance_code",
        "zero_balance_effective_date", "last_paid_installment_date", "foreclosed_after",
        "disposition_date", "foreclosure_costs", "prop_preservation_and_repair_costs",
        "asset_recovery_costs", "misc_holding_expenses", "holding_taxes", "net_sale_proceeds",
        "credit_enhancement_proceeds", "repurchase_make_whole_proceeds", "other_foreclosure_proceeds",
        "non_interest_bearing_upb", "principal_forgiveness_upb", "repurchase_make_whole_proceeds_flag",
        "foreclosure_principal_write_off_amount", "servicing_activity_indicator"
    ]
    
    dtypes = OrderedDict([
        ("loan_id", "int64"),
        ("monthly_reporting_period", "date"),
        ("servicer", "str"),
        ("interest_rate", "float64"),
        ("current_actual_upb", "float64"),
        ("loan_age", "float64"),
        ("remaining_months_to_legal_maturity", "float64"),
        ("adj_remaining_months_to_maturity", "float64"),
        ("maturity_date", "date"),
        ("msa", "float64"),
        ("current_loan_delinquency_status", "int32"),
        ("mod_flag", "category"),
        ("zero_balance_code", "str"),
        ("zero_balance_effective_date", "date"),
        ("last_paid_installment_date", "date"),
        ("foreclosed_after", "date"),
        ("disposition_date", "date"),
        ("foreclosure_costs", "float64"),
        ("prop_preservation_and_repair_costs", "float64"),
        ("asset_recovery_costs", "float64"),
        ("misc_holding_expenses", "float64"),
        ("holding_taxes", "float64"),
        ("net_sale_proceeds", "float64"),
        ("credit_enhancement_proceeds", "float64"),
        ("repurchase_make_whole_proceeds", "float64"),
        ("other_foreclosure_proceeds", "float64"),
        ("non_interest_bearing_upb", "float64"),
        ("principal_forgiveness_upb", "float64"),
        ("repurchase_make_whole_proceeds_flag", "str"),
        ("foreclosure_principal_write_off_amount", "float64"),
        ("servicing_activity_indicator", "category")
    ])

    print(performance_path)
    pdf = dask_cudf.read_csv(performance_path, names=cols, delimiter='|', dtype=list(dtypes.values()), header= True, npartitions = n_partitions) 
    return pdf

def gpu_load_acquisition_csv(acquisition_path, **kwargs):
    """ Loads acquisition data

    Returns
    -------
    Dask GPU DataFrame
    """
    
    cols = [
        'loan_id', 'orig_channel', 'seller_name', 'orig_interest_rate', 'orig_upb', 'orig_loan_term', 
        'orig_date', 'first_pay_date', 'orig_ltv', 'orig_cltv', 'num_borrowers', 'dti', 'borrower_credit_score', 
        'first_home_buyer', 'loan_purpose', 'property_type', 'num_units', 'occupancy_status', 'property_state',
        'zip', 'mortgage_insurance_percent', 'product_type', 'coborrow_credit_score', 'mortgage_insurance_type', 
        'relocation_mortgage_indicator'
    ]
    
    dtypes = OrderedDict([
        ("loan_id", "int64"),
        ("orig_channel", "str"),
        ("seller_name", "str"),
        ("orig_interest_rate", "float64"),
        ("orig_upb", "int64"),
        ("orig_loan_term", "int64"),
        ("orig_date", "date"),
        ("first_pay_date", "date"),
        ("orig_ltv", "float64"),
        ("orig_cltv", "float64"),
        ("num_borrowers", "float64"),
        ("dti", "float64"),
        ("borrower_credit_score", "float64"),
        ("first_home_buyer", "str"),
        ("loan_purpose", "str"),
        ("property_type", "str"),
        ("num_units", "int64"),
        ("occupancy_status", "str"),
        ("property_state", "str"),
        ("zip", "int64"),
        ("mortgage_insurance_percent", "float64"),
        ("product_type", "category"),
        ("coborrow_credit_score", "float64"),
        ("mortgage_insurance_type", "float64"),
        ("relocation_mortgage_indicator", "str")
    ])
    
    print(acquisition_path)
    adf = dask_cudf.read_csv(acquisition_path, names=cols, delimiter='|', dtype=list(dtypes.values()), header= True, npartitions = 1)
    return adf

def gpu_load_names(**kwargs):
    """ Loads names used for renaming the banks
    
    Returns
    -------
    Dask GPU DataFrame
    """

    cols = [
        'seller_name', 'new'
    ]
    
    dtypes = OrderedDict([
        ("seller_name", "str"),
        ("new", "category"),
    ])
    ndf = dask_cudf.read_csv(col_names_path, names=cols, delimiter='|', dtype=list(dtypes.values()), header= True, npartitions = 1)
    return ndf

In [14]:
def create_ever_features(gdf, **kwargs):
    everdf = gdf[['loan_id', 'current_loan_delinquency_status']]
    everdf = everdf.groupby('loan_id').max()
    everdf = everdf.reset_index()
    del(gdf)
    everdf['ever_30'] = (everdf['current_loan_delinquency_status'] >= 1).astype('int8')
    everdf['ever_90'] = (everdf['current_loan_delinquency_status'] >= 3).astype('int8')
    everdf['ever_180'] = (everdf['current_loan_delinquency_status'] >= 6).astype('int8')
    everdf = everdf.drop('current_loan_delinquency_status', axis=1)
    return everdf

In [15]:
def create_delinq_features(gdf, **kwargs):
    """ Gathers loans with 30, 90 and 180 day delinquency status and merges them into one dataframe
    
    Returns
    -------
    Dask GPU DataFrame
    """
    delinq_gdf = gdf[['loan_id', 'monthly_reporting_period', 'current_loan_delinquency_status']]
    del(gdf)
    delinq_30 = delinq_gdf.query('current_loan_delinquency_status >= 1')[['loan_id', 'monthly_reporting_period']].groupby('loan_id').min()
    delinq_30 = delinq_30.reset_index()
    delinq_30['delinquency_30'] = delinq_30['monthly_reporting_period']
    delinq_30 = delinq_30.drop('monthly_reporting_period', axis=1)
    delinq_90 = delinq_gdf.query('current_loan_delinquency_status >= 3')[['loan_id', 'monthly_reporting_period']].groupby('loan_id').min()
    delinq_90 = delinq_90.reset_index()
    delinq_90['delinquency_90'] = delinq_90['monthly_reporting_period']
    delinq_90 = delinq_90.drop('monthly_reporting_period', axis=1)
    delinq_180 = delinq_gdf.query('current_loan_delinquency_status >= 6')[['loan_id', 'monthly_reporting_period']].groupby('loan_id').min()
    delinq_180 = delinq_180.reset_index()
    delinq_180['delinquency_180'] = delinq_180['monthly_reporting_period']
    delinq_180 = delinq_180.drop('monthly_reporting_period', axis=1)
    del(delinq_gdf)
    delinq_merge = delinq_30.merge(delinq_90, on=['loan_id'])
    delinq_merge['delinquency_90'] = delinq_merge['delinquency_90'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    delinq_merge = delinq_merge.merge(delinq_180, on=['loan_id'])
    delinq_merge['delinquency_180'] = delinq_merge['delinquency_180'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    del(delinq_30)
    del(delinq_90)
    del(delinq_180)
    return delinq_merge

In [16]:
def join_ever_delinq_features(everdf_tmp, delinq_merge, **kwargs):
    everdf = everdf_tmp.merge(delinq_merge, on=['loan_id'])
    del(everdf_tmp)
    del(delinq_merge)
    everdf['delinquency_30'] = everdf['delinquency_30'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    everdf['delinquency_90'] = everdf['delinquency_90'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    everdf['delinquency_180'] = everdf['delinquency_180'].fillna(np.dtype('datetime64[ms]').type('1970-01-01').astype('datetime64[ms]'))
    return everdf

In [17]:
def create_joined_df(gdf, everdf, **kwargs):
    test = gdf[['loan_id', 'monthly_reporting_period', 'current_loan_delinquency_status', 'current_actual_upb']]
    del(gdf)
    test['timestamp'] = test['monthly_reporting_period']
    test= test.drop('monthly_reporting_period', axis=1)
    test['timestamp_month'] = test['timestamp'].dt.month
    test['timestamp_year'] = test['timestamp'].dt.year
    test['delinquency_12'] = test['current_loan_delinquency_status']
    test = test.drop('current_loan_delinquency_status', axis=1)
    test['upb_12'] = test['current_actual_upb']
    test = test.drop('current_actual_upb', axis=1)
    test['upb_12'] = test['upb_12'].fillna(999999999)
    test['delinquency_12'] = test['delinquency_12'].fillna(-1)
    
    joined_df = test.merge(everdf, on=['loan_id'])
    del(everdf)
    del(test)
    
    joined_df['ever_30'] = joined_df['ever_30'].fillna(-1)
    joined_df['ever_90'] = joined_df['ever_90'].fillna(-1)
    joined_df['ever_180'] = joined_df['ever_180'].fillna(-1)
    joined_df['delinquency_30'] = joined_df['delinquency_30'].fillna(-1)
    joined_df['delinquency_90'] = joined_df['delinquency_90'].fillna(-1)
    joined_df['delinquency_180'] = joined_df['delinquency_180'].fillna(-1)
    
    joined_df['timestamp_year'] = joined_df['timestamp_year'].astype('int32')
    joined_df['timestamp_month'] = joined_df['timestamp_month'].astype('int32')
    
    return joined_df

In [18]:
def create_12_mon_features(joined_df, **kwargs):
    testdfs = []
    n_months = 12
    for y in range(1, n_months + 1):
        tmpdf = joined_df[['loan_id', 'timestamp_year', 'timestamp_month', 'delinquency_12', 'upb_12']]
        tmpdf['josh_months'] = tmpdf['timestamp_year'] * 12 + tmpdf['timestamp_month']
        tmpdf['josh_mody_n'] = ((tmpdf['josh_months'].astype('float64') - 24000 - y) / 12).astype('int64')
        tmpdf = tmpdf.groupby(['loan_id', 'josh_mody_n']).agg({'delinquency_12': 'max','upb_12': 'min'})
        tmpdf = tmpdf.reset_index()
        tmpdf['delinquency_12'] = (tmpdf['delinquency_12']>3).astype('int32')
        tmpdf['delinquency_12'] +=(tmpdf['upb_12']==0).astype('int32')
        tmpdf['timestamp_year'] = (((tmpdf['josh_mody_n'] * n_months) + 24000 + (y - 1)) / 12).astype('int16')
        tmpdf['timestamp_month'] = np.int8(y)
        tmpdf = tmpdf.drop('josh_mody_n', axis=1)
        testdfs.append(tmpdf)
        del(tmpdf)
    del(joined_df)

    return dask_cudf.concat(testdfs)

In [19]:
def combine_joined_12_mon(joined_df, testdf, **kwargs):
    joined_df = joined_df.drop('delinquency_12', axis=1)
    joined_df = joined_df.drop('upb_12', axis=1)
    joined_df['timestamp_year'] = joined_df['timestamp_year'].astype('int16')
    joined_df['timestamp_month'] = joined_df['timestamp_month'].astype('int8')
    return joined_df.merge(testdf, on=['loan_id', 'timestamp_year', 'timestamp_month'])

In [20]:
def final_performance_delinquency(gdf, joined_df, **kwargs):
    merged = null_workaround(gdf)
    joined_df = null_workaround(joined_df)
    joined_df['timestamp_month'] = joined_df['timestamp_month'].astype('int8')
    joined_df['timestamp_year'] = joined_df['timestamp_year'].astype('int16')
    merged['timestamp_month'] = merged['monthly_reporting_period'].dt.month
    merged['timestamp_month'] = merged['timestamp_month'].astype('int8')
    merged['timestamp_year'] = merged['monthly_reporting_period'].dt.year
    merged['timestamp_year'] = merged['timestamp_year'].astype('int16')
    merged = merged.merge(joined_df, on=['loan_id', 'timestamp_year', 'timestamp_month'])
    merged = merged.drop('timestamp_year', axis=1)
    merged = merged.drop('timestamp_month', axis=1)
    return merged

In [21]:
def join_perf_acq_gdfs(perf, acq, **kwargs):
    perf = null_workaround(perf)
    acq = null_workaround(acq)
    return perf.merge(acq, on=['loan_id'])

In [22]:
def last_mile_cleaning(df, **kwargs):
    drop_list = [
        'loan_id', 'orig_date', 'first_pay_date', 'seller_name',
        'monthly_reporting_period', 'last_paid_installment_date', 'maturity_date', 'ever_30', 'ever_90', 'ever_180',
        'delinquency_30', 'delinquency_90', 'delinquency_180', 'upb_12',
        'zero_balance_effective_date','foreclosed_after', 'disposition_date','timestamp'
    ]
    for column in drop_list:
        df = df.drop(column, axis=1)
    for col, dtype in df.dtypes.iteritems():
        if str(dtype)=='category':
            df[col] = df[col].cat.codes
        df[col] = df[col].astype('float32')
    df['delinquency_12'] = df['delinquency_12'] > 0
    df['delinquency_12'] = df['delinquency_12'].fillna(False).astype('int32')
    for column in df.columns:
        df[column] = df[column].fillna(np.dtype(str(df[column].dtype)).type(-1))
    return df

In [23]:
def run_gpu_workflow(quarter=1, year=2000, perf_file="", **kwargs):
    """ Main function to perform ETL on the data.   
    
    Returns
    -------
    Dask GPU DataFrame
    """
    names = gpu_load_names()
    acq_gdf = gpu_load_acquisition_csv(acquisition_path= acq_data_path + "/Acquisition_"
                                      + str(year) + "Q" + str(quarter) + ".txt")
    acq_gdf = acq_gdf.merge(names, on=['seller_name'], how="left")
    acq_gdf = acq_gdf.drop('seller_name', axis=1)
    acq_gdf['seller_name'] = acq_gdf['new']
    acq_gdf = acq_gdf.drop('new', axis=1)
    perf_df_tmp = gpu_load_performance_csv(perf_file)
    gdf = perf_df_tmp
    everdf = create_ever_features(gdf)
    delinq_merge = create_delinq_features(gdf)
    everdf = join_ever_delinq_features(everdf, delinq_merge)
    del(delinq_merge)
    joined_df = create_joined_df(gdf, everdf)
    testdf = create_12_mon_features(joined_df)
    joined_df = combine_joined_12_mon(joined_df, testdf)
    del(testdf)
    perf_df = final_performance_delinquency(gdf, joined_df)
    del(gdf, joined_df)
    final_gdf = join_perf_acq_gdfs(perf_df, acq_gdf)
    del(perf_df)
    del(acq_gdf)
    final_gdf = last_mile_cleaning(final_gdf)
    return final_gdf

## ETL

#### Perform all of ETL with a single call to
```python
process_quarter_gpu(year=year, quarter=quarter, perf_file=file)
```

In [None]:
%%time

# NOTE: The ETL calculates additional features which are then dropped before creating the XGBoost DMatrix.
# This can be optimized to avoid calculating the dropped features.
part_count = 4

gpu_dfs = []
gpu_time = 0
quarter = 1
year = start_year
count = 0
while year <= end_year:
    for file in glob(os.path.join(perf_data_path + "/Performance_" + str(year) + "Q" + str(quarter) + "*")):
        gpu_dfs.append(process_quarter_gpu(year=year, quarter=quarter, perf_file=file))
        count += 1
    quarter += 1
    if quarter == 5:
        year += 1
        quarter = 1
print("ETL for start_year:{} and end_year:{}\n".format(start_year,end_year))
wait(gpu_dfs)

/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2000Q1.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2000Q1.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2000Q2.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2000Q2.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2000Q3.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2000Q3.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2000Q4.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2000Q4.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2001Q1.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2001Q1.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2001Q2.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/perf/Performance_2001Q2.txt_0
/raid/tdyer/rapids-h2o/mortgage/data/mortgage/acq/Acquisition_2001Q2.txt
/raid/tdyer/rapids-h2o/mortgage/data/mortga

In [None]:
#client.run(initialize_rmm_no_pool)

## XGBoost the Dataset

### Train and test on the dataset

First thing we do is split the data into a test and a training dataframe.  Let's persist/wait our test dataset here

In [None]:
%%time
test_df = gpu_dfs[len(gpu_dfs)-1].persist()
_ = wait(test_df)
del gpu_dfs[len(gpu_dfs)-1]

Concatenate all the dask dataframes into one.  We will `.persist()` and `wait` here as a milestone to reduce the persist/wait times later.

In [None]:
%%time
train_df = dask_cudf.concat(gpu_dfs).persist(split_out=n_workers)
_ = wait(train_df)

In [None]:
train_data, test_data = dask.persist(train_df[train_df.columns.difference(["delinquency_12"])], test_df[test_df.columns.difference(["delinquency_12"])])
train_labels, test_labels = dask.persist(train_df["delinquency_12"], test_df["delinquency_12"])
_ = wait(train_data)
_ = wait(test_data)
_ = wait(train_labels)
_ = wait(test_labels)

In [None]:
# Create dtrain and dtest dask dmatrix
dtrain = xgb.dask.DaskDMatrix(client, 
                              train_data,
                              train_labels, missing=-1    )

dtest = xgb.dask.DaskDMatrix(client, 
                             test_data,
                             test_labels , missing=-1   )

In [None]:
%%time
#Train the model

trained_model = xgb.dask.train(client,
                        {
                         'learning_rate': 0.1,
                          'max_depth': 8,
                          'subsample': 1,
                          'gamma': 0.1,
                          'silent': True,
                          'verbose_eval': True,
                          'tree_method':'gpu_hist',
                          'loss':              'ls',
                          'objective':         'binary:logistic',
                          'max_features':      'auto',
                          'criterion':         'friedman_mse',
                          'grow_policy':       'lossguide',
                        },
                        dtrain,
                        num_boost_round=100, evals=[(dtrain, 'train')])

#Predict the model
prediction = xgb.dask.predict(client, trained_model['booster'], dtest)

#form and test predictions form xgboost.dask output
## get pred into an array
pred = prediction.compute()

Finally we will form our test_labels into an array that we can validate our predictions `pred` against it with an RMSE score.  

In [None]:
%%time
## get test_labels into an array form
ytest= test_labels.compute()
ytest = ytest.astype(np.float32)

##test prediction wih RMSE
from cuml.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(ytest, pred))

print("RMSE: ", rmse)

Our notebook ends here.  You now have a fully trained model and tested it to get your RMSE.  You can continue doing any other analytics you'd like.  Hooray!