# Mortgage Workflow with Deep Learning

## Dataset

The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

Preprocessing ETL has already been precalculated and is located at /tmp/eoldridge/fnma_full_data_proc_out4/dnn/

## PyTorch Deep Neural Network

### Model
The model constructed below starts with an initial embedding layer ([`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/nn.html#embeddingbag)) that takes the indices from the ETL pipeline, looks up the embeddings in the hash table and takes their mean. This vector then passes to a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) which finally outputs a single score.

Many of the model architecture parameters can be configured by the user such as embedding dimension, number and size of hidden layers, and activation functions.

### Training
To cut down on boilerplate code and realize the benefits of [early stopping](https://en.wikipedia.org/wiki/Early_stopping)
we use the [`ignite`](https://pytorch.org/ignite/) library.


## Requirements
Beyond the dependencies that come installed in the standard 
[RAPIDS docker containers](https://hub.docker.com/r/rapidsai/rapidsai) we'll also
need the following `pip` dependencies installed:

In [1]:
!pip install torch pytorch-ignite

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/69/60/f685fb2cfb3088736bafbc9bdbb455327bdc8906b606da9c9a81bae1c81e/torch-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (676.9MB)
[K     |################################| 676.9MB 20kB/s 
[?25hCollecting pytorch-ignite
[?25l  Downloading https://files.pythonhosted.org/packages/98/7b/1da69e5fdcb70e8f40ff3955516550207d5f5c81b428a5056510e72c60c5/pytorch_ignite-0.2.0-py2.py3-none-any.whl (73kB)
[K     |################################| 81kB 34.7MB/s 
Installing collected packages: torch, pytorch-ignite
Successfully installed pytorch-ignite-0.2.0 torch-1.1.0


In [2]:
!pip install snakeviz

Collecting snakeviz
[?25l  Downloading https://files.pythonhosted.org/packages/39/b5/2672d76c4d21debc451aaa4cc49ef5b1af1796d4627184db9f6a3b6a6401/snakeviz-2.0.0-py2.py3-none-any.whl (281kB)
[K     |################################| 286kB 9.4MB/s 
Installing collected packages: snakeviz
Successfully installed snakeviz-2.0.0


## CODE
Most of the details are buried/organized within the .py files.

### Imports

In [3]:
from collections import defaultdict, OrderedDict
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import pyarrow.parquet as pq

In [4]:
import cudf
cudf.__version__

'0.7.2+0.g3ebd286.dirty'

In [5]:
import pdb

In [6]:
%load_ext autoreload
%autoreload 2

## Configuration

#### ETL - Discretization

In [7]:
max_quantiles = 20  # Used for computing histograms of continuous features
num_features = 2 ** 22  # When hashing features range will be [0, num_features)

#### Training - Model Details

In [8]:
embedding_size = 64
hidden_dims = [600,600,600,600]

device = 'cuda'
dropout = None  # Can add dropout probability in [0, 1] here
activation = nn.ReLU()

batch_size = 8096

## Torch Dataset from Parquet
The preprocessing ETL has already been precalculated and is stored at: /tmp/eoldridge/fnma_full_data_proc_out4/dnn/

In [9]:
data_dir = '/data/mortgage/'
!ls -al --block-size=M /data/mortgage/

total 1M
drwxr-xr-x 1 10128 10004 0M May 29 18:38 .
drwxr-xr-x 3 root  root  1M May 29 18:50 ..
drwxr-xr-x 1 10128 10004 0M May 29 18:38 test
drwxr-xr-x 1 10128 10004 0M May 29 18:38 train
drwxr-xr-x 1 10128 10004 0M May 29 18:38 validation


### Training starts here

In [10]:
from training import run_training
from model import MortgageNetwork

In [11]:
model = None
model = MortgageNetwork(num_features, embedding_size, hidden_dims,
                        dropout=dropout, activation=activation, use_cuda=True)

In [12]:
model.device

device(type='cuda')

In [14]:
%load_ext snakeviz

In [15]:
%snakeviz run_training(model, data_dir, batch_size=batch_size, batch_dataload=True, num_workers=8)

Epoch[1] Iteration[63/1067] Loss: 0.04694 Example/s: 122547.646 (Total examples: 510048)
Epoch[1] Iteration[126/1067] Loss: 0.03472 Example/s: 139842.134 (Total examples: 1020096)
Epoch[1] Iteration[189/1067] Loss: 0.04061 Example/s: 146761.253 (Total examples: 1530144)
Epoch[1] Iteration[252/1067] Loss: 0.03693 Example/s: 150355.437 (Total examples: 2040192)
Epoch[1] Iteration[315/1067] Loss: 0.03975 Example/s: 152617.831 (Total examples: 2550240)
Epoch[1] Iteration[378/1067] Loss: 0.03986 Example/s: 154140.674 (Total examples: 3060288)
Epoch[1] Iteration[441/1067] Loss: 0.04132 Example/s: 155262.977 (Total examples: 3570336)
Epoch[1] Iteration[504/1067] Loss: 0.03956 Example/s: 155617.958 (Total examples: 4080384)
Epoch[1] Iteration[567/1067] Loss: 0.03490 Example/s: 156353.267 (Total examples: 4590432)
Epoch[1] Iteration[630/1067] Loss: 0.03293 Example/s: 156952.977 (Total examples: 5100480)
Epoch[1] Iteration[693/1067] Loss: 0.03787 Example/s: 157412.830 (Total examples: 5610528)
E