# Mortgage Workflow with Deep Learning - Basic (No Ignite)

Original workflow by Even Oldridge - Modified by Rick Zamora

## Dataset

The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

Preprocessing ETL has already been precalculated and is located at `/datasets/mortgage/post_etl/dnn/`

## PyTorch Deep Neural Network

### Model
The model constructed below starts with an initial embedding layer ([`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/nn.html#embeddingbag)) that takes the indices from the ETL pipeline, looks up the embeddings in the hash table and takes their mean. This vector then passes to a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) which finally outputs a single score.

Many of the model architecture parameters can be configured by the user such as embedding dimension, number and size of hidden layers, and activation functions.

## CODE
Most of the details are buried/organized within the .py files.

### Imports

In [22]:
from collections import defaultdict, OrderedDict
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import pyarrow.parquet as pq
import cudf
cudf.__version__

'0.8.0a1+388.gbaff98a'

In [23]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Configuration

#### ETL - Discretization

In [24]:
num_features = 2 ** 22  # When hashing features range will be [0, num_features)

#### Training - Model Details

In [25]:
embedding_size = 64
hidden_dims = [600,600,600,600]
device = 'cuda'
activation = nn.ReLU()
batch_size = 80960
using_docker = False

## Torch Dataset from Parquet
The preprocessing ETL has already been precalculated and is stored at: /tmp/eoldridge/fnma_full_data_proc_out4/dnn/

In [26]:
if using_docker:
    data_dir = '/data/mortgage/'
    !ls -al --block-size=M /data/mortgage/
else:
    data_dir = '/datasets/mortgage/post_etl/dnn/'
    !ls -al --block-size=M /datasets/mortgage/post_etl/dnn/

total 0M
drwxr-xr-x 1 eoldridge nvidia 0M May 29 11:38 .
drwxr-xr-x 1 eoldridge nvidia 0M May 29 11:38 ..
drwxr-xr-x 1 eoldridge nvidia 0M May 29 11:38 test
drwxr-xr-x 1 eoldridge nvidia 0M May 29 11:38 train
drwxr-xr-x 1 eoldridge nvidia 0M May 29 11:38 validation


### Training starts here

In [27]:
from training import run_training
from model import MortgageNetwork

In [28]:
model = MortgageNetwork(
    num_features,
    embedding_size,
    hidden_dims,
    dropout=dropout,
    activation=activation,
    use_cuda=True
)
model.device

device(type='cuda')

In [29]:
%%time
run_training(
    model,
    data_dir,
    batch_size=batch_size,
    batch_dataload=True,
    num_workers=8
)

Epoch [1/3], Step [100/107] Loss: 0.0470
Epoch [2/3], Step [100/107] Loss: 0.0455
Epoch [3/3], Step [100/107] Loss: 0.0418


IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)