# Dataduit

In this example, we'll demonstrate how to use [dataduit](https://github.com/JackBurdick/dataduit) to create tensorflow datasets from a pandas dataframe by specifying a config file.

We'll then demonstrate how to use yeahml to create/build/evaluate a model on the created data.

#### Note:
> neither dataduit nor yeaml are installed in the environment so I have both repositories cloned to my machine and I am operating in the yeahml directory but including the dataduit project (by using sys.path.append("../path/to/dataduit") in the next cell. This will hopefully change in the future. Also, the model for this project likely doesn't make sense. I am not personally familiar with the dataset, I only wanted to show that it is possible to use these two libraries together

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("../dataduit/")

In [2]:
import pandas as pd
import tensorflow as tf
import dataduit as dd
import yeahml as yml

## Create Datasets

In [3]:
# the config file looks something like this:
# right now the config file is a python dict object, but eventually 
# we'll be able to write json or yaml files to be parsed into this dict
from pandas_dev import conf_dict
conf_dict

{'meta': {'name': 'albalone',
  'logging': {'log_stream_level': 'INFO'},
  'in': {'from': 'memory', 'type': 'pandas'}},
 'read': {'split_percents': [75, 15, 10],
  'split_names': ['train', 'val', 'test'],
  'iterate': {'return_type': 'tuple',
   'schema': {'x': {'length': {'indicator': 'length',
      'datatype': {'in': {'options': {'dtype': 'float64', 'shape': 1}},
       'out': {}},
      'special': 'decode'},
     'diameter': {'indicator': 'diameter',
      'datatype': {'in': {'options': {'dtype': 'float64', 'shape': 1}},
       'out': {}},
      'special': 'decode'}},
    'y': {'rings': {'datatype': {'in': {'options': {'dtype': 'int64',
         'shape': 1}},
       'out': {}}}}}}}}

In [4]:
# Reading a file from online
# more information can be found here:
# > https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
h = ["sex",
"length",
"diameter",
"height",
"whole_weight",
"shucked_weight",
"viscera_weight",
"shell_weight",
"rings"]
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',names=h)

In [5]:
# create the datasets based on the above defined names/splits/specifed data
ds_dict = dd.read(conf_dict, df)

`ds_dict` is just a dictionary containing the tensorflow datasets. which can be accessed like this:

```python
ds_val = ds_dict["val"]
```

## Specify the Model

In [6]:
example = "./examples/abalone/main_config.yml"
config_dict = yml.create_configs(example)

The config looks something like this

In [7]:
!cat ./examples/abalone/main_config.yml

meta:
  name: 'abalone'
  experiment_dir: 'trial_00'
  # TODO: information on when to save params, currently only best params saved
logging:
  console:
    level: 'info'
    format_str: null
  file:
    level: 'ERROR'
    format_str: null
  graph_spec: True

performance:
  loss_fn: 
    type: 'MSE'
  type: ["MeanSquaredError", "MeanAbsoluteError"]
  options: [null, 
            null]

# TODO: this section needs to be redone
data:
  in:
    dim: [2,1]
    dtype: 'float64'
  label:
    dim: [1]
    dtype: 'int32'

hyper_parameters:
  optimizer: 
    type: 'adam'
    learning_rate: 0.0001
  epochs: 30
  dataset:
    # TODO: I would like to make this logic more abstract
    # I think the only options that should be applied here are "batch" and "shuffle"
    batch: 16
    shuffle: 128 # this should be grouped with batchsize
model:
  path: './examples/abalone/model_config.yml'



And the model config looks like this:

In [8]:
!cat ./examples/abalone/model_config.yml

meta:
  name: "model_a"
  name_override: True
  activation:
    type: 'elu'

layers:
  dense_1:
    type: 'dense'
    options:
      units: 16
  dense_2:
    type: 'dense'
    options:
      units: 8
      activation:
        type: 'linear'
  dense_3_output:
    type: 'dense'
    options:
      units: 1
      activation:
        type: 'linear'

## Build the model

In [9]:
model = yml.build_model(config_dict)

build_logger: INFO     -> START building graph
build_logger: INFO     -> START building hidden block
graph_logger: INFO     | dense_1         | no_shape
graph_logger: INFO     | dense_2         | no_shape
graph_logger: INFO     | dense_3_output  | no_shape
build_logger: INFO     [END] building hidden block
build_logger: INFO     information json file created


In [10]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 2, 1)]            0         
_________________________________________________________________
dense_1 (Dense)              (None, 2, 16)             32        
_________________________________________________________________
dense_2 (Dense)              (None, 2, 8)              136       
_________________________________________________________________
dense_3_output (Dense)       (None, 2, 1)              9         
Total params: 177
Trainable params: 177
Non-trainable params: 0
_________________________________________________________________


## Train the Model

Notice here that we're using the created training and validation sets from `ds_dict`

In [11]:
train_dict = yml.train_model(model, config_dict, (ds_dict["train"], ds_dict["val"]))

train_logger: INFO     -> START training graph
W0115 17:23:33.572206 140689008121664 base_layer.py:1814] Layer dense_1 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

train_logger: INFO     start creating train_dict
train_logger: INFO     [END] creating train_dict


## Evaluate the Model

In [12]:
eval_dict = yml.eval_model(
    model,
    config_dict,
    dataset=ds_dict["test"]
)
print(eval_dict)

eval_logger : INFO     params loaded from examples/abalone/trial_00/model_a/save/params/run_2020_01_15-17_23_33/best_params.h5
eval_logger : INFO     -> START evaluating model
eval_logger : INFO     [END] evaluating model
eval_logger : INFO     -> START creating eval_dict
eval_logger : INFO     [END] creating eval_dict


{'meansquarederror': 6.5413327, 'meanabsoluteerror': 1.9091898}


## Inspect model in Tensorflow

In the command line you can navigate to the `albalone` directory and run: (provided tensorboard is installed in your environment)

```bash
tensorboard --logdir model_a/
```