Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF2 porting: eager mode training and evaluation, numerical and binary features #646

Merged
merged 72 commits into from Mar 27, 2020

Conversation

jimthompson5802
Copy link
Collaborator

Reopening PR for TF2 porting...

I'm hoping this posting provides some evidence of progress. With your last guidance, I was able to get a "minimal training loop" working with model subclassing and eager execution.

I wanted to offer you an early look at how I'm adapting Ludwig training to TF2 eager execution.

This commit demonstrates the minimal training loop for this model_definition:

input_features = [
    {'name': 'x1', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}},
    {'name': 'x2', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}},
    {'name': 'x3', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}}
]
output_features = [
    {'name': 'y', 'type': 'numerical'}
]

model_definition = {
    'input_features': input_features,
    'output_features': output_features,
    'combiner': {
        'type': 'concat',  # 'concat',
        'num_fc_layers': 5,
        'fc_size': 64
    },
    'training': {'epochs': 100}
}

The main result of this "minimal training loop" is demonstrating:

  • not blowing up while running the specified number epochs
  • Create numpy arrays used for training from input features and output features specified in model_definition
  • reduction in the loss function during training

Here is an excerpt of the log file for the minimal training loop:


Epoch   1

Epoch   1
Training: 100%|██████████| 6/6 [00:00<00:00, 50.40it/s]
Epoch 1, train Loss: 9732.708984375, : train metric 9743.146484375

Epoch   2

Epoch   2
Training: 100%|██████████| 6/6 [00:00<00:00, 77.71it/s]
Epoch 2, train Loss: 9711.203125, : train metric 9721.9404296875

Epoch   3

Epoch   3
Training: 100%|██████████| 6/6 [00:00<00:00, 80.07it/s]
Epoch 3, train Loss: 9678.2998046875, : train metric 9689.5517578125

<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>

Epoch  50

Epoch  50
Training: 100%|██████████| 6/6 [00:00<00:00, 80.76it/s]
Epoch 50, train Loss: 1540.60400390625, : train metric 1547.77294921875

Epoch  51

Epoch  51
Training: 100%|██████████| 6/6 [00:00<00:00, 75.30it/s]
Epoch 51, train Loss: 1510.5498046875, : train metric 1517.5792236328125

Epoch  52

Epoch  52
Training: 100%|██████████| 6/6 [00:00<00:00, 75.37it/s]
Epoch 52, train Loss: 1481.64404296875, : train metric 1488.5391845703125

<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Epoch  98

Epoch  98
Training: 100%|██████████| 6/6 [00:00<00:00, 78.33it/s]
Epoch 98, train Loss: 787.4824829101562, : train metric 791.1544189453125

Epoch  99

Epoch  99
Training: 100%|██████████| 6/6 [00:00<00:00, 75.26it/s]
Epoch 99, train Loss: 779.5379028320312, : train metric 783.1728515625

Epoch 100

Epoch 100
Training: 100%|██████████| 6/6 [00:00<00:00, 83.06it/s]
Epoch 100, train Loss: 771.7520141601562, : train metric 775.3506469726562

Here is the entire eager execution log file:
proof-of-concept_log_tf2_eager_exec.txt.

Now comes all the limitations and caveats for the current state of the code:

  • training loss reduction is not has fast as the TF1 implementation. Here is the TF1 log file for the same model
    proof-of-concept_log_tf1.txt.
  • Commented out large sections of the code in the model.train() method. I envision re-enabling and modifying the commented code as work progresses
  • Several items were hard-coded in this implemantion to miminize the amount of change required just to demonstrate the "training loop". The hard-code functions are
    • The encoder structure is hard-coded in the model.call() method. This will change to reflect the encorder/decoders specified in the model_definition
    • the objective, loss and metric functions are hard-coded. Future work will be build these from the model_defintion
  • One of the sections of code commented out creates these data structures. Without theses structures Ludwig processing after the training loop abnormally terminates. This will be fixed as the work progresses. Actually, I'm thinking that this is the next thing to fix with the simple model I'm using for testing. If I can get these enabled, then the rest of Ludwig processing "should work".
    • progress_tracker.train_stats
    • progress_tracker.vali_stats
    • progress_tracker.test_stats

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust One item I forgot to point out in my initial posting.

Current implementation has ludwig.model.Model class is now a subclass of tensorflow.keras.models.Model class. This seemed like the path of least resistance to get something going.

But this now ties Ludwig's Model class to tensorflow.keras.models.Model. I probably should have checked with you beforehand.

Would you rather have Ludwig's be independent of tensorflow.keras?

OrLudwig being a subclass of tensorflow.keras Model is good enough for now? We'll deal with separating the two if needed later on.

@w4nderlust
Copy link
Collaborator

Yo ushouldn't mind for now the difference in convergence speed, the learning rate and other parameters may be different so don't worry.

Regarding the structure of the code, I think Model should not extend keras model because they are kinda different concepts, in Keras a model is something that produces predictions, it has fit functions that are kinda similar to Ludwig's train. I would avoid confounding them. I think Model can have a property variable that we can call keras_model for now (although you already have keras_layers, maybe the layers are not needed) and then later we can call differently. So Model doesn't need a call function, wyou would call keras_model.call()

This is quite some progress. Tomorrow I can try to do a pass to reflect the point in the previous paragraph.

@jimthompson5802
Copy link
Collaborator Author

Got it...I'll separate the Ludwig Model from tensorflow.keras Model

@jimthompson5802
Copy link
Collaborator Author

Ludwig Model and tf.keras.models.Model are now separated. This commit 87c962e implemented this change.

As noted earlier, I'm focusing on capturing all the metrics for training, validation and test data sets. When I can do this, this should allow the Ludwig training procedure to run from start to end.

From what I can tell, in the TF1 version, the metrics are computed on the tensorflow graph and the values extracted from the graph. OTOH in TF2 with eager execution, based on the tutorial you pointed out, the metrics are calculated by Python functions, such as

loss_object = tf2.keras.losses.MeanSquaredError()
train_loss = tf2.keras.metrics.Mean(name='train_loss')
train_metric = tf2.keras.metrics.MeanSquaredError(name='train_metric')

Right now these are hard-coded to the specific model I'm testing with. In looking at Ludwig's design, I can abstract these functions by their associated output feature. I'm looking at updating each output feature's output_config dictionary to add a calc_fn key that points to the function that will calculate the specific metric. Below is the example for the output numerical feature.

   output_config = OrderedDict([
        (LOSS, {
            'output': EVAL_LOSS,
            'aggregation': SUM,
            'value': 0,
            'calc_fn': tf2.keras.metrics.Mean(name='train_loss'),  # todo: tf2
            'type': MEASURE
        }),
        (MEAN_SQUARED_ERROR, {
            'output': SQUARED_ERROR,
            'aggregation': SUM,
            'value': 0,
            'calc_fn': tf2.keras.metrics.MeanSquaredError(name='metric_mse'), # todo: tf2
            'type': MEASURE
        }),
        (MEAN_ABSOLUTE_ERROR, {
            'output': ABSOLUTE_ERROR,
            'aggregation': SUM,
            'value': 0,
            'calc_fn': tf2.keras.metrics.MeanAbsoluteError(name='metric_mae'), # todo: tf2,
            'type': MEASURE
        }),
        (R2, {
            'output': R2,
            'aggregation': SUM,
            'value': 0,
            'calc_fn': lambda y, y_hat: -1, # todo: tf2 need to find function
            'type': MEASURE
        }),
        (ERROR, {
            'output': ERROR,
            'aggregation': SUM,
            'value': 0,
            'calc_fn': lambda y, y_hat: y - y_hat,  # todo: tf2
            'type': MEASURE
        }),
        (PREDICTIONS, {
            'output': PREDICTIONS,
            'aggregation': APPEND,
            'value': [],
            'calc_fn': None, # todo: tf2 need to define function
            'type': PREDICTION
        })
    ])

Right now I'm still working out the appropriate way to make the calc_fn visible at the right point in the processing. If this works, I'll make similar changes for the other output data type's output_config data structure.

Does this seem like a reasonable approach?

@w4nderlust
Copy link
Collaborator

You don't need that I believe, you can just change the implementation of NumericalOutputFeatures.get_loss() and .get_measures().

@jimthompson5802
Copy link
Collaborator Author

A follow-up on recording loss and measures.

From what I can tell, there are two parts to being able to record the loss and measures.

  • Part 1: This happens during model building. Based on output feature type, determine the functions required to calculate loss and measures and record those functions somewhere such that the functions are available during training.
  • Part 2: During training make use of the recorded functions identified in Part 1 to calculate loss and measures and save those values in a data structure that is used to create the Ludwig training statistics.

From what I can tell, in TF1, Ludwig uses the static Graph to save the "functions" and make them available during training.

In TF2 with eager execution there is no static Graph, or I believe this to be true. This leads to the question of "Where should the functions needed to calculate loss and measures be saved such that they are available during training? That is how I ended up in thinking that the output_config data structure could be used by adding the calc_fn item.

re:

You don't need that I believe, you can just change the implementation of NumericalOutputFeatures.get_loss() and .get_measures().

When I look in numerical_features.py I see the NumericalOutputFeature definition. This class is a subclass of NumericalBaseFeature and OutputFeature. As best I can tell there are no public methods NumericalOutputFeature.get_loss() or NumericalOutputFeature.get_measures().

There are, however, private methods _get_loss() and _get_measures() in the NumericalOutputFeature class. As I understand these methods, they are the ones that place the required functions on the static Graph. These functions are called only during model building phase. As far as I can tell, these functions are not accessed during training.

Even if I change the implementation of NumericalOutputFeature._get_loss() and NumericalOutputFeature._get_measures() to be used during training to retrieve the loss and measures, I still have the gap of passing the functions identified during the model build phase (Part 1) to training (Part 2). I need some data structure to pass the functions. If output_config is not appropriate, is there a better data structure?

Or I'm just missing your point. If this is the case, I need a little more hand-holding. :-)

@w4nderlust
Copy link
Collaborator

In TF2 with eager execution there is no static Graph, or I believe this to be true. This leads to the question of "Where should the functions needed to calculate loss and measures be saved such that they are available during training? That is how I ended up in thinking that the output_config data structure could be used by adding the calc_fn item.
...

Your reasoning is correct. The NumericalOutputFeature class can have an init function that saves in self which loss and measure to use so that get_loss and get_measures can become non static method and use the info in self to determine what to compute. Does it make sense?

@jimthompson5802
Copy link
Collaborator Author

OK...now I get what you are suggesting. Save the functions as instance attributes in the output feature object. This makes sense. Thank you for the guidance.

@jimthompson5802
Copy link
Collaborator Author

While working on the latest guidance, it came apparent that this question

"Where should the functions needed to calculate loss and measures be saved such that they are available during training?"

applies to other items: input features and output features. In TF1, these are saved in the static Graph.

Following the same reasoning as above, the appropriate place to save these in TF2 would appear to be as instance attributes in the ludwig.models.Model class. I'm thinking of creating two attributes for this purpose.

self.input_features = {}
self.output_features = {}

These will be dictionaries where the keys are the features' names and the value is the feature object.

Does this seem reasonable?

test: add temporary stats data structure for testing
@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 3, 2020

This commit 28f2dd3 fixes how metrics are reported. Prior to the change it appeared that convergence was slow. Now it seems to be as expected.

 
╒══════════╕
╒══════════╕
│ TRAINING │
│ TRAINING │
╘══════════╛
╘══════════╛



Epoch  1

Epoch  1
Training: 100%|██████████| 6/6 [00:00<00:00, 56.92it/s]
Epoch 1, train Loss: 9753.6533203125, : train metric 9753.6533203125

Epoch  2

Epoch  2
Training: 100%|██████████| 6/6 [00:00<00:00, 88.75it/s]
Epoch 2, train Loss: 9727.7841796875, : train metric 9727.7841796875

Epoch  3

Epoch  3
Training: 100%|██████████| 6/6 [00:00<00:00, 90.72it/s]
Epoch 3, train Loss: 9670.3916015625, : train metric 9670.3916015625

Epoch  4

Epoch  4
Training: 100%|██████████| 6/6 [00:00<00:00, 96.40it/s]
Epoch 4, train Loss: 9544.5966796875, : train metric 9544.5966796875

Epoch  5

Epoch  5
Training: 100%|██████████| 6/6 [00:00<00:00, 100.62it/s]
Epoch 5, train Loss: 9289.3447265625, : train metric 9289.3447265625

Epoch  6

Epoch  6
Training: 100%|██████████| 6/6 [00:00<00:00, 90.71it/s]
Epoch 6, train Loss: 8794.6416015625, : train metric 8794.6416015625

Epoch  7

Epoch  7
Training: 100%|██████████| 6/6 [00:00<00:00, 87.95it/s]
Epoch 7, train Loss: 7876.7001953125, : train metric 7876.7001953125

Epoch  8

Epoch  8
Training: 100%|██████████| 6/6 [00:00<00:00, 97.43it/s]
Epoch 8, train Loss: 6290.029296875, : train metric 6290.029296875

Epoch  9

Epoch  9
Training: 100%|██████████| 6/6 [00:00<00:00, 97.19it/s]
Epoch 9, train Loss: 3933.5751953125, : train metric 3933.5751953125

Epoch 10

Epoch 10
Training: 100%|██████████| 6/6 [00:00<00:00, 88.64it/s]
Epoch 10, train Loss: 1521.4022216796875, : train metric 1521.4022216796875

Epoch 11

Epoch 11
Training: 100%|██████████| 6/6 [00:00<00:00, 90.59it/s]
Epoch 11, train Loss: 667.9972534179688, : train metric 667.9972534179688

Epoch 12

Epoch 12
Training: 100%|██████████| 6/6 [00:00<00:00, 97.81it/s]
Epoch 12, train Loss: 370.60516357421875, : train metric 370.60516357421875

Epoch 13

Epoch 13
Training: 100%|██████████| 6/6 [00:00<00:00, 97.54it/s]
Epoch 13, train Loss: 226.81394958496094, : train metric 226.81394958496094

Epoch 14

Epoch 14
Training: 100%|██████████| 6/6 [00:00<00:00, 87.97it/s]
Epoch 14, train Loss: 145.25387573242188, : train metric 145.25387573242188

Epoch 15

Epoch 15
Training: 100%|██████████| 6/6 [00:00<00:00, 90.65it/s]
Epoch 15, train Loss: 112.29157257080078, : train metric 112.29157257080078

Epoch 16

Epoch 16
Training: 100%|██████████| 6/6 [00:00<00:00, 98.22it/s]
Epoch 16, train Loss: 74.66690826416016, : train metric 74.66690826416016

Epoch 17

Epoch 17
Training: 100%|██████████| 6/6 [00:00<00:00, 94.20it/s]
Epoch 17, train Loss: 64.78895568847656, : train metric 64.78895568847656

Epoch 18

Epoch 18
Training: 100%|██████████| 6/6 [00:00<00:00, 87.57it/s]
Epoch 18, train Loss: 51.5577278137207, : train metric 51.5577278137207

Epoch 19

Epoch 19
Training: 100%|██████████| 6/6 [00:00<00:00, 88.60it/s]
Epoch 19, train Loss: 44.502235412597656, : train metric 44.502235412597656

Epoch 20

Epoch 20
Training: 100%|██████████| 6/6 [00:00<00:00, 96.82it/s]
Epoch 20, train Loss: 38.008697509765625, : train metric 38.008697509765625
Best validation model epoch: 1
Best validation model epoch: 1
Best validation model loss on validation set combined: 9489.847173455057
Best validation model loss on validation set combined: 9489.847173455057
Best validation model loss on test set combined: 9489.847173455057
Best validation model loss on test set combined: 9489.847173455057

Finished: experiment_run

Finished: experiment_run
Saved to: results/experiment_run_41
Saved to: results/experiment_run_41

Process finished with exit code 0

@jimthompson5802
Copy link
Collaborator Author

Just a status update....

Re: NumericalOutputFeatureclass. This is an example of how I'm adapting to use the tf.keras.lossses functions. These are the same functions that are hard coded in the current proof-of-concept.

    def _get_loss(self):
        if self.loss['type'] == 'mean_squared_error':
            train_loss = tf.keras.losses.MeanSquaredError(
                reduction=Reduction.NONE
            )
            train_mean_loss = tf.keras.losses.MeanSquaredError(
                reduction=Reduction.SUM
            )
        elif self.loss['type'] == 'mean_absolute_error':
            train_loss = tf.keras.losses.MeanAbsoluteError(
                reduction=Reduction.NONE
            )
            train_mean_loss = tf.keras.losses.MeanAbsoluteError(
                reduction=Reduction.SUM
            )
        else:
            train_mean_loss = None
            train_loss = None
            raise ValueError(
                'Unsupported loss type {}'.format(self.loss['type'])
            )

        return train_mean_loss, train_loss

The above code replaces this current implementation of _get_loss()
https://github.com/uber/ludwig/blob/4039f93e605ed929e4f4a75c725bb70e148df92b/ludwig/features/numerical_feature.py#L226-L252

However taking this approach is presenting some interesting challenges. In the TF1 version train_loss and train_mean_loss are tensors placed on the static graph. Once I converted these to the Keras functions that are executed during eager execution, other parts of Ludwig broke because they were expecting tensors. I tried working around those issues by commenting out the code or trying to use None as a temporary place holder. This then led to other parts of Ludwig breaking. Working through this is slow and tedious.

I'm thinking the better approach is to--at least for the now--leave the current static graph functions in place. Instead, I'll create parallel functions that use the eager execution approach. So instead of refactoring _get_loss() method, I'll put in place a parallel method _get_loss_tf2().

I think this will provide more straight forward way of converting to eager execution and not get encumbered with retrofitting. Once eager execution is working with this parallel structure, I'm hoping it will be easier to rip out the static graph functionality. At this point we can rename the parallel functions to the the appropriate name, e.g., _get_loss_tf2 reverts back to _get_loss().

Any thoughts on this approach?

refactor: partial working generalized approach for optimizer
@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 4, 2020

Primary change in this commit 82ee8bf is generalization for the loss function and start of generalizing the optimizer.

Taking the "parallel coding apporach" seems to be working out. By using this approach I was able to make progress on the generalizing the loss and optimizer functions.

Specific sections of code to review. Let me know if these are reasonable approaches.

  • ludwig.features.numerical_features.NumericalOutputFeature._get_loss_tf2() if this looks reasonable, this is the approach I'll reuse for other output features.
  • ludwig.models.modules.optimization_modules.get_optimizer_fun_tf2() This converts to the tf.keras optimizers. As you'll see in the code, not all optimizers in TF1 were converted. Related to this, I still have to rework the optimize() function in the same module for handing user specified optimizer parameters. Right now the optimizers are running with the default parameters.

With this approach I was able to remove the hard-coded optimizer and loss function.

For now I think the optimizer running with the default parameters is good enough. So the next thing I'm going to work on is capturing metrics. As noted earlier, when I can get this working this will all me to test full_experiment() function.

This is eager execution output when executing full_train() function:

<<<<<<<<<<<<< REMOVED LINES >>>>>>>>>>>>>>>>>>>>>
╒══════════╕
╒══════════╕
│ TRAINING │
│ TRAINING │
╘══════════╛
╘══════════╛



Epoch  1

Epoch  1
Training: 100%|██████████| 6/6 [00:00<00:00, 58.80it/s]
Epoch 1, train Loss: 9749.208984375, : train metric 9749.208984375

Epoch  2

Epoch  2
Training: 100%|██████████| 6/6 [00:00<00:00, 94.76it/s]
Epoch 2, train Loss: 9715.1982421875, : train metric 9715.1982421875

Epoch  3

Epoch  3
Training: 100%|██████████| 6/6 [00:00<00:00, 95.15it/s]
Epoch 3, train Loss: 9652.4091796875, : train metric 9652.4091796875

<<<<<<<<<<<<< REMOVED LINES >>>>>>>>>>>>
Epoch 16

Epoch 16
Training: 100%|██████████| 6/6 [00:00<00:00, 94.06it/s]
Epoch 16, train Loss: 69.82646942138672, : train metric 69.82646942138672

Epoch 17

Epoch 17
Training: 100%|██████████| 6/6 [00:00<00:00, 101.08it/s]
Epoch 17, train Loss: 62.97928237915039, : train metric 62.97928237915039

Epoch 18

Epoch 18
Training: 100%|██████████| 6/6 [00:00<00:00, 94.39it/s]
Epoch 18, train Loss: 48.84450149536133, : train metric 48.84450149536133

Epoch 19

Epoch 19
Training: 100%|██████████| 6/6 [00:00<00:00, 91.03it/s]
Epoch 19, train Loss: 47.20182418823242, : train metric 47.20182418823242

Epoch 20

Epoch 20
Training: 100%|██████████| 6/6 [00:00<00:00, 102.21it/s]
Epoch 20, train Loss: 41.10151672363281, : train metric 41.10151672363281
Best validation model epoch: 1
Best validation model epoch: 1
Best validation model loss on validation set combined: 9489.847173455057
Best validation model loss on validation set combined: 9489.847173455057
Best validation model loss on test set combined: 9489.847173455057
Best validation model loss on test set combined: 9489.847173455057

Finished: experiment_run

Finished: experiment_run
Saved to: results/experiment_run_86
Saved to: results/experiment_run_86

Process finished with exit code 0


@jimthompson5802
Copy link
Collaborator Author

This posting is about managing the work for TF2 port.

When I make changes to the source code, I've embedded TODO comments to let you know my view of future changes for the module. It occurred to me that this might not be sufficient and a higher level view of how the work will evolve is needed. With this in mind, I created a simple kanban in my Ludwig fork to keep track of

  • tasks completed
  • tasks in-flight
  • to do tasks

The "to do tasks" list is not complete. Right now I'm using this list as a reminder for near-term work. I'll add new tasks, for example, enabling other input and output features for eager execution, as the work progresses. The cards in this list are listed in the sequence I plan to do the work (top-down). This is also where I'll note significant design decisions/discussions, such as this one,

Generalize KerasModel.call() and KerasModel.__build() functions. Maintain "Graph structure" in Model or KerasModel classes?

Feel free to provide guidance on what should be on the "To Do" board and the sequencing of the work.

@w4nderlust
Copy link
Collaborator

Sorry for being slow at answering.

Any thoughts on this approach?

Yes it makes sense to me.

_get_loss_tf2

Instead of having 2 calls to the keras functions,, one with and one without reduction I would have only one, and then reduce the output.

kanban

Looks great to me and the tasks make sense to me. Do you have specific questions about them?

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 6, 2020

Just an update...

The set of commits --cf61261 b00ae61 f21bced -- added two capabilities:

  • Support for both mse and mae loss functions
  • an approach for defining and capturing measures in eager execution. I have a partial implementation of the approach to capture numerical type measures. Right now only mse and mae are captured. To work within the eager execution mode, I may have to define custom tf.keras.metric classes for error and r2 measures. I'm still researching this.

When you look at the last set of changes, you may see some "unusual" coding sequences. These were needed as a temporary work-around for this problem:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_resource_variable_ops.py", line 142, in assign_variable_op
    tld.op_callbacks, resource, value)
tensorflow.python.eager.core._FallbackException: This function does not handle the case of the path where all inputs are not already EagerTensors.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/project/sandbox/tf2_port/sandbox_model_ludwig.py", line 117, in <module>
    exp_dir_name = full_train(data_csv='./data2/train.csv', **args)
  File "/opt/project/ludwig/train.py", line 354, in full_train

Epoch  1

Epoch  1
    debug=debug
  File "/opt/project/ludwig/train.py", line 520, in train
    **model_definition['training']
  File "/opt/project/ludwig/models/model.py", line 627, in train
    of.reset_measures()
  File "/opt/project/ludwig/features/numerical_feature.py", line 328, in reset_measures
    measure_fn.reset_states()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 212, in reset_states
    K.batch_set_value([(v, 0) for v in self.variables])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3323, in batch_set_value
    x.assign(np.asarray(value, dtype=dtype(x)))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 821, in assign
    self.handle, value_tensor, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_resource_variable_ops.py", line 147, in assign_variable_op
    resource, value, name=name, ctx=_ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_resource_variable_ops.py", line 165, in assign_variable_op_eager_fallback
    attrs=_attrs, ctx=ctx, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 76, in quick_execute
    raise e
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 61, in quick_execute
    num_outputs)
TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
  @tf.function
  def has_init_scope():
    my_constant = tf.constant(1.)
    with tf.init_scope():
      added = my_constant * 2
The graph tensor has name: y/total:0

Process finished with exit code 1

The natural place to define the measure is during the build_ouptput() method in the OutputFeature class. At this point of the execution, it is running in the context of with graph.as_default(). This meant the tensors used to keep track of the measures were placed on the static Graph. So to solve the problem I had to move execution of _setup_measures_tf1(), which is called from within build_output(), to be outside of the graph.as_default() context. I've noted in the code where I had to make this change. Once everything is in eager execution mode and the static Graph is removed, I'll change the code to be a more natural execution.

Example output current set up measures captured:

Epoch  1

Epoch  1
Training: 100%|██████████| 6/6 [00:00<00:00, 62.43it/s]
Epoch 1: mse: 9749.9560546875 mae: 77.47599792480469

Epoch  2

Epoch  2
Training: 100%|██████████| 6/6 [00:00<00:00, 114.13it/s]
Epoch 2: mse: 9713.7109375 mae: 77.31450653076172

Epoch  3

Epoch  3
Training: 100%|██████████| 6/6 [00:00<00:00, 107.14it/s]
Epoch 3: mse: 9647.970703125 mae: 77.02067565917969

Epoch  4

Epoch  4
Training: 100%|██████████| 6/6 [00:00<00:00, 111.02it/s]
Epoch 4: mse: 9520.9091796875 mae: 76.44526672363281

Epoch  5

Epoch  5
Training: 100%|██████████| 6/6 [00:00<00:00, 110.76it/s]
Epoch 5: mse: 9278.9423828125 mae: 75.32697296142578

Epoch  6

Epoch  6
Training: 100%|██████████| 6/6 [00:00<00:00, 110.54it/s]
Epoch 6: mse: 8825.17578125 mae: 73.17013549804688

Epoch  7

Epoch  7
Training: 100%|██████████| 6/6 [00:00<00:00, 104.93it/s]
Epoch 7: mse: 8004.2001953125 mae: 69.08928680419922

Epoch  8

Epoch  8
Training: 100%|██████████| 6/6 [00:00<00:00, 102.74it/s]
Epoch 8: mse: 6595.7216796875 mae: 61.526405334472656

Epoch  9

Epoch  9
Training: 100%|██████████| 6/6 [00:00<00:00, 105.86it/s]
Epoch 9: mse: 4440.728515625 mae: 47.92460250854492

Epoch 10

Epoch 10
Training: 100%|██████████| 6/6 [00:00<00:00, 108.38it/s]
Epoch 10: mse: 1895.5604248046875 mae: 27.373172760009766

Epoch 11

Epoch 11
Training: 100%|██████████| 6/6 [00:00<00:00, 100.99it/s]
Epoch 11: mse: 516.0098266601562 mae: 16.7248592376709

Epoch 12

Epoch 12
Training: 100%|██████████| 6/6 [00:00<00:00, 110.01it/s]
Epoch 12: mse: 216.2883758544922 mae: 10.587516784667969

Epoch 13

Epoch 13
Training: 100%|██████████| 6/6 [00:00<00:00, 113.04it/s]
Epoch 13: mse: 266.9472961425781 mae: 9.908870697021484

Epoch 14

Epoch 14
Training: 100%|██████████| 6/6 [00:00<00:00, 111.45it/s]
Epoch 14: mse: 104.53760528564453 mae: 7.767614364624023

Epoch 15

Epoch 15
Training: 100%|██████████| 6/6 [00:00<00:00, 105.33it/s]
Epoch 15: mse: 85.23799896240234 mae: 6.513027191162109

Epoch 16

Epoch 16
Training: 100%|██████████| 6/6 [00:00<00:00, 108.96it/s]
Epoch 16: mse: 84.063232421875 mae: 5.99407958984375

Epoch 17

Epoch 17
Training: 100%|██████████| 6/6 [00:00<00:00, 111.00it/s]
Epoch 17: mse: 52.07749938964844 mae: 5.028381824493408

Epoch 18

Epoch 18
Training: 100%|██████████| 6/6 [00:00<00:00, 104.68it/s]
Epoch 18: mse: 46.138309478759766 mae: 4.436774253845215

Epoch 19

Epoch 19
Training: 100%|██████████| 6/6 [00:00<00:00, 101.70it/s]
Epoch 19: mse: 39.23905944824219 mae: 4.340053558349609

Epoch 20

Epoch 20
Training: 100%|██████████| 6/6 [00:00<00:00, 109.33it/s]
Epoch 20: mse: 35.55711364746094 mae: 4.135770320892334

Once error and r2 are captured, I'll put the measures in the standard Ludwig output.

@jimthompson5802
Copy link
Collaborator Author

These commits 42b09c4 and 9422158 fixes/implements saving the test statistics to the results directory. Both Numerical and Binary features now save all the test statistics.

This zip file show show the results directory with test statistics.
test_stats.zip

results/example_run contains the numerical output feature example
results/example_run_0 contains the binary output feature

These are the commands I used to generate the above

python -m ludwig.experiment --logging_level debug --data_csv data2/train.csv \
  --model_definition "{input_features: [{name: x1, type: numerical, preprocessing: {normalization: zscore}},
    {name: x2, type: numerical, preprocessing: {normalization: zscore}},
    {name: x3, type: numerical, preprocessing: {normalization: zscore}}],
    combiner: {type: concat, num_fc_layers: 5, fc_size: 128},
    output_features: [{name: y, type: numerical}], training: {epochs: 10}}"


python -m ludwig.experiment --logging_level debug --data_csv data3/train.csv \
  --model_definition "{input_features: [{name: x1, type: numerical, preprocessing: {normalization: zscore}},
    {name: x2, type: numerical, preprocessing: {normalization: zscore}},
    {name: x3, type: numerical, preprocessing: {normalization: zscore}}],
    combiner: {type: concat, num_fc_layers: 2},
    output_features: [{name: y, type: binary}], training: {epochs: 10}}"

Let me know if this looks right. At this point I can't think of anything else to do re: the Numerical and Binary features. If these are at a reasonable state, I can start working on the next feature type to convert.

@jimthompson5802
Copy link
Collaborator Author

I was curious and wanted to establish a baseline for the pytest unit tests since we started working on the eager execution approach.

============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /opt/project
plugins: typeguard-2.7.1
collected 80 items

tests/integration_tests/test_api.py F                                    [  1%]
tests/integration_tests/test_experiment.py FFFFFFFFFFFFFFFFFFF           [ 25%]
tests/integration_tests/test_kfold_cv.py FF.                             [ 28%]
tests/integration_tests/test_server.py F                                 [ 30%]
tests/integration_tests/test_visualization.py FFFFFFFFFFFFFFFFFFFFFFFFFF [ 62%]
                                                                         [ 62%]
tests/integration_tests/test_visualization_api.py FFFFFFFFFFFFFFFFFFFFFF [ 90%]
                                                                         [ 90%]
tests/ludwig/models/modules/test_encoder.py FF.F                         [ 95%]
tests/ludwig/utils/test_data_utils.py .                                  [ 96%]
tests/ludwig/utils/test_image_utils.py ..                                [ 98%]
tests/ludwig/utils/test_normalization.py .                               [100%]

=================================== FAILURES ===================================

=================== 74 failed, 6 passed, 1 warning in 41.97s ===================

@w4nderlust
Copy link
Collaborator

w4nderlust commented Mar 27, 2020

I did a pass myself as it was simpler than explaining what I wished to be changed :) Hopefully by looking at the code of the commit it's easier for you to figure out directly.

Regarding BinaryOutputFeature there are still a couple things to improve before moving on.
Specifically train_loss and update_metrics should respect the superclass interface.
Right now they can't because of binary_weighted_cross_entropy_with_logits and BWCEWLScore I believe. So what needs to happen here is that there needs to be a BWCEWLoss that takes the

positive_class_weight=self.loss['positive_class_weight'],
robust_lambda=self.loss['robust_lambda']

parameters as inputs at initialization time. This way, you can create the object in _setup_loss() like:

self.train_loss_function = BWCEWLoss(
    positive_class_weight=self.loss['positive_class_weight'],
    robust_lambda=self.loss['robust_lambda']
)

and then use it without the need for those parameters, which in turn removes the need for a custom train_loss.
The same applies to BWCEWLScore which now doesn't need parameters in update_state. You I believe you could also just substitute it with Mean, but I'm not 100% sure.
Thisis also a general pattern moving forward for losses and metrics that are not available in TF2.

Is all of this clear?

The next step after this, anyway, I believe would be to work on the CategoryFeature. It's a bit more complicated because the encoder (which is a embedding layer, may need to be modified according to TF2 too, and the losses are really custom. But having figured out the general pattern from binary and numerical will make things much much easier to grasp moving forward.

@jimthompson5802
Copy link
Collaborator Author

re:

Right now they can't because of binary_weighted_cross_entropy_with_logits and BWCEWLScore I believe. So what needs to happen here is that there needs to be a BWCEWLoss that takes the ...

The same applies to BWCEWLScore which now doesn't need parameters in update_state. You I believe you could also just substitute it with Mean, but I'm not 100% sure.

I'll make the changes as requested.

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Mar 27, 2020

Two items:

Item 1

Implemented changes requested in this comment #646 (comment)

re: BWCEWLScore class

  • Renamed class from BWCEWLScore to BWCEWLMetric. I did this because I noticed that you had renamed custom metric classes used in NumericalOutputFeature that contained Score suffix to have a Metric suffix.
  • I passed the instance of BWCEWLoss() used for BinaryOutputFeature.train_loss_function() to be part of the update_state() method.

If I missed anything let me know.

Item 2

Re: my question here #646 (comment) on changing OutputFeature.update_metrics() method to be an abstract method. Did you want me to do that?

The reason for changing OutputFeature.update_metrics() to be abstract is that different metrics may require different prediction values, e.,g logits, predictions, etc.

Right both BinaryOutputFeature and NumericalOutputFeature classes override the update_metrics() method.

In the mean time, I'll start working on the categorical feature.

@w4nderlust
Copy link
Collaborator

w4nderlust commented Mar 27, 2020

A couple tings:

  • it's true, some metrics may use the predictions, some metrics may use the probabilities, some metrics the logits.. An that depends on the losses that are in palce, which in turn depends on the features type. So yes you are right, it would be a good design pattern to make that an abstract method and remove the current implementation.
  • before moving to categorical let's do the following: I do a further pass of cleanup and solving the confidence_penaly part, and then let me merge on tf2_porting branch. The reason is that other people may want to help with other features, so margin back there enables us to do more specific PRs for specific new things, like new features ported to TF2 or saving models etc. Is it fine?

@jimthompson5802
Copy link
Collaborator Author

I'll make the change to have OutputFeature.update_metrics() to be abstract and add it to the PR.

I'll wait until you do your pass.

@w4nderlust
Copy link
Collaborator

Actually I made a change that should work in most cases. So for now we can keep the method in the base class. If we see that there are features for which this does not work, then we can make it abstract in the future.

@w4nderlust
Copy link
Collaborator

So I'm mergin this back now, from now on let's make the PRs more self contained, like one pr for the categorical feature, one for sets etc and so on, ok?

@w4nderlust w4nderlust merged commit 116e474 into ludwig-ai:tf2_porting Mar 27, 2020
Ludwig Development automation moved this from Triage to Done Mar 27, 2020
@w4nderlust w4nderlust changed the title TF2 Porting TF2 porting: eager mode training and evaluation, numerical and binary features Mar 27, 2020
@jimthompson5802
Copy link
Collaborator Author

Got it...I understand that you are merging to the uber repo.

Future PR should be specific and limited in scope. I'm assuming the future PR should be made against uber:tf2_porting branch. Is this correct?

@w4nderlust
Copy link
Collaborator

Yes, that's correct. The motivation is that now the basic systems fr training and prediction work, so with this in palce, work can be parallelized.

@jimthompson5802 jimthompson5802 deleted the tf2_porting branch March 27, 2020 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants