New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF2 porting: eager mode training and evaluation, numerical and binary features #646
TF2 porting: eager mode training and evaluation, numerical and binary features #646
Conversation
@w4nderlust One item I forgot to point out in my initial posting. Current implementation has But this now ties Ludwig's Would you rather have Ludwig's be independent of tensorflow.keras? OrLudwig being a subclass of tensorflow.keras Model is good enough for now? We'll deal with separating the two if needed later on. |
Yo ushouldn't mind for now the difference in convergence speed, the learning rate and other parameters may be different so don't worry. Regarding the structure of the code, I think Model should not extend keras model because they are kinda different concepts, in Keras a model is something that produces predictions, it has fit functions that are kinda similar to Ludwig's train. I would avoid confounding them. I think Model can have a property variable that we can call keras_model for now (although you already have keras_layers, maybe the layers are not needed) and then later we can call differently. So Model doesn't need a call function, wyou would call keras_model.call() This is quite some progress. Tomorrow I can try to do a pass to reflect the point in the previous paragraph. |
Got it...I'll separate the Ludwig Model from tensorflow.keras Model |
Ludwig Model and tf.keras.models.Model are now separated. This commit 87c962e implemented this change. As noted earlier, I'm focusing on capturing all the metrics for training, validation and test data sets. When I can do this, this should allow the Ludwig training procedure to run from start to end. From what I can tell, in the TF1 version, the metrics are computed on the tensorflow graph and the values extracted from the graph. OTOH in TF2 with eager execution, based on the tutorial you pointed out, the metrics are calculated by Python functions, such as
Right now these are hard-coded to the specific model I'm testing with. In looking at Ludwig's design, I can abstract these functions by their associated output feature. I'm looking at updating each output feature's
Right now I'm still working out the appropriate way to make the Does this seem like a reasonable approach? |
You don't need that I believe, you can just change the implementation of NumericalOutputFeatures.get_loss() and .get_measures(). |
This reverts commit bd19343.
A follow-up on recording loss and measures. From what I can tell, there are two parts to being able to record the loss and measures.
From what I can tell, in TF1, Ludwig uses the static Graph to save the "functions" and make them available during training. In TF2 with eager execution there is no static Graph, or I believe this to be true. This leads to the question of "Where should the functions needed to calculate loss and measures be saved such that they are available during training? That is how I ended up in thinking that the re:
When I look in There are, however, private methods Even if I change the implementation of Or I'm just missing your point. If this is the case, I need a little more hand-holding. :-) |
Your reasoning is correct. The NumericalOutputFeature class can have an init function that saves in self which loss and measure to use so that get_loss and get_measures can become non static method and use the info in self to determine what to compute. Does it make sense? |
OK...now I get what you are suggesting. Save the functions as instance attributes in the output feature object. This makes sense. Thank you for the guidance. |
While working on the latest guidance, it came apparent that this question
applies to other items: input features and output features. In TF1, these are saved in the static Graph. Following the same reasoning as above, the appropriate place to save these in TF2 would appear to be as instance attributes in the
These will be dictionaries where the keys are the features' names and the value is the feature object. Does this seem reasonable? |
test: add temporary stats data structure for testing
This commit 28f2dd3 fixes how metrics are reported. Prior to the change it appeared that convergence was slow. Now it seems to be as expected.
|
Just a status update.... Re:
The above code replaces this current implementation of However taking this approach is presenting some interesting challenges. In the TF1 version I'm thinking the better approach is to--at least for the now--leave the current static graph functions in place. Instead, I'll create parallel functions that use the eager execution approach. So instead of refactoring I think this will provide more straight forward way of converting to eager execution and not get encumbered with retrofitting. Once eager execution is working with this parallel structure, I'm hoping it will be easier to rip out the static graph functionality. At this point we can rename the parallel functions to the the appropriate name, e.g., Any thoughts on this approach? |
refactor: partial working generalized approach for optimizer
Primary change in this commit 82ee8bf is generalization for the loss function and start of generalizing the optimizer. Taking the "parallel coding apporach" seems to be working out. By using this approach I was able to make progress on the generalizing the loss and optimizer functions. Specific sections of code to review. Let me know if these are reasonable approaches.
With this approach I was able to remove the hard-coded optimizer and loss function. For now I think the optimizer running with the default parameters is good enough. So the next thing I'm going to work on is capturing metrics. As noted earlier, when I can get this working this will all me to test This is eager execution output when executing
|
This posting is about managing the work for TF2 port. When I make changes to the source code, I've embedded TODO comments to let you know my view of future changes for the module. It occurred to me that this might not be sufficient and a higher level view of how the work will evolve is needed. With this in mind, I created a simple kanban in my Ludwig fork to keep track of
The "to do tasks" list is not complete. Right now I'm using this list as a reminder for near-term work. I'll add new tasks, for example, enabling other input and output features for eager execution, as the work progresses. The cards in this list are listed in the sequence I plan to do the work (top-down). This is also where I'll note significant design decisions/discussions, such as this one,
Feel free to provide guidance on what should be on the "To Do" board and the sequencing of the work. |
Sorry for being slow at answering.
Yes it makes sense to me.
Instead of having 2 calls to the keras functions,, one with and one without reduction I would have only one, and then reduce the output.
Looks great to me and the tasks make sense to me. Do you have specific questions about them? |
Just an update... The set of commits --cf61261 b00ae61 f21bced -- added two capabilities:
When you look at the last set of changes, you may see some "unusual" coding sequences. These were needed as a temporary work-around for this problem:
The natural place to define the measure is during the Example output current set up measures captured:
Once |
These commits 42b09c4 and 9422158 fixes/implements saving the test statistics to the This zip file show show the
These are the commands I used to generate the above
Let me know if this looks right. At this point I can't think of anything else to do re: the Numerical and Binary features. If these are at a reasonable state, I can start working on the next feature type to convert. |
I was curious and wanted to establish a baseline for the pytest unit tests since we started working on the eager execution approach.
|
I did a pass myself as it was simpler than explaining what I wished to be changed :) Hopefully by looking at the code of the commit it's easier for you to figure out directly. Regarding
parameters as inputs at initialization time. This way, you can create the object in
and then use it without the need for those parameters, which in turn removes the need for a custom Is all of this clear? The next step after this, anyway, I believe would be to work on the |
re:
I'll make the changes as requested. |
Two items: Item 1Implemented changes requested in this comment #646 (comment) re:
If I missed anything let me know. Item 2Re: my question here #646 (comment) on changing The reason for changing Right both In the mean time, I'll start working on the categorical feature. |
A couple tings:
|
I'll make the change to have I'll wait until you do your pass. |
Actually I made a change that should work in most cases. So for now we can keep the method in the base class. If we see that there are features for which this does not work, then we can make it abstract in the future. |
So I'm mergin this back now, from now on let's make the PRs more self contained, like one pr for the categorical feature, one for sets etc and so on, ok? |
Got it...I understand that you are merging to the uber repo. Future PR should be specific and limited in scope. I'm assuming the future PR should be made against |
Yes, that's correct. The motivation is that now the basic systems fr training and prediction work, so with this in palce, work can be parallelized. |
Reopening PR for TF2 porting...
I'm hoping this posting provides some evidence of progress. With your last guidance, I was able to get a "minimal training loop" working with model subclassing and eager execution.
I wanted to offer you an early look at how I'm adapting Ludwig training to TF2 eager execution.
This commit demonstrates the minimal training loop for this
model_definition
:The main result of this "minimal training loop" is demonstrating:
model_definition
Here is an excerpt of the log file for the minimal training loop:
Here is the entire eager execution log file:
proof-of-concept_log_tf2_eager_exec.txt.
Now comes all the limitations and caveats for the current state of the code:
proof-of-concept_log_tf1.txt.
model.train()
method. I envision re-enabling and modifying the commented code as work progressesmodel.call()
method. This will change to reflect the encorder/decoders specified in themodel_definition
model_defintion