Skip to content

v0.3: TensorFlow 2, Hyperparameter optimization, Hugging Face Transformers integration, new data formats and more

Compare
Choose a tag to compare
@w4nderlust w4nderlust released this 06 Oct 00:03
· 1878 commits to master since this release

Improvements

  • Full porting to TensorFlow 2.
  • New hyperparameter optimization functionality through the hyperopt command.
  • Integration with HuggingFace Transformers for pre-trained text encoders.
  • Refactored preprocessing with new supported data formats: auto, csv, df, dict, excel, feather, fwf, hdf5 (cache file produced during previous training), html (file containing a single HTML <table>), json, jsonl, parquet, pickle (pickled Pandas DataFrame), sas, spss, stata, tsv.
  • improved validation logic.
  • New Transformer encoders for sequential data types (sequence, text, audio, timeseries).
  • new batch_predict functionality in the REST API.
  • New export command to export to SavedModel and Neuropod.
  • New collect_summary command to print out a model summary with layers names.
  • Modified the predict command, and splitt it into predict and evaluate. The first only produces predictions, the second evaluates those predictions against ground truth.
  • Two new hyperopt-related visualizations: hyperopt_report and hyperopt_hiplot.
  • Improved tracking of metrics in the TensorBoard.
  • Greatly improved test suite.
  • Various documentation improvements.

Bugfixes

This release includes a fundamental rewrite of the internals, so many bugs have been fixed while rewiting.
This list includes only the ones that have a specific Issue associated with them, but many others where addressed.

  • Fix #649: Replaced SPLIT with 'split' in example code.
  • Fix documentation, wrong parameter name (#684)
  • Fix #702: Fixed setting defaults in binary output feature.
  • Fix #729: Reduce output was not passed to the sequence encoder inside the sequence combiner.
  • Fix #742: Renamed self._learning_rate in Progresstracker.
  • Fix #799: Added tf_version to description.json.
  • Fix #840: Better messaging for plateau logic.
  • Fix #850: Switch from ValueError to Warning to make stratify work on non-output features.
  • Fix ##844: Load LudwigModel in test_savedmodel before creating saved model.
  • Fix #833: loads the model after training and before predicting if the model was saved on disk.
  • Fix #933: Added NumpyDecoder before returning JSON response from server.
  • Fix #935: Multiple categorical features with different vocabs now work.

Breaking changes

Because of the change in the underlying tensor computation library (TensorFlow 1 to TensorFlow 2) and the internal reworking it required, models trained with v0.2 don't work on v0.3.
We suggest to retrain such models, in most cases the same model definition can be used, although one impactuful breaking change is that model_definition are now called config, because they don't contain only information about the model, but also training, preprocessing, and a newly added hyperopt section.

There have been some changes in the parameters inside the config too.
In particular, one main change is dropout that now it is a float value that can be specified for each encode / combiner / decoder / layer, while before it was a boolean parameter.
As a consequence, the dropout_rate parameter in the training section has been removed.

Another change in training parameters are the available optimizers.
TensorFlow 2 doesn't ship with some of the ones that were exposed in Ludwig (adagradda, proximalgd, proximaladagrad) and the momentum optimizer has been removed as now it is a parameter of the sgd optimizer.
Newly added optimizers are nadam and adamax.
Note that the accuracy metric for the combined feature has been removed because it was misleading in some scenarios when multiple features of different types where trained.

In most cases, encoders, combiners and decoders now have an increased number of exposed parameters to play with for increased flexibility.
One notable change is that the previous BERT encoder has been replaced by an HuggingFace based one with different parameters, and it is now available only for text features.
Please refer to the User Guide for details for each encoder.

Tokenizers also changed substantially with new parameters supported, refer to User Guide for more details.

Other major changes are related to the CLI interface.
The predict command has been replaced in functionality with a simplified predict and a new evaluate. The first only produces predictions, the second evaluates those predictions against ground truth.
Some parameters of all CLI commands changed.
All different type of data_* parameters have been replaced by generic dataset, training_set, validation_set and test_set parameters, while the data format is automatically determined, but can also be set manually by using the data_format argument. There is no gpu_fractionany more, but now users can specifygpu_limit` for managing the VRAM usage.
For all additional minor changes to the CLI please refer to the User Guide.

The programmatic API changed too, as a consequence.
Now all the parameters match closely the ones of the CLI interface, including the new dataset and gpu parameters.
Also in this case the predict function has been split into predict and evaluate.
Finally, the returned values of most functions changed to include some intermediate processing values, like for instance the preprocessed and split data when calling train, the output experiment directory and so on.
Notably, now there is an experiment function in the API too, together with a new hyperopt one.
For more datails, refer to the API reference.

Contriburotrs

@jimthompson5802 @tgaddair @kaushikb11 @ANarayan @calio @dme65 @ydudin3 @carlogrisetti @ifokeev @flozi00 @soovam123 @KushalP1 @JiByungKyu @stremlau @adiov @martinremy @dsblank @jakobt @vkuzmin-uber @mbzhu1 @moritzebeling @lnxpy