First Pass at Fully Implementing the PropertyEstimator Calls #7

SimonBoothroyd · 2019-06-24T23:05:47Z

Description

This PR is the result of a first pass of integrating the propertyestimator into the template provided by Yudong.

Status

Ready to go

Cleaned up the config dictionary into a full object. Minor reformatting and code renaming to be more comprehensible.

src/property_estimator_io.py

yudongqiu · 2019-06-25T22:04:36Z

Thanks a lot for the PR. I think the changes overall looks great, and a few minor issues can be fixed quickly. In general, I have two suggestions:

Customize the input file and simplify the API calls to property estimator.
Reproduce the ForceBalance run in studies/023_property_estimator. (Sorry that I didn't include the targets folder before, I just committed a compressed file targets.tar.gz there).

Made gradient calculation optional. Refactored code to reduce redundency. Temporarily made the estimation code blocking. Turned off uncertainty checks. Added support for JSON property files. Moved imports into try / except block.

src/property_estimator_io.py

yudongqiu · 2019-07-10T16:33:17Z

src/property_estimator_io.py

+        if AGrad is False:
+            return estimated_data, estimated_gradients
+
+        jacobian = self._build_pvals_jacobian(mvals)


It's nice to see this implementation leads to using gradients in PE. I have three questions:

Are these gradients validated against the numerical gradients of mvals? If so, can we keep the numerical gradients code in here as a sanity check option?

The pvals does not contain all changing parameters in the ForceBalance parameterization. These are the "freely moving" ones. ForceBalance also supports "parameter_eval" and "parameter_repeat", and when those are defined the one-to-one mapping _gradient_key_mappings will lead to gradients of the evaluated parameters, (the ones not in pvals) to be lost. You can see here, that pfields have all fields in the Force field file that is changing. And Lee-Ping recently implemented OpenFF API calls to modify OpenFF Forcefield object. (If you merge the master branch you will get them.) The point is we want to collect gradients for all pfields into mvals, not just pvals.

I think the _build_pvals_jacobian() gives the same result so shall we store that and skip re-computing it?

Currently I'm comparing them against the values of the native FB implementation - my preference would be to use FB as the reference in this case (as opposed to the initial PE implementation) as it's a good sign when two independent code implementations are in good agreement!

Apologies I'm not sure that I understand this point fully - would you be able to give a few examples of cases where this would occur?

If these options aren't set frequently (or more pressing if they don't get set in the first round of fitting) it might be good for now to just raise a NotImplementedException in those cases, and revisit this at a later date just to help ease the time constraints in this first sprint (although if the effort to add this is minimal I'm open to trying to fix it now rather than postpone it until later 🙂)!

I think the latency of re-calculating the jacobian is well hidden by the cost of evaluating the physical properties so it shouldn't really contribute to the performance, but happy to add the caching if you'd prefer that?

For 2, I also had the same idea at first when implementing other components, but then I realized that we do need those features. An example is this test proposed by Mobley group: https://github.com/openforcefield/openforcefield/pull/384/files
If you look at line 300 in that file, the parameter_eval rule is set to use a special sqrtk1 parameter to set the k1 parameter. Here you will get the value of sqrtk1 in the pvals list, but actually k1 is the "real" parameter. The parameter_eval feature can also be used to set relations between different terms, i.e. constraining k1 of two different SMIRKs to be the same.

For 3, I agree the cost of evaluating the physical properties is higher than other parts of the code, so we designed the code to distribution the evaluation cost to many "worker nodes". In comparison, the _build_pvals_jacobian takes serial execution time on the host. Although it is cheap for small number of parameters, it's not exactly clear how mush time cost it will add eventually when the parameter relations are all evaluated in this function. Because this time cost can be easily eliminated by caching the result, I think it's better to do that.

For 1, after think more about it, I think I can understand your reason. The general idea is that, we have (a) ForceBalance liquid target as reference, and we implemented (b) numerical gradients, it matches the reference, then we implement (c) analytical gradients and it also matches the reference. I think you're saying as long as (c) matches (a), we don't need to keep (b) in the code.
I agree with this. However I think (b) is still valuable, because when (c) does not match (a), we will have a middle step to be able to tell which part goes wrong. We have found this very helpful, and you can see examples in the implementation of (a) where Lee-Ping still keep the FDCheck code just in case. https://github.com/leeping/forcebalance/blob/8cbafff745b416b513045b1eacbd58e14af22c16/src/data/npt.py#L526

For 2, I also had the same idea at first when implementing other components, but then I realized that we do need those features. An example is this test proposed by Mobley group: https://github.com/openforcefield/openforcefield/pull/384/files

I was just summarizing the Torsions meeting, and I think we've decided to skip improper fitting for release-1. We are probably OK to have it raise a NotImplementedError if this comes along.

yudongqiu · 2019-07-10T19:16:42Z

I think this PR is very close to be done now! Great job.

Here are a few things left:

Include updated forcebalance/studies/023_property_estimator/optimize.out
keep numerical mvals gradients option as a sanity check.
Try improve jacobian calculation to support parameter_eval and parameter_repeat features. If it's too much work, we can revisit this later, and the sanity check with numerical gradients will be very helpful validating this.

improve assign_openff_parameter performance by caching parameter search

Improve optgeo target

… property_estimator

Temporarily disabled the ab initio target.

yudongqiu · 2019-08-21T21:19:39Z

Although the numerical mvals gradients as a sanity check is a good feature to have, we agreed that we don't have time for re-adding that now, so we will revisit that later.

Everything else looks good and I will merge now.

SimonBoothroyd added 2 commits June 24, 2019 19:03

First pass at fully implementing the PE calls.

3adf144

Cleaned up the config dictionary into a full object. Minor reformatting and code renaming to be more comprehensible.

Finished adding support for calculating gradients not using the PE API.

1ded573

SimonBoothroyd changed the title ~~[WIP] First Pass at Fully Implementing the PropertyEstimator Calls~~ First Pass at Fully Implementing the PropertyEstimator Calls Jun 25, 2019