Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate sites support #410

Merged
merged 10 commits into from May 10, 2019

Conversation

Projects
None yet
5 participants
@alubbock
Copy link
Member

commented Feb 4, 2019

Duplicate sites are monomer sites with the same name. The following becomes valid PySB with this PR:

Monomer('A', ['a', 'a'])

One can then reference the sites using a tuple, with one entry
for each of the sites, for example:

A(a=(1, 2))   # Example 1

Note that this is distinct from multiple bonds to the same site:

A(a=[1,2])    # Example 2

In example 1, there are two sites, each named A, one with bond 1
and one with bond 2. In example 2, there is one site called A with
two bonds, 1 and 2.

The syntax is similar in both cases so users will have to be careful to
check their syntax matches their use case.

Note that the data types are checked, so that expressing a
(state, bond) tuple is still valid, noting that states must be strings:

A(a=('p', 1))

Most of the refactoring in this PR in core.py is to pull out the
site and state checking logic into separate functions, so that it
can be applied to check each site in a tuple of duplicate sites.
This includes _handle_site_instance within _as_graph
and _site_instance_concrete; these are mostly refactors rather
than new code. Likewise with _check_bond, is_state_bond_tuple,
and _check_state_bond_tuple.

Modifications outside core.py:

  • pysb/bng.py modified to support import of duplicate sites
    from netfiles, and to use the model's name as a file stem (L61-64)
    to aid debugging.
  • generator/bng.py modified to export models with duplicate
    sites to BioNetGen. This is straightforward: we just loop
    over each duplicate site and export them individually
  • importers/bngl.py modified to import BNGL models with
    duplicate sites. A couple of other modifications have snuck in,
    to make unit tests pass on new models with duplicate sites:
    • Model name is now set based on the .bngl filename
      to make it more meaningful than just pysb (line 43)
    • ln (natural log) now supported in BNGL import (line 64)
    • Support for constant expressions import marked as
      Constant in BNGXML (lines 178-185)
    • Support for BNG observables with same name as monomers,
      which is illegal in PySB (lines 196-199)
  • simulator/bng.py modified to support using model name's
    in the BNG output file
  • pysb/tests/test_core.py unit tests for duplicate sites
  • pysb/tests/test_importers.py update import tests
    to support duplicate sites

Closes: #59

@coveralls

This comment has been minimized.

Copy link

commented Feb 4, 2019

Coverage Status

Coverage increased (+0.07%) to 79.365% when pulling cc86b8f on alubbock:duplicate_sites into 470945e on pysb:master.

@coveralls

This comment has been minimized.

Copy link

commented Feb 4, 2019

Coverage Status

Coverage increased (+0.2%) to 78.345% when pulling 9cc3cba on alubbock:duplicate_sites into 4c51327 on pysb:master.

@jmuhlich
Copy link
Member

left a comment

This is a pretty deep change in a lot of core functionality and I'm having trouble following all of the logic and data structures. Could you add a summary of the strategy/assumptions/etc? Maybe add it to the PR message, and it can go into the squashed commit message upon acceptance.

It looks like the multi-bond support should still work since that uses a list of bonds rather than a tuple, but that difference seems awfully subtle. Can we eventually deprecate multi-bonds in favor of duplicate sites? Are there plausible use cases for unconstrained numbers of bonds on a site in a rule-based model? I think the typical case is where there's true symmetry or lack of distinguishability between several protein domains, and in that case the multiplicity is known ahead of time and thus can be encoded in the number of duplicate sites.

@alubbock

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

I've updated the PR message with further details. Please let me know if anything requires further clarification.

I checked with @lh64 regarding multiple bonds, and he thinks they're useful enough to keep. He gave the following example (using BNGL):

Say we wanted to build a 2D lattice out of Monomers X and A, where each X is bound to 4 As but each A is bound to 2 Xs. With multiple bonds we could write something like ‘X(a!1!2!3!4).A(x!1!5).A(x!2!6).A(x!3!7).A(x!4!8).X(a!5!9!10!11)....’

One issue, as you point out, is that the proposed syntax (Option 1) for duplicate sites in PySB A(a=(1,2)) is close to the syntax for multiple bonds A(a=[1,2]). This comes down to a stylistic choice between brevity (which I went for) versus explicitness.

  • Option 2: consider sites with certain suffixes to be equivalent: A(a_1=1, a_2=2). As @lh64 pointed out, this could also be confusing and is a little too "magic" for my taste.
  • Option 3: use an explicit keyword: A(DupSite(a=1), DupSite(a=2)). Internally, we could convert to the tuple representation that I've already added (and bar direct use of tuples in sites, except for (state, bond) pairs), and just convert back to the above syntax for printing/export etc.

Happy to hear thoughts and suggestions on this.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

Thanks for the extra details, I'll have another look at this today.

Regarding syntax:

For Leonard's example, wouldn't you get the equivalent system with monomers X(a,a,a,a) and A(x,x), putting one bond on each duplicate site rather than multi-bonds on the single sites? This is what I meant about knowing the bond multiplicity structure ahead of time.

I wonder if set instead of list for multi-bonds would have helped us out here. More semantically accurate I think (bond numbers at one site would need to be unique) and harder to confuse with tuple or list. Now that would be a breaking change, though. Of course dropping support for multi-bonds entirely, even if they can reasonably be replaced by duplicate sites, is also a breaking change.

@lh64

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

Hi @jmuhlich. You're right that X(a,a,a,a) and A(x,x) can, in theory, be used to build an equivalent system as multiple bonds. The issue is really one of symmetry and efficiency. With all of those identical sites BNG ends up having to do a lot of additional pattern matching and calculating of symmetry factors, etc., that can really bog the network generation down, to the point where it becomes impossible in some cases. Using multiple bonds really simplifies the problem by removing these symmetry factors, which is why I think we should retain the functionality, even if it's not commonly used right now. I've run into this problem before in various contexts (I was helping someone in John Tyson's lab at one point with something related to this) so I could probably dig up an example to illustrate the point if you like. Just let me know.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

This design fails to distinguish between (in BNG syntax) A(a~p!1) and A(a~p, a!1) (both would be A(a=('p', 1)) in the proposed PySB syntax). Is this a problem? Swapping the order to put the bond first does produce the expected duplicate-site pattern, but this seems error-prone. Are there other legal/useful BNGL patterns that can't be expressed?

As a further test, I imported the following bngl into PySB and the two rules ended up being identical.

begin model
begin parameters
  P   1
end parameters

begin molecule types
  A(a~p,a~p)
  B(b)
end molecule types

begin reaction rules
  r_1_site:  A(a~p!1).B(b!1) -> 0    P
  r_2_sites:  A(a~p,a!1).B(b!1) -> 0    P
end reaction rules

end model

I also found that get_half_bonds_in_pattern doesn't handle nested (state,bond) tuples in a duplicate site context (e.g. it returns an empty list for A(a=('u', ('u', 1)))), but I can address that in an actual code review once we work out this design issue.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Maybe we want to add some new class to wrap duplicate site bonds/states. Something along the lines of option 3 above, but on the pattern not the monomer declaration. I don't have a good name, so just as a point for discussion:

A(a=DuplicateSites('p', 1))  # A(a~p, a!1)
A(a=DuplicateSites(1, 'p'))  # A(a!1, a~p)
A(a=('p', 1))  # A(a~p!1)
@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 10, 2019

Given the problems you've highlighted, an explicit duplicate sites qualifier seems appropriate. Does MultiSite seem a reasonable name for the class? Slightly shorter.

@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 11, 2019

@jmuhlich I've gone with MultiSite for now, but that can be renamed if needed.

@lh64

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2019

Just to throw more of a wrench into the discussion, all components in BNGL can have states and bonds associated with them. So something like A(a~p!1, a~u!2) is perfectly legal syntax. Would that be written in the proposed PySB syntax as A(a=MultiSite(('p',1), ('u',2)))?

Also, how about DupSite as a name? Or even just Dup?

@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 12, 2019

Yes, A(a=MultiSite(('p',1), ('u',2))) works (just tried it).

DupSite seems suitable, I'd lean towards that. Dup seems more ambiguous.

@jmuhlich
Copy link
Member

left a comment

Looks good overall. There are some BNG-related changes that don't seem to logically belong here. Did this get rebased against another branch at some point?

def _check_bond(bond):
""" A bond can either by a single int, WILD, ANY, or a list of ints """
return (
isinstance(bond, int) or

This comment has been minimized.

Copy link
@jmuhlich

jmuhlich Apr 24, 2019

Member

Prefer operator at beginning of next line instead of end of previous line, per pep8.

def is_state_bond_tuple(state):
""" Check the argument is a (state, bond) tuple for a Mononer site """
return (
isinstance(state, tuple) and

This comment has been minimized.

Copy link
@jmuhlich

jmuhlich Apr 24, 2019

Member

Prefer operators at beginning of next line.

Show resolved Hide resolved pysb/core.py
Show resolved Hide resolved pysb/core.py
@@ -45,7 +46,7 @@ def generate_parameters(self):
max_length = max(len(p.name) for p in
self.model.parameters | self.model.expressions)
for p in self.model.parameters:
self.__content += ((" %-" + str(max_length) + "s %e\n") %
self.__content += ((" %-" + str(max_length) + "s %.16g\n") %

This comment has been minimized.

Copy link
@jmuhlich

jmuhlich Apr 24, 2019

Member

Needs to be %.17g to get the full precision.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

It looks like some of the changes brought in with the merge from master in 72b85e1 are showing up in the final diff for some reason. I think you should rebase this against master one last time to make sure everything is resolved properly.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

One other thought: Should we normalize a MultiSite with one value to just the naked value? A(x=MultiSite(1)) seems confusing.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

Even worse is MultiSite(('u',1)). Without looking very carefully that's easy to confuse with MultiSite('u', 1).

@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2019

Thanks for the feedback. I'll rebase and see if that cleans things up. Couple of quick questions:

  1. Should we normalise, or reject, supplying a single site to MultiSite, as per your examples? Reject might be cleaner, and less likely to lead to unexpected behaviour.
  2. How do you feel about the name DupSite instead of MultiSite? Keeps us closer to BNG naming convention, and MultiSite might get confused with multiple bonds.

alubbock added some commits Sep 5, 2018

Duplicate sites support
Duplicate sites are monomer sites with the same name, which are
supported in BioNetGen but not Kappa.

The following is now valid PySB:

    Monomer('A', ['a', 'a'])

One can then reference the sites using a tuple, with one entry
for each of the sites, for example:

    A(a=(1, 2))

@alubbock alubbock force-pushed the alubbock:duplicate_sites branch from a67c6c3 to 6b04d3b Apr 24, 2019

@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2019

@jmuhlich OK, I'm nearly ready to push. One issue with '%.17g' as a format specifier is it often gives roundoff errors:

>>> '%.17g' % 0.3
'0.29999999999999999'

>>> '%.17g' % 3.14
'3.1400000000000001'

Our current Parameter printing also matches %.16g:

>>> Parameter('pi', math.pi)
Parameter('pi', 3.141592653589793)

>>> '%.16g' % (math.pi)
'3.141592653589793'

I'd suggest we standardise on %.16g across the codebase (BNG, Kappa, PySB flat etc.). Thoughts?

@glucksfall

This comment has been minimized.

Copy link

commented Apr 25, 2019

I tested that precision for model parameters can make huge variations. I added an option to my scripts to let the user choose it. Now, I decided to use %.6e as default. Another reason to allow configurable was BioNetFit. It writes parameters with %.6g format (I believe... Need to double-check).

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

  1. I agree that raising rather than normalizing is better. It's in line with how all of our other constructors operate, too.
  2. "Duplicate" sort of has a negative connotation in my mind, like it's a mistake or something, whereas "multiple" is value-neutral. Actually, the entity being modeled here isn't even a site after all -- it's state(s). What about StateList or DuplicateSiteStates something like that? Either way, let's avoid using an abbreviation in the name, and users can alias it if they prefer a shorter name (e.g. from pysb import DuplicateSiteStates as DS).
  3. %.17g is absolutely required to round-trip the full precision of a 64-bit float. "If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number." - https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF / https://en.wikipedia.org/wiki/Double-precision_floating-point_format . The rounding "error" you are seeing isn't an error and it's nothing to worry about, it's just the reality of converting between base-10 and base-2 representations. For example, 0.3 can't be represented exactly in base-2 but 0.29999999999999999 converted to base-2 gets you as close as possible. The big-picture solution might be to find a nice way to use fraction.Rational or sympy.Rational for nominal parameter values.
@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

  1. Raising it is
  2. I'm not sure about encouraging aliasing for core constructs - it lowers readability when sharing models. For that reason I'd prefer something short. How about MultiState?
  3. As @glucksfall points out, switching to %.17g might cause subtle reproducibility issues with existing models. How about adding a switch (environment variable, e.g. PYSB_HIGH_PRECISION=True) to enable %.17g parameter format, but keep the current setup for now? Then make high precision the default in PySB 2.0, since it's a breaking change.
@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

  1. 👍 for MultiState.
  2. Why is the float precision change for the BNG generator even in here? Is it required to support a new BNG example model?
@alubbock

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

@jmuhlich wrote:

Why is the float precision change for the BNG generator even in here? Is it required to support a new BNG example model?

It's required to get one of the BNG importer unit tests to pass when round-tripping a BNG model with duplicate sites (sorry, multi-states!). But it's probably best to make that test an expected failure, and deal with this precision change as a separate issue and PR.

@jmuhlich

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

Let's do expected failure for now to get this merged, then plan how to resolve it. I don't think an environment variable is the way to go, but we can hash it out separately.

@alubbock alubbock merged commit ffb0a7c into pysb:master May 10, 2019

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.2%) to 78.345%
Details

@alubbock alubbock deleted the alubbock:duplicate_sites branch May 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.