Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove type-distinction between system-communicators and user-communicators #155

Merged
merged 15 commits into from
Oct 20, 2023

Conversation

Cydhra
Copy link
Contributor

@Cydhra Cydhra commented Mar 14, 2023

As outlined in #148, it is very unintuitive to split the Communicator interface in SystemCommunicator and UserCommunicator.
I drafted a PR to unify both into one common enum SimpleCommunicator:

  • I migrated topology-related code from UserCommunicator into the common enum, so both communicator variants can now call those functions. I couldn't confirm it outright, but the spec does not prohibit that explicitely, so I think this models the MPI interface more closely.
  • I refactored CartesianCommunicator to accept both World- and UserCommunicator, which is legal by spec, I don't know why this wasn't possible before.
  • since WorldCommunicator is no longer Copy, code that moved a reference to SystemCommunicator must now explicitly refer to IntraCommunicator::WorldCommunicator (see immediate_multiple_test.rs)
  • I added SelfCommunicator next to WorldCommunicator, because the MPI spec defines it and without it downstream crates would need unsafe code to create it, and call free on it when their UserCommunicator drops

Things missing from this draft:

There is another weird split in the code-base: InterCommunicators use SystemCommunicatorHandle and UserCommunicatorHandle, even though those handle structs do the exact same as their communicator-counterparts (at least as far as I can tell).
Since I have never worked with InterCommunicators before though, and there are neither integration-tests nor examples for them, I didn't migrate them to use SimpleCommunicator yet, as I wanted to get some feedback on their purpose.

All examples and tests are green, but I am not very familiar with the codebase yet, so somebody should scrutinize this.

- migrated topology-related code from UserCommunicator into the common enum, so both communicator variants can now call those functions. Spec indicates that this is legal
- refactored CartesianCommunicator to accept both World- and UserCommunicator, which is legal by spec, I don't know why this wasn't possible before
- since WorldCommunicator is no longer Copy, code that moved a reference to SystemCommunicator must now explicitly refer to IntraCommunicator::WorldCommunicator (see immediate_multiple_test.rs)
@jedbrown
Copy link
Contributor

Thanks! The spawn examples use inter-communicators. A subtle point is that the inter-comm returned by MPI_Comm_get_parent should be thought of as a "system" inter-comm since freeing it would leave other references dangling. This is in contrast to the inter-comm returned by MPI_Comm_spawn, which has normal "user" semantics. This might help explain why the separate handles exist. I'm not attached to this design, BTW.

/// A system-defined inter-communicator
pub type SystemInterCommunicator = InterCommunicator<SystemCommunicatorHandle>;

/// A user-defined inter-communicator
pub type UserInterCommunicator = InterCommunicator<UserCommunicatorHandle>;

Looks like we have some CI failures.

@Cydhra
Copy link
Contributor Author

Cydhra commented Mar 15, 2023

A subtle point is that the inter-comm returned by MPI_Comm_get_parent should be thought of as a "system" inter-comm since freeing it would leave other references dangling. This is in contrast to the inter-comm returned by MPI_Comm_spawn, which has normal "user" semantics. This might help explain why the separate handles exist. I'm not attached to this design, BTW.

The reason why the separate handles exist is the same why SystemCommunicator and UserCommunicator were split in the first place, I suppose. My question was why use a separate -Handle instead of SystemCommunicator and UserCommunicator.

I don't quite understand if Drop-semantics are important here: Intuitively I wouldn't even unify the two handles into a common SimpleCommunicatorHandle, but instead use SimpleCommunicator directly. But I don't know if there was a reason to differentiate between *CommunicatorHandle and *Communicator.

Looks like we have some CI failures.

I was bitten by #[cfg(not(msmpi))], since I tested everything on Windows. Fixing it on Fedora right now.

@Cydhra
Copy link
Contributor Author

Cydhra commented Mar 16, 2023

So I figured out why the split in *Communicator and *CommunicatorHandle exists. Drop semantics are actually different, which I would've noticed if I had read the code more carefully. So I unified the handles into a common enum as well, just like I did with the Communicator.

As far as I can tell, the split in System* and User* types existed, because System* types were Clone and Copy. With the new design they aren't. I merged the types into one enum and Clone would produce double free bugs with user communicators. However, since both system communicators (that is MPI_COMM_WORLD and MPI_COMM_SELF) are static constants, I have made both enum variants for those stateless, meaning you don't need to Clone them, you can just instantiate them twice (I referenced that in one of the commits, where I changed the behavior in one of the examples).

examples/immediate_multiple_test.rs Outdated Show resolved Hide resolved
src/topology/mod.rs Outdated Show resolved Hide resolved
…torHandle and simplified InterCommunicator type

 * removed CommunicatorHandle trait
 * removed [User|System]InterCommunicator types in favor of InterCommunicator
 * added self_comm() function to SimpleCommunicator type
 * added inter-comm check to constructor of intercomm handles instead of constructor of intercomms to save on MPI-calls.
 * added unchecked inter-comm handle constructor as all callers of inter-comm constructors would unwrap the optional anyway
@Cydhra
Copy link
Contributor Author

Cydhra commented Mar 23, 2023

I have implemented an enum CommunicatorHandle now, deleted the previous handles and their trait, and simplified the API to one struct SimpleCommunicator, one InterCommunicator and the topological communicators.

The latter still use simple communicators in their constructor (as before), I don't know if that is how it is supposed to be, because it seems weird to have this back-and-forth between the types going on. I need some feedback on that.

The Handle type is still public, I don't know if we should hide it completely. At the moment hiding it is weird, because there is the public trait AsRaw that exposes the raw handles from ffi anyway. If we want to hide the handle enum, we should also hide AsRaw.

@Cydhra Cydhra changed the title Unified SystemCommunicator and UserCommunicator into common enum Remove type-distinction between system-communicators and user-communicators Mar 23, 2023
@jedbrown
Copy link
Contributor

Cool, I'll check it out as time allows.

I think of AsRaw as public because a user may want to call a C function (say PETSc) that takes an MPI_Comm (relates to #32). My current thinking is that AsRaw can stay public while hiding the handle. One should be able to create any of the concrete implementations from a raw MPI_Comm, but I don't think we need to see the CommunicatorHandle for that either. Do you have a use case?

@Cydhra
Copy link
Contributor Author

Cydhra commented Mar 23, 2023

Cool, I'll check it out as time allows.

I think of AsRaw as public because a user may want to call a C function (say PETSc) that takes an MPI_Comm (relates to #32). My current thinking is that AsRaw can stay public while hiding the handle. One should be able to create any of the concrete implementations from a raw MPI_Comm, but I don't think we need to see the CommunicatorHandle for that either. Do you have a use case?

Nope. At the moment InterCommunicator can be created from CommunicatorHandle (which is a bit weird) so this needs to be changed to hide the handle. I wanted to ask whether to hide the handle before changing it, because it was like this before. For some reason the check whether a raw handle is of an inter-comm was implemented in the InterComm constructor instead of the Handle-constructor. I have since changed that.

Copy link
Contributor

@jedbrown jedbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work here and sorry for my slow review. Would it be a lot to ask for an example showing something that can be done in the new model that couldn't be done before?

I suspect the new version can compile a little faster/smaller due to fewer variants to monomorphize, though even that is hard to measure. I find it a bit more intuitive, but it seems to be almost exactly the same amount of code. Unfortunately, we can still hardly use &dyn Communicator because most methods require Self: Sized. I'm not sure how to gracefully get us out of that bind.

src/topology/mod.rs Outdated Show resolved Hide resolved
examples/immediate_multiple_test.rs Outdated Show resolved Hide resolved
src/topology/mod.rs Outdated Show resolved Hide resolved
src/topology/mod.rs Outdated Show resolved Hide resolved
src/topology/mod.rs Outdated Show resolved Hide resolved
src/topology/cartesian.rs Outdated Show resolved Hide resolved
@jedbrown jedbrown requested a review from tbetcke April 2, 2023 03:45
Co-authored-by: Jed Brown <jed@jedbrown.org>
@Cydhra
Copy link
Contributor Author

Cydhra commented Apr 4, 2023

About the example what can be done in the new model:
I don't think it changes that much to be honest. I initially wanted to change it, because I was using &SystemCommunicator in my crate, which was overly restrictive. So I tried to unify it with UserCommunicator (because as #148 points out, dyn Communicator doesnt work), but then realized that I cannot easily unify the struct with InterCommunicator.

I then changed my approach to your suggestion of at least encapsulating the handle correctly, which further disconnected this PR from its initial attempt to make it easier to support different communicators for higher-level routines: Now the distinction between Intra-, Inter- and topological communicators still exists publicly, only the handle is nicely modelled. I am still hoping this will simplify the solution for #148 (see below), but from the outside, this PR doesn't improve the API that much, sadly.

About all the safety-related stuff: I tried to not create more mess than was already present, but some functions were and are not marked unsafe that really should be (then again, at least some collective communication functions are unsound but not marked unsafe either).

I opened #157 and linked a commit that builds upon this PR to fix the issue with process_at_rank detailed in #148. I think this should motivate the changes here, but it is probably also possible to do that without the changes here, if they aren't desirable.

 - renamed checked from_raw functions to try_from_raw
 - removed from_handle functions for inter-comms, as those can as well be replaced with from_raw functions
 - added more safety documentation
@Cydhra
Copy link
Contributor Author

Cydhra commented May 2, 2023

Is this still desired?

@jedbrown
Copy link
Contributor

jedbrown commented May 2, 2023

Sorry, I got swamped with end-of-semester stuff. I'll be able to put more attention toward this next week.

Copy link
Contributor

@jedbrown jedbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Sorry about losing track of this. Pipeline is running now.

src/environment.rs Outdated Show resolved Hide resolved
@jedbrown jedbrown merged commit 25de28e into rsmpi:main Oct 20, 2023
7 checks passed
Cydhra added a commit to Cydhra/rsmpi that referenced this pull request Oct 29, 2023
…cators (rsmpi#155)

* Unified SystemCommunicator and UserCommunicator into common enum
- migrated topology-related code from UserCommunicator into the common enum, so both communicator variants can now call those functions. Spec indicates that this is legal
- refactored CartesianCommunicator to accept both World- and UserCommunicator, which is legal by spec, I don't know why this wasn't possible before
- since WorldCommunicator is no longer Copy, code that moved a reference to SystemCommunicator must now explicitly refer to IntraCommunicator::WorldCommunicator (see immediate_multiple_test.rs)

* Added SelfCommunicator into SimpleCommunicator for easy access

* cargo fmt

* refactored optional code that was untested in msmpi

* SimpleCommunicator is no longer Clone, as that violates RAII logic

* merged SystemCommunicatorHandle and UserCommunicatorHandle into one enum.
- SysHandles are no longer Copy&Clone, but they no longer hold state

* merged SimpleCommunicator and SimpleCommunicatorHandle into CommunicatorHandle and simplified InterCommunicator type
 * removed CommunicatorHandle trait
 * removed [User|System]InterCommunicator types in favor of InterCommunicator
 * added self_comm() function to SimpleCommunicator type
 * added inter-comm check to constructor of intercomm handles instead of constructor of intercomms to save on MPI-calls.
 * added unchecked inter-comm handle constructor as all callers of inter-comm constructors would unwrap the optional anyway

* made constructors public again, since handles are public too atm

* downgraded handle-enum visibility to pub(crate)

* implemented FromRaw for SimpleCommunicator and InterCommunicator

* implemented FromRaw for CartesianCommunicator

* Better explanation of empty drop logic

Co-authored-by: Jed Brown <jed@jedbrown.org>

* example recommendation: &impl Communicator vs SimpleCommunicator

* Cleaned up unsafe from_raw functions
 - renamed checked from_raw functions to try_from_raw
 - removed from_handle functions for inter-comms, as those can as well be replaced with from_raw functions
 - added more safety documentation

* Mild cleaning

---------

Co-authored-by: Jed Brown <jed@jedbrown.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants