Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable RX Fan-In #3

Closed
quetric opened this issue Sep 21, 2021 · 3 comments
Closed

Enable RX Fan-In #3

quetric opened this issue Sep 21, 2021 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@quetric
Copy link
Collaborator

quetric commented Sep 21, 2021

Currently the receive pipeline supports fan-in == 1 and all collectives are ring-based as a result. Adding support for fan-in > 1 would enable tree collectives.

@quetric quetric added the enhancement New feature or request label Sep 21, 2021
@DanieleParravicini
Copy link
Contributor

image

As the image here shows the rx datapath provide a notification AXIS.
This is how it works

  1. when some bytes arrives at the stack and they are stored in the buffer (window) ready to be read by the application (in our case the CCLO). the stack sends a session ID and the amount of bytes ready.
  2. the user kernel (CCLO in this case) replies providing the session id and the number of bytes that he wants to read.
    As a side note, the high level message that we have now carries at bits 32-63 the total number of bytes of the message.
    What I propose to get past the RX FAN IN =1 is the following in the easy case.
    The depacketizer waits for notifications from the stack.
    When the first arrives the depacketizer gets the first bytes (which includes the header) and discovers the amount of bytes to be read.
    It dequeue stack notifications to understand when new data have been received.
    All data coming from different sessions are kept aside and they wait.
    All data from the first active session are consumed up until last byte.
    Then we move to next session.

We need to:

  1. avoid that we run out of memory in the network stack. Is data saved in DDR/HBM at the moment?
  2. (related to previous) avoid limiting number of ranks we can support
  3. avoid that this approach does leads to a deadlock (imagine you have to sum 2 buffers of 32 MB each coming from two external FPGAs. You can receive 1 MB each and sum them to create the resulting 32 MB result. If we receive 32MB from one and 32 from the other we may run out of space in the CCLO staging area (spare buffer))
  4. ensure that we handle properly the end of the message. (e.g. message of 5 MB. The stack notifies each time that 4MB arrives. The depacketizer needs to respect the boundaries and fetch only missing 1MB after first 4MB read. this because:
    a. we need to be fair and serve other sessions
    b. avoid mixing data coming from different sessions

@DanieleParravicini
Copy link
Contributor

I can think of a FSM for that if you want

@quetric
Copy link
Collaborator Author

quetric commented Feb 23, 2022

Cloding, feature implemented in dev

@quetric quetric closed this as completed Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants