Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for MPI data count larger than INT_MAX #43

Closed
JieRen98 opened this issue Apr 8, 2024 · 30 comments
Closed

support for MPI data count larger than INT_MAX #43

JieRen98 opened this issue Apr 8, 2024 · 30 comments

Comments

@JieRen98
Copy link

JieRen98 commented Apr 8, 2024

Is your feature request related to a problem? Please describe.
Since MPI uses int32 as the data count, see as follows

int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest,
             int tag, MPI_Comm comm)

When we want to send a larger buffer (count > INT_MAX), we need to split the buffer into several chunks and send them one by one. However, StarPU does not support it so far. (e.g., https://gitlab.inria.fr/starpu/starpu/-/blob/master/src/drivers/mpi/driver_mpi_common.c?ref_type=heads#L300)

Describe the solution you'd like
Split the buffer into several chunks:

void send_large_byte_buffer(const void* data, size_t total_size, int dest, int tag, MPI_Comm comm) {
    const size_t max_int = INT_MAX;
    size_t chunks = (total_size + max_int - 1) / max_int;

    for (size_t i = 0; i < chunks; ++i) {
        const size_t offset = i * max_int;
        const int count = (total_size - offset) > max_int ? max_int : (int)(total_size - offset);

        MPI_Send((const char*)data + offset, count, MPI_BYTE, dest, tag, comm);
    }
}

Functions' signatures (e.g., __starpu_mpi_common_send_to_device, __starpu_mpi_common_send, etc) should be changed correspondingly.

Describe alternatives you've considered
N/A

Additional context
N/A

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

StarPU does not support it so far. (e.g., https://gitlab.inria.fr/starpu/starpu/-/blob/master/src/drivers/mpi/driver_mpi_common.c?ref_type=heads#L300)

I guess you are using starpu_mpi_task_insert etc., not the MPI master-slave driver support, so it's rather the MPI datatype definition from mpi/src/starpu_mpi_datatype.c that you need fixed.

I understand that you have a urgent deadline. Which StarPU data type are you using in your application?

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024 via email

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

I am using Chameleon actually. I am not completely sure how Chaneleon interact with StarPU

Ok, then I guess you are using a matrix descriptor from Chameleon?

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024 via email

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

How is it customized? Essentially, the question is which starpu_data_something_register function is getting called in your case.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

put another way, do you have any data_register call beyond starpu_vector_data_register and starpu_mpi_data_register?

Is it using starpu_data_register directly?

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Does it use starpu_mpi_interface_datatype_register?

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024 via email

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Is MPI_Type_vector_c supported by your mpi implementation?

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

(basically, we would just want to use the _c variants of the MPI calls that we currently make: MPI_Irecv, MPI_Isend, MPI_Issend, MPI_Type_vector, MPI_Type_contiguous, MPI_Type_size)

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Chameleon uses both starpu_data_register and starpu_mpi_data_register to register tiles. I believe StarPU does not know what the type is but only the size count in bytes.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Then it must be also using starpu_mpi_interface_datatype_register to register the mpi type to be used? Otherwise StarPU does not even know the size count in bytes.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

(starpu_data_register alone does not tell starpu the size count in bytes)

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Yes, you are right, here is the type registration although I do not understand it completely...:

void
starpu_cham_tile_interface_init()
{
    if ( starpu_interface_cham_tile_ops.interfaceid == STARPU_UNKNOWN_INTERFACE_ID )
    {
        starpu_interface_cham_tile_ops.interfaceid = starpu_data_interface_get_next_id();
#if defined(CHAMELEON_USE_MPI_DATATYPES)
  #if defined(HAVE_STARPU_MPI_INTERFACE_DATATYPE_NODE_REGISTER)
        starpu_mpi_interface_datatype_node_register( starpu_interface_cham_tile_ops.interfaceid,
                                                    cti_allocate_datatype_node,
                                                    cti_free_datatype );
  #else
        starpu_mpi_interface_datatype_register( starpu_interface_cham_tile_ops.interfaceid,
                                                cti_allocate_datatype,
                                                cti_free_datatype );
  #endif
#endif
    }
}

This shows how Chameleon registers the tile. I thought attributes .allocsize and .tilesize would tell StarPU the size.

void
starpu_cham_tile_register( starpu_data_handle_t *handleptr,
                           int                   home_node,
                           CHAM_tile_t          *tile,
                           cham_flttype_t        flttype )
{
    size_t elemsize = CHAMELEON_Element_Size( flttype );
    starpu_cham_tile_interface_t cham_tile_interface =
        {
            .id         = STARPU_CHAM_TILE_INTERFACE_ID,
            .flttype    = flttype,
            .dev_handle = (intptr_t)(tile->mat),
            .allocsize  = -1,
            .tilesize   = tile->m * tile->n * elemsize,
        };
    memcpy( &(cham_tile_interface.tile), tile, sizeof( CHAM_tile_t ) );
    /* Overwrite the flttype in case it comes from a data conversion */
    cham_tile_interface.tile.flttype = flttype;

    if ( tile->format & CHAMELEON_TILE_FULLRANK ) {
        cham_tile_interface.allocsize = tile->m * tile->n * elemsize;
    }
    else if ( tile->format & CHAMELEON_TILE_DESC ) { /* Needed in case starpu ask for it */
        cham_tile_interface.allocsize = tile->m * tile->n * elemsize;
    }
    else if ( tile->format & CHAMELEON_TILE_HMAT ) {
        /* For hmat, allocated data will be handled by hmat library. StarPU cannot allocate it for the library */
        cham_tile_interface.allocsize = 0;
    }

    starpu_data_register( handleptr, home_node, &cham_tile_interface, &starpu_interface_cham_tile_ops );
}

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Please also show cti_allocate_datatype_node and cti_allocate_datatype, that's very most probably where you need a fix

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Here you go:

#if defined(CHAMELEON_USE_MPI_DATATYPES)
int
cti_allocate_datatype_node( starpu_data_handle_t handle,
                            unsigned             node,
                            MPI_Datatype        *datatype )
{
    int ret;

    starpu_cham_tile_interface_t *cham_tile_interface = (starpu_cham_tile_interface_t *)
        starpu_data_get_interface_on_node( handle, node );

    size_t m  = cham_tile_interface->tile.m;
    size_t n  = cham_tile_interface->tile.n;
    size_t ld = cham_tile_interface->tile.ld;
    size_t elemsize = CHAMELEON_Element_Size( cham_tile_interface->flttype );

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );
    STARPU_ASSERT_MSG(ret == MPI_SUCCESS, "MPI_Type_vector failed");

    ret = MPI_Type_commit( datatype );
    STARPU_ASSERT_MSG(ret == MPI_SUCCESS, "MPI_Type_commit failed");

    return 0;
}

int
cti_allocate_datatype( starpu_data_handle_t handle,
                       MPI_Datatype        *datatype )
{
    return cti_allocate_datatype_node( handle, STARPU_MAIN_RAM, datatype );
}

void
cti_free_datatype( MPI_Datatype *datatype )
{
    MPI_Type_free( datatype );
}
#endif

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Also, again,

Is MPI_Type_vector_c supported by your mpi implementation?

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

Here you go:

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );

That's it: you want to use MPI_Type_vector_c instead. StarPU just MPI_Sends one of this type.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

Where did you not find it?

Put another way: which MPI implementation are you using?

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

I see that notably openmpi doesn't seem to be providing the _c variants, so in that case you need to use a for loop to make the series of MPI_Type_vector calls

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Here you go:

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );

That's it: you want to use MPI_Type_vector_c instead. StarPU just MPI_Sends one of this type.

Ok, Chameleon uses byte and the leading dim becomes ld * sizeof(t), that's fine I guess. So do you mean I should change this to MPI_Type_vector_c to support large integers?

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

If your MPI implementation supports MPI_Type_vector_c, that's the simplest, yes. If not, you need to use a for loop to describe the data type piece by piece.

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

Where did you not find it?

Put another way: which MPI implementation are you using?

I am using mpich, I mean Chameleon does not use MPI_Type_vector_c, not my MPI does not have this.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

So in the end it's the chameleon code that needs fixing. StarPU will however want to do the same for its predefined vector/matrix/etc. types, so keeping this issue open for that.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

I am using mpich

mpich does have MPI_Type_vector_c, so you can simply add a _c to the MPI_Type_vector call, and that should work.

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

(mpich does so apparently since its version 4)

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Thanks a lot! I was reading StarPU, but it seems that I did not do well. So starpu_mpi_interface_datatype_node_register is the most important thing. While the starpu_data_register and starpu_mpi_data_register calls have no effect on the integer overflow, if allocate_datatype_func is correctly defined (using MPI_Type_vector_c) in calling starpu_mpi_interface_datatype_node_register?

@sthibaul
Copy link
Collaborator

sthibaul commented Apr 8, 2024

yes, that's the idea. For application-defined interfaces it's starpu_mpi_interface_datatype_node_register that tells starpu how to send the data over MPI

@JieRen98
Copy link
Author

JieRen98 commented Apr 8, 2024

Thanks a lot, you helped a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants