Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update table sorting API to allow for callback #616

Closed
molpopgen opened this issue May 14, 2020 · 17 comments · Fixed by #627
Closed

Update table sorting API to allow for callback #616

molpopgen opened this issue May 14, 2020 · 17 comments · Fixed by #627
Assignees
Labels
C API Issue is about the C API
Milestone

Comments

@molpopgen
Copy link
Member

Sorting an edge table is quite expensive, and the current implementation can be out-performed by client code. The main motivation here is to speed up forward simulations.

It has been proposed to allow a callback to be passed to tsk_table_collection_sort that would look something like:

typedef int (*edge_table_sort_callback)(tsk_table_collection_t * tables, void * tbd);

I'd like to propose adding the following callback as a "built in" in tskit:

int tsk_do_not_sort_edges(tsk_table_collection_t * tables, void * tbd)
{
    return 0;
}

The idea is that this call back does nothing, meaning that client code is assuming 100% responsibility here and accepting that the integrity checker will double-check their work. It turns out to be very fiddly to get complex callbacks in other languages to work when the calling conventions differ from C. (Read: my attempts to mock this up so far all segfault.). Thus, allowing the step to be just skipped internally would be useful.

@molpopgen molpopgen added the C API Issue is about the C API label May 14, 2020
@jeromekelleher
Copy link
Member

jeromekelleher commented May 15, 2020

The way it works at the moment is that table_collection_sort is a thin wrapper for the table_sorter private "class".

My suggestion would be that we keep the existing API for tsk_table_collection_sort, but make the tsk_table_sorter part of the public API and put all the bells and whistles on that.

In terms of making hooks available, one thing we could do is have sort_edges be a function pointer which by default points to the existing implementation. Then, if you want to do something more fancy, you can use:

tsk_table_sorter_t sorter;
ret = tsk_table_sorter_init(sorter, 0);
error_check(ret);
sorter.sort_edges = my_fancy_edge_sort_func;
ret = table_sorter_run(&sorter, tables, bookmark, TSK_NO_INTEGRITY_CHECK);
error_check(ret);

It's a bit more long-winded, but seems quite flexible and extensible.

Will this work with C++ ABI/calling convention issues @molpopgen?

@jeromekelleher
Copy link
Member

jeromekelleher commented May 15, 2020

BTW, we'd change around the implementation of the table_sorter class a bit - the way things are done is left over from older code organisations, and it missed out a few refactorings. But the basic idea would stay the same.

@jeromekelleher jeromekelleher added this to the C API 0.99.3 milestone May 15, 2020
@molpopgen
Copy link
Member Author

Will this work with C++ ABI/calling convention issues @molpopgen?

Should. You can set the sorter to do nothing like I was thinking, and then run your own separately. The issue is that only a small subset of functional types in C++ convert to C function pointers without a lot of extra work to write trampolines.

@jeromekelleher
Copy link
Member

It should be possible wrap the C++ lambda fanciness with a simple function call though, right? So, you define your function

int 
edge_sort_func(tsk_table_sorter_t *self, tsk_size_t start, tsk_flags_t options)
{
    // Define fancy C++ stuff in here!
    return ret;
}

That should all just compile down to something simple enough then?

@molpopgen
Copy link
Member Author

In order for that to work, then your lamba expressions will need to do awful things like capture global variables. What is usually needed is the reverse of your example. The closure should have the signature of the simpler function, and itself called some more complex function that takes advantage of captured variables.

@jeromekelleher
Copy link
Member

Oh, OK, I was thinking too much like Python then. Would you mind pasting in an outline of what this would look like ideally from the C++ client?

@molpopgen
Copy link
Member Author

molpopgen commented May 15, 2020 via email

@bhaller
Copy link

bhaller commented May 15, 2020

I think the void * tbd in @molpopgen's original proposal would suffice for SLiM. From that we could recover the simulation object, which would get us to the table collection, which would let us do the sort in parallel ourselves. BTW, it would be good for the comparator function to be public in tskit, if it isn't already, so that one would not have to hard-code the details of that in the client. (Although one might choose to do so anyway, for efficiency; but one shouldn't have to.)

@molpopgen
Copy link
Member Author

molpopgen commented May 15, 2020 via email

@bhaller
Copy link

bhaller commented May 15, 2020

Oh, sorry, I misunderstood the purposed to the void * tbd; I took that to be the fairly standard mechanism whereby one would pass in a void* to tskit when registering one's callback, and tskit would simply pass the void* back again when calling the callback, allowing any arbitrary pointer from the client to be provided back to the callback code. If that wasn't the intent, then I'd suggest doing that; it's a common design pattern that seems adequate for most purposes. And of course I wasn't suggesting that satisfying SLiM was the only important thing! Just saying "This design [as I understood it then] would satisfy SLiM, for what that's worth." :->

@molpopgen
Copy link
Member Author

@bhaller -- if you want access to the pop object, then your callback becomes a closure, meaning that you bind (by one of the 3 or 4 possible mechanisms) your population, leaving a final signature of int(*)(table_collection *). You'd do that via a lambda, std::function, etc., and then it would crash on the C side. You'll see this in the example I write up.

If it helps, I can write in the comments of the example how function composition works? It'll take some time to write, so may not happen this week.

@bhaller
Copy link

bhaller commented May 15, 2020

If it helps, I can write in the comments of the example how function composition works? It'll take some time to write, so may not happen this week.

No need; that is not what I was suggesting. I would simply pass in a SLiMSim* to tskit when registering my callback, tskit would hang on to that void* for me and pass it back to my callback, and my callback would be a C function that would cast the void* back to a SLiMSim* and use that to get the table collection and do the sort. Pure C, as far as tskit and the ABI are concerned (perhaps one would want to declare the callback function with extern "C", but in my experience that has never been necessary). And perfectly safe, as long as the table collection doesn't live longer than the SLiMSim object (in which case one would need to deregister the callback to get rid of the dangling pointer in the void*). What is the need to make it more complicated than this?

@molpopgen
Copy link
Member Author

molpopgen commented May 15, 2020

Let's get clarification from @jeromekelleher -- is the intent that the void * be an as-yet-undefined pointer to an internal state, or something which the client callback receives? If the latter, that changes how I'd write the example.

@molpopgen
Copy link
Member Author

Assuming for the moment that I originally over-thought the problem, meaning @bhaller is correct and that the void * refers to user data, then we are good with an API that looks like this, which also allows for the callback to be undefined so that client code can do arbitrarily complicated things on their own.

The case I was originally envisioning, with tskit sending some internal state object to the callback, is a lot more complex to handle.

#include <vector>
#include <iostream>
#include <functional>

// Mock some tskit stuff

struct tsk_mock_table_collection
{
};

using tsk_mock_edge_sort_callback = int (*)(tsk_mock_table_collection *, void *);

struct tsk_mock_table_sorter
{
    tsk_mock_edge_sort_callback cback;
};

int
tsk_mock_default_edge_sorting_fxn(tsk_mock_table_collection *tables, void *)
{
    std::cout << "default callback\n";
    return 0;
}

int
tsk_mock_table_sorter_init(tsk_mock_table_sorter *sorter)
{
    sorter->cback = &tsk_mock_default_edge_sorting_fxn;
    return 0;
};

int
tsk_mock_sort_tables(tsk_mock_table_sorter *sorter, tsk_mock_table_collection *tables,
                     void *client_data)
{
    if (sorter->cback != nullptr)
        {
            sorter->cback(tables, client_data);
        }
    else
        {
            std::cout << "not sorting the tables, so hope you did your homework!\n";
        }
    return 0;
}

// Mock some client stuff

int
vanilla_client_callback(tsk_mock_table_collection *, void *)
{
    std::cout << "vanilla client callback\n";
    return 0;
}

struct _edge
{
    // left, right, parent, child,
    // used for sorting
};

int
cpp_callback_1(tsk_mock_table_collection *, void *)
{
    std::cout << "cpp callback 1\n";
    std::vector<_edge> edges;

    // copy from tables to edges
    // sort via std::sort
    // copy back
    return 0;
}

int
cpp_callback_2(tsk_mock_table_collection *, void *data)
{
    std::cout << "cpp callback 2\n";
    auto edges = static_cast<std::vector<_edge> *>(data);
    if (edges == nullptr)
        {
            return -1;
        }
    edges->clear(); // re-use those big allocations

    // copy/sort/copy

    return 1;
}

// Typdef for a C++ callback via std::function
using std_function_callback = std::function<int(tsk_mock_table_collection *, void *)>;

int
main(int argc, char **argv)
{
    tsk_mock_table_sorter sorter;
    tsk_mock_table_sorter_init(&sorter);
    tsk_mock_table_collection tables;
    tsk_mock_sort_tables(&sorter, &tables, nullptr);

    // I promise that I know what I am doing
    sorter.cback = nullptr;
    tsk_mock_sort_tables(&sorter, &tables, nullptr);

    // Now, explore more complex setups
    sorter.cback = &vanilla_client_callback;
    tsk_mock_sort_tables(&sorter, &tables, nullptr);

    // A non-capturing lambda suffices
    const auto lambda_callback_0 = [](tsk_mock_table_collection *, void *) -> int {
        std::cout << "lambda callback 0\n";
        return 0;
    };

    sorter.cback = lambda_callback_0;
    tsk_mock_sort_tables(&sorter, &tables, nullptr);

    const auto lambda_callback_1
        = [](tsk_mock_table_collection *tables, void *tbd) -> int {
        std::cout << "lambda callback 1 dispatches to";
        return cpp_callback_1(tables, tbd);
    };

    sorter.cback = lambda_callback_1;
    tsk_mock_sort_tables(&sorter, &tables, nullptr);

    sorter.cback = &cpp_callback_2;
    std::vector<_edge> edges;
    tsk_mock_sort_tables(&sorter, &tables, &edges);
}

@molpopgen
Copy link
Member Author

I never did anything using the std::function case, btw.

@jeromekelleher
Copy link
Member

Looks good to me @molpopgen, I'll try to get a PR with a proposal for the tskit API soon.

@molpopgen
Copy link
Member Author

Sounds good. I think I was being completely thick originally and had forgotten everything about C callbacks.

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 19, 2020
@jeromekelleher jeromekelleher self-assigned this May 22, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 26, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jul 6, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jul 6, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jul 7, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jul 8, 2020
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants