Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non string (composite) keys #3

Closed
ghost opened this issue Jul 4, 2017 · 10 comments
Closed

Non string (composite) keys #3

ghost opened this issue Jul 4, 2017 · 10 comments

Comments

@ghost
Copy link

ghost commented Jul 4, 2017

Hi! Does the hat-trie have to be string keys only? Would be really great to have something like arbitrary type composite keys along the lines of MultiIndex container http://www.boost.org/doc/libs/1_64_0/libs/multi_index/doc/tutorial/key_extraction.html#composite_keys

@Tessil
Copy link
Owner

Tessil commented Jul 4, 2017

For now the HAT-trie only supports a contiguous sequence of chars (could be used to store binary data as a '\0' char inside the sequence is supported).

I'm not that familiar with the composite keys of boost::multi_index_container. Do you have a simple example of what you want to achieve so I can wrap my head around?

If you want to store something like:

struct employee {
    std::string first_name;
    std::string last_name;
};

You could store in the HAT-trie the concatenation of the first name and the last name and provide a string_view into this concatenation in your interface.

struct employee {
    std::string_view first_name() const;
    std::string_view last_name() const;
};

@ghost
Copy link
Author

ghost commented Jul 4, 2017

I'm more thinking of binary types like:

struct order {
    int64_t price;
    int32_t id;
};

And then I'd need to find an order by it's price + id combination and iterate over orders at the same price but having lower id. Would you recon it could be done by putting both to std::string maybe in big-endian?

@Tessil
Copy link
Owner

Tessil commented Jul 5, 2017

You can do something like this.

#include <iostream>
#include "htrie_set.h"

struct order {
    int64_t price;
    int32_t id;
};

int main() {
    order o1{10, 1};
    order o2{5, 2};
    order o3{10, 3};
    order o4{16, 4};
    
    static_assert(std::is_pod<order>::value, "");
    
    tsl::htrie_set<char> test;
    test.insert_ks(reinterpret_cast<char*>(&o1), sizeof(o1));
    test.insert_ks(reinterpret_cast<char*>(&o2), sizeof(o2));
    test.insert_ks(reinterpret_cast<char*>(&o3), sizeof(o3));
    test.insert_ks(reinterpret_cast<char*>(&o4), sizeof(o4));
    


    
    int64_t price = 10;
    auto its_prefix = test.equal_prefix_range_ks(
                                 reinterpret_cast<char*>(&price), sizeof(price));

    order o;
    std::string key_buffer;

    // 3: 10    1: 10
    for(auto it = its_prefix.first; it != its_prefix.second; ++it) {
        it.key(key_buffer);
        
        assert(key_buffer.size() == sizeof(o));
        std::memcpy(&o, key_buffer.data(), sizeof(o));
        
        std::cout << o.id << ": " << o.price << std::endl;
    }
}

If I'm not mistaken, there should be no undefined behaviour if order is a POD and you don't transfer the trie between big and little endian but it's quite "hackish".

You also can't really find objects with a id lower than 'x' (you could check for id bigger than 'x' if you use an unsigned int by knowing the binary representation, but well...).

What is the problem with boost::multi_index_container for this usecase? Using too much memory?

@ghost
Copy link
Author

ghost commented Jul 5, 2017

"hakish" and slight "undefined behavior" is perfectly fine as long as gcc compiled x64 is doing good.

boost::multi_index_container has pretty good hash set but I don't like relying on red-black tree to maintain an order as it's not cache friendly and once tree's growing deep quite a bit of different cache lines could be required.

the hash set is heavily modified too i.e. a lot of deletes, inserts going on. I'll try to benchmark trie vs boost for my cases, thanks so much for your help!

if you'd also need to know the sum of all order quantities (extra uint32_t qty field in the struct) for each price is there a good way to store a bit of extra data in the trie node? maybe I can just have some values in the order itself I could use for extra storage though.

@Tessil
Copy link
Owner

Tessil commented Jul 5, 2017

Yes I think it should be good to do a benchmark as I'm not sure that the HAT-trie offers a significant advantage compared to boost::multi_index_container for your use case considering the complexity that it add.

For the sum, I'm not sure to understand. You can do:

std::size_t sum = 0;
for(auto it = its_prefix.first; it != its_prefix.second; ++it) {
    it.key(key_buffer);
    std::memcpy(&o, key_buffer.data(), sizeof(o));
        
    sum += o.qty;
}

You want to have an aggregate field for each price to avoid this calculation?

@ghost
Copy link
Author

ghost commented Jul 5, 2017

Exactly, an aggregate and some way to get N highest & lowest aggregates by price.

Now have a separate std::map collection of aggregates which is constantly updated and each order have a pointer to map node. But not happy with the setup, kind of too many tree structures everywhere.

@Tessil
Copy link
Owner

Tessil commented Jul 5, 2017

I think it's better to have a separate structure for the aggregates.

As the trie stores binary data, you could store multiple data types in the trie.

Example you will have a char buffer that looks like this for an order:
| char (type) | int64_t (id) | int64_t (price) | uint32 (qty) |

And for an aggregate:
| char (type) | char (aggregate type min, max, avg, ...) | int64_t (price) | uint32 (total_qty) |

The first byte differentiate an order from an aggregate. When you retrieve your data you would do:

if(key_buffer[0] == ORDER_TYPE) {
    std::memcpy(&ord, key_buffer.data() + 1, sizeof(order));
}
else {
    assert(key_buffer[0] == AGGREGATE_TYPE);
    std::memcpy(&aggr, key_buffer.data() + 1, sizeof(aggragate));
}

But I don't see any advantage compared to a separate map while it really increases the complexity of the code.

@ghost
Copy link
Author

ghost commented Jul 6, 2017

#define NONIUS_RUNNER

#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/mem_fun.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/pool/pool.hpp>

#include "../../hat-trie/src/htrie_set.h"
#include "../../nonius/nonius.h++"
#include <unordered_set>

struct order
{
    int64_t price;
    int32_t id;
};

order ht_o1 {10, 1};
order us_o1 {10, 1};
order mi_o1 {10, 1};

struct order_hash
{
    size_t operator()(const order &o) const
    {
        return std::hash<int64_t>()(o.price) + std::hash<int32_t>()(o.id);
    }
};

struct order_equal
{
    bool operator()(const order &x, const order &y) const
    {
        return x.price == y.price && x.id == y.id;
    }
};

using namespace boost::multi_index;

using mindex_set = multi_index_container<
    order,
    indexed_by<
        hashed_unique<
            member<order, const int32_t, &order::id>>,
        ordered_non_unique<
            composite_key<
                order,
                member<order, int64_t, &order::price>,
                member<order, int32_t, &order::id>>>>>;

tsl::htrie_set<char> ht_test;
std::unordered_set<order, order_hash, order_equal> us_test;
mindex_set mi_test;

NONIUS_BENCHMARK("htrie-basic", [](nonius::chronometer meter)
{
    ht_test.clear();

    meter.measure([&](int i)
    {
        ++ht_o1.id;

        if (ht_o1.id % 10 == 0)
            ++ht_o1.price;

        ht_test.insert_ks(reinterpret_cast<char*>(&ht_o1), sizeof(ht_o1));
    });
})

NONIUS_BENCHMARK("unordered_set-basic", [](nonius::chronometer meter)
{
    us_test.clear();

    meter.measure([&](int i)
    {
        ++us_o1.id;

        if (us_o1.id % 10 == 0)
            ++us_o1.price;

        us_test.insert(us_o1);
    });
})

NONIUS_BENCHMARK("multi_index-basic", [](nonius::chronometer meter)
{
    mi_test.clear();

    meter.measure([&](int i)
    {
        ++mi_o1.id;

        if (mi_o1.id % 10 == 0)
            ++mi_o1.price;

        mi_test.insert(mi_o1);
    });
})
clock resolution: mean is 20.725 ns (20480002 iterations)

benchmarking htrie-basic
collecting 100 samples, 155 iterations each, in estimated 2.077 ms
mean: 87.9874 ns, lb 86.7012 ns, ub 89.3417 ns, ci 0.95
std dev: 6.71555 ns, lb 5.84097 ns, ub 8.5427 ns, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking unordered_set-basic
collecting 100 samples, 189 iterations each, in estimated 2.0601 ms
mean: 39.9197 ns, lb 39.7768 ns, ub 40.1902 ns, ci 0.95
std dev: 0.976376 ns, lb 0.597635 ns, ub 1.42363 ns, ci 0.95
found 15 outliers among 100 samples (15%)
variance is moderately inflated by outliers

benchmarking multi_index-basic
collecting 100 samples, 80 iterations each, in estimated 2.072 ms
mean: 62.8324 ns, lb 62.3842 ns, ub 64.4419 ns, ci 0.95
std dev: 3.71753 ns, lb 0.54332 ns, ub 8.42814 ns, ci 0.95
found 5 outliers among 100 samples (5%)
variance is severely inflated by outliers

@ghost
Copy link
Author

ghost commented Jul 6, 2017

htrie seem to be quite complicated indeed so basic inserts are getting slower. probably got to stick to simple data structures for now.

@Tessil
Copy link
Owner

Tessil commented Jul 6, 2017

Yes, I think the HAT-trie is not well-suited for your usecase. It could probably be speed-up to take advantage of the fixed size of the entries (the HAT-trie here must support strings of variable length) but it's not the main purpose of the structure, there are probably more suited datastructures for this.

@Tessil Tessil closed this as completed Jul 6, 2017
This was referenced Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant