Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to generate perfect hash for reading when there are one or more unicode keys in struct #666

Closed
Nukoooo opened this issue Dec 28, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@Nukoooo
Copy link

Nukoooo commented Dec 28, 2023

I was trying to read csv files which have unicode in keys and got this bug and debugged a little bit, turned out it was because of the unicode keys.

Using compiler: MSVC 19.38.33133 / Visual studio 2022 17.8.3 (it also doesn't work in 17.8.2)

Callstack from visual studio error report:
image

Code to reproduce
struct unicode_keys_t
{
    float field1;
    float field2;
    std::uint8_t field3;
    std::string field4;
    std::string field5;
    std::string field6;
    std::string field7;
};

// example 1
template <>
struct glz::meta<unicode_keys_t>
{
    using T = unicode_keys_t;
    static constexpr auto value = object("alpha",&T::field1,
                                         "bravo",&T::field2, 
                                         "charlie",&T::field3,
                                         "♥️",&T::field4,
                                         "delta",&T::field5,
                                         "echo",&T::field6, // wont compile if there are any unicode keys after this
                                         "😄",&T::field7 // fails
                                      // "foxtrot",&T::field7 // success
    );
};

// example 2
template <>
struct glz::meta<unicode_keys_t>
{
    using T = unicode_keys_t;
    static constexpr auto value = object("😄",&T::field1,
                                         "💔",&T::field2, // wont compile if there are any keys(both unicode and non-unicode) after this (order doesn't matter)
                                         "alpha",&T::field3
    );
};
@stephenberry
Copy link
Owner

Thanks for this bug report. Could you share some of the unicode keys that are failing for you? I haven't tested much with unicode keys, so I'll have to look into it and add more unit tests.

@Nukoooo
Copy link
Author

Nukoooo commented Dec 29, 2023

Sure, the struct for one of my csv files

struct FishRecord
{
     float Duration;
     float FishSize;
     std::uint8_t Amount;

     std::string FishBaitName;
     std::string SurfaceSlapFishName;
     std::string MoochFishName;
     std::string BuffName;
     std::string FishingSpotPlaceName;

     std::string BiteTypeName;
     std::string CaughtFishName;
     std::string HooksetName;
};

template <>
struct glz::meta<FishRecord>
{
    using T = FishRecord;
    static constexpr auto value = object("上钩的鱼", &T::CaughtFishName, //
                                         "间隔", &T::Duration, //
                                         "尺寸", &T::FishSize, //
                                         "数量", &T::Amount, //
                                         "鱼饵", &T::FishBaitName, //
                                         "拍水的鱼", &T::SurfaceSlapFishName, //
                                         "以小钓大的鱼", &T::MoochFishName, //
                                         "Buff", &T::BuffName, //
                                         "钓场", &T::FishingSpotPlaceName, //
                                         "咬钩类型", &T::BiteTypeName, //
                                         "提钩类型", &T::HooksetName //
    );
};

@Nukoooo
Copy link
Author

Nukoooo commented Dec 29, 2023

I went ahead and tested a little bit, "when there are one or more unicode keys in struct" doesn't seem to be the case.
With this struct with more than 2 unicode keys, it can be compiled:

template <>
struct glz::meta<unicode_keys_t>
{
    using T = unicode_keys_t;
    static constexpr auto value = object("简体汉字", &T::field0, // simplified chinese characters
                                         "漢字寿限無寿限無五劫", &T::field1, // traditional chinese characters / kanji
                                         "こんにちはむところやぶらこうじのぶらこうじパイポパイポパイポのシューリンガンシューリンガンのグーリンダイグーリンダイのポンポコピーのポンポコナの", &T::field2, // katakana 
                                         "한국인", &T::field3, // korean
                                         "русский", &T::field4, // cyrillic
                                         "สวัสดี", &T::field5, // thai
                                         "english", &T::field6 
    );
};

however, when I replace field2 with こんにちはむところやぶら, which basically removes a bunch of katakanas, it just fails to compile

@stephenberry stephenberry added the bug Something isn't working label Dec 29, 2023
@stephenberry
Copy link
Owner

@Nukoooo, I'm looking into this now. I'm wondering how you're using these structures with CSV reading. Glaze is designed to read into containers like std::vector<std::string> or std::deque<float>. Are you trying to read into a structure, like std::vector<FishRecord>? This is a reasonable desire/request, but it is currently not supported. Could you provide an example CSV for one of these use cases? I'm curious that sometimes things are working for you. I converted the types to std::vector<int> and was not able to replicate the unicode issues you are having, so there's something going on with the structure.

If you can clarify your use case a little bit more I think we can resolve this. Thanks!

@Nukoooo
Copy link
Author

Nukoooo commented Jan 2, 2024

Ah, I just realized I was providing the struct I used for testing and I'm sorry for that.

Are you trying to read into a structure, like std::vector

No, it was for testing in JSON, because I originally thought it was because of the CSV format so I tried removing std::vector in the struct.

I was reading into std::vector<float>, std::vector<std::uint8_t>, std::vector<std::string> for user-readable data and std::vector<std::int32_t>, std::vector<bool> for raw data:

struct FishRecord
{
    std::vector<float> Duration;
    std::vector<float> FishSize;
    std::vector<std::uint8_t> Amount;

    std::vector<std::string> FishBaitName;
    std::vector<std::string> SurfaceSlapFishName;
    std::vector<std::string> MoochFishName;
    std::vector<std::string> BuffName;
    std::vector<std::string> FishingSpotPlaceName;

    std::vector<std::string> BiteTypeName;
    std::vector<std::string> CaughtFishName;
    std::vector<std::string> HooksetName;
    std::vector<std::string> IsLargeSizeName;
    std::vector<std::string> IsCollectableName;
};

template <>
struct glz::meta<FishRecord>
{
    using T = FishRecord;
    static constexpr auto value = object("上钩的鱼", &T::CaughtFishName, //
                                         "间隔", &T::Duration, //
                                         "尺寸", &T::FishSize, //
                                         "数量", &T::Amount, //
                                         "鱼饵", &T::FishBaitName, //
                                         "拍水的鱼", &T::SurfaceSlapFishName, // 
                                         "以小钓大的鱼", &T::MoochFishName, //
                                         "Buff", &T::BuffName, //
                                         "钓场", &T::FishingSpotPlaceName, //
                                         "咬钩类型", &T::BiteTypeName, //
                                         "提钩类型", &T::HooksetName, //
                                         "大尺寸", &T::IsLargeSizeName, //
                                         "收藏品", &T::IsCollectableName //
    );
};

struct FishRecordRaw
{
    std::vector<float> Duration;
    std::vector<float> FishSize;
    std::vector<std::uint8_t> Amount;

    std::vector<std::int32_t> FishBaitIds;
    std::vector<std::int32_t> SurfaceSlapFishIds;
    std::vector<std::int32_t> MoochFishIds;
    std::vector<std::int32_t> BuffFlags;
    std::vector<std::int32_t> FishingSpotPlaceNameId;

    std::vector<std::uint8_t> BiteType;
    std::vector<std::uint32_t> CaughtFishId;
    std::vector<std::uint8_t> Hookset;
    std::vector<bool> IsLargeSize;
    std::vector<bool> IsCollectable;
}

template <>
struct glz::meta<FishRecordRaw>
{
    using T = FishRecordRaw;
    static constexpr auto value = object(
                                         "上钩的鱼", &T::CaughtFishId, //
                                         "间隔", &T::Duration, //
                                         "尺寸", &T::FishSize, //
                                         "数量", &T::Amount, //
                                         "鱼饵", &T::FishBaitIds, //
                                         "拍水的鱼", &T::SurfaceSlapFishIds, // 
                                         "以小钓大的鱼", &T::MoochFishIds, //
                                         "Buff", &T::BuffFlags, //
                                         "钓场", &T::FishingSpotPlaceNameId, //
                                         "咬钩类型", &T::BiteType, //
                                         "提钩类型", &T::Hookset, //
                                         "大尺寸", &T::IsLargeSize, //
                                         "收藏品", &T::IsCollectable //
    );
};

The user readable csv data:

上钩的鱼,间隔,尺寸,数量,鱼饵,拍水的鱼,以小钓大的鱼,Buff,钓场,咬钩类型,提钩类型,大尺寸,收藏品
珊瑚蝶,8.365,12.5,1,万能拟饵,无,无,撒饵,利姆萨·罗敏萨下层甲板,重杆,提钩,否,否
海港鲱,7.586,11.1,1,万能拟饵,无,无,撒饵,利姆萨·罗敏萨下层甲板,重杆,提钩,否,否
猛犸章鱼,26.044,55.6,1,万能拟饵,无,海港鲱,撒饵,利姆萨·罗敏萨下层甲板,鱼王杆,提钩,否,否

The raw data:

上钩的鱼,间隔,尺寸,数量,鱼饵,拍水的鱼,以小钓大的鱼,Buff,钓场,咬钩类型,提钩类型,大尺寸,收藏品
4876,8.365,12.5,1,29717,0,0,2,29,2,1,0,0
4874,7.586,11.1,1,29717,0,0,2,29,2,1,0,0
7707,26.044,55.6,1,29717,0,4874,2,29,3,1,0,0

@stephenberry
Copy link
Owner

Thanks for this. I made some progress today figuring it out. It has to do with the naive_map and unicode. Will hopefully have a fix soon.

@stephenberry
Copy link
Owner

stephenberry commented Jan 3, 2024

@Nukoooo, I think I fixed the bug with #670. The issue was with hashing unicode characters, I was casting char bytes to uint64_t, which would make the values enormous for unicode bytes. The fix was to use: uint64_t(uint8_t(character)), this way the uint64_t value is between 0 and 255 as intended.

If you run into more issues, please let me know. And, thanks so much for pointing this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants