Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Automatic sharding with store not working #2616

Closed
AlejandroCabeza opened this issue Apr 22, 2024 · 18 comments
Closed

bug: Automatic sharding with store not working #2616

AlejandroCabeza opened this issue Apr 22, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@AlejandroCabeza
Copy link
Contributor

AlejandroCabeza commented Apr 22, 2024

Problem

The short and full version for content topics with automatic sharding don't retrieve the same messages.
Having two sets of messages, one for /toychat/2/huilong/proto and another for /0/toychat/2/huilong/proto, when making a HistoryQuery for each of the content topics, each query retrieves their respective set of messages, and not both.

Impact

...

To reproduce

  1. Defined two equivalent content topics: one full version (e.g.: /0/toychat/2/huilong/proto) and another short (/toychat/2/huilong/proto).
  2. Create two sets of messages.
  3. Create an ArchiveDriver and insert the two sets of messages; each of them assigned to one of the content topics.
  4. Create a server and a client, mount store on the server and store client on the client. Also, mount the archive on
    the server.
  5. Create two HistoryQuery. One both will make a request to the server, one of them to the full content topic, another to the short content topic.
  6. Verify the results. The short content topic query will only retrieve the messages assigned to the short content topic, and the full content topic query will only retrieve the messages assigned to the full content topic.

You can also check tests/sharding branch. Open tests/node/test_wakunode_sharding.nim and check out the store (automatic sharding filtering) case. Keep in mind they are declared as xasyncTest, so you'll need to update them to asyncTest for them not to be ignored by the test runner.

Expected behavior

Given the two content topics are equivalent, both sets of messages should be retrieved by both queries.

nwaku version/commit hash

wakunode2: v0.27.0-rc.0-11-ge61e4f
branch: tests/sharding
commit: 82978c5 (might change in the future)
pr: #2603

@AlejandroCabeza AlejandroCabeza added the bug Something isn't working label Apr 22, 2024
@gabrielmer
Copy link
Contributor

@SionoiS WDYT? Are those content topics actually supposed to be equivalent?

@SionoiS
Copy link
Contributor

SionoiS commented Apr 23, 2024

@SionoiS WDYT? Are those content topics actually supposed to be equivalent?

AFAIK we store content topics as string and so they are different at the DB level...

🤔 I'm not sure what to do exactly. I agree that they should be equivalent. I guess we could format content topics before storing them in DB?

@gabrielmer
Copy link
Contributor

Yes, at the DB level they're different strings, nothing to do about it.
So either it is or it should be formatted at a higher level. But maybe it's not something to test at the ArchiveDriver, which is too low level

CC @Ivansete-status

@SionoiS
Copy link
Contributor

SionoiS commented Apr 23, 2024

I feel we should not use strings internally at all. We already have

type NsPubsubTopic* = object
and
type NsContentTopic* = object
we could use.

@kaichaosun
Copy link
Contributor

Could someone elaborate what this 0 means, and what's the value it brings?

The generation number monotonously increases and indirectly refers to the total number of shards of the Waku Network.

From spec it still confuses me.

And I have been sending messages via /relay/v1/auto/messages, the content topic is not prefixed with any 0 (generation) in database messages table, it stores the one app used.

@SionoiS
Copy link
Contributor

SionoiS commented Apr 25, 2024

Could someone elaborate what this 0 means, and what's the value it brings?

The generation number monotonously increases and indirectly refers to the total number of shards of the Waku Network.

From spec it still confuses me.

Since autosharding maps infinite content topics to finite shards. We need to know the number of shards and the algorithm used for autosharding, that is what this number represent. In TWN gen 0, hash mod 8 (shards) is used for autosharding but we could define a TWN gen 1 in the future with a different algorithm.

And I have been sending messages via /relay/v1/auto/messages, the content topic is not prefixed with any 0 (generation) in database messages table, it stores the one app used.

0 is the first and default prefix and can be omitted, it is implicit in all content topics ATM.

@gabrielmer
Copy link
Contributor

Revisiting this, I personally think that we should store in the DB the full length content topic defined in the spec

And if we receive the short version of the content topic, transform it at the application level to the long version which will be stored in the DB.

If that's the case, then when interacting directly with the DB in tests, we would only need to use the full version.

Does it make sense? cc @jm-clius

@jm-clius
Copy link
Contributor

At this stage we don't know yet if there will be a gen 1. I wouldn't pad our existing content topics just in case we have this generational use in future.
In fact, autosharding should preferably not have a major effect on what we do in the lower layer of the protocols in general. For now, I'm happy with the content topics being differentiated in the DB (i.e. requiring separate filters to query), as long as both short and long forms map to the same pubsub topic. We may clarify the specification to indicate:

  • without a prefix a content topic will be assumed gen 0
  • there is no filter equivalence between content topics that use the explicit vs implicit default gen 0

Actually, the entire generational concept is more to indicate how we could expand number of shards in future and may be marked as not implemented until we do define subsequent generations.

@gabrielmer
Copy link
Contributor

Sounds good, agree with that approach :))

If that's the case, then can we close this one? or is there anything missing? @AlejandroCabeza

We do have to check that content topics in short and long forms map to the same pubsub topic, but I think that would be a separate issue from this one.

@gabrielmer
Copy link
Contributor

Closing this issue as per the above discussion. Feel free to reopen in case something is missing :))

@AlejandroCabeza
Copy link
Contributor Author

AlejandroCabeza commented Jun 10, 2024

If I understood correctly then these two, toychat/2/huilong/proto and /0/toychat/2/huilong/proto, are not to be treated as equivalent?

@gabrielmer
Copy link
Contributor

If I understood correctly then these two, toychat/2/huilong/proto and /0/toychat/2/huilong/proto, are not to be treated as equivalent?

My understanding is that not at the DB level, but when using autosharding's API they should behave as equals

@AlejandroCabeza
Copy link
Contributor Author

If I understood correctly then these two, toychat/2/huilong/proto and /0/toychat/2/huilong/proto, are not to be treated as equivalent?

My understanding is that not at the DB level, but when using autosharding's API they should behave as equals

Right, then has somebody implemented that and enabled the test?

@gabrielmer
Copy link
Contributor

Right, then has somebody implemented that and enabled the test?

I see that there's already tests activated for it and working such as

asyncTest "relay (automatic sharding filtering)":
# Given a connected server and client subscribed to the same content topic (with two different formats)
let
contentTopicShort = "/toychat/2/huilong/proto"
contentTopicFull = "/0/toychat/2/huilong/proto"
pubsubTopic = "/waku/2/rs/0/58355"
serverHandler = server.subscribeToContentTopicWithHandler(contentTopicShort)
clientHandler = client.subscribeToContentTopicWithHandler(contentTopicFull)
await sleepAsync(FUTURE_TIMEOUT)
await client.connectToNodes(@[server.switch.peerInfo.toRemotePeerInfo()])
# When the client publishes a message
discard await client.publish(
some(pubsubTopic),
WakuMessage(payload: "message1".toBytes(), contentTopic: contentTopicShort),
)
let
serverResult1 = await serverHandler.waitForResult(FUTURE_TIMEOUT)
clientResult1 = await clientHandler.waitForResult(FUTURE_TIMEOUT)
# Then the server and client receive the message
assertResultOk(serverResult1)
assertResultOk(clientResult1)
# When the server publishes a message
serverHandler.reset()
clientHandler.reset()
discard await server.publish(
some(pubsubTopic),
WakuMessage(payload: "message2".toBytes(), contentTopic: contentTopicFull),
)
let
serverResult2 = await serverHandler.waitForResult(FUTURE_TIMEOUT)
clientResult2 = await clientHandler.waitForResult(FUTURE_TIMEOUT)
# Then the server and client receive the message
assertResultOk(serverResult2)
assertResultOk(clientResult2)

Which if I understand correctly, implies that both content topics map to the same shard. Maybe I'm missing something?

@AlejandroCabeza
Copy link
Contributor Author

Right, then has somebody implemented that and enabled the test?

I see that there's already tests activated for it and working such as

asyncTest "relay (automatic sharding filtering)":
# Given a connected server and client subscribed to the same content topic (with two different formats)
let
contentTopicShort = "/toychat/2/huilong/proto"
contentTopicFull = "/0/toychat/2/huilong/proto"
pubsubTopic = "/waku/2/rs/0/58355"
serverHandler = server.subscribeToContentTopicWithHandler(contentTopicShort)
clientHandler = client.subscribeToContentTopicWithHandler(contentTopicFull)
await sleepAsync(FUTURE_TIMEOUT)
await client.connectToNodes(@[server.switch.peerInfo.toRemotePeerInfo()])
# When the client publishes a message
discard await client.publish(
some(pubsubTopic),
WakuMessage(payload: "message1".toBytes(), contentTopic: contentTopicShort),
)
let
serverResult1 = await serverHandler.waitForResult(FUTURE_TIMEOUT)
clientResult1 = await clientHandler.waitForResult(FUTURE_TIMEOUT)
# Then the server and client receive the message
assertResultOk(serverResult1)
assertResultOk(clientResult1)
# When the server publishes a message
serverHandler.reset()
clientHandler.reset()
discard await server.publish(
some(pubsubTopic),
WakuMessage(payload: "message2".toBytes(), contentTopic: contentTopicFull),
)
let
serverResult2 = await serverHandler.waitForResult(FUTURE_TIMEOUT)
clientResult2 = await clientHandler.waitForResult(FUTURE_TIMEOUT)
# Then the server and client receive the message
assertResultOk(serverResult2)
assertResultOk(clientResult2)

Which if I understand correctly, implies that both content topics map to the same shard. Maybe I'm missing something?

That's a relay-related test. This issue is store-related.

@gabrielmer
Copy link
Contributor

That's a relay-related test. This issue is store-related.

Yes, but isn't the mapping from content topic to shard the same for all protocols? Same autosharding algorithm.

I understood this was the last question remaining, whether autosharding maps both to the same shard and if so we're fine for now (as we decided to not see them as equivalent at the DB level).
The test case I attached, even if it's for relay, tests/proves this is the case.

So currently

  1. We know that the mapping works as intended
  2. If it stops working as intended, a test will fail and we will know about it

@AlejandroCabeza
Copy link
Contributor Author

That's a relay-related test. This issue is store-related.

Yes, but isn't the mapping from content topic to shard the same for all protocols? Same autosharding algorithm.

I understood this was the last question remaining, whether autosharding maps both to the same shard and if so we're fine for now (as we decided to not see them as equivalent at the DB level). The test case I attached, even if it's for relay, tests/proves this is the case.

So currently

1. We know that the mapping works as intended

2. If it stops working as intended, a test will fail and we will know about it

I can't remember, but there must be a reason I posted an issue specifically referencing an issue with lightpush-store, with the test case skipped, while there's a couple other tests referencing other protocols that weren't skipped. There could be something within store that is messing with the topics format.
I just run the mentioned test and it still fails. I'll point out, though, that the test uses HistoryQuery which now is located in waku_store_legacy.

@gabrielmer
Copy link
Contributor

I can't remember, but there must be a reason I posted an issue specifically referencing an issue with lightpush-store, with the test case skipped, while there's a couple other tests referencing other protocols that weren't skipped. There could be something within store that is messing with the topics format. I just run the mentioned test and it still fails. I'll point out, though, that the test uses HistoryQuery which now is located in waku_store_legacy.

I think that the issue here is that the test is adding and querying it at the DB level which is too low of a level. At that level as per the above discussion it's ok to see both content topics as different.

As long as autosharding maps them to the same shard (which is not specific to any protocol and we saw there's already tests that assure it), then everything seems to be alright :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants