-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new DB column messageHash
#2202
Conversation
This PR may contain changes to database schema of one of the drivers. If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues. Please make sure the label |
You can find the image built from this PR at
Built from 104a717 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda understand what this does but SQL not my forte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Have left some comments on selecting an appropriate primary key and whether we need to maintain the previous index. Since DB migration is a delicate process, we should test the migration from previous version (or two versions?) of nwaku with populated DB to this version.
" storedAt BIGINT NOT NULL," & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, id, pubsubTopic)" & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, messageHash)" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both storedAt
and messageHash
as PRIMARY KEY
? Since the messageHash is unique and not null, shouldn't that be enough?
We may still need an index (not primary key) on (storedAt, id, pubsubTopic)
for query performance, though, since we still query these attributes. @Ivansete-status may have better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First half of your question:
short answer: storedAt
is required otherwise, the integrity of the DB will be broken.
long answer: https://github.com/waku-org/nwaku/blob/master/tests/testlib/wakucore.nim#L54 Payload and meta are optional here, if we remove storedAt
from primary key, then some test cases will fail. i.e. any app sending the data might keep the same data (payload, meta, content and pubsub topic, all those required for messageHash
computation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your statement @jm-clius. Better have the messageHash
as the primary KEY.
" storedAt INTEGER NOT NULL," & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, id, pubsubTopic)" & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, messageHash)" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for postgres - unsure why we'd need both storedAt
and messageHash
for primary key.
id BLOB, | ||
messageHash BLOB, -- Newly added, this will be populated with a counter value | ||
storedAt INTEGER NOT NULL, | ||
CONSTRAINT messageIndex PRIMARY KEY (storedAt, messageHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have noted elsewhere, but:
- I would imagine we only need
messageHash
as the primary key? - Not sure if we need to maintain the previous index on
(storedAt, id, pubsubTopic)
for query performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Answered elsewhere
- for query performance input from @Ivansete-status ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes! You are right @jm-clius, the messageHash
will be the primary key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes! You are right @jm-clius, the
messageHash
will be the primary key
in that case, I will open a new PR since in this we also need to depend on amended digest (which I have already made locally in a separate PR against issue #2215 ) so it is better to open a new PR for the removal of storedAt
from the primary key
I have tested the migration on tag versions 15 and 20, it works out well on both of them without any problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this step further @ABresting !
Overall it looks good. Just a little detail that I don't quite understand from the upgrade script in SQLite
id BLOB, | ||
messageHash BLOB, -- Newly added, this will be populated with a counter value | ||
storedAt INTEGER NOT NULL, | ||
CONSTRAINT messageIndex PRIMARY KEY (storedAt, messageHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes! You are right @jm-clius, the messageHash
will be the primary key
( | ||
SELECT COUNT(*) | ||
FROM message_backup AS mb2 | ||
WHERE mb2.storedAt <= mb.storedAt | ||
) as messageHash, -- to populate the counter values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I don't quite understand this snippet. That doesn't sound like a message hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically here in the case of migration, we can not leave the newly added messageHash
as Null or empty for backdated messages/DB-rows, so using this script we are filling messageHash
using a counter variable. For eg. If there were 50 messages before running the migration scripts, then after migration messageHash
will be introduced so for the previous 50 messages in the DB we fill the messageHash
value for those as a counter from 1 to 50. This way unique value will be there in messageHash
column. WDYT @Ivansete-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would take long in a DB with large number of rows like those from status.
Perhaps this https://www.sqlite.org/lang_corefunc.html#randomblob could be a good alternative to doing a count for each row inserted, or perhaps using https://www.sqlite.org/c3ref/create_function.html and defining a function in nim to calculate the message hash that can be called from sqlite. (this is for sure much more complicated to implement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would take long in a DB with large number of rows like those from status.
Perhaps this https://www.sqlite.org/lang_corefunc.html#randomblob could be a good alternative to doing a count for each row inserted, or perhaps using https://www.sqlite.org/c3ref/create_function.html and defining a function in nim to calculate the message hash that can be called from sqlite. (this is for sure much more complicated to implement)
I think the current solution is capable of scaling, and it is a simpler approach to the issue here, since post-migration, there will be long strings of hashes in messageHash
column, so basically backfilling DB with counter values will be unique and less resource-consuming in comparison?
WDYT @jm-clius @richard-ramos @Ivansete-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would take long in a DB with large number of rows like those from status.
Perhaps this https://www.sqlite.org/lang_corefunc.html#randomblob could be a good alternative to doing a count for each row inserted, or perhaps using https://www.sqlite.org/c3ref/create_function.html and defining a function in nim to calculate the message hash that can be called from sqlite. (this is for sure much more complicated to implement)I think the current solution is capable of scaling, and it is a simpler approach to the issue here, since post-migration, there will be long strings of hashes in
messageHash
column, so basically backfilling DB with counter values will be unique and less resource-consuming in comparison? WDYT @jm-clius @richard-ramos @Ivansete-status
Actually, @richard-ramos is right, the count(*)
operation will be more intensive than randomBlob(N)
, so it will be nice to make this change. But make sure that this function is available in clients older this SQLite version 3.3.13.
Also as recommended by @jm-clius, a counter-mechanism i.e. using simple values 1,2,3... will not be the case then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering that, SQLite 3.3.13 was launched in 2007, we should proceed with randomBlob(N)
without version check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should be possible to check the SQLite version in the ABI used by the version of nwaku currently deployed to the fleets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like 3.40
, but probably worth checking history here for any recent upgrades. https://raw.githubusercontent.com/arnetheduck/nim-sqlite3-abi/362e1bd9f689ad9f5380d9d27f0705b3d4dfc7d3/sqlite3_abi/sqlite3.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to just set this attribute to "empty" or nil ?
It doesn't make sense from my point of view to add this complexity because, in the end, we are not calculating the messageHash in the same way as the code base is doing. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should be possible to check the SQLite version in the ABI used by the version of nwaku currently deployed to the fleets.
One question: the randomBlob()
is there to begin with from tag v0.1 so it is safer to assume that we can use randomBlob that even on the earliest clients it is available and will not break any backward/outdated clients?
" storedAt BIGINT NOT NULL," & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, id, pubsubTopic)" & | ||
" CONSTRAINT messageIndex PRIMARY KEY (storedAt, messageHash)" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your statement @jm-clius. Better have the messageHash
as the primary KEY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience in this. I think we would need to amend the deterministic message hashing algorithm (#2215) before merging this (and rebase it on that change).
This change is meant to introduce a new unique index column, called messageHash. If we have determined that messageHash
is not unique, we first need to fix the underlying logic, otherwise we'll have to migrate existing databases twice (to update the primary key from (storedAt, messageHash)
to just messageHash
). Afaict the fix for #2215 should be a small change in a single file?
Thanks for the input @jm-clius, I have launched a PR fix for #2215, upon it's merge I will quickly rebase to that change. |
Seeing you've re-requested a review, but afaict the primary key has not been updated to reflect the latest changes in |
I was planning to remove the Primary key in the follow-up PR since we follow modular changes practice, but let me club it inside this PR then. Ah but yeah in production it's better to do one time migration than two. |
Indeed, we do incremental PRs, but in this case this would necessitate unnecessary migrations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description
DB column
messageHash
added in both SQLite and Postgres. More about this PR can be found at this research issueChanges
messageHash
column added in SQLitemessageHash
column added in PostgresIssue
#2112
closes #2229