-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
(I haven't read up on the architecture, so this might not actually be applicable) Are there unique indices on these tables (probably)? In that case, you need to make sure to handle race conditions (duplicate keys) where the replicas don't yet have rows already inserted on master… (Unlikely) Are rows ever removed/ purged from these tables? If so, that needs to be handled here as well. Also I find the functional style sometimes a little obstructive to read, especially the |
Currently, there are, yes. (You can see them in -- Deduplication is currently best-effort to avoid locking on inserts that
-- would be required for strict deduplication. There MAY be multiple rows with
-- the same comment_text and comment_data.
They are (implemented in #10) – after all, terms from deleted revisions might contain sensitive data that should not remain on the Cloud VPS replicas. (MediaWiki core solves this with a whole bunch of custom views for its own normalization tables, as far as I’m aware, but since |
Or we can query the master for IDs if we don’t find them on the replicas, and only insert if we don’t find the IDs on the master either, though that would probably need a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I find the functional style sometimes a little obstructive to read, especially the array_reduce class that probably could be replace by fairly simple foreach loops.
Interesting, I find reduce more readable than foreach, but maybe just because I'm more used to functional style. Fair point though, our paradigm is mainly OO so I'll happily change it so that it is more likely to be more readable by more Wikidata developers :)
* @return array | ||
* [ | ||
* 'type' => [ | ||
* [ 'language' => [ <textId1>, <textId2>, ... ] ], ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given [ 'type' => [ 'language' => 'term' ] ]
, will this return [ 'type' => [ 'language' => 1 ] ]
or [ 'type' => [ 'language' => [ 1 ] ] ]
? That is, if the input has a single term instead of a list, is that preserved in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will not be preserved iirc, I convert them to arrays in the output even for single elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out, it looks like it’s preserved after all (because array_walk_recursive
preserves the structure) – I think that’s why DatabaseTermIdsAcquirerTest::assertTermsArrayExistInDb
needs the cast (array)$texts
too.
We certainly do not want to allow duplicate rows here as the main purpose of this normalization is to remove duplication. We can query master for non existing ones, and only insert those that do not exist eventually, but yeah that's then another In a previous implementation, we tried to use A third, in-the-middle solution can be to do the following for non-existing ids in replica:
what do you think? |
That’s the main purpose of the I was going to ask about I don’t understand your third solution yet, can you clarify what “prepare” means and where which queries are run (except 2., which already says “master”)? |
Sure, atm I feel that should still be the last resort.. but let's see which solution might be cheaper on performance and maintenance.
I'll put it in the full pseudo steps:
that is a loop solution, that accounts of inserted terms from in-parallel requests.. not sure how often that will happen, but one run of the loop is at least needed in case the very first set of non-existing records in replica might have some of the records in master already |
This needs to
I think here you need to query all
In generally you will have to make sure all entries (rows) you want to refer to are present (and ensure that by locking them on the master) until you put your references in place (so that no one will delete them because they are not referred to from anywhere). If doing all of that in a transaction is a performance problem, you could go for a multi step process: 1) Find/ create entries you want to reference 2) put references in place 3) make sure the entries you just referenced are still there 4) fix now invalid references if needed. This should work fairly well, unless the runner dies after step 2) and you're not in a transaction. I know that this is all fairly ugly, but doing this without using proper foreign keys is not easy (unless you heavily compromise on consistency, I guess). |
Thanks @mariushoch I will then try to sleep on it a bit and check whether there's a "simpler" solution that can be done. (top of my head screams at me "staging table", no clue what it means though so hope I will shut it up with some reasoning haha) |
@mariushoch FYI I'm continuing with this to finish it first without considering the race-condition problem (= assuming there's no cleanup logic running in-parallel). We will get back to race-conditions between clean up #10 and acquirer logic, and try to reproduce in a test, once both logics are merged and tested to be functional in test env. |
Sounds good to me :) Do you plan to handle insertion failures already (duplicate key) or not yet? |
@mariushoch yup I'm going to handle that, while implementing the logic I stated in the latest pseudo steps more or less |
* @return array | ||
* [ | ||
* 'type' => [ | ||
* [ 'language' => [ <textId1>, <textId2>, ... ] ], ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out, it looks like it’s preserved after all (because array_walk_recursive
preserves the structure) – I think that’s why DatabaseTermIdsAcquirerTest::assertTermsArrayExistInDb
needs the cast (array)$texts
too.
return $carry; | ||
}, | ||
[] | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole reduce business can be replaced with:
$existingTextIds = $this->dbr->selectFieldValues(
self::TABLE_TEXT,
'wbx_id',
[ 'wbx_text' => $texts ],
__METHOD__ // unrelated but also nice
);
}, $nonExistingTexts ) | ||
); | ||
|
||
$insertedTextRows = $this->dbw->select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also use selectFieldValues()
, I think.
} | ||
} | ||
|
||
$existingLangTextIdRows = $this->dbr->select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also use selectFieldValues()
, I think.
|
||
$this->assertNotEmpty( | ||
$textId, | ||
sprintf( 'Expected record for text "%s" was not found in wbt_text', $text ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sprintf
instead of normal string interpolation? "Expected record for text '$text' was not found in wbt_text"
|
||
$acquiredTermIds = $dbTermIdsAcquirer->acquireTermIds( $termsArray ); | ||
|
||
$this->assertTermSArrayExistInDb( $termsArray, $alreadyAcquiredTypeIds ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TermS → Terms
|
||
$acquiredTermIds = $dbTermIdsAcquirer->acquireTermIds( $termsArray ); | ||
|
||
$this->assertEquals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nicer to use assertSame()
instead of assertEquals()
(this also applies to two more places below).
It might also be worth it extracting a separate function, à la $this->assertRowCount( 'wbt_text', '*' )
.
1a99c4c
to
6dae465
Compare
I'm close to wanting this to be reviewed and merged to not keep it open forever .. I will today go over it to add a separate tests for the new class |
I will also respond to the reviews from before of course ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a full review, but some hopefully useful comments
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
src/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirer.php
Outdated
Show resolved
Hide resolved
$existingRecords, | ||
$this->findExistingRecords( $this->dbMaster, $neededRecords ) | ||
); | ||
$neededRecords = $this->filterNonExistingRecords( $neededRecords, $existingRecords ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this be non-empty? AFAIS only if something you just inserted into master got deleted… is it really worth guarding just for this specific case? (If you want to be safe in all cases, you need to SELECT … FOR UPDATE
all record-rows on master)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be non-empty in the following case:
replica has records 1, 2 but not 3, 4 and 5
master has records 3 and 4 but not 5 (the previous insert will fail due to 3 and 4 duplicatation)
there fore 5 is still needed and to be inserted again into master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I added a test case for that one, but the one I have under testWhenSomeRecordsDoNotExistInReplicaButExistInMaster
tests the case where all $neededRecords spilling here exist in master .. will add another test case for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the existing test case in testWhenSomeRecordsDoNotExistInReplicaButExistInMaster
so that master have only a subset of those neededRecrods already, in which case the first attempt to save will fail resulting in the $neededRecords to be non-empty on this line.
using infinite loop here might still not be the best option .. we can make it just two steps checks procedural-style .. but I don't know if that would look better so I'd rather not change it until we find a third better improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense… but given we only care for equivalent rows anyway, couldn't we just go with INSERT IGNORE
and be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup we could've but we tried that before and the problem with using IGNORE Is that it will still increment the next primary key per every failing record .. we concluded that we do not want that side-effect in tables with high cardinality as our primary keys will have gaps and might risk reaching much more quickly the maximum value of id column
Had a quick glance… I think this is mostly good to go now (although I only checked superficially and didn't look at the tests). One thing I noticed is that sometimes you seem to manually create sub-arrays (like |
6e3b56b
to
489f81c
Compare
now I remember how rebasing was annoying in Github .. with these commits now appearing after all the previous comments .. a slight annoyance not a big deal if you ignore it, though in one team we did use merge over rebase mainly for that reason (which I'd probably not do again) |
Good catch .. we surely should stick to one. I prefer auto-create (not more evel in php than creating variables without defining them). Will update accordingly, unless there're reasons against that approach you know of (and I don't) |
489f81c
to
ce2bf3f
Compare
@@ -8,6 +8,10 @@ class TermStoreSchemaUpdater { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to this file (and related change in MediaWikiNormalizedTermCleanerTest.php) were not necessary .. they slipped through when I rebased .. but I'll avoid creating a commit to revert them unless reviewers would prefer that
Checking again the code, I can't seem to find any sub-arrays pre-created where that can be ignored.. there are few cases where it was necessary as the sub-array will be used and can cause an exception if it wasn't created, as in this example (https://github.com/wmde/mediawiki-term-store/pull/11/files#diff-318c7c56290dcb03989fe814a679f4aeR146 lines 146 to 155): if ( !isset( $flattenedLangTextIds[$lang] ) ) {
$flattenedLangTextIds[$lang] = [];
}
$flattenedLangTextIds[$lang] = array_unique(
array_merge(
(array)$textIds,
(array)$flattenedLangTextIds[$lang]
)
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on the tests. (Some of them are duplicates, but I don’t know how readable “same here” comments are on GitHub, so I copy+pasted the comments in those cases.)
tests/Integration/PackagePrivate/DatabaseTermIdsAcquirerTest.php
Outdated
Show resolved
Hide resolved
tests/Integration/PackagePrivate/DatabaseTermIdsAcquirerTest.php
Outdated
Show resolved
Hide resolved
[ 'wbtl_text_in_lang_id' => $enSameTextInLangId, | ||
'wbtl_type_id' => $alreadyAcquiredTypeIds['alias'] ] | ||
); | ||
$aliasEnSameTermInLangId = (string)$this->db->insertId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the cast to string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the interface expects string ids only (Note 2 https://github.com/wmde/mediawiki-term-store/pull/11/files#diff-8b29a0bea7b8ed10130868f7ba1dc7cdR75)
while InMemory implementation return integer ids.. I can perhaps change InMemory to return string ids as well
tests/Integration/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirerTest.php
Outdated
Show resolved
Hide resolved
tests/Integration/PackagePrivate/Util/ReplicaMasterAwareRecordIdsAcquirerTest.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s merge this, it’s been waiting in review for quite long enough by now.
Phab task: https://phabricator.wikimedia.org/T223963
Database implementation of TermIdsAcquirer.
The implementation acquires ids bottom-up (starting from wbt_text table towards wbt_term_in_lang table), inserting new records as needed. The pseudo steps for each table are:
This implementation optimizes the number of queries, effectively making them 1 select per table on replicas, and if needed a follow up 1 insert followed by 1 select per table on master.
Todo:
Along with it,
TypeIdsAcquirer
interface with in-momory implementation were introduced.wbt_type
is likely to be loaded and stored in memory, for whichNameTableStore
class in MW core can be used. So we will likely have implementation ofTypeIdsAcquirer
that usesNameTableStore
class internally and pass it to this class.