Skip to content
This repository has been archived by the owner on Jul 20, 2019. It is now read-only.

Add database term ids acquirer #11

Merged
merged 18 commits into from
May 21, 2019

Conversation

AlaaSarhan
Copy link
Collaborator

@AlaaSarhan AlaaSarhan commented May 9, 2019

Phab task: https://phabricator.wikimedia.org/T223963

Database implementation of TermIdsAcquirer.

The implementation acquires ids bottom-up (starting from wbt_text table towards wbt_term_in_lang table), inserting new records as needed. The pseudo steps for each table are:

  • find existing ids that can be reused on read database
  • insert new non-existing data into write database, if any
  • fetch inserted records' ids from write database, if any
  • use the acquired ids to link up with the next table

This implementation optimizes the number of queries, effectively making them 1 select per table on replicas, and if needed a follow up 1 insert followed by 1 select per table on master.

Todo:

  • test case for reusing existing TermInLangIds
  • handle insertion failures on duplicate insertions into dbw due to lagging replica
  • add logger
  • make code more readable where possible (minimum split the huge class per table functionality)
  • reduce duplication in code
  • check for low-performance pitfalls

Along with it, TypeIdsAcquirer interface with in-momory implementation were introduced. wbt_type is likely to be loaded and stored in memory, for which NameTableStore class in MW core can be used. So we will likely have implementation of TypeIdsAcquirer that uses NameTableStore class internally and pass it to this class.

@mariushoch
Copy link
Member

(I haven't read up on the architecture, so this might not actually be applicable)

Are there unique indices on these tables (probably)? In that case, you need to make sure to handle race conditions (duplicate keys) where the replicas don't yet have rows already inserted on master…

(Unlikely) Are rows ever removed/ purged from these tables? If so, that needs to be handled here as well.

Also I find the functional style sometimes a little obstructive to read, especially the array_reduce class that probably could be replace by fairly simple foreach loops.

@lucaswerkmeister
Copy link
Member

Are there unique indices on these tables (probably)? In that case, you need to make sure to handle race conditions (duplicate keys) where the replicas don't yet have rows already inserted on master…

Currently, there are, yes. (You can see them in AddNormalizedTermsTablesDDQ.sql, by the way.) We can either handle the error on the master, or make the indices non-unique again and tolerate duplicate rows. The latter is what MediaWiki does for the comment table:

-- Deduplication is currently best-effort to avoid locking on inserts that
-- would be required for strict deduplication. There MAY be multiple rows with
-- the same comment_text and comment_data.

(Unlikely) Are rows ever removed/ purged from these tables? If so, that needs to be handled here as well.

They are (implemented in #10) – after all, terms from deleted revisions might contain sensitive data that should not remain on the Cloud VPS replicas. (MediaWiki core solves this with a whole bunch of custom views for its own normalization tables, as far as I’m aware, but since wb_terms is only a secondary data store, we can completely delete the rows instead.)

@lucaswerkmeister
Copy link
Member

We can either handle the error on the master, or make the indices non-unique again and tolerate duplicate rows.

Or we can query the master for IDs if we don’t find them on the replicas, and only insert if we don’t find the IDs on the master either, though that would probably need a FOR UPDATE on the master select, and it’s currently unclear if we should use that or not (see also discussion in #10).

Copy link
Collaborator Author

@AlaaSarhan AlaaSarhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I find the functional style sometimes a little obstructive to read, especially the array_reduce class that probably could be replace by fairly simple foreach loops.

Interesting, I find reduce more readable than foreach, but maybe just because I'm more used to functional style. Fair point though, our paradigm is mainly OO so I'll happily change it so that it is more likely to be more readable by more Wikidata developers :)

src/PackagePrivate/DatabaseTermIdsAcquirer.php Outdated Show resolved Hide resolved
* @return array
* [
* 'type' => [
* [ 'language' => [ <textId1>, <textId2>, ... ] ], ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given [ 'type' => [ 'language' => 'term' ] ], will this return [ 'type' => [ 'language' => 1 ] ] or [ 'type' => [ 'language' => [ 1 ] ] ]? That is, if the input has a single term instead of a list, is that preserved in the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not be preserved iirc, I convert them to arrays in the output even for single elements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out, it looks like it’s preserved after all (because array_walk_recursive preserves the structure) – I think that’s why DatabaseTermIdsAcquirerTest::assertTermsArrayExistInDb needs the cast (array)$texts too.

@AlaaSarhan
Copy link
Collaborator Author

AlaaSarhan commented May 10, 2019

We can either handle the error on the master, or make the indices non-unique again and tolerate duplicate rows.

Or we can query the master for IDs if we don’t find them on the replicas, and only insert if we don’t find the IDs on the master either, though that would probably need a FOR UPDATE on the master select, and it’s currently unclear if we should use that or not (see also discussion in #10).

We certainly do not want to allow duplicate rows here as the main purpose of this normalization is to remove duplication. We can query master for non existing ones, and only insert those that do not exist eventually, but yeah that's then another select query added.

In a previous implementation, we tried to use IGNORE option in the insert so that it ignores duplication errors and still proceed to insert the rest. Though the issue here is that it will increment the value for the next primary key even for those failing inserts. We wanted to avoid that too.

A third, in-the-middle solution can be to do the following for non-existing ids in replica:

  1. prepare non existing records
  2. insert into master
  3. select ids of non existing records
  4. if selected ids do not count as the ones we asked for, find again the new set of non-existing records (this then should exclude the ones that already exist and made the previous insert fail due to duplicate constraint violation) and go back to 2

what do you think?

@lucaswerkmeister
Copy link
Member

We certainly do not want to allow duplicate rows here as the main purpose of this normalization is to remove duplication.

That’s the main purpose of the comment table too, and it still allows duplicate rows. In practice, those would be pretty rare (and we could even clean them up periodically if we really wanted to, though the query to find duplicate rows would probably be very inefficient), so I’m not yet convinced we should dismiss this idea.

I was going to ask about IGNORE – good that you already tested that :)

I don’t understand your third solution yet, can you clarify what “prepare” means and where which queries are run (except 2., which already says “master”)?

@AlaaSarhan
Copy link
Collaborator Author

AlaaSarhan commented May 10, 2019

That’s the main purpose of the comment table too, and it still allows duplicate rows. In practice, those would be pretty rare (and we could even clean them up periodically if we really wanted to, though the query to find duplicate rows would probably be very inefficient), so I’m not yet convinced we should dismiss this idea.

Sure, atm I feel that should still be the last resort.. but let's see which solution might be cheaper on performance and maintenance.

I don’t understand your third solution yet, can you clarify what “prepare” means and where which queries are run (except 2., which already says “master”)?

I'll put it in the full pseudo steps:

  1. prepare a list of needed records -> needed
  2. query needed records from replica with their ids -> existing
  3. prepare a list of non-existing records in replica: needed - existing -> non_existing
  4. insert non_existing into master
  5. query non_existing from master with their ids -> existing_master
  6. if count(existing_master) != count(non_exiting) then non_existing - existing_master -> non_existing and go to 4.

that is a loop solution, that accounts of inserted terms from in-parallel requests.. not sure how often that will happen, but one run of the loop is at least needed in case the very first set of non-existing records in replica might have some of the records in master already

@mariushoch
Copy link
Member

That’s the main purpose of the comment table too, and it still allows duplicate rows. In practice, those would be pretty rare (and we could even clean them up periodically if we really wanted to, though the query to find duplicate rows would probably be very inefficient), so I’m not yet convinced we should dismiss this idea.

Sure, atm I feel that should still be the last resort.. but let's see which solution might be cheaper on performance and maintenance.

I don’t understand your third solution yet, can you clarify what “prepare” means and where which queries are run (except 2., which already says “master”)?

I'll put it in the full pseudo steps:

1. prepare a list of needed records -> `needed`

2. query `needed` records from replica with their ids -> `existing`

3. prepare a list of non-existing records in replica: `needed - existing -> non_existing`

4. insert `non_existing` into master

This needs to INSERT IGNORE to guard against races.

5. query `non_existing` from master with their ids -> `existing_master`

I think here you need to query all needed (and lock them until you "use" them), as some might have been/ will be deleted from master.

6. if `count(existing_master) != count(non_exiting)` then `non_existing - existing_master -> non_existing` and go to 4.

that is a loop solution, that accounts of inserted terms from in-parallel requests.. not sure how often that will happen, but one run of the loop is at least needed in case the very first set of non-existing records in replica might have some of the records in master already

In generally you will have to make sure all entries (rows) you want to refer to are present (and ensure that by locking them on the master) until you put your references in place (so that no one will delete them because they are not referred to from anywhere).

If doing all of that in a transaction is a performance problem, you could go for a multi step process: 1) Find/ create entries you want to reference 2) put references in place 3) make sure the entries you just referenced are still there 4) fix now invalid references if needed. This should work fairly well, unless the runner dies after step 2) and you're not in a transaction.

I know that this is all fairly ugly, but doing this without using proper foreign keys is not easy (unless you heavily compromise on consistency, I guess).

@AlaaSarhan
Copy link
Collaborator Author

AlaaSarhan commented May 10, 2019

Thanks @mariushoch

I will then try to sleep on it a bit and check whether there's a "simpler" solution that can be done. (top of my head screams at me "staging table", no clue what it means though so hope I will shut it up with some reasoning haha)

@AlaaSarhan
Copy link
Collaborator Author

AlaaSarhan commented May 13, 2019

@mariushoch FYI

I'm continuing with this to finish it first without considering the race-condition problem (= assuming there's no cleanup logic running in-parallel). We will get back to race-conditions between clean up #10 and acquirer logic, and try to reproduce in a test, once both logics are merged and tested to be functional in test env.

@mariushoch
Copy link
Member

I'm continuing with this to finish it first without considering the race-condition problem (= assuming there's no cleanup logic running in-parallel).

Sounds good to me :) Do you plan to handle insertion failures already (duplicate key) or not yet?

@AlaaSarhan
Copy link
Collaborator Author

@mariushoch yup I'm going to handle that, while implementing the logic I stated in the latest pseudo steps more or less

* @return array
* [
* 'type' => [
* [ 'language' => [ <textId1>, <textId2>, ... ] ], ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out, it looks like it’s preserved after all (because array_walk_recursive preserves the structure) – I think that’s why DatabaseTermIdsAcquirerTest::assertTermsArrayExistInDb needs the cast (array)$texts too.

return $carry;
},
[]
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole reduce business can be replaced with:

$existingTextIds = $this->dbr->selectFieldValues(
	self::TABLE_TEXT,
	'wbx_id',
	[ 'wbx_text' => $texts ],
	__METHOD__ // unrelated but also nice
);

}, $nonExistingTexts )
);

$insertedTextRows = $this->dbw->select(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also use selectFieldValues(), I think.

}
}

$existingLangTextIdRows = $this->dbr->select(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also use selectFieldValues(), I think.

src/PackagePrivate/TypeIdsAcquirer.php Show resolved Hide resolved

$this->assertNotEmpty(
$textId,
sprintf( 'Expected record for text "%s" was not found in wbt_text', $text )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sprintf instead of normal string interpolation? "Expected record for text '$text' was not found in wbt_text"


$acquiredTermIds = $dbTermIdsAcquirer->acquireTermIds( $termsArray );

$this->assertTermSArrayExistInDb( $termsArray, $alreadyAcquiredTypeIds );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TermS → Terms


$acquiredTermIds = $dbTermIdsAcquirer->acquireTermIds( $termsArray );

$this->assertEquals(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nicer to use assertSame() instead of assertEquals() (this also applies to two more places below).

It might also be worth it extracting a separate function, à la $this->assertRowCount( 'wbt_text', '*' ).

@AlaaSarhan AlaaSarhan force-pushed the add-database-term-ids-acquirer branch from 1a99c4c to 6dae465 Compare May 15, 2019 07:19
@AlaaSarhan
Copy link
Collaborator Author

AlaaSarhan commented May 15, 2019

I'm close to wanting this to be reviewed and merged to not keep it open forever .. I will today go over it to add a separate tests for the new class ReplicaMasterAwareRecordIdsAcquirer (perhaps renaming the class if anyone have a better suggestion) and a test case for duplication failures on master insertions .. but I'd like to stop expanding this PR at this point and then we have a look again on how different things can be improved/continued in-parallel/separately

@AlaaSarhan
Copy link
Collaborator Author

I will also respond to the reviews from before of course ;)

Copy link
Member

@lucaswerkmeister lucaswerkmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a full review, but some hopefully useful comments

@AlaaSarhan AlaaSarhan changed the title [WIP] Add database term ids acquirer Add database term ids acquirer May 15, 2019
$existingRecords,
$this->findExistingRecords( $this->dbMaster, $neededRecords )
);
$neededRecords = $this->filterNonExistingRecords( $neededRecords, $existingRecords );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this be non-empty? AFAIS only if something you just inserted into master got deleted… is it really worth guarding just for this specific case? (If you want to be safe in all cases, you need to SELECT … FOR UPDATE all record-rows on master)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be non-empty in the following case:
replica has records 1, 2 but not 3, 4 and 5
master has records 3 and 4 but not 5 (the previous insert will fail due to 3 and 4 duplicatation)
there fore 5 is still needed and to be inserted again into master

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I added a test case for that one, but the one I have under testWhenSomeRecordsDoNotExistInReplicaButExistInMaster tests the case where all $neededRecords spilling here exist in master .. will add another test case for that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the existing test case in testWhenSomeRecordsDoNotExistInReplicaButExistInMaster so that master have only a subset of those neededRecrods already, in which case the first attempt to save will fail resulting in the $neededRecords to be non-empty on this line.

using infinite loop here might still not be the best option .. we can make it just two steps checks procedural-style .. but I don't know if that would look better so I'd rather not change it until we find a third better improvement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense… but given we only care for equivalent rows anyway, couldn't we just go with INSERT IGNORE and be done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup we could've but we tried that before and the problem with using IGNORE Is that it will still increment the next primary key per every failing record .. we concluded that we do not want that side-effect in tables with high cardinality as our primary keys will have gaps and might risk reaching much more quickly the maximum value of id column

@mariushoch
Copy link
Member

Had a quick glance… I think this is mostly good to go now (although I only checked superficially and didn't look at the tests).

One thing I noticed is that sometimes you seem to manually create sub-arrays (like $a = []; if (!isset($a['a'])) { $a['a'] = []; } $a['a'][] = 4;) and sometimes you "autocreate" them (like $a = []; $a['a'][] = 4;). I'm not sure which one we should chose… but I guess we should stick with one way.

@AlaaSarhan AlaaSarhan force-pushed the add-database-term-ids-acquirer branch from 6e3b56b to 489f81c Compare May 15, 2019 23:29
@AlaaSarhan
Copy link
Collaborator Author

now I remember how rebasing was annoying in Github .. with these commits now appearing after all the previous comments .. a slight annoyance not a big deal if you ignore it, though in one team we did use merge over rebase mainly for that reason (which I'd probably not do again)

@AlaaSarhan
Copy link
Collaborator Author

Had a quick glance… I think this is mostly good to go now (although I only checked superficially and didn't look at the tests).

One thing I noticed is that sometimes you seem to manually create sub-arrays (like $a = []; if (!isset($a['a'])) { $a['a'] = []; } $a['a'][] = 4;) and sometimes you "autocreate" them (like $a = []; $a['a'][] = 4;). I'm not sure which one we should chose… but I guess we should stick with one way.

Good catch .. we surely should stick to one. I prefer auto-create (not more evel in php than creating variables without defining them). Will update accordingly, unless there're reasons against that approach you know of (and I don't)

@AlaaSarhan AlaaSarhan force-pushed the add-database-term-ids-acquirer branch from 489f81c to ce2bf3f Compare May 21, 2019 11:00
@@ -8,6 +8,10 @@ class TermStoreSchemaUpdater {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file (and related change in MediaWikiNormalizedTermCleanerTest.php) were not necessary .. they slipped through when I rebased .. but I'll avoid creating a commit to revert them unless reviewers would prefer that

@AlaaSarhan
Copy link
Collaborator Author

Had a quick glance… I think this is mostly good to go now (although I only checked superficially and didn't look at the tests).
One thing I noticed is that sometimes you seem to manually create sub-arrays (like $a = []; if (!isset($a['a'])) { $a['a'] = []; } $a['a'][] = 4;) and sometimes you "autocreate" them (like $a = []; $a['a'][] = 4;). I'm not sure which one we should chose… but I guess we should stick with one way.

Good catch .. we surely should stick to one. I prefer auto-create (not more evel in php than creating variables without defining them). Will update accordingly, unless there're reasons against that approach you know of (and I don't)

Checking again the code, I can't seem to find any sub-arrays pre-created where that can be ignored.. there are few cases where it was necessary as the sub-array will be used and can cause an exception if it wasn't created, as in this example (https://github.com/wmde/mediawiki-term-store/pull/11/files#diff-318c7c56290dcb03989fe814a679f4aeR146 lines 146 to 155):

				if ( !isset( $flattenedLangTextIds[$lang] ) ) {
					$flattenedLangTextIds[$lang] = [];
				}
				$flattenedLangTextIds[$lang] = array_unique(
					array_merge(
						(array)$textIds,
						(array)$flattenedLangTextIds[$lang]
					)
				);

Copy link
Member

@lucaswerkmeister lucaswerkmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the tests. (Some of them are duplicates, but I don’t know how readable “same here” comments are on GitHub, so I copy+pasted the comments in those cases.)

[ 'wbtl_text_in_lang_id' => $enSameTextInLangId,
'wbtl_type_id' => $alreadyAcquiredTypeIds['alias'] ]
);
$aliasEnSameTermInLangId = (string)$this->db->insertId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the cast to string?

Copy link
Collaborator Author

@AlaaSarhan AlaaSarhan May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the interface expects string ids only (Note 2 https://github.com/wmde/mediawiki-term-store/pull/11/files#diff-8b29a0bea7b8ed10130868f7ba1dc7cdR75)

while InMemory implementation return integer ids.. I can perhaps change InMemory to return string ids as well

Copy link
Member

@lucaswerkmeister lucaswerkmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s merge this, it’s been waiting in review for quite long enough by now.

@lucaswerkmeister lucaswerkmeister merged commit 36913a5 into master May 21, 2019
@lucaswerkmeister lucaswerkmeister deleted the add-database-term-ids-acquirer branch May 21, 2019 15:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants