Add TermCleaner interface and in-memory implementation #16

lucaswerkmeister · 2019-05-14T11:29:30Z

This is based on #10, but I want to get that one merged soon without blocking it on discussions about the two other commits here, so it’s a separate pull request.

mariushoch · 2019-05-14T12:56:44Z

You also submitted MediaWikiNormalizedTermCleaner here, I don't think that was intended.

lucaswerkmeister · 2019-05-14T12:57:51Z

Well, that’s how GitHub pull requests work as far as I know, anything not in the target branch is included as part of the pull request. Once #10 is merged that should fix itself.

JeroenDeDauw · 2019-05-14T16:59:43Z

I am reviewing this now, might push some small changes

JeroenDeDauw · 2019-05-14T17:06:53Z

src/PackagePrivate/InMemoryTermIdsAcquirer.php

+		foreach ( $this->terms as $type => &$termsOfType ) {
+			foreach ( $termsOfType as $lang => &$termsOfLang ) {
+				foreach ( $termsOfLang as $term => $id ) {
+					if ( array_key_exists( $id, $termInLangIdsAsKeys ) ) {


PHPMD would be very mad if it where included in the CS. Lots of indent. I'd have split this into multiple functions to reduce scope of the many variables. Local and minor issue though, fine with merging this.

JeroenDeDauw · 2019-05-14T17:07:55Z

src/PackagePrivate/MediaWikiNormalizedTermCleaner.php

+		if ( $this->dbr === null ) {
+			$this->dbr = $this->lb->getConnection( ILoadBalancer::DB_REPLICA );
+			$this->dbw = $this->lb->getConnection( ILoadBalancer::DB_MASTER );
+		}


This is block fits in its own method

JeroenDeDauw · 2019-05-14T17:17:24Z

tests/Unit/PackagePrivate/InMemoryTermIdsAcquirerTest.php

+			'description' => [ 'en' => 'the description' ],
+		] );
+
+		$termIdsAcquirer->cleanTerms( array_merge( $ids1, $ids2 ) );


What is the reason for making two calls to acquireTermIds rather than one?

I created a PR into this one that changes this test to only one such call: #17

No particular reason, just a vague sense of “let’s make sure the implementation supports multiple acquireTermIds calls”.

That is not the job of this test method

JeroenDeDauw · 2019-05-14T17:34:13Z

tests/Unit/PackagePrivate/InMemoryTermIdsAcquirerTest.php

+
+		$termIdsAcquirer->cleanTerms( array_merge( $ids1, $ids2 ) );
+
+		$this->assertEmpty( TestingAccessWrapper::newFromObject( $termIdsAcquirer )->terms );


I think breaking encapsulation like this is not a good idea. Since this was done deliberately I am not pushing the change onto this PR and instead made a follow up into it: #18

Well, checking that $terms is completely empty (and not just something like [ 'label' => [ 'en' => [] ]) is the sole purpose of this test, and your follow-up no longer tests that. (hasTerms() might still return false for such a structure, depending on how it’s implemented.)

Bit weird you are not putting that comment on the PR.

Will check later in my IDE myself - currently not clear to me how the behavior is changed. assertEmpty and empty() are not equivalent?

As your pull request is a follow-up to this pull request, I think it makes sense to keep the discussion here – my comment was intended mainly as a reply to your comment, not as a review of your follow-up.

Neither your pull request nor mine use empty() as far as I can tell, but I assume your question refers to the implementation of hasTerms() in your pull request, which you then check with assertFalse(). The test !== [] in that method is indeed equivalent to the assertEmpty() here (assuming the values are arrays), but this is an implementation detail – hasTerms() might be changed later so that it still returns false if the $terms for some reason end up in a state like [ 'label' => [ 'en' => [] ] ] (which still arguably has no actual terms). By directly using assertEmpty on the internal field in the tests, we avoid relying on this implementation detail.

By directly using assertEmpty on the internal field in the tests, we avoid relying on this implementation detail.

I am quite confused by this. Is it not the exact opposite? By using hasTerms, nothing outside of the class depends on its implementation, and by directly accessing the private field, the test does rely on it? Or are you talking about something else?

As your pull request is a follow-up to this pull request, I think it makes sense to keep the discussion here – my comment was intended mainly as a reply to your comment, not as a review of your follow-up.

Not sure you noticed: my PRs are PRs into this one. If they are merged, they become part of this PR. Hence these are to be merged first, not after this PR, as you might expect when using the term "follow up".

JeroenDeDauw · 2019-05-14T17:51:36Z

tests/Unit/PackagePrivate/InMemoryTermIdsAcquirerTest.php

+			'label' => [ 'en' => 'the label' ],
+			'description' => [ 'en' => 'the description' ],
+		] );
+		$this->assertSame( $ids2, $ids3 );


Another follow up: #19

This test is not doing what I originally expected it to be doing: ensuring only unreferenced ids are cleaned up. The tested code (InMemoryTermIdsAcquirer) does not seem to be doing that at all. Should it not be doing that?

How is the code supposed to know which term_in_lang IDs are referenced and which aren’t? It cleans up all the ones it gets, that’s why the calling code has to do the array_diff ($ids2 are still used).

As far as I can tell, your follow-up is mostly equivalent, except that it hard-codes all the IDs, meaning it will break as soon as we change the ID assignment in any way (e. g. start counting from 0 instead of 1, or make the counter per-class instead of per-instance), so I don’t understand how it’s supposed to be an improvement.

The refactored version is less complex and (at least to me) easier to understand. If the array_diff was adding some kind of value to the test, it is unclear enough for me to still not understand it after looking at it for 10 minutes.

You are right this test binds to the id numbers. Not clear to me there is a practical downside to that for this particular case. Remember we are talking about an in memory test double that is package private.

The non-in-memory implementation is also “package private” at the moment, so to me that label doesn’t carry much value. Plenty of Wikibase’s in-memory test doubles are public (because they’re useful in tests of other components as well).

Yes many test doubles in WB are public and used in other components. (A practice which I started, and had some people complaining about consistency and "this should not be public because I don't expect it to be".)

So what? And how does that make the distinction between package public and private less important? Does it not make it more important?

A bunch of thoughts on this topic:

From the perspective of the "store Item/Property terms" use case

ItemTermStore and PropertyTermStore are good interfaces (which is a bit "doh!", since that is what they where created for)

The database tables and proposed split of having some in one component and some in the other are an implementation detail

If persistence details are encapsulated, then you cannot join across tables behind the boundary

If you drop ItemTermStore and PropertyTermStore and just have wbt_item_terms/wbt_property_terms bound implementations in Wikibase Repo, then you loose the current implementation decoupling

Having a service/component that just deals with terms and not the entity ids can make sense

Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel.

Replacing the current term store abstractions with the package private ones here makes little sense to me. Yes, we might want to reuse a bunch of the package private stuff here at some later point. We can then do the extraction and have an additional (lower level) public abstraction. I do not think it makes sense to do this now because of YAGNI and certainly don't think it makes sense to replace the current abstraction with the package private one.

ItemTermStore and PropertyTermStore are good interfaces (which is a bit "doh!", since that is what they where created for)

Yes they are. No problem with these interfaces, and they will continue to exist in their context (Wikibase extension probably).
We have one of two options proceeding with Wikibase DataModel design:

Drop the general abstraction. Meaning we drop EntityId and EntityDocument interfaces from the public spaces of the DataModel. If that's the direction that we want to go in, we should then not do the split suggested in here probably.

Keep the abstraction but take Item and Property stuff out of it.

The database tables and proposed split of having some in one component and some in the other are an implementation detail

Not sure I understand this one.

If persistence details are encapsulated, then you cannot join across tables behind the boundary

No body says that joining tables is the only solution that must be used between linking tables. For that case specifically, it sounds like an implementation detail imposed on design, which sometimes is acceptable if such detail is of great importance for the function of the system. But it is not immediately obvious that it is to me yet.

To give and example, if normalized term tables are joined by a library to search through them and returns back ids, then that is used to query against linking tables in a second query .. this is a pseudo snippet:

$resultTermIds = TermLookup::searchFuzzilyInAllTypesAndLanguages( 'some text' ); $resultItemIds = doQuery(<<<SQL select wbit_item_id from wbt_item_terms where wbit_term_in_lang_id IN { convertToSqlList( $resultTermIds ) } SQL; );

It isn't immediately obvious that that's not acceptable nor that it is necessarily less performant (as in compared to acceptable performance) than if doing all of it as one query and one series of joins. There is even another benefit with such split, now you can cache $resultTermIds separately too ;)

If you drop ItemTermStore and PropertyTermStore and just have wbt_item_terms/wbt_property_terms bound implementations in Wikibase Repo, then you loose the current implementation decoupling

No sure what you mean here. Would you give an example maybe on this?

Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel.

Agree. but what is the difference between a domain and a sub domain? are they all domains on their own right at the end, and it is just a matter of the distance from where you view things? What difference would it make to this decision if we say "TermsStore" must be a sub domain and not a domain on its own right, practically?

Replacing the current term store abstractions with the package private ones here makes little sense to me. Yes, we might want to reuse a bunch of the package private stuff here at some later point. We can then do the extraction and have an additional (lower level) public abstraction. I do not think it makes sense to do this now because of YAGNI and certainly don't think it makes sense to replace the current abstraction with the package private one.

Agree to YAGNI part.

I tend to agree with your concerns .. but I'm a bit paralyzed why you think the above decision is in contradiction with your main message. I think what that decision tries to do is the same as what you have described as "Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel."

I feel like we are in agreement so not sure why that decision doesn't makes sense to you.

Or am I getting you completely wrong?

No problem with these interfaces, and they will continue to exist in their context (Wikibase extension probably).

Why move them there? Are the reasons why we created the library no longer valid?

We have one of two options proceeding with Wikibase DataModel design

I do not see a benefit in including redesign of Wikibase DataModel in this conversation. I certainly was not talking about it.

The database tables and proposed split of having some in one component and some in the other are an implementation detail

Not sure I understand this one.

The fact that there are DB tables is an implementation detail to begin with, so anything that depends on that is as well.

No body says that joining tables is the only solution that must be used

Yes, and no one said anyone said that. All I did was point out you can't do joins with that approach. I totally agree with the rest of what you wrote.

not sure why that decision doesn't makes sense to you.

The two reasons are described in the last paragraph of my previous comment.

If persistence details are encapsulated, then you cannot join across tables behind the boundary

The fact that there are DB tables is an implementation detail to begin with, so anything that depends on that is as well.

I don’t fully agree with this. The exact internal layout of the tables is an implementation detail, but how can the mere fact that there are DB tables remain internal? Surely, Wikibase (or any other user of the library) will have to inject a database connection into the implementation. (The library may offer multiple term store implementations, some of which wouldn’t use a database, but I assume it will be Wikibase who chooses which implementation is used.)

And if the general fact that a database is used is part of the interface, then a join across the boundary should also be possible. For example, MediaWiki core’s ActorMigration and CommentStore classes each have a getJoin() method to instruct both other core classes and extensions how to perform a join against the actor/comment storage.

Constructors are not part of the interface of a class. It does not matter what needs to be injected.

This is the interface offered by the MediaWikiNormalizedTermCleaner class, which may also be implemented by others.

The TermIdsAcquirer and the TermCleaner clearly have to manipulate the same state somehow. For the MediaWiki implementation, this happens via the database; for the in-memory implementation, I’m going the easy route and making one class implement both interfaces (rather than, for example, having a third class with the actual state – the $terms array – and two classes accessing that and implementing the acquirer and cleaner parts). Some of the tests could be reused for other cleaners as well in the future, if we wanted.

lucaswerkmeister · 2019-05-15T16:10:46Z

I’ve pushed new versions of the commits with some added documentation that might help clear up the confusion around the array_diff() usage.

JeroenDeDauw · 2019-05-15T21:05:44Z

Well great, you merge conflicted all my PRs into this one. Can't say I feel encouraged to help out more.

JeroenDeDauw

Accidental complexity in tests. Fixed by linked PRs. +2 when those have been merged.

AlaaSarhan · 2019-05-16T13:07:34Z

I'm merging this for now.

I will direct the other PRs against master, review them and try to fix merge conflicts for ones I approve.

lucaswerkmeister requested review from JeroenDeDauw and AlaaSarhan May 14, 2019 11:29

JeroenDeDauw reviewed May 14, 2019

View reviewed changes

JeroenDeDauw force-pushed the clean-inmemory branch from 3a68b90 to 05b3166 Compare May 14, 2019 17:11

JeroenDeDauw reviewed May 14, 2019

View reviewed changes

lucaswerkmeister force-pushed the clean-inmemory branch from bc20fa4 to 05b3166 Compare May 15, 2019 13:41

lucaswerkmeister mentioned this pull request May 15, 2019

Fix typo in class name and add missing @covers #22

Merged

lucaswerkmeister added 2 commits May 15, 2019 18:10

Introduce TermCleaner interface

db65b63

This is the interface offered by the MediaWikiNormalizedTermCleaner class, which may also be implemented by others.

lucaswerkmeister force-pushed the clean-inmemory branch from 05b3166 to 3acc58e Compare May 15, 2019 16:10

JeroenDeDauw suggested changes May 15, 2019

View reviewed changes

AlaaSarhan merged commit 43a2604 into master May 16, 2019

JeroenDeDauw deleted the clean-inmemory branch May 18, 2019 01:12

lucaswerkmeister mentioned this pull request May 21, 2019

Remove InMemoryTermIdsAcquirer::hasTerms() and tests #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TermCleaner interface and in-memory implementation #16

Add TermCleaner interface and in-memory implementation #16

lucaswerkmeister commented May 14, 2019

mariushoch commented May 14, 2019

lucaswerkmeister commented May 14, 2019

JeroenDeDauw commented May 14, 2019

JeroenDeDauw May 14, 2019

JeroenDeDauw May 14, 2019

JeroenDeDauw May 14, 2019

JeroenDeDauw May 14, 2019

lucaswerkmeister May 15, 2019

JeroenDeDauw May 15, 2019

JeroenDeDauw May 14, 2019

lucaswerkmeister May 15, 2019

JeroenDeDauw May 15, 2019

lucaswerkmeister May 15, 2019

JeroenDeDauw May 15, 2019

JeroenDeDauw May 14, 2019

lucaswerkmeister May 15, 2019

JeroenDeDauw May 15, 2019

lucaswerkmeister May 15, 2019

JeroenDeDauw May 15, 2019

JeroenDeDauw May 18, 2019 •

edited

Loading

AlaaSarhan May 18, 2019 •

edited by JeroenDeDauw

Loading

JeroenDeDauw May 19, 2019

lucaswerkmeister May 20, 2019

JeroenDeDauw May 21, 2019

lucaswerkmeister commented May 15, 2019

JeroenDeDauw commented May 15, 2019

JeroenDeDauw left a comment

AlaaSarhan commented May 16, 2019


		$termIdsAcquirer->cleanTerms( array_merge( $ids1, $ids2 ) );

		$this->assertEmpty( TestingAccessWrapper::newFromObject( $termIdsAcquirer )->terms );

Add TermCleaner interface and in-memory implementation #16

Add TermCleaner interface and in-memory implementation #16

Conversation

lucaswerkmeister commented May 14, 2019

mariushoch commented May 14, 2019

lucaswerkmeister commented May 14, 2019

JeroenDeDauw commented May 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeroenDeDauw May 18, 2019 • edited Loading

Choose a reason for hiding this comment

AlaaSarhan May 18, 2019 • edited by JeroenDeDauw Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucaswerkmeister commented May 15, 2019

JeroenDeDauw commented May 15, 2019

JeroenDeDauw left a comment

Choose a reason for hiding this comment

AlaaSarhan commented May 16, 2019

JeroenDeDauw May 18, 2019 •

edited

Loading

AlaaSarhan May 18, 2019 •

edited by JeroenDeDauw

Loading