Skip to content
This repository has been archived by the owner on Jul 20, 2019. It is now read-only.

Add TermCleaner interface and in-memory implementation #16

Merged
merged 2 commits into from
May 16, 2019

Conversation

lucaswerkmeister
Copy link
Member

This is based on #10, but I want to get that one merged soon without blocking it on discussions about the two other commits here, so it’s a separate pull request.

@mariushoch
Copy link
Member

You also submitted MediaWikiNormalizedTermCleaner here, I don't think that was intended.

@lucaswerkmeister
Copy link
Member Author

Well, that’s how GitHub pull requests work as far as I know, anything not in the target branch is included as part of the pull request. Once #10 is merged that should fix itself.

@JeroenDeDauw
Copy link
Contributor

I am reviewing this now, might push some small changes

foreach ( $this->terms as $type => &$termsOfType ) {
foreach ( $termsOfType as $lang => &$termsOfLang ) {
foreach ( $termsOfLang as $term => $id ) {
if ( array_key_exists( $id, $termInLangIdsAsKeys ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHPMD would be very mad if it where included in the CS. Lots of indent. I'd have split this into multiple functions to reduce scope of the many variables. Local and minor issue though, fine with merging this.

if ( $this->dbr === null ) {
$this->dbr = $this->lb->getConnection( ILoadBalancer::DB_REPLICA );
$this->dbw = $this->lb->getConnection( ILoadBalancer::DB_MASTER );
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is block fits in its own method

'description' => [ 'en' => 'the description' ],
] );

$termIdsAcquirer->cleanTerms( array_merge( $ids1, $ids2 ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for making two calls to acquireTermIds rather than one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR into this one that changes this test to only one such call: #17

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, just a vague sense of “let’s make sure the implementation supports multiple acquireTermIds calls”.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the job of this test method


$termIdsAcquirer->cleanTerms( array_merge( $ids1, $ids2 ) );

$this->assertEmpty( TestingAccessWrapper::newFromObject( $termIdsAcquirer )->terms );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think breaking encapsulation like this is not a good idea. Since this was done deliberately I am not pushing the change onto this PR and instead made a follow up into it: #18

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, checking that $terms is completely empty (and not just something like [ 'label' => [ 'en' => [] ]) is the sole purpose of this test, and your follow-up no longer tests that. (hasTerms() might still return false for such a structure, depending on how it’s implemented.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit weird you are not putting that comment on the PR.

Will check later in my IDE myself - currently not clear to me how the behavior is changed. assertEmpty and empty() are not equivalent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As your pull request is a follow-up to this pull request, I think it makes sense to keep the discussion here – my comment was intended mainly as a reply to your comment, not as a review of your follow-up.

Neither your pull request nor mine use empty() as far as I can tell, but I assume your question refers to the implementation of hasTerms() in your pull request, which you then check with assertFalse(). The test !== [] in that method is indeed equivalent to the assertEmpty() here (assuming the values are arrays), but this is an implementation detail – hasTerms() might be changed later so that it still returns false if the $terms for some reason end up in a state like [ 'label' => [ 'en' => [] ] ] (which still arguably has no actual terms). By directly using assertEmpty on the internal field in the tests, we avoid relying on this implementation detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By directly using assertEmpty on the internal field in the tests, we avoid relying on this implementation detail.

I am quite confused by this. Is it not the exact opposite? By using hasTerms, nothing outside of the class depends on its implementation, and by directly accessing the private field, the test does rely on it? Or are you talking about something else?

As your pull request is a follow-up to this pull request, I think it makes sense to keep the discussion here – my comment was intended mainly as a reply to your comment, not as a review of your follow-up.

Not sure you noticed: my PRs are PRs into this one. If they are merged, they become part of this PR. Hence these are to be merged first, not after this PR, as you might expect when using the term "follow up".

'label' => [ 'en' => 'the label' ],
'description' => [ 'en' => 'the description' ],
] );
$this->assertSame( $ids2, $ids3 );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another follow up: #19

This test is not doing what I originally expected it to be doing: ensuring only unreferenced ids are cleaned up. The tested code (InMemoryTermIdsAcquirer) does not seem to be doing that at all. Should it not be doing that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the code supposed to know which term_in_lang IDs are referenced and which aren’t? It cleans up all the ones it gets, that’s why the calling code has to do the array_diff ($ids2 are still used).

As far as I can tell, your follow-up is mostly equivalent, except that it hard-codes all the IDs, meaning it will break as soon as we change the ID assignment in any way (e. g. start counting from 0 instead of 1, or make the counter per-class instead of per-instance), so I don’t understand how it’s supposed to be an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored version is less complex and (at least to me) easier to understand. If the array_diff was adding some kind of value to the test, it is unclear enough for me to still not understand it after looking at it for 10 minutes.

You are right this test binds to the id numbers. Not clear to me there is a practical downside to that for this particular case. Remember we are talking about an in memory test double that is package private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-in-memory implementation is also “package private” at the moment, so to me that label doesn’t carry much value. Plenty of Wikibase’s in-memory test doubles are public (because they’re useful in tests of other components as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes many test doubles in WB are public and used in other components. (A practice which I started, and had some people complaining about consistency and "this should not be public because I don't expect it to be".)

So what? And how does that make the distinction between package public and private less important? Does it not make it more important?

Copy link
Contributor

@JeroenDeDauw JeroenDeDauw May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of thoughts on this topic:

  • From the perspective of the "store Item/Property terms" use case
    • ItemTermStore and PropertyTermStore are good interfaces (which is a bit "doh!", since that is what they where created for)
    • The database tables and proposed split of having some in one component and some in the other are an implementation detail
  • If persistence details are encapsulated, then you cannot join across tables behind the boundary
  • If you drop ItemTermStore and PropertyTermStore and just have wbt_item_terms/wbt_property_terms bound implementations in Wikibase Repo, then you loose the current implementation decoupling
  • Having a service/component that just deals with terms and not the entity ids can make sense
  • Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel.

Replacing the current term store abstractions with the package private ones here makes little sense to me. Yes, we might want to reuse a bunch of the package private stuff here at some later point. We can then do the extraction and have an additional (lower level) public abstraction. I do not think it makes sense to do this now because of YAGNI and certainly don't think it makes sense to replace the current abstraction with the package private one.

Copy link
Collaborator

@AlaaSarhan AlaaSarhan May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ItemTermStore and PropertyTermStore are good interfaces (which is a bit "doh!", since that is what they where created for)

Yes they are. No problem with these interfaces, and they will continue to exist in their context (Wikibase extension probably).
We have one of two options proceeding with Wikibase DataModel design:

  • Drop the general abstraction. Meaning we drop EntityId and EntityDocument interfaces from the public spaces of the DataModel. If that's the direction that we want to go in, we should then not do the split suggested in here probably.
  • Keep the abstraction but take Item and Property stuff out of it.

The database tables and proposed split of having some in one component and some in the other are an implementation detail

Not sure I understand this one.

If persistence details are encapsulated, then you cannot join across tables behind the boundary

No body says that joining tables is the only solution that must be used between linking tables. For that case specifically, it sounds like an implementation detail imposed on design, which sometimes is acceptable if such detail is of great importance for the function of the system. But it is not immediately obvious that it is to me yet.

To give and example, if normalized term tables are joined by a library to search through them and returns back ids, then that is used to query against linking tables in a second query .. this is a pseudo snippet:

$resultTermIds = TermLookup::searchFuzzilyInAllTypesAndLanguages( 'some text' );

$resultItemIds = doQuery(<<<SQL
    select wbit_item_id from wbt_item_terms
    where wbit_term_in_lang_id IN { convertToSqlList( $resultTermIds ) }
SQL;
);

It isn't immediately obvious that that's not acceptable nor that it is necessarily less performant (as in compared to acceptable performance) than if doing all of it as one query and one series of joins. There is even another benefit with such split, now you can cache $resultTermIds separately too ;)

If you drop ItemTermStore and PropertyTermStore and just have wbt_item_terms/wbt_property_terms bound implementations in Wikibase Repo, then you loose the current implementation decoupling

No sure what you mean here. Would you give an example maybe on this?

Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel.

Agree. but what is the difference between a domain and a sub domain? are they all domains on their own right at the end, and it is just a matter of the distance from where you view things? What difference would it make to this decision if we say "TermsStore" must be a sub domain and not a domain on its own right, practically?

Replacing the current term store abstractions with the package private ones here makes little sense to me. Yes, we might want to reuse a bunch of the package private stuff here at some later point. We can then do the extraction and have an additional (lower level) public abstraction. I do not think it makes sense to do this now because of YAGNI and certainly don't think it makes sense to replace the current abstraction with the package private one.

Agree to YAGNI part.


I tend to agree with your concerns .. but I'm a bit paralyzed why you think the above decision is in contradiction with your main message. I think what that decision tries to do is the same as what you have described as "Such a component can be in its own sub domain. The code is already going in that direction by using data structures specific to the functionality provided. In this case it is likely good to explicitly not use Wikibase DataModel."

I feel like we are in agreement so not sure why that decision doesn't makes sense to you.

Or am I getting you completely wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem with these interfaces, and they will continue to exist in their context (Wikibase extension probably).

Why move them there? Are the reasons why we created the library no longer valid?

We have one of two options proceeding with Wikibase DataModel design

I do not see a benefit in including redesign of Wikibase DataModel in this conversation. I certainly was not talking about it.

The database tables and proposed split of having some in one component and some in the other are an implementation detail

Not sure I understand this one.

The fact that there are DB tables is an implementation detail to begin with, so anything that depends on that is as well.

No body says that joining tables is the only solution that must be used

Yes, and no one said anyone said that. All I did was point out you can't do joins with that approach. I totally agree with the rest of what you wrote.

not sure why that decision doesn't makes sense to you.

The two reasons are described in the last paragraph of my previous comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If persistence details are encapsulated, then you cannot join across tables behind the boundary

The fact that there are DB tables is an implementation detail to begin with, so anything that depends on that is as well.

I don’t fully agree with this. The exact internal layout of the tables is an implementation detail, but how can the mere fact that there are DB tables remain internal? Surely, Wikibase (or any other user of the library) will have to inject a database connection into the implementation. (The library may offer multiple term store implementations, some of which wouldn’t use a database, but I assume it will be Wikibase who chooses which implementation is used.)

And if the general fact that a database is used is part of the interface, then a join across the boundary should also be possible. For example, MediaWiki core’s ActorMigration and CommentStore classes each have a getJoin() method to instruct both other core classes and extensions how to perform a join against the actor/comment storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructors are not part of the interface of a class. It does not matter what needs to be injected.

This is the interface offered by the MediaWikiNormalizedTermCleaner
class, which may also be implemented by others.
The TermIdsAcquirer and the TermCleaner clearly have to manipulate the
same state somehow. For the MediaWiki implementation, this happens via
the database; for the in-memory implementation, I’m going the easy route
and making one class implement both interfaces (rather than, for
example, having a third class with the actual state – the $terms array –
and two classes accessing that and implementing the acquirer and cleaner
parts). Some of the tests could be reused for other cleaners as well in
the future, if we wanted.
@lucaswerkmeister
Copy link
Member Author

I’ve pushed new versions of the commits with some added documentation that might help clear up the confusion around the array_diff() usage.

@JeroenDeDauw
Copy link
Contributor

Well great, you merge conflicted all my PRs into this one. Can't say I feel encouraged to help out more.

Copy link
Contributor

@JeroenDeDauw JeroenDeDauw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental complexity in tests. Fixed by linked PRs. +2 when those have been merged.

@AlaaSarhan
Copy link
Collaborator

I'm merging this for now.

I will direct the other PRs against master, review them and try to fix merge conflicts for ones I approve.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants