Changed how "forgeign repository name" is stored in entity ids #681

JeroenDeDauw · 2016-09-22T05:18:26Z

Follow up to #678

I'd have submitted this as a PR into the original, though as that one is coming from a fork, I'd be a PR against that fork... So if you want to see my changes against the existing PR, see only the second commit.

This removes the addition of static code to EntityId and pushes the responsibility of parsing serializations out of the class.

I did not touch various other issues and have created a number of small ones, so this leaves the code in a state, that while it works, still needs some work before it is mergeable. That includes making names consistent again and adding some edge case tests.

The first constructor argument contains what the existing getter has named the "local part", which can also include prefixes defined on other sites. It's not immedidately clear to me why we need this here, so I'd be nice if someone could explain that. Should this not already have been resolved by the time the id is constructed?

Bug: T145516

Follow up to #678 This removes the addition of static code to EntityId and pushes the responsibility of parsing serializations out of the class. I did not touch various other issues and have created a number of small ones, so this leaves the code in a state, that while it works, still needs some work before it is mergeable. That includes making names consistent again and adding some edge case tests. The first constructor argument contains what the existing getter has named the "local part", which can also include prefixes defined on other sites. It's not immedidately clear to me why we need this here, so I'd be nice if someone could explain that. Should this not already have been resolved by the time the id is constructed?

brightbyte · 2016-09-22T13:57:59Z

I don't see how this will still allow an EntityIdParser to apply prefix mappings before constructing an EntityId, or how it can decide which EntityId to construct. For both tasks, it is necessary to have access to a split version of the raw ID string, before the type of the ID is known, or an EntityId has been constructed.

JeroenDeDauw · 2016-09-22T23:11:04Z

Interpreting the string it gets is part of the parsers job. I don't see why you'd put this is some global functions glued to the ID class. If this leads to a concrete problem I'm not seeing it.

manicki · 2016-09-23T09:38:21Z

Having looked at this again I think it would be cleaner if the parser was doing the serialized id split to repository name, local id etc. I don't think it would be a problem for the parser to apply prefix mapping before constructing the EntityId.
Parser would have split methods, maybe similar to those that are now in EntityId, and could use those to get "local id" and test it against known entity id patterns.
Do we want to have some abstract EntityIdParser-implementing class that would have those methods, so all actual Parser implementation s(DispatchingEntityIdParser from data model, prefixMappingEntityIdParser I am adding to data model services etc) could use the same splitting code?

One issue I see with the approach suggested here is classes derived from EntityId in a way duplicate the splitting code, as in https://github.com/wmde/WikibaseDataModel/pull/681/files#diff-3bebb92621622bd5fabe5f9b33d786b1R35 It would be nice to not have those details in Id classes if we wanted to move the responsibility to parsers. But it doesn't seem to be easily doable. Or is it?

Also, it seems that parser would also now need to check if the structure of the id serialization matches the general format and normalize it (eg. remove leading colon). This is now done in EntityId.

Are there any other things to consider here that I am missing?

brightbyte · 2016-09-23T10:00:06Z

@manicki Moving the splitting logic from EntityId into the EntityIdParser sounds sensible. However, that logic then cannot be used in any EntityId implementation, to avoid a circular dependency. That implies that the constructor of EntityId needs to get the three parts separately. The join-logic would still need to be in EntityId, I think, which means knowledge about the syntax is spread over two classes.

This seems worth investigating, but I would like it better if we'd only have to pass two parts to the EntityId, not three. But EntityId needs the first part (for getRepoName), and the last part (for normalization), and middle-and-last part (for getLocalPart).

brightbyte

I have no strong feelings about this PR. I'm personally rather keen on knowledge locality, and I'm ok with binding syntax knowledge that is the same for all IDs to the base class. This PR avoids the static methods, at the price of spreading knowedge about the syntax across three (potentially four) classes.
Ideally, that knowledge would be encapsulated in a separate class, IdTokenizer or some such. But EntityIds would need to know an instance of that, for normalization and joining, which again isn't nice for a value object.

So, in my opinion, this is just swapping one annoyance for another. Seems like a matter of taste to me.

brightbyte · 2016-09-23T10:02:08Z

src/Entity/ItemId.php

 		if ( !is_string( $idSerialization ) ) {
 			throw new InvalidArgumentException( '$idSerialization must be a string' );
 		}

-		if ( !preg_match( self::PATTERN, $idSerialization ) ) {
+		$parts = explode( ':', $idSerialization );


Having the explode/implode based on ":" here means spreading the knowledge about the syntax for the prefixes across multiple classes. That's what I tried to avoid with the static methods.

brightbyte · 2016-09-23T10:20:29Z

@jeroen wrote:

The first constructor argument contains what the existing getter has named the "local part", which can also include prefixes defined on other sites. It's not immedidately clear to me why we need this here, so I'd be nice if someone could explain that. Should this not already have been resolved by the time the id is constructed?

We cannot assume that we will always have a local mapping for all prefixes used by another repo. If we don't have a local mapping for x:Q12 coming from repo foo, we still need to handle it somehow. The solution is to call it foo:x:Q12 locally. When trying to load the entity, we need to ask repo foo (from getRepoName) for x:Q12 (from getLocalPart). This is explained in the section about "chained" prefixes here: https://phabricator.wikimedia.org/T133381

JeroenDeDauw · 2016-09-29T13:12:44Z

The join-logic would still need to be in EntityId, I think [...]

Why?

brightbyte · 2016-09-29T13:29:58Z

@JeroenDeDauw if the constructor of the EntityId gets the three parts separately, it still needs to be able to construct a single string "serialization" from them, so it needs the "join-logic".

I really don't like the idea of repeating the split-and-join logic for all entity types (six so far), plus EntityIdFactory, plus DispatchingEntityIdParser. This spreads knowledge about the syntax not only over several classes, but over several modules, and requires implementors of one of our main extension points (adding entity types) to duplicate it for every new type.

JeroenDeDauw · 2016-09-29T13:34:08Z

What does it need to create this serialization for?

brightbyte · 2016-09-29T13:41:03Z

@JeroenDeDauw to return it from getSerialization(), which is used in 100 files in wikibase.

You could of course pass the full serialization as yet another parameter to the constructor.

But in the end, it seems to me that if you want to keep the parsing logic out of the EntityId object and into the parser, you also need to move all validation and normalization out of EntityId. Which then raises the question where it should go, since it is bound to each specific subclass of EntityId, and can't live in the parser.

JeroenDeDauw · 2016-09-29T14:42:55Z

Yeah, but what types of things is getSerialization used for? If it is not the public format, then you don't need to bind to it anyway and you are not "duplicating", since the things are different.

brightbyte · 2016-09-29T15:10:43Z

@JeroenDeDauw what EntityId::getSerialization returns is the canonical public ID, used in the database, in JSON, in RDF, and so on.

The syntax used for prefixes cannot be different for different entity types. It should be possible to determine the home repo of an entity without knowing the concrete type. Only the normalization/validation of the last part of the ID is specific to the type of the ID.

JeroenDeDauw · 2016-09-29T15:57:31Z

Value objects are not the place to put presentation code

brightbyte · 2016-09-29T20:33:56Z

@JeroenDeDauw what presentation code? The "serial" form of the ID is the value that is contained in the value object. There is nothing presentational about it. A URL object would have a method to get the URL as a string, right?

If I remember correctly, it was you who introduced getSerialization, for exactly that reason.

JeroenDeDauw · 2016-09-30T07:07:42Z

If the value object has multiple fields, then no, it does not just contain the serial form.

Originally we had one constructor parameter, $serialization, with matching getSerialization. Now we have more parameters and renamed $serialization to $localSerialization, without touching the corresponding method.

brightbyte · 2016-09-30T07:16:58Z

@JeroenDeDauw well, it was a single value when getSerialization() was introduced. And to me, it still only contains one field (with convenience getters, like a URL object would), but that's a matter of perspective. In any case, getSerialization() needs to return the entire thing, not just the local part. If you pass the individual bits to the constructor, you have to join them back together again.

JeroenDeDauw · 2016-09-30T07:59:50Z

It's not a matter of perspective, it's a choice you make, after which it's one way or the other.

The general rule is to have a field for each meaningful part of the value. AFAIK pretty much all data values and value objects in Wikibase follow that approach. Perhaps this URL you mention is an exception - can you link it? Such an exception can make sense if the context in which the value is defined does not care about the parts; in other words its not really an exception at all. To me the different parts of the IDs are clearly relevant within the domain of our data model, do you disagree?

brightbyte · 2016-10-04T13:26:10Z

@JeroenDeDauw I really don't care whether we store the ID in one member variable, or in two, or in three. I do care about how many classes should know about the syntax for the prefix. Spreading that knowledge across multiple classes is annoying, spreading it across multiple modules is bad. In the name of OCP, this knowledge should be local to a single class.

What concrete problem are you trying to solve here? I don't see the improvement. And I don't see a way around the OCP issue introduced by this approach that would not leads to overhead an annoyance (we could introduce IdTokenizer and IdTokens classes - but YAGNI),

Perhaps this URL you mention is an exception - can you link it?

Well, e.g. Java's URL class has a single string constructor, as well as a per-part constructor. It has getters for the individual parts, as well as a getter for the whole thing. I don't know or care how it is represented internally, that should be an implementation detail. That was my point.

Such an exception can make sense if the context in which the value is defined does not care about the parts; in other words its not really an exception at all. To me the different parts of the IDs are clearly relevant within the domain of our data model, do you disagree?

There are three context in which the individual parts of the ID are relevant in our data model:

normalization during construction (last part)
mapping during deserialization (middle part)
resolving the ID to load content (first part)

In all other places, the ID is completely opaque, prefix or no. This is pretty similar to the URL use case, I think: in most contexts, a URL is an atomic identifier; the individual parts are only relevant when actually loading data.

But again, I think that discussion is rather academic. To me, the relevant question is:

What problem does this change solve, and is it worth spreading syntax knowledge across modules?

JeroenDeDauw · 2016-10-06T23:43:13Z

I do not understand where this OCP violation you mention is.

The description of this PR mentions what problems it solves: the responsibility of presentation bound serialization code in the value object and the global functions glued onto it and accessed from other components.

It is not clear to me syntax knowledge is being spread across modules more than it otherwise would be. Depending on if getSerialization is conceptually bound to the presentation format, the DataModel component can even avoid that binding altogether. I tried finding out earlier in this thread and did not get an answer.

brightbyte · 2016-10-08T09:50:57Z

The OCP violation is about encapsulating the prefix syntax. With the code currently in master, the syntax is controlled entirely by the EntityId base class, no other code knows or cares about whether it uses colons or what else. The syntax could be extended by changing just that one class. This is particularly nice when fixing edge case behavior which might otherwise diverge - e.g. regarding the handling of foo:::bar or something. With your code, every constructor of every sublclass, and also every parser and deserializer, has to know about the syntax, and behave consistently, copying the exact same code over and over. That's asking for trouble.

Note that the prefix syntax cannot be different for different kinds of IDs. It's always exactly the same.

I think the big issue here is the term "presentation". I see nothing related to presentation here. "representation" perhaps. But again, as for a URL, the representation of an entity ID as a string is essential here, it's not an arbitrary presentation. It defines what an entity ID is, and thus should be b bound to the base class. We are essentially talking about the equivalent of PHP's parse_url http://php.net/manual/de/function.parse-url.php or python's urlparse https://docs.python.org/2/library/urlparse.html. Not that python or PHP libraries are a good example of OO design in general. But I see no problem here.

Can you give an example under which circumstances this would be a problem? Most code violates some principle or other to some extend. We have to the benefit of fixing that violation against the coast of fixing that violation - and in this case, I see no practical benefit, and quite a bit of potential harm.

thiemowmde · 2017-11-09T14:34:18Z

I'm going to close this for now because the code drafted here is not applicable any more. #767 currently does something very similar and might be worth a look.

jakobw and others added 2 commits September 21, 2016 17:25

Support foreign EntityIds.

41e8f75

Bug: T145516

JeroenDeDauw force-pushed the suchids branch from 874a834 to 163ba71 Compare September 22, 2016 05:19

JeroenDeDauw mentioned this pull request Sep 22, 2016

Support foreign EntityIds. #678

Merged

brightbyte reviewed Sep 23, 2016

View reviewed changes

thiemowmde closed this Nov 9, 2017

thiemowmde deleted the suchids branch November 9, 2017 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed how "forgeign repository name" is stored in entity ids #681

Changed how "forgeign repository name" is stored in entity ids #681

JeroenDeDauw commented Sep 22, 2016 •

edited

Loading

brightbyte commented Sep 22, 2016

JeroenDeDauw commented Sep 22, 2016

manicki commented Sep 23, 2016

brightbyte commented Sep 23, 2016

brightbyte left a comment

brightbyte Sep 23, 2016

brightbyte commented Sep 23, 2016 •

edited

Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016 •

edited

Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016 •

edited

Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016

JeroenDeDauw commented Sep 30, 2016

brightbyte commented Sep 30, 2016

JeroenDeDauw commented Sep 30, 2016

brightbyte commented Oct 4, 2016

JeroenDeDauw commented Oct 6, 2016

brightbyte commented Oct 8, 2016 •

edited

Loading

thiemowmde commented Nov 9, 2017

Changed how "forgeign repository name" is stored in entity ids #681

Changed how "forgeign repository name" is stored in entity ids #681

Conversation

JeroenDeDauw commented Sep 22, 2016 • edited Loading

brightbyte commented Sep 22, 2016

JeroenDeDauw commented Sep 22, 2016

manicki commented Sep 23, 2016

brightbyte commented Sep 23, 2016

brightbyte left a comment

Choose a reason for hiding this comment

brightbyte Sep 23, 2016

Choose a reason for hiding this comment

brightbyte commented Sep 23, 2016 • edited Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016 • edited Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016 • edited Loading

JeroenDeDauw commented Sep 29, 2016

brightbyte commented Sep 29, 2016

JeroenDeDauw commented Sep 30, 2016

brightbyte commented Sep 30, 2016

JeroenDeDauw commented Sep 30, 2016

brightbyte commented Oct 4, 2016

JeroenDeDauw commented Oct 6, 2016

brightbyte commented Oct 8, 2016 • edited Loading

thiemowmde commented Nov 9, 2017

JeroenDeDauw commented Sep 22, 2016 •

edited

Loading

brightbyte commented Sep 23, 2016 •

edited

Loading

brightbyte commented Sep 29, 2016 •

edited

Loading

brightbyte commented Sep 29, 2016 •

edited

Loading

brightbyte commented Oct 8, 2016 •

edited

Loading