Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed how "forgeign repository name" is stored in entity ids #681

Closed
wants to merge 2 commits into from

Conversation

JeroenDeDauw
Copy link
Contributor

@JeroenDeDauw JeroenDeDauw commented Sep 22, 2016

Follow up to #678

I'd have submitted this as a PR into the original, though as that one is coming from a fork, I'd be a PR against that fork... So if you want to see my changes against the existing PR, see only the second commit.

This removes the addition of static code to EntityId and pushes the responsibility of parsing serializations out of the class.

I did not touch various other issues and have created a number of small ones, so this leaves the code in a state, that while it works, still needs some work before it is mergeable. That includes making names consistent again and adding some edge case tests.

The first constructor argument contains what the existing getter has named the "local part", which can also include prefixes defined on other sites. It's not immedidately clear to me why we need this here, so I'd be nice if someone could explain that. Should this not already have been resolved by the time the id is constructed?

jakobw and others added 2 commits September 21, 2016 17:25
Follow up to #678

This removes the addition of static code to EntityId and pushes
the responsibility of parsing serializations out of the class.

I did not touch various other issues and have created a number of
small ones, so this leaves the code in a state, that while it works,
still needs some work before it is mergeable. That includes making
names consistent again and adding some edge case tests.

The first constructor argument contains what the existing getter
has named the "local part", which can also include prefixes defined
on other sites. It's not immedidately clear to me why we need this
here, so I'd be nice if someone could explain that. Should this
not already have been resolved by the time the id is constructed?
@brightbyte
Copy link

I don't see how this will still allow an EntityIdParser to apply prefix mappings before constructing an EntityId, or how it can decide which EntityId to construct. For both tasks, it is necessary to have access to a split version of the raw ID string, before the type of the ID is known, or an EntityId has been constructed.

@JeroenDeDauw
Copy link
Contributor Author

Interpreting the string it gets is part of the parsers job. I don't see why you'd put this is some global functions glued to the ID class. If this leads to a concrete problem I'm not seeing it.

@manicki
Copy link
Member

manicki commented Sep 23, 2016

Having looked at this again I think it would be cleaner if the parser was doing the serialized id split to repository name, local id etc. I don't think it would be a problem for the parser to apply prefix mapping before constructing the EntityId.
Parser would have split methods, maybe similar to those that are now in EntityId, and could use those to get "local id" and test it against known entity id patterns.
Do we want to have some abstract EntityIdParser-implementing class that would have those methods, so all actual Parser implementation s(DispatchingEntityIdParser from data model, prefixMappingEntityIdParser I am adding to data model services etc) could use the same splitting code?

One issue I see with the approach suggested here is classes derived from EntityId in a way duplicate the splitting code, as in https://github.com/wmde/WikibaseDataModel/pull/681/files#diff-3bebb92621622bd5fabe5f9b33d786b1R35 It would be nice to not have those details in Id classes if we wanted to move the responsibility to parsers. But it doesn't seem to be easily doable. Or is it?

Also, it seems that parser would also now need to check if the structure of the id serialization matches the general format and normalize it (eg. remove leading colon). This is now done in EntityId.

Are there any other things to consider here that I am missing?

@brightbyte
Copy link

@manicki Moving the splitting logic from EntityId into the EntityIdParser sounds sensible. However, that logic then cannot be used in any EntityId implementation, to avoid a circular dependency. That implies that the constructor of EntityId needs to get the three parts separately. The join-logic would still need to be in EntityId, I think, which means knowledge about the syntax is spread over two classes.

This seems worth investigating, but I would like it better if we'd only have to pass two parts to the EntityId, not three. But EntityId needs the first part (for getRepoName), and the last part (for normalization), and middle-and-last part (for getLocalPart).

Copy link

@brightbyte brightbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong feelings about this PR. I'm personally rather keen on knowledge locality, and I'm ok with binding syntax knowledge that is the same for all IDs to the base class. This PR avoids the static methods, at the price of spreading knowedge about the syntax across three (potentially four) classes.
Ideally, that knowledge would be encapsulated in a separate class, IdTokenizer or some such. But EntityIds would need to know an instance of that, for normalization and joining, which again isn't nice for a value object.

So, in my opinion, this is just swapping one annoyance for another. Seems like a matter of taste to me.

if ( !is_string( $idSerialization ) ) {
throw new InvalidArgumentException( '$idSerialization must be a string' );
}

if ( !preg_match( self::PATTERN, $idSerialization ) ) {
$parts = explode( ':', $idSerialization );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the explode/implode based on ":" here means spreading the knowledge about the syntax for the prefixes across multiple classes. That's what I tried to avoid with the static methods.

@brightbyte
Copy link

brightbyte commented Sep 23, 2016

@jeroen wrote:

The first constructor argument contains what the existing getter has named the "local part", which can also include prefixes defined on other sites. It's not immedidately clear to me why we need this here, so I'd be nice if someone could explain that. Should this not already have been resolved by the time the id is constructed?

We cannot assume that we will always have a local mapping for all prefixes used by another repo. If we don't have a local mapping for x:Q12 coming from repo foo, we still need to handle it somehow. The solution is to call it foo:x:Q12 locally. When trying to load the entity, we need to ask repo foo (from getRepoName) for x:Q12 (from getLocalPart). This is explained in the section about "chained" prefixes here: https://phabricator.wikimedia.org/T133381

@JeroenDeDauw
Copy link
Contributor Author

The join-logic would still need to be in EntityId, I think [...]

Why?

@brightbyte
Copy link

@JeroenDeDauw if the constructor of the EntityId gets the three parts separately, it still needs to be able to construct a single string "serialization" from them, so it needs the "join-logic".

I really don't like the idea of repeating the split-and-join logic for all entity types (six so far), plus EntityIdFactory, plus DispatchingEntityIdParser. This spreads knowledge about the syntax not only over several classes, but over several modules, and requires implementors of one of our main extension points (adding entity types) to duplicate it for every new type.

@JeroenDeDauw
Copy link
Contributor Author

What does it need to create this serialization for?

@brightbyte
Copy link

brightbyte commented Sep 29, 2016

@JeroenDeDauw to return it from getSerialization(), which is used in 100 files in wikibase.

You could of course pass the full serialization as yet another parameter to the constructor.

But in the end, it seems to me that if you want to keep the parsing logic out of the EntityId object and into the parser, you also need to move all validation and normalization out of EntityId. Which then raises the question where it should go, since it is bound to each specific subclass of EntityId, and can't live in the parser.

@JeroenDeDauw
Copy link
Contributor Author

Yeah, but what types of things is getSerialization used for? If it is not the public format, then you don't need to bind to it anyway and you are not "duplicating", since the things are different.

@brightbyte
Copy link

brightbyte commented Sep 29, 2016

@JeroenDeDauw what EntityId::getSerialization returns is the canonical public ID, used in the database, in JSON, in RDF, and so on.

The syntax used for prefixes cannot be different for different entity types. It should be possible to determine the home repo of an entity without knowing the concrete type. Only the normalization/validation of the last part of the ID is specific to the type of the ID.

@JeroenDeDauw
Copy link
Contributor Author

Value objects are not the place to put presentation code

@brightbyte
Copy link

@JeroenDeDauw what presentation code? The "serial" form of the ID is the value that is contained in the value object. There is nothing presentational about it. A URL object would have a method to get the URL as a string, right?

If I remember correctly, it was you who introduced getSerialization, for exactly that reason.

@JeroenDeDauw
Copy link
Contributor Author

If the value object has multiple fields, then no, it does not just contain the serial form.

Originally we had one constructor parameter, $serialization, with matching getSerialization. Now we have more parameters and renamed $serialization to $localSerialization, without touching the corresponding method.

@brightbyte
Copy link

@JeroenDeDauw well, it was a single value when getSerialization() was introduced. And to me, it still only contains one field (with convenience getters, like a URL object would), but that's a matter of perspective. In any case, getSerialization() needs to return the entire thing, not just the local part. If you pass the individual bits to the constructor, you have to join them back together again.

@JeroenDeDauw
Copy link
Contributor Author

It's not a matter of perspective, it's a choice you make, after which it's one way or the other.

The general rule is to have a field for each meaningful part of the value. AFAIK pretty much all data values and value objects in Wikibase follow that approach. Perhaps this URL you mention is an exception - can you link it? Such an exception can make sense if the context in which the value is defined does not care about the parts; in other words its not really an exception at all. To me the different parts of the IDs are clearly relevant within the domain of our data model, do you disagree?

@brightbyte
Copy link

@JeroenDeDauw I really don't care whether we store the ID in one member variable, or in two, or in three. I do care about how many classes should know about the syntax for the prefix. Spreading that knowledge across multiple classes is annoying, spreading it across multiple modules is bad. In the name of OCP, this knowledge should be local to a single class.

What concrete problem are you trying to solve here? I don't see the improvement. And I don't see a way around the OCP issue introduced by this approach that would not leads to overhead an annoyance (we could introduce IdTokenizer and IdTokens classes - but YAGNI),

Perhaps this URL you mention is an exception - can you link it?

Well, e.g. Java's URL class has a single string constructor, as well as a per-part constructor. It has getters for the individual parts, as well as a getter for the whole thing. I don't know or care how it is represented internally, that should be an implementation detail. That was my point.

Such an exception can make sense if the context in which the value is defined does not care about the parts; in other words its not really an exception at all. To me the different parts of the IDs are clearly relevant within the domain of our data model, do you disagree?

There are three context in which the individual parts of the ID are relevant in our data model:

  1. normalization during construction (last part)
  2. mapping during deserialization (middle part)
  3. resolving the ID to load content (first part)

In all other places, the ID is completely opaque, prefix or no. This is pretty similar to the URL use case, I think: in most contexts, a URL is an atomic identifier; the individual parts are only relevant when actually loading data.

But again, I think that discussion is rather academic. To me, the relevant question is:

What problem does this change solve, and is it worth spreading syntax knowledge across modules?

@JeroenDeDauw
Copy link
Contributor Author

I do not understand where this OCP violation you mention is.

The description of this PR mentions what problems it solves: the responsibility of presentation bound serialization code in the value object and the global functions glued onto it and accessed from other components.

It is not clear to me syntax knowledge is being spread across modules more than it otherwise would be. Depending on if getSerialization is conceptually bound to the presentation format, the DataModel component can even avoid that binding altogether. I tried finding out earlier in this thread and did not get an answer.

@brightbyte
Copy link

brightbyte commented Oct 8, 2016

The OCP violation is about encapsulating the prefix syntax. With the code currently in master, the syntax is controlled entirely by the EntityId base class, no other code knows or cares about whether it uses colons or what else. The syntax could be extended by changing just that one class. This is particularly nice when fixing edge case behavior which might otherwise diverge - e.g. regarding the handling of foo:::bar or something. With your code, every constructor of every sublclass, and also every parser and deserializer, has to know about the syntax, and behave consistently, copying the exact same code over and over. That's asking for trouble.

Note that the prefix syntax cannot be different for different kinds of IDs. It's always exactly the same.

I think the big issue here is the term "presentation". I see nothing related to presentation here. "representation" perhaps. But again, as for a URL, the representation of an entity ID as a string is essential here, it's not an arbitrary presentation. It defines what an entity ID is, and thus should be b bound to the base class. We are essentially talking about the equivalent of PHP's parse_url http://php.net/manual/de/function.parse-url.php or python's urlparse https://docs.python.org/2/library/urlparse.html. Not that python or PHP libraries are a good example of OO design in general. But I see no problem here.

Can you give an example under which circumstances this would be a problem? Most code violates some principle or other to some extend. We have to the benefit of fixing that violation against the coast of fixing that violation - and in this case, I see no practical benefit, and quite a bit of potential harm.

@thiemowmde
Copy link
Contributor

I'm going to close this for now because the code drafted here is not applicable any more. #767 currently does something very similar and might be worth a look.

@thiemowmde thiemowmde closed this Nov 9, 2017
@thiemowmde thiemowmde deleted the suchids branch November 9, 2017 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants