Don't use numeric IDs for internal PHP serialization. #716

brightbyte · 2017-02-08T18:48:00Z

This changes Snak serialization to use the full string representation of property IDs.

NOTE: serialize/unserialize is used mainly when cloning Statements.
NOTE: it's unclear if this completely fixes T157442
NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal!

Bug: T157442

brightbyte · 2017-02-08T18:48:39Z

pinging @manicki @thiemowmde

JeroenDeDauw · 2017-02-08T19:04:40Z

src/Entity/EntityIdValue.php

@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) {
 	 */
 	public function serialize() {
 		return json_encode( [
-			$this->entityId->getEntityType(),
-			$this->getNumericId()
+			get_class( $this->entityId ),


Why this change? Kinda scary to do new $userInput()

Yes, it's not pretty. But all we use this for is cloning statements.

The change is here because the old code assumes two things: a) the ID can be represented as a number (it can't) and b) we have a well known set of entity types (we don't).

So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.

My remark is not about prettiness, it is about security.

But all we use this for is cloning statements.

That does no reassure me. Kinda like saying "no need to escape this SQL argument since right now only we call this function with non-user-input".

So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.

Why not just (de)serialize the ID?

We can't deserialize the ID without access to a service. How do we get the right service instance? This is particularly tricky for entity parsing, since we use different EntityIdParsers for data coming from different repos.

But I appreciate your concern for security. I'm not very happy about this solution. I just don't see an alternative, apart from not using serialize/unserialize for cloning statements.

We can't deserialize the ID without access to a service.

Huh?

deserialize( $entityId );

I uploaded a draft utilizing this approach as #718.

For what is worth: if we didn't want to use PHP serialization for EntityIdValue that @thiemowmde is suggesting in #718 (I am not sure what are implications with that approach, if this serialization is never stored) another option (theoretically) could be using EntityIdComposer which is meant as a entity-type-agnostic (kinda) replacement for LegacyIdInterpreter. The obvious problem is how to pull EntityIdComposer in here, so it makes sense (it is now bound to Wikibase git repo and entity type definitions out there). I cannot think of good way to achieve this, so this comment is mostly noise.

Also theoretically, another option would be to use value of EntityId::getSerialization in EntityIdValue::serialize, and use EntityIdParser in EntityIdValue::unserialize but that seems so wrong.

@JeroenDeDauw Yes, I realized this morning that I misunderstood. deserialize( $this->entityId ) will work, sure.

thiemowmde · 2017-02-08T19:02:19Z

src/Entity/EntityIdValue.php

@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) {
 	 */
 	public function serialize() {
 		return json_encode( [
-			$this->entityId->getEntityType(),
-			$this->getNumericId()
+			get_class( $this->entityId ),


This is something we are heavily trying to avoid, because this makes implementation details like the (entire!) namespace and the class name leak into the database. That's why we have entity type identifiers.

Oh, I definitely do not want this in the database, or in our external representation! As far as I am aware, this is not used in any way with our JSON encoding. We only use the php-serialization temporarily, to do deep clones. PHP-Serialization is extremely brittle (sensitive against changes to private members, etc), so it should never be used for persistence.

The alternative would be to change the way we do cloning, and avoid copying immutable objects when we clone. That would be nice, but I can't think of a reliable and easy way of doing this. serialize/unserialize is a very convenient method to do deep cloning safely. But it's somewhat wasteful, since it also copies immutable objects.

It's correct that these serialize formats are not used in our JSON encoding, but they are used to build summaries.

@thiemowmde you wrote:

they are used to build summaries.

when/where? I can't find it.

thiemowmde · 2017-02-08T20:21:43Z

src/Entity/EntityIdValue.php


 		try {
-			$entityId = LegacyIdInterpreter::newIdFromTypeAndNumber( $entityType, $numericId );
+			if ( is_string( $id ) ) {


This needs at least a class_exists check and a fallback solution (or return null?).

If the class doesn't exist, this should fail hard. Ok, we could throw an exception instead of a fatal error.

thiemowmde · 2017-02-08T20:23:15Z

src/Snak/SnakObject.php

+			return PropertyId::newFromNumber( $unserialized );
+		} elseif ( is_string( $unserialized ) ) {
+			return new PropertyId( $unserialized );
+		} elseif ( $unserialized instanceof PropertyId ) {


Which code path needs this?

Probably none atm, it's here for completeness. I don't have strong feelings about having it.

Still want me to remove this? This is currently blocking the pull request...

brightbyte · 2017-02-09T12:02:53Z

I talked to Thiemo about this. We decided to use his patch #718 for EntityIdValue and the EntityId classes, and my changes for SnakObject and PropertyValueSnak. I will remove the changes to EntityIdValue from this pull request to avoid conflicts. The two pull requests do not depend on each other, but are conceptually related, and should become part of the same release.

Note that this is a breaking change only if we consider the php serialization to be a stable external interface. I don't think we have an explicite statement about this anywhere.

brightbyte · 2017-02-09T13:20:30Z

NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal!

thiemowmde · 2017-02-09T14:02:00Z

src/Snak/SnakObject.php

@@ -102,7 +102,7 @@ public function equals( $target ) {
 	 * @return string
 	 */
 	public function serialize() {
-		return serialize( $this->propertyId->getNumericId() );
+		return serialize( $this->propertyId->getSerialization() );


This is a bit pointless. Please replace with a straight return $this->propertyId->getSerialization();.

brightbyte · 2017-02-09T16:04:08Z

NOTE: once this is merged, we should make a release.

thiemowmde · 2017-02-09T16:17:54Z

src/Snak/SnakObject.php

+	}
+
+	/**
+	 * @param string $serialized


thiemowmde · 2017-02-09T16:20:08Z

src/Snak/PropertyValueSnak.php

-		list( $numericId, $dataValue ) = unserialize( $serialized );
-		$this->__construct( $numericId, $dataValue );
+		list( $propertyId, $dataValue ) = unserialize( $serialized );
+		$this->__construct( self::newPropertyId( $propertyId ), $dataValue );


Mark the date in your calendar. ;-) This might be the first time I don't like the fact that this relies on a protected method from a base class. Can be much simpler:

$this->__construct( is_int( $propertyId ) ? $propertyId : new PropertyId( $propertyId ), $dataValue );

thiemowmde · 2017-02-09T16:28:51Z

src/Snak/SnakObject.php

+		}
+
+		$unserialized = unserialize( $serialized );
+		if ( is_int( $unserialized ) ) {


Sorry, but this code is weird. It does an is_int check twice? The reason is that this method is used for two very different use cases:

In Value snaks this can be called with either 1 or "P1".

In NoValue and SomeValue snaks this can be called with "i:42;" or "P1".

The problem is: there is no need to make these two use cases compatible with each other.

NOTE: it's unclear if this completely fixes T157442 Bug: T157442

brightbyte · 2017-02-09T16:46:44Z

Thanks @thiemowmde for refactoring the PropertyId instantiation! +1 for that.

thiemowmde

I squashed a few commits and added one that simplifies the code quite a lot. I did not touched the tests. +2 from my side. I will leave this here for others to see and merge this tomorrow.

JeroenDeDauw · 2017-02-09T18:36:04Z

+1

manicki

Looks reasonable to me. I think I've spotted one cosmetic thing. I'll fix it and then merge this.
Please also note my comment on @SInCE 7.0 tag. @thiemowmde it was you who changed, right? I think it is OK to possibly change this tag back if needed before the release, I am not willing to block merging this pull request on this discussion.

manicki · 2017-02-10T08:20:37Z

tests/unit/Snak/PropertyValueSnakTest.php

+		return [
+			'legacy' => [
+				new PropertyValueSnak( $p2, $value ),
+				'a:2:{i:0;i:2;i:1;C:22:"DataValues\StringValue":1:{b}}'


It does not affect test result but I believe to be 100% accurate "i:2" here should become "i:1". Minor thing, I am going to fix this, and squash into other commits.

Please ignore this comment. I looked at wrong spot in the old code. All good here.

manicki · 2017-02-10T08:23:11Z

src/Snak/SnakObject.php

@@ -97,12 +97,12 @@ public function equals( $target ) {
 	/**
 	 * @see Serializable::serialize
 	 *
-	 * @since 0.1
+	 * @since 7.0


I wonder is this correct to change this number here. Change here is indeed breaking change but the interface of the methods does not change. So I guess version number change is to show that serialization format has changed in the release 7.0. I am find with that I just wonder if this is "how it should be done" (tm).

I believe the returned value is a crucial part of the contract of a method. If the format of the return value changes, it's a different method. The idea of the since tag is to make this obvious.

We have some places where we bumped the number but then mention the method was already there in some other form or something along those lines

Yea, something like this seems reaonable.

JeroenDeDauw reviewed Feb 8, 2017

View reviewed changes

thiemowmde suggested changes Feb 8, 2017

View reviewed changes

thiemowmde mentioned this pull request Feb 9, 2017

Rework native EntityId(Value) serializations #718

Merged

brightbyte force-pushed the FixIdSerialization branch from db964fb to 2d0fff1 Compare February 9, 2017 13:19

thiemowmde suggested changes Feb 9, 2017

View reviewed changes

daniel and others added 4 commits February 9, 2017 17:30

Don't use numeric IDs for internal PHP serialization of Snaks.

0c7c9ce

NOTE: it's unclear if this completely fixes T157442 Bug: T157442

Updated snak tests for new serialization

9d16272

Don't php-serialize ID string.

01d66ca

Split and simplify unserialize implementations

ae3f139

thiemowmde force-pushed the FixIdSerialization branch from 8cede90 to ae3f139 Compare February 9, 2017 16:44

thiemowmde approved these changes Feb 9, 2017

View reviewed changes

manicki reviewed Feb 10, 2017

View reviewed changes

manicki merged commit 68667eb into master Feb 10, 2017

manicki deleted the FixIdSerialization branch February 10, 2017 08:32

thiemowmde added this to the 7.0.0 milestone Feb 13, 2017

Don't use numeric IDs for internal PHP serialization. #716

Don't use numeric IDs for internal PHP serialization. #716

Conversation

brightbyte commented Feb 8, 2017 • edited

brightbyte commented Feb 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeroenDeDauw Feb 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brightbyte commented Feb 9, 2017

brightbyte commented Feb 9, 2017

Choose a reason for hiding this comment

brightbyte commented Feb 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brightbyte commented Feb 9, 2017

thiemowmde left a comment

Choose a reason for hiding this comment

JeroenDeDauw commented Feb 9, 2017

manicki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brightbyte commented Feb 8, 2017 •

edited

JeroenDeDauw Feb 8, 2017 •

edited