Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use numeric IDs for internal PHP serialization. #716

Merged
merged 4 commits into from Feb 10, 2017

Conversation

brightbyte
Copy link

@brightbyte brightbyte commented Feb 8, 2017

This changes Snak serialization to use the full string representation of property IDs.

NOTE: serialize/unserialize is used mainly when cloning Statements.
NOTE: it's unclear if this completely fixes T157442
NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal!

Bug: T157442

@brightbyte
Copy link
Author

pinging @manicki @thiemowmde

@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) {
*/
public function serialize() {
return json_encode( [
$this->entityId->getEntityType(),
$this->getNumericId()
get_class( $this->entityId ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Kinda scary to do new $userInput()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not pretty. But all we use this for is cloning statements.

The change is here because the old code assumes two things: a) the ID can be represented as a number (it can't) and b) we have a well known set of entity types (we don't).

So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remark is not about prettiness, it is about security.

But all we use this for is cloning statements.

That does no reassure me. Kinda like saying "no need to escape this SQL argument since right now only we call this function with non-user-input".

So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.

Why not just (de)serialize the ID?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't deserialize the ID without access to a service. How do we get the right service instance? This is particularly tricky for entity parsing, since we use different EntityIdParsers for data coming from different repos.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I appreciate your concern for security. I'm not very happy about this solution. I just don't see an alternative, apart from not using serialize/unserialize for cloning statements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't deserialize the ID without access to a service.

Huh?

deserialize( $entityId );

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded a draft utilizing this approach as #718.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what is worth: if we didn't want to use PHP serialization for EntityIdValue that @thiemowmde is suggesting in #718 (I am not sure what are implications with that approach, if this serialization is never stored) another option (theoretically) could be using EntityIdComposer which is meant as a entity-type-agnostic (kinda) replacement for LegacyIdInterpreter. The obvious problem is how to pull EntityIdComposer in here, so it makes sense (it is now bound to Wikibase git repo and entity type definitions out there). I cannot think of good way to achieve this, so this comment is mostly noise.

Also theoretically, another option would be to use value of EntityId::getSerialization in EntityIdValue::serialize, and use EntityIdParser in EntityIdValue::unserialize but that seems so wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeroenDeDauw Yes, I realized this morning that I misunderstood. deserialize( $this->entityId ) will work, sure.

@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) {
*/
public function serialize() {
return json_encode( [
$this->entityId->getEntityType(),
$this->getNumericId()
get_class( $this->entityId ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we are heavily trying to avoid, because this makes implementation details like the (entire!) namespace and the class name leak into the database. That's why we have entity type identifiers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I definitely do not want this in the database, or in our external representation! As far as I am aware, this is not used in any way with our JSON encoding. We only use the php-serialization temporarily, to do deep clones. PHP-Serialization is extremely brittle (sensitive against changes to private members, etc), so it should never be used for persistence.

The alternative would be to change the way we do cloning, and avoid copying immutable objects when we clone. That would be nice, but I can't think of a reliable and easy way of doing this. serialize/unserialize is a very convenient method to do deep cloning safely. But it's somewhat wasteful, since it also copies immutable objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct that these serialize formats are not used in our JSON encoding, but they are used to build summaries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thiemowmde you wrote:

they are used to build summaries.

when/where? I can't find it.


try {
$entityId = LegacyIdInterpreter::newIdFromTypeAndNumber( $entityType, $numericId );
if ( is_string( $id ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs at least a class_exists check and a fallback solution (or return null?).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the class doesn't exist, this should fail hard. Ok, we could throw an exception instead of a fatal error.

return PropertyId::newFromNumber( $unserialized );
} elseif ( is_string( $unserialized ) ) {
return new PropertyId( $unserialized );
} elseif ( $unserialized instanceof PropertyId ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which code path needs this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably none atm, it's here for completeness. I don't have strong feelings about having it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want me to remove this? This is currently blocking the pull request...

@brightbyte
Copy link
Author

I talked to Thiemo about this. We decided to use his patch #718 for EntityIdValue and the EntityId classes, and my changes for SnakObject and PropertyValueSnak. I will remove the changes to EntityIdValue from this pull request to avoid conflicts. The two pull requests do not depend on each other, but are conceptually related, and should become part of the same release.

Note that this is a breaking change only if we consider the php serialization to be a stable external interface. I don't think we have an explicite statement about this anywhere.

@brightbyte
Copy link
Author

NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal!

@@ -102,7 +102,7 @@ public function equals( $target ) {
* @return string
*/
public function serialize() {
return serialize( $this->propertyId->getNumericId() );
return serialize( $this->propertyId->getSerialization() );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit pointless. Please replace with a straight return $this->propertyId->getSerialization();.

@brightbyte
Copy link
Author

NOTE: once this is merged, we should make a release.

}

/**
* @param string $serialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string|int

list( $numericId, $dataValue ) = unserialize( $serialized );
$this->__construct( $numericId, $dataValue );
list( $propertyId, $dataValue ) = unserialize( $serialized );
$this->__construct( self::newPropertyId( $propertyId ), $dataValue );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark the date in your calendar. ;-) This might be the first time I don't like the fact that this relies on a protected method from a base class. Can be much simpler:

$this->__construct( is_int( $propertyId ) ? $propertyId : new PropertyId( $propertyId ), $dataValue );

}

$unserialized = unserialize( $serialized );
if ( is_int( $unserialized ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this code is weird. It does an is_int check twice? The reason is that this method is used for two very different use cases:

  • In Value snaks this can be called with either 1 or "P1".
  • In NoValue and SomeValue snaks this can be called with "i:42;" or "P1".

The problem is: there is no need to make these two use cases compatible with each other.

@brightbyte
Copy link
Author

Thanks @thiemowmde for refactoring the PropertyId instantiation! +1 for that.

Copy link
Contributor

@thiemowmde thiemowmde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I squashed a few commits and added one that simplifies the code quite a lot. I did not touched the tests. +2 from my side. I will leave this here for others to see and merge this tomorrow.

@JeroenDeDauw
Copy link
Contributor

+1

Copy link
Member

@manicki manicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. I think I've spotted one cosmetic thing. I'll fix it and then merge this.
Please also note my comment on @SInCE 7.0 tag. @thiemowmde it was you who changed, right? I think it is OK to possibly change this tag back if needed before the release, I am not willing to block merging this pull request on this discussion.

return [
'legacy' => [
new PropertyValueSnak( $p2, $value ),
'a:2:{i:0;i:2;i:1;C:22:"DataValues\StringValue":1:{b}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not affect test result but I believe to be 100% accurate "i:2" here should become "i:1". Minor thing, I am going to fix this, and squash into other commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this comment. I looked at wrong spot in the old code. All good here.

@@ -97,12 +97,12 @@ public function equals( $target ) {
/**
* @see Serializable::serialize
*
* @since 0.1
* @since 7.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder is this correct to change this number here. Change here is indeed breaking change but the interface of the methods does not change. So I guess version number change is to show that serialization format has changed in the release 7.0. I am find with that I just wonder if this is "how it should be done" (tm).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the returned value is a crucial part of the contract of a method. If the format of the return value changes, it's a different method. The idea of the since tag is to make this obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some places where we bumped the number but then mention the method was already there in some other form or something along those lines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, something like this seems reaonable.

@manicki manicki merged commit 68667eb into master Feb 10, 2017
@manicki manicki deleted the FixIdSerialization branch February 10, 2017 08:32
@thiemowmde thiemowmde added this to the 7.0.0 milestone Feb 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants