Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future-proof serialization #87

Open
tdunning opened this issue Apr 20, 2017 · 24 comments
Open

Future-proof serialization #87

tdunning opened this issue Apr 20, 2017 · 24 comments

Comments

@tdunning
Copy link
Owner

Currently, we depend on Java's serialization which is not a good strategy.

It would be much better to do our own and make all the different kinds of Digest use interchangeable formats.

@Nikko78
Copy link

Nikko78 commented Jun 25, 2017

You can overload Java serialization process to save what you want, for example in my example, methods getBytes and fromBytes already implements in your Digest classes.

In this example, fields of object are ignored and only declared "digestEncodeWithBytes" field is declared :

public abstract class TDigest implements Serializable {

    // For serialisation, overload ObjectInput/OutputStream reflection
    private static final ObjectStreamField[] serialPersistentFields = {
        new ObjectStreamField("digestEncodeWithBytes", byte[].class)
    };

    // Or use getBytes(ByteBuffer) instead
    public abstract byte[] getBytes();

    // Serialization create the object which implements this abstract class, so call a method to init it
    protected abstract void initWithBytes(byte[] b);

    /**
     * Overload {@link ObjectOutputStream#defaultWriteObject()}
     *
     * @param oos object output stream use for write
     * @throws IOException error during write process
     */
    private void writeObject(ObjectOutputStream oos) throws IOException {
        oos.putFields().put("digestEncodeWithBytes", getBytes());
        oos.writeFields();
    }

    /**
     * Overload {@link ObjectInputStream#defaultReadObject()}
     *
     * @param ois object input stream use for read
     * @throws IOException error during read process
     * @throws ClassNotFoundException class use in serialised stream unknown
     */
    private void readObject(ObjectInputStream ois) throws IOException, ClassNotFoundException {
        Object o = ois.readFields().get("digestEncodeWithBytes", null);
        if (o != null) {
            byte[] b = (byte[]) o;
            initWithBytes(b);
        }
    }
}

You can also overwrite deserialisation with : link

class HackedObjectInputStream extends ObjectInputStream {

    public HackedObjectInputStream(InputStream in) throws IOException {
        super(in);
    }

    @Override
    protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
        ObjectStreamClass resultClassDescriptor = super.readClassDescriptor();

        if (resultClassDescriptor.getName().equals("oldpackage.Clazz"))
            resultClassDescriptor = ObjectStreamClass.lookup(newpackage.Clazz.class);

        return resultClassDescriptor;
    }
}

@anvinjain
Copy link

anvinjain commented Jun 28, 2017

Isn't it possible to add a Transcoder interface which has byte[] asBytes(SerializationModel) and SerializationModel fromBytes(byte[]) methods? SerializationModel here being a java object that is good enough to represent any digest. TDigest interface can now have a create method that takes SerializationModel as input which all digests implement. Basically taking out byte-level concerns out of the digest impls.

You can provide a default transcoder which does serialization similar to how it's done today (with bugfixes), while keeping the flexibility of accepting PRs from others that implement transcoders for other formats. If this line of thought makes sense to you and more importantly if we can come up with a common model to represent all digests, I will be happy to contribute a PR!

A little background: we have a project that does jvm profiling for clusters and all our components talk Protobuf. We deal with latencies, scheduler delays and other metrics and streaming quantiles make a lot of sense. Hence the interest in adopting tdigest but it does not have protobuf support today nor the design is friendly enough to submit a PR for the same

@tdunning
Copy link
Owner Author

tdunning commented Jun 28, 2017 via email

@anvinjain
Copy link

No. Basically, I wanted to enable bring your own serialization logic with some sane ones provided by the lib. By SerializationModel, I am implying one java class something on the lines of:

class SerializationModel {
    int encodingType; double min; double max; float compression; // ... and other members
}

class MergingDigest {
    // ...
    static MergingDigest fromSerializationModel(SerializationModel);
}

interface Transcoder {
    byte[] asBytes(SerializationModel);
    SerializationModel fromBytes(byte[]);
}

class ProtobufTranscoder implements Transcoder {
    // ...
}

This keeps the byte magic out of digest implementations and they are responsible for generating a uniform on-heap representation (which is SerializationModel, I know this name sucks)
Users can instantiate any out of the box transcoder or write their own. A way to maintain backward compatibility would be for MergingDigest, etc to optionally take a transcoder as constructor param and use that (or default one if not provided) to keep existing asBytes, fromBytes method working.

If you have any other ideas on how tdigest can support multiple serialization formats (and hopefully you consider that as a feature to be added in the first place), would love to hear about them.

@tdunning
Copy link
Owner Author

tdunning commented Jun 28, 2017 via email

@anvinjain
Copy link

Or DigestModel? I see this as a dto representing digests.
Sure, I have some free time next week so expect a PR then.

@anvinjain
Copy link

anvinjain commented Jun 29, 2017

You mention interchangeability of digest implementations. Presently, MergingDigest and AvlTreeDigest serialize different values so their serialized representation is not equivalent. Will need your inputs on structure of DigestModel and what fields of MergingDigest/AvlTreeDigest map to DigestModel fields.
The straightforward ones are min, max, compression, mean[]. But rest differ across impls.

@tdunning
Copy link
Owner Author

tdunning commented Jun 29, 2017 via email

@anvinjain
Copy link

@tdunning Need your inputs on what DigestModel looks like. As mentioned in previous comment, presently different fields are serialized across digest implementations

@tdunning
Copy link
Owner Author

tdunning commented Jul 5, 2017 via email

@anvinjain
Copy link

anvinjain commented Jul 8, 2017

Apart from bufferSize, MergingDigest serializes lastUsedCell. For verbose encoding, this is deserialized and set as a property in digest. For small encoding, lastUsedCell additionally represents the values to read and assign in mean/weight array. Rest of the array seems to be zero-filled since n(array size) is part of serde as well.

Also, oddly MergingDigest#asBytes serializes tempMean.length but MergingDigest#fromBytes (verbose) does not expect it to be present in buffer. Is it a known issue or am I missing something there?

@tdunning
Copy link
Owner Author

tdunning commented Jul 8, 2017 via email

@tdunning
Copy link
Owner Author

tdunning commented Aug 1, 2017

@anvinjain Do you have a PR coming?

We need to cut the new major release. If you can't get this in soon, it will have to wait.

@anvinjain
Copy link

anvinjain commented Aug 2, 2017 via email

@tdunning
Copy link
Owner Author

tdunning commented Aug 2, 2017 via email

@tdunning
Copy link
Owner Author

tdunning commented Aug 6, 2017

This is being pushed beyond the 3.2 release.

@anvinjain
Copy link

@tdunning I have raised a PR around this.

@tdunning
Copy link
Owner Author

tdunning commented Jan 1, 2018 via email

@anvinjain
Copy link

@tdunning Did you get a chance to look at the PR?

@tdunning
Copy link
Owner Author

tdunning commented Feb 14, 2018 via email

@anvinjain
Copy link

@tdunning Ping. Is there a timeline for next release? If you don't see this PR #99 making the cut without major changes, let me know.

@scheler
Copy link

scheler commented Jan 14, 2021

@tdunning looks like the PR was never merged. This is of interest to us too, so if there are any review comments we can take a look at addressing them too. Thanks in advance.

@tdunning
Copy link
Owner Author

tdunning commented Jan 14, 2021 via email

@amalic
Copy link

amalic commented Mar 14, 2021

Here's my proposal: #162

  • Serializations can be configured via serialization.properties file.
  • The serialization gets instatniated with a java.util.Properties object containing all properties form the serialization.properties file
  • It can load custom implementations via classForName as long as they have been added to the classpath and configured in the serialization.properties file. It's up to the implementation what to do with those properties.

Example for serialization.properties file

tdigest.serializerClass=org.example.CustomSerializer
someProperty=someValue
someOtherProperty=someOtherValue

This is the current implementation of TDigestDefaultSerializer

public class TDigestDefaultSerializer extends AbstractTDigestSerializer<TDigest, byte[]> {
    
    public TDigestDefaultSerializer(Properties properties) {
        super(properties);
    }

    @Override
    public byte[] serialize(TDigest tDigest) throws TDigestSerializerException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream(5120);
        try (ObjectOutputStream out = new ObjectOutputStream(baos)) {
            out.writeObject(tDigest);
            return baos.toByteArray();
        } catch (IOException e) {
            throw new TDigestSerializerException(e);
        } 
    }

    @Override
    public TDigest deserialize(byte[] object) throws TDigestSerializerException {
        try (ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(object))) {
            return (TDigest) in.readObject();
        } catch (ClassCastException | ClassNotFoundException | IOException e) {
            throw new TDigestSerializerException(e);
        }
    }

}

This is the code for AbstractTDigestSerializer

public abstract class AbstractTDigestSerializer<T extends TDigest, O extends Object> implements TDigestSerializer<T, O> {
    private Properties properties;
    
    public AbstractTDigestSerializer(Properties properties) {
        this.properties = properties;
    }    
    
    public Properties getProperties() {
        return properties;
    }

}

Hope this is being seen as future proof.

There's still some work to do:

  • code formatting (I use 🌑Eclipse) and review (especially how I do generics 😅)
  • proper JavaDoc documentation in source code
  • additional tests for custom serialization implementations and property file handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants