No Performance Gain over Base Avro #6

zcargnop · 2018-09-03T06:56:46Z

I am using Avro to parse messages generated by another vendor (I single fixed schema) and found that the parsing performance is not great. I was looking for ways to improve performance when I came across avro-fastserde library.

My initial attempt at using the library was not successful, I was able to parse messages but the performance was identical to the base Avro implementation. I was hoping you might be able to provide some additional insights into what I might be doing wrong as the documentation does not provide a complete working example.

In my case I have used the Avro schema provided by the vendor to generate a SpecificDatumParser and all supporting classes. The code used to parse the messages looks something like this...

final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

where the parseAvroMessage() method parses the method into various objects before passing them on to the application for processing. Note: the JSON schema is moderately simple consisting of a single record with an array of 1..n sub-records. The parser method combines sets of sub-records into single objects. This method consumes minimal cpu as verified using Java Mission Control to take Flight Recordings.

Here is the FastSerde implementation...

final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/serde");
final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

As noted above I get the exact same throughput regardless of whether I use FastSerde or base Avro. On my test setup it takes about 5min to process 100,000 messages and flight recordings from each test run show slight differences but otherwise are more or less the same.

You note that the FastSpecificDatumReader will fall back to the SpecificDatumReader if the specific classes are not available. I am feeling this is most likely happening in my case (hence the identical performance). I feel like I have done something wrong but not sure what that is.

The text was updated successfully, but these errors were encountered:

dervan · 2018-09-17T12:45:00Z

We're glad that you found our package as a way to improve the performance of your code. We hope it will eventually work in your scenario too.

Unfortunately, I see some unclarities in your example implementation, so it's hard to investigate this case without better insight into your real program/schemas (as I don't know what is an error in your code and what is just an error in code pasting). Similar routine (DataFileReader, FastSpecificDatumReader, 1M LargeAndFlat records) have 2x boost in my tests - so indeed it is very likely that your code is falling back to the default implementation.

Could you carefully verify that you don't have any error logs (especially at the start of processing) which could hint us what does not work?

zcargnop · 2018-09-18T02:26:14Z

I've updated the typos in the original post but thought it best to include the complete class for processing byte messages off a jms queue;

private class jmsListener implements MessageListener, ExceptionListener {

        final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/schema/avro/hf_data");
        final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

        DataFileReader<HfReadData> dataFileReader;

        @Override
        public void onMessage(Message message) {
            BytesMessage bytesMessage = (BytesMessage) message;

            try {
                byte[] byteArray = new byte[(int) bytesMessage.getBodyLength()];
                bytesMessage.readBytes(byteArray);
                SeekableByteArrayInput byteStream = new SeekableByteArrayInput(byteArray);
                dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);
                while (dataFileReader.hasNext()) {
                    HfReadData hfRead = hfReadQueue.take();
                    hfRead = dataFileReader.next(hfRead);
                    parseAvroMessage(hfRead);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
}

The only difference between using fastserde and pure avro is replacing the Avro SpecificDatumReader

final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);

with the fast serde FastSpecificDatumReader

final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

and passing the instance of the Avro SpecificDatumReader created above

dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);

with the instance of the fastserde FastSpecificDatumReader (also created above)

dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);

The HfReadData class referenced in the code was generated by Avro using the schema.

I can see no errors in my logs during startup nor do I see any exceptions at runtime.

dervan · 2018-09-18T12:40:01Z

Do you have to reuse HfReadData object? For now, we don't support object reuse (as stated in "Limitations" paragraph in the readme), so I think that may be some issue. You can verify, that plain "dataFileReader.next()" loop will work faster indeed.

FelixGV · 2019-04-18T05:23:59Z

Hi,

We’ve had great success with avro-fastserde in our own project. This is a truly fantastic piece of software. Thank you for that!

We have decided to invest in it and add some additional performance enhancements. Among those is support for object re-use. Our fork of this project is available here, if you’d like to take a look:

https://github.com/linkedin/avro-util

-F

flowenol · 2019-05-06T09:19:36Z

Hi,

Glad to hear that you find this library useful! I briefly looked over your branch and the enhancements you made are really worth incorporating into this repository. Thank you for adding the object reuse. We haven't provided it to this day, because we didn't really need that feature and I wasn't sure if it won't spoil the overall performance gain. Would you mind if I asked you to provide the pull-request?

FelixGV · 2019-05-09T23:18:00Z

Sure, we can look into providing a PR. We have shied away from doing so so far because we were not sure whether our changes were useful and appropriate for you. In particular:

We have a mildly-annoying requirement to support many versions of Avro, because we need to support a mix of modern use cases written against Avro 1.7 and 1.8 as well as decade-old applications written against Avro 1.4. Unfortunately, Avro has poor API compatibility practices, which makes this tricky. Therefore, most of our infra projects (and fast-avro falls into this category) need to code against our avro-migration-helper shim in order to be compatible with many versions of Avro. This indirection adds a bit of complexity to the code, and is not really adding any value for users who have the luxury to depend on modern versions of Avro exclusively.
We have removed the locally-cached code-generated classes in our fork, because we considered that the code-gen time seemed acceptable as is and, more importantly, because we wanted to maintain the ability to iterate over the code-generation such that we had guarantees that deploying a new version of the lib would not result in interference by previous runs of the code-gen. This is not an insurmountable problem, so if the cache is really desired, we could address the freshness guarantees concern by versioning the code-generation and including the version as part of the cached classes' name, but we have not bothered doing that so far.
Since code generation is quite complex in general, we have made some refactoring which we believe improves readability, but which may not merge super cleanly. This is not necessarily a big deal, as long as you agree that the refactoring we did does improve readability and not the other way around (: ...

cc @gaojieliu

LMK what you think.

-F

flowenol · 2019-05-14T11:25:40Z

Thank you for the detailed answer and willingness to provide the PR. However upon further consideration, I decided to provide the pull request by myself, basing partially on your changes.
This will cause less hassle for you and for us as well.

To be more specific, I will surely add the reuse parameter support but I won't provide the support for legacy avro versions, because we surely don't need this and I think that majority of potential users don't need it as well. I will also consider adding the option to switch generated classes caching on/off.
And I also admit that the readability is rather poor as-is and has to be rethinked and improved.

From our perspective this is a side project, so it may take a while for us to introduce these changes, but nevertheless they will appear soon.

Anyways, thank you for sharing your fork and investing your time into this project!

FelixGV · 2019-05-29T00:06:20Z

Oh, I just saw this by chance. Not sure why I didn't get the notification for it... anyway.

BTW, we have a fix for the object re-use which is not yet merged in the other repo, here: linkedin/avro-util#10

That code (including the PR) runs in production and we have validated that object re-use works fully as expected.

-F

FelixGV mentioned this issue May 29, 2020

Two failure cases of FastDeserializerGenerator #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Performance Gain over Base Avro #6

No Performance Gain over Base Avro #6

zcargnop commented Sep 3, 2018 •

edited

Loading

dervan commented Sep 17, 2018

zcargnop commented Sep 18, 2018

dervan commented Sep 18, 2018

FelixGV commented Apr 18, 2019

flowenol commented May 6, 2019

FelixGV commented May 9, 2019

flowenol commented May 14, 2019

FelixGV commented May 29, 2019

No Performance Gain over Base Avro #6

No Performance Gain over Base Avro #6

Comments

zcargnop commented Sep 3, 2018 • edited Loading

dervan commented Sep 17, 2018

zcargnop commented Sep 18, 2018

dervan commented Sep 18, 2018

FelixGV commented Apr 18, 2019

flowenol commented May 6, 2019

FelixGV commented May 9, 2019

flowenol commented May 14, 2019

FelixGV commented May 29, 2019

zcargnop commented Sep 3, 2018 •

edited

Loading