Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Performance Gain over Base Avro #6

Open
zcargnop opened this issue Sep 3, 2018 · 8 comments
Open

No Performance Gain over Base Avro #6

zcargnop opened this issue Sep 3, 2018 · 8 comments

Comments

@zcargnop
Copy link

zcargnop commented Sep 3, 2018

I am using Avro to parse messages generated by another vendor (I single fixed schema) and found that the parsing performance is not great. I was looking for ways to improve performance when I came across avro-fastserde library.

My initial attempt at using the library was not successful, I was able to parse messages but the performance was identical to the base Avro implementation. I was hoping you might be able to provide some additional insights into what I might be doing wrong as the documentation does not provide a complete working example.

In my case I have used the Avro schema provided by the vendor to generate a SpecificDatumParser and all supporting classes. The code used to parse the messages looks something like this...

final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

where the parseAvroMessage() method parses the method into various objects before passing them on to the application for processing. Note: the JSON schema is moderately simple consisting of a single record with an array of 1..n sub-records. The parser method combines sets of sub-records into single objects. This method consumes minimal cpu as verified using Java Mission Control to take Flight Recordings.

Here is the FastSerde implementation...

final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/serde");
final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

As noted above I get the exact same throughput regardless of whether I use FastSerde or base Avro. On my test setup it takes about 5min to process 100,000 messages and flight recordings from each test run show slight differences but otherwise are more or less the same.

You note that the FastSpecificDatumReader will fall back to the SpecificDatumReader if the specific classes are not available. I am feeling this is most likely happening in my case (hence the identical performance). I feel like I have done something wrong but not sure what that is.

@dervan
Copy link
Contributor

dervan commented Sep 17, 2018

We're glad that you found our package as a way to improve the performance of your code. We hope it will eventually work in your scenario too.

Unfortunately, I see some unclarities in your example implementation, so it's hard to investigate this case without better insight into your real program/schemas (as I don't know what is an error in your code and what is just an error in code pasting). Similar routine (DataFileReader, FastSpecificDatumReader, 1M LargeAndFlat records) have 2x boost in my tests - so indeed it is very likely that your code is falling back to the default implementation.

Could you carefully verify that you don't have any error logs (especially at the start of processing) which could hint us what does not work?

@zcargnop
Copy link
Author

I've updated the typos in the original post but thought it best to include the complete class for processing byte messages off a jms queue;

private class jmsListener implements MessageListener, ExceptionListener {

        final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/schema/avro/hf_data");
        final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

        DataFileReader<HfReadData> dataFileReader;

        @Override
        public void onMessage(Message message) {
            BytesMessage bytesMessage = (BytesMessage) message;

            try {
                byte[] byteArray = new byte[(int) bytesMessage.getBodyLength()];
                bytesMessage.readBytes(byteArray);
                SeekableByteArrayInput byteStream = new SeekableByteArrayInput(byteArray);
                dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);
                while (dataFileReader.hasNext()) {
                    HfReadData hfRead = hfReadQueue.take();
                    hfRead = dataFileReader.next(hfRead);
                    parseAvroMessage(hfRead);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
}

The only difference between using fastserde and pure avro is replacing the Avro SpecificDatumReader

final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);

with the fast serde FastSpecificDatumReader

final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

and passing the instance of the Avro SpecificDatumReader created above

dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);

with the instance of the fastserde FastSpecificDatumReader (also created above)

dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);

The HfReadData class referenced in the code was generated by Avro using the schema.

I can see no errors in my logs during startup nor do I see any exceptions at runtime.

@dervan
Copy link
Contributor

dervan commented Sep 18, 2018

Do you have to reuse HfReadData object? For now, we don't support object reuse (as stated in "Limitations" paragraph in the readme), so I think that may be some issue. You can verify, that plain "dataFileReader.next()" loop will work faster indeed.

@FelixGV
Copy link

FelixGV commented Apr 18, 2019

Hi,

We’ve had great success with avro-fastserde in our own project. This is a truly fantastic piece of software. Thank you for that!

We have decided to invest in it and add some additional performance enhancements. Among those is support for object re-use. Our fork of this project is available here, if you’d like to take a look:

https://github.com/linkedin/avro-util

-F

@flowenol
Copy link
Contributor

flowenol commented May 6, 2019

Hi,

Glad to hear that you find this library useful! I briefly looked over your branch and the enhancements you made are really worth incorporating into this repository. Thank you for adding the object reuse. We haven't provided it to this day, because we didn't really need that feature and I wasn't sure if it won't spoil the overall performance gain. Would you mind if I asked you to provide the pull-request?

@FelixGV
Copy link

FelixGV commented May 9, 2019

Sure, we can look into providing a PR. We have shied away from doing so so far because we were not sure whether our changes were useful and appropriate for you. In particular:

  • We have a mildly-annoying requirement to support many versions of Avro, because we need to support a mix of modern use cases written against Avro 1.7 and 1.8 as well as decade-old applications written against Avro 1.4. Unfortunately, Avro has poor API compatibility practices, which makes this tricky. Therefore, most of our infra projects (and fast-avro falls into this category) need to code against our avro-migration-helper shim in order to be compatible with many versions of Avro. This indirection adds a bit of complexity to the code, and is not really adding any value for users who have the luxury to depend on modern versions of Avro exclusively.

  • We have removed the locally-cached code-generated classes in our fork, because we considered that the code-gen time seemed acceptable as is and, more importantly, because we wanted to maintain the ability to iterate over the code-generation such that we had guarantees that deploying a new version of the lib would not result in interference by previous runs of the code-gen. This is not an insurmountable problem, so if the cache is really desired, we could address the freshness guarantees concern by versioning the code-generation and including the version as part of the cached classes' name, but we have not bothered doing that so far.

  • Since code generation is quite complex in general, we have made some refactoring which we believe improves readability, but which may not merge super cleanly. This is not necessarily a big deal, as long as you agree that the refactoring we did does improve readability and not the other way around (: ...

cc @gaojieliu

LMK what you think.

-F

@flowenol
Copy link
Contributor

Thank you for the detailed answer and willingness to provide the PR. However upon further consideration, I decided to provide the pull request by myself, basing partially on your changes.
This will cause less hassle for you and for us as well.

To be more specific, I will surely add the reuse parameter support but I won't provide the support for legacy avro versions, because we surely don't need this and I think that majority of potential users don't need it as well. I will also consider adding the option to switch generated classes caching on/off.
And I also admit that the readability is rather poor as-is and has to be rethinked and improved.

From our perspective this is a side project, so it may take a while for us to introduce these changes, but nevertheless they will appear soon.

Anyways, thank you for sharing your fork and investing your time into this project!

@FelixGV
Copy link

FelixGV commented May 29, 2019

Oh, I just saw this by chance. Not sure why I didn't get the notification for it... anyway.

BTW, we have a fix for the object re-use which is not yet merged in the other repo, here: linkedin/avro-util#10

That code (including the PR) runs in production and we have validated that object re-use works fully as expected.

-F

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants