Non-Cascaded `#[inline]` Propagation #678

RoloEdits · 2023-11-05T11:57:52Z

When looking at the code I noticed that there was a break in #[inline] propagation on functions. Given that #[inline] is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.

When running tests with this setup:

use quick_xml::events::Event;
use quick_xml::reader::Reader;

static XML: &str = include_str!(r"XML");

fn main() {
    quick_xml();
}

fn quick_xml() {
    let mut reader = Reader::from_str(XML);
    reader.trim_text(true);

    let mut buf = Vec::with_capacity(1024 * 8);

    loop {
        match reader.read_event_into(&mut buf) {
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            Ok(Event::Eof) => break,
            Ok(Event::Start(_)) => {}
            Ok(Event::Text(_)) => {}
            _ => (),
        }
        buf.clear();
    }
}

I noticed that although this function has #[inline]:

    #[inline]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {
        self.read_event_impl(buf)
    }

The next funtion its calling does not:

fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {
        read_event_impl!(self, buf, self.reader, read_until_open, read_until_close)
    }

In trying to see the effects of the break in the #[inline] propgation, I ran a few tests. I went through and at first added #[inline] to the functions so that there is an end-to-end #[inline] chain, and then later went through and changed it to be #[inline(always)] to see what the differences would be.

I ran lto=true as well to see if there is any benefit to #[inline] or if it just ultimately matches what a manual #[inline] decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.

The xml file was a 1.3GB file from the worksheet of this dataset.

Benchmarks

quick_xml: master branch
  Time (mean ± σ):      5.941 s ±  0.018 s    [User: 5.614 s, System: 0.350 s]
  Range (min … max):    5.915 s …  5.976 s    10 runs

quick_xml: master branch + lto
  Time (mean ± σ):      4.623 s ±  0.049 s    [User: 4.231 s, System: 0.395 s]
  Range (min … max):    4.579 s …  4.719 s    10 runs

// I might have missed coverage for this one, but the main functions were gotten for sure
quick_xml: inline
  Time (mean ± σ):      5.236 s ±  0.240 s    [User: 4.877 s, System: 0.353 s]
  Range (min … max):    5.123 s …  5.916 s    10 runs
  
quick_xml: inline(always)
  Time (mean ± σ):      4.232 s ±  0.022 s    [User: 3.843 s, System: 0.385 s]
  Range (min … max):    4.201 s …  4.255 s    10 runs

quick_xml: inline(always) + lto
  Time (mean ± σ):      4.148 s ±  0.017 s    [User: 3.744 s, System: 0.393 s]
  Range (min … max):    4.116 s …  4.162 s    10 runs

Percentage over master:

Seeing that the changes even benefited lto, I think this is a worth while add.

Profile

Current master:

#[inline]:

#[inline(always)]:

#[inline(always)] + lto:

One thing I did notice was a memcpy instruction that was 25% of the samples:

I'm not sure if this is part of the program or if its just an artifact of the profiling.

The text was updated successfully, but these errors were encountered:

Mingun · 2023-11-05T14:22:41Z

Thanks for your research!

The next funtion its calling does not:

Yes, this is because this function implemented using a macro which is actually expands into a relatively long code. I never profiled library yet, so maybe all your suggestions are correct, but because I want to rewrite parser I think it is prematurally to optimize inlines now. Currently reading and parsing parts are very coupled which I think is not very good for readability and performance. Indirectly, this theory is confirmed by the fact that by rewriting the code in a new manner, I later discovered that the same approach is used in maybe_xml, which is now often 1.5-2x times faster than quick_xml.

Thanks for a dataset. Would be very grateful if you also share the code you used to do comparisons and draw graphs so we can get back to this topic after rewriting.

About calamine performance: if it uses serde-based parsing than it is significantly slow that working with raw events. I think it is because when you read events in a cycle, it is better for caches, then when you read one event and then process it, read another one and process it, etc. I have an idea to improve serde performance by buffering events first.

As a conclusion: I definitely want to look and improve performance and create the best XML library in Rust, it is just the next step after fixing some bugs which are coupled with current implementation which I hope will be fixed after rewriting.

dralley · 2023-11-05T16:17:43Z

Inlining is also a bit of a tricky form of optimization in general. If you bloat the code size too much, larger programs will get slower rather than faster. Debugging & profiling also gets more difficult, etc.

Certainly worth it in many cases, though. I agree w/ Mingun about taking a deeper look once the rewrites are done.

RoloEdits · 2023-11-06T11:27:55Z

The code I changed was:

// reader/buffered_reader.rs

 #[inline(always)]
        $($async)? fn read_bytes_until $(<$lf>)? (
            &mut self,
            byte: u8,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

 #[inline(always)]
        $($async)? fn read_element $(<$lf>)? (
            &mut self,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

#[inline(always)]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn read_to_end_into(&mut self, end: QName, buf: &mut Vec<u8>) -> Result<Span> {...}

// reader/mod.rs

#[inline(always)]
    pub fn buffer_position(&self) -> usize {...}

 #[inline(always)]
    fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}

#[inline(always)]
    fn read_until_open<'i, B>(&mut self, buf: B) -> Result<std::result::Result<Event<'i>, B>>
    where
        R: XmlSource<'i, B>,
    {...}

 #[inline(always)]
    fn read_until_close<'i, B>(&mut self, buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}

// reader/state.rs

#[inline(always)]
    pub fn emit_text<'b>(&mut self, bytes: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_bang<'b>(&mut self, bang_type: BangType, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_end<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_question_mark<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_start<'b>(&mut self, content: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn close_expanded_empty(&mut self) -> Result<Event<'static>> {...}

This should be all of it. It was a pretty quick and dirty patch job, but the main idea was to create an end-to-end link with memchrs own #[inline] hints as well as just going through the function calls in the flamegraph and making them #[inline(always)].

For the plots, I just used a .ipynb notebook with python:

import plotly.express as px
import polars as pl

df = pl.read_csv('CSV')

# requires that `master` time is in the first row
diff = df.item(column='time', row=0)

df = df.with_columns(
    percentage_over_master = pl.lit((df.get_column('time') / diff  - 1) * -1)
)

fig = px.bar(
    x=df['implementation'],
    y=df['time'],
    template='plotly_dark',
    text=df['time'],
    height=800,
    labels={'y': 'seconds', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    title='quick_xml `#[inline]` Comparison: Runtime'
    )

fig.update_traces(textposition='outside', texttemplate = "%{text}s",)

fig.show()

fig = px.bar(
    x=df['implementation'],
    y=df['percentage_over_master'],
    text=df['percentage_over_master'],
    template='plotly_dark',
    height=800,
    labels={'y': 'percentage', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    text_auto='.4p',
    title='quick_xml `#[inline]` Comparison: Percentage Over Master'
    )

fig.update_traces(textposition='outside')

fig.update_yaxes(showticklabels=False)

fig.show()

The csv was shaped like this:

implementation,time
`master`,5.941
`master` + `lto`,4.623
`#[inline]`,5.236
`#[inline(always)]`,4.232
`#[inline(always)]` + `lto`,4.148

I used hyperfine for the runtime comparisons.

Related: tafia#678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec

Related: #678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec

RoloEdits mentioned this issue Nov 5, 2023

Worksheet Reading 3-4x Slower Than Baseline Of Underlying quick_xml Library tafia/calamine#372

Open

Mingun added enhancement help wanted optimization Issues related to reducing time needed to parse XML or to memory consumption labels Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Cascaded `#[inline]` Propagation #678

Non-Cascaded `#[inline]` Propagation #678

RoloEdits commented Nov 5, 2023 •

edited

Mingun commented Nov 5, 2023 •

edited

dralley commented Nov 5, 2023 •

edited

RoloEdits commented Nov 6, 2023

Non-Cascaded #[inline] Propagation #678

Non-Cascaded #[inline] Propagation #678

Comments

RoloEdits commented Nov 5, 2023 • edited

Benchmarks

Profile

Mingun commented Nov 5, 2023 • edited

dralley commented Nov 5, 2023 • edited

RoloEdits commented Nov 6, 2023

Non-Cascaded `#[inline]` Propagation #678

Non-Cascaded `#[inline]` Propagation #678

RoloEdits commented Nov 5, 2023 •

edited

Mingun commented Nov 5, 2023 •

edited

dralley commented Nov 5, 2023 •

edited