Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Cascaded #[inline] Propagation #678

Open
RoloEdits opened this issue Nov 5, 2023 · 3 comments
Open

Non-Cascaded #[inline] Propagation #678

RoloEdits opened this issue Nov 5, 2023 · 3 comments
Labels
enhancement help wanted optimization Issues related to reducing time needed to parse XML or to memory consumption

Comments

@RoloEdits
Copy link

RoloEdits commented Nov 5, 2023

When looking at the code I noticed that there was a break in #[inline] propagation on functions. Given that #[inline] is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.

When running tests with this setup:

use quick_xml::events::Event;
use quick_xml::reader::Reader;

static XML: &str = include_str!(r"XML");

fn main() {
    quick_xml();
}

fn quick_xml() {
    let mut reader = Reader::from_str(XML);
    reader.trim_text(true);

    let mut buf = Vec::with_capacity(1024 * 8);

    loop {
        match reader.read_event_into(&mut buf) {
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            Ok(Event::Eof) => break,
            Ok(Event::Start(_)) => {}
            Ok(Event::Text(_)) => {}
            _ => (),
        }
        buf.clear();
    }
}

I noticed that although this function has #[inline]:

    #[inline]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {
        self.read_event_impl(buf)
    }

The next funtion its calling does not:

fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {
        read_event_impl!(self, buf, self.reader, read_until_open, read_until_close)
    }

In trying to see the effects of the break in the #[inline] propgation, I ran a few tests. I went through and at first added #[inline] to the functions so that there is an end-to-end #[inline] chain, and then later went through and changed it to be #[inline(always)] to see what the differences would be.

I ran lto=true as well to see if there is any benefit to #[inline] or if it just ultimately matches what a manual #[inline] decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.

The xml file was a 1.3GB file from the worksheet of this dataset.

Benchmarks

quick_xml: master branch
  Time (mean ± σ):      5.941 s ±  0.018 s    [User: 5.614 s, System: 0.350 s]
  Range (min … max):    5.915 s …  5.976 s    10 runs

quick_xml: master branch + lto
  Time (mean ± σ):      4.623 s ±  0.049 s    [User: 4.231 s, System: 0.395 s]
  Range (min … max):    4.579 s …  4.719 s    10 runs

// I might have missed coverage for this one, but the main functions were gotten for sure
quick_xml: inline
  Time (mean ± σ):      5.236 s ±  0.240 s    [User: 4.877 s, System: 0.353 s]
  Range (min … max):    5.123 s …  5.916 s    10 runs
  
quick_xml: inline(always)
  Time (mean ± σ):      4.232 s ±  0.022 s    [User: 3.843 s, System: 0.385 s]
  Range (min … max):    4.201 s …  4.255 s    10 runs

quick_xml: inline(always) + lto
  Time (mean ± σ):      4.148 s ±  0.017 s    [User: 3.744 s, System: 0.393 s]
  Range (min … max):    4.116 s …  4.162 s    10 runs

Percentage over master:
image

Seeing that the changes even benefited lto, I think this is a worth while add.

Profile

Current master:
image

#[inline]:
image

#[inline(always)]:
image

#[inline(always)] + lto:
image

One thing I did notice was a memcpy instruction that was 25% of the samples:
image

I'm not sure if this is part of the program or if its just an artifact of the profiling.

@Mingun
Copy link
Collaborator

Mingun commented Nov 5, 2023

Thanks for your research!

The next funtion its calling does not:

Yes, this is because this function implemented using a macro which is actually expands into a relatively long code. I never profiled library yet, so maybe all your suggestions are correct, but because I want to rewrite parser I think it is prematurally to optimize inlines now. Currently reading and parsing parts are very coupled which I think is not very good for readability and performance. Indirectly, this theory is confirmed by the fact that by rewriting the code in a new manner, I later discovered that the same approach is used in maybe_xml, which is now often 1.5-2x times faster than quick_xml.

Thanks for a dataset. Would be very grateful if you also share the code you used to do comparisons and draw graphs so we can get back to this topic after rewriting.

About calamine performance: if it uses serde-based parsing than it is significantly slow that working with raw events. I think it is because when you read events in a cycle, it is better for caches, then when you read one event and then process it, read another one and process it, etc. I have an idea to improve serde performance by buffering events first.

As a conclusion: I definitely want to look and improve performance and create the best XML library in Rust, it is just the next step after fixing some bugs which are coupled with current implementation which I hope will be fixed after rewriting.

@Mingun Mingun added enhancement help wanted optimization Issues related to reducing time needed to parse XML or to memory consumption labels Nov 5, 2023
@dralley
Copy link
Collaborator

dralley commented Nov 5, 2023

Inlining is also a bit of a tricky form of optimization in general. If you bloat the code size too much, larger programs will get slower rather than faster. Debugging & profiling also gets more difficult, etc.

Certainly worth it in many cases, though. I agree w/ Mingun about taking a deeper look once the rewrites are done.

@RoloEdits
Copy link
Author

The code I changed was:

// reader/buffered_reader.rs

 #[inline(always)]
        $($async)? fn read_bytes_until $(<$lf>)? (
            &mut self,
            byte: u8,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

 #[inline(always)]
        $($async)? fn read_element $(<$lf>)? (
            &mut self,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

#[inline(always)]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn read_to_end_into(&mut self, end: QName, buf: &mut Vec<u8>) -> Result<Span> {...}
// reader/mod.rs

#[inline(always)]
    pub fn buffer_position(&self) -> usize {...}

 #[inline(always)]
    fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}

#[inline(always)]
    fn read_until_open<'i, B>(&mut self, buf: B) -> Result<std::result::Result<Event<'i>, B>>
    where
        R: XmlSource<'i, B>,
    {...}

 #[inline(always)]
    fn read_until_close<'i, B>(&mut self, buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}
// reader/state.rs

#[inline(always)]
    pub fn emit_text<'b>(&mut self, bytes: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_bang<'b>(&mut self, bang_type: BangType, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_end<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_question_mark<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_start<'b>(&mut self, content: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn close_expanded_empty(&mut self) -> Result<Event<'static>> {...}

This should be all of it. It was a pretty quick and dirty patch job, but the main idea was to create an end-to-end link with memchrs own #[inline] hints as well as just going through the function calls in the flamegraph and making them #[inline(always)].

For the plots, I just used a .ipynb notebook with python:

import plotly.express as px
import polars as pl

df = pl.read_csv('CSV')

# requires that `master` time is in the first row
diff = df.item(column='time', row=0)

df = df.with_columns(
    percentage_over_master = pl.lit((df.get_column('time') / diff  - 1) * -1)
)

fig = px.bar(
    x=df['implementation'],
    y=df['time'],
    template='plotly_dark',
    text=df['time'],
    height=800,
    labels={'y': 'seconds', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    title='quick_xml `#[inline]` Comparison: Runtime'
    )

fig.update_traces(textposition='outside', texttemplate = "%{text}s",)

fig.show()

fig = px.bar(
    x=df['implementation'],
    y=df['percentage_over_master'],
    text=df['percentage_over_master'],
    template='plotly_dark',
    height=800,
    labels={'y': 'percentage', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    text_auto='.4p',
    title='quick_xml `#[inline]` Comparison: Percentage Over Master'
    )

fig.update_traces(textposition='outside')

fig.update_yaxes(showticklabels=False)

fig.show()

The csv was shaped like this:

implementation,time
`master`,5.941
`master` + `lto`,4.623
`#[inline]`,5.236
`#[inline(always)]`,4.232
`#[inline(always)]` + `lto`,4.148

I used hyperfine for the runtime comparisons.

Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 9, 2024
Related: tafia#678

All methods called only once or two and inlining them in most cases increases performance
of our benchmarks:

> critcmp master element-parser -t 5
group                                                              element-parser                         master
-----                                                              --------------                         ------
NsReader::read_resolved_event_into/trim_text = false               1.00    398.9±6.30µs        ? ?/sec    1.05    419.6±7.94µs        ? ?/sec
NsReader::read_resolved_event_into/trim_text = true                1.00    382.1±7.06µs        ? ?/sec    1.06    404.0±7.44µs        ? ?/sec
One event/CData                                                    1.00     56.3±0.97ns        ? ?/sec    1.21     68.1±1.35ns        ? ?/sec
One event/Comment                                                  1.00    141.2±2.52ns        ? ?/sec    1.14    161.4±2.79ns        ? ?/sec
decode_and_parse_document_with_namespaces/rpm_filelists.xml        1.00     95.1±1.45µs   115.5 MB/sec    1.07    102.2±1.65µs   107.5 MB/sec
escape_text/escaped_chars_long                                     1.42  1806.4±34.20ns        ? ?/sec    1.00  1275.0±23.98ns        ? ?/sec
escape_text/escaped_chars_short                                    1.00    491.5±8.35ns        ? ?/sec    1.07   526.6±10.80ns        ? ?/sec
escape_text/no_chars_to_escape_long                                2.06  1831.1±36.31ns        ? ?/sec    1.00   887.1±17.00ns        ? ?/sec
parse_document_nocopy_with_namespaces/libreoffice_document.fodt    1.00    507.2±8.56µs   107.6 MB/sec    1.08   546.2±10.20µs   100.0 MB/sec
parse_document_nocopy_with_namespaces/rpm_filelists.xml            1.00     87.2±1.64µs   126.0 MB/sec    1.14     99.2±1.74µs   110.7 MB/sec
parse_document_nocopy_with_namespaces/rpm_other.xml                1.00    139.6±2.83µs   158.5 MB/sec    1.07    148.7±2.71µs   148.9 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary.xml              1.00    190.5±3.43µs   106.4 MB/sec    1.09    207.9±3.79µs    97.5 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary2.xml             1.00     61.7±1.10µs   116.2 MB/sec    1.09     67.5±1.28µs   106.2 MB/sec
parse_document_nocopy_with_namespaces/sample_1.xml                 1.00     10.5±0.20µs   105.0 MB/sec    1.06     11.1±0.21µs    99.3 MB/sec
parse_document_nocopy_with_namespaces/sample_ns.xml                1.00      8.4±0.16µs    86.5 MB/sec    1.08      9.0±0.18µs    80.0 MB/sec
parse_document_nocopy_with_namespaces/sample_rss.xml               1.00   786.4±13.46µs   239.8 MB/sec    1.09   859.9±12.82µs   219.3 MB/sec
parse_document_nocopy_with_namespaces/test_writer_ident.xml        1.00     29.0±0.55µs   146.4 MB/sec    1.06     30.8±0.55µs   138.0 MB/sec
read_event/trim_text = false                                       1.00    199.3±3.59µs        ? ?/sec    1.10    218.5±3.98µs        ? ?/sec
read_event/trim_text = true                                        1.00    190.4±3.76µs        ? ?/sec    1.11    211.7±4.11µs        ? ?/sec
unescape_text/no_chars_to_unescape_short                           1.00     11.8±0.21ns        ? ?/sec    1.06     12.4±0.23ns        ? ?/sec
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 9, 2024
Related: tafia#678

All methods called only once or two and inlining them in most cases increases performance
of our benchmarks:

> critcmp master element-parser -t 5
group                                                              element-parser                         master
-----                                                              --------------                         ------
NsReader::read_resolved_event_into/trim_text = false               1.00    398.9±6.30µs        ? ?/sec    1.05    419.6±7.94µs        ? ?/sec
NsReader::read_resolved_event_into/trim_text = true                1.00    382.1±7.06µs        ? ?/sec    1.06    404.0±7.44µs        ? ?/sec
One event/CData                                                    1.00     56.3±0.97ns        ? ?/sec    1.21     68.1±1.35ns        ? ?/sec
One event/Comment                                                  1.00    141.2±2.52ns        ? ?/sec    1.14    161.4±2.79ns        ? ?/sec
decode_and_parse_document_with_namespaces/rpm_filelists.xml        1.00     95.1±1.45µs   115.5 MB/sec    1.07    102.2±1.65µs   107.5 MB/sec
escape_text/escaped_chars_long                                     1.42  1806.4±34.20ns        ? ?/sec    1.00  1275.0±23.98ns        ? ?/sec
escape_text/escaped_chars_short                                    1.00    491.5±8.35ns        ? ?/sec    1.07   526.6±10.80ns        ? ?/sec
escape_text/no_chars_to_escape_long                                2.06  1831.1±36.31ns        ? ?/sec    1.00   887.1±17.00ns        ? ?/sec
parse_document_nocopy_with_namespaces/libreoffice_document.fodt    1.00    507.2±8.56µs   107.6 MB/sec    1.08   546.2±10.20µs   100.0 MB/sec
parse_document_nocopy_with_namespaces/rpm_filelists.xml            1.00     87.2±1.64µs   126.0 MB/sec    1.14     99.2±1.74µs   110.7 MB/sec
parse_document_nocopy_with_namespaces/rpm_other.xml                1.00    139.6±2.83µs   158.5 MB/sec    1.07    148.7±2.71µs   148.9 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary.xml              1.00    190.5±3.43µs   106.4 MB/sec    1.09    207.9±3.79µs    97.5 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary2.xml             1.00     61.7±1.10µs   116.2 MB/sec    1.09     67.5±1.28µs   106.2 MB/sec
parse_document_nocopy_with_namespaces/sample_1.xml                 1.00     10.5±0.20µs   105.0 MB/sec    1.06     11.1±0.21µs    99.3 MB/sec
parse_document_nocopy_with_namespaces/sample_ns.xml                1.00      8.4±0.16µs    86.5 MB/sec    1.08      9.0±0.18µs    80.0 MB/sec
parse_document_nocopy_with_namespaces/sample_rss.xml               1.00   786.4±13.46µs   239.8 MB/sec    1.09   859.9±12.82µs   219.3 MB/sec
parse_document_nocopy_with_namespaces/test_writer_ident.xml        1.00     29.0±0.55µs   146.4 MB/sec    1.06     30.8±0.55µs   138.0 MB/sec
read_event/trim_text = false                                       1.00    199.3±3.59µs        ? ?/sec    1.10    218.5±3.98µs        ? ?/sec
read_event/trim_text = true                                        1.00    190.4±3.76µs        ? ?/sec    1.11    211.7±4.11µs        ? ?/sec
unescape_text/no_chars_to_unescape_short                           1.00     11.8±0.21ns        ? ?/sec    1.06     12.4±0.23ns        ? ?/sec
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 9, 2024
Related: tafia#678

All methods called only once or two and inlining them in most cases increases performance
of our benchmarks:

> critcmp master element-parser -t 5
group                                                              element-parser                         master
-----                                                              --------------                         ------
NsReader::read_resolved_event_into/trim_text = false               1.00    398.9±6.30µs        ? ?/sec    1.05    419.6±7.94µs        ? ?/sec
NsReader::read_resolved_event_into/trim_text = true                1.00    382.1±7.06µs        ? ?/sec    1.06    404.0±7.44µs        ? ?/sec
One event/CData                                                    1.00     56.3±0.97ns        ? ?/sec    1.21     68.1±1.35ns        ? ?/sec
One event/Comment                                                  1.00    141.2±2.52ns        ? ?/sec    1.14    161.4±2.79ns        ? ?/sec
decode_and_parse_document_with_namespaces/rpm_filelists.xml        1.00     95.1±1.45µs   115.5 MB/sec    1.07    102.2±1.65µs   107.5 MB/sec
escape_text/escaped_chars_long                                     1.42  1806.4±34.20ns        ? ?/sec    1.00  1275.0±23.98ns        ? ?/sec
escape_text/escaped_chars_short                                    1.00    491.5±8.35ns        ? ?/sec    1.07   526.6±10.80ns        ? ?/sec
escape_text/no_chars_to_escape_long                                2.06  1831.1±36.31ns        ? ?/sec    1.00   887.1±17.00ns        ? ?/sec
parse_document_nocopy_with_namespaces/libreoffice_document.fodt    1.00    507.2±8.56µs   107.6 MB/sec    1.08   546.2±10.20µs   100.0 MB/sec
parse_document_nocopy_with_namespaces/rpm_filelists.xml            1.00     87.2±1.64µs   126.0 MB/sec    1.14     99.2±1.74µs   110.7 MB/sec
parse_document_nocopy_with_namespaces/rpm_other.xml                1.00    139.6±2.83µs   158.5 MB/sec    1.07    148.7±2.71µs   148.9 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary.xml              1.00    190.5±3.43µs   106.4 MB/sec    1.09    207.9±3.79µs    97.5 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary2.xml             1.00     61.7±1.10µs   116.2 MB/sec    1.09     67.5±1.28µs   106.2 MB/sec
parse_document_nocopy_with_namespaces/sample_1.xml                 1.00     10.5±0.20µs   105.0 MB/sec    1.06     11.1±0.21µs    99.3 MB/sec
parse_document_nocopy_with_namespaces/sample_ns.xml                1.00      8.4±0.16µs    86.5 MB/sec    1.08      9.0±0.18µs    80.0 MB/sec
parse_document_nocopy_with_namespaces/sample_rss.xml               1.00   786.4±13.46µs   239.8 MB/sec    1.09   859.9±12.82µs   219.3 MB/sec
parse_document_nocopy_with_namespaces/test_writer_ident.xml        1.00     29.0±0.55µs   146.4 MB/sec    1.06     30.8±0.55µs   138.0 MB/sec
read_event/trim_text = false                                       1.00    199.3±3.59µs        ? ?/sec    1.10    218.5±3.98µs        ? ?/sec
read_event/trim_text = true                                        1.00    190.4±3.76µs        ? ?/sec    1.11    211.7±4.11µs        ? ?/sec
unescape_text/no_chars_to_unescape_short                           1.00     11.8±0.21ns        ? ?/sec    1.06     12.4±0.23ns        ? ?/sec
dralley pushed a commit that referenced this issue Jun 9, 2024
Related: #678

All methods called only once or two and inlining them in most cases increases performance
of our benchmarks:

> critcmp master element-parser -t 5
group                                                              element-parser                         master
-----                                                              --------------                         ------
NsReader::read_resolved_event_into/trim_text = false               1.00    398.9±6.30µs        ? ?/sec    1.05    419.6±7.94µs        ? ?/sec
NsReader::read_resolved_event_into/trim_text = true                1.00    382.1±7.06µs        ? ?/sec    1.06    404.0±7.44µs        ? ?/sec
One event/CData                                                    1.00     56.3±0.97ns        ? ?/sec    1.21     68.1±1.35ns        ? ?/sec
One event/Comment                                                  1.00    141.2±2.52ns        ? ?/sec    1.14    161.4±2.79ns        ? ?/sec
decode_and_parse_document_with_namespaces/rpm_filelists.xml        1.00     95.1±1.45µs   115.5 MB/sec    1.07    102.2±1.65µs   107.5 MB/sec
escape_text/escaped_chars_long                                     1.42  1806.4±34.20ns        ? ?/sec    1.00  1275.0±23.98ns        ? ?/sec
escape_text/escaped_chars_short                                    1.00    491.5±8.35ns        ? ?/sec    1.07   526.6±10.80ns        ? ?/sec
escape_text/no_chars_to_escape_long                                2.06  1831.1±36.31ns        ? ?/sec    1.00   887.1±17.00ns        ? ?/sec
parse_document_nocopy_with_namespaces/libreoffice_document.fodt    1.00    507.2±8.56µs   107.6 MB/sec    1.08   546.2±10.20µs   100.0 MB/sec
parse_document_nocopy_with_namespaces/rpm_filelists.xml            1.00     87.2±1.64µs   126.0 MB/sec    1.14     99.2±1.74µs   110.7 MB/sec
parse_document_nocopy_with_namespaces/rpm_other.xml                1.00    139.6±2.83µs   158.5 MB/sec    1.07    148.7±2.71µs   148.9 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary.xml              1.00    190.5±3.43µs   106.4 MB/sec    1.09    207.9±3.79µs    97.5 MB/sec
parse_document_nocopy_with_namespaces/rpm_primary2.xml             1.00     61.7±1.10µs   116.2 MB/sec    1.09     67.5±1.28µs   106.2 MB/sec
parse_document_nocopy_with_namespaces/sample_1.xml                 1.00     10.5±0.20µs   105.0 MB/sec    1.06     11.1±0.21µs    99.3 MB/sec
parse_document_nocopy_with_namespaces/sample_ns.xml                1.00      8.4±0.16µs    86.5 MB/sec    1.08      9.0±0.18µs    80.0 MB/sec
parse_document_nocopy_with_namespaces/sample_rss.xml               1.00   786.4±13.46µs   239.8 MB/sec    1.09   859.9±12.82µs   219.3 MB/sec
parse_document_nocopy_with_namespaces/test_writer_ident.xml        1.00     29.0±0.55µs   146.4 MB/sec    1.06     30.8±0.55µs   138.0 MB/sec
read_event/trim_text = false                                       1.00    199.3±3.59µs        ? ?/sec    1.10    218.5±3.98µs        ? ?/sec
read_event/trim_text = true                                        1.00    190.4±3.76µs        ? ?/sec    1.11    211.7±4.11µs        ? ?/sec
unescape_text/no_chars_to_unescape_short                           1.00     11.8±0.21ns        ? ?/sec    1.06     12.4±0.23ns        ? ?/sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted optimization Issues related to reducing time needed to parse XML or to memory consumption
Projects
None yet
Development

No branches or pull requests

3 participants