Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small String Optimization #362

Open
RoloEdits opened this issue Oct 4, 2023 · 23 comments
Open

Small String Optimization #362

RoloEdits opened this issue Oct 4, 2023 · 23 comments
Labels

Comments

@RoloEdits
Copy link
Contributor

Considering the domain, as well as the specification, it seems a lot of gains could be made from using something like smartstring in-place of String for some performance gains.

It could also replace some of the Vec<u8> as keys in BTreeMap as the in-lined string has better cache locality. And even if it does need to get allocated then its not any worse then a Vec.

@tafia
Copy link
Owner

tafia commented Oct 5, 2023

I am not certain of the impact it could have but indeed strings are mostly small and immutable and it seems like a good fit.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Oct 6, 2023

I tried to make a quick and dirty patch just to see if there would be any immediate differences.

Here are the flamegraphs. Purple is alloc.

Current std::string::String
image

SSO with smartstring::String
image

The biggest "win" is this block disappearing
image

Some quick benchmarks with hyperfine opening an xlsx with ~150k rows.

std::string::String

Time (mean ± σ):     831.9 ms ±   5.7 ms    [User: 811.6 ms, System: 23.7 ms]
Range (min … max):   824.1 ms … 840.1 ms    10 runs

smartstring::String

Time (mean ± σ):     786.5 ms ±   9.0 ms    [User: 768.1 ms, System: 18.3 ms]
Range (min … max):   775.6 ms … 805.1 ms    10 runs

I think it could be worth more time to do a proper implementation. I'll try to get a proper PR for this when I have some time.

@tafia
Copy link
Owner

tafia commented Oct 8, 2023

Thanks a lot for looking at it!

@RoloEdits
Copy link
Contributor Author

With the current way the library is designed, which directly exposes the String in the cell value, as well as the other places a String is exposed, hiding the new type is impossible without changes. And if &str are stored in the cells, then large changes would be needed.

I can try to hack together a feature flag for it that would expose the new SSO string directly, and use it where possible, but this is going to lead to a bit of a code mess and also further lock us into the SSO implementation we go with, as the new type would be part of the public API, so this cant even be an implementation detail.

How would you feel if I write up a proposal about working towards a version 1.0 with some changes to accommodate better ergonomics and better encapsulation? I have looked through some of the issues and can try to address them as I go, as well as at least start some discussion on how writing spreadsheets would go. With the goal, ultimately, to make calamine the goto for spreadsheet related matters in rust.

@tafia
Copy link
Owner

tafia commented Oct 17, 2023

I'd be happy to see how it could go yes but i'm afraid it may take quite a lot of your time.
I don't mind breaking the public api too much if necessary, this is still a pre 1.0.

Having a DataType and a SmallDataType for instance seems like a good idea. Maybe having Reader<Data=SmallDataType> type might also be worth exploring.

Alternatively having an opaque DataType with various methods is also possible. I am not a super fan of opaque structs as opposed to generic ones but I understand that it may be simpler to iterate on.

@RoloEdits
Copy link
Contributor Author

Cool, I will try to come up with a write-up for the reading API, leaving writing for later, and we can go from there on what should actually be done.

As for storing the values, another option would be to just store all the values as a &str, leaving the parsing to other types to the user. It would take overhead away from the parsing, and with SSO a lot of the data would still fit within the 24 byte space on the stack.

We could also provide the methods to use with failure cases for the enum variants, if it is the wrong enum variant. The type info would be useful for writing, so if possible I think its best to track it.

The worst case performance option might be a Vec<Box<dyn DataType>> scenario, but for the exposed API, it might allow for the best. Though in keeping with Rusts mantra of correctness, having types spattered to the end user isn't great. We will have to find a balance of ease of use vs correctness.

@RoloEdits
Copy link
Contributor Author

Another case for using &str in the cell to a backing memory. The strings gets cloned from the shared string Vec. Cloning here takes up 5% of the samples. read_value taking up 30% of the samples.

image

calamine/src/xlsx.rs

Lines 1015 to 1020 in 09c25ba

match get_attribute(c_element.attributes(), QName(b"t"))? {
Some(b"s") => {
// shared string
let idx: usize = v.parse()?;
Ok(DataType::String(strings[idx].clone()))
}

The main issue is when thinking of eventually having an API for writing and editing. There would be a need for the cell to own the type. And we can't really just mem::take the value from the backing share string Vec either, as other sheets might rely on it too.

At least with the compact_str crate, if the strings fit on the stack, clones are just a copy. This would help here in the case where no matter what, the cell needs to own the value.

I think having the SSO type in the cell is the best middle ground. A write API could look like a .value_mut(NEW_VALUE) or a set_value(NEW_VALUE). The issue then becomes directly exposing the type in the API if we keep with the enum. The type will always be exposed. But if we go with wrapper types then we could have to deal with a Vec<Box<dyn CellValue>> situation, which is going backwards as far as perf is concerned.

@tafia
Copy link
Owner

tafia commented Oct 25, 2023

Sorry for the late reply.

I may be wrong but couldn't we use some form of Cow? By default, it just refers to the backing store with the option to own it is needed?

Don't get me wrong, I am really not opposed to having an SSO type in the cell, just want to understand all the constraints.

@tafia
Copy link
Owner

tafia commented Oct 25, 2023

#369 while not tackling the per cell issue might help with the memory usage for large shared string in xlsx (as per your benchmarks). Unfortunately I cannot test it properly (the computer I'm using now is under wsl and I can't figure out how to monitor memory usage), Would you mind giving it a try?

Similarly, another venue would be to add some function to read rows per rows instead of saving as a big range then iterating over it, assuming it is ok to hold all the data in the Nested struct.

@RoloEdits
Copy link
Contributor Author

Nested:
nested
image

Master:
image
image

The malloc is the allocations for the clone() for master and to_owned() for nested. More time is spent in malloc in the nested, surprisingly.

The memory plot is basically the same:
nested_mem

I think I would need to build on top of the default allocator to actually count the allocations being done to get any more resolution. The second spike there in the nested plot could actually be happening in the other one as well, but the sampling for that is only 200ms per sample. So there is a lot it can miss.

@tafia
Copy link
Owner

tafia commented Oct 25, 2023

The second spike could be because of some Vec resizing (maybe we could try resizing Nested::with_capacity(BIG) to very high capacity upfront to validate it?).

In general, this is what I'd expect because it is overwhelmingly dominated by the allocation to DataType which are not addressed.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Oct 25, 2023

Did a memory profile with valgrind tool=massif --time-unit=B of the current master.

    GB
3.417^                                                                ###     
     |                                                                #       
     |                                                                #       
     |                                                                #       
     |                                                                #       
     |                                                                #       
     |                                                                #       
     |                                                                #       
     |                                                       @@@@@@@::#  :    
     |                                                @@@@@@@@@@@@@@  #  :    
     |                                                @@@@@@@@@@@@@@  #  ::   
     |                                                @@@@@@@@@@@@@@  #  ::   
     |                                                @@@@@@@@@@@@@@  #  :::: 
     |                                           @@@@@@@@@@@@@@@@@@@  #  ::   
     |                                    :::::::@@@@ @@@@@@@@@@@@@@  #  ::   
     |                                    :::::::@@@@ @@@@@@@@@@@@@@  #  ::   
     |                                  @@:::::::@@@@ @@@@@@@@@@@@@@  #  ::   
     |                               :@@@@:::::::@@@@ @@@@@@@@@@@@@@  #  ::   
     |                            @@@:@@@@:::::::@@@@ @@@@@@@@@@@@@@  #  ::   
     |                   :::::::@@@@@:@@@@:::::::@@@@ @@@@@@@@@@@@@@  #  ::  :
   0 +----------------------------------------------------------------------->GB
     0                                                                   31.44

Here is the plot from massif-visualizer.
image

Here is a pastbin of the full output.

The allocation trees are associated with a snapshot. More info can be found from the massif man page.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Oct 25, 2023

Where it looks like the sheet is being read, samples 89, 90, and 91, 91 being the peak memory usage, I see this for string clones. It seems to have taken 12% of the allocations for the sheet read.

->12.20% (447,488,418B) 0x180256: alloc (alloc.rs:98)
| ->12.20% (447,488,418B) 0x180256: alloc_impl (alloc.rs:181)
|   ->12.20% (447,488,418B) 0x180256: allocate (alloc.rs:241)
|     ->12.20% (447,488,418B) 0x180256: allocate_in<u8, alloc::alloc::Global> (raw_vec.rs:184)
|       ->12.20% (447,488,418B) 0x180256: with_capacity_in<u8, alloc::alloc::Global> (raw_vec.rs:130)
|         ->12.20% (447,488,418B) 0x180256: with_capacity_in<u8, alloc::alloc::Global> (mod.rs:670)
|           ->12.20% (447,488,418B) 0x180256: to_vec<u8, alloc::alloc::Global> (slice.rs:162)
|             ->12.20% (447,488,418B) 0x180256: to_vec<u8, alloc::alloc::Global> (slice.rs:111)
|               ->12.20% (447,488,418B) 0x180256: to_vec_in<u8, alloc::alloc::Global> (slice.rs:441)
|                 ->12.20% (447,488,418B) 0x180256: clone<u8, alloc::alloc::Global> (mod.rs:2628)
|                   ->12.20% (447,488,418B) 0x180256: <alloc::string::String as core::clone::Clone>::clone (string.rs:1930)
|                     ->12.20% (447,488,208B) 0x12C443: read_value (xlsx.rs:1019)
|                     | ->12.20% (447,488,208B) 0x12C443: {closure
|                     |   ->12.20% (447,488,208B) 0x12C443: read_sheet<calamine::datatype::DataType, calamine::xlsx::read_sheet_data::{closure_env
|                     |     ->12.20% (447,488,208B) 0x12C443: calamine::xlsx::read_sheet_data (xlsx.rs:1077)
|                     |       ->12.20% (447,488,208B) 0x118EC2: {closure
|                     |         ->12.20% (447,488,208B) 0x118EC2: worksheet<calamine::datatype::DataType, calamine::xlsx::{impl
|                     |           ->12.20% (447,488,208B) 0x118EC2: {closure
|                     |             ->12.20% (447,488,208B) 0x118EC2: map<core::result::Result<quick_xml::reader::Reader<std::io::buffered::bufreader::BufReader<zip::read::ZipFile>>, calamine::xlsx::XlsxError>, core::result::Result<calamine::Range<calamine::datatype::DataType>, calamine::xlsx::XlsxError>, calamine::xlsx::{impl
|                     |               ->12.20% (447,488,208B) 0x118EC2: <calamine::xlsx::Xlsx<RS> as calamine::Reader<RS>>::worksheet_range (xlsx.rs:821)
|                     |                 ->12.20% (447,488,208B) 0x127077: nyc::main (nyc.rs:7)
|                     |                   ->12.20% (447,488,208B) 0x129032: call_once<fn(), ()> (function.rs:250)
|                     |                     ->12.20% (447,488,208B) 0x129032: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154)
|                     |                       ->12.20% (447,488,208B) 0x12A1C8: std::rt::lang_start::{{closure}} (rt.rs:166)
|                     |                         ->12.20% (447,488,208B) 0x15CF4A: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)
|                     |                           ->12.20% (447,488,208B) 0x15CF4A: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:502)
|                     |                             ->12.20% (447,488,208B) 0x15CF4A: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:466)
|                     |                               ->12.20% (447,488,208B) 0x15CF4A: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142)
|                     |                                 ->12.20% (447,488,208B) 0x15CF4A: {closure
|                     |                                   ->12.20% (447,488,208B) 0x15CF4A: do_call<std::rt::lang_start_internal::{closure_env
|                     |                                     ->12.20% (447,488,208B) 0x15CF4A: try<isize, std::rt::lang_start_internal::{closure_env
|                     |                                       ->12.20% (447,488,208B) 0x15CF4A: catch_unwind<std::rt::lang_start_internal::{closure_env
|                     |                                         ->12.20% (447,488,208B) 0x15CF4A: std::rt::lang_start_internal (rt.rs:148)
|                     |                                           ->12.20% (447,488,208B) 0x12735B: main (in /mnt/d/git/calamine/target/release/nyc)
|                     |                                             
|                     ->00.00% (210B) in 1+ places, all below ms_print's threshold (01.00%)

Here is the tree from massif-visualizer when it was at peak memory:
image

@tafia
Copy link
Owner

tafia commented Oct 25, 2023

Thanks for the analysis, I've done another iteration in #370 to provide some DataTypeRef and see how it fares in terms of memory consumption.

On my machine it is >10% faster, Memory usage is also better but still not impressive.

    GB
2.672^                                                              ####
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                                              #
     |                                     :::::::::::::::::::::::::#   :::::
     |                                     :::: :::: ::: :: :: ::   #   :
     |                                     :::: :::: ::: :: :: ::   #   :
     |                                     :::: :::: ::: :: :: ::   #   :
     |                    ::::::::::::::::::::: :::: ::: :: :: ::   #   :
     |                    :::: :: : : :::: :::: :::: ::: :: :: ::   #   :
     |                    :::: :: : : :::: :::: :::: ::: :: :: ::   #   :
     |           ::::::::::::: :: : : :::: :::: :::: ::: :: :: ::   #   :
     |     ::::::: :: ::: :::: :: : : :::: :::: :::: ::: :: :: ::   #   :
     |  :::: : ::: :: ::: :::: :: : : :::: :::: :::: ::: :: :: ::   #   :    :
   0 +----------------------------------------------------------------------->GB
     0                                                                   20.08

@RoloEdits
Copy link
Contributor Author

10% is a good speedup. Just shows the allocation pressure that exists.

Memory:
data_ref

Virtual Memory:
data_ref_virt

master:
image

Ref:
image

When measured, there were 2,819,870 shared strings. Each of these would be 24 bytes worth of wide pointers, for 67,676,880 bytes in the Vec. The size in bytes for all of those characters in the Strings was 62,534,331. Altogether that makes for 130,211,211 bytes for the shared strings.

The implementation of compact_str::CompactString allows to store the character bytes inline, up to the 24 bytes size of the smart pointer. It would help a bit here as there wouldn't be the overhead of the character allocations and the smart pointer in the Vec. You'd think a lot of the words would fit within the inline space. This not only removes an allocation, but is also more efficient in the memory needed to store the same information.

But I'm actually wondering now where the extra overhead is coming from to get us to ~2GB pre sheet load. 200MB should be more like it when just looking at it like this. I wonder if there is a lot of fragmentation going on from each string getting allocated?

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Oct 25, 2023

Did a quick patch to use Vec<compact_str::CompactString> in-place of Vec<String> as the sheets.

Memory:
data_ref_sso

Virtual Memory:
data_ref_sso_virt

Benchmark 1: master 
  Time (mean ± σ):     25.278 s ±  0.424 s    [User: 24.852 s, System: 0.470 s]
  Range (min … max):   24.980 s … 26.369 s    10 runs

Benchmark 1: data_ref_sso
  Time (mean ± σ):     21.644 s ±  0.180 s    [User: 21.429 s, System: 0.240 s]
  Range (min … max):   21.489 s … 22.018 s    10 runs

Over the current master its a 16% increase in speed and a 28% reduction in peak memory usage. Only a 2% reduction in memory over the data_ref result. Out of the 2,819,870 total shared words, 2,487,960 were stored in-lined, leaving 331,910 needing more than 24 bytes, and thus allocated on the heap.

@RoloEdits
Copy link
Contributor Author

So at this point in the programs execution: image

I waited for the print for the number of inline strings, and then stopped the profile right after, so I could see what the memory usage was right after the initial struct was made with the shared strings in it. I got 111,833,088 bytes. This is more of the expected value than what seemed like the ~2GB before worksheet read. And its this early in the execution that its read into memory.

image

I'm really not sure where the rest of the memory usage is coming from that wouldn't be from the sheet, if the shared strings data was in-memory this early on.

@tafia Do you have any idea what else could cause the memory to go up like this that wouldn't be from the sheet read?

@tafia
Copy link
Owner

tafia commented Oct 26, 2023

At this point most of the memory is used in cells:

  • size_of::<Cell>() is 40 so ~ 1GB + actual string heap
  • There should actually be more because of power of 2 resize of the vecs (thus the step function)

Clearly the sso won't cut it, we should either find a better way to store a cell or use streaming iterators.

  • it is using a "sparse" storage, which in this case seems to be beneficial else we would have ~40M cells
  • we "could" use u16 for columns but I think that there won't be any impact because of padding
  • we can definitely use Box<str> and remove ~214MB

Having a streaming iterator on the other hand seems doable and would dwarf any other attempt in terms of both memory and speed.

@RoloEdits
Copy link
Contributor Author

If we commit to being read-only, we could just use &str. Same 16 bytes as the Box<str> - the extra allocations. That would bring the size_of<Cell<'a>> to 32 bytes. A 25% size reduction. The backing shared strings Vec can use the SSO string, then the sheets would have a

struct Worksheet<'a, 'b> { 
    shared_strings: &'a [CompactString],
    cells: Vec<Cell<'b>>,
}

And the part for the string in the cell would just be a

match get_attribute(c_element.attributes(), QName(b"t"))? { 
     Some(b"s") => { 
         // shared string 
         let idx: usize = v.parse()?; 
         Ok(DataType::String(shared_strings[idx].as_str())) 
     } 

The issue of mutating the data means the cell would need to own the type, meaning that a clone() would need to take place still. With the SSO type, and with this specific data set, 87% of strings were in-lined, and in-lined strings are a copy. For single sheet workbooks this is just an unnecessary duplicate of information. But that's how it is currently anyways.

Being read-only would also mean that the public API can just expose a &str where needed. It would also make a lot of the other API much simpler to go about. As seen with the 10% difference shown from the allocation pressure in cloning the cells' String, the SSO clone could speed this operation up still, and also allow the cell to own the type, but the Cell size remains 40 bytes. So there isn't any win for memory usage. And there is the issue that the type is now part of the API.

Sparse data optimization could help in certain workloads. But as it really doesn't seem like we can reduce the memory requirements of the Cell to be any lower than what we have now, even in a read-only situation, I think iterators over cells or rows it the next step we have to look into. This could be a solution for people who only need to go cell by cell or row by row with very large data sets. For anything else, as far as in-memory representation goes, we might just have to accept we have reached the limit.

One thing I am curious about is the large spike at the end of the program. If the gradual climb from after the shared strings are read into memory is the sheet being read, are cells creation the final big bump? But if that's the case, why is there a climb up to that point? The current master peaks at 3.2GB. The 40 million cells at 40 bytes is 1.6GB. The shared strings is 130MB. Even if you clone each string, that should only be another 130MB. The spikes jump matches the Vec<Cell> size. We might be able to remove the middle climb from the 130MB shared strings to the 2GB point before the cells are formed. The max we should expect for memory usage should be right at that 2GB mark. Not the 3.2GB we have now.

@tafia
Copy link
Owner

tafia commented Oct 26, 2023

An option to workaround ownership is to ask the caller for some buffer to hold the non shared strings (like what is done in quick-xml).

   /// Get worksheet range where shared string values are only borrowed
    pub fn worksheet_range_ref<'a>(
        &'a mut self,
        name: &str,
        buffer: &'a mut Vec<u8>,
    ) -> Option<Result<Range<DataTypeRef<'a>>, XlsxError>> {
        // ...
    }
}

Buffer could be something else than plain Vec but it seems simpler this way. At this point SSO optimization seems unecessary at all because all string will only be 2 bytes (smaller that CompactString) and likely faster as there is no need to check for all the variant types.

The current master peaks at 3.2GB. The 40 million cells at 40 bytes is 1.6GB.

What I believe we're seeing here is just the total Vec<Cell> resizing (~1GB old + ~2GB new before copying old and dropping the old one, assuming the heap allocator couldn't just find "free" 1GB space just after the current old). One way to avoid that is

  • be better at guessing the total capacity upfront (with some statistics, maybe depending on the file size / the sheet size etc...)
  • control the resizing on our end to not be always a 2x after some threshold (like incrementing by 200_000 items maximum?)

I think iterators over cells or rows it the next step we have to look into. This could be a solution for people who only need to go cell by cell or row by row with very large data sets. For anything else, as far as in-memory representation goes, we might just have to accept we have reached the limit.

Yes. In general the current api could just simply be collecting all the cells into a range so we won't duplicate too much all the code.

@tafia
Copy link
Owner

tafia commented Oct 29, 2023

I've updated #370 to support a new XlsxCellReader:

    let mut cell_reader = excel
        .worksheet_cells_reader("NYC_311_SR_2010-2020-sample-1M")
        .unwrap();
    let mut l = 0;
    while let Some(cell) = cell_reader.next_cell().unwrap() {
        l += cell.get_position().0;
    }

This one, as expected, doesn't use much memory (relatively speaking of course), mostly the sharedStrings I suppose.

    MB
202.7^                                   ::           :          ::@       :::
     |                             ::#::@: ::::::::::::::::::@:::: @::::@:::::
     |                          :@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                        :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                        :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                        :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                        :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                        :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                      :::::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                   :::: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |                ::@: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |             @@:: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |            :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |            :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |            :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |         ::::@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |       :::: :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |     ::: :: :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |    :: : :: :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
     |  :::: : :: :@ :: @: :: :::@@: #: @: : ::::::: ::: ::::@:::: @: ::@:::::
   0 +----------------------------------------------------------------------->GB
     0                                                                   24.96

In terms of perf, there isn't much of a difference compared to using DataTypeRef.
I haven't updated the code to reuse the same buffer for every cell etc ... but I don't think it'll have a big impact now.
I'll update anyway this week to see how it fares.

@RoloEdits
Copy link
Contributor Author

This looks like a promising direction. So then we can have two modes, an aggressive mode that reads fully into memory and a 'lazy' option that exposes iterators of cells and with some work, rows?

use calamine::{Workbook, Xlsx, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Type-state for a full open in memory
    let workbook: Xlsx<Memory> = Workbook::open("WORKBOOK.xslx")?;
}
use calamine::{Workbook, Xlsx, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Type-state for a lazy open
    let workbook: Xlsx<Lazy> = Workbook::open("WORKBOOK.xslx")?;
}

Or if we want to scope the changes to only the sheet level:

use calamine::{Workbook, Xlsx, XlsxError};

fn main() -> Result<(), XlsxError> {
    let workbook: Xlsx = Workbook::open("WORKBOOK.xslx")?;

    let worksheet: Worksheet<Memory> = workbook.worksheet("Sheet1")?;

   while let Some(cell)  =  worksheet.cells() {
        // work
    }

    while let Some(row)  =  worksheet.rows() {
        // work
    }   
}
use calamine::{Workbook, Xlsx, XlsxError};

fn main() -> Result<(), XlsxError> {
    let workbook: Xlsx = Workbook::open("WORKBOOK.xslx")?;

    let worksheet: Worksheet<Lazy> = workbook.worksheet_lazy("Sheet1")?;

   while let Some(cell)  =  worksheet.cells() {
        // work
    }

    while let Some(row)  =  worksheet.rows() {
        // work
    }
}

We would need to have coverage of .worksheets() and a .worksheet_lazy(). Perhaps a .worksheets_by_name(["Sheet1", "Sheet5"]) and a .worksheets_by_name_lazy(["Sheet1","Sheet5"]). These would offer either a full load or a lazy iteration load.

@tafia
Copy link
Owner

tafia commented Oct 30, 2023

I think we should keep changes only at sheet level. The current code already provides both lazy and full api.
Lazy being the reader while full being the Range, the full one already uses the lazy version so we're good.
We may want to rename things etc ...

Next steps

  1. Add some wrappers
    a. rows
    b. filter_position: a special way to filter cells based on their position. IMHO it is very common NOT to load all the columns, and we could just avoid some work by filtering out some cells only based on their positions
  2. investigate fetching multiple sheets in parallel. I don't think there is any strong issue, as long as we have different bufreaders.
  3. investigate reading the same sheet with multiple threads (Seek into some index, find a start row event and update the "length" of previous file slice). I am not sure it is that easy but it may be worth checking at some point. EDIT: seems like we cannot really seek into a compressed file in a zip ... so if we need to decompress it all first it may be quite heavy at the end
  4. investigate moving all this logic into other file formats. It is unsure this is possible at all but I'd like to show same api regardless of the files
  5. investigate using a full borrowed DataTypeRef where the inline strings are using some external buffer. Maybe it is useful maybe not
  6. try avoiding code rewrite, currently cell values and cell formula are almost exact copy. We can use the same code and just call some FnMut depending on the scenario
  7. See how we can leverage the lazy variant for serde deserialization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants