Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Saving copy-on-write mmap changes back to file #54

Closed
crumblingstatue opened this issue Aug 16, 2022 · 12 comments
Closed

Question: Saving copy-on-write mmap changes back to file #54

crumblingstatue opened this issue Aug 16, 2022 · 12 comments

Comments

@crumblingstatue
Copy link

I'm writing a hex editor, and I want to add memory mapped file support to it.
Writable memory maps write changes directly into a file, which is not ideal, since the user might not want to save their changes.
The changes should only be written when the user saves.
I discovered that memmap2-rs supports copy-on-write memory maps, which sounds ideal for my use case.
However, I'm unsure how to safely write the changes back to the file on save.
Calling .flush() does not write the changed data back to the file.

I see two options, both of which involve keeping track of the parts of the file that changed:

  1. Just use the underlying File handle to write back the changes right from the mmap.
    Is this sound? Or is this undefined behavior?
  2. Right before saving:
    1. Copy all the changes into buffers that I maintain
    2. Drop the mmap
    3. Write the changes from my buffers using the File handle
    4. Reopen the mmap
      This is more convoluted, but this should definitely be sound. Or at least I hope so.

Which one should I use? Or is there a better alternative?

Thank you, and sorry for using issues for questions, but I couldn't find good resources on this, and I hope there are some mmap experts here that can help answer.

@RazrFalcon
Copy link
Owner

I personally using memmap2 only for reading, so I cannot help you here. Maybe someone else would.

@adamreichold
Copy link

This does not answer your general question and I am not sure if file sizes prohibit this in your case, but editors usually do extra work to atomically replace the file being edited, i.e. instead of modifying foo.bin, they would copy it to .foo.bin.swp and modify only that file and finally when the user wants to save, rename .foo.bin.swp to foo.bin thereby atomically replacing one file by the other.

If you adopt this approach, you can map the .foo.bin.swp directly and modify its contents which would not touch the original.

If you want to optimize the impact of copying foo.bin to .foo.bin.swp in the first place, you could do something like cp --reflink=auto foo.bin .foo.bin.swp which shares the underlying file extents if possible thereby copying only metadata and data only on demand, i.e. if you modify the mapping and the page cache's write back wants to push the pages to disk.

@crumblingstatue
Copy link
Author

I am not sure if file sizes prohibit this in your case

I want the hex editor to be usable for things like viewing and editing block devices, which could be hundreds of gigabytes large, so I'm not sure how well this approach would play with that.

you could do something like cp --reflink=auto foo.bin .foo.bin.swp

I want my hex editor to work at least on windows and linux, so I don't want to use unix specific commands. Not sure if windows supports something similar to this.

@adamreichold
Copy link

In that case, I would suggest looking at it from the other way around: Mmap the files directly for reading and writing. But instead of writing pending modifications into the memory map, store those in separate data structures (a sort of overlay that is considered when displayed any modified file range) and write them into the memory map only when saving.

This is sort of like your option 2 but without closing and re-opening the memory map and keeping the buffers storing the modifications around until saving. Of course, this still means modifications will not be atomic from the PoV of other reads of the underlying file.

@crumblingstatue
Copy link
Author

instead of writing pending modifications into the memory map, store those in separate data structures (a sort of overlay that is considered when displayed any modified file range) and write them into the memory map only when saving.

This is what I would like to avoid doing if possible, as it makes the implementation a lot more complex.
Copy-on-write mmap would allow my data model to remain &mut [u8], which is drastically more simple than having to do bookkeeping work on every access.

@adamreichold
Copy link

adamreichold commented Aug 16, 2022

having to do bookkeeping work on every access.

Don't you have to do the book keeping anyway if you are going for option 2?

(I don't think option 1 is obviously sound. This was discussed repeatedly on URLO without reaching a consensus that the current memory map semantics yield anything which soundness can be based upon. But then again, since this is somewhat unchartered territory, soundness of option 2 is not really ensured either. Using memory maps is always mildly problematic with Rust ATM. (Even copy-on-write maps only protect other processes from seeing your writes, it does not protect your process from seeing concurrent modifications performed by other processes.))

@crumblingstatue
Copy link
Author

Don't you have to do the book keeping anyway if you are going for option 2?

There is nothing to do when trying to access memory. The only thing that has to be done is keep track of what changed during editing operations.

However, if I'm not using a copy-on-write memory map, I have to, on every access, look up if that access is within an edited range, and if it is, find the edit buffer that corresponds to that location, and return a slice to that instead of the original memory.
What if the request spans both an edited range and a non-edited range? I can't return a contiguous slice that spans both my edit buffer and the original memory, unless I allocate and copy on every access.

@adamreichold
Copy link

adamreichold commented Aug 16, 2022

Since you need detailed control of when writeback (user presses save) happens and care about soundness and portability, I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read. (Meaning just read fixed size blocks from storage into memory to display and optionally modify them (and track this on a block level). If a configured memory limit is reached, write out clean blocks to make space. When the user saves write out all dirty blocks.)

@crumblingstatue
Copy link
Author

I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read.

I don't see how fixed size blocks would help with simplicity, as there could still be requests that span blocks. And there are certain APIs that expect slices, like memmem. If I wanted to do a string search with such an implementation, I couldn't just simply call memmem, I would have to call memmem for every page, plus search each page boundary for matches as well.
And this is just one example of how such an implementation complicates things.
I would rather let the operating system take care of all this hard work for me, as it was written by professionals who know what they are doing, and has received lots of testing over the years.

@crumblingstatue
Copy link
Author

I did some preliminary testing, and this passes both on windows and linux:

#[test]
fn test_mmap_write() {
    use std::fs::OpenOptions;
    use std::io::{Seek, Write};
    std::fs::write(
        "testfile-small.txt",
        include_bytes!("../testfile-small-in.txt"),
    )
    .unwrap();
    let mut f = OpenOptions::new()
        .read(true)
        .write(true)
        .open("testfile-small.txt")
        .unwrap();
    let mut mmap = unsafe { memmap2::MmapOptions::new().map_copy(&f).unwrap() };
    assert_eq!(&*mmap, include_bytes!("../testfile-small-in.txt"));
    mmap[7..=12].copy_from_slice(b"Edited");
    assert_eq!(&*mmap, include_bytes!("../testfile-small-edited-in.txt"));
    mmap.flush().unwrap();
    f.seek(std::io::SeekFrom::Start(7)).unwrap();
    f.write_all(&mmap[7..=12]).unwrap();
    f.flush().unwrap();
    let new = std::fs::read("testfile-small.txt").unwrap();
    assert_eq!(new, include_bytes!("../testfile-small-edited-in.txt"));
    mmap[70..=86].copy_from_slice(b"Changed you again");
    assert_eq!(&*mmap, include_bytes!("../testfile-small-edited2-in.txt"));
    mmap.flush().unwrap();
    f.seek(std::io::SeekFrom::Start(70)).unwrap();
    f.write_all(&mmap[70..=86]).unwrap();
    f.flush().unwrap();
    let new = std::fs::read("testfile-small.txt").unwrap();
    assert_eq!(new, include_bytes!("../testfile-small-edited2-in.txt"));
}

testfile-small-in.txt:

Hello, World! I am a small test file.
I'm pretty nice as you can see.
Change me please.
Thank you!.

testfile-small-edited-in.txt:

Hello, Edited I am a small test file.
I'm pretty nice as you can see.
Change me please.
Thank you!.

testfile-small-edited2-in.txt:

Hello, Edited I am a small test file.
I'm pretty nice as you can see.
Changed you again
Thank you!.

This gives me some level of confidence that I could use this strategy.
I could expose it under an --unsafe-memmap option, and warn users about potential risks, but the option would still be available for users who understand the risks.

Then if some smart person wants to contribute a smart paging algorithm for my hex editor, I'll accept their pull request. But until then, I'd like to have at least basic usability for huge files without greatly complicating the implementation.

@adamreichold
Copy link

I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read.

I think my tone was wrong here. What I would like to have said is that if you ask about soundness - meaning defined behaviour no matter the conditions under which the code executed to me - and portability, then I fear that the semantics this crate (or rather the underlying system calls) can provide will be insufficient. The intersection between memory maps and Rust's take on memory safety is still not really defined as the discussions on URLO seem to indicate.

Testing a certain pattern before usage can of course provide evidence that it is practical, but it cannot answer the question of soundness. This is why I suggested to try to implement your own paging without using the kernel-provided memory maps, because soundness should be achievable there based on the currently established interactions between Rust and the kernel.

Concerning the pattern your test case employs, I think your option 2 is still preferable and it avoids aliasing the same memory via the memory map and via the file handle (which will eventually write into the same physical page in the kernel's page cache). An alternative might to build an API which ensures that your code can only either Deref the Mmap or accessing the underlying File, but not both at the same time.

But finally, even option 2 needs to assume that your process is the only one mapping the underlying file to be safe from aliasing issues which is usually not the case for an interactively used editor.

@crumblingstatue
Copy link
Author

crumblingstatue commented Aug 18, 2022

I'm fine with assuming that I have exclusive access to the file. I'll document this invariant for the users when they opt into mmap. The default will still be reading everything into a Vec. mmap will be an option available in cases it's needed, like huge files or block devices, and the user should take precautions that they provide exclusive access.

As far as the question of soundness goes, I don't need it to be 100% theoretically sound, I just want to know if there is something obviously wrong with these approaches that is known to break things.

As for why I would rather use a copy on write memmap than implementing my own paging, the answer is greatly reduced implementation complexity. If I can rely on data being [u8], I can reuse all the algorithms and libraries that work on byte slices.
This works because mmap maps pages on page fault, so I (and the libraries I use) don't have to do anything special, just access bytes directly.
If I did my own paging, I would have to write significantly more complex algorithms, and I couldn't use those libraries that work on byte slices either. I would have to write my own versions that request pages explicitly instead of it happening on page fault.
Unless I can write my own page fault handler for doing this, but I get the feeling that is another can of worms that would lead to similar issues that mmap has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants