-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Saving copy-on-write mmap changes back to file #54
Comments
I personally using |
This does not answer your general question and I am not sure if file sizes prohibit this in your case, but editors usually do extra work to atomically replace the file being edited, i.e. instead of modifying If you adopt this approach, you can map the If you want to optimize the impact of copying |
I want the hex editor to be usable for things like viewing and editing block devices, which could be hundreds of gigabytes large, so I'm not sure how well this approach would play with that.
I want my hex editor to work at least on windows and linux, so I don't want to use unix specific commands. Not sure if windows supports something similar to this. |
In that case, I would suggest looking at it from the other way around: Mmap the files directly for reading and writing. But instead of writing pending modifications into the memory map, store those in separate data structures (a sort of overlay that is considered when displayed any modified file range) and write them into the memory map only when saving. This is sort of like your option 2 but without closing and re-opening the memory map and keeping the buffers storing the modifications around until saving. Of course, this still means modifications will not be atomic from the PoV of other reads of the underlying file. |
This is what I would like to avoid doing if possible, as it makes the implementation a lot more complex. |
Don't you have to do the book keeping anyway if you are going for option 2? (I don't think option 1 is obviously sound. This was discussed repeatedly on URLO without reaching a consensus that the current memory map semantics yield anything which soundness can be based upon. But then again, since this is somewhat unchartered territory, soundness of option 2 is not really ensured either. Using memory maps is always mildly problematic with Rust ATM. (Even copy-on-write maps only protect other processes from seeing your writes, it does not protect your process from seeing concurrent modifications performed by other processes.)) |
There is nothing to do when trying to access memory. The only thing that has to be done is keep track of what changed during editing operations. However, if I'm not using a copy-on-write memory map, I have to, on every access, look up if that access is within an edited range, and if it is, find the edit buffer that corresponds to that location, and return a slice to that instead of the original memory. |
Since you need detailed control of when writeback (user presses save) happens and care about soundness and portability, I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read. (Meaning just read fixed size blocks from storage into memory to display and optionally modify them (and track this on a block level). If a configured memory limit is reached, write out clean blocks to make space. When the user saves write out all dirty blocks.) |
I don't see how fixed size blocks would help with simplicity, as there could still be requests that span blocks. And there are certain APIs that expect slices, like memmem. If I wanted to do a string search with such an implementation, I couldn't just simply call memmem, I would have to call memmem for every page, plus search each page boundary for matches as well. |
I did some preliminary testing, and this passes both on windows and linux: #[test]
fn test_mmap_write() {
use std::fs::OpenOptions;
use std::io::{Seek, Write};
std::fs::write(
"testfile-small.txt",
include_bytes!("../testfile-small-in.txt"),
)
.unwrap();
let mut f = OpenOptions::new()
.read(true)
.write(true)
.open("testfile-small.txt")
.unwrap();
let mut mmap = unsafe { memmap2::MmapOptions::new().map_copy(&f).unwrap() };
assert_eq!(&*mmap, include_bytes!("../testfile-small-in.txt"));
mmap[7..=12].copy_from_slice(b"Edited");
assert_eq!(&*mmap, include_bytes!("../testfile-small-edited-in.txt"));
mmap.flush().unwrap();
f.seek(std::io::SeekFrom::Start(7)).unwrap();
f.write_all(&mmap[7..=12]).unwrap();
f.flush().unwrap();
let new = std::fs::read("testfile-small.txt").unwrap();
assert_eq!(new, include_bytes!("../testfile-small-edited-in.txt"));
mmap[70..=86].copy_from_slice(b"Changed you again");
assert_eq!(&*mmap, include_bytes!("../testfile-small-edited2-in.txt"));
mmap.flush().unwrap();
f.seek(std::io::SeekFrom::Start(70)).unwrap();
f.write_all(&mmap[70..=86]).unwrap();
f.flush().unwrap();
let new = std::fs::read("testfile-small.txt").unwrap();
assert_eq!(new, include_bytes!("../testfile-small-edited2-in.txt"));
} testfile-small-in.txt:
testfile-small-edited-in.txt:
testfile-small-edited2-in.txt:
This gives me some level of confidence that I could use this strategy. Then if some smart person wants to contribute a smart paging algorithm for my hex editor, I'll accept their pull request. But until then, I'd like to have at least basic usability for huge files without greatly complicating the implementation. |
I think my tone was wrong here. What I would like to have said is that if you ask about soundness - meaning defined behaviour no matter the conditions under which the code executed to me - and portability, then I fear that the semantics this crate (or rather the underlying system calls) can provide will be insufficient. The intersection between memory maps and Rust's take on memory safety is still not really defined as the discussions on URLO seem to indicate. Testing a certain pattern before usage can of course provide evidence that it is practical, but it cannot answer the question of soundness. This is why I suggested to try to implement your own paging without using the kernel-provided memory maps, because soundness should be achievable there based on the currently established interactions between Rust and the kernel. Concerning the pattern your test case employs, I think your option 2 is still preferable and it avoids aliasing the same memory via the memory map and via the file handle (which will eventually write into the same physical page in the kernel's page cache). An alternative might to build an API which ensures that your code can only either But finally, even option 2 needs to assume that your process is the only one mapping the underlying file to be safe from aliasing issues which is usually not the case for an interactively used editor. |
I'm fine with assuming that I have exclusive access to the file. I'll document this invariant for the users when they opt into mmap. The default will still be reading everything into a Vec. mmap will be an option available in cases it's needed, like huge files or block devices, and the user should take precautions that they provide exclusive access. As far as the question of soundness goes, I don't need it to be 100% theoretically sound, I just want to know if there is something obviously wrong with these approaches that is known to break things. As for why I would rather use a copy on write memmap than implementing my own paging, the answer is greatly reduced implementation complexity. If I can rely on data being [u8], I can reuse all the algorithms and libraries that work on byte slices. |
I'm writing a hex editor, and I want to add memory mapped file support to it.
Writable memory maps write changes directly into a file, which is not ideal, since the user might not want to save their changes.
The changes should only be written when the user saves.
I discovered that memmap2-rs supports copy-on-write memory maps, which sounds ideal for my use case.
However, I'm unsure how to safely write the changes back to the file on save.
Calling
.flush()
does not write the changed data back to the file.I see two options, both of which involve keeping track of the parts of the file that changed:
File
handle to write back the changes right from the mmap.Is this sound? Or is this undefined behavior?
File
handleThis is more convoluted, but this should definitely be sound. Or at least I hope so.
Which one should I use? Or is there a better alternative?
Thank you, and sorry for using issues for questions, but I couldn't find good resources on this, and I hope there are some mmap experts here that can help answer.
The text was updated successfully, but these errors were encountered: