Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Ability to pause, save state and restore VM #480

Closed
dobegor opened this issue Jun 3, 2019 · 25 comments
Closed

Feature: Ability to pause, save state and restore VM #480

dobegor opened this issue Jun 3, 2019 · 25 comments
Labels
🎉 enhancement New feature! 📦 lib-compiler-llvm About wasmer-compiler-llvm priority-low Low priority issue

Comments

@dobegor
Copy link

dobegor commented Jun 3, 2019

The actual proposal is wasmerio/wasmer-go#22, forwarded here as requested by @Hywan.

Motivation

With ability to pause the VM, save it's state (memory, execution context) and restore it later
it'd be possible to created persist-able sandboxed environments with ability to migrate them, restart, provide maintenance to hosts with little to no interruption of user programs.

Proposal

Add necessary functions (with checkpoints ?) to take pause/resume VM flow, export it's memory alongside with all the state (registers, instruction pointer) and ability to create VM with imported state.

@dobegor dobegor added the 🎉 enhancement New feature! label Jun 3, 2019
@losfair
Copy link
Contributor

losfair commented Jun 13, 2019

Being worked on as a part of #489 .

@moonheart08
Copy link

#489 is merged now, however there is very little documentation on how to use it.

@dobegor
Copy link
Author

dobegor commented Sep 12, 2019

@losfair could you please document or point us how to use the Su engine to accomplish this?

@kaimast
Copy link

kaimast commented Jan 10, 2021

I am also interested in how to use Su, but not sure how. Not even sure if that code was removed during the 1.0 refactor?

@Hywan Hywan self-assigned this Jul 13, 2021
@Hywan Hywan added the 📦 lib-compiler-llvm About wasmer-compiler-llvm label Jul 13, 2021
@Amanieu Amanieu added the priority-low Low priority issue label Oct 20, 2021
@RobDavenport
Copy link

Does there currently exist a workaround to do this? Can't find much on Su in existing documentation currently.

@supercmmetry
Copy link

Any updates on this feat?

@danielblignaut
Copy link

+1 Any updates or plans for this?

Copy link

stale bot commented Dec 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the 🏚 stale Inactive issues or PR label Dec 9, 2023
@theduke
Copy link
Contributor

theduke commented Dec 12, 2023

We will be able to use the functionality from #4263 for this.

@stale stale bot removed the 🏚 stale Inactive issues or PR label Dec 12, 2023
@MaratBR
Copy link

MaratBR commented Jan 7, 2024

#4263 has been merged recently and I have been trying to implement automatic saving of the VM state. I.e. at certain moments I want to send Sigstop to the VM and save the state to the journal (via SnapshotTrigger::Sigstop). However I can't get it to work. Journal file contains only records of syscalls (a bunch of printlns in my case).

In my use case I need to save the state of the VM at arbitrary moments, not periodically (although saving every 2-3s is an option albeit not a very good one).

I am pretty new to using wasmer though so maybe I am just doing it wrong.

My attempts so far
macro_rules! setup_tokio_rt {
    () => {
        let tokio_rt = tokio::runtime::Builder::new_multi_thread()
            .enable_all()
            .build()
            .unwrap();
        let tokio_rt_handle = tokio_rt.handle().clone();
        let _tokio_rt_guard = tokio_rt_handle.enter();
    };
}

// set everything up manually
//  1. snapshot restoration doesn't work because function that does it seems to be private (or pub(crate) I am not sure)
//  2. sending Sigstop doesn't do anything but sending Sigint does work (but no snapshot is saved)
//  3. snapshots are not being taken every 100ms
// NOTE: because of 1. I can't be sure that snapshots are actually not being taken, but content of the journal file does not
// change significantly and my WASM module allocates around 15 kb of memory as Vec<u8> so it should be in a snapshot I am assuming (I am using debug profile so it shouldn't optimize it away)
pub fn test1_fully_manual() {
    setup_tokio_rt!();

    let mut store = Store::default();
    let file_path = Path::new("./output.wasm");
    let module = Module::from_file(&mut store, file_path).unwrap();
    let journal = Arc::new(LogFileJournal::new(Path::new("./test1.wasi-journal")).unwrap());

    // create store and wasi environment
    let mut wasi_env_builder = WasiEnv::builder("hello")
        .stdin(Box::new(Stdin::default()))
        .stdout(Box::new(Stdout::default()))
        .stderr(Box::new(Stderr::default()));
    wasi_env_builder.add_snapshot_trigger(SnapshotTrigger::Sigint);
    wasi_env_builder.add_snapshot_trigger(SnapshotTrigger::Sigstop);
    wasi_env_builder.with_snapshot_interval(Duration::from_millis(100));
    wasi_env_builder.add_journal(journal);
    wasi_env_builder.set_module_hash(ModuleHash::from_bytes([0, 0, 0, 0, 0, 0, 0, 0]));
    let wasi_env = wasi_env_builder.build().unwrap();
    let tasks = wasi_env.runtime.task_manager().clone();
    let mut wasi_fn_env = WasiFunctionEnv::new(&mut store, wasi_env);

    // imports
    let mut import_object = wasi_fn_env
        .import_object_for_all_wasi_versions(&mut store, &module)
        .unwrap();
    // TODO: add my own imports

    let mut store_mut = store.as_store_mut();
    let memory = tasks
        .build_memory(&mut store_mut, SpawnMemoryType::CreateMemory)
        .unwrap();
    if let Some(memory) = memory.as_ref() {
        import_object.define("env", "memory", memory.clone());
    }

    let instance = Instance::new(&mut store, &module, &import_object).unwrap();
    wasi_fn_env
        .initialize_with_memory(&mut store, instance.clone(), memory, true)
        .unwrap();

    let start_fn = instance.exports.get_function("_start").unwrap();

    let data = wasi_fn_env.data(&mut store).clone();
    std::thread::spawn(move || {
        std::thread::sleep(Duration::from_secs(1));
        data.process.signal_process(Signal::Sigint); // Sigstop does nothing, Sigint stops the process without taking a snapshot and then crashes the
    });

    let result = start_fn.call(&mut store, &[]);
    result.unwrap();
}

// very similar to test1 but now restoring from the journal actually works, except there is still
// no snapshot only syscalls like "println"s are being restored
// also SnapshotTrigger::FirstStdin outright crashes the WASI process when I call stdin().read_line() from within it
// # UPDATE: I figured out issue with the crash, you can ignore this example probably
pub fn test2_run_with_store_ext() {
    setup_tokio_rt!();

    let mut store = Store::default();
    let file_path = Path::new("./output.wasm");
    let module: Module = Module::from_file(&mut store, file_path).unwrap();

    let journal = Arc::new(LogFileJournal::new(Path::new("./test2.wasi-journal")).unwrap());

    let mut builder = WasiEnv::builder("hello")
        .stdin(Box::new(Stdin::default()))
        .stdout(Box::new(Stdout::default()))
        .stderr(Box::new(Stderr::default()));
    builder.add_journal(journal);
    builder.add_snapshot_trigger(SnapshotTrigger::FirstStdin); // THIS LINE crashes the VM when it gets to stdin().read_line()
    builder.with_snapshot_interval(Duration::from_millis(500));
    builder
        .run_with_store_ext(
            module,
            ModuleHash::from_bytes([0, 1, 2, 3, 4, 5, 6, 7]),
            &mut store,
        )
        .unwrap();
}

// test3: using wasi_runner - could not find a way to send a Sigstop, but
// periodical snapshots don't seem to work either
pub fn test3_wasi_runner() {
    let tokio_rt = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap();
    let handle = tokio_rt.handle().clone();
    let _guard = handle.enter();

    let mut store = Store::default();
    let engine = store.engine().clone();

    let file_path = Path::new("./output.wasm");
    let module: Module = Module::from_file(&mut store, file_path).unwrap();

    let journal = CompactingLogFileJournal::new(Path::new("./test3.wasi-journal"))
        .unwrap()
        .with_compact_on_drop();
    let journal = Arc::new(journal);

    let task_manager = Arc::new(TokioTaskManager::new(tokio_rt));
    let mut rt = PluggableRuntime::new(task_manager.clone());
    rt.add_journal(journal.clone());
    // runtime.set_engine(Some(store.engine().clone()));
    rt.set_networking_implementation(virtual_net::UnsupportedVirtualNetworking::default());

    let tty = Arc::new(SysTty::default());
    tty.reset();
    rt.set_tty(tty);
    rt.set_engine(Some(engine));

    let rt = Arc::new(rt);

    let mut runner = WasiRunner::new()
        .with_args(Vec::<&'static str>::new())
        .with_forward_host_env(true)
        .with_capabilities(Capabilities::default());
    runner
        .add_journal(journal)
        .add_default_snapshot_triggers()
        .with_snapshot_interval(Duration::from_millis(500))
        .add_snapshot_trigger(SnapshotTrigger::Sigstop);
    runner
        .run_wasm(
            rt,
            "hello",
            &module,
            ModuleHash::from_bytes([0, 1, 2, 3, 4, 5, 6, 7]), // i am too lazy to calculate actual hash, sorry
            true,
        )
        .unwrap()
}
WASM module

WASM module is built for wasm32-wasi in Rust and then wasm-opt --asyncify -o ./output.wasm ./target/blah/blah/result.wasm .

WASM module:

fn main() {
    // allocate 15kb of memory with repeating pattern to be able to see it in the snapshot
    // easily
    let mut v = Vec::<u8>::new();
    let mut byte: u8 = 0;
    for _ in 0..15000 {
        v.push(byte);
        byte = if byte == 255 { 0 } else { byte + 1 }
    }

    // just waste some time
    for i in 0..10 {
        println!("{}", i);
        std::thread::sleep(Duration::from_millis(200));
    }

    for i in v {
        print!("{}", i)
    }

    // FirstStdin snapshot trigger
    let mut s = String::new();
    stdin().read_line(&mut s).unwrap();
    println!("you typed: {}", s);
}

On a related note: the only way I found how to send Sigstop to WasmProcess is to setup WASI manually rather than use WasiRunner or run_with_store_ext (since those don't give you direct access to WasiEnv). However after looking at the code it seems that part responsible for restoring the snapshot is not exported, so if I setup WASI env manually then I don't have a clean way of restoring from the snapshot.

EDIT: formatting

EDIT 2: by the way it's a little weird that I have to add_journal twice in test3_wasi_runner, didn't look into that much though, but without first add_journal it won't work because enable_journal in one of the internal structs will be false (don'r remember exact details right now)

EDIT 3: I figured out the issue with the crash in example 2 (see code above). Wasmer assumes that __stack_pointer global is exported and tries to access it, which is not the case for Rust where __stack_pointer is not exported and mutable.

To summarize:

  • Any plans on creating examples for this feature in the future?
  • What am I doing wrong in examples above?
  • Would it be possible to make functions responsible for restoring the VM from a snapshot public? So that manually setting up WASI is an option for my use case.
  • ...or alternatively, is it possible to send Sigstop to a WASI process started with WasiRunner?
  • Is there plans to handle __stack_pointer in Rust (see EDIT 3)

Some more findings:

  • WasiEnvBuilder::snapshot_interval seems to be unused.

@dobegor
Copy link
Author

dobegor commented Jan 8, 2024

#4263 has been merged recently and I have been trying to implement automatic saving of the VM state. I.e. at certain moments I want to send Sigstop to the VM and save the state to the journal (via SnapshotTrigger::Sigstop). However I can't get it to work. Journal file contains only records of syscalls (a bunch of printlns in my case).

In my use case I need to save the state of the VM at arbitrary moments, not periodically (although saving every 2-3s is an option albeit not a very good one).

I am pretty new to using wasmer though so maybe I am just doing it wrong.

My attempts so far
WASM module
On a related note: the only way I found how to send Sigstop to WasmProcess is to setup WASI manually rather than use WasiRunner or run_with_store_ext (since those don't give you direct access to WasiEnv). However after looking at the code it seems that part responsible for restoring the snapshot is not exported, so if I setup WASI env manually then I don't have a clean way of restoring from the snapshot.

EDIT: formatting

EDIT 2: by the way it's a little weird that I have to add_journal twice in test3_wasi_runner, didn't look into that much though, but without first add_journal it won't work because enable_journal in one of the internal structs will be false (don'r remember exact details right now)

EDIT 3: I figured out the issue with the crash in example 2 (see code above). Wasmer assumes that __stack_pointer global is exported and tries to access it, which is not the case for Rust where __stack_pointer is not exported and mutable.

To summarize:

  • Any plans on creating examples for this feature in the future?
  • What am I doing wrong in examples above?
  • Would it be possible to make functions responsible for restoring the VM from a snapshot public? So that manually setting up WASI is an option for my use case.
  • ...or alternatively, is it possible to send Sigstop to a WASI process started with WasiRunner?
  • Is there plans to handle __stack_pointer in Rust (see EDIT 3)

Some more findings:

  • WasiEnvBuilder::snapshot_interval seems to be unused.

@john-sharratt could you please help (sorry for pinging but since you're the original author of the journaling PR it might be our best shot)?

@theduke
Copy link
Contributor

theduke commented Jan 8, 2024

Hopefully @john-sharratt can provide more context.

Regarding the stack pointer, I think that is provided only in wasix builds.

@theduke
Copy link
Contributor

theduke commented Jan 8, 2024

For capturing the stack to work the module needs to be compiled with wasix (see cargo-wasix / wasix.org), because it uses asyncify to enable unwinding and rewinding.

@john-sharratt
Copy link
Contributor

john-sharratt commented Jan 9, 2024

The first release of the journaling capability we merged was scope limited to a number of use cases, in particular the DCGI runner, the reason we could not keep going with it is the PR was becoming huge so we had to merge it as is, with a stable test pass rate.

That means the DProxy functionality will close out the remaining capability, which is now being worked on here:
https://github.com/wasmerio/wasmer/tree/dproxy

In hind-sight the journaling functionality of course would get quite some interest given its capability so I can understand why you are jumping on it now before GA which is a good thing as the more hands on it the faster everyone gets the capability - but that does mean you are ahead on some of the use-cases and will hit problems that others won't.

On the use-case you have explicitly linked there are some restrictions on how you build the app for it to work properly (which we should probably detect at runtime and issue warnings).

  1. It must be compiled with WASIX (cargo wasix build)
  2. It must have the asyncify step applied after it (wasm-opt --asyncify in.wasm -o out.wasm)

When compiled in that way it will export the globals it needs to snapshot threads (in particular the main thread), and while the main thread snapshotting (which is not needed for DCGI) should work it has not been anywhere nearly tested enough compared to the more basic use-case due to the scope limitations of focusing on DCGI.

Some known issues when attempting to use journaling on full blown apps:
a. The atomic waits block execution preventing snapshots from being taken (this will require some changes to WASIX)
b. It will only currently restore the main thread... other spawned threads are not restored (i.e. it does not support multithreading).

Looking at the code you posted the std::thread::sleep call actually translates to a poll_oneoff syscall which does not use the WASM atomic wait logic so your code should avoid problem (a). Further you are not spawning threads and thus you should be avoiding problem (b). Given this the use case you have in theory should work if you follow steps 1 and 2 above and compile it to WASIX.

If you would like to really run at the absolute cutting edge you can jump on over to the DProxy branch, where I'm developing fixes for problem (a) and (b) and building a new runner which will allow for traditional apps (WASIX) to work with journals. The testing path for this branch is a bit more complex but the hard bits are already in place (coding wise).

I do see your point about the triggers for snapshots being a problem, will have to look into this some more and brainstorm some ideas as having many robust and simple ways to trigger a snapshot is going to be important for it to be the most useful.

For a direct answer to the questions you asked:

Any plans on creating examples for this feature in the future?

Yes, more complex use-cases will be added to the DProxy branch along with examples that the guys will publish when a article about the feature is published.

What am I doing wrong in examples above?

Looks like you need to do step 1 and 2 when compiling

Would it be possible to make functions responsible for restoring the VM from a snapshot public? So that manually setting up 

Yeah that makes sense, if you want to have a go at a PR we can review it and get it in earlier (otherwise I'll drop it in my PR at a later date)

WASI is an option for my use case.

Journaling does work with WASI but it will not be able to save or restore thread state as only WASIX has the extensions that make this possible. In practice that means WASI with journaling is mainly useful for saving and restoring the file system, which is all that DCGI uses it for at the moment. The good news is that WASIX is fully backwards compatible with WASI so you can take WASI code and straight up compile it to WASIX.

...or alternatively, is it possible to send Sigstop to a WASI process started with WasiRunner?

This will be possible in the future, the triggers for this were added as a placeholder however they are not all wired up yet, it's not a big difficultly to add signal hooks and wire that up. If you want to have a crack at a PR you can otherwise I'll take a look at it in the DProxy PR.

Is there plans to handle __stack_pointer in Rust (see EDIT 3)

The stack pointer is used to unwind the stack when capturing it from memory, that means the Wasmer runtime needs that global in order to know where in the memory the stack starts and stops. When compiling to WASI it does not export that global, but when compiling to WASIX it does

WasiEnvBuilder::snapshot_interval seems to be unused.

Seems this is not wired properly yet, will look into it in the DProxy PR.

@theduke
Copy link
Contributor

theduke commented Jan 11, 2024

Just to clarify, cargo wasix build will run wasm-opt with the asyncify step automatically.

No need to do that manually.

@john-sharratt
Copy link
Contributor

Just to clarify, cargo wasix build will run wasm-opt with the asyncify step automatically.

No need to do that manually.

I think that step only runs in release builds cargo wasix build --release.
that's not very intuitive for users though so perhaps that should change

@MaratBR
Copy link

MaratBR commented Jan 12, 2024

I see, thank you for your responses. After following steps you provided it works. Some triggers still do not, but as you said they are placeholders. I was testing with SnapshotTrigger::FirstStdin and it did work fine (except when I was using WasiRunner when it wasn't for some reason). I'll probably wait until feature becomes more stable though, I am not in a rush.

@john-sharratt
Copy link
Contributor

@MaratBR good news, we've made some excellent progress on this - it's not quite mainline ready yet, but the main parts are there. You should be able to use single threaded applications, take snapshots and resume them.
#4462

Will update again when multi-threading is implemented and the patch hits mainline

@Wulfheart
Copy link

@john-sharratt Thanks for the clarification.
Is there any timeframe for this? No pressure from my side, I would just like to plan my master thesis accordingly.

@john-sharratt
Copy link
Contributor

@Wulfheart probably it will go into master this week I suspect

@Wulfheart
Copy link

@john-sharratt Thanks for the clarification. I am looking forward to this.

@Wulfheart
Copy link

Do you have any docs on how this works?

@john-sharratt
Copy link
Contributor

john-sharratt commented Mar 21, 2024

@Wulfheart its primarily initiated in the CLI which has some limited documentation in the help commands.

There is also a document here that describes it at a high level
https://github.com/wasmerio/wasmer/blob/master/docs/journal.md

@john-sharratt
Copy link
Contributor

This is now merged in master
#4462

It will go out in the next release

@CodeDoctorDE
Copy link

Hi,
I just saw this issue but couldn't find any method resore, save or other methods in the wasmer api (https://docs.rs/wasmer/4.2.8/wasmer/index.html).
Is there any example how to use it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎉 enhancement New feature! 📦 lib-compiler-llvm About wasmer-compiler-llvm priority-low Low priority issue
Projects
None yet
Development

No branches or pull requests