-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-24290: Remove sm calls & clean up VBDs on termination #18
CP-24290: Remove sm calls & clean up VBDs on termination #18
Conversation
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
This is equivalent to an Lwt.finalize, which is safer. Signed-off-by: Iglói Gábor <gabor.igloi@citrix.com>
Using the temporary Block_error_printer module from the nbd 2.x library - when we upgrade to the new Mirage block library, we can use the Block.pp_*_error functions instead. Signed-off-by: Iglói Gábor <gabor.igloi@citrix.com>
Otherwise users won't be able to run data_destroy on the VDI & dom0 will run out of the max number of VBDs. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
If we do not close the block device first, the VBD.unplug call will hang, and systemd will eventually kill the process, and the VBD will not get cleaned up. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
src/main.ml
Outdated
(fun () -> | ||
Lwt_log.notice_f "Destroying VBD %s" vbd >>= fun () -> | ||
Xen_api.VBD.destroy ~rpc ~session_id ~self:vbd >|= fun () -> | ||
vbds_to_clean_up := StringSet.remove vbd !vbds_to_clean_up) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like accessing references this way, even if it looks safe now. Can you please wrap any change with an Lwt_mutex.with_lock
(https://ocsigen.org/lwt/2.7.1/api/Lwt_mutex)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for the suggestion!
src/main.ml
Outdated
let with_vbd ~vDI ~vM ~mode ~rpc ~session_id f = | ||
Xen_api.VBD.create ~rpc ~session_id ~vM ~vDI ~userdevice:"autodetect" ~bootable:false ~mode ~_type:`Disk ~unpluggable:true ~empty:false ~other_config:[] ~qos_algorithm_type:"" ~qos_algorithm_params:[] | ||
>>= fun vbd -> | ||
vbds_to_clean_up := StringSet.add vbd !vbds_to_clean_up; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
src/main.ml
Outdated
| `Error e -> Lwt.fail_with (Printf.sprintf "Unable to read %s: %s" filename (Nbd.Block_error_printer.to_string e)) | ||
| `Ok b -> | ||
let block_uuid = Uuidm.v `V4 |> Uuidm.to_string in | ||
Hashtbl.add blocks_to_close block_uuid b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
src/main.ml
Outdated
(fun () -> f b) | ||
(fun () -> | ||
Block.disconnect b >|= fun () -> | ||
Hashtbl.remove blocks_to_close block_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
In case this becomes multithreaded in the future. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@mseri Done, now I protect both references with mutexes when I update them. |
a4d20f0
to
82c8a48
Compare
To ensure that we'll still clean them up in case something goes wrong. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
82c8a48
to
371b868
Compare
One question: should we cause the whole client handler thread to fail when persistent VBD tracking fails? Or should we silently ignore these unexpected I/O errors. I think it might be ok to fail in that case, because we don't expect any errors to occur? |
I'd like @jonludlam and @thomassa to give their opinion, I am not sure how resilient should be the client |
src/vbd_store.ml
Outdated
(function | ||
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit | ||
| e -> | ||
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see what you mean. I think if we have failures here, we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure
src/vbd_store.ml
Outdated
let transform_vbd_list f = | ||
Lwt_mutex.with_lock m (fun () -> | ||
create_dir_if_doesnt_exist () >>= fun () -> | ||
(try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why try
instead of Lwt.catch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. Lwt_stream
s are a bit unique because they are not an Lwt.t
, so that's why I originally used try
, but now that I convert them into a list, we do get a Lwt.t
type back, so we need Lwt.catch
indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean that depending on the failure we need both try and catch? :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lwt.catch
is always enough, as it works with normal exceptions too as far as I know :) And we need that in this case, because now we have an Lwt type. Previously I was using one stream piped into a chain of functions, and got a horrible, hard-to-track-down bug where the beginning of the stream was reading what the end of the stream was writing, and it overwrote the original file with duplicate items 🤕
src/vbd_store.ml
Outdated
create_dir_if_doesnt_exist () >>= fun () -> | ||
(try | ||
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list | ||
with _ -> Lwt.return []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine but I'd like to see a log as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should fail here, I think we should fail on write errors but be resistant to read errors (but with a log when the error is not about non-existent file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, after discussing it a bit, it may be better to fail to avoid leaking VBDs.
src/vbd_store.ml
Outdated
with _ -> Lwt.return []) | ||
>>= fun l -> | ||
let l = f l in | ||
Lwt_stream.of_list l |> Lwt_io.lines_to_file Consts.vbd_list_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in a Lwt.catch
either. Although I think we should fail on errors so it seems ok. Let's think if it is worth logging the failure before failing or not. I don't have a strong opinion in regard
src/vbd_store.ml
Outdated
Lwt_unix.file_exists Consts.vbd_list_file >>= fun exists -> | ||
if exists then | ||
Lwt_mutex.with_lock m (fun () -> | ||
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, do we want to log the failures?
src/vbd_store.ml
Outdated
transform_vbd_list (List.filter ((<>) vbd_uuid)) | ||
|
||
let get_all () = | ||
Lwt_unix.file_exists Consts.vbd_list_file >>= fun exists -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are checking this because you know that nothing should delete the file, in the worst case it will be empty. Add a comment explaining this and the fact that this prevents races issues that would require to remove this check and instead use a catch (like you do for the make dir above)
src/vbd_store.mli
Outdated
(** Read and write a persistent list of VBD UUIDs. | ||
These functions are thread-safe. *) | ||
|
||
val add: string -> unit Lwt.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstrings here please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It's messy not because of you but by necessity. Please fill the code with docstrings and comments that explain the various assumptions or choices.
@@ -0,0 +1,178 @@ | |||
open Lwt.Infix | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an interface file here to hide the implementation details
It would be good to add unit tests. Up to you if you want to add them to this PR or in a separate one, but I expect to see them |
@mseri Thanks for the review, I've addressed your above comments, let me know if more documentation / comments are needed. |
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
b6c02df
to
4229a78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there. Just a few, minor, pedantic comments
val ignore_exn_log_error : string -> (unit -> unit Lwt.t) -> unit Lwt.t | ||
|
||
module VBD : sig | ||
val with_vbd : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cleanup.mli
Outdated
end | ||
|
||
module Block : sig | ||
val with_block : string -> (Block.t -> 'a Lwt.t) -> 'a Lwt.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cleanup.mli
Outdated
end | ||
|
||
module Runtime : sig | ||
val register_signal_handler : unit -> unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring. Which signals are handled? What will the signal handler do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cleanup.mli
Outdated
end | ||
|
||
module Persistent : sig | ||
val cleanup : unit -> unit Lwt.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/vbd_store.ml
Outdated
let create_dir_if_doesnt_exist () = | ||
Lwt.catch | ||
(fun () -> Lwt_unix.mkdir Consts.xapi_nbd_persistent_dir 0o755) | ||
(function | ||
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit | ||
| e -> | ||
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e) | ||
(* In this we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reword as
(* In any other case we let the client fail. In this case the user/admin should go and fix the root cause of the issue *)
src/vbd_store.ml
Outdated
let create_dir_if_doesnt_exist () = | ||
Lwt.catch | ||
(fun () -> Lwt_unix.mkdir Consts.xapi_nbd_persistent_dir 0o755) | ||
(function | ||
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit | ||
| e -> | ||
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e) | ||
(* In this we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure *) | ||
log_and_reraise_error "Failed to create directory" e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the path to the error message
src/vbd_store.ml
Outdated
(function | ||
| Unix.(Unix_error (ENOENT, "open", file)) when file = vbd_list_file -> Lwt.return [] | ||
| e -> | ||
(* In this we should let the client fail and the user/admin fix the permission/... issue that caused the failure *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment as above
src/vbd_store.ml
Outdated
if exists then | ||
Lwt_mutex.with_lock m (fun () -> | ||
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list) | ||
Lwt.catch (fun () -> Lwt_io.lines_of_file vbd_list_file |> Lwt_stream.to_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<pedantic>
Put the function on a new line as the catches above</pedantic>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're completely right, I wanted to do that but had to deal with something else so I forgot about it 😄 .
In particular, document when exceptions are thrown. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@mseri Thanks for the review, I think I addressed all the above comments - your comments were completely fair and helpful, actually I just pushed a commit adding all the missing docstrings you requested before reading your comments about them 😄 |
@mseri Another downside is that systemd will say the process failed when it's terminated by a signal, because of the exception I throw. But a while ago (https://discuss.ocaml.org/t/lwt-how-to-catch-exceptions-raised-by-signal-handlers/565) I found that the exception is necessary to terminate the program:
It's a pity that the solution using |
I have tested the following scenarios, with different combinations of the various tracking mechanisms:
I used this command to list the VBDs tracked by xapi-nbd, and the VBDs actually plugged to dom0:
I used the following command to watch the log output of xapi-nbd:
When I did the reboot, I got
When I started it manually later, it worked, and it cleaned up the leaked VDIs properly. |
By the way, does xapi notify systemd about when its startup sequence finishes? If not, I think that would be a useful thing to add in the future, probably using https://github.com/juergenhoetzel/ocaml-systemd/ |
This might be relevant to the solution: https://stackoverflow.com/questions/1372080/wait-for-a-unix-domain-socket-to-be-bound?rq=1 Indeed, this page says that Unix domain sockets may not be bound to an existing file, and when they are bound, the file is created: http://osr507doc.sco.com/en/netguide/dusockD.binding_names.html So this means hopefully that we can uses systemd to wait for xapi to become active, because its Unix domain socket will always be freshly created. But this relies on the assumption that this socket is immediately able to receive requests without failures. |
Even if we make the server path-activated by both the unix domain socket of xapi and the
So it seems that xapi is not yet completely ready when its Unix domain socket is created ( |
I'd like to see this discussion in a |
|
||
module Persistent = struct | ||
let cleanup () = | ||
local_login () >>= fun (rpc, session_id) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm not sure about is whether I should log out of these local sessions. It's not a disaster if I don't do that, xapi would just invalidate the oldest sessions of the server, because we supply the originator
name when we log in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it better to log out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in the the PR.
Added a unit test for the |
Not it works in case of service stop, restart, and machine reboot. The only issue is that the client hangs after reboot.