Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-24290: Remove sm calls & clean up VBDs on termination #18

Merged

Conversation

gaborigloi
Copy link
Contributor

@gaborigloi gaborigloi commented Sep 27, 2017

Not it works in case of service stop, restart, and machine reboot. The only issue is that the client hangs after reboot.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
This is equivalent to an Lwt.finalize, which is safer.

Signed-off-by: Iglói Gábor <gabor.igloi@citrix.com>
Using the temporary Block_error_printer module from the nbd 2.x library
- when we upgrade to the new Mirage block library, we can use the
Block.pp_*_error functions instead.

Signed-off-by: Iglói Gábor <gabor.igloi@citrix.com>
Otherwise users won't be able to run data_destroy on the VDI & dom0 will
run out of the max number of VBDs.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
If we do not close the block device first, the VBD.unplug call will
hang, and systemd will eventually kill the process, and the VBD will not
get cleaned up.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
src/main.ml Outdated
(fun () ->
Lwt_log.notice_f "Destroying VBD %s" vbd >>= fun () ->
Xen_api.VBD.destroy ~rpc ~session_id ~self:vbd >|= fun () ->
vbds_to_clean_up := StringSet.remove vbd !vbds_to_clean_up)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like accessing references this way, even if it looks safe now. Can you please wrap any change with an Lwt_mutex.with_lock (https://ocsigen.org/lwt/2.7.1/api/Lwt_mutex)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for the suggestion!

src/main.ml Outdated
let with_vbd ~vDI ~vM ~mode ~rpc ~session_id f =
Xen_api.VBD.create ~rpc ~session_id ~vM ~vDI ~userdevice:"autodetect" ~bootable:false ~mode ~_type:`Disk ~unpluggable:true ~empty:false ~other_config:[] ~qos_algorithm_type:"" ~qos_algorithm_params:[]
>>= fun vbd ->
vbds_to_clean_up := StringSet.add vbd !vbds_to_clean_up;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

src/main.ml Outdated
| `Error e -> Lwt.fail_with (Printf.sprintf "Unable to read %s: %s" filename (Nbd.Block_error_printer.to_string e))
| `Ok b ->
let block_uuid = Uuidm.v `V4 |> Uuidm.to_string in
Hashtbl.add blocks_to_close block_uuid b;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

src/main.ml Outdated
(fun () -> f b)
(fun () ->
Block.disconnect b >|= fun () ->
Hashtbl.remove blocks_to_close block_uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

In case this becomes multithreaded in the future.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@gaborigloi
Copy link
Contributor Author

@mseri Done, now I protect both references with mutexes when I update them.

@gaborigloi gaborigloi force-pushed the remove_sm_calls_signal_handler branch 11 times, most recently from a4d20f0 to 82c8a48 Compare September 28, 2017 08:12
To ensure that we'll still clean them up in case something goes wrong.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@gaborigloi
Copy link
Contributor Author

One question: should we cause the whole client handler thread to fail when persistent VBD tracking fails? Or should we silently ignore these unexpected I/O errors. I think it might be ok to fail in that case, because we don't expect any errors to occur?

@mseri
Copy link
Collaborator

mseri commented Sep 28, 2017

I'd like @jonludlam and @thomassa to give their opinion, I am not sure how resilient should be the client

src/vbd_store.ml Outdated
(function
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit
| e ->
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see what you mean. I think if we have failures here, we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure

src/vbd_store.ml Outdated
let transform_vbd_list f =
Lwt_mutex.with_lock m (fun () ->
create_dir_if_doesnt_exist () >>= fun () ->
(try
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try instead of Lwt.catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. Lwt_streams are a bit unique because they are not an Lwt.t, so that's why I originally used try, but now that I convert them into a list, we do get a Lwt.t type back, so we need Lwt.catch indeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean that depending on the failure we need both try and catch? :O

Copy link
Contributor Author

@gaborigloi gaborigloi Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lwt.catch is always enough, as it works with normal exceptions too as far as I know :) And we need that in this case, because now we have an Lwt type. Previously I was using one stream piped into a chain of functions, and got a horrible, hard-to-track-down bug where the beginning of the stream was reading what the end of the stream was writing, and it overwrote the original file with duplicate items 🤕

src/vbd_store.ml Outdated
create_dir_if_doesnt_exist () >>= fun () ->
(try
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list
with _ -> Lwt.return [])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but I'd like to see a log as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should fail here, I think we should fail on write errors but be resistant to read errors (but with a log when the error is not about non-existent file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, after discussing it a bit, it may be better to fail to avoid leaking VBDs.

src/vbd_store.ml Outdated
with _ -> Lwt.return [])
>>= fun l ->
let l = f l in
Lwt_stream.of_list l |> Lwt_io.lines_to_file Consts.vbd_list_file
Copy link
Collaborator

@mseri mseri Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in a Lwt.catch either. Although I think we should fail on errors so it seems ok. Let's think if it is worth logging the failure before failing or not. I don't have a strong opinion in regard

src/vbd_store.ml Outdated
Lwt_unix.file_exists Consts.vbd_list_file >>= fun exists ->
if exists then
Lwt_mutex.with_lock m (fun () ->
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, do we want to log the failures?

src/vbd_store.ml Outdated
transform_vbd_list (List.filter ((<>) vbd_uuid))

let get_all () =
Lwt_unix.file_exists Consts.vbd_list_file >>= fun exists ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are checking this because you know that nothing should delete the file, in the worst case it will be empty. Add a comment explaining this and the fact that this prevents races issues that would require to remove this check and instead use a catch (like you do for the make dir above)

(** Read and write a persistent list of VBD UUIDs.
These functions are thread-safe. *)

val add: string -> unit Lwt.t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings here please

Copy link
Collaborator

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It's messy not because of you but by necessity. Please fill the code with docstrings and comments that explain the various assumptions or choices.

@@ -0,0 +1,178 @@
open Lwt.Infix

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an interface file here to hide the implementation details

@mseri
Copy link
Collaborator

mseri commented Sep 28, 2017

It would be good to add unit tests. Up to you if you want to add them to this PR or in a separate one, but I expect to see them

@gaborigloi
Copy link
Contributor Author

@mseri Thanks for the review, I've addressed your above comments, let me know if more documentation / comments are needed.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Copy link
Collaborator

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there. Just a few, minor, pedantic comments

val ignore_exn_log_error : string -> (unit -> unit Lwt.t) -> unit Lwt.t

module VBD : sig
val with_vbd :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/cleanup.mli Outdated
end

module Block : sig
val with_block : string -> (Block.t -> 'a Lwt.t) -> 'a Lwt.t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/cleanup.mli Outdated
end

module Runtime : sig
val register_signal_handler : unit -> unit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring. Which signals are handled? What will the signal handler do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/cleanup.mli Outdated
end

module Persistent : sig
val cleanup : unit -> unit Lwt.t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/vbd_store.ml Outdated
let create_dir_if_doesnt_exist () =
Lwt.catch
(fun () -> Lwt_unix.mkdir Consts.xapi_nbd_persistent_dir 0o755)
(function
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit
| e ->
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e)
(* In this we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword as

(* In any other case we let the client fail. In this case the user/admin should go and fix the root cause of the issue *)

src/vbd_store.ml Outdated
let create_dir_if_doesnt_exist () =
Lwt.catch
(fun () -> Lwt_unix.mkdir Consts.xapi_nbd_persistent_dir 0o755)
(function
| Unix.(Unix_error (EEXIST, "mkdir", dir)) when dir = Consts.xapi_nbd_persistent_dir -> Lwt.return_unit
| e ->
Lwt_log.error_f "Failed to create directory: %s" (Printexc.to_string e)
(* In this we should let the client fail and the user/admin fix the permission/space/... issue that caused the failure *)
log_and_reraise_error "Failed to create directory" e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the path to the error message

src/vbd_store.ml Outdated
(function
| Unix.(Unix_error (ENOENT, "open", file)) when file = vbd_list_file -> Lwt.return []
| e ->
(* In this we should let the client fail and the user/admin fix the permission/... issue that caused the failure *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment as above

src/vbd_store.ml Outdated
if exists then
Lwt_mutex.with_lock m (fun () ->
Lwt_io.lines_of_file Consts.vbd_list_file |> Lwt_stream.to_list)
Lwt.catch (fun () -> Lwt_io.lines_of_file vbd_list_file |> Lwt_stream.to_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<pedantic>Put the function on a new line as the catches above</pedantic>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're completely right, I wanted to do that but had to deal with something else so I forgot about it 😄 .

In particular, document when exceptions are thrown.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@gaborigloi
Copy link
Contributor Author

gaborigloi commented Sep 28, 2017

@mseri Thanks for the review, I think I addressed all the above comments - your comments were completely fair and helpful, actually I just pushed a commit adding all the missing docstrings you requested before reading your comments about them 😄
Please let me know if any of the comments / docstrings are incorrect / incomplete / unclear.

@gaborigloi
Copy link
Contributor Author

gaborigloi commented Sep 28, 2017

@mseri Another downside is that systemd will say the process failed when it's terminated by a signal, because of the exception I throw. But a while ago (https://discuss.ocaml.org/t/lwt-how-to-catch-exceptions-raised-by-signal-handlers/565) I found that the exception is necessary to terminate the program:

...
Sep 28 17:43:00 clabuk-02-08 systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 17:43:00 clabuk-02-08 systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 17:43:00 clabuk-02-08 systemd[1]: xapi-nbd.service failed.

It's a pity that the solution using Lwt.wait + Lwt.pick does not work. It worked perfectly for all the small test programs that I've tried. It even worked for parallel background threads as long as I joined them with Lwt.join. I tried doing the same thing in xapi-nbd, building up a list of background threads and then calling Lwt.join on them, but it did not work :(. Maybe the issue is that we have a recursive Lwt thread that does not terminate. Or maybe there is something in the code that I'm missing - not sure.

@gaborigloi
Copy link
Contributor Author

gaborigloi commented Sep 28, 2017

I have tested the following scenarios, with different combinations of the various tracking mechanisms:

  • Only using Persistent tracking:
    • Two clients connecting / disconnecting, the persistent list of VBDs to clean up is the same as the list of VBDs plugged to dom0
    • Two clients connected, stop server with service xapi-nbd stop. After termination, the persistent list of VBDs to clean up is the same as the list of VBDs plugged to dom0. After service xapi-nbd start, all these VDIs are cleaned up, and both the persistent list of VBDs to clean up and the list of VBDs plugged to dom0 is empty:
    Sep 28 18:11:33 clabuk-02-08 xapi-nbd[908]: main: Starting xapi-nbd: port = '10809'; certfile = '/etc/xensource/xapi-ssl.pem'; ciphersuites = '!EXPORT:RSA+AES128-SHA256' no_tls = 'false'
    Sep 28 18:11:33 clabuk-02-08 xapi-nbd[908]: main: Checking if there are any VBDs to clean up that leaked during the previous run
    Sep 28 18:11:33 clabuk-02-08 xapi-nbd[908]: main: Cleaning up VBD with UUID c0bfe55f-f201-9db4-80e1-9b2e75dedee1
    Sep 28 18:11:33 clabuk-02-08 xapi-nbd[908]: main: Cleaning up VBD with UUID 347f2673-0b12-d27b-3e98-3546d0fffbef
    Sep 28 18:11:33 clabuk-02-08 xapi-nbd[908]: main: Initialising TLS
    
    • Same as above, but with service xapi-nbd restart.
    • Same as above, but interrupt with reboot.

I used this command to list the VBDs tracked by xapi-nbd, and the VBDs actually plugged to dom0:

cat /var/lib/xapi-nbd/VBDs_to_clean_up ; echo ------------------------; xe vbd-list vm-uuid=<UUID of dom0> params=uuid --minimal

I used the following command to watch the log output of xapi-nbd:

tail -f /var/log/daemon.log | grep xapi-nbd

When I did the reboot, I got ECONNREFUSED, probably when xapi-nbd tried to connect to xapi's Unix domain socket:

Sep 28 18:30:33 localhost xapi-nbd[2843]: Caught: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: main: Caught unexpected exception: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: xapi-nbd: internal error, uncaught exception:
Sep 28 18:30:33 localhost xapi-nbd[2843]: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 18:30:33 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 18:30:33 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 18:30:33 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 18:30:34 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.

When I started it manually later, it worked, and it cleaned up the leaked VDIs properly.
This issue happens consistently at startup: the NBD server will try to make XenAPI calls via the Unix domain socket and will fail.

@gaborigloi
Copy link
Contributor Author

By the way, does xapi notify systemd about when its startup sequence finishes? If not, I think that would be a useful thing to add in the future, probably using https://github.com/juergenhoetzel/ocaml-systemd/

@gaborigloi
Copy link
Contributor Author

gaborigloi commented Sep 28, 2017

This might be relevant to the solution: https://stackoverflow.com/questions/1372080/wait-for-a-unix-domain-socket-to-be-bound?rq=1
Maybe we can just add a path dependency on the unix domain socket, because systemd path activation uses inotify under the hood, and presumably it listens to IN_CREATE. I don't see why it shouldn't work for xapi's unix domain socket, that's also a file after all. If it does not work, we can always add an inotify loop, like this: https://github.com/gaborigloi/xapi-nbd/tree/inotify

Indeed, this page says that Unix domain sockets may not be bound to an existing file, and when they are bound, the file is created: http://osr507doc.sco.com/en/netguide/dusockD.binding_names.html So this means hopefully that we can uses systemd to wait for xapi to become active, because its Unix domain socket will always be freshly created. But this relies on the assumption that this socket is immediately able to receive requests without failures.

@gaborigloi
Copy link
Contributor Author

Even if we make the server path-activated by both the unix domain socket of xapi and the .pem file, there are still transient errors:

Sep 28 21:42:41 localhost xapi-nbd[2605]: Caught: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: main: Caught unexpected exception: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: xapi-nbd: internal error, uncaught exception:
Sep 28 21:42:41 localhost xapi-nbd[2605]: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 21:42:41 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.
Sep 28 21:42:41 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit lvm2-activation.service is masked.
Sep 28 21:42:41 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit lvm2-activation-early.service is
masked.
Sep 28 21:42:41 localhost systemd[1]: start request repeated too quickly for xapi-nbd.service
Sep 28 21:42:41 localhost systemd[1]: Failed to start NBD server that exposes XenServer disks.
Sep 28 21:42:41 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:41 localhost ntpd[2475]: 0.0.0.0 c615 05 clock_sync
Sep 28 21:42:45 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit lvm2-activation.service is masked.
Sep 28 21:42:45 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit lvm2-activation-early.service is
masked.
Sep 28 21:42:45 localhost systemd[1]: Started NBD server that exposes XenServer disks.
Sep 28 21:42:45 localhost systemd[1]: Starting NBD server that exposes XenServer disks...
Sep 28 21:42:45 localhost xapi-nbd[2778]: main: Starting xapi-nbd: port = '10809'; certfile = '/etc/xensource/xapi-ssl.pem'; ciphersuites = '!EXPORT:RSA
+AES128-SHA256' no_tls = 'false'
Sep 28 21:42:46 localhost xapi-nbd[2778]: main: Caught unexpected exception: Server_error(INTERNAL_ERROR, [ missing table; host;  ])
Sep 28 21:42:46 localhost xapi-nbd[2778]: xapi-nbd: internal error, uncaught exception:
Sep 28 21:42:46 localhost xapi-nbd[2778]: Server_error(INTERNAL_ERROR, [ missing table; host;  ])
Sep 28 21:42:46 localhost xapi-nbd[2778]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 21:42:46 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.
Sep 28 21:42:46 localhost systemd[1]: Started NBD server that exposes XenServer disks.
Sep 28 21:42:46 localhost systemd[1]: Starting NBD server that exposes XenServer disks...
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Starting xapi-nbd: port = '10809'; certfile = '/etc/xensource/xapi-ssl.pem'; ciphersuites = '!EXPORT:RSA+AES128-SHA256' no_tls = 'false'
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Checking if there are any VBDs to clean up that leaked during the previous run
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Initialising TLS
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Setting up server socket
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Listening for incoming connections

So it seems that xapi is not yet completely ready when its Unix domain socket is created (Sep 28 21:42:46 localhost xapi-nbd[2778]: Server_error(INTERNAL_ERROR, [ missing table; host; ])). So I'll just add a loop to the startup of xapi-nbd that will try logging in until it succeeds, up to a timeout, to ensure that xapi is up when we need it.

@mseri
Copy link
Collaborator

mseri commented Sep 29, 2017

I'd like to see this discussion in a issue and the fix in a different PR, this one is already big enough, and the socket mechanism predates it anyway

@gaborigloi gaborigloi merged commit 6f64c4b into xapi-project:master Sep 29, 2017
@gaborigloi gaborigloi deleted the remove_sm_calls_signal_handler branch September 29, 2017 10:04

module Persistent = struct
let cleanup () =
local_login () >>= fun (rpc, session_id) ->
Copy link
Contributor Author

@gaborigloi gaborigloi Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'm not sure about is whether I should log out of these local sessions. It's not a disaster if I don't do that, xapi would just invalidate the oldest sessions of the server, because we supply the originator name when we log in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it better to log out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in the the PR.

@gaborigloi
Copy link
Contributor Author

gaborigloi commented Sep 29, 2017

@mseri raised #20 to fix the startup and log out of the local session

@gaborigloi
Copy link
Contributor Author

Added a unit test for the Vbd_store in #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants