-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-26717: Port Gpumon to PPX-based RPCs #196
Conversation
gpumon/gpumon_client.ml
Outdated
xml_url | ||
call | ||
module Client = RPC_API(Idl.GenClientExnRpc(struct let rpc=rpc end)) | ||
include Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you decide to include the client? Just to avoid doing Gpumon_client.Client
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although it looks like that's redundant since it's only called in xapi 4 times and always as Gpumon_client.Client
anyway, so I can change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine either way
gpumon/jbuild
Outdated
@@ -38,7 +37,7 @@ let () = Printf.ksprintf Jbuild_plugin.V1.send {| | |||
threads | |||
xcp)) | |||
(wrapped false) | |||
%s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the coverage_rewriter
s otherwise this would break @edwintorok coverage work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, keep %s
and coverage_rewriter
in lines 41 and 56? Or just line 56?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
gpumon/gpumon_interface.ml
Outdated
|
||
type incompatibility_reason = Host_driver | Guest_driver | GPU | Other | ||
type compatibility = Compatible | Incompatible of incompatibility_reason list | ||
(** Boolean: compatible? *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doctstring is a bit weird. This is the compatibility information for vgpu migrations returned from the Nvidia library. Vgpu and pgpu can be compatible or they can be incompatible for a number of reasons. This type encodes that behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure what to write there.
gpumon/gpumon_interface.ml
Outdated
;"in the form `domain:bus:device.function` PCI identifier."] | ||
pgpu_address | ||
|
||
let nvidia_pgpu_metadata_p = param ~description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense, now that it is possible, to move all these nvidia_
types and errors to the Nvidia
submodule dropping the prefix and encapsulating things more appropriately. They had to be extracted only due to camlp4 generator limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but we need to avoid breanking coverage and it makes sense to make the interface more coherent now that we can
|
||
|
||
(** Error wrapper *) | ||
type gpu_errors = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm uncertain about is whether we need an error for compatibility issues. I assume these are already handled, but is it worth including one just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compatibility errors are only for the Nvidia module at the moment, we will need new ones for ATI or Intel if ever needed. How to deal with it I think is open for discussion:
It is probably not worth splitting:
type nvml_error =
| NvmlInterfaceNotAvailable
(** Exception raised when gpumon is unable to load the nvml nvidia library *)
| NvmlFailure of string
(** Exception raised by the c bindings to the nvml nvidia library*)
[@@deriving rpcty]
(** Error wrapper *)
type gpu_errors =
| Nvml of nvml_error
(** Error raised by the Nvml library bindings *)
| Gpumon_failure
(** Default exception raised upon daemon failure *)
[@@default Gpumon_failure]
[@@deriving rpcty]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't merge until all the necessary testing have been performed
gpumon/gpumon_client.ml
Outdated
@@ -17,14 +17,13 @@ open Xcp_client | |||
|
|||
let xml_url () = "file:" ^ xml_path | |||
|
|||
module Client = Gpumon_interface.Client(struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this gone, do we still need open Gpumon_interface
?
gpumon/gpumon_client.ml
Outdated
end) | ||
|
||
let rpc call = | ||
if !use_switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see if !Xcp_service.use_switch
and no open Xcp_client
. This module is so small that it should not need `open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without opening modules:
let xml_url () = "file:" ^ Gpumon_interface.xml_path
let rpc call =
if !Xcp_client.use_switch
then Xcp_client.json_switch_rpc Gpumon_interface.queue_name call
else Xcp_client.xml_http_rpc
~srcstr:(Xcp_client.get_user_agent ())
~dststr:"gpumon"
xml_url
call
module Client = Gpumon_interface.RPC_API(Idl.GenClientExnRpc(struct let rpc=rpc end))
I think there's a case to be made either way, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let rpc call =
let open Xcp_client in
if !use_switch
then json_switch_rpc Gpumon_interface.queue_name call
else xml_http_rpc
~srcstr:(get_user_agent ())
~dststr:"gpumon"
xml_url
call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to be explicit. You can still to inside rpc
:
let rpc call =
let open Xcp_client in
or
module C = Xcp_client
let rpc call =
if !C.use_switch
then ..
gpumon/gpumon_interface.ml
Outdated
; description = | ||
[ "This interface is used by Xapi and Gpumon to monitor " | ||
; "physical and virtual GPUs."] | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Join to ; version = (1,0,0)
gpumon/gpumon_interface.ml
Outdated
let get_pgpu_vm_compatibility = | ||
declare "get_pgpu_vm_compatibility" | ||
["Checks compatibility between a VM's vGPU(s) and another pGPU."] | ||
(debug_info_p @-> pgpu_address_p @-> domid_p @-> nvidia_pgpu_metadata_p @-> returning compatibility_p gpu_err ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very long line; how about
( debug_info_p
@-> pgpu_address_p
@-> domid_p
@-> nvidia_pgpu_metadata_p
@-> returning compatibility_p gpu_err
)
or similar? This also provides room for comments on parameters when needed.
gpumon/gpumon_interface.ml
Outdated
let get_pgpu_vgpu_compatibility = | ||
declare "get_pgpu_vgpu_compatibility" | ||
["Checks compatibility between a pGPU (on a host) and a list of vGPUs (assigned to a VM). Note: A VM may use several vGPUs."] | ||
( debug_info_p @-> nvidia_pgpu_metadata_p @-> nvidia_vgpu_metadata_list_p @-> returning compatibility_p gpu_err ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a very long line. Break it up in at least two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not merge until all testing is complete and all the related PRs are approved
9e0eaef
to
6000ed6
Compare
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Please squash the fixup commits |
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com> fixup! CP-26717: Minor edits and refactoring - no global opens in Gpumon_client - reformatted Gpumon_interface - rewrote compatibility type docstring in Gpumon_interface Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
This is the first of 3 PRs to port the GPU monitoring daemon gpumon to use PPX-based code generation for RPCs. This PR updates the Gpumon interface and client to use PPX, and adds a dev/debugging CLI.
(Core BVT and Ring3 BST have passed)
PRs involved: