Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-26717: Port Gpumon to PPX-based RPCs #196

Merged
merged 5 commits into from
Feb 2, 2018

Conversation

minishrink
Copy link
Contributor

@minishrink minishrink commented Jan 28, 2018

This is the first of 3 PRs to port the GPU monitoring daemon gpumon to use PPX-based code generation for RPCs. This PR updates the Gpumon interface and client to use PPX, and adds a dev/debugging CLI.
(Core BVT and Ring3 BST have passed)

PRs involved:

  1. Port Gpumon IDL to PPX (this one)
  2. Update Xapi to use upgraded Gpumon interface
  3. Port Gpumon daemon

xml_url
call
module Client = RPC_API(Idl.GenClientExnRpc(struct let rpc=rpc end))
include Client
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to include the client? Just to avoid doing Gpumon_client.Client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although it looks like that's redundant since it's only called in xapi 4 times and always as Gpumon_client.Client anyway, so I can change that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine either way

gpumon/jbuild Outdated
@@ -38,7 +37,7 @@ let () = Printf.ksprintf Jbuild_plugin.V1.send {|
threads
xcp))
(wrapped false)
%s))
Copy link
Collaborator

@mseri mseri Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the coverage_rewriters otherwise this would break @edwintorok coverage work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, keep %s and coverage_rewriter in lines 41 and 56? Or just line 56?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both


type incompatibility_reason = Host_driver | Guest_driver | GPU | Other
type compatibility = Compatible | Incompatible of incompatibility_reason list
(** Boolean: compatible? *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doctstring is a bit weird. This is the compatibility information for vgpu migrations returned from the Nvidia library. Vgpu and pgpu can be compatible or they can be incompatible for a number of reasons. This type encodes that behaviour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure what to write there.

;"in the form `domain:bus:device.function` PCI identifier."]
pgpu_address

let nvidia_pgpu_metadata_p = param ~description:
Copy link
Collaborator

@mseri mseri Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense, now that it is possible, to move all these nvidia_ types and errors to the Nvidia submodule dropping the prefix and encapsulating things more appropriately. They had to be extracted only due to camlp4 generator limitations.

Copy link
Collaborator

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but we need to avoid breanking coverage and it makes sense to make the interface more coherent now that we can



(** Error wrapper *)
type gpu_errors =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm uncertain about is whether we need an error for compatibility issues. I assume these are already handled, but is it worth including one just in case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compatibility errors are only for the Nvidia module at the moment, we will need new ones for ATI or Intel if ever needed. How to deal with it I think is open for discussion:

It is probably not worth splitting:

type nvml_error = 
   | NvmlInterfaceNotAvailable
   (** Exception raised when gpumon is unable to load the nvml nvidia library *)
   | NvmlFailure of string
   (** Exception raised by the c bindings to the nvml nvidia library*)
[@@deriving rpcty]

(** Error wrapper *)
 type gpu_errors =
   | Nvml of nvml_error
   (** Error raised by the Nvml library bindings *)
   | Gpumon_failure
   (** Default exception raised upon daemon failure *)
 [@@default Gpumon_failure]
[@@deriving rpcty]

Copy link
Collaborator

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't merge until all the necessary testing have been performed

@@ -17,14 +17,13 @@ open Xcp_client

let xml_url () = "file:" ^ xml_path

module Client = Gpumon_interface.Client(struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this gone, do we still need open Gpumon_interface?

end)

let rpc call =
if !use_switch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see if !Xcp_service.use_switch and no open Xcp_client. This module is so small that it should not need `open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without opening modules:

let xml_url () = "file:" ^ Gpumon_interface.xml_path

let rpc call =
  if !Xcp_client.use_switch
  then Xcp_client.json_switch_rpc Gpumon_interface.queue_name call
  else Xcp_client.xml_http_rpc
      ~srcstr:(Xcp_client.get_user_agent ())
      ~dststr:"gpumon"
      xml_url
      call
module Client = Gpumon_interface.RPC_API(Idl.GenClientExnRpc(struct let rpc=rpc end))

I think there's a case to be made either way, what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let rpc call =
  let open Xcp_client in
  if !use_switch
  then json_switch_rpc Gpumon_interface.queue_name call
  else xml_http_rpc
      ~srcstr:(get_user_agent ())
      ~dststr:"gpumon"
      xml_url
      call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to be explicit. You can still to inside rpc:

let rpc call =
  let open Xcp_client in

or

module C = Xcp_client
let rpc call =
  if !C.use_switch 
  then ..

; description =
[ "This interface is used by Xapi and Gpumon to monitor "
; "physical and virtual GPUs."]
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join to ; version = (1,0,0)

let get_pgpu_vm_compatibility =
declare "get_pgpu_vm_compatibility"
["Checks compatibility between a VM's vGPU(s) and another pGPU."]
(debug_info_p @-> pgpu_address_p @-> domid_p @-> nvidia_pgpu_metadata_p @-> returning compatibility_p gpu_err )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very long line; how about

( debug_info_p 
@->  pgpu_address_p 
@-> domid_p 
@-> nvidia_pgpu_metadata_p 
@-> returning compatibility_p gpu_err 
)

or similar? This also provides room for comments on parameters when needed.

let get_pgpu_vgpu_compatibility =
declare "get_pgpu_vgpu_compatibility"
["Checks compatibility between a pGPU (on a host) and a list of vGPUs (assigned to a VM). Note: A VM may use several vGPUs."]
( debug_info_p @-> nvidia_pgpu_metadata_p @-> nvidia_vgpu_metadata_list_p @-> returning compatibility_p gpu_err )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a very long line. Break it up in at least two lines.

Copy link
Collaborator

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge until all testing is complete and all the related PRs are approved

@coveralls
Copy link

coveralls commented Jan 30, 2018

Coverage Status

Coverage remained the same at 11.014% when pulling 8a007fa on minishrink:gpu_ppx into 8ab1ac7 on xapi-project:master.

@minishrink minishrink force-pushed the gpu_ppx branch 2 times, most recently from 9e0eaef to 6000ed6 Compare January 31, 2018 14:07
Akanksha Mathur added 3 commits February 1, 2018 11:11
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
@mseri
Copy link
Collaborator

mseri commented Feb 1, 2018

Please squash the fixup commits

@mseri mseri closed this Feb 1, 2018
@mseri mseri reopened this Feb 1, 2018
Akanksha Mathur added 2 commits February 1, 2018 14:40
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

fixup! CP-26717: Minor edits and refactoring

- no global opens in Gpumon_client
- reformatted Gpumon_interface
- rewrote compatibility type docstring in Gpumon_interface

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
@mseri mseri merged commit bda1536 into xapi-project:master Feb 2, 2018
@minishrink minishrink deleted the gpu_ppx branch February 15, 2018 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants