Skip to content

Conversation

krizex
Copy link
Member

@krizex krizex commented Feb 6, 2018

  1. Implement network sriov create/destroy
  • Call idl interface to enable/disable network sriov when plug/unplug
  • Update pci status after sriov plug/unplug
  1. Check SRIOV pci compatibility before adding it to network
  2. When disabling sriov, only call networkd when all the PIFs that use same
    driver in the host has disabled sriov.
  3. Add UT for vlan/bond/tunnel/network_sriov

Signed-off-by: Yang Qian yang.qian@citrix.com


This change is Reviewable

@krizex krizex requested review from mseri and robhoes February 6, 2018 03:23
@krizex krizex changed the title Private/yangqi/cp 25699 CP-25699 Network sriov create/destroy Feb 6, 2018
@krizex
Copy link
Member Author

krizex commented Feb 6, 2018

Reviewable is really good at diffing, it automatically ignore the white space changes, so changes of nm.ml is clearer than github.
P.S. github could also ignore the white space while diffing by adding ?w=1 to the end of the url.
P.P.S. Chrome extension github-mate also could achieve the job, see camsong/chrome-github-mate#5

@robhoes
Copy link
Member

robhoes commented Feb 6, 2018

The commit message of the first commit is not quite what I see in the patch:

  1. Implement network sriov create/destroy
    OK
  • Call idl interface to enable/disable network sriov when plug/unplug
    Where is this? Anyway, this should be a separate patch. Hmmm.... the title of the second sounds like it...
  • Update pci status after sriov plug/unplug
    Where is this? I see some helpers but nothing call them. Should also be a separate patch.
  1. Check SRIOV pci compatibility before adding it to network
    OK
  1. When disabling sriov, only call networkd when all the PIFs that use same
    driver in the host has disabled sriov.
    Where is this? Should also be a separate patch.

Reviewed 5 of 6 files at r1.
Review status: 5 of 14 files reviewed at latest revision, 13 unresolved discussions.


ocaml/idl/datamodel.ml, line 576 at r1 (raw file):

  error Api_errors.network_sriov_disable_failed ["PIF"; "msg"]
    ~doc:"Failed to disable SR-IOV on PIF" ();
  error Api_errors.network_sriov_pci_vendor_not_compatible ["PIF"; "network"]

Some comments on this in the code below.


ocaml/xapi/xapi_network_sriov.ml, line 15 at r1 (raw file):

 *)

open Xapi_network_sriov_helpers

As said before, prefer to not routinely open modules.


ocaml/xapi/xapi_network_sriov.ml, line 26 at r1 (raw file):

  let sriov_uuid = Uuid.to_string (Uuid.make_uuid ()) in
  let logical_PIF = Ref.make () in
  let mTU = Db.PIF.get_MTU ~__context ~self:physical_PIF in

Let's pass in the record, which you've already got from the DB below, to avoid all these additional DB calls.


ocaml/xapi/xapi_network_sriov.ml, line 35 at r1 (raw file):

    ~ip_configuration_mode:`None ~iP:"" ~netmask:"" ~gateway:"" ~dNS:"" ~bond_slave_of:Ref.null
    ~vLAN_master_of:Ref.null ~management:false ~other_config:[] ~disallow_unplug:false
    ~ipv6_configuration_mode:`None ~iPv6:[""] ~ipv6_gateway:"" ~primary_address_type:`IPv4 ~managed:true

iPv6:[] seems better... we are not consistent unfortunately, and [""] is used in some places.... but it is just weird.


ocaml/xapi/xapi_network_sriov.ml, line 46 at r1 (raw file):

  let pif_rec = Db.PIF.get_record ~__context ~self:pif in
  Xapi_pif_helpers.sriov_is_allowed_on_pif ~__context ~physical_PIF:pif ~pif_rec;
  if pif_rec.API.pIF_sriov_physical_PIF_of <> [] then

Why is this not part of sriov_is_allow_on_pif? Seems like the same category of checks.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 66 at r1 (raw file):

    )

let need_operate_pci_device ~__context ~self =

Where are this function and the ones above used? It doesn't seem to be part of this patch, or something else is missing.

And what does the function name mean?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 112 at r1 (raw file):

  (get_device_info pif1) = (get_device_info pif2)

let assert_sriov_pif_compatible_with_network ~__context ~pif ~network =

This deserves a comment that explains the checks and the logic behind them.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 118 at r1 (raw file):

    begin
      match Db.PIF.get_sriov_logical_PIF_of ~__context ~self:logical_pif with
      | [] -> raise (Api_errors.Server_error(Api_errors.network_incompatible_purposes, [Ref.string_of network]));

That error code is used for something else.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 120 at r1 (raw file):

      | [] -> raise (Api_errors.Server_error(Api_errors.network_incompatible_purposes, [Ref.string_of network]));
      | sriov :: _ ->
        let exist_pif = Db.Network_sriov.get_physical_PIF ~__context ~self:sriov in

existing_pif


ocaml/xapi/xapi_network_sriov_helpers.ml, line 122 at r1 (raw file):

        let exist_pif = Db.Network_sriov.get_physical_PIF ~__context ~self:sriov in
        if not (is_device_underneath_same_type ~__context pif exist_pif) then
          raise (Api_errors.Server_error(Api_errors.network_sriov_pci_vendor_not_compatible, [Ref.string_of pif; Ref.string_of network]))

You can write raise Api_errors.(Server_error (network_sriov_pci_vendor_not_compatible, ..........)).

The syntax M.(...) locally opens module M in between the ().

Also, it's not only about the PCI vendor, but also the device type. I'd call it network_has_incompatible_pifs.


ocaml/xapi/xapi_pif_helpers.ml, line 127 at r1 (raw file):

    raise (Api_errors.Server_error (Api_errors.cannot_add_tunnel_to_sriov_logical, [Ref.string_of transport_PIF]))
  | VLAN_untagged _ :: tl when List.exists (fun x -> match x with Network_sriov_logical _ -> true | _ -> false) tl ->
      raise (Api_errors.Server_error (Api_errors.cannot_add_tunnel_to_vlan_on_sriov_logical, [Ref.string_of transport_PIF]))

These error codes occur in a variety of styles, and we have never had good naming guidelines. This one looks inconsistent with some of the ones below, but consistent with some others. Oh well...


ocaml/xapi/xapi_vlan.ml, line 58 at r1 (raw file):

    raise (Api_errors.Server_error (Api_errors.vlan_tag_invalid, [Int64.to_string tag]));

  let device = pif_rec.API.pIF_device in

In future, please do this sort of "cleanup" in a separate commit.


ocaml/xapi-consts/api_errors.ml, line 135 at r1 (raw file):

let cannot_add_vlan_to_bond_slave = "CANNOT_ADD_VLAN_TO_BOND_SLAVE"
let cannot_add_tunnel_to_bond_slave = "CANNOT_ADD_TUNNEL_TO_BOND_SLAVE"
let cannot_add_tunnel_to_sriov_logical = "CANNOT_ADD_TUNNEL_TO_SRIOV"

_LOGICAL (below as well)


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 6, 2018

Reviewed 1 of 6 files at r1, 1 of 2 files at r2.
Review status: 6 of 14 files reviewed at latest revision, 16 unresolved discussions.


ocaml/xapi/nm.ml, line 317 at r2 (raw file):

      | Network_sriov_logical _ :: _ ->
        Xapi_network_sriov_helpers.sriov_bring_up ~__context ~self:pif
      | VLAN_untagged _ :: tl when List.exists (fun x -> match x with Network_sriov_logical _ -> true | _ -> false) tl ->

Is it ever not just | VLAN_untagged _ :: Network_sriov_logical sriov_logical_pif :: _ ->? Then we can delete the when clause and the seven lines below.


ocaml/xapi/nm.ml, line 329 at r2 (raw file):

      | _ ->
        let dbg = Context.string_of_task __context in
        let rc = Db.PIF.get_record ~__context ~self:pif in

Remove this line and call the one on top of this diff rc.


ocaml/xapi/nm.ml, line 485 at r2 (raw file):

      match get_pif_topo ~__context ~pif_rec with
      | Network_sriov_logical _ :: _ ->
        Xapi_network_sriov_helpers.sriov_bring_down ~__context ~self:pif

Do we also disable SRIOV if there are still VLAN PIFs with currently_attached = true on it? For the normal bridge case, we leave the bridge is place even if the underlying PIF gets unplugged and its currently_attached becomes false.


ocaml/xapi/nm.ml, line 486 at r2 (raw file):

      | Network_sriov_logical _ :: _ ->
        Xapi_network_sriov_helpers.sriov_bring_down ~__context ~self:pif
      | VLAN_untagged _ :: tl when List.exists (fun x -> match x with Network_sriov_logical _ -> true | _ -> false) tl ->

As above, I think we can simplify this match case.


Comments from Reviewable

@krizex
Copy link
Member Author

krizex commented Feb 7, 2018

I have reorganize the commits make it clearer for you to review the change.


Review status: 6 of 14 files reviewed at latest revision, 17 unresolved discussions.


ocaml/idl/datamodel.ml, line 576 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Some comments on this in the code below.

Done.


ocaml/xapi/nm.ml, line 317 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Is it ever not just | VLAN_untagged _ :: Network_sriov_logical sriov_logical_pif :: _ ->? Then we can delete the when clause and the seven lines below.

That's because we have to consider the vlan on vlan scenario.
See https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vlan.ml#L58


ocaml/xapi/nm.ml, line 329 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Remove this line and call the one on top of this diff rc.

Done.


ocaml/xapi/nm.ml, line 485 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Do we also disable SRIOV if there are still VLAN PIFs with currently_attached = true on it? For the normal bridge case, we leave the bridge is place even if the underlying PIF gets unplugged and its currently_attached becomes false.

In case of unplugging sriov logical PIF which has vlan on top of it, we have to handle the scenario as unplugging tunnel transport PIF, so we should call unplug on vlan PIF. Besides, we have to handle the case when plugging the vlan PIF which on top of a sriov logical PIF, so we have to plug the sriov logical PIF before that.


ocaml/xapi/nm.ml, line 486 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

As above, I think we can simplify this match case.

See reply above


ocaml/xapi/xapi_network_sriov.ml, line 15 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

As said before, prefer to not routinely open modules.

Done.


ocaml/xapi/xapi_network_sriov.ml, line 26 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Let's pass in the record, which you've already got from the DB below, to avoid all these additional DB calls.

Done.


ocaml/xapi/xapi_network_sriov.ml, line 35 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

iPv6:[] seems better... we are not consistent unfortunately, and [""] is used in some places.... but it is just weird.

Done.


ocaml/xapi/xapi_network_sriov.ml, line 46 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Why is this not part of sriov_is_allow_on_pif? Seems like the same category of checks.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 66 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Where are this function and the ones above used? It doesn't seem to be part of this patch, or something else is missing.

And what does the function name mean?

This function is called in sriov_bring_down.
sriov_bring_down is called by Nm.bring_pif_down which in the next patch, but I will reorganize the patches to make this changes in one patch.
This function is used to judge whether we should call networkd to disable SRIOV on the pci device. So require_operation_on_pci_device maybe better.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 112 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

This deserves a comment that explains the checks and the logic behind them.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 118 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

That error code is used for something else.

I will introduce a new error network_is_not_sriov_compatible


ocaml/xapi/xapi_network_sriov_helpers.ml, line 120 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

existing_pif

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 122 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

You can write raise Api_errors.(Server_error (network_sriov_pci_vendor_not_compatible, ..........)).

The syntax M.(...) locally opens module M in between the ().

Also, it's not only about the PCI vendor, but also the device type. I'd call it network_has_incompatible_pifs.

Yes, and I will update the Api_errors.(....) to the style you mentioned.


ocaml/xapi/xapi_pif_helpers.ml, line 127 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

These error codes occur in a variety of styles, and we have never had good naming guidelines. This one looks inconsistent with some of the ones below, but consistent with some others. Oh well...

:-(


ocaml/xapi/xapi_vlan.ml, line 58 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

In future, please do this sort of "cleanup" in a separate commit.

Done.


ocaml/xapi-consts/api_errors.ml, line 135 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

_LOGICAL (below as well)

👍


Comments from Reviewable

@krizex krizex force-pushed the private/yangqi/CP-25699 branch from 21163f5 to 3f13491 Compare February 7, 2018 08:37
@krizex
Copy link
Member Author

krizex commented Feb 7, 2018

Review status: 1 of 15 files reviewed at latest revision, 16 unresolved discussions.


ocaml/xapi/nm.ml, line 485 at r2 (raw file):

Previously, krizex (Yang Qian) wrote…

In case of unplugging sriov logical PIF which has vlan on top of it, we have to handle the scenario as unplugging tunnel transport PIF, so we should call unplug on vlan PIF. Besides, we have to handle the case when plugging the vlan PIF which on top of a sriov logical PIF, so we have to plug the sriov logical PIF before that.

Code updated accordingly ^


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 7, 2018

Reviewed 13 of 13 files at r7, 7 of 7 files at r8.
Review status: 7 of 15 files reviewed at latest revision, 9 unresolved discussions.


ocaml/xapi/nm.ml, line 317 at r2 (raw file):

Previously, krizex (Yang Qian) wrote…

That's because we have to consider the vlan on vlan scenario.
See https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vlan.ml#L58

But we can't do VLAN-on-VLAN-on-SRIOV right?


ocaml/xapi/xapi_network_sriov.ml, line 56 at r8 (raw file):

let destroy ~__context ~self =
  let logical_PIF = Db.Network_sriov.get_logical_PIF ~__context ~self in
  if Db.PIF.get_VLAN_slave_of ~__context ~self:logical_PIF <> [] then

This looks like assert_no_vlans from xapi_pif.ml. Move that to the helpers as well? You don't need the new API error then.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 27 at r8 (raw file):

  (get_device_info pif1) = (get_device_info pif2)

(* SRIOV PIF can only join the network which is empty or all of the existing PIFs of it are SRIOV PIFS *)

...and the PIFs need to be identical PCI devices.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 31 at r8 (raw file):

  match Db.Network.get_PIFs ~__context ~self:network with
  | [] -> ()
  | logical_pif :: _ ->

Strictly speaking, this seems not fully reliable, because we don't block the network from being used by VLAN.create etc. We may ignore that now, and consider making those other create functions stricter in future.


ocaml/xapi-consts/api_errors.ml, line 135 at r1 (raw file):

Previously, krizex (Yang Qian) wrote…

👍

Below as well ;)


ocaml/xapi-consts/api_errors.ml, line 105 at r8 (raw file):

let network_sriov_disable_failed = "NETWORK_SRIOV_DISABLE_FAILED"
let network_is_not_sriov_compatible = "NETWORK_IS_NOT_SRIOV_COMPATIBLE"
let network_has_incompatible_sriov_pifs = "NETWORK_SRIOV_PCI_VENDOR_NOT_COMPATIBLE"

These strings really must match exactly (apart from upper- vs. lowercase), otherwise things get very confusing.


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 7, 2018

Reviewed 2 of 2 files at r9.
Review status: 8 of 15 files reviewed at latest revision, 16 unresolved discussions.


ocaml/xapi/nm.ml, line 317 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

But we can't do VLAN-on-VLAN-on-SRIOV right?

I mean: we shouldn't try to handle the VLAN-on-VLAN case for SRIOV. I'm not sure why it is there at all, but given that it is off by default and enabled by FIST point, it is most likely something for testing purposes. We should just block that case for SRIOV and simplify the code below. I have no idea how a VLAN inside a VLAN is supposed to work anyway.

I've looked this up: the FIST thing was put in there 9 years ago, when fixing the bug that allowed one to put a VLAN on a VLAN. I can find any xenrt test code that uses it. So let's delete it and simply never allow VLANs on a VLAN (in a separate patch please ;).


ocaml/xapi/nm.ml, line 486 at r2 (raw file):

Previously, krizex (Yang Qian) wrote…

See reply above

And mine :)


ocaml/xapi/xapi_network_sriov_helpers.ml, line 28 at r9 (raw file):

      | Modprobe_successful_requires_reboot -> `modprobe, true
    in
    let sriov = List.hd (Db.PIF.get_sriov_logical_PIF_of ~__context ~self) in

I'm always nervous about List.hd. I know that it "should not" raise an exception here, but if something gets really out of sync for some reason, the error that is reported is really opaque. It is better to do a match and raise an internal_error with a better explanation in case of a [].


ocaml/xapi/xapi_network_sriov_helpers.ml, line 37 at r9 (raw file):

  let device = Db.PIF.get_device ~__context ~self in
  begin
    match Net.Sriov.enable "enable_sriov" ~name:device with

You need to pass in a debug string from the context. Search for let dbg = elsewhere.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 40 at r9 (raw file):

    | Ok result -> update_sriov_with_result result
    | Error error ->
      Db.PIF.set_currently_attached ~__context ~self ~value:false;

If SRIOV was already on, does then an error here still mean that it is now off?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 43 at r9 (raw file):

      raise (Api_errors.(Server_error (network_sriov_enable_failed, [Ref.string_of self; error])))
  end;
  update_sriovs ~__context

Is this from another patch again?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 47 at r9 (raw file):

let require_operation_on_pci_device ~__context ~self =
  let is_sriov_enabled ~pif =
    match Db.PIF.get_sriov_logical_PIF_of ~__context ~self:pif with

Can you pass in the network_sriov ref from sriov_bring_down?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 50 at r9 (raw file):

    | [] -> false
    | sriov :: _ ->
      Db.PIF.get_currently_attached ~__context ~self:pif = true || Db.Network_sriov.get_requires_reboot ~__context ~self:sriov = true

= true is a bit redundant.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 53 at r9 (raw file):

  in
  if is_sriov_enabled ~pif:self then begin
    let sriov = List.hd (Db.PIF.get_sriov_logical_PIF_of ~__context ~self) in

Again getting the network_sriov ref?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 79 at r9 (raw file):

          is_sriov_enabled ~pif:pif_ref
        )
      |> List.length

Perhaps (=) [self]?


ocaml/xapi/xapi_network_sriov_helpers.ml, line 85 at r9 (raw file):

let sriov_bring_down ~__context ~self =
  let sriov = List.hd (Db.PIF.get_sriov_logical_PIF_of ~__context ~self) in

See above about List.hd.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 91 at r9 (raw file):

    let open Network_interface in
    let device = Db.PIF.get_device ~__context ~self in
    match Net.Sriov.disable "disable_sriov" ~name:device with

See above about the dbg arg.


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 7, 2018

Review status: 8 of 15 files reviewed at latest revision, 17 unresolved discussions.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 24 at r10 (raw file):

let pci_path x = Printf.sprintf "/sys/bus/pci/devices/%s/physfn" x

let update_sriovs ~__context =

Ah, there it is now :)

Please add a comment on top to explain what this does.

Is there any reason not to put this inside xapi_pci.ml, and do this stuff as part of update_pcis? There is something similar happening there already, and there seems to be overlap with the AMD MXGPU case (mxgpu_set_phys_fn_ref). There seems to be scope for merging this. At least, I don't think that there is anything fundamentally different (at this level) between network SRIOV and MXGPU.


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 7, 2018

Reviewed 1 of 1 files at r10.
Review status: 9 of 15 files reviewed at latest revision, 17 unresolved discussions.


Comments from Reviewable

@krizex
Copy link
Member Author

krizex commented Feb 8, 2018

Review status: 9 of 15 files reviewed at latest revision, 18 unresolved discussions.


ocaml/xapi/nm.ml, line 317 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

I mean: we shouldn't try to handle the VLAN-on-VLAN case for SRIOV. I'm not sure why it is there at all, but given that it is off by default and enabled by FIST point, it is most likely something for testing purposes. We should just block that case for SRIOV and simplify the code below. I have no idea how a VLAN inside a VLAN is supposed to work anyway.

I've looked this up: the FIST thing was put in there 9 years ago, when fixing the bug that allowed one to put a VLAN on a VLAN. I can find any xenrt test code that uses it. So let's delete it and simply never allow VLANs on a VLAN (in a separate patch please ;).

OK, I will update code accordingly, and create another PR to delete the vlan on vlan switch.


ocaml/xapi/nm.ml, line 486 at r2 (raw file):

Previously, robhoes (Rob Hoes) wrote…

And mine :)

Done.


ocaml/xapi/xapi_network_sriov.ml, line 56 at r8 (raw file):

Previously, robhoes (Rob Hoes) wrote…

This looks like assert_no_vlans from xapi_pif.ml. Move that to the helpers as well? You don't need the new API error then.

Yes, we can take some logic from assert_no_vlans to create a new function to handle this stuff.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 27 at r8 (raw file):

Previously, robhoes (Rob Hoes) wrote…

...and the PIFs need to be identical PCI devices.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 31 at r8 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Strictly speaking, this seems not fully reliable, because we don't block the network from being used by VLAN.create etc. We may ignore that now, and consider making those other create functions stricter in future.

Yes, if use create network linker object with XC, this code would not be a problem, but if with xe, anything could happen because we don't restrict that before.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 28 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

I'm always nervous about List.hd. I know that it "should not" raise an exception here, but if something gets really out of sync for some reason, the error that is reported is really opaque. It is better to do a match and raise an internal_error with a better explanation in case of a [].

Yes, and actually we have met some problem caused by List.hd in xcp-networkd, and the backtrace is not enough to find the place where the exception was raised. Any idea? Does that means Backtrace only create backtrace for Api_errors.Server_error?

    570 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] org.xen.xapi.xenops.classic events D:85d496198c5e failed with exception (Failure hd)
    571 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] Raised (Failure hd)
    572 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] 1/1 xcp-networkd @ localhost.localdomain Raised at file (Thread 104 has no backtrace table. Was

ocaml/xapi/xapi_network_sriov_helpers.ml, line 37 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

You need to pass in a debug string from the context. Search for let dbg = elsewhere.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 40 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

If SRIOV was already on, does then an error here still mean that it is now off?

yes


ocaml/xapi/xapi_network_sriov_helpers.ml, line 43 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Is this from another patch again?

Yes, when splitting the change I have noticed that but sorry I forget to add a hint here.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 47 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Can you pass in the network_sriov ref from sriov_bring_down?

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 50 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

= true is a bit redundant.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 53 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Again getting the network_sriov ref?

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 79 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Perhaps (=) [self]?

Yes it could, will update.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 85 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

See above about List.hd.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 91 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

See above about the dbg arg.

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 24 at r10 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Ah, there it is now :)

Please add a comment on top to explain what this does.

Is there any reason not to put this inside xapi_pci.ml, and do this stuff as part of update_pcis? There is something similar happening there already, and there seems to be overlap with the AMD MXGPU case (mxgpu_set_phys_fn_ref). There seems to be scope for merging this. At least, I don't think that there is anything fundamentally different (at this level) between network SRIOV and MXGPU.

I will add comment for it.
This function do the same work as Xapi_pgpu.mxgpu_set_phys_fn_ref does, but merge them together will impact vgpu code, so I'd like to keep this function in this PR, and create another PR to merge the stuff into xapi_pci.ml meanwhile change xapi_pgpu.ml andxapi_network_sriov_helpers.ml.


ocaml/xapi-consts/api_errors.ml, line 135 at r1 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Below as well ;)

Done.


ocaml/xapi-consts/api_errors.ml, line 105 at r8 (raw file):

Previously, robhoes (Rob Hoes) wrote…

These strings really must match exactly (apart from upper- vs. lowercase), otherwise things get very confusing.

Oh I forgot to update them simultaneously.


Comments from Reviewable

@krizex krizex force-pushed the private/yangqi/CP-25699 branch from 3f13491 to e236b6e Compare February 8, 2018 05:16
@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

ocaml/xapi/xapi_pif_helpers.ml, line 69 at r12 (raw file):

  with
  | Some v, _ -> v
  | None, _ -> raise (Api_errors.(Server_error (internal_error, [Printf.sprintf "Cannot calculate PIF type of %s" pif_rec.API.pIF_uuid])))

Here and in all the raise calls below you can get rid of a pair of brackets:

raise Api_errors.(Server_error (cannot_add_vlan_to_bond_slave, [Ref.string_of tagged_PIF]))

Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

ocaml/xapi/xapi_network_sriov_helpers.ml, line 17 at r13 (raw file):

open Network
open Db_filter_types
module D = Debug.Make(struct let name="xapi" end)

why not name="xapi_network_sriov_helpers"? This is a bit inconsistent in xapi at the moment, with some modules following a convention or the other, but using the file name can ease the beginning of the denugging


Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

ocaml/xapi/nm.ml, line 26 at r13 (raw file):

open Network
open Network_interface
open Xapi_pif_helpers

Can we avoid opening it in the toplevel and do local openings instead?


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 8, 2018

Reviewed 1 of 6 files at r11, 13 of 13 files at r12, 2 of 2 files at r13.
Review status: 8 of 15 files reviewed at latest revision, 5 unresolved discussions.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 28 at r9 (raw file):

Previously, krizex (Yang Qian) wrote…

Yes, and actually we have met some problem caused by List.hd in xcp-networkd, and the backtrace is not enough to find the place where the exception was raised. Any idea? Does that means Backtrace only create backtrace for Api_errors.Server_error?

    570 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] org.xen.xapi.xenops.classic events D:85d496198c5e failed with exception (Failure hd)
    571 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] Raised (Failure hd)
    572 Jan 25 09:01:59 localhost xcp-networkd: [error|localhost.localdomain|104 ||backtrace] 1/1 xcp-networkd @ localhost.localdomain Raised at file (Thread 104 has no backtrace table. Was

If the exception is marked with Backtrace.is_important at the right place, then it should work. But in any case, we should never see Failure hd anywhere; we can always avoid it by using a match.


Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

ocaml/xapi/xapi_pif_helpers.ml, line 172 at r14 (raw file):

    debug "PIF has associated VLANs: [ %s ]"
      (String.concat
         ("; ")

You don't need parentheses here nor in (vlan) below


Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

Review status: 8 of 15 files reviewed at latest revision, 7 unresolved discussions.


ocaml/xapi/nm.ml, line 322 at r15 (raw file):

        Db.PIF.set_currently_attached ~__context ~self:pif ~value:currently_attached
      | _ ->
      let dbg = Context.string_of_task __context in

This function needs to be reindented or is reviewable that is ignoring whitespaces?


ocaml/xapi/test_bond.ml, line 1 at r15 (raw file):

Why so many new lines here?


ocaml/xapi-consts/api_errors.ml, line 100 at r15 (raw file):

let network_sriov_already_enabled = "NETWORK_SRIOV_ALREADY_ENABLED"
let network_sriov_find_pf_from_vf_failed = "NETWORK_SRIOV_FIND_PF_FROM_VF_FAILED"
let network_sriov_pif_has_vf_inuse = "NETWORK_SRIOV_PIF_HAS_VF_INUSE"

It seems that this error is not used anywhere


Comments from Reviewable

@krizex
Copy link
Member Author

krizex commented Feb 8, 2018

Review status: 8 of 15 files reviewed at latest revision, 7 unresolved discussions.


ocaml/xapi/nm.ml, line 26 at r13 (raw file):

Previously, mseri (Marcello Seri) wrote…

Can we avoid opening it in the toplevel and do local openings instead?

Done.


ocaml/xapi/nm.ml, line 322 at r15 (raw file):

Previously, mseri (Marcello Seri) wrote…

This function needs to be reindented or is reviewable that is ignoring whitespaces?

Good eye!


ocaml/xapi/test_bond.ml, line 1 at r15 (raw file):

Previously, mseri (Marcello Seri) wrote…

Why so many new lines here?

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 28 at r9 (raw file):

Previously, robhoes (Rob Hoes) wrote…

If the exception is marked with Backtrace.is_important at the right place, then it should work. But in any case, we should never see Failure hd anywhere; we can always avoid it by using a match.

Got it.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 17 at r13 (raw file):

Previously, mseri (Marcello Seri) wrote…

why not name="xapi_network_sriov_helpers"? This is a bit inconsistent in xapi at the moment, with some modules following a convention or the other, but using the file name can ease the beginning of the denugging

Good suggestion, I will also do it to xapi_pif_helpers.ml


ocaml/xapi/xapi_pif_helpers.ml, line 69 at r12 (raw file):

Previously, mseri (Marcello Seri) wrote…

Here and in all the raise calls below you can get rid of a pair of brackets:

raise Api_errors.(Server_error (cannot_add_vlan_to_bond_slave, [Ref.string_of tagged_PIF]))

Done.


ocaml/xapi/xapi_pif_helpers.ml, line 172 at r14 (raw file):

Previously, mseri (Marcello Seri) wrote…

You don't need parentheses here nor in (vlan) below

Done.


ocaml/xapi-consts/api_errors.ml, line 100 at r15 (raw file):

Previously, mseri (Marcello Seri) wrote…

It seems that this error is not used anywhere

Yes.


Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

ocaml/xapi/xapi_network_sriov.ml, line 15 at r16 (raw file):

 *)

module D = Debug.Make(struct let name="xapi" end)

We could use "xapi_newtork_sriov" here


Comments from Reviewable

@mseri
Copy link
Contributor

mseri commented Feb 8, 2018

Reviewed 1 of 6 files at r1, 1 of 7 files at r8, 2 of 13 files at r12, 1 of 6 files at r15, 4 of 6 files at r16.
Review status: 9 of 15 files reviewed at latest revision, 4 unresolved discussions.


ocaml/xapi/xapi_pif.ml, line 805 at r16 (raw file):

  Xapi_pif_helpers.assert_pif_is_managed ~__context ~self;
  let pif_rec = Db.PIF.get_record ~__context ~self in
  let _ = match Xapi_pif_helpers.get_pif_type pif_rec with

Let's be stricter and use let () = here


ocaml/xapi/xapi_pif_helpers.ml, line 172 at r14 (raw file):

Previously, krizex (Yang Qian) wrote…

Done.

This is not important, but I just realised that we could write it more readably with pipes:

vlans
|> List.map (fun self -> Db.VLAN.get_uuid ~__context ~self)
|> String.concat "; "
|> debug "PIF has associated VLANs: [ %s ]"

Comments from Reviewable

@krizex
Copy link
Member Author

krizex commented Feb 8, 2018

Review status: 9 of 15 files reviewed at latest revision, 3 unresolved discussions.


ocaml/xapi/xapi_network_sriov.ml, line 15 at r16 (raw file):

Previously, mseri (Marcello Seri) wrote…

We could use "xapi_newtork_sriov" here

Done. So I will keep name="xapi" for xapi_pif_helpers.ml to keep it consistent with xapi_pif.ml


ocaml/xapi/xapi_pif.ml, line 805 at r16 (raw file):

Previously, mseri (Marcello Seri) wrote…

Let's be stricter and use let () = here

Done.


ocaml/xapi/xapi_pif_helpers.ml, line 172 at r14 (raw file):

Previously, mseri (Marcello Seri) wrote…

This is not important, but I just realised that we could write it more readably with pipes:

vlans
|> List.map (fun self -> Db.VLAN.get_uuid ~__context ~self)
|> String.concat "; "
|> debug "PIF has associated VLANs: [ %s ]"

Yes, that's clearer


Comments from Reviewable

@krizex krizex force-pushed the private/yangqi/CP-25699 branch from df4004d to 80d6820 Compare February 8, 2018 14:38
@robhoes
Copy link
Member

robhoes commented Feb 8, 2018

Review status: 6 of 15 files reviewed at latest revision, 6 unresolved discussions.


ocaml/xapi/xapi_network_sriov.ml, line 15 at r16 (raw file):

Previously, mseri (Marcello Seri) wrote…

We could use "xapi_newtork_sriov" here

"xapi_network_sriov", yes


ocaml/xapi/xapi_network_sriov_helpers.ml, line 46 at r14 (raw file):

    with Not_found ->
      error "Failed to find physical function of vf %s" (Db.PCI.get_uuid ~__context ~self:vf);
      raise (Api_errors.(Server_error (network_sriov_find_pf_from_vf_failed, [Ref.string_of vf])))

Do we really need to fail the API call here? In any case, it should be an internal_error, because the error condition would probably imply a bug in our software. But logging the error, and carrying on with the rest, is probably enough.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 54 at r14 (raw file):

      let pfs, vfs = List.partition (fun pci -> is_phy_fn ~pci) local_pcis in
      (* set physical function for vfs *)
      List.iter (fun vf -> if Db.PCI.get_physical_function ~__context ~self:vf = Ref.null then set_phy_fn ~vf ~pfs) vfs;

This is very inefficient: it does DB.PCI.get_pci_id O(pfs * vfs) times, plus another DB-get for all vfs. And that's after doing the same DB-get O(pfs + vfs) times in the List.partition. And then some more DB gets for all pfs below. And that for every SR-IOV PIF that is plugged. On a pool slave with many PCI devices that could get quite bad.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 60 at r14 (raw file):

          Db.PCI.set_functions ~__context ~self:pf ~value:(Int64.of_int (vf_cnt + 1))
        ) pfs
    )

Overall, I don't quite like this function, because of its inefficiency, and because it is essentially the same as (part of) mxgpu_vf_setup+is_local_gpu+mxgpu_set_phys_fn_ref. The GPU one is already in the wrong place, but adding this one here makes it worse. This is entirely at the PCI level, so update_pcis should do it, and it can do it much more efficiently. We can do this as a separate PR, but then please remove this patch from this PR.


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 8, 2018

I am happy with this now, apart from the commit that updates the PCIs, which I think we should remove from this PR and do separately. Also, the last commit still needs to be squashed. We can merge it after both are done.


Reviewed 1 of 1 files at r14, 6 of 6 files at r15, 3 of 6 files at r16, 4 of 4 files at r17.
Review status: all files reviewed at latest revision, 7 unresolved discussions.


ocaml/xapi/test_vlan.ml, line 106 at r15 (raw file):

       Xapi_vlan.create ~__context ~tagged_PIF:untagged_PIF ~tag ~network:vlan_network2)

let test_create_pif_is_vlan_master_on_sriov () =

You didn't add this to tests below, so it won't get run.


Comments from Reviewable

@krizex
Copy link
Member Author

krizex commented Feb 9, 2018

OK, will do


Review status: all files reviewed at latest revision, 5 unresolved discussions.


ocaml/xapi/test_vlan.ml, line 106 at r15 (raw file):

Previously, robhoes (Rob Hoes) wrote…

You didn't add this to tests below, so it won't get run.

Done.


ocaml/xapi/xapi_network_sriov.ml, line 15 at r16 (raw file):

Previously, robhoes (Rob Hoes) wrote…

"xapi_network_sriov", yes

Done.


ocaml/xapi/xapi_network_sriov_helpers.ml, line 46 at r14 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Do we really need to fail the API call here? In any case, it should be an internal_error, because the error condition would probably imply a bug in our software. But logging the error, and carrying on with the rest, is probably enough.

ok


ocaml/xapi/xapi_network_sriov_helpers.ml, line 54 at r14 (raw file):

Previously, robhoes (Rob Hoes) wrote…

This is very inefficient: it does DB.PCI.get_pci_id O(pfs * vfs) times, plus another DB-get for all vfs. And that's after doing the same DB-get O(pfs + vfs) times in the List.partition. And then some more DB gets for all pfs below. And that for every SR-IOV PIF that is plugged. On a pool slave with many PCI devices that could get quite bad.

Exactly, I will update the DB access complexity to O(pfs + vfs)


ocaml/xapi/xapi_network_sriov_helpers.ml, line 60 at r14 (raw file):

Previously, robhoes (Rob Hoes) wrote…

Overall, I don't quite like this function, because of its inefficiency, and because it is essentially the same as (part of) mxgpu_vf_setup+is_local_gpu+mxgpu_set_phys_fn_ref. The GPU one is already in the wrong place, but adding this one here makes it worse. This is entirely at the PCI level, so update_pcis should do it, and it can do it much more efficiently. We can do this as a separate PR, but then please remove this patch from this PR.

Yes, will do.


Comments from Reviewable

1. remove redundant Api_errors
2. fix typo error
3. slightly refine xapi_vlan.ml

Signed-off-by: Yang Qian <yang.qian@citrix.com>
1. implement Network sriov create/destroy
2. consider vlan on network sriov scenario when:
   + plug vlan pif on top of network sriov
   + unplug network sriov pif underneath vlan pif
3. check SRIOV compatibility before adding it to network

Signed-off-by: Yang Qian <yang.qian@citrix.com>
1. call networkd to enable/disable sriov in bring_pif_up/bring_pif_down
2. when disabling sriov, only call networkd when all of PIFs that use same
driver in the host has disabled sriov

Signed-off-by: Yang Qian <yang.qian@citrix.com>
Signed-off-by: Yang Qian <yang.qian@citrix.com>
@krizex krizex force-pushed the private/yangqi/CP-25699 branch from 80d6820 to 9fdc3fa Compare February 9, 2018 03:19
@krizex
Copy link
Member Author

krizex commented Feb 9, 2018

Updated.
Separate the update_pci change to another PR.


Review status: 1 of 15 files reviewed at latest revision, 4 unresolved discussions.


Comments from Reviewable

@robhoes
Copy link
Member

robhoes commented Feb 9, 2018

:lgtm:


Reviewed 14 of 14 files at r18, 7 of 7 files at r19, 2 of 2 files at r20, 6 of 6 files at r21.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@robhoes robhoes merged commit 2df4154 into xapi-project:sr-iov Feb 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants