Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smartswitch Platform Test Plan Document #12701

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

nissampa
Copy link

@nissampa nissampa commented May 2, 2024

Description of PR

The smartSwitch is a next generation of data center switch for T0/T1 roles, that now subsumes the DPU. This PR describes test cases to validate additional platform management functions such FPD, BMC, Console, Power mgmt., Health, Software upgrade, Life-cycle scenarios needed due to the presence of these DPUs in the system.

Back port request

  • 201911
  • 202012
  • 202205
  • 202305
  • 202311

Copy link

linux-foundation-easycla bot commented May 2, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@nissampa nissampa marked this pull request as draft May 2, 2024 20:51
@nissampa nissampa changed the title DPU Test Plan Document Smartswitch Test Plan Document May 3, 2024
@r12f r12f requested review from r12f and zjswhhh May 3, 2024 22:44
@r12f r12f requested a review from prgeor May 3, 2024 22:49
@KrisNey-MSFT
Copy link

We reviewed the test cases today 5/8/2024. One comment, please change to SONiC-DASH OS for the DPU, and SONiC only for the CPU/NPU :)

@nissampa nissampa marked this pull request as ready for review May 9, 2024 18:31
* The "show reboot-cause history module-name" CLI on the switch shows the history of the specified module
* Use `config chassis modules shutdown <DPU_Number>`
* Use `config chassis modules startup <DPU_Number>`
* Wait for 5 minutes for Pmon to update the dpu states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wait 5 minutes? what is the max time until the dpu states is updated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Considering power on dpu, service to be up on dpu and chassis db update, we had given the 5 mins to be max limit.
  • This time limit is for initial boot up case and for subsequent operation state updates are going to be instantaneous.

@nissampa nissampa changed the title Smartswitch Test Plan Document Smartswitch Platform Test Plan Document May 14, 2024
### 1.8 Check the NTP date and timezone between DPU and NPU

#### Steps
* In Switch, under the file /etc/ntp.conf configure it to use the ntp server and restart ntp.service to configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTP configuration should be set via config DB. An example of the configuration is in https://github.com/sonic-net/SONiC/blob/master/doc/ntp/ntp-design.md HLD.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This test case is just to check that both NPU and DPU are in sync with the dates. Nothing to do with any configurations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the steps? Because the first step here is describes that the configuration will be set:

under the file /etc/ntp.conf configure it to use the ntp server and restart ntp.service


#### Steps
* In Switch, under the file /etc/ntp.conf configure it to use the ntp server and restart ntp.service to configure
* In DPU, similarly under the ntp configuration use the switches ip as ntp server and restart ntp service to configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that SONiC on the switch should run NTP server? The support of the NTP servers is not yet integrated into the SONiC. It should be possible to configure the NTP server via Linux config files but this configuration might conflict with the NTP client configuration that SONiC supports.

If we want to run the NTP server on the switch we need to discuss this with the Microsoft team. @prgeor can you please assist?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is nothing to do with any configuration. This is to check just the date and time zones are all same both on host and dpus. Changed the test case as such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the steps? The steps tell opposite

## Introduction

The purpose is to test the functionality of Smartswitch.
Smartswitch is connected to dpu sleds via pcie link having two dpus per sled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the sleds in our platform, could you please update to make it common for all vendors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

### 1.4 Check DPU Console

#### Steps
* Use command `/usr/bin/picocom -b 115200 /dev/ttyS<DPU_SLOT_NUM+OFFSET>` to access console for given dpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use minicom as the console tool in our platform, maybe we don't need to specify the commands in the steps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized it as serial port utility rather than command

* Currently all the interfaces gets static ip address
* Use the command `lspci -d 1dd8:1004` to list all the pcie buses for DPUs
* Ip adddress mapping to dpu - 169.254.x.2 (switch side interface) and 169.254.x.1 (DPU side interface). where x - bus number in decimal number.
* Work in progress - Dymanic assignment of ip address via dhcp
Copy link
Contributor

@congh-nvidia congh-nvidia Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the DPU IP assignment design(https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/ip-address-assigment/smart-switch-ip-address-assignment.md), looks like we shouldn't have the switch side IP interfaces but only a bridge-midplane interface?
@oleksandrivantsiv please help confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The logic described here is incorrect and should be aligned with the IP address assignment HLD.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed to it. Once the above design is implemented, the test case will be changed as such. This is the temporary solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the community test plan. It should be generic and not based on the specific implementation. I can't approve such PR.

## Introduction

The purpose is to test the functionality of Smartswitch.
Smartswitch is connected to dpu sleds via pcie link having two dpus per sled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Sled' was never discussed in any of the HLD reviews. This seems to be a vendor-specific concept that should not be part of the test plan.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

```
On Switch:

root@sonic:/home/cisco# show platform inventory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inventory output should be based on the module implemented in the platform API. The example output includes vendor-specific concepts. The test should be generalized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it. The display now shows only the generalized output

On Switch:

root@sonic:/home/cisco# show platform inventory
Name Product ID Version Serial Number Description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the version column mean in this table? Where is the data for this column taken from? We don't have an API defined for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the standard column in show platform inventory and there is no change to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The show platform inventory is not a standard command. It new command for the SONiC community that was introduced in the scope PMON HDL.

```
#### Pass/Fail Criteria
* Verify number of dpus from api and number of dpus shown in the cli output.
* Verify powered off dpus should display it as powered off.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inventory table should not display the DPU state. How will the test verify this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, this is not to check the change of state for dpus. It is just checking the display output.
Changed the test case.

#### Pass/Fail Criteria
* Verify number of dpus from api and number of dpus shown in the cli output.
* Verify powered off dpus should display it as powered off.
* Verify all the serial numbers of the dpus that are powered on are unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serial numbers of the DPU should be available regardless of the DPU state. Why do we care about DPU state?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, serial number gets populated on DB when DPUs are powered on for the first time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption was never discussed in the context of PMON HLD.

### 1.4 Check DPU Console

#### Steps
* Use command `/usr/bin/picocom -b 115200 /dev/ttyS<DPU_SLOT_NUM+OFFSET>` to access console for given dpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serial device names may differ from vendor to vendor. We should include the mapping from DPU to the serial device in the platform.json file. The test should read the device information from platform.json.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it,


#### Steps
* Use command `/usr/bin/picocom -b 115200 /dev/ttyS<DPU_SLOT_NUM+OFFSET>` to access console for given dpu
* Get starting offset of serial port for dpus using the command `cat /proc/tty/driver/serial`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is based on the vendor implementation. Please generalize the test case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

### 1.7 Check removal of pcie link between npu and dpu

#### Steps
* Use command `pcieutil generate` to generate pcie yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pcie.yaml file should be generated in the run-time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be part of the platform definition

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are use cases where user can turn off and turn on DPUs dynamically using config cli, so in such cases pcieyaml needs to be regenerated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right approach to do this. The HLD should be extended to cover this case during the shutdown of the DPU. @prgeor what do you think?

#### Steps

Existing Test case:
* Reboot using a particular command (sonic reboot, watchdog reboot, etc) (timeout 5 mins, wait 2 mins)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boot time may differ from vendor to vendor. This information should be part of the platform.json file. Please check sonic-net/SONiC#1699 PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agreed and will move this info to platform.json

* Verify number of interfaces should be equal to number of dpu modules.


### 1.6 Check DPU shutdown and power up individually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root/shutdown flows are part of sonic-net/SONiC#1699 HLD. I think the reboot and shutdown tests should be defined in that document

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is an overlap, the reboot is part of the document specified above. Here as a platform testing, the test of power on/off is done through this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HDL introduces new functionality and covers both reboot and power on/off flows. The assumptions here contradict the HDL. I'm ok to have those tests here but they should be aligned accordingly

Copy link
Contributor

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented

@KrisNey-MSFT
Copy link

Looks like we need 1 approver to close this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants