Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOS disastrous performance on PCIE 4 NVME SSD #1467

Open
archit3kt opened this issue Nov 3, 2021 · 12 comments
Open

ZOS disastrous performance on PCIE 4 NVME SSD #1467

archit3kt opened this issue Nov 3, 2021 · 12 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@archit3kt
Copy link

Hello, following this forum post with no news since a month, I thought it will be a better idea to create this issue here.

Quick summary, ZOS is having terrible PCIE4 SSD performance issue. Here are some fio tests results on actual ZoS :

Random read 4k blocks : 12.4 Mb/s 4142 IOPS
Random write 4k blocks : 13.3 Mb/s 4489 IOPS
Sequential read 2MB blocks : 1316 Mb/s 864 IOPS
Sequential write 2MB blocks : 2326 Mb/s 1528 IOPS

Made some tests on the same machine with Ubuntu 20 and kernel 5.4, same results.

Hopefully performance is very good on Ubuntu when switching to 5.10.x kernels 👍

Random read 4k blocks : 1855 Mb/s 488 000 IOPS
Random write 4k blocks : 563 Mb/s 144 000 IOPS
Sequential read 2MB blocks : 6728 Mb/s 3360 IOPS
Sequential write 2MB blocks : 6271 Mb/s 3132 IOPS

This answer was given to me :

"It’s not kernel related, if you run fio on your root filesystem of the container, you hit 0-fs , which is not made to be fast, specially for random read/write.

I got it, 0-fs is not meant to be fast, but this slow would still be a big problem for a computer which only have one container running... I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Could you have a look please ? I cannot start hosting productions workload with such terrible IO performance...

@archit3kt
Copy link
Author

New tests made on zos v3.0.1-rc3, better but still way below I should get :

Tests are done on a rootfs of Ubuntu zMachine :

Random read 4k blocks : 790 Mb/s 200 000 IOPS
Random write 4k blocks : 116 Mb/s 30 000 IOPS
Sequential read 2MB blocks : 1850 Mb/s 900 IOPS
Sequential write 2MB blocks : 900 Mb/s 450 IOPS

Note performance regression on sequential write...

Tests on disks added to the zMachine and mounted on /data :

Random read 4k blocks : 630 Mb/s 160 000 IOPS
Random write 4k blocks : 190 Mb/s 50 000 IOPS
Sequential read 2MB blocks : 1200 Mb/s 600 IOPS
Sequential write 2MB blocks : 290 Mb/s 140 IOPS

It doesn't make sense ! If added disk should get the native NVME SSD performance, there is clearly a problem somewhere ! Could someone please explain how the storage framework on zos v3 works ?

@xmonader
Copy link
Collaborator

xmonader commented Nov 7, 2021

@maxux please take a look at it

@xmonader xmonader added this to To do in 3.0 via automation Nov 7, 2021
@xmonader xmonader added this to the next milestone Nov 7, 2021
@xmonader xmonader added the type_bug Something isn't working label Nov 7, 2021
@muhamadazmy
Copy link
Member

I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

@muhamadazmy
Copy link
Member

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

What happen behind the scene for V3:

  • When creating a zmount (or a disk) a raw disk file is allocated on SSD (the disk is formatted as btrfs)
  • The disk is then attached to cloud-hypervisor process as raw disk.
  • The disk is auto-mounted on the configured location in your deplolyment

So IO operations go through this,

  • In the VM (the cloud-hypervisor process) the operation is dealt with with the btrfs module in the VM kernel
  • The disk IO operation then go through the VirtIO driver to the host machine (ZOS)
  • The write operation is then handled again by the btrfs driver on the host
  • Then at the end written to physical disk

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

@archit3kt
Copy link
Author

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

Yes, you got it

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

Thanks for the explanation. Indeed the architectural choice you made is not the best for IO performance ! It would be great to allow logical volume creation and mount inside the VMs (at least for power users who'd like to get all the performance from their hardware). I would be glad to be a tester for this use case !

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

@muhamadazmy
Copy link
Member

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

Yes, ZOS has a unified workload type called ZMACHINE which is always (under the hood) a VM. If your flist is a container (let's say an ubuntu flist) we inject a custom build kernel+initramfs and still start the "container" as a full VM. this insure 100% separation from the ZOS host, and control over amount of CPU+Memory allocated to your resource. For the user he still can perfectly access and run his processes inside this "container" normally.

When you start a k8s node on zos, it's basically a well crafted "flist" with the k8s well configured and ready to start. for ZOS it's just another VM that it runs same way as a container (this makes code much simpler)

@maxux
Copy link
Contributor

maxux commented Nov 8, 2021

Which image do you run exactly ? Default zos runs a kernel 5.4, there is a 5.10 also available.
Can you give me the nodded ?

@archit3kt
Copy link
Author

My first post was done with kernel 5.4 in grid v2
My second post was done with latest zos for grid v3, I saw kernel 5.12 inside the VMs

node id is 68, IP is 2a02:842a:84c8:c601:d250:99ff:fedf:924d (ICMP is blocked, but IPv6 firewall allows everything else)

@maxux
Copy link
Contributor

maxux commented Nov 8, 2021

I confirm, your node is running 5.10.55 kernel, which is the latest we support officially.
The limitation probably the VM like Azmy said yep.

@archit3kt
Copy link
Author

archit3kt commented Nov 24, 2021

FYI I automated my fio tests and launched it simultaneously on X ubuntu VMs

Each 4 VMs have exactly the same results as a launch with only 1 VM

I see a degradation of performance per VM when I launch the test on 8 VM

My guess is that it is a vfio limitation, could be good to know if you make some performance tweaks someday

Still, sequential write is disastrous with vfio, and I don't have a clue why...

@muhamadazmy muhamadazmy moved this from Accepted to Backlog in 3.0 Nov 25, 2021
@muhamadazmy muhamadazmy removed this from Backlog in 3.0 Dec 28, 2021
@muhamadazmy muhamadazmy added this to Accepted in 3.0.9+ X via automation Dec 28, 2021
@muhamadazmy muhamadazmy removed this from Accepted in 3.0.9+ X Jan 13, 2022
@muhamadazmy muhamadazmy added this to Accepted in 3.1.0 via automation Jan 13, 2022
@despiegk
Copy link

despiegk commented Feb 7, 2022

this will have to wait, we have other things to first do.

@despiegk despiegk removed this from Accepted in 3.1.0 Feb 7, 2022
@despiegk despiegk modified the milestones: next, later Feb 7, 2022
@xmonader xmonader removed this from the later milestone Jul 4, 2022
@xmonader xmonader added this to the 3.5.x milestone Nov 14, 2022
@amandacaster
Copy link

Hello Team, can we have an update on this, please?

@muhamadazmy muhamadazmy modified the milestones: 3.6.x, later Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants