Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

512 vs 4k block size devices under XFS #372

Closed
trgill opened this issue Sep 18, 2018 · 13 comments
Closed

512 vs 4k block size devices under XFS #372

trgill opened this issue Sep 18, 2018 · 13 comments
Labels
omnibus aggregator for other issues

Comments

@trgill
Copy link

trgill commented Sep 18, 2018

I think XFS needs a constant block size for the underlying devices.

Write tests that:

  • try to mix devices 4k and 512 block size devices in the same pool
  • add a 4k block size device as a cache to a 512 block size device
@tasleson
Copy link

tasleson commented Feb 18, 2019

I did a manual test with 2 devices in a pool, one 512, one 4096. Created a single FS and consumed 89% of the total size of the pool and seemed to have no issues. The devices were loopback devices eg.

# cat /sys/block/loop*/queue/hw_sector_size
4096
512

Starting to think device mapper hides this by emulating 512 4096 byte sectors , but not sure.

@tasleson
Copy link

Taken from Mike's https://people.redhat.com/msnitzer/docs/io-limits.txt doc.

For instance, a 512 byte device and a 4K device may be combined into a
single logical DM device; the resulting DM device would have a
'logical_block_size' of 4K.  Filesystems layered on such a hybrid device
assume that 4K will be written atomically but in reality it will span 8
LBAs when issued to the 512 byte device.  Using a 4K 'logical_block_size'
for the higher-level DM device increases potential for a partial write
to the 512b device if there is a system crash.

If combining multiple devices' "I/O Limits" results in a conflict the
block layer may report a warning that the device is susceptible to
partial writes and/or misaligned.

We may want to prevent devices with different block sizes to be incorporated into a pool, but we should discuss.

@tasleson
Copy link

tasleson commented Feb 19, 2019

FYI: lvm allows the mixing of different physical block devices in a VG and allows creating a single LV which uses both of them with the resulting LV physical block size as 4096 byes.

Update: This has been removed: lvmteam/lvm2@0404539

@johnsimcall
Copy link

@tasleson @trgill did this discussion about allowing devices with mixed block sizes into the same pool ever happen? I have several servers with 8TB disks and 4,096 byte sector sizes. Those same servers also have some NVMe devices with 512 byte sector sizes. The total capacity is 1.5TB of NVMe and 70TB of HDD. I'm especially interested in creating a single Stratis pool with the NVMe devices added as cache devices. My testing shows that this does not work... Any suggestions would be appreciated!

[root@rhdata5 ~]# **lsblk -o +MODEL,VENDOR,PHY-SEC,LOG-SEC**
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT MODEL       VENDOR  PHY-SEC LOG-SEC
sda       8:0    0 139.8G  0 disk            INTEL SSDSC ATA        4096     512
sdb       8:16   0  72.8T  0 disk            MR9361-8i   AVAGO      4096    4096
sdc       8:32   0 139.8G  0 disk            INTEL SSDSC ATA        4096     512
├─sdc1    8:33   0   600M  0 part /boot/efi                         4096     512
├─sdc2    8:34   0     1G  0 part /boot                             4096     512
└─sdc3    8:35   0 138.2G  0 part                                   4096     512
  ├─rhel_rhdata5-root
  │     253:0    0    70G  0 lvm  /                                 4096     512
  ├─rhel_rhdata5-swap
  │     253:1    0    14G  0 lvm  [SWAP]                            4096     512
  └─rhel_rhdata5-home
        253:2    0  54.2G  0 lvm  /home                             4096     512
nvme0n1 259:0    0 745.2G  0 disk            INTEL SSDPE             512     512
nvme1n1 259:1    0 745.2G  0 disk            INTEL SSDPE             512     512

[root@rhdata5 ~]# **stratis pool list**
Name   Total Physical   Properties

[root@rhdata5 ~]# **stratis pool create Data /dev/sdb**

[root@rhdata5 ~]# **stratis pool add-cache Data /dev/nvme0n1**
Execution failed:
stratisd failed to perform the operation that you requested. It returned the following information via the D-Bus: ERROR: Error: No cache has been initialized for pool with UUID f73411729ffd4558af2aff32170cdaa3 and name Data; it is therefore impossible to add additional devices to the cache. 

[root@rhdata5 ~]# **stratis --propagate pool add-cache Data /dev/nvme0n1**
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/stratis_cli/_main.py", line 43, in the_func
    result.func(result)
  File "/usr/lib/python3.6/site-packages/stratis_cli/_parser/_parser.py", line 87, in wrapped_func
    func(*args)
  File "/usr/lib/python3.6/site-packages/stratis_cli/_actions/_top.py", line 622, in add_cache_devices
    raise StratisCliEngineError(return_code, message)
stratis_cli._errors.StratisCliEngineError: ERROR: Error: No cache has been initialized for pool with UUID f73411729ffd4558af2aff32170cdaa3 and name Data; it is therefore impossible to add additional devices to the cache

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/stratis", line 35, in <module>
    main()
  File "/usr/bin/stratis", line 32, in main
    return run()(sys.argv[1:])
  File "/usr/lib/python3.6/site-packages/stratis_cli/_main.py", line 60, in the_func
    raise StratisCliActionError(command_line_args, result) from err
stratis_cli._errors.StratisCliActionError: Action selected by command-line arguments ['--propagate', 'pool', 'add-cache', 'Data', '/dev/nvme0n1'] which were parsed to Namespace(blockdevs=['/dev/nvme0n1'], func=<function add_subcommand.<locals>.wrap_func.<locals>.wrapped_func at 0x7f6c5dd88048>, pool_name='Data', propagate=True) failed


[root@rhdata5 ~]# **rpm -q stratis-cli stratisd**
stratis-cli-2.1.1-6.el8.noarch
stratisd-2.1.0-1.el8.x86_64

[root@rhdata5 ~]# **stratis --version**
2.1.1

[root@rhdata5 ~]# **uname -a**
Linux rhdata5.dota-lab.iad.redhat.com 4.18.0-240.el8.x86_64 stratis-storage/stratisd#1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

[root@rhdata5 ~]# **grep -e '^NAME' -e '^VERSION' /etc/os-release**
NAME="Red Hat Enterprise Linux"
VERSION="8.3 (Ootpa)"

[root@rhdata5 ~]# **dmesg | tail -20**
[   81.227331] igb 0000:01:00.0 enp1s0f0: igb: enp1s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   81.386989] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0f0: link becomes ready
[   90.542510] irq 3: Affinity broken due to vector space exhaustion.
[  184.266197] XFS (dm-6): Mounting V5 Filesystem
[  184.322135] XFS (dm-6): Ending clean mount
[  184.379799] XFS (dm-6): Unmounting Filesystem
[  184.466763] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  184.593735] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  184.720656] device-mapper: thin: 253:7: growing the metadata device from 4096 to 4161600 blocks
[  184.837826] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  184.964778] device-mapper: thin: 253:7: growing the data device from 768 to 76227356 blocks
[  185.091823] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  185.222726] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  185.352260] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  185.483769] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  185.614781] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.
[  185.745753] device-mapper: thin: Metadata device dm-4 is larger than 33292800 sectors: excess space will not be used.

[root@rhdata5 ~]# **dmsetup ls | sort -k 2**
rhel_rhdata5-root	(253:0)
rhel_rhdata5-swap	(253:1)
rhel_rhdata5-home	(253:2)
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-physical-originsub	(253:3)
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thinmeta	(253:4)
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thindata	(253:5)
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-mdv	(253:6)
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-thinpool-pool	(253:7)

[root@rhdata5 ~]# **dmsetup info -c | sort -k 3**
Name                                                                  Maj Min Stat Open Targ Event  UUID                                                                 
rhel_rhdata5-root                                                     253   0 L--w    1    1      0 LVM-oS0K6QOk1gaeu90rgwR4Mr0zgQu5Alrf755NlqBy6SagEofNGjLc8qPNjmIazJge 
rhel_rhdata5-swap                                                     253   1 L--w    2    1      0 LVM-oS0K6QOk1gaeu90rgwR4Mr0zgQu5Alrf3zAwGNe7imHT6Od7EZUX3Dfcd4YEyrTQ 
rhel_rhdata5-home                                                     253   2 L--w    1    1      0 LVM-oS0K6QOk1gaeu90rgwR4Mr0zgQu5AlrfnYBGFG5WDHygfSQL0jgyxNtlKgYP5Hhu 
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-physical-originsub 253   3 L--w    3    1      0 stratis-1-private-f73411729ffd4558af2aff32170cdaa3-physical-originsub
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thinmeta      253   4 L--w    1    2      0 stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thinmeta     
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thindata      253   5 L--w    1    2      0 stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-thindata     
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-mdv           253   6 L--w    0    1      0 stratis-1-private-f73411729ffd4558af2aff32170cdaa3-flex-mdv          
stratis-1-private-f73411729ffd4558af2aff32170cdaa3-thinpool-pool      253   7 L--w    0    1      0 stratis-1-private-f73411729ffd4558af2aff32170cdaa3-thinpool-pool     

@mulkieran
Copy link
Member

@johnsimcall We have discussed this issue, but only in the most general terms. Note that the error you experienced when adding the cache is irrelevant to your question; you need to initialize a cache now before you can add more devices to the cache.

@johnsimcall
Copy link

Thanks @mulkieran, your advice to initialize the cache helped me get the pool (data+cache) created! The howto page didn't mention it, and I had overlooked the change in behavior since 2.1.0.

I'd love to hear more about mixing 4096 and 512-byte media in the same pool. At this point I've created a stratis filesystem and am running an fio job against it to compare it's performance against an uncached filesystem. My data is not mission-critical, so I don't mind testing this use-case with NFS exports, VMs and container workloads.

@tasleson
Copy link

tasleson commented Dec 3, 2020

@johnsimcall FYI: lvm placed restrictions on mixed block sizes lvmteam/lvm2@0404539

@johnsimcall
Copy link

Thanks for the additional info @tasleson ! The statement that stood out most to me was...

Avoid having PVs with different logical block sizes in the same VG. This prevents LVs from having mixed block sizes, which can produce file system errors.

This seems pretty damning... Can I try one last time to wiggle success out of my hardware before finally putting this issue to bed? 😁
My last grasp at hope comes from the fact that I'm using a 4096-byte device (/dev/sdb) as the primary storage device, and mixing in 512-byte devices (/dev/nvme0n1 and/or /dev/nvme1n1) only for cache.

[root@rhdata5 ~]# lsblk -d -o NAME,SIZE,TYPE,PHY-SEC,LOG-SEC,MODEL /dev/sdb /dev/nvme0n1 /dev/nvme1n1
NAME      SIZE TYPE PHY-SEC LOG-SEC MODEL
sdb      72.8T disk    4096    4096 MR9361-8i       
nvme1n1 745.2G disk     512     512 INTEL SSDPEDMD800G4                     
nvme0n1 745.2G disk     512     512 INTEL SSDPEDMD800G4                     

My testing with fio has been successful so far... 🤞

[root@rhdata5 ~]# stratis blockdev list
Pool Name   Device Node    Physical Size    Tier
Data        /dev/nvme0n1      745.21 GiB   Cache
Data        /dev/nvme1n1      745.21 GiB   Cache
Data        /dev/sdb           72.77 TiB    Data

[root@rhdata5 ~]# grep IOPS /tmp/hdd-only.output /tmp/stratis-cached.output
/tmp/hdd-only.output:        seq-read:   IOPS=37.9k, BW=148MiB/s (155MB/s)(20.0GiB/138330msec)
/tmp/hdd-only.output:        rand-read:  IOPS=639, BW=2559KiB/s (2620kB/s)(750MiB/300008msec)
/tmp/hdd-only.output:        seq-write:  IOPS=75.1k, BW=293MiB/s (308MB/s)(20.0GiB/69788msec); 0 zone resets
/tmp/hdd-only.output:        rand-write: IOPS=1750, BW=7001KiB/s (7169kB/s)(2052MiB/300166msec); 0 zone resets

/tmp/stratis-cached.output:  seq-read:   IOPS=25.1k, BW=98.1MiB/s (103MB/s)(100GiB/1043607msec)
/tmp/stratis-cached.output:  rand-read:  IOPS=26.3k, BW=103MiB/s (108MB/s)(100GiB/998500msec)
/tmp/stratis-cached.output:  seq-write:  IOPS=29.9k, BW=117MiB/s (122MB/s)(100GiB/877174msec); 0 zone resets
/tmp/stratis-cached.output:  rand-write: IOPS=1562, BW=6251KiB/s (6401kB/s)(21.5GiB/3600438msec); 0 zone resets

@tasleson
Copy link

tasleson commented Dec 3, 2020

This seems pretty damning... Can I try one last time to wiggle success out of my hardware before finally putting this issue to bed? grin

When you are exercising good path, I don't think there is much to worry about. From what I posted in https://github.com/stratis-storage/stratisd/issues/1196#issuecomment-465142750 issues would arise in use cases like writing while you lose power, or some other interruption in the write operation which causes the write to be incomplete for a logical sector that is composed of multiple physical sectors. Although, if you lose power in the middle of a write you would likely corrupt one of the physical sectors which would result in an error on reading the logical sector later.

One good thing I can think of is that XFS has check sums on all of it's metadata. Thus it would be able to detect a "partial write" as the check sum would fail.

LVM doesn't know what filesystem will be used with it, thus requires a more conservative approach. However, to really answer this question would involve a discussion with the device mapper kernel and XFS folks.

@johnsimcall
Copy link

johnsimcall commented Dec 3, 2020

Wonderful, thank you for the additional details @tasleson! As a take-away from this discussion, I would suggest that Stratis should behave similar to LVM and alert users to mismatched blocksizes... Perhaps even prevent users from mixing block sizes in a pool.

@johnsimcall
Copy link

I found these comments from Mike Snitzer regarding 4096-byte IOs being issued to 512-byte disks to be interesting.

Concerns about 4K issued to 512b physical devices not being atomic
(could have 5 of the 8 512b written, so old 3 bytes could cause
issues). IIRC I shared those concerns with Martin Petersen before
(Martin is an upstream Linux SCSI maintainer) and he felt the atomicity
concerns were overstated. Thinking now, it was possibly for devices
that advertise 4K physical and 512b logical. Whereas issuing 4K to a
512b/512b device could easily not be atomic for that 4K IO.

I can revisit this with Martin. Also, I'm happy to adjust my
understanding based on further anecdotal real-world evidence that
issuing 4K IOs to a 512b device and expecting any 4K IO operation to be
atomic is wrong.

I wonder if an intermediate layer which reports a a minimum block-size of 4096 bytes would mitigate any of his concerns (like a device-mapper device or an lvm volume.)

@mulkieran
Copy link
Member

Our plan at this time is to ensure that pools can not be created with devices w/ different block sizes.

@mulkieran mulkieran assigned jbaublitz and unassigned drckeefe and jbaublitz May 26, 2021
@mulkieran mulkieran self-assigned this Oct 27, 2021
@ghost ghost assigned bgurney-rh Dec 15, 2021
@mulkieran mulkieran transferred this issue from stratis-storage/stratisd Dec 16, 2021
@mulkieran mulkieran added the omnibus aggregator for other issues label Dec 16, 2021
@mulkieran mulkieran removed this from In progress (long term) in 2021December Dec 24, 2021
@mulkieran mulkieran added this to To do in 2022January via automation Dec 24, 2021
@mulkieran mulkieran moved this from To do to In progress in 2022January Dec 24, 2021
@mulkieran mulkieran moved this from In progress to In progress (long term) in 2022January Dec 24, 2021
@mulkieran mulkieran removed this from In progress (long term) in 2022January Jan 29, 2022
@mulkieran mulkieran added this to To do in 2022February via automation Jan 29, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022February Jan 29, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022February Feb 26, 2022
@mulkieran mulkieran added this to To do in 2022March via automation Feb 26, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022March Apr 2, 2022
@mulkieran mulkieran added this to To do in 2022April via automation Apr 2, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022April Apr 2, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022April May 2, 2022
@mulkieran mulkieran added this to To do in 2022May via automation May 2, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022May May 2, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022May May 31, 2022
@mulkieran mulkieran added this to To do in 2022June via automation May 31, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022June May 31, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022June Jul 5, 2022
@mulkieran mulkieran added this to To do in 2022July via automation Jul 5, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022July Jul 5, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022July Aug 3, 2022
@mulkieran mulkieran added this to To do in 2022August via automation Aug 3, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022August Aug 3, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022August Sep 7, 2022
@mulkieran mulkieran added this to To do in 2022September via automation Sep 7, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022September Sep 7, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022September Oct 3, 2022
@mulkieran mulkieran added this to To do in 2022October via automation Oct 3, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022October Oct 3, 2022
@mulkieran mulkieran removed this from In progress (long term) in 2022October Oct 29, 2022
@mulkieran mulkieran added this to To do in 2022November via automation Oct 29, 2022
@mulkieran mulkieran moved this from To do to In progress (long term) in 2022November Oct 29, 2022
@mulkieran mulkieran moved this from In progress (long term) to In progress in 2022November Nov 14, 2022
@mulkieran
Copy link
Member

We've implemented a policy that restricts the combinations of block sizes that are allowed.

2022November automation moved this from In progress to Done Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
omnibus aggregator for other issues
Projects
No open projects
Development

No branches or pull requests

7 participants