Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme: Use interim buffer for nvme_get_log_sanitize() #951

Merged
merged 1 commit into from
Aug 30, 2023

Conversation

tbzatek
Copy link
Member

@tbzatek tbzatek commented Aug 8, 2023

Certain drives tend to return more data than expected resulting in stack corruption. Use an interim buffer large enough to handle the excess bytes.

storaged-project/udisks#1152
linux-nvme/libnvme#684


This is a dirty workaround until a proper fix is implemented either in libnvme or in the kernel.

Certain drives tend to return more data than expected resulting
in stack corruption. Use an interim buffer large enough to handle
the excess bytes.

storaged-project/udisks#1152
linux-nvme/libnvme#684
@igaw
Copy link

igaw commented Aug 9, 2023

Makes sense until we figure out how to deal with such devices. I suspect we need to introduce quirks for this. BTW, would it possible to add some debug code which checks if the device writes too much data?

@tbzatek
Copy link
Member Author

tbzatek commented Aug 10, 2023

BTW, would it possible to add some debug code which checks if the device writes too much data?

We can afford having extra checks here, any suggestion how to achieve that other than analyzing data in the buffer itself?

@igaw
Copy link

igaw commented Aug 11, 2023

The only idea I have is to write a pattern into buffer and check how much was overwritten. So similar to what KASAN is doing. Not really clever I know.

What we also could do instead bloating up production code, is to write a small stand alone program which does this.

@allfoxwy
Copy link

I tried to play with this patch today. And I got more confusion.

I would like to know how large the buffer is needed, so I took igaw's idea:

The only idea I have is to write a pattern into buffer and check how much was overwritten.

I modified the patch to

BDNVMESanitizeLog * bd_nvme_get_sanitize_log (const gchar *device, GError **error) {
    int ret;
    int fd;
    char buf[65536] = ZERO_INIT;
    int i, buf_len;
    FILE *save_buf_file = NULL;
    FILE *save_ij_file = NULL;
    struct nvme_sanitize_log_page *sanitize_log;
    BDNVMESanitizeLog *log;
    __u16 sstat;

    buf_len = sizeof(buf) / sizeof(buf[0]);

    memset(buf, 'V', buf_len);

    /* open the block device */
    fd = _open_dev (device, error);
    if (fd < 0)
        return NULL;

    /* send the NVME_LOG_LID_SANITIZE ioctl */
    ret = nvme_get_log_sanitize (fd, FALSE /* rae */, (struct nvme_sanitize_log_page *) &buf);
    if (ret != 0) {
        _nvme_status_to_error (ret, FALSE, error);
        g_prefix_error (error, "NVMe Get Log Page - Sanitize Status Log command error: ");
        close (fd);
        return NULL;
    }

    close (fd);

    for(i = buf_len - 1; i >= 0; --i) {
        if(buf[i] != 'V') {
            i++;
            break;
        }
    }

    save_ij_file = fopen("/path/to/my-ij.bin", "ab");
    if(save_ij_file) {
        fwrite(&i, sizeof(i), 1, save_ij_file);
        fclose(save_ij_file);
    }

    save_buf_file = fopen("/path/to/my-nvme.bin", "ab");
    if(save_buf_file) {
        fwrite(buf, sizeof(buf[0]), i, save_buf_file);
        fclose(save_buf_file);
    }



    /* need to use interim buffer that is large enough for broken drives
     * returning more data than expected
     */
    sanitize_log = (struct nvme_sanitize_log_page *) &buf;

    log = g_new0 (BDNVMESanitizeLog, 1);
    log->sanitize_progress = 0;
    sstat = GUINT16_FROM_LE (sanitize_log->sstat);
    if ((sstat & NVME_SANITIZE_SSTAT_STATUS_MASK) == NVME_SANITIZE_SSTAT_STATUS_IN_PROGESS)
        log->sanitize_progress = ((gdouble) GUINT16_FROM_LE (sanitize_log->sprog) * 100) / 0x10000;
    log->global_data_erased = sstat & NVME_SANITIZE_SSTAT_GLOBAL_DATA_ERASED;
    log->overwrite_passes = (sstat >> NVME_SANITIZE_SSTAT_COMPLETED_PASSES_SHIFT) & NVME_SANITIZE_SSTAT_COMPLETED_PASSES_MASK;

    switch (sstat & NVME_SANITIZE_SSTAT_STATUS_MASK) {
        case NVME_SANITIZE_SSTAT_STATUS_COMPLETE_SUCCESS:
            log->sanitize_status = BD_NVME_SANITIZE_STATUS_SUCCESS;
            break;
        case NVME_SANITIZE_SSTAT_STATUS_IN_PROGESS:
            log->sanitize_status = BD_NVME_SANITIZE_STATUS_IN_PROGESS;
            break;
        case NVME_SANITIZE_SSTAT_STATUS_COMPLETED_FAILED:
            log->sanitize_status = BD_NVME_SANITIZE_STATUS_FAILED;
            break;
        case NVME_SANITIZE_SSTAT_STATUS_ND_COMPLETE_SUCCESS:
            log->sanitize_status = BD_NVME_SANITIZE_STATUS_SUCCESS_NO_DEALLOC;
            break;
        case NVME_SANITIZE_SSTAT_STATUS_NEVER_SANITIZED:
        default:
            log->sanitize_status = BD_NVME_SANITIZE_STATUS_NEVER_SANITIZED;
            break;
    }

    log->time_for_overwrite = (GUINT32_FROM_LE (sanitize_log->eto) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->eto);
    log->time_for_block_erase = (GUINT32_FROM_LE (sanitize_log->etbe) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etbe);
    log->time_for_crypto_erase = (GUINT32_FROM_LE (sanitize_log->etce) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etce);
    log->time_for_overwrite_nd = (GUINT32_FROM_LE (sanitize_log->etond) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etond);
    log->time_for_block_erase_nd = (GUINT32_FROM_LE (sanitize_log->etbend) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etbend);
    log->time_for_crypto_erase_nd = (GUINT32_FROM_LE (sanitize_log->etcend) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etcend);

    return log;
}

Then I got result:

$ hexdump my-ij.bin 
0000000 0940 0000 05c0 0000                    
0000008
$ hexdump my-nvme.bin 
0000000 ffff 0001 0000 0000 ffff ffff ffff ffff
0000010 ffff ffff 0000 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0000940 ffff 0001 0000 0000 ffff ffff ffff ffff
0000950 ffff ffff 0000 0000 0000 0000 0000 0000
0000960 0000 0000 0000 0000 0000 0000 0000 0000
*
0000f00
$

So this code is called TWICE during boot, as there are two records in file.
Both records have same actual content however their end-padding zero is different in length.

Both record length 0x940 and 0x5c0 are divisible by 32.
struct nvme_sanitize_log_page's actual data are 32 bytes

So I got more confusion:

  1. This Sanitize Log Query is not idempotent. As two query return different data.
  2. I GUESS maybe every query is replied with some kind of ALL storage cell, while only the first data is valid, the remaining storage space filling with zero could be used for future log. I GUESS.
  3. The 0x940 number may change. I don't know how it works, but it might vary during next OS boot. However 0x5c0 is always there as I can see. This might need more boot attempt to find some more knowledge.

Anyway the patch itself work. Thanks for all of it.

@igaw
Copy link

igaw commented Aug 16, 2023

Both record length 0x940 and 0x5c0 are divisible by 32.
struct nvme_sanitize_log_page's actual data are 32 bytes

We are requesting 512 bytes and thus the device should only transfer 512 bytes. Older kernels did always use bounce buffers for this operation but with the fix for the dma alignment setting (torvalds/linux@52fde2c) newer kernels are using zero copy I/O.

So we just need one more experiment to have all statements supported by experiements.

Could you build a kernel and hack blk_rq_map_user_iov so that always using bounce buffers?

--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -627,7 +627,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
                        struct rq_map_data *map_data,
                        const struct iov_iter *iter, gfp_t gfp_mask)
 {
-       bool copy = false, map_bvec = false;
+       bool copy = true, map_bvec = false;
        unsigned long align = q->dma_pad_mask | queue_dma_alignment(q);
        struct bio *bio = NULL;
        struct iov_iter i;

@allfoxwy
Copy link

Could you build a kernel and hack blk_rq_map_user_iov so that always using bounce buffers?

I tried to flip that bool on linux-6.4.11, and it wouldn't help with the bug. Udisks2 would still be stack smashing error.

Thanks for looking into this.

@igaw
Copy link

igaw commented Aug 17, 2023

Okay, let me play with this. I though I got the right place to force the kernel to use bounce buffers. Obviously, it was the wrong place...

@tbzatek tbzatek marked this pull request as ready for review August 18, 2023 13:34
@igaw
Copy link

igaw commented Aug 21, 2023

@allfoxwy I was stupid, the correct test patch is here: storaged-project/udisks#1152 (comment)

@allfoxwy
Copy link

I would try it, and reply in that thread. Thanks.

@keithbusch
Copy link

Maybe nvmecli could just allocate a page aligned and sized buffer instead?

@tbzatek
Copy link
Member Author

tbzatek commented Aug 30, 2023

Let's merge this at its current state as @vojtechtrefny is planning to make a new release tomorrow.

Feel free to continue here even when the ticket is closed. I'm watching the linux-nvme mailing list thread but no clear outcome as of yet.

@tbzatek tbzatek merged commit 4292295 into storaged-project:master Aug 30, 2023
12 checks passed
@igaw
Copy link

igaw commented Aug 30, 2023

Sorry, got a bit drowned in work. Anyway, Christoph had and idea for another test patch. The idea here is to introduce a 512 byte alignment for the buffer and enable KASAN to see if this device will work then. From our previous tests and bissection, this has good chance to work. Anyway, will look into it... hopefully soon.

@keithbusch
Copy link

Could we also try my suggestion from http://lists.infradead.org/pipermail/linux-nvme/2023-August/041901.html? But as an experiment, memset to 0xff, then check if the padding bytes are unchanged after the expected 512b log read? If so, then I think we can conclude that Christoph's hypothesis of bad PRP parsing is likely the bug in the device and a quirk for DMA alignment sounds like the right kernel side fix. If the data is still corrupted with an aligned buffer, though, then I don't think we can simply force 512b DMA alignment.

@allfoxwy
Copy link

Could we also try my suggestion from http://lists.infradead.org/pipermail/linux-nvme/2023-August/041901.html? But as an experiment, memset to 0xff, then check if the padding bytes are unchanged after the expected 512b log read? If so, then I think we can conclude that Christoph's hypothesis of bad PRP parsing is likely the bug in the device and a quirk for DMA alignment sounds like the right kernel side fix. If the data is still corrupted with an aligned buffer, though, then I don't think we can simply force 512b DMA alignment.

I'm not confident if I could understand all your suggestion, I think it means if we could make the log buffer's start address at an aligned position, then there would be good news, right?

So I tried an aligned-HACK patch like this:

--- libblockdev-nvme-27/src/plugins/nvme/nvme-info.c	
+++ align-libblockdev-nvme-27/src/plugins/nvme/nvme-info.c	
@@ -1026,7 +1026,10 @@
 BDNVMESanitizeLog * bd_nvme_get_sanitize_log (const gchar *device, GError **error) {
     int ret;
     int fd;
-    char buf[65536] = ZERO_INIT;
+    const int buf_size = 0x10000; // 65536
+    char *check_ptr;
+    FILE *len_file, *buf_file;
+    int i;
     struct nvme_sanitize_log_page *sanitize_log;
     BDNVMESanitizeLog *log;
     __u16 sstat;
@@ -1036,20 +1039,51 @@
     if (fd < 0)
         return NULL;
 
+    if(posix_memalign((void*)&sanitize_log, getpagesize(), buf_size)) {
+        g_prefix_error (error, "NVMe Get Log Page - Sanitize Status Log command error: Failed to get aligned mem");
+        close(fd);
+        return NULL;
+    }
+
+    memset(sanitize_log, 'A', buf_size);
+
     /* send the NVME_LOG_LID_SANITIZE ioctl */
-    ret = nvme_get_log_sanitize (fd, FALSE /* rae */, (struct nvme_sanitize_log_page *) &buf);
+    ret = nvme_get_log_sanitize (fd, FALSE /* rae */, sanitize_log);
     if (ret != 0) {
         _nvme_status_to_error (ret, FALSE, error);
         g_prefix_error (error, "NVMe Get Log Page - Sanitize Status Log command error: ");
+        free(sanitize_log);
         close (fd);
         return NULL;
     }
     close (fd);
 
+    check_ptr = (char*)sanitize_log;
+    for(i = buf_size - 1; i >= 0; --i) {
+        if(check_ptr[i] != 'A') {
+            i++;
+            break;
+        }
+    }
+
+    if(i > 0) {
+        len_file = fopen("/path/to/length.bin", "ab");
+        if(len_file) {
+            fwrite(&i, sizeof(i), 1, len_file);
+            fclose(len_file);
+        }
+
+        buf_file = fopen("/path/to/content.bin", "ab");
+        if(buf_file) {
+            fwrite(sanitize_log, sizeof(check_ptr[0]), i, buf_file);
+            fclose(buf_file);
+        }
+    }
+
+
     /* need to use interim buffer that is large enough for broken drives
      * returning more data than expected
      */
-    sanitize_log = (struct nvme_sanitize_log_page *) &buf;
 
     log = g_new0 (BDNVMESanitizeLog, 1);
     log->sanitize_progress = 0;
@@ -1085,5 +1119,7 @@
     log->time_for_block_erase_nd = (GUINT32_FROM_LE (sanitize_log->etbend) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etbend);
     log->time_for_crypto_erase_nd = (GUINT32_FROM_LE (sanitize_log->etcend) == 0xffffffff) ? -1 : (gint64) GUINT32_FROM_LE (sanitize_log->etcend);
 
+    free(sanitize_log);
+
     return log;
 }

And I think there are indeed some NEWS

During first boot after apply aligned-HACK patch:

$ hexdump length.bin 
0000000 1000 0000 1000 0000                    
0000008
$ hexdump content.bin 
0000000 ffff 0001 0000 0000 ffff ffff ffff ffff
0000010 ffff ffff 0000 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 ffff 0001 0000 0000 ffff ffff ffff ffff
0001010 ffff ffff 0000 0000 0000 0000 0000 0000
0001020 0000 0000 0000 0000 0000 0000 0000 0000
*
0002000
$

During third boot after apply aligned-HACK patch:

$ hexdump length.bin 
0000000 1000 0000 1000 0000 1000 0000 1000 0000
0000010 1000 0000 1000 0000                    
0000018
$ hexdump content.bin 
0000000 ffff 0001 0000 0000 ffff ffff ffff ffff
0000010 ffff ffff 0000 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 ffff 0001 0000 0000 ffff ffff ffff ffff
0001010 ffff ffff 0000 0000 0000 0000 0000 0000
0001020 0000 0000 0000 0000 0000 0000 0000 0000
*
0002000 ffff 0001 0000 0000 ffff ffff ffff ffff
0002010 ffff ffff 0000 0000 0000 0000 0000 0000
0002020 0000 0000 0000 0000 0000 0000 0000 0000
*
0003000 ffff 0001 0000 0000 ffff ffff ffff ffff
0003010 ffff ffff 0000 0000 0000 0000 0000 0000
0003020 0000 0000 0000 0000 0000 0000 0000 0000
*
0004000 ffff 0001 0000 0000 ffff ffff ffff ffff
0004010 ffff ffff 0000 0000 0000 0000 0000 0000
0004020 0000 0000 0000 0000 0000 0000 0000 0000
*
0005000 ffff 0001 0000 0000 ffff ffff ffff ffff
0005010 ffff ffff 0000 0000 0000 0000 0000 0000
0005020 0000 0000 0000 0000 0000 0000 0000 0000
*
0006000
$

The buffer always be written for 4096 bytes. This is different than my former experiment #951 (comment) , as it was resulting in a different size every boot. However now with an aligned start address, the buffer always length as 4096 bytes.

Not sure if I understand that POSIX aligned memory function properly. Not sure if I programmed correctly. But I feel there is something that you are right, so I post a quick reply.

@keithbusch
Copy link

Ok, if the device is overrunning an aligned buffer, I think we can conclude it's not a bad interpretation of a PRP with an offset. The device just always writes a full 4k page into a 512b log. That's not good: the kernel DMA alignment suggestion isn't viable and it's preferred that user space work around this with an over-allocated buffer instead of relying on the kernel to do something about it.

I think the reason you may have seen different overruns in previous experiments may be because the original buffer + extra unwanted data straddles different page offsets, causing the device to transfer the bulk of the corruption to PRP2, which the driver hadn't set and would transfer to nowhere. Different page offsets will cause different observable buffer overruns.

@igaw
Copy link

igaw commented Aug 31, 2023

Okay, I think we should handle this inside libnvme somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants