Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesystem Write Fails with Some SD-Cards #52931

Closed
clamattia opened this issue Dec 9, 2022 · 22 comments · Fixed by #53501
Closed

Filesystem Write Fails with Some SD-Cards #52931

clamattia opened this issue Dec 9, 2022 · 22 comments · Fixed by #53501
Assignees
Labels
area: Disk Access area: Drivers area: File System bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug Regression Something, which was working, does not anymore

Comments

@clamattia
Copy link
Contributor

Writing Data to a file on the SD-card fails with the version *** Booting Zephyr OS build v3.1.99-ncs1-9-g85edda1e989c *** of zephyr. The Failure depends on the card and on the data being written. It fails with the cards Lexar 300x 32GB microSD HC and Transcend 4GB microSD HC. It succeeds with Delkin Devices. Utility + 4GB microSD HC. If the data is all zeros instead of initialized via the function provided below, the test always succeeds.

See more details and failing unit-test on target below.

This is a regression and it worked fine with version 2.6.

The target board is a variant of nrf9160ns.

To Reproduce
Steps to reproduce the behavior:

  1. add the following test and invoke it with twister:
#include <fs/fs.h>
#include <ztest.h>

#define BUFFER_SIZE (1<<15)
#define BUFFER_CNT (15)

static char buffer[BUFFER_SIZE] = {0};

static void init_buffer(void){
    for(int i=0;i<BUFFER_SIZE;i++){
        buffer[i]+=3*(1+i);
    }
}

static bool write_file(void){
    struct fs_file_t file;

    fs_file_t_init(&file);
    int rc = fs_open(&file, "/SD:/write_test", FS_O_CREATE | FS_O_RDWR);
    if (rc < 0)
    {
        printk("Failed to open file\n");
        return false;
    }

    ssize_t ret = 0;
    for(int i=0;i<BUFFER_CNT;i++){
        printk("chunk: %d size: %d data: %02x ... %02x\n",
                i,
                BUFFER_SIZE,
                buffer[0],
                buffer[BUFFER_SIZE -1]);
        ret = fs_write(&file, buffer, BUFFER_SIZE);
        if (ret < 0)
        {
            printk("Failed to write %d\n", ret);
            printk("WRITTEN %d PACKETS A %d = %d\n", i, BUFFER_SIZE,i*BUFFER_SIZE);
            return false;
        }
    }

    rc = fs_truncate(&file, BUFFER_CNT*BUFFER_SIZE);
    if(rc < 0)
    {
        printk("error truncating file\n");
        return false;
    }

    rc = fs_close(&file);
    if (rc < 0)
    {
        printk("error closing file\n");
        return false;
    }

    return true;
}

static void test_write(void)
{
    init_buffer();
    zassert_true(write_file(), "failed to write file");
}

void test_main(void)
{
    ztest_test_suite(test_write_file,
            ztest_unit_test(test_write)
            );

    ztest_run_test_suite(test_write_file);
}

Expected behavior
The test should pass independently of the data written to the file and of the SD card being used.

Impact
Show stopper. We are likely not able to update zephyr version without a workaround.

Logs and console output
Failing test result using v3.1.99 and Transcend 4GB microSD HC:

*** Booting Zephyr OS build v3.1.99-ncs1-9-g85edda1e989c  ***
I: Start initializing SDHC.
I: Detected SD card
I: Maximum SD clock is under 25MHz, using clock of 8000000Hz
I: Block count 7680000
I: Sector size 512
I: Memory Size(MB) 3750
I: Disk mounted under "/SD:"
CENSOR
I: SDHC initialization succeeded.
Running TESTSUITE test_write_file
===================================================================
START - test_write
chunk: 0 size: 32768 data: 03 ... 00
chunk: 1 size: 32768 data: 03 ... 00
chunk: 2 size: 32768 data: 03 ... 00
chunk: 3 size: 32768 data: 03 ... 00
chunk: 4 size: 32768 data: 03 ... 00
chunk: 5 size: 32768 data: 03 ... 00
E: Write failed
E: file write error (-5)
Failed to write -5
WRITTEN 5 PACKETS A 32768 = 163840
Assertion failed at CMAKE_SOURCE_DIR/src/main.c:62: test_write: (write_file() is false)
failed to write file
FAIL - test_write in 17.438 seconds
===================================================================
TESTSUITE test_write_file failed.
===================================================================
RunID: fba8dcb6ff2a643162217b8944b19ed0
PROJECT EXECUTION FAILED

Succeeding test result using v3.1.99 and Delkin Devices. Utility + 4GB microSD HC:

DEBUG   - DEVICE: *** Booting Zephyr OS build v3.1.99-ncs1-9-g85edda1e989c  ***
DEBUG   - DEVICE: I: Start initializing SDHC.
DEBUG   - DEVICE: I: Detected SD card
DEBUG   - DEVICE: I: Maximum SD clock is under 25MHz, using clock of 8000000Hz
DEBUG   - DEVICE: I: Block count 7576576
DEBUG   - DEVICE: I: Sector size 512
DEBUG   - DEVICE: I: Memory Size(MB) 3699
DEBUG   - DEVICE: I: Disk mounted under "/SD:"
CENSOR
DEBUG   - DEVICE: I: SDHC initialization succeeded.
DEBUG   - DEVICE: Running TESTSUITE test_write_file
DEBUG   - DEVICE: ===================================================================
DEBUG   - DEVICE: START - test_write
DEBUG   - DEVICE: chunk: 0 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 1 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 2 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 3 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 4 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 5 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 6 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 7 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 8 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 9 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 10 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 11 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 12 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 13 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: chunk: 14 size: 32768 data: 03 ... 00
DEBUG   - DEVICE: PASS - test_write in 1.313 seconds
DEBUG   - DEVICE: ===================================================================
DEBUG   - DEVICE: TESTSUITE test_write_file succeeded
DEBUG   - DEVICE: ===================================================================
DEBUG   - DEVICE: RunID: ff4fd55abfb0591aa484e9d6b58d34af
DEBUG   - DEVICE: PROJECT EXECUTION SUCCESSFUL

Environment:

  • Ubuntu 22.04.1 LTS running as VM
  • Nordic SDK

Furthermore

  • Changes to device tree:
-    compatible = "zephyr,mmc-spi-slot";
-    reg = <0>;
-    status = "okay";
-    label = "SDHC0";
-       spi-max-frequency = <8000000>; // Maximum of NRF91
+      compatible = "zephyr,sdhc-spi-slot";
+      reg = <0>;
+      status = "okay";
+      mmc {
+          compatible = "zephyr,sdmmc-disk";
+          status = "okay";
+      };
+      spi-max-frequency = <8000000>; // Maximum of NRF91
  };
  • Changes to kconfig:
+CONFIG_DISK_ACCESS=y
+CONFIG_DISK_DRIVERS=y
+CONFIG_SDMMC_SUBSYS=y
+CONFIG_SPM=y
+CONFIG_ZTEST_STACK_SIZE=4096
  • This is how we initialize the SD-card using SYS_INIT. We do it early in the boot process (high priority).:
    static const char *disk_pdrv = "SD";
    uint64_t memory_size;
    uint32_t block_count;
    uint32_t block_size;

    LOG_INF("Start initializing SDHC.");

    const struct device *gpio = DEVICE_DT_GET(DT_NODELABEL(gpio0));
    if (!gpio)
    {
        LOG_ERR("Failed to get GPIO device.");
        return false;
    }
    gpio_pin_configure(gpio, DT_GPIO_PIN(DT_ALIAS(sdhcsw), gpios), GPIO_OUTPUT);
    gpio_pin_set(gpio, DT_GPIO_PIN(DT_ALIAS(sdhcsw), gpios), 0);

    int err = disk_access_init(disk_pdrv);
    if (err)
    {
        LOG_ERR("Storage init ERROR: %d!", -err);
        return false;
    }

    if(disk_access_ioctl(disk_pdrv, DISK_IOCTL_GET_SECTOR_COUNT, &block_count)) {
        LOG_ERR("Unable to get sector count!");
        return false;
    }
    LOG_INF("Block count %u", block_count);

    if(disk_access_ioctl(disk_pdrv, DISK_IOCTL_GET_SECTOR_SIZE, &block_size)) {
        LOG_ERR("Unable to get sector size!");
        return false;
    }
    LOG_INF("Sector size %u", block_size);

    memory_size = (uint64_t)block_count * block_size;
    LOG_INF("Memory Size(MB) %u", (uint32_t)(memory_size >> 20));

    mp.mnt_point = disk_mount_pt;

    int res = fs_mount(&mp);

    if (res == FR_OK) {
        LOG_INF("Disk mounted under \"%s\"", mp.mnt_point);
    }
    else {
        LOG_ERR("Error mounting disk!");
        return false;
    }
...

Let me know if more information is needed to reproduce.

@clamattia clamattia added the bug The issue is a bug, or the PR is fixing a bug label Dec 9, 2022
@nordicjm nordicjm added area: Drivers area: File System Regression Something, which was working, does not anymore area: Disk Access priority: high High impact/importance bug labels Dec 12, 2022
@danieldegrasse
Copy link
Collaborator

@clamattia Can you try running this test with CONFIG_SDHC_LOG_LEVEL_DBG=y, and CONFIG_SD_LOG_LEVEL_DBG=y? I've tried to reproduce this failure with a few cards I have, and haven't had any luck

@clamattia
Copy link
Contributor Author

Thanks for the quick response.
You might need to increase the BUFFER_CNT to reproduce.
And I made it repeat the test with different data to increase the chance of the bug:

for(int i=0;i<10;i++){
        init_buffer();
        zassert_true(write_file(), "failed to write file");
    }

and added some data variation:

buffer[i]+=3*(1+i)+buffer[i];

The bug must be timing sensitive. If I enable the log output as you suggested or if I put a k_msleep(100) in the loop, it changes the behavior of the bug.

Debug Log for the Lexar 300x 32GB microSD HC:

*** Booting Zephyr OS build v3.1.99-ncs1-9-g85edda1e989c  ***
I: Start initializing SDHC.
D: Resetting power to card
D: cmd0 arg 0x0
D: cmd8 arg 0x1aa
D: Found SDHC with CMD8 support
D: cmd59 arg 0x1
D: cmd5 arg 0x0
I: Detected SD card
D: cmd0 arg 0x0
D: cmd8 arg 0x1aa
D: Found SDHC with CMD8 support
D: cmd59 arg 0x1
D: cmd58 arg 0x0
D: cmd55 arg 0x0
D: cmd41 arg 0x40000000
D: cmd55 arg 0x0
D: cmd41 arg 0x40000000
D: cmd55 arg 0x0
D: cmd41 arg 0x40000000
D: SDMMC responded to ACMD41 after 2 attempts
D: cmd58 arg 0x40000000
D: cmd10 arg 0x0
D: Card MID: 0x28, OID: EB
D: cmd9 arg 0x0
D: Card block count 62705664, block size 512
I: Maximum SD clock is under 25MHz, using clock of 8000000Hz
D: cmd55 arg 0x0
D: cmd51 arg 0x0
D: SD reports specification version 8
I: Block count 62705664
I: Sector size 512
I: Memory Size(MB) 30618
D: READ: Sector = 0, Count = 1
D: cmd17 arg 0x0
D: READ: Sector = 8192, Count = 1
D: cmd17 arg 0x2000
D: READ: Sector = 8193, Count = 1
D: cmd17 arg 0x2001
I: Disk mounted under "/SD:"
D: READ: Sector = 40960, Count = 1
D: cmd17 arg 0xa000
D: READ: Sector = 44352, Count = 1
D: cmd17 arg 0xad40
CENSOR
I: SDHC initialization succeeded.
Running TESTSUITE test_write_file
===================================================================
START - test_write
D: READ: Sector = 40960, Count = 1
D: cmd17 arg 0xa000
...............................chunk: 0 size: 32768 data: 03 ... 00
D: WRITE: Sector = 63552, Count = 32
D: cmd25 arg 0xf840
D: cmd13 arg 0x0
D: READ: Sector = 10365, Count = 1
D: cmd17 arg 0x287d
D: WRITE: Sector = 63584, Count = 32
D: cmd25 arg 0xf860
D: cmd13 arg 0x0
...............................chunk: 1 size: 32768 data: 03 ... 00
D: WRITE: Sector = 63616, Count = 32
D: cmd25 arg 0xf880
D: cmd13 arg 0x0
D: WRITE: Sector = 63648, Count = 32
D: cmd25 arg 0xf8a0
D: cmd13 arg 0x0
...............................chunk: 2 size: 32768 data: 03 ... 00
D: WRITE: Sector = 63680, Count = 32
D: cmd25 arg 0xf8c0
D: cmd13 arg 0x0
D: WRITE: Sector = 63712, Count = 32
D: cmd25 arg 0xf8e0
D: cmd13 arg 0x0
E: Card did not return to ready state
E: Write failed
E: file write error (-5)
Failed to write -5
WRITTEN 2 PACKETS A 32768 = 65536
Assertion failed at CMAKE_SOURCE_DIR/src/main.c:64: test_write: (write_file() is false)
failed to write file
FAIL - test_write in 0.292 seconds
===================================================================
TESTSUITE test_write_file failed.
===================================================================
RunID: 99c4093aad855c1545f29078355efbd3
PROJECT EXECUTION FAILED

I was unable to reproduce it on the Transcend today unfortunately.

@clamattia
Copy link
Contributor Author

FYI: I have the same issue in production. Adding the debug log there made the bug go away.

@danieldegrasse
Copy link
Collaborator

Ok, so it seems the card fails to return to a ready state. Can you try this patch? #53017

This change should poll the SD card for the full duration of CONFIG_SD_DATA_TIMEOUT until it returns to a ready state or times out. Currently the SD card status is only checked once after write

@clamattia
Copy link
Contributor Author

Thanks @danieldegrasse for the quick patch. On a first glance it seems to improve but not completely fix it.
Now I am unable to reproduce the issue with the logs enabled. Will try some more later. But with the logs disable I am still able to get the error:

...
...............................chunk: 100 size: 32768 data: 03 ... 00
...............................chunk: 101 size: 32768 data: 03 ... 00
...............................chunk: 102 size: 32768 data: 03 ... 00
E: Write failed
E: file write error (-5)
Failed to write -5
WRITTEN 102 PACKETS A 32768 = 3342336
Assertion failed at CMAKE_SOURCE_DIR/src/main.c:64: test_write: (write_file() is false)
failed to write file
FAIL - test_write in 23.516 seconds
===================================================================
TESTSUITE test_write_file failed.

On a first glance it seems unrelated to CONFIG_SD_DATA_TIMEOUT. I set CONFIG_SD_DATA_TIMEOUT=60000 but it failed already after ~<10s.
I will try/investigate some more.

@clamattia
Copy link
Contributor Author

clamattia commented Dec 14, 2022

I was able to reproduce the problem with the both "bad"-cards, even with the patches.

Edit: If debug log is enabled both cards pass unfortunately. Probably due to timing reasons.

@danieldegrasse
Copy link
Collaborator

On a first glance it seems unrelated to CONFIG_SD_DATA_TIMEOUT. I set CONFIG_SD_DATA_TIMEOUT=60000 but it failed already after ~<10s.

Do you see this behavior even with the patch? Without the patch, CONFIG_SD_DATA_TIMEOUT wouldn't have an effect, as the SD card's ready status is only checked once after a write, without delays

@clamattia
Copy link
Contributor Author

clamattia commented Dec 14, 2022

It is hard to tell now. To increase the likelihood of the bug I removed all debug out. So it is hard to tell, when the timeout is starting and ending.

I re-enabled some debug out to check the timing. But then unfortunately it does not happen anymore.

But I think I have found what was causing the ignoring of the timeout earlier (unrelated) and believe we can assume, that the timeout is now respected. If you have an idea, how I could check this (without debug logging) let me know.

Edit: Btw, I think the patch helped, because it feels like it has become more difficult to reproduce the bug but it is hard to say for sure.

@danieldegrasse
Copy link
Collaborator

One way you can check the location of the error return would be to change the return code at the SD subsystem level. For example, I believe that your error is occurring on this line: https://github.com/nxp-zephyr/zephyr/blob/065b7ac02661d580b9300d4c1c1048874671f118/subsys/sd/sdmmc.c#L1421 (Note I've linked to code on the patched branch, as that is what I assume you're working from now). If you changed that return code to something obvious (say 0xDEADBEEF), you could likely get a better idea of if that is the location the error occurs at.

@clamattia
Copy link
Contributor Author

@danieldegrasse I can say confidently now that:

  • The timeout is not respected. I put it to 600 seconds CONFIG_SD_DATA_TIMEOUT=600000 the test still fails after 20.4 seconds.
START - test_write
E: Write failed
E: file write error (-5)
Failed to write -5
WRITTEN 48 PACKETS A 32768 = 1572864
Assertion failed at CMAKE_SOURCE_DIR/src/main.c:136: test_write: (write_file() is false)
failed to write file
FAIL - test_write in 20.463 seconds

@danieldegrasse
Copy link
Collaborator

The timeout is not respected. I put it to 600 seconds CONFIG_SD_DATA_TIMEOUT=600000 the test still fails after 20.4 seconds.

Just to be clear here, from what you are describing it sounds like the timeout is being respected, just that the card responds to a query after some time. In sdmmc_wait_ready, CONFIG_SD_DATA_TIMEOUT is treated as the maximum time the subsystem will wait for the card to respond before returning a timeout error. Based off the error location, it seems that ACMD22 is failing on your card. I would expect your card to support this command, as it is mandatory per the specification.

A few things that might help figure out the root cause here:

  • Is this the first time the write fails? If so, could you log the error code?
  • If you ignore the error and retry the call to fs_write, does the card write data?
  • In the SD implementation for v2.6.0, there was logic to retry a failed write in SPI mode: https://github.com/zephyrproject-rtos/zephyr/blob/zephyr-v2.6.0/drivers/disk/sdmmc_spi.c#L471. If possible, could you check if this function is called while running this test on v2.6.0? Not sure if you've got a branch you can revert to.

Also, I've added emulation of the retry behavior in v2.6.0 on this sha: https://github.com/nxp-zephyr/zephyr/tree/fdcf37a65a033d59081d8807f72866185d15ef59. Could you try this, and see if the error recovery logic can fix your issue?

@clamattia
Copy link
Contributor Author

Thanks for the help. Yes I did not express myself well. At first I thought, that maybe the timeout was ignored. But in fact, as you point out, and as I later found out by figuring out the exact source of the problem, it is not the timeout causing the issue.

The error code is -5 returned here https://github.com/nxp-zephyr/zephyr/blob/065b7ac02661d580b9300d4c1c1048874671f118/subsys/sd/sdmmc.c#L1358 (I will double check just to be sure)

I will try the other suggestions today.

Thank you.

@clamattia
Copy link
Contributor Author

  • I added a retry logic (with additional sleep) to call fs_write again in case of error. This did not help with the bug. (It failed the second time also, always).
  • The suggested sha does avoid the error from happening, but the test gets stuck and cannot proceed. I could not yet figure out why. I tried to debug using printk but cannot reproduce problem then (timing issue). I also put the timeout back to CONFIG_SD_DATA_TIMEOUT=1000. Maybe the card never returns from busy state?

One more thing I observed, The "good" card supports UHS-I with speed class 3. One of the "bad" cards does only support UHS-I with speed class 1, while the other does not seem to support UHS at all (seems to only support regular speed with class 10). Not sure if relevant.

@danieldegrasse
Copy link
Collaborator

The suggested sha does avoid the error from happening, but the test gets stuck and cannot proceed. I could not yet figure out why. I tried to debug using printk but cannot reproduce problem then (timing issue). I also put the timeout back to CONFIG_SD_DATA_TIMEOUT=1000. Maybe the card never returns from busy state?

I'm interested to know more about this failure. Does the test get stuck at the same location the error previously occurred at? If so, where did you place printk calls that caused the issue to disappear? I'm wondering if there are a few locations printk calls could be used (such as when the test is already stuck) that could help debug the issue

@clamattia
Copy link
Contributor Author

clamattia commented Jan 3, 2023

I put the printks before each call to fs_write which makes the bug disappear.

Does the test get stuck at the same location the error previously occurred at?

Do you have a suggestion how to find out? Or where I could add a printk to help us debug the issue?

@danieldegrasse
Copy link
Collaborator

I put the printks before each call to fs_write which makes the bug disappear.

Can adding a call to printk before each fs_write call resolve the issue without any other changes (ie on the latest revision of Zephyr main), or does the fix only work with the changes in https://github.com/nxp-zephyr/zephyr/tree/fdcf37a65a033d59081d8807f72866185d15ef59?

Do you have a suggestion how to find out? Or where I could add a printk to help us debug the issue?

I expect the hang is occurring in the error recovery function, so I'd start there: nxp-zephyr@fdcf37a#diff-c4b06563dc5d9e3210c473243a2cdfbd20800c1ab33dcf71ccbc44233ea83a8eR596

@clamattia
Copy link
Contributor Author

clamattia commented Jan 3, 2023

Can adding a call to printk before each fs_write call resolve the issue without any other changes (ie on the latest revision of Zephyr main), or does the fix only work with the changes in https://github.com/nxp-zephyr/zephyr/tree/fdcf37a65a033d59081d8807f72866185d15ef59?

I went back to before the proposed fixes. Note, that this is not zephyr main but a fork based on 3.1.99 https://github.com/nrfconnect/sdk-zephyr/commits/v3.1.99-ncs1. Is something important missing from upstream? I saw, that the history of subsys/sd and /driver/sdhc/ have quite some additions since then.

Compare for example: https://github.com/nrfconnect/sdk-zephyr/commits/86893246053c50d34618f09ec6722cae8ba19472/drivers/sdhc and https://github.com/nrfconnect/sdk-zephyr/commits/86893246053c50d34618f09ec6722cae8ba19472/subsys/sd with https://github.com/nxp-zephyr/zephyr/commits/fdcf37a65a033d59081d8807f72866185d15ef59/drivers/sdhc and https://github.com/nxp-zephyr/zephyr/commits/065b7ac02661d580b9300d4c1c1048874671f118/subsys/sd/sdmmc.c

Doing so, I can get the write error still. If I then add the printk before the fs_write, I do not get the write error but instead I get a -5 error when attempting to close the file. (Sorry this might not be helpful).

Note on how I tested the your proposed fixes: I cherry picked them (and previous commits in that history) on our fork based on 3.1.99.

@clamattia
Copy link
Contributor Author

One more note. Before the proposed fix (and the other cherry-picks from that history), I can reproduce the bug with the DEBUG_LOG enabled:

D: cmd13 arg 0x0
D: WRITE: Sector = 473728, Count = 64
D: cmd25 arg 0x73a80
D: cmd13 arg 0x0
WRITTEN 452 PACKETS A 32768 = 14811136
attempting to closeD: READ: Sector = 8192, Count = 1
D: cmd17 arg 0x2000
D: WRITE: Sector = 8192, Count = 1
D: cmd24 arg 0x2000
D: cmd13 arg 0x0
E: file close error (-5)
error closing file

With the proposed fix (and the other cherry-picks from that history) and the DEBUG_LOG enabled, the bug can not be reproduced. Only with debug log disabled.

This suggests, that there might be two separate issues. One fixed by the proposal and a separate, timing-sensitive issue.

@danieldegrasse
Copy link
Collaborator

I went back to before the proposed fixes. Note, that this is not zephyr main but a fork based on 3.1.99 nrfconnect/sdk-zephyr@v3.1.99-ncs1 (commits). Is something important missing from upstream? I saw, that the history of subsys/sd and /driver/sdhc/ have quite some additions since then.

Thanks for the clarification- there are two commits that might be useful since the branch you linked:

  • nxp-zephyr@2c2b8c3: This adds support for f_sync to the SPI mode SD driver. I am unsure if f_sync is being called internally by the filesystem, but this could cause issues if you don't have an implementation of the function present on your branch.

  • nxp-zephyr@ba9220c: This fixed a timing bug seen when the SD subsystem was compiled with CONFIG_SPEED_OPTIMIZATIONS=y. However, this was only observed on cards running in native SD mode, and the failure seen in that issue doesn't seem similar to the one you experienced. Either way, that fix is currently only applied to cards running in native mode. You could try adding a similar k_busy_wait call here: https://github.com/nxp-zephyr/zephyr/blob/ba9220c5bdf951983bcee3f6c49b195a066d98e8/subsys/sd/sdmmc.c#L975 (before the call to sdmmc_send_ocr) if you want to try this fix.

Doing so, I can get the write error still. If I then add the printk before the fs_write, I do not get the write error but instead I get a -5 error when attempting to close the file. (Sorry this might not be helpful).

The -5 error may be due to the support for f_sync being absent (see nxp-zephyr@2c2b8c3)

@danieldegrasse
Copy link
Collaborator

Could you try applying these commits?

The implementation of sdmmc_wait_ready in the fix I initially proposed did not check if the card was busy prior to reading its status. This commit does, which may help resolve the issue. You'll need the second linked commit to add support for checking the busy status of a card in SPI mode.

@clamattia
Copy link
Contributor Author

Could you try applying these commits?

* [nxp-zephyr@c612299](https://github.com/nxp-zephyr/zephyr/commit/c6122991ba51cf4128064e7ac4cb352ab0e07073)

* [nxp-zephyr@2c2b8c3](https://github.com/nxp-zephyr/zephyr/commit/2c2b8c342247ff767ee369169a74dbcf09b5edc6)

I went back to our fork of version 3.1.99. I applied the two suggested commits (cherry-pick).

With those I can not reproduce the bug under conditions it consistently happens without them.

Great job and congratulations on finding this 👍
Thanks for the help.

Feel free to close this issue, when appropriate.

[ Out of curiosity, do you know why only some of the SD-cards are affected? Might it be related to the speed-class, such that some of them are always ready again, before zephyr can contact them, while the slower might not be ready? ]

@danieldegrasse
Copy link
Collaborator

Thanks for the help.

Absolutely, appreciate the prompt responses.

[ Out of curiosity, do you know why only some of the SD-cards are affected? Might it be related to the speed-class, such that some of them are always ready again, before zephyr can contact them, while the slower might not be ready? ]

I believe that is the reason, yes. The changes present in the two commits you applied wait for the card to finish programming before checking its status, while the upstream SD implementation simply checks SD status directly after the write has completed.

carlescufi pushed a commit that referenced this issue Jan 10, 2023
Update sdmmc framework to use sdmmc_wait_ready when accessing card in
SPI mode. this will allow cards that do not return to ready to be polled
for busy status until the SD data timeout expires

Fixes #52931

Signed-off-by: Daniel DeGrasse <daniel.degrasse@nxp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Disk Access area: Drivers area: File System bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug Regression Something, which was working, does not anymore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants