Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartctl -a returns "Read Self-test Log failed: Invalid Field in Command (0x2002)" with Kingston SFYRS1000G #217

Closed
sbraz opened this issue Nov 1, 2023 · 11 comments
Labels

Comments

@sbraz
Copy link

sbraz commented Nov 1, 2023

Hello,
Since version 7.4 (4c974b3 to be precise, when the support for NVMe self-tests was added), I cannot run smartctl -a /dev/nvme0n1 without getting Read Self-test Log failed: Invalid Field in Command (0x2002). It also causes an increment to the Error Information Log Entries counter.

To reproduce, the following is enough:

# smartctl -l selftest /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.7-gentoo-x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Read Self-test Log failed: Invalid Field in Command (0x2002)

Here are some details about the device:

# smartctl -i /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.7-gentoo-x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SFYRS1000G
Serial Number:                      
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            
Local Time is:                      Wed Nov  1 02:52:34 2023 CET

nvme-cli has no problem reading the log and it does not increment the error log entries counter:

# nvme version
nvme version 2.6 (git 2.6)
libnvme version 1.6 (git 1.6)
# nvme self-test-log /dev/nvme0n1
Device Self Test Log for NVME device:nvme0n1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0xf
Self Test Result[1]:
  Operation Result             : 0xf
Self Test Result[2]:
  Operation Result             : 0xf
Self Test Result[3]:
  Operation Result             : 0xf
Self Test Result[4]:
  Operation Result             : 0xf
Self Test Result[5]:
  Operation Result             : 0xf
Self Test Result[6]:
  Operation Result             : 0xf
Self Test Result[7]:
  Operation Result             : 0xf
Self Test Result[8]:
  Operation Result             : 0xf
Self Test Result[9]:
  Operation Result             : 0xf
Self Test Result[10]:
  Operation Result             : 0xf
Self Test Result[11]:
  Operation Result             : 0xf
Self Test Result[12]:
  Operation Result             : 0xf
Self Test Result[13]:
  Operation Result             : 0xf
Self Test Result[14]:
  Operation Result             : 0xf
Self Test Result[15]:
  Operation Result             : 0xf
Self Test Result[16]:
  Operation Result             : 0xf
Self Test Result[17]:
  Operation Result             : 0xf
Self Test Result[18]:
  Operation Result             : 0xf
Self Test Result[19]:
  Operation Result             : 0xf

Please let me know how I can help troubleshoot this issue.

@sbraz sbraz changed the title smartctl -a returns "Read Self-test Log failed: Invalid Field in Command (0x2002)" with KINGSTON SFYRS1000G smartctl -a returns "Read Self-test Log failed: Invalid Field in Command (0x2002)" with Kingston SFYRS1000G Nov 1, 2023
@chrfranke
Copy link

Please retry with broadcast namespace: smartctl -l selftest /dev/nvme0

@sbraz
Copy link
Author

sbraz commented Nov 1, 2023

@chrfranke it seems to work:

# smartctl -l selftest /dev/nvme0
smartctl 7.4 (build date Nov  1 2023) [x86_64-linux-6.5.7-gentoo-x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

@chrfranke
Copy link

Related or duplicate: ticket 1741.

@chrfranke
Copy link

Please try also: smartctl -d nvme,0xffffffff -l selftest /dev/nvme0n1

@sbraz
Copy link
Author

sbraz commented Nov 11, 2023

It works as well:

# smartctl -d nvme,0xffffffff -l selftest /dev/nvme0n1
smartctl 7.4 (build date Nov 11 2023) [x86_64-linux-6.5.7-gentoo-x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

@mmokrejs
Copy link

Hi, I have similar issue with Corsair MP600 PRO NH 8TB firmware EIFM51.3:

# smartctl -x /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.4.16] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Corsair MP600 PRO NH
Serial Number:                      xxxx
Firmware Version:                   EIFM51.3
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 8,001,563,222,016 [8.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          8,001,563,222,016 [8.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 81aac01b7d
Local Time is:                      Sat Jan 20 15:36:28 2024 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0c):         Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0440W       -        -    4  4  4  4    10500   65000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        48 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    21,711,686 [11.1 TB]
Data Units Written:                 38,990,195 [19.9 TB]
Host Read Commands:                 475,953,129
Host Write Commands:                979,837,212
Controller Busy Time:               1,052
Power Cycles:                       46
Power On Hours:                     153
Unsafe Shutdowns:                   31
Media and Data Integrity Errors:    0
Error Information Log Entries:      76
Warning  Comp. Temperature Time:    109
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   3
Thermal Temp. 2 Transition Count:   1
Thermal Temp. 1 Total Time:         14943
Thermal Temp. 2 Total Time:         76

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0         76     0  0x1004  0x4004  0x004            0     1     -  Invalid Field in Command
  1         75     0  0x001c  0x4004  0x028            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                 152            -     -   -   -    -
 1   Extended          Completed without error                   5            -     -   -   -    -
 2   Short             Completed without error                   5            -     -   -   -    -

My Dell E5580 and also external M.2 NVMe to USB3 external cases are only PCIe Gen3 while the MP600 device is Gen4. It happened a few times the external RTL9210B chips did not manage to speak tot he XHCI host. Also, the NVMe SSD completely disappeared from BIOS and seemed dead. Repeatedly, I ahve issue with resuming the Linux 6.6.11 and 6.4.16, basically I loose acces tot he NVMe filesystem and not even dmesg works and trhough I/O error. So, I do not now what triggered those errors reported by smartmontools.

Some caught upon a new bootup (after the unsuccessful resume):

# dmesg
...
[    1.130687] pcieport 0000:00:1d.0: AER: enabled with IRQ 125
[   13.900901] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[   13.901655] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   13.902398] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00000001/00002000
[   13.903172] pcieport 0000:00:1d.0:    [ 0] RxErr                 
[   14.777095] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[   14.778971] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   14.780821] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00000001/00002000
[   14.782675] pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
[   15.694484] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[   15.696437] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   15.698375] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00000001/00002000
[   15.700281] pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
...
# lsusb -t -v
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
           +-02.0  Intel Corporation HD Graphics 630
           +-04.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
           +-14.0  Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
           +-14.2  Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem
           +-15.0  Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0
           +-15.1  Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1
           +-16.0  Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1
           +-16.3  Intel Corporation 100 Series/C230 Series Chipset Family KT Redirection
           +-17.0  Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode]
           +-1c.0-[01]----00.0  Intel Corporation Wireless 8265 / 8275
           +-1c.2-[02]----00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
           +-1c.4-[03]--
           +-1d.0-[04]----00.0  Phison Electronics Corporation E18 PCIe4 NVMe Controller
           +-1f.0  Intel Corporation CM238 Chipset LPC/eSPI Controller
           +-1f.2  Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller
           +-1f.3  Intel Corporation CM238 HD Audio Controller
           +-1f.4  Intel Corporation 100 Series/C230 Series Chipset Family SMBus
           \-1f.6  Intel Corporation Ethernet Connection (5) I219-LM
#

From another bootup using 6.4.16 I collected:

[    4.434871] pcieport 0000:00:1c.2: AER: Multiple Corrected error received: 0000:02:00.0
[    4.435708] pcieport 0000:00:1c.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[    4.436405] pcieport 0000:00:1c.2:   device [8086:a112] error status/mask=00001000/00002000
[    4.437128] pcieport 0000:00:1c.2:    [12] Timeout               
[    4.437845] rtsx_pci 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[    4.438541] rtsx_pci 0000:02:00.0:   device [10ec:525a] error status/mask=00000080/00006000
[    4.439290] rtsx_pci 0000:02:00.0:    [ 7] BadDLLP               
[    4.440030] rtsx_pci 0000:02:00.0: AER:   Error of this Agent is reported first

@chrfranke
Copy link

chrfranke commented Jan 22, 2024

@sbraz: Thanks for testing.

smartctl -l selftest /dev/nvme0n1 - fails.
smartctl -l selftest /dev/nvme0 - works.
smartctl -d nvme,0xffffffff -l selftest /dev/nvme0n1 - works.

This suggests that (some) single namespace devices do not allow read of self-test log with NSID field set to 1. NVMe specs might not cleanly specify this corner case.

Conclusion: smartctl -l selftest should always the use broadcast namespace for such devices.

@mmokrejs
Copy link

It also happens with other commands than just -l selftest:

# smartctl -x /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.11] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Corsair MP600 PRO NH
...
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x2002)

#
# smartctl -x /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.11] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Corsair MP600 PRO NH
...
Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0         80     0  0x9001  0x4004  0x004            0     1     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                 152            -     -   -   -    -
 1   Extended          Completed without error                   5            -     -   -   -    -
 2   Short             Completed without error                   5            -     -   -   -    -

#

@sbraz
Copy link
Author

sbraz commented Jan 27, 2024

This suggests that (some) single namespace devices do not allow read of self-test log with NSID field set to 1. NVMe specs might not cleanly specify this corner case.

@chrfranke would it be possible for smartctl to do whatever the nvme CLI tool does and apply some kind of workaround for these devices? Surely they have some logic in place to handle this case since nvme self-test-log /dev/nvme0n1 works.

It also happens with other commands than just -l selftest:

@mmokrejs -x is an alias that includes the selftest option, see the manual:

          For NVMe, this is equivalent to
         '-H -i -c -A -l error -l selftest'.

@chrfranke
Copy link

See ticket 1741.

@sbraz
Copy link
Author

sbraz commented Mar 29, 2024

I can confirm that the git master as of today (d117639) works as expected, thanks!

$ smartctl -l selftest /dev/nvme0n1
smartctl 7.4 (build date Mar 30 2024) [x86_64-linux-6.7.6-gentoo-x86_64] (local build)
Copyright (C) 2002-24, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants