Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the disk is almost full, the storage will crash #3423

Closed
handsonbao opened this issue Dec 7, 2021 · 4 comments · Fixed by #3576
Closed

When the disk is almost full, the storage will crash #3423

handsonbao opened this issue Dec 7, 2021 · 4 comments · Fixed by #3576
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@handsonbao
Copy link

handsonbao commented Dec 7, 2021

Please check the FAQ documentation before raising an issue

Describe the bug (required)
When i used the importer to load SF300 into nebula, i found that some errors in the output log. I stoped it and found that the storage crashed. So i check everything including the log of storaged. But i found nothing.
Finally i found that the disk was almost full, and i deleted some data.
I restarted the storaged and reimported the SF300. It worked and i found no error.

When the disk is almost full, the storage will crash
Your Environments (required)
ent 2.6.1 c074eeb

  • OS: uname -a
    4.18.0-305.7.1.el8_4.x86_64 Parser framework #1 SMP Tue Jun 29 21:55:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Compiler: g++ --version or clang++ --version
    g++ (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1)
    Copyright © 2018 Free Software Foundation, Inc.

  • CPU: lscpu
    架构: x86_64
    CPU 运行模式: 32-bit, 64-bit
    字节序: Little Endian
    CPU: 96
    在线 CPU 列表: 0-95
    每个核的线程数: 2
    每个座的核数: 24
    座: 2
    NUMA 节点: 2
    厂商 ID: GenuineIntel
    BIOS Vendor ID: Intel(R) Corporation
    CPU 系列: 6
    型号: 85
    型号名称: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
    BIOS Model name: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
    步进: 7
    CPU MHz: 3100.013
    CPU 最大 MHz: 3900.0000
    CPU 最小 MHz: 1000.0000
    BogoMIPS: 5000.00
    虚拟化: VT-x
    L1d 缓存: 32K
    L1i 缓存: 32K
    L2 缓存: 1024K
    L3 缓存: 36608K
    NUMA 节点0 CPU: 0-23,48-71
    NUMA 节点1 CPU: 24-47,72-95

  • Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:

According to the "describe the bug"

Expected behavior

Additional context

@handsonbao handsonbao added the type/bug Type: something is unexpected label Dec 7, 2021
@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Dec 7, 2021
@critical27
Copy link
Contributor

Check if your storage log contains a log like "Failed to appendLogs because of no more space".

@handsonbao
Copy link
Author

I found no word about "appendLogs".

@Sophie-Xie Sophie-Xie added the need info Solution: need more information (ex. can't reproduce) label Dec 15, 2021
@Sophie-Xie Sophie-Xie assigned Nivras and unassigned critical27 Dec 15, 2021
@handsonbao
Copy link
Author

I tested this on a single node just now. And I found this in the nebula-storage.INFO:

W1215 10:29:13.669871 10755 FileBasedWal.cpp:520] [Port: 9780, Space: 1, Part: 9] Failed to appendLogs because of no more space
W1215 10:29:13.669879 10755 RaftPart.cpp:718] [Port: 9780, Space: 1, Part: 9] Failed to write into WAL
W1215 10:29:13.669885 10755 RaftPart.cpp:731] [Port: 9780, Space: 1, Part: 9] Failed to write wal

The storaged finally crashed:
2021年 12月 15日 星期三 10:33:10 UTC
[vesoft@handson scripts]$ sudo ./nebula.service status all
[INFO] nebula-metad(de03025): Running as 10702, Listening on 9559
[INFO] nebula-graphd(de03025): Running as 10716, Listening on 9669
[INFO] nebula-storaged(de03025): Exited

When in a cluster, if one node is almost full, the storaged on the node will crash.

@Sophie-Xie Sophie-Xie removed the need info Solution: need more information (ex. can't reproduce) label Dec 20, 2021
@Nivras
Copy link
Contributor

Nivras commented Dec 23, 2021

This because diskManager will get the space info and save the freebytes by 10 second. And everty time to write wal log, storaged will makesure freebytes < minimum_reserved_bytes, in this case, minimum_reserved_bytes is 256M, and write speed is too fast, so before diskManager update the freebytes, disk don't have enough space and it makes the storaged fataled.

I change the minimum_reserved_bytes to 2.5G, and test again, the storaged will return error when the disk left about 2.1G space. and won't crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
4 participants