Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "MKS-Klipad50: Switch to standard support" #7883

Closed
wants to merge 1 commit into from

Conversation

torte71
Copy link
Contributor

@torte71 torte71 commented Feb 26, 2025

This reverts commit 4bddce9.

Standard support turned out to be too complicated for me.

This reverts commit 4bddce9.

Standard support turned out to be too complicated for me.
@github-actions github-actions bot added size/small PR with less then 50 lines Needs review Seeking for review Hardware Hardware related like kernel, U-Boot, ... labels Feb 26, 2025
@igorpecovnik
Copy link
Member

too complicated for me

In what sense? This is still a grey zone. You are not obligated to do hard lifting or resolve bugs. If device generally works well and most of general rules apply - try at least until next release and then decide?

@igorpecovnik igorpecovnik added the Discussion Being discussed - Voice your opinions :) label Feb 27, 2025
@torte71
Copy link
Contributor Author

torte71 commented Feb 28, 2025

I was panicking. :)

After the change from .csc to .conf, only the nightly trixie/jammy images got created and apt.armbian.com did not receive any packages (though beta.armbian.com still does). I misassumed that stable images would have been generated earlier.
This and your "website status is still on manual change" comment from #7851 (comment) (which I still do not understand fully) made me think that I drove that project into a dead end because of whatever unmet requirements, so I wanted to revert.

But now it sounds, as if that setup is correct and sufficient, i.e. a stable image and packages on apt.armbian.com will follow in 05/25 (or if there is any earlier release). That would be perfectly fine.

You are right, I'll drop this PR and wait for the next release.

Thanks for the clarification.

@torte71 torte71 closed this Feb 28, 2025
@igorpecovnik
Copy link
Member

igorpecovnik commented Feb 28, 2025

No worries.

After the change from .csc to .conf, only the nightly trixie/jammy images got created and apt.armbian.com did not receive any packages

Let me clarify this better.

Stable images are done manually, one by one, even we have automation "build them all" as some families have either known serious problems or they weren't been finished by the release date. Its a quality control feature - better old images then broken new ones. We can't hold back others ... so I think its better to release majority and release the rest once ready. This is happening right now, today, tomorrow. Its a lot of manual work, first by fixing and then releasing / testing. This includes several people, who we all have some life happening in between. I had to spent few days with my family as otherwise I risk serious problems of different kind ;) Backup is, but not for all roles and tasks ...

Since most targets got major kernel upgrade, where troubles are more then expected, we are holding back populating apt repository with all packages. Selecting here is much more difficult as kernel is common for many boards. Luckily we don't have one kernel, so we can hold back Rockchip, but release Allwinner ... but that only adds additional manual work and risk of added bugs in the process due to manual work.

your "website status is still on manual change"

When images are pushed to the download servers, index on download pages is updated automatically, but not the status itself - supported / csc ... This also slightly defines how images are displayed. What we are looking here is a script that would adjust wordpress database with changes on Git. As this is the source of truth. This still has to be switched by hand.

If board was previously .csc, it will have community supported targets in the download pages until next recompilation (happens weekly, sometimes not due to compilation issues and has to be fixed and run again), after changed to .conf. .conf is getting daily images and daily repo, but not community ... Perhaps too many complications.

That would be perfectly fine.

Yes, everything should show up properly within a week. If not, then something could be wrong somewhere and you open a ticket, ping me, ...

@torte71
Copy link
Contributor Author

torte71 commented Feb 28, 2025

Many thanks for that detailed explanation. Getting the whole picture from reading the docs and the workflow galore is a bit demanding for me, being new to this project.
But that really cleared things up. 👍

@igorpecovnik
Copy link
Member

igorpecovnik commented Feb 28, 2025

But that really cleared things up.

Great! I know its overwhelming for anyone that wants to jump in the loop. We try our best.

Stable images (.conf) - you can prepare them on your own and tell me, when they are tested, to move to download folder. You have rights (once you accept invitation to join .org) for that and here is most of related documentation - https://docs.armbian.com/Process_CI/

When you manage.

@torte71
Copy link
Contributor Author

torte71 commented Feb 28, 2025

Strange: The noble-minimal image goes into bootloop, kernel runs into "Synchronous abort" handler directly after u-boot starts the kernel.
(I've checked the .asc and .sha checksums, the image is correct. Tried different image-writing programs and a different emmc card. It's not a download or card error.)

It has something to do with the initrd - if I replace it with the initrd from noble-server, it boots up OK.
Regenerating initrd on noble-minimal does not solve it.
Comparing the extracted initrd contents shows that noble-minimal lacks some quite basic libraries - e.g. libz.so.1, liblzo2.so.2, libfuse3.so.3. That might be a trace.

I'll investigate that further, but I wonder if this really only affects the mks-klipad50?

Edit: Both other images (bookworm-minimal and noble-server) work fine.

@igorpecovnik
Copy link
Member

igorpecovnik commented Feb 28, 2025

This is strange, hug. Check CI build logs if there is anything odd, like qemu crash. Images were assembled on x86 machine - we can force them to use aarch64 runners ...

Build logs for broken image:
https://paste.armbian.de/gigazitici

Build logs for OKish Noble server:
https://paste.armbian.de/emehuriqar

I wonder if this really only affects the mks-klipad50?

We need to find that out.

@igorpecovnik
Copy link
Member

igorpecovnik commented Feb 28, 2025

Nothing obvious - this was build from trunk and there are some commits after release that could make some troubles. I would propose quick workaround - removing broken image (in progress) - until we find out why it broke.

@torte71
Copy link
Contributor Author

torte71 commented Feb 28, 2025

No need to hurry.
If it affects more boards, then fixing it for all has priority. If it affects only my board, then it still feels better if things got sorted out beforehand.
Tomorrow more, now it's family time.

@torte71
Copy link
Contributor Author

torte71 commented Mar 1, 2025

A working uInitrd (without bootloop) gets generated after installing "fuse3".
On noble-minimal (with "uInitrd" copied from noble-server to allow boot):

  • update-initramfs -y && reboot
    • creates bootloop
  • apt install fuse3 && reboot
    • (automatically runs update-initramfs -y)
    • boots OK
  • apt remove fuse3 && reboot
    • (automatically runs update-initramfs -y)
    • bootloop

A locally built noble-minimal from 25.05-trunk (today morning) has no problems booting - but also has no fuse3 installed.
Digging futher into it...

Edit: The initrd from noble-minimal DOES have fuse3, even though the normal filesystem doesn't.
After an update-initramfs with fuse3 installed, the resulting initrd contains the same files as the original (non-booting) noble-minimal, but there are binary differences between libfuse3.so.3, libfuse3.so.3.14.0, mount.fuse3 (but they have the same version numbers).

@torte71
Copy link
Contributor Author

torte71 commented Mar 3, 2025

Not nailed it down yet, current test status for the record (in addition to my prior post):

  • Other *-minimal images (bookworm, trixie, plucky) don't have this problem (but they also have no fuse3 installed)
  • I tried to reproduce the build locally exactly as on the workflow raw log (same git hash, same compile options), but it does not show the boot problem. Though there must be some more logic to it, as the "armbian-images" parameter is unknown to compile.sh from armbian/build.
  • I tried debugging the initrd, but even with "break=top" I don't get a shell before the bootloop
  • It is unexplainable to me, why/how at this early stage fuse3 should be required:
    • Running "ldd" on all initrd files: no dependency on libfuse
    • Grepping all initrd files for fuse3, fusermount and mount.fuse: nothing found
    • steps "/init" (from initrd) does before checking "break=top" (apart from setting variables):
      • mount sysfs, proc, devtmpfs, devpts, tmpfs - none should require fuse
      • executed binaries (non-shell-builtin): mount (see above), mkdir, cat, ln, hostname - none should require fuse
  • Every other image (with a rockchip kernel, running on mksklipad) show this directly after u-boot emits "Starting kernel...":
    efi_free_pool: illegal free 0x000000003cf20040
    efi_free_pool: illegal free 0x000000003cf1d040
    efi_free_pool: illegal free 0x000000003cf1b040

That happens when u-boot loads initramfs into EFI memory region (that should have been fixed in later versions): https://lore.kernel.org/all/d3f3fc7f-b29a-4503-9fe0-97468bbe1f71@gmx.de/
The broken noble-minimal shows only the first to "efi_free_pool" errors, then the Abort handler kicks in:

    efi_free_pool: illegal free 0x000000003cf20040
    efi_free_pool: illegal free 0x000000003cf1d040
    "Synchronous Abort" handler, esr 0x96000004

So maybe that illegal free leads to execution of some random code, which in case of installed fuse3 just happens to take a non-fatal pathway (out of sheer luck)?

My other assumption is that this is caused by some bug in an upstream package.
@igorpecovnik
I'll try one new workflow run when armbian/os is not busy to see if that changes the behaviour. So please don't wonder why I rerun it without any prior code change.


Are other boards affected? Probably not, but unsure:

  • I hoped to get some positive/negative replies on discord, if other people with rockchip64 boards had a similar problem with 25.2 stable noble-minimal images, but got no reaction so far. Not sure if others are affected.
  • I tested the same release for RasPi (rpi4b) noble-minimal, and it worked correctly, so other boards are probably not affected (I don't have any other Armbian capable hardware lying around for testing, at least to my knowledge)

@torte71
Copy link
Contributor Author

torte71 commented Mar 3, 2025

After rerunning the workflow:
Now "bookworm-minimal" has the bootloop.
But "noble-minimal" boots fine - even without the efi errors.
"noble-server" behaves like before: boots ok, with efi errors.

That's not my idea of a "stable" release. I'll continue search.

@igorpecovnik
Copy link
Member

I hoped to get some positive/negative replies on discord, if other people with rockchip64 boards had a similar problem

@paolosabatino Have you experienced this on similar Rockchip RK3328 boards?

I tested the same release for RasPi (rpi4b) noble-minimal, and it worked correctly, so other boards are probably not affected

To me it looks isolated to Rockchip family, could be to this SoC. It would be more reports, if this would be present wider.

torte71 added a commit to torte71/armbian-mksklipad50 that referenced this pull request Mar 6, 2025
Fixes loading initramfs into EFI memory region, leading to
errors "efi_free_pool: illegal free".
Which may be the cause for bootloops:
  armbian#7883 (comment)
See also:
  https://lore.kernel.org/all/d3f3fc7f-b29a-4503-9fe0-97468bbe1f71@gmx.de/
@torte71
Copy link
Contributor Author

torte71 commented Mar 7, 2025

@redrathnure The latest community build (trunk.185) for MKS-PI bookworm-minimal from https://github.com/armbian/community/releases/tag/25.5.0-trunk.185 also runs into the same bootloop on my MKS-Klipad50.

I am suspecting that this is caused by u-boot-2022.07 (that is "patch/u-boot/u-boot-rockchip64") loading the initramfs into the EFI address space (see here). Which usually leads to this output at boot:

Starting kernel ...

efi_free_pool: illegal free 0x000000003cf21040
efi_free_pool: illegal free 0x000000003cf1e040
efi_free_pool: illegal free 0x000000003cf1c040

Can you confirm

  1. that these error messages show up on your boards (can be any "old" image prior to trunk.185, it happens even with the original Makerbase image for my board)?
  2. that the bootloop happens on your board as well (with the community build exactly as stated above)?

In case of a bootloop, this will be the output (with varying amounts of EFI errors):

Starting kernel ...

efi_free_pool: illegal free 0x000000003cf21040
"Synchronous Abort" handler, esr 0x96000004
elr: 0000000000273208 lr : 000000000025e708 (reloc)
elr: 000000003ffad208 lr : 000000003ff98708
x0 : bdcc11bc76560ecc x1 : 000000003ffb4b60
x2 : 0000000000000010 x3 : 000000003df3c680
x4 : 0000000000000000 x5 : bdcc11bc76560ecc
x6 : 000000003cd18000 x7 : 0000000000000007
x8 : 0000000000000004 x9 : 0000000000000008
x10: 000000000000a994 x11: 000000003df233cc
x12: 000000000000a994 x13: 000000003df23488
x14: 000000003cea8000 x15: 0000000000000021
x16: 000000003ff72b70 x17: 00000000f9d4f04f
x18: 000000003df31dc0 x19: 000000003cf23040
x20: 000000003ff3ab50 x21: 000000003ffb4b60
x22: 0000000000000001 x23: 000000003df3c5d0
x24: 000000003ffd0a48 x25: 0000000000001000
x26: 000000003cf1e000 x27: 0000000000200000
x28: 0000000000000001 x29: 000000003df23280

Code: eb04005f 54000061 52800000 14000006 (386468a3)
Resetting CPU ...

I am currently preparing a patch to switch mksklipad50 to u-boot v2025.01 (needs testing and a bit of cleanup/squeezing before PR, but you can get the picture), hoping that this will fix the reboots.

Edit: Reworked patch/PR: #7922

torte71 added a commit to torte71/armbian-mksklipad50 that referenced this pull request Mar 7, 2025
Fixes loading initramfs into EFI memory region, leading to
errors "efi_free_pool: illegal free".
Which may be the cause for these bootloops:
  armbian#7883 (comment)
See also:
  https://lore.kernel.org/all/d3f3fc7f-b29a-4503-9fe0-97468bbe1f71@gmx.de/
@redrathnure
Copy link
Contributor

Will test it during weekends

@redrathnure
Copy link
Contributor

redrathnure commented Mar 9, 2025

@torte71 A test on MKSPI + armbian/v25.02 based image, hash: b1ac026, date 2025.02.07, hope it's old enough for the discussion:

U-Boot 2022.07-armbian-2022.07-Se092-P92b1-Hd0b5-Vb79e-Bb703-R448a (Feb 03 2025                                                                                                                          - 15:42:38 +0000)

Model: Makerbase MKS-PI
DRAM:  1022 MiB
PMIC:  RK8050 (on=0x10, off=0x00)
Core:  229 devices, 22 uclasses, devicetree: separate
MMC:   mmc@ff500000: 1, mmc@ff520000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

 .... blah blah blah....

## Executing script at 09000000
Trying kaslrseed command... Info: Unknown command can be safely ignored since kaslrseed does not apply to all boards.
Unknown command 'kaslrseed' - try 'help'
Moving Image from 0x2080000 to 0x2200000, end=4600000
## Loading init Ramdisk from Legacy Image at 06000000 ...
   Image Name:   uInitrd
   Image Type:   AArch64 Linux RAMDisk Image (gzip compressed)
   Data Size:    24004880 Bytes = 22.9 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 01f00000
   Booting using the fdt blob at 0x1f00000
   Loading Ramdisk to 3c83a000, end 3df1e910 ... OK
   Loading Device Tree to 000000003c7c1000, end 000000003c839fff ... OK

Starting kernel ...

efi_free_pool: illegal free 0x000000003cf21040
efi_free_pool: illegal free 0x000000003cf1e040
efi_free_pool: illegal free 0x000000003cf1c040
done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... Scanning for Btrfs filesystems

IMO:

  1. MKSPI and Most likely SKIPR work good with U-Boot 2022.07
  2. U-Boot stage has tons of various errors and warnings... (never have payed attention on it)
  3. It's pretty strange that U-Boot package and boot loop depend of Debian distro image. As for my understanding this is separated project, repo and strictly speaking it does not related to even Linux kernel itself.

~~Will try to flash latest builds and will drop updates here... ~~
And yes, MKS-PI bookworm-minimal gives a bootloop on MKSPI too:

Moving Image from 0x2080000 to 0x2200000, end=4670000
## Loading init Ramdisk from Legacy Image at 06000000 ...
   Image Name:   uInitrd
   Image Type:   AArch64 Linux RAMDisk Image (gzip compressed)
   Data Size:    16768158 Bytes = 16 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 01f00000
   Booting using the fdt blob at 0x1f00000
   Loading Ramdisk to 3cf21000, end 3df1ec9e ... OK
   Loading Device Tree to 000000003cea8000, end 000000003cf20fff ... OK

Starting kernel ...

efi_free_pool: illegal free 0x000000003cf21040
"Synchronous Abort" handler, esr 0x96000004
elr: 0000000000273208 lr : 000000000025e708 (reloc)
elr: 000000003ffad208 lr : 000000003ff98708
x0 : bdcc11bc76560ecc x1 : 000000003ffb4b60
x2 : 0000000000000010 x3 : 000000003df3c680
x4 : 0000000000000000 x5 : bdcc11bc76560ecc
x6 : 000000003cd18000 x7 : 0000000000000007
x8 : 0000000000000004 x9 : 0000000000000008
x10: 000000000000a994 x11: 000000003df233cc
x12: 000000000000a994 x13: 000000003df23488
x14: 000000003cea8000 x15: 0000000000000021
x16: 000000003ff72b70 x17: 0000000080d11b1c
x18: 000000003df31dc0 x19: 000000003cf23040
x20: 000000003ff3ab50 x21: 000000003ffb4b60
x22: 0000000000000001 x23: 000000003df3c5d0
x24: 000000003ffd0a48 x25: 0000000000001000
x26: 000000003cf1e000 x27: 0000000000200000
x28: 0000000000000001 x29: 000000003df23280

Code: eb04005f 54000061 52800000 14000006 (386468a3)
Resetting CPU ...

resetting ...

@torte71
Copy link
Contributor Author

torte71 commented Mar 10, 2025

  1. It's pretty strange that U-Boot package and boot loop depend of Debian distro image. As for my understanding this is separated project, repo and strictly speaking it does not related to even Linux kernel itself.

Absolutely!

And actually, the bootloop doesn't depend on the distro: The first time I noticed it, noble-minimal was affected but debian-minimal ran fine. Then I triggered a rebuild (without having changed any of the klipad related files) and suddenly debian-minimal was affected but noble-minimal was fine.

Not less weird is, that by adding fuse3 to initrd, the bootloop could be fixed, though none of the initrd contents show any kind of dependency on fuse3 (and esp. none of the "init" script contents, that are executed before checking "break=top" parameter).

In my understanding, this bootloop can only be triggered in/by u-boot (opposed to kernel or initrd):

  • The text "Synchronous Abort" (and the register dump) comes from u-boot: arch/arm/lib/interrupts_64.c:do_sync()
  • The text "Synchronous Abort" is not contained in the kernel sources (apart from some comments in drivers/scsi/lpfc/lpfc.h and lpfc_hw.h, but these are not printed)
  • Judging from the serial output, the bootloop happens at the same time, when the efi "illegal free" happens, without any prior output from the kernel (but that can also happen, if unbuffered error messages get printed before buffered standard output got processed, leading to non-chronological output - so this point is only an indication, but no proof)

Double free()s have the potential to allow "random" code execution under certain conditions (and are used for various exploits).
My guess is, that this is happening here: The initrd just reached a size/condition, where the wrong free() is not detected but some other code gets executed. Leading to strange, non-logically-looking behaviour.

So I hope, that by switching to recent u-boot, which claims that the "illegal free" got fixed (hopefully not by just removing the printf() statement), this can be solved.
But without being able to exactly reproduce that issue and tracking it down to where the wrong code actually gets executed, this is just a more or less educated guess.

torte71 added a commit to torte71/armbian-mksklipad50 that referenced this pull request Mar 10, 2025
Fixes loading initramfs into EFI memory region, leading to
errors "efi_free_pool: illegal free".
Which may be the cause for these bootloops:
  armbian#7883 (comment)
See also:
  https://lore.kernel.org/all/d3f3fc7f-b29a-4503-9fe0-97468bbe1f71@gmx.de/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Being discussed - Voice your opinions :) Hardware Hardware related like kernel, U-Boot, ... Needs review Seeking for review size/small PR with less then 50 lines
Development

Successfully merging this pull request may close these issues.

3 participants