Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: init: allow to jitter boot delay #25627

Conversation

markus-becker-tridonic-com
Copy link
Contributor

When a big amount of networked devices collectively gets switched on,
e.g. by a mains switch, quickly overwhelms a wireless network.

This PR allows to jitter the start-up of the devices.

Cleaned-up the boot banner after the boot delay. At that point in time
is irrelevant.

Signed-off-by: Markus Becker markus.becker@tridonic.com

@markus-becker-tridonic-com
Copy link
Contributor Author

Open issue: How to make sure that the printk() is flushed before k_msleep()?

@pfalcon
Copy link
Contributor

pfalcon commented May 26, 2020

When a big amount of networked devices collectively gets switched on,
e.g. by a mains switch, quickly overwhelms a wireless network.

As this mentions networking usecase, adding @jukkar to reviewers. I myself would point that if networking is the usecase, than rather e.g. DHCP client should have a randomized delay support. Handling that with a boot delay seems like a crude approach.

@markus-becker-tridonic-com
Copy link
Contributor Author

It is mostly a wireless/BT/openThread networking use case, not all of the networking options necessarily require DHCP.

It might be a crude approach, but the implementation is rather straight forward.

Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. And as already commented, the DHCP provides similar support but if one is not using DHCP then something like this is useful.

kernel/Kconfig Outdated Show resolved Hide resolved
kernel/Kconfig Outdated Show resolved Hide resolved
kernel/init.c Outdated Show resolved Hide resolved
@markus-becker-tridonic-com
Copy link
Contributor Author

rebased on master post 2.3 release.

@markus-becker-tridonic-com
Copy link
Contributor Author

@jukkar @andyross would you take yet another look at this reworked solution?

kernel/init.c Outdated Show resolved Hide resolved
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick about docs, but one that does need to be fixed. Also confused about that noop printk wrapper.

help
This option randomizes the bootup delay between 0 and the
value of BOOT_DELAY. This can be used e.g., to jitter boot-up
of wireless network devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantry: this needs to specify units. How long is "1"? One reasonable choice IMHO is to bake it into the API and call this CONFIG_BOOT_DELAY_RANDOM_MS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it says, the value is between 0 and BOOT_DELAY. BOOT_DELAY is in milliseconds:
https://github.com/zephyrproject-rtos/zephyr/pull/25627/files#diff-a0dbba97dac305f5927e75a408d79a25R309-R310

va_start(ap, fmt);
vprintk(fmt, ap);
va_end(ap);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. What is the point of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printk() is calling log_printk() when CONFIG_LOG_PRINTK is on. Then the output would be happening in the logging thread which is halted while the busy-loop is running. The output of the delay would only happen after the delay and thus leaving the user wondering what is happening.

This way the message is outputted before the busy-loop executes and the user knows that boot is delayed.

@@ -191,6 +182,15 @@ void z_data_copy(void)

bool z_sys_post_kernel;

void immediate_printk(const char *fmt, ...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding it ? What it the problem with printk ? I presume is something related with log_printk ... but I don't know what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

help
This option randomizes the bootup delay between 0 and the
value of BOOT_DELAY. This can be used e.g., to jitter boot-up
of wireless network devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth to add a note about it in BOOT_DELAY option. Because there is saying that the entire value set will be used to delay the boot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BOOT_DELAY_RANDOM is not set, then BOOT_DELAY is used.

Would you rather add a random delay on top of BOOT_DELAY?

Copy link
Member

@ceolin ceolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments. If I'm right the random value is not being used.

Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you not modify the core kernel for this use-case.

Have you considered defining a SYS_INIT() in your application (at PRE_KERNEL_1) which just introduces a delay there?

@markus-becker-tridonic-com
Copy link
Contributor Author

I would rather you not modify the core kernel for this use-case.

Have you considered defining a SYS_INIT() in your application (at PRE_KERNEL_1) which just introduces a delay there?

Have not considered this. How would you delay network set-up with SYS_INIT()? Do you have an example?

When a big amount of networked devices collectively gets switched on,
e.g. by a mains switch, a low-bandwidth wireless network is quickly
overwhelmed.

This PR allows to jitter the start-up of the devices.

Cleaned-up the boot banner after the boot delay. At that point in time
the boot banner is irrelevant.

Signed-off-by: Markus Becker <markus.becker@tridonic.com>
@andyross
Copy link
Contributor

andyross commented Jul 9, 2020

printk() is calling log_printk() when CONFIG_LOG_PRINTK is on. Then the output would be happening in the logging thread

OK... I get it. And this is clever, but... no, sorry. :)

Really what you're doing here is exploiting a bug in the logging subsystem. CONFIG_LOG_PRINTK is supposed to mean that printk() output goes to logging instead, but apparently the feature missed a spot, or put the hook in the wrong place, and it happens that vprintk() doesn't do that and goes to the original output. But that's wrong. When you have CONFIG_LOG=y and CONFIG_LOG_IMMEDIATE=n, logging is asynchronous. That's the way it's supposed to work, and it's desired behavior[1]. Subsystems shouldn't be silently working around the intended behavior of other subsystems by using backdoors.

Basically: if the user asked for async logging give them what they want. If you don't think they should want that, let's deal with that via a patch to the log layer.

[1] By someone (cough, cough, @nordic-krch). I actually agree that for most development usage asynchronous logging is the devil's work and don't understand why we decided to make it default. But it's the system we have, so let's not hack holes around it.

@markus-becker-tridonic-com
Copy link
Contributor Author

printk() is calling log_printk() when CONFIG_LOG_PRINTK is on. Then the output would be happening in the logging thread

OK... I get it. And this is clever

Thanks :)

, but... no, sorry. :)

Hmm.

Really what you're doing here is exploiting a bug in the logging subsystem. CONFIG_LOG_PRINTK is supposed to mean that printk() output goes to logging instead, but apparently the feature missed a spot, or put the hook in the wrong place, and it happens that vprintk() doesn't do that and goes to the original output. But that's wrong. When you have CONFIG_LOG=y and CONFIG_LOG_IMMEDIATE=n, logging is asynchronous. That's the way it's supposed to work, and it's desired behavior[1]. Subsystems shouldn't be silently working around the intended behavior of other subsystems by using backdoors.

Will the bug be fixed?

Basically: if the user asked for async logging give them what they want. If you don't think they should want that, let's deal with that via a patch to the log layer.

True, that the user wants to use async logging (or has to because it is pulled in by default). But this tripped me several times, looking at debug on UART and expecting to see something, only to notice several trials later that the device is delaying the boot.

For my information, would be switching the logging backend to RTT make any difference?

[1] By someone (cough, cough, @nordic-krch). I actually agree that for most development usage asynchronous logging is the devil's work and don't understand why we decided to make it default. But it's the system we have, so let's not hack holes around it.

What was the reason to move it to async? Would disabling async logging be possible? What would be the side-effects?

@andrewboie
Copy link
Contributor

andrewboie commented Jul 13, 2020

Will the bug be fixed?

Do we have a bug opened on GitHub for this issue? That's the first step.

Consider enabling CONFIG_LOG_MINIMAL and just ignoring the async log mechanism.

@github-actions
Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Sep 12, 2020
@github-actions github-actions bot closed this Sep 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants