Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some 64-bit milliseconds tick values were not migrated to 32-bit #209

Closed
warasilapm opened this issue Feb 28, 2024 · 30 comments
Closed

Some 64-bit milliseconds tick values were not migrated to 32-bit #209

warasilapm opened this issue Feb 28, 2024 · 30 comments
Assignees

Comments

@warasilapm
Copy link

warasilapm commented Feb 28, 2024

It would appear at some time in the past (I believe here), the portable API was changed from 64-bit millisecond ticks to 32-bit milliseconds ticks. In general, this was migrated successfully. However, there are still a number of 64-bit millisecond time values which would more correctly be 32-bit values, since that matches the portable API's type. None of these included here strictly cause issues, but they are confusing and their use could conceivably lead to issues with misuse in the future. See attached patch.

I may have missed some uses; perhaps a more thorough review is in order.

@RobMeades
Copy link
Contributor

Agreed: I was aware of this but it is kind of a ticklish thing to change, need to check each one to make sure we're not introducing a new problem. Thanks for the ones you've spotted here, what I will do is search the codebase for int64_t variables and sift through them to see if there are any more to be had, then we will have to check each one.

I had wondered if there is an argument for creating a macro which performs the compare "safely", and which would fall-through if the tick wraps, though the macro would need to be named very well in order that the person writing the code knows how it will behave, since the macro is now hiding the truth.

Anyway, maybe we can discuss this later.

@RobMeades RobMeades self-assigned this Feb 29, 2024
@RobMeades
Copy link
Contributor

RobMeades commented Mar 1, 2024

As discussed, I have created a branch which integrates your proposed fixes here and in #207/#208, and also introduces a pair of macros, U_PORT_TICK_TIME_EXPIRED_OR_WRAP_MS() and U_PORT_TICK_TIME_BEYOND_STOP_OR_WRAP_MS(); the macros are used where timeouts are checked throughout the code so as to ensure the wrap case (which should cause both macros to return true) is handled.

You can find a preview of the branch here:

https://github.com/u-blox/ubxlib/tree/preview_fix_tick_wrap_rmea

Note that this has NOT BEEN tested properly yet, since a key part of our test system is down. The test system should be fixed on Monday, at which point I can begin running proper tests. Obviously with such a wide change I'm concerned that I've typed in something wrong, maybe got the sense of a check upside-down: please, if you have the time, give it a thorough review.

All feedback welcomed.

@warasilapm
Copy link
Author

If I may ask - why use a macro for the tick comparison functions instead of a static inline style header function? I notice they both have explicit casts to uint32_t. These suppress the compiler warnings for implicit integer conversion. If you're concerned with the types, a static inline style function would enforce the type more naturally and (if memory serves correctly) also issue a warning if the implicit integer conversion warning is turned on in the compiler.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 15, 2024

Hi: only because static inline is, IIRC, a request to the compiler, not an instruction, so it cannot be guaranteed. Though the macro doesn't enforce type-checking, casting does enforce the type used.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 15, 2024

But I could go either way of course, just seems a bit more "embedded" to use a macro, which guarantees a specific behaviour irrespective of toolchain.

@warasilapm
Copy link
Author

warasilapm commented Mar 15, 2024

Hi: only because static inline is, IIRC, a request to the compiler, not an instruction, so it cannot be guaranteed. Though the macro doesn't enforce type-checking, casting does enforce the type used.

This is largely true. The nuance is complicated but I'm so used to always having optimization turned on. For the three primary arm targeting c compilers (gcc, clang, and armcc), with any level of optimization static inline is always respected, since the internal linkage of the function allows the compiler to avoid emitting an out-of-line version. I believe gcc (which is what I use most often) tends to still emit an out-of-line version when compiled with -O0.

All three compilers do support the __attribute__((always_inline)) (gcc, clang, armcc) syntax for forcing functions inline, though I think this can sometimes tie the compiler's hands in -Os which may prefer to emit an out of line version for the code reuse benefits. Generally I prefer to trust the compiler in that case.

I generally live on -O1, -Og, or -Os. In my experience with gcc, all of them will pick up the hint correctly and avoid emitting an out-of-line version when unnecessary. I also find that an attached debugger will better follow the code path of the inline function, allowing easier line steps through it, while a macro can only be stepped over or instruction stepped. I'm just being an evangelist for header delcared static inline functions; I don't think using a macro is strictly a problem. The names are a bit long though; maybe uPortTickTimeExpired() and uPortTickTimePastStop() would be less cumbersome.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 15, 2024

OK, I can be persuaded to go for static inline, though, personal preference again of course, I would prefer to leave the "or wrap" part in the function name; we are doing this to point out to the caller that the function will return true at wrap as well as on expiry or past stop. Otherwise the developer might just do a compare to avoid this stupidly inefficient function call that just, after all, compares two numbers. Definitely want to leave the ms in there, 'cos it is not a tick, it is a tick-derived-time-value in milliseconds.

@warasilapm
Copy link
Author

You're welcome to name it whatever you want; it's not my codebase after all. I have a different perspective on things again so I'll share for posterity.

Wrapping Schmapping

The thing is, it's not that it returns true if the value(s) has/have wrapped. It returns true if the time has expired or is past stop. The caller shouldn't need to know or even think about the wrap around that occurs around INT32_MAX. I think what you're implying is actually further from the right behavior, suggesting that if either value has wrapped past this point the function will return true. That would be a different buggy behavior compared to handling the wrap around correctly.

If there's any caveats to list, it would be that this is only valid within the range (INT32_MIN, INT32_MAX ) relative to the current time. Otherwise, the math will correctly handle the timeouts presented. I generally just list these as a @note field in the doxygen comment.

What's in a Millisecond

This can be nice to have in the signature, but truthfully it doesn't matter what unit the int32_t values have. The math is the same either way. Hence, I dropped them. In ubxlib's case there's only milliseconds ticks values as far as I can see. It's more important that the two values are both the same base unit, and no change to the function signature can really enforce that in C without a more complicated structure or timing system. In my opinion, it's not really worth it.


On a different evangelism note, static inline functions tend to show up in the call graph during static analysis. Some tools will not pick up the macro as part of the call graph. That can be handy some times when working with automated coverage and analysis tools, but as far as working with the code I still mostly rely on a text search myself so it doesn't really matter.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 15, 2024

We make ubxlib as source code with the primary intention that it is understandable/absorbable by customers; your opinion carries more weight than mine/ours, so we will follow your pattern. I am testing the revised changes now, will post here shortly.

My concern remains that I have got a comparison the wrong way around somewhere in this relatively broad change; I will merge the change when you have confirmed when you are happy that has not happened.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 15, 2024

Branch is now updated with the new form, do let me know when you are happy with it and I will merge it to master here and delete the preview branch some time after that.

@warasilapm
Copy link
Author

Been looking at this again trying to resolve issues running ubxlib past INT32_MAX rollover. Had the following comments:

  • The keep going callback in u_device_private_cell.c is still broken. We both missed this one.
  • I misunderstood that your expiry function calls uPortGetTickTimeMs() under the hood, and therefore is only compatible with milliseconds values. I had assumed that you passed in the return value from uPortGetTickTimeMs() into the elapsed functions. Given that context, it makes more sense for the unit to be in there. My mistake.
  • For the static inline to compile as I described, the function body needs to be in the header, not in a separate compilation unit. Just as a random example plucked out of thin air, this style is what I'm thinking of.

Unfortunately, I don't really have a great way of pulling this down and testing it at the moment. Our fork diverged way back at 1.2.0; not sure what the executive decision will be on updating.

@RobMeades
Copy link
Contributor

RobMeades commented Mar 27, 2024

The keep going callback in u_device_private_cell.c is still broken. We both missed this one.

Ooo, well spotted, will fix.

Given that context, it makes more sense for the unit to be in there.

Understood, will modify in next revision.

For the static inline to compile as I described, the function body needs to be in the header.

Interesting, hadn't appreciated that but it does seem to be the case that relying on the linker is not so wise. Will modify.

Our fork diverged way back at 1.2.0; not sure what the executive decision will be on updating.

Well I guess you could apply the changes manually to places you know/care about; with the exception of the one you spotted above they were reasonably easy to search for. Either way the result will require a careful review.

@RobMeades
Copy link
Contributor

Changes now applied to the preview branch, see here:

https://github.com/u-blox/ubxlib/tree/preview_fix_tick_wrap_rmea

Thanks again for the feedback. I'm not going to merge this change into master here until you are happy with it, of course, so let me know how it goes.

@warasilapm
Copy link
Author

So we've made progress on handling the INT32_MAX wrap around point. However, there are still rough edges to consider. Code which immediately captures a start time and then retries for some timeout do not suffer from this next issue, but ones that revisit a potentially much earlier set start time do. For example, consider the code which enforces a 1 second gap between CFUN commands in registerNetwork().

ubxlib/cell/src/u_cell_net.c

Lines 1243 to 1248 in 3cf5348

// Come out of airplane mode and try to register
// Wait for flip time to expire first though
while (!uPortTickTimeExpiredMs(pInstance->lastCfunFlipTimeMs,
U_CELL_PRIVATE_AT_CFUN_FLIP_DELAY_SECONDS * 1000)) {
uPortTaskBlock(1000);
}

What if it's been more than INT32_MAX ticks since the last CFUN command was issued? This while loop will end up waiting as much as INT32_MAX + U_CELL_PRIVATE_AT_CFUN_FLIP_DELAY_SECONDS * 1000 milliseconds. This is much longer than the expected worst case timeout. The issue here is that the current comparison (and indeed all using the newly introduced port API functions) maps half of the integer space (red and yellow below) as within the timeout period instead of just the timeout period (yellow only below).

image

There are a couple of different approaches to solving this sort of problem. The first is to explicitly check for values in that range. With the current code that would look like a timeout implementation below:

static U_INLINE bool uPortTickTimeExpiredMs(int32_t startTimeMs,
                                            int32_t timeoutMs)
{
    int32_t delta = uPortGetTickTimeMs() - startTimeMs;
    return (delta > timeoutMs) || (delta < 0);
}

The second way this is often resolved is to just used unsigned math for expiration values, since the negative values we had to explicitly check for before are simply large values which will always exceed the timeout. Both of these approaches map only the yellow area above and below as within the timeout value. This more tightly bounds the potential waiting time to the potential timeout window anyways, so worst case the check will result in the timeout to be observed in code.

image

If that's not sufficient, the only way is to maintain these timeout values, essentially checking that they are within the expected value and bringing them within range or invalidating them every 2^31 ticks. This is a lot of complexity and overhead and generally it's acceptable to have the application wait up to the specified timeout anyways in the exceedingly rare case this occurs.

Note that a similar relationship is true with the current state of the comparisons for "stop" values instead of "start" values with a timeout. These map stopTimeMs - INT32_MAX to "not yet timed out", just like startTimeMs + timeoutMs - INT32_MAX.

I'm curious if there was a reason the portable layer chose int32_t instead of uint32_t. At least in STM32Cube land both HAL and FreeRTOS os ticks are both unsigned numbers.

An additional spot we have identified this style of is issue is below:

ubxlib/cell/src/u_cell_sock.c

Lines 2003 to 2006 in a2d986e

while (uPortGetTickTimeMs() - pInstance->connectedAtMs <
U_CELL_SOCK_SARA_R422_DNS_DELAY_MILLISECONDS) {
uPortTaskBlock(100);
}


Another potential source of issues worth revisiting is the use of 0 or -1 to indicate unset or invalid times. This works when it is used for optionally set timeout values, such as the AT client's atTimeoutSavedMs member. However, it doesn't work well for start times, like startTimeMs in the cell instance structure. The local keep going callback in the cellular network code will fail to abort a failing connection if startTimeMs is a negative tick value as a result of a check for this "unset" value that it is returned to.

if ((pInstance->startTimeMs > 0) &&
!uPortTickTimeExpiredMs(pInstance->startTimeMs,
U_CELL_NET_CONNECT_TIMEOUT_SECONDS * 1000)) {
keepGoing = false;
}

This check has to be removed to prevent immediate timeout in these cases. The good news on this one is that, currently, there does not seem to be a code path which calls this without first setting startTimeMs so a bug from an unset value does not seem to be an issue. In our application this has yet to cause a problem because we only call through uNetworkInterfaceUp() and uNetworkInterfaceDown() which passes in the keepGoingCallback() in u_network_private_cell.h.

@RobMeades
Copy link
Contributor

All good stuff, give me a little time to process...

@RobMeades
Copy link
Contributor

@warasilapm: have been discussing with Michael today, hope to have next revision tomorrow.

@RobMeades
Copy link
Contributor

RobMeades commented Apr 4, 2024

Branch is now updated.

This is the "Michael edition": a uTimeout API is introduced in /common/timeout with "anonymous" [not really] type uTimeoutStart_t, a function uTimeoutStart() to populate variables of that type and functions uTimeoutExpiredMs()/uTimeoutExpiredSeconds() to check for expiry. All calculations are uint32_t.

This new API is used in all cases where time calculations are performed, aside from one case in BLE and the lock timer in the AT Client (which has been pretty darned thoroughly tested and doesn't quite fit the pattern).

Just about every file in ubxlib has been touched by this so I don't want to leave it sitting too long or it will be a nightmare to merge: please let me know what you think, @mazgch as well as @warasilapm.

@RobMeades
Copy link
Contributor

RobMeades commented Apr 4, 2024

@mazgch has made initial comments on the branch (thanks Michael). I have addressed these, I think, but it was getting quite difficult for Michael to view (the baseline for the current preview branch is also pretty old now) so I have (a) deleted the old preview branch, (b) created a new branch preview_fix_tick_wrap_2_rmea and (c), in order to make commenting hopefully easier, I have made this branch a PR #225 to this repo (rather than our internal development repo). I won't merge this PR, it will only be for comments.

@RobMeades
Copy link
Contributor

@warasilapm: in the hope that it will enable you to comment directly on #225 I have sent you an invitation to this repo as a read-only collaborator. Or you can also just comment here of course.

@warasilapm
Copy link
Author

@RobMeades As far as I know, I can comment on PRs and submit code review on PRs in public repositories. I just can't be used as an "approving" review for the purposes of code review requirements, etc. As such, I have declined the invitation for now; if there are other features you need me to have access for in the future I'll let you know.

I'll still take a look at the changes in #225 as I am able.

@warasilapm
Copy link
Author

@RobMeades If memory serves, you should even be able to request review from me on the PR without me as a collaborator.

@RobMeades
Copy link
Contributor

If memory serves, you should even be able to request review from me on the PR without me as a collaborator.

Unfortunately typing your Github ID doesn't seem to resolve:

image

@RobMeades
Copy link
Contributor

@warasilapm: I understand from separate e-mail communication that you cannot use a new release of ubxlib in your product, whatever is proposed in the very fruitful discussions we have had here. I had planned to wait until you were happy with the changes but maybe that is not good use of your time, or at least not something you would be able to prioritise? If that is the case, let me know and we will continue with our internal review process; the changes have already been tested, but of course, given the statistical nature of these things, can be tested further, I just didn't want to head down that route if you disagreed with the direction.

Please advise.

@warasilapm
Copy link
Author

@RobMeades Did you develop a test case where the tick value is initialized close to the 2^31 and 2^32 rollover values?

@RobMeades
Copy link
Contributor

RobMeades commented Apr 10, 2024

Yes, we do this on test instance 23 (where ubxlib happens to be running on Windows, not that it matters) which has a SARA-R5, NINA-W15 and M8 module attached, and also runs out-of-coverage tests. It is running with U_CFG_TEST_TIMEOUT_SPEED_UP=14, so a multiplier of 2^14: time should wrap roughly every 4 minutes. This is about as fast as we can go without having a wrap that is within our largest timeout.

EDIT: for context, the complete test run on that instance takes just under one hour.

@warasilapm
Copy link
Author

warasilapm commented Apr 10, 2024

Okay sounds good! Just wanted to make sure that was being covered explicitly since that's the main advantage I could provide with testing on our application.

I may be able to find some time to review the PR you put up in the next couple of days but you are correct in that the takeaway in the email from our side today is a) our version is not synchronized with latest ubxlib so it's challenging for me to work with, b) we aren't updating to latest ubxlib on this release cycle, and c) there are other immediate priorities for me during the work day over reviewing your PR.

@RobMeades
Copy link
Contributor

Got it: in that case I will wait until late Friday evening UK time before I go ahead with things on our side; there are reasons for me not to upset the applecart before then.

@RobMeades
Copy link
Contributor

FYI, #225 is about to enter our internal review process; once it has passed that it should appear on master here, hopefully next week. @warasilapm: still happy to receive any comments you may have, which I will feed into that process.

@RobMeades
Copy link
Contributor

@warasilapm: the change has now passed testing, review, and is pushed to master here in commit 79e808e. Are you OK for me to close #207, #208 and #209 or is there more you would like to discuss?

@warasilapm
Copy link
Author

I'm good to close the issues for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants