Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fdtable: fix longstanding layering violations #75348

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cfriedt
Copy link
Member

@cfriedt cfriedt commented Jul 3, 2024

Although fdtable sat a layer below posix, as part of the base os, it was reaching up a layer for POSIX types, creating another dependency cycle.

Instead of forcing POSIX types to be used, simply add the native Zephyr types, k_off_t and k_ssize_t. Remove <sys/types.h> and <fcntl.h> inclusion below the POSIX layer, where appropriate.

Generally, it would be best to use only ISO or native zephyr types below POSIX.

(Would have fixed #75205 with a very small workaround)
Fixes #75513
Fixes #10436

Update:
To future-proof Zephyr in terms of file, disk, and device compatibility, a Kconfig option was added that allows the user to explicitly use a 64-bit k_off_t, even if the underlying hardware is only 32-bit.

The question is whether we want future generations to pay off our debt + even more interest, or whether we are willing to pay the bill ourselves. This has already been ignored for 2 LTS releases.

Directly before an LTS release probably isn't a super appropriate time for a fix like this to go in, but then again, maybe it's the most appropriate time for it to go in 🤷‍♂️

@cfriedt
Copy link
Member Author

cfriedt commented Jul 3, 2024

twister -p native_sim -p qemu_riscv64 -p qemu_x86 -p qemu_x86_64 -p qemu_cortex_a53 -p mps2/an385 -T tests/posix
...
INFO    - Total complete:  270/ 270  100%  skipped:   84, failed:    0, error:    0
INFO    - 45 test scenarios (270 test instances) selected, 84 configurations skipped (58 by static filter, 26 at runtime).
INFO    - 186 of 270 test configurations passed (100.00%), 0 failed, 0 errored, 84 skipped with 0 warnings in 403.48 seconds
INFO    - In total 7029 test cases were executed, 3759 skipped on 6 out of total 21 platforms (28.57%)
INFO    - 181 test configurations executed on platforms, 5 test configurations were only built.

A lot of build-only targets being skipped here..

twister --build-only -T tests/net/ -T tests/lib/c_lib
...
INFO    - Total complete: 7042/7042  100%  skipped: 5895, failed:    0, error:    0
INFO    - 250 test scenarios (7042 test instances) selected, 5895 configurations skipped (5868 by static filter, 27 at runtime).
INFO    - 1147 of 7042 test configurations passed (100.00%), 0 failed, 0 errored, 5895 skipped with 0 warnings in 2589.39 seconds
INFO    - 0 test configurations executed on platforms, 1147 test configurations were only built.
twister -i -p  mimxrt1170_evk/mimxrt1176/cm7 -p mm_feather -p nrf9131ek/nrf9131/ns -s tests/posix/common/portability.posix.common.newlib -s tests/posix/common/portability.posix.common.tls.newlib -s tests/posix/fs/portability.posix.fs.newlib -s tests/posix/fs/portability.posix.fs.tls.newlib -s tests/posix/headers/portability.posix.headers.newlib.with_posix_api -s tests/lib/c_lib/thrd/libraries.libc.c11_threads.newlib -s tests/lib/c_lib/thrd/libraries.libc.c11_threads.newlib_nano
...
INFO    - Total complete:   21/  21  100%  skipped:    2, failed:    0, error:    0
INFO    - 7 test scenarios (21 test instances) selected, 2 configurations skipped (0 by static filter, 2 at runtime).
INFO    - 19 of 21 test configurations passed (100.00%), 0 failed, 0 errored, 2 skipped with 0 warnings in 127.21 seconds
INFO    - In total 696 test cases were executed, 210 skipped on 3 out of total 7 platforms (42.86%)
INFO    - 0 test configurations executed on platforms, 19 test configurations were only built.

@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 468003b to 398403c Compare July 3, 2024 02:01
@cfriedt cfriedt requested a review from ycsin July 3, 2024 02:08
@cfriedt cfriedt marked this pull request as ready for review July 3, 2024 02:54
@zephyrbot zephyrbot added area: POSIX POSIX API Library area: Base OS Base OS Library (lib/os) labels Jul 3, 2024
@cfriedt cfriedt added the Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. label Jul 3, 2024
Copy link
Member

@aescolar aescolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand what and why you are doing this I don't think this is the solution.
Defining a type in the "wrong" header to avoid including another header from that one will end up with an even more messy situation, as now issues will depend on the order of header inclusion by users of those headers.
It is not safe to redefine these types (as you do not know what the C library is defining those to), and attempting to guard those definitions with the C library internal guard names is going to not work with other libraries or when those guards names change.

include/zephyr/sys/fdtable.h Outdated Show resolved Hide resolved
@cfriedt
Copy link
Member Author

cfriedt commented Jul 3, 2024

@aescolar - this was the "path of least resistance" change that is non-destructive. The better change would be to simply not use inappropriate types in the base OS (i.e. simply change return values from ssize_t to long). However, that would have greater repercussions as several function signatures in and out of tree would need to be corrected.

@keith-packard
Copy link
Collaborator

keith-packard commented Jul 3, 2024

While painful, I think the only 'conforming' solution is to avoid POSIX types entirely and either use existing C types or create your own Zephyr-specific ones. For ssize_t, you could use ptrdiff_t and then introduce casts in the POSIX layer between those -- I've never heard of a target for which those don't use the same underlying representation.

For off_t, I don't think there's any C replacement you could use with the right semantics. fpos_t would be great as an abstraction but that's not required to be an integer type. I think you need to create a zephyr-specific type here. I would really encourage you to use a 64-bit type and avoid all of the disasters that come from 32-bit file size adventures.

C libraries for Zephyr could then adopt this new OS type for their own off_t and fpos_t definitions so that we would be assured of 64-bit file support through the whole stack. C libraries without such changes would need range checks across the OS abstraction layer to catch problems at runtime (sigh).

Reading through the picolibc code, I'm afraid it defines off_t as long. I'll go fix that so that it will always use 64-bits. That way it will match the Zephyr OS type you should create.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 5, 2024

Thanks for your help with this @keith-packard 🪁

@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 398403c to 55290f0 Compare July 6, 2024 13:13
@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 55290f0 to f633095 Compare July 6, 2024 13:26
@cfriedt cfriedt changed the title fdtable: avoid pulling in posix header & types fdtable: remove inappropriate use of POSIX off_t and ssize_t Jul 6, 2024
@cfriedt cfriedt removed the Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. label Jul 6, 2024
@cfriedt cfriedt added this to the v3.7.0 milestone Jul 10, 2024
@nashif
Copy link
Member

nashif commented Jul 10, 2024

This appears to be passing CI now and should be merged before v3.7.0

@cfriedt the nature of this change which involves changing types, touching 400 files and changing sginatures of many APIs requires more discussion and following the process outlined above, i.e. discussion and consesnus in architecture working group and wide broadcasting of the change to the community and those affected.

This would have been the case if we were not in the middle of a release. The fact that we are in the middle of a release and closing on a hard freeze milestone in 2 days makes it a bit more critical to get a signoff from stakeholders, not only rely on CI passing to get it merged.

The arch. wg was cancelled yesterday, so I can give you a slot in today's TSC to make the case for this to go into 3.7, if you wish to do that please mark this issue with the TSC label and join the TSC today.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 10, 2024

@nashif - I have an appointment during the TSC call unfortunately. It's not something I usually plan on attending these days.

Additionally, I would be advocating for the same things I advocated for the last time I attended the TSC meeting. I.e. putting v3.7-branch into a better state for additional bug fixes and stabilization fixes.

At this point, it's unlikely that POSIX will be able to receive any significant backports in v3.7-branch.

There are also other fairly significant bugs that would be part of the release at this point (such as #75402).

Personally, I wouldn't consider tagging v3.7.0 without additional bug fixes, but that isn't my decision to make.

ycsin
ycsin previously approved these changes Jul 11, 2024
Copy link
Member

@ycsin ycsin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to fixing the layering issue, it is unfortunate that there are a lot of affected files and we didn't get to stop this from snowballing much earlier.

It is pretty late in the current release cycle, but for users that jump from one LTS release to the next, getting this sorted out before the release means that LTS v3 users will be better positioned for the next LTS release.

@kartben
Copy link
Collaborator

kartben commented Jul 11, 2024

It is pretty late in the current release cycle, but for users that jump from one LTS release to the next, getting this sorted out before the release means that LTS v3 users will be better positioned for the next LTS release.

Not only is it late in the cycle but it still is a tree-wide change so I believe the process linked in a previous comment would still need to be followed, and this potential change communicated to a wider audience ASAP.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 11, 2024

Not only is it late in the cycle but it still is a tree-wide change so I believe the process linked in a previous comment would still need to be followed, and this potential change communicated to a wider audience ASAP.

Yes, the process requires the arch wg to approve, and then when it is stamped, an email should be sent out to the devel mailing list before it is merged, etc. If the arch wg does not come to a consensus it goes to the TSC, etc.

For the arch wg to approve, there would need to be consensus, and I think that would be the challenging part; some people really like defending bad decisions.

It would also need some buy-in from the release manager... but... 🤷🏼‍♂️

It is very late in the release cycle for all of this.

Of course; this should have been fixed many, many, releases ago 🤷🏼‍♂️

Chris Friedt added 3 commits July 12, 2024 06:35
Define k_off_t to have native support for measuring file offsets
within any kind of file that is supportable in Zephyr.

CONFIG_OFFSET_64BIT is added to allow 32-bit users to force
64-bit offsets, when trading space and speed for flexibility
makes sense for the application.

Define k_ssize_t to have native support for reporting file I/O
sizes along with negative error codes.

Signed-off-by: Chris Friedt <cfriedt@tenstorrent.com>
Update sof to use k_ssize_t instead of the POSIX ssize_t, thus
avoiding layering violations in core parts of Zephyr.

Signed-off-by: Chris Friedt <cfriedt@tenstorrent.com>
A shortcut was taken many years ago by "cherry-picking" the
POSIX off_t and ssize_t types for use throughout Zephyr.

Additionally, the POSIX header fcntl.h, as well as constants
defined in that header, were being used inappropriately
throughout Zephyr.

Doing so created a dependency cycle: below POSIX's position
in the stack, code depended on POSIX; above POSIX's position
in the stack, code depends on the lower parts.

Such dependency cycles usually result in fragility and
instability of the software stack.

Use the newly defined k_off_t and k_ssize_t types throughout
Zephyr's stack, where off_t and ssize_t were previously used
inappropriately.

Additionally, use ZVFS-prefixed constants instead of their
POSIX counterparts.

Additionally, where appropriate, ensure the POSIX fcntl.h
header is prefixed with <zephyr/posix/fcntl.h>.

We effectively create a mutual dependency to resolve the
cyclic dependency, as described in GitHub issue zephyrproject-rtos#51211.

For clarification, it is inappropriate to use POSIX types or
functions within the kernel, core OS, OS services, or really
anywhere that is not equal to or above POSIX's position in the
software stack.

In other words, if a library uses POSIX, Zephyr's POSIX
implementation cannot depend on that library.

Similarly, if a system service uses POSIX, Zephyr's POSIX
implementation cannot depend on that system service.

Signed-off-by: Chris Friedt <cfriedt@tenstorrent.com>
@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Copy link
Collaborator

@sylvioalves sylvioalves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ESP32 side and memory discussion, LGTM.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooph, that's a lot of change. Might as well +1 just so we can get this in. In the longer term, maybe we should look at moving usage from ssize_t (AIUI a POSIX type) to ptrdiff_t (ISO C). Not sure whether off_t has an equivalent in the standard library, though? Probably not, so I guess a k_off_t is unavoidable.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot vote

ssize_t len;
k_ssize_t len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really correct? This driver is for native_sim, and some 10 lines later the return value from the host libc's read() is assigned to it.

Copy link
Member Author

@cfriedt cfriedt Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definitions should be identical (as far as the underlying type goes). Otherwise, there would be warnings / errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt that. I was mainly wondering where a native_sim driver like this conceptually sits in the software stack. It's calling the host OS APIs (like read()) so in that sense it could be seen to be above POSIX (in which case ssize_t would be more appropriate), however the driver also accesses Zephyr kernel APIs. Anyway, this isn't necessarily that critical (as you say, the types should be the same), which is why I didn't want to block this PR by requesting changes, but I do think it would be good to understand where native_sim driver code sits in the software stack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's kind of a recursion. I would still say those drivers sit below (Zephyr's) POSIX API.

The namespace issue is kind of an unfortunate side-effect (not without solutions, ofc)

@andyross
Copy link
Contributor

Not sure whether off_t has an equivalent in the standard library, though?

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same. Interestingly fseek() never got the memo and uses a bare long as the offset type. But there, POSIX got huffy and defined fseeko() which takes off_t, so we don't get to assume anything.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Technically size_t would be the same already for 64 bit (without the sign)

@keith-packard
Copy link
Collaborator

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same.

Not at all -- size_t refers to the in-memory storage, which is limited to the address space. off_t covers file sizes, not memory, which C doesn't really talk about much, aside from fseek as you say. However, there's also fpos_t as used by fgetpos and fsetpos. C has weasel words for that allowing it to be a struct type, which allows file sizes beyond the scope of any integer type, but sadly doesn't map neatly to off_t.

On 32-bit Linux systems, size_t is 32-bits, but (most of the time), off_t is 64-bits.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Aren't standards wonderful?? 😂🙃

@nashif
Copy link
Member

nashif commented Jul 16, 2024

The PR is reaching the limit of what GH can handle, opening it in the browser just takes forever :(

from the commit message of the largest commit:

Additionally, the POSIX header fcntl.h, as well as constants
defined in that header, were being used inappropriately
throughout Zephyr.

Can this be done in its own commit?

Additionally, use ZVFS-prefixed constants instead of their POSIX counterparts.

ditto

Additionally, where appropriate, ensure the POSIX fcntl.h header is prefixed with <zephyr/posix/fcntl.h>.

ditto

@@ -866,6 +866,23 @@ config XIP
supply a linker command file when building your image. Enabling this
option increases both the code and data footprint of the image.

config OFFSET_64BIT
bool "Use 64-bit offsets"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if this went into its own section in Kconfig (menu), i.e. do not put alongside other features and make it obvious is it a type, maybe others will be added to the same menu later.

OFFSET_64BIT can be confusing, maybe make it obvious we are talking about a type here: TYPE_OFFSET_64BIT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The closest thing I found to this so far is TIMEOUT_64BIT. Should that be moved as well? It also has to do with a type (k_timeout_t).

@de-nordic
Copy link
Collaborator

de-nordic commented Jul 16, 2024

Not sure whether off_t has an equivalent in the standard library, though?

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same. Interestingly fseek() never got the memo and uses a bare long as the offset type. But there, POSIX got huffy and defined fseeko() which takes off_t, so we don't get to assume anything.

fread/fwrite operate on memory objects rather than buffers. The POSIX write operates on buffers, but still knows it can only write as much as size_t.
I think that when the fread/fwrite was designed you would still mostly write on tape, so you would rather do a lot of sequential operations, on same types of objects, and then fseek to some other location, and again do some sequential operations.
Tape could be much longer then your address space so using long for fseek made sense, as this operates on external address space which may not fit in size_t.
POSIX was designed later than C, to stop Sys V vs 4BSD API division in systems getting too annoying.
It is funny that if you have true POSIX system, this means that lseek would be native function using off_t, for offset, and the fseek implemented on top would have to convert from long to off_t on call.

@nashif nashif modified the milestones: v3.7.0, v4.0.0 Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Base OS Base OS Library (lib/os) area: POSIX POSIX API Library DNM This PR should not be merged (Do Not Merge) manifest manifest-sof treewide 🧹
Projects
Status: Todo