AArch64 support #550

Open
grigorig opened this Issue Mar 1, 2016 · 71 comments

Comments

Projects
None yet

grigorig commented Mar 1, 2016

Is there a chance of AArch64 builds of the userspace and kernel? What's missing to get this to work?

This comment has been minimized.

Show comment Hide comment
@clivem

clivem Mar 1, 2016

kernel support!

clivem commented Mar 1, 2016

kernel support!

This comment has been minimized.

Show comment Hide comment
@popcornmix

popcornmix Mar 1, 2016

Contributor

This isn't going to happen from us any time soon. A 64-bit kernel is not trivial (and could be produced by community).

Contributor

popcornmix commented Mar 1, 2016

This isn't going to happen from us any time soon. A 64-bit kernel is not trivial (and could be produced by community).

This comment has been minimized.

Show comment Hide comment
@deborah-c

deborah-c Mar 1, 2016

Could it be produced by community? I think it might well need changes to the VC firmware to correspond, as interface structures would potentially change shape

Could it be produced by community? I think it might well need changes to the VC firmware to correspond, as interface structures would potentially change shape

This comment has been minimized.

Show comment Hide comment
@popcornmix

popcornmix Mar 1, 2016

Contributor

The kernel could be. Depends on the implementation if the interface to VC needs to change. Forcing 32-bit pointers in interface to VC would be a sensible solution that wouldn't need a VC side change.

Contributor

popcornmix commented Mar 1, 2016

The kernel could be. Depends on the implementation if the interface to VC needs to change. Forcing 32-bit pointers in interface to VC would be a sensible solution that wouldn't need a VC side change.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Mar 1, 2016

Contributor

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

Contributor

pelwell commented Mar 1, 2016

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

@joolswills joolswills referenced this issue in RetroPie/RetroPie-Setup Mar 1, 2016

Merged

Rpi3 #1307

This comment has been minimized.

Show comment Hide comment
@6by9

6by9 Mar 1, 2016

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

Really? That doesn't sound right as kernel pointers have no meaning outside the kernel.
I'm happy to take a look if you'll email me details of the bit of concern.

6by9 commented Mar 1, 2016

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

Really? That doesn't sound right as kernel pointers have no meaning outside the kernel.
I'm happy to take a look if you'll email me details of the bit of concern.

This comment has been minimized.

Show comment Hide comment
@6by9

6by9 Mar 1, 2016

Fair cop - not nice. V4l2 driver is just copying the way mmal did it.

How brave are we feeling? We could pull in the rpmsg mmal service instead, however that loses the bulk transfer facility so may need a slight change to the client code.

edit Hang on, that is the V4L2 driver only, so all kernel side. It's expecting VC to echo back a kernel pointer, not userspace.
I do have some changes planned for V4L2 which may help here (GSH and DC are aware). I'll check in a moment, but does the MMAL interface to userland have this same nastiness?

6by9 commented Mar 1, 2016

Fair cop - not nice. V4l2 driver is just copying the way mmal did it.

How brave are we feeling? We could pull in the rpmsg mmal service instead, however that loses the bulk transfer facility so may need a slight change to the client code.

edit Hang on, that is the V4L2 driver only, so all kernel side. It's expecting VC to echo back a kernel pointer, not userspace.
I do have some changes planned for V4L2 which may help here (GSH and DC are aware). I'll check in a moment, but does the MMAL interface to userland have this same nastiness?

This comment has been minimized.

Show comment Hide comment
@6by9

6by9 Mar 1, 2016

:-( Userland also expects VC to preserve a kernel pointer https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L360
That is just VC and kernel side, so could be updated fairly easily, but would be an ABI change between firmware and kernel (or we need the firmware to try and handle multiple different versions of structure).
There's a couple of other pointers in structures passed to VC which would need attention too (eg https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L421)

Are the other services OK?
IL had some niggles with having to set OMX_SKIP64BIT due to structure padding mismatches, but how does ILCS shape up more generally? Something will still need to reduce kernel 64bit pointers to 32 bit physicals for VC.
VCSM? Mailbox services?

My memory is failing me - did we ever get a 64bit kernel running? All userspaces were certainly 32bit.

6by9 commented Mar 1, 2016

:-( Userland also expects VC to preserve a kernel pointer https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L360
That is just VC and kernel side, so could be updated fairly easily, but would be an ABI change between firmware and kernel (or we need the firmware to try and handle multiple different versions of structure).
There's a couple of other pointers in structures passed to VC which would need attention too (eg https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L421)

Are the other services OK?
IL had some niggles with having to set OMX_SKIP64BIT due to structure padding mismatches, but how does ILCS shape up more generally? Something will still need to reduce kernel 64bit pointers to 32 bit physicals for VC.
VCSM? Mailbox services?

My memory is failing me - did we ever get a 64bit kernel running? All userspaces were certainly 32bit.

This comment has been minimized.

Show comment Hide comment
@ncguk

ncguk Mar 1, 2016

Speaking from a position of zero knowledge - how much of the Debian arm64 kernel source can be used before you run into problems?

ncguk commented Mar 1, 2016

Speaking from a position of zero knowledge - how much of the Debian arm64 kernel source can be used before you run into problems?

This comment has been minimized.

Show comment Hide comment
@deborah-c

deborah-c Mar 2, 2016

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 2, 2016

why not 64bit across? Debian has a 64bit kernel and dist for arm. I know the Pi specific stuff would still need to be done, but why were they planning 32bit kernel? This is a curiosity question.

TheSin- commented Mar 2, 2016

why not 64bit across? Debian has a 64bit kernel and dist for arm. I know the Pi specific stuff would still need to be done, but why were they planning 32bit kernel? This is a curiosity question.

This comment has been minimized.

Show comment Hide comment
@grigorig

grigorig Mar 2, 2016

32 bit kernel w/ 64 bit userland? I wasn't aware that this is a possible combination. Seems like a strange idea.

grigorig commented Mar 2, 2016

32 bit kernel w/ 64 bit userland? I wasn't aware that this is a possible combination. Seems like a strange idea.

This comment has been minimized.

Show comment Hide comment
@6by9

6by9 Mar 2, 2016

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

I'd remembered other way up - 64 bit kernel, 32 bit userland (as that was the current state of Android). I couldn't remember if that work had actually happened - did we actually have A53s in a chip that was brought up?

6by9 commented Mar 2, 2016

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

I'd remembered other way up - 64 bit kernel, 32 bit userland (as that was the current state of Android). I couldn't remember if that work had actually happened - did we actually have A53s in a chip that was brought up?

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Mar 2, 2016

Contributor

64-bit user space with 32-bit kernel is not possible on ARMv8. The kernel (especially the task switching) needs to be able to access all register state used by user space, which wouldn't be possible if the kernel was in 32-bit mode. The ARMv8 architecture allows an AArch32->AArch64 transition as the result of an exception/interrupt, and AArch64->AArch32 on return from an exception; the reverse routes don't exist.

Contributor

pelwell commented Mar 2, 2016

64-bit user space with 32-bit kernel is not possible on ARMv8. The kernel (especially the task switching) needs to be able to access all register state used by user space, which wouldn't be possible if the kernel was in 32-bit mode. The ARMv8 architecture allows an AArch32->AArch64 transition as the result of an exception/interrupt, and AArch64->AArch32 on return from an exception; the reverse routes don't exist.

This comment has been minimized.

Show comment Hide comment
@grigorig

grigorig Mar 2, 2016

Okay, good to see this clarified.

On x86, 64 bit kernels have some (small) performance advantages even if combined with 32 bit userspace. Maybe that's a possible motivation to get it working on Pi 3 as well.

grigorig commented Mar 2, 2016

Okay, good to see this clarified.

On x86, 64 bit kernels have some (small) performance advantages even if combined with 32 bit userspace. Maybe that's a possible motivation to get it working on Pi 3 as well.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Mar 2, 2016

Contributor

There must be figures out there from all of those other A53-based SBCs comparing 32-bit vs 64-bit kernels - let's see some.

Contributor

pelwell commented Mar 2, 2016

There must be figures out there from all of those other A53-based SBCs comparing 32-bit vs 64-bit kernels - let's see some.

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Mar 2, 2016

It's worth noting that the biggest reason on x86 for a performance increase is not the wider registers, but the fact that x86_64 has more general purpose registers available, which means on average you need fewer load/store operations to do the same calculation. AArch64 however has the same required registers as AArch32, so x86 is not really a good point of comparison for the performance difference.

While I don't personally have any figures, I can attest that there is a small but noticeable performance improvement for 64-bit vs 32-bit on both SPARC and PPC with recent kernels. I've not seen figures for any 64-bit ARM processors, but I would assume there will be a similar small but noticeable performance increase there as well as the differences between 32 and 64 bit modes on SPARC/PPC are relatively similar to those on ARM when compared to the changes on x86.

That said, I think the big thing that will really make the difference stand out is the fact that the in-kernel timekeeping structures are in the process of being converted from 32-bit to 64-bit to avoid the y2038 issue. Once that hits mainline, most 32-bit systems will likely show measurably lower performance as a result.

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

Ferroin commented Mar 2, 2016

It's worth noting that the biggest reason on x86 for a performance increase is not the wider registers, but the fact that x86_64 has more general purpose registers available, which means on average you need fewer load/store operations to do the same calculation. AArch64 however has the same required registers as AArch32, so x86 is not really a good point of comparison for the performance difference.

While I don't personally have any figures, I can attest that there is a small but noticeable performance improvement for 64-bit vs 32-bit on both SPARC and PPC with recent kernels. I've not seen figures for any 64-bit ARM processors, but I would assume there will be a similar small but noticeable performance increase there as well as the differences between 32 and 64 bit modes on SPARC/PPC are relatively similar to those on ARM when compared to the changes on x86.

That said, I think the big thing that will really make the difference stand out is the fact that the in-kernel timekeeping structures are in the process of being converted from 32-bit to 64-bit to avoid the y2038 issue. Once that hits mainline, most 32-bit systems will likely show measurably lower performance as a result.

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

This comment has been minimized.

Show comment Hide comment
@deborah-c

deborah-c Mar 2, 2016

Sorry, my bad: I've clearly misremembered!

Sorry, my bad: I've clearly misremembered!

This comment has been minimized.

Show comment Hide comment
@grigorig

grigorig Mar 4, 2016

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

I think it's called AArch64-ILP32. I am not sure if it is a good idea to use such an unusual ABI. No regular AArch32 or AArch64 binaries will work without a costly multilib setup.

grigorig commented Mar 4, 2016

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

I think it's called AArch64-ILP32. I am not sure if it is a good idea to use such an unusual ABI. No regular AArch32 or AArch64 binaries will work without a costly multilib setup.

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Mar 4, 2016

We would need a multilib setup anyway for 32-bit support if we do a 64-bit version, otherwise we're actively breaking compatibility with existing systems. On the Pi, flash storage space is cheap and upgradeable, whereas RAM is not, and this situation is exactly the type of thing that such ABI's are designed for.. On top of that,t we don't need to worry about AArch64 compatibility, we have no established user base using it, and people are more likely to either use stuff bundled with Raspbian (or whatever other distribution)) or built locally than third party proprietary code, with the sole exception being the Oracle JDK, which isn't as critical as it was because we have much better performance now and IcedTea should run fine (and there's no hardware acceleration for Java on newer processors anyway, so using Jazelle doesn't really provide any performance improvement). Such compatibility would be nice, but should by no means be mandatory.

The big deciding factor should really be whether we can support all three ABI's at the same time (I know of no distribution on x86 that currently supports all three options there (32, 64, and x32), even though the kernel fully supports all having all three operating modes on the same system), and whether the processor itself supports it (I think it's optional, but I'm not sure, I've never had the time to read the ARMv8 ABI spec).

Aside from that, my point was more that using that ABI in the kernel may allow us to avoid having to deal with pointer width issues in the kernel drivers. I'm not certain however that the kernel fully supports it yet though in mainline.

Ferroin commented Mar 4, 2016

We would need a multilib setup anyway for 32-bit support if we do a 64-bit version, otherwise we're actively breaking compatibility with existing systems. On the Pi, flash storage space is cheap and upgradeable, whereas RAM is not, and this situation is exactly the type of thing that such ABI's are designed for.. On top of that,t we don't need to worry about AArch64 compatibility, we have no established user base using it, and people are more likely to either use stuff bundled with Raspbian (or whatever other distribution)) or built locally than third party proprietary code, with the sole exception being the Oracle JDK, which isn't as critical as it was because we have much better performance now and IcedTea should run fine (and there's no hardware acceleration for Java on newer processors anyway, so using Jazelle doesn't really provide any performance improvement). Such compatibility would be nice, but should by no means be mandatory.

The big deciding factor should really be whether we can support all three ABI's at the same time (I know of no distribution on x86 that currently supports all three options there (32, 64, and x32), even though the kernel fully supports all having all three operating modes on the same system), and whether the processor itself supports it (I think it's optional, but I'm not sure, I've never had the time to read the ARMv8 ABI spec).

Aside from that, my point was more that using that ABI in the kernel may allow us to avoid having to deal with pointer width issues in the kernel drivers. I'm not certain however that the kernel fully supports it yet though in mainline.

This comment has been minimized.

Show comment Hide comment
@ED6E0F17

ED6E0F17 Mar 5, 2016

(Upstream, not rpi-specific) Kernel ILP32 support is not fully baked, but someone is putting a lot of effort into it:

https://lkml.org/lkml/2015/12/15/737

ED6E0F17 commented Mar 5, 2016

(Upstream, not rpi-specific) Kernel ILP32 support is not fully baked, but someone is putting a lot of effort into it:

https://lkml.org/lkml/2015/12/15/737

This comment has been minimized.

Show comment Hide comment
@grigorig

grigorig Mar 5, 2016

I'm not very convinced that going for AArch64-ILP32 is a good idea. Raspbian is stuck with an unusual ARMv6 hard-fp architecture/ABI, but it was necessary given the BCM2835 SoC. Now we have a chance of finally switching to a standard ABI, so let's do that instead of going for some questionable new ABI that doesn't really have much support upstream.

Regarding multiarch, having to support less architectures is always a good thing. AArch64-ILP32 would add a third architecture into the mix. And storage might be cheap, but it's not free either! Also, multiarch can actually increase RAM usage because shared libraries can't be shared if processes of multiple architectures are running at the same time. This can be a pretty big deal if large frameworks like Qt are used.

grigorig commented Mar 5, 2016

I'm not very convinced that going for AArch64-ILP32 is a good idea. Raspbian is stuck with an unusual ARMv6 hard-fp architecture/ABI, but it was necessary given the BCM2835 SoC. Now we have a chance of finally switching to a standard ABI, so let's do that instead of going for some questionable new ABI that doesn't really have much support upstream.

Regarding multiarch, having to support less architectures is always a good thing. AArch64-ILP32 would add a third architecture into the mix. And storage might be cheap, but it's not free either! Also, multiarch can actually increase RAM usage because shared libraries can't be shared if processes of multiple architectures are running at the same time. This can be a pretty big deal if large frameworks like Qt are used.

This comment has been minimized.

Show comment Hide comment
@niklas88

niklas88 Mar 5, 2016

I wonder how big the case is for binary compatibility anyway? Most Raspberry Pi users that actually have non-repository software probably either use scripting languages like Python or have the source and shouldn't have a problem combining a system upgrade with recompiling their code. Also interestingly Go 32bit ARM executables would only need a 32bit glibc besides there being ARM64 support. So that basically leaves the people working with C, C++ without access to source code while still having a desire to upgrade.

On the other hand many people likely do want to port code to ARM64 and would greatly benefit from the Rasberry Pi as an inexpensive ARM64 platform. So yeah I really don't see the Raspberry Pi as depending on binary compatibility.

niklas88 commented Mar 5, 2016

I wonder how big the case is for binary compatibility anyway? Most Raspberry Pi users that actually have non-repository software probably either use scripting languages like Python or have the source and shouldn't have a problem combining a system upgrade with recompiling their code. Also interestingly Go 32bit ARM executables would only need a 32bit glibc besides there being ARM64 support. So that basically leaves the people working with C, C++ without access to source code while still having a desire to upgrade.

On the other hand many people likely do want to port code to ARM64 and would greatly benefit from the Rasberry Pi as an inexpensive ARM64 platform. So yeah I really don't see the Raspberry Pi as depending on binary compatibility.

This comment has been minimized.

Show comment Hide comment
@turnip86

turnip86 Mar 7, 2016

For the love of all things digital, please provide an AArch64 kernel build. Debian and Arch already have arm64 ports, and a large number of Rapsberry Pi owners are using those distros already, one of the motivations being armv7 support on Pi 2. There are significant performance increases - 15% to 30% - in running AArch64 code versus AArch32 on Cortex A53:

http://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/
(pelwell, Ferroin: was this what you're looking for?)

And this does not take into account the benefits of AArch32 compared to ARMv7, like load-acquire/store-release, new VFP float and SIMD instructions, and the cryptography extensions. https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf (page 106)

One group of users that will directly benefit from this are people who use the Pi for media and emulation. OpenELEC, OSMC and RetroPie all have separate armv6 and armv7 releases specifically to maximize performance.

Would any Raspberry-specific userland code need to be patched for this?

turnip86 commented Mar 7, 2016

For the love of all things digital, please provide an AArch64 kernel build. Debian and Arch already have arm64 ports, and a large number of Rapsberry Pi owners are using those distros already, one of the motivations being armv7 support on Pi 2. There are significant performance increases - 15% to 30% - in running AArch64 code versus AArch32 on Cortex A53:

http://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/
(pelwell, Ferroin: was this what you're looking for?)

And this does not take into account the benefits of AArch32 compared to ARMv7, like load-acquire/store-release, new VFP float and SIMD instructions, and the cryptography extensions. https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf (page 106)

One group of users that will directly benefit from this are people who use the Pi for media and emulation. OpenELEC, OSMC and RetroPie all have separate armv6 and armv7 releases specifically to maximize performance.

Would any Raspberry-specific userland code need to be patched for this?

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Mar 7, 2016

@grigorig The big things that made me think about it were:

  1. AArch64-ILP32 is intended for memory constrained systems which will never need a 64-bit address space (VC4 limits us to 4G RAM, which fits this perfectly)
  2. There were multiple comments made about certain components only using 32-bit pointers, and thus potentially needing significant work to handle properly from a regular AArch64 kernel.
    I was advocating it less because I want to deal with it than because I thought it might help as a starting point.

@turnip86 There would likely be some significant code changes needed. From what I understand based on discussion both here and elsewhere, some of the hardware components only deal in 32-bit pointers, and handling that sanely will take some work, not only in the vc binaries, but likely also in most of the third-party stuff that uses hardware acceleration.

Ferroin commented Mar 7, 2016

@grigorig The big things that made me think about it were:

  1. AArch64-ILP32 is intended for memory constrained systems which will never need a 64-bit address space (VC4 limits us to 4G RAM, which fits this perfectly)
  2. There were multiple comments made about certain components only using 32-bit pointers, and thus potentially needing significant work to handle properly from a regular AArch64 kernel.
    I was advocating it less because I want to deal with it than because I thought it might help as a starting point.

@turnip86 There would likely be some significant code changes needed. From what I understand based on discussion both here and elsewhere, some of the hardware components only deal in 32-bit pointers, and handling that sanely will take some work, not only in the vc binaries, but likely also in most of the third-party stuff that uses hardware acceleration.

This comment has been minimized.

Show comment Hide comment
@MrTomasz

MrTomasz Mar 9, 2016

Maybe let's try first running proper kernel in AArch64 mode?

I already did bunch of work to try boot it, but still can't see kernel booting on UART... as I mentioned on forums, it's not "make ARCH=arm64 defconfig Image" simple shot...

Anyone working on 64bit kernel as well?

MrTomasz commented Mar 9, 2016

Maybe let's try first running proper kernel in AArch64 mode?

I already did bunch of work to try boot it, but still can't see kernel booting on UART... as I mentioned on forums, it's not "make ARCH=arm64 defconfig Image" simple shot...

Anyone working on 64bit kernel as well?

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 9, 2016

I currently am but no where near ready for a test boot yet. And my Pi3 doesn't arrive for weeks yet sadly.

TheSin- commented Mar 9, 2016

I currently am but no where near ready for a test boot yet. And my Pi3 doesn't arrive for weeks yet sadly.

This comment has been minimized.

Show comment Hide comment
@MrTomasz

MrTomasz Mar 9, 2016

@TheSin-
How far are you with changes comparing to vanilla arm64 kernel ?
Could you contact me?

MrTomasz commented Mar 9, 2016

@TheSin-
How far are you with changes comparing to vanilla arm64 kernel ?
Could you contact me?

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 10, 2016

Once I get a full build sure, but I'm sure I won't be the first or fastest source, there are ppl here much stronger at this stuff them me, I'm I'm just using the debian build system with cross compiling ATM, and I haven't made it very far cause i"m still messing the the defines for the .config build.

Not to mention I'm still working with 4.1 which debian no longer supports, been thinking about jumping to 4.4 but debian testing is only on 4.3, so lots to decide still. And I have no idea how stable the 4.3 and/or 4.4 branches are here. I assume everything in the 4.1 branch gets back ports to the other branches, but haven't looked into it. Though I'm sure a newer kernel would be easier to work with for arm64.

TheSin- commented Mar 10, 2016

Once I get a full build sure, but I'm sure I won't be the first or fastest source, there are ppl here much stronger at this stuff them me, I'm I'm just using the debian build system with cross compiling ATM, and I haven't made it very far cause i"m still messing the the defines for the .config build.

Not to mention I'm still working with 4.1 which debian no longer supports, been thinking about jumping to 4.4 but debian testing is only on 4.3, so lots to decide still. And I have no idea how stable the 4.3 and/or 4.4 branches are here. I assume everything in the 4.1 branch gets back ports to the other branches, but haven't looked into it. Though I'm sure a newer kernel would be easier to work with for arm64.

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 10, 2016

okay so with the 4.1 tree and the debian build system I've finally got config and such working I believe, but now I'm at my first VC issue, this is where things are gonna get icky for me anyhow as I assume we are going to have to convert everything to force 32bit integers.

/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
  .write = vc_cma_proc_write,
           ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: note: (near initialization for ‘vc_cma_proc_fops.write’)
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘vc_cma_alloc_chunks’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:584:3: error: implicit declaration of function ‘dmac_flush_range’ [-Werror=implicit-function-declaration]
   dmac_flush_range(chunk_addr, chunk_addr + chunk_size);
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:585:3: error: implicit declaration of function ‘outer_inv_range’ [-Werror=implicit-function-declaration]
   outer_inv_range(__pa(chunk_addr), __pa(chunk_addr) +
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘cma_worker_proc’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:651:7: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
   if ((unsigned int)msg >= VC_CMA_MSG_MAX) {
       ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:658:11: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
    type = (int)msg;
           ^
In file included from /root/rpi/linux-4.1/linux/include/linux/printk.h:6:0,
                 from /root/rpi/linux-4.1/linux/include/linux/kernel.h:13,
                 from /root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:34:
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:4:18: error: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long unsigned int’ [-Werror=format=]
 #define KERN_SOH "\001"  /* ASCII Start Of Header */
                  ^
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:10:18: note: in expansion of macro ‘KERN_SOH’
 #define KERN_ERR KERN_SOH "3" /* error conditions */
                  ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:9: note: in expansion of macro ‘KERN_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
         ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:678:6: note: in expansion of macro ‘LOG_ERR’
      LOG_ERR
      ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:732:12: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
            (unsigned int)page);
            ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:30: note: in definition of macro ‘LOG_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
                              ^

Should I make a PR on the linux tree for the Kconfig changes? I'm mostly just reusing the 2709 stuff for now, since I don't have a 2710 to get more specific, I'd just like to be able to build to start, I know the VC stuff is going to take some time and planning but we all have to start someplace ;)

TheSin- commented Mar 10, 2016

okay so with the 4.1 tree and the debian build system I've finally got config and such working I believe, but now I'm at my first VC issue, this is where things are gonna get icky for me anyhow as I assume we are going to have to convert everything to force 32bit integers.

/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
  .write = vc_cma_proc_write,
           ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: note: (near initialization for ‘vc_cma_proc_fops.write’)
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘vc_cma_alloc_chunks’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:584:3: error: implicit declaration of function ‘dmac_flush_range’ [-Werror=implicit-function-declaration]
   dmac_flush_range(chunk_addr, chunk_addr + chunk_size);
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:585:3: error: implicit declaration of function ‘outer_inv_range’ [-Werror=implicit-function-declaration]
   outer_inv_range(__pa(chunk_addr), __pa(chunk_addr) +
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘cma_worker_proc’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:651:7: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
   if ((unsigned int)msg >= VC_CMA_MSG_MAX) {
       ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:658:11: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
    type = (int)msg;
           ^
In file included from /root/rpi/linux-4.1/linux/include/linux/printk.h:6:0,
                 from /root/rpi/linux-4.1/linux/include/linux/kernel.h:13,
                 from /root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:34:
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:4:18: error: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long unsigned int’ [-Werror=format=]
 #define KERN_SOH "\001"  /* ASCII Start Of Header */
                  ^
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:10:18: note: in expansion of macro ‘KERN_SOH’
 #define KERN_ERR KERN_SOH "3" /* error conditions */
                  ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:9: note: in expansion of macro ‘KERN_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
         ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:678:6: note: in expansion of macro ‘LOG_ERR’
      LOG_ERR
      ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:732:12: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
            (unsigned int)page);
            ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:30: note: in definition of macro ‘LOG_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
                              ^

Should I make a PR on the linux tree for the Kconfig changes? I'm mostly just reusing the 2709 stuff for now, since I don't have a 2710 to get more specific, I'd just like to be able to build to start, I know the VC stuff is going to take some time and planning but we all have to start someplace ;)

This comment has been minimized.

Show comment Hide comment
@MrTomasz

MrTomasz Mar 10, 2016

You can try first to disable that kind of things. I believe it shall boot with minimal subset of things...

Remember also to disable EFI in config, otherwise you will create incompatible kernel binary.

You can try first to disable that kind of things. I believe it shall boot with minimal subset of things...

Remember also to disable EFI in config, otherwise you will create incompatible kernel binary.

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 10, 2016

yeah I just wanted to try with the VC stuff to start see how far I can make it. And as for EFI it's all set the same as my rpi and rpi2 builds. Anyhow trying it now with VC stuff disabled.

TheSin- commented Mar 10, 2016

yeah I just wanted to try with the VC stuff to start see how far I can make it. And as for EFI it's all set the same as my rpi and rpi2 builds. Anyhow trying it now with VC stuff disabled.

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 11, 2016

okay disabled VC stuff to try and get a little further, I'm now stuck with

/tmp/ccyU2uoV.s: Assembler messages:
/tmp/ccyU2uoV.s:117: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:199: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:297: Error: missing immediate expression at operand 1 -- `dsb '
/root/rpi/linux-4.1/linux/scripts/Makefile.build:258: recipe for target 'drivers/dma/bcm2708-dmaengine.o' failed
make[7]: *** [drivers/dma/bcm2708-dmaengine.o] Error 1

Seems like pretty much all the RPI stuff is going to have issues of some sort. Asm is not my thing so I'm going to have to skip that I assume.

TheSin- commented Mar 11, 2016

okay disabled VC stuff to try and get a little further, I'm now stuck with

/tmp/ccyU2uoV.s: Assembler messages:
/tmp/ccyU2uoV.s:117: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:199: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:297: Error: missing immediate expression at operand 1 -- `dsb '
/root/rpi/linux-4.1/linux/scripts/Makefile.build:258: recipe for target 'drivers/dma/bcm2708-dmaengine.o' failed
make[7]: *** [drivers/dma/bcm2708-dmaengine.o] Error 1

Seems like pretty much all the RPI stuff is going to have issues of some sort. Asm is not my thing so I'm going to have to skip that I assume.

This comment has been minimized.

Show comment Hide comment
@MrTomasz

MrTomasz Mar 11, 2016

I don't have my code right now with me, but if I remember correctly, you're building with CONFIG_DMA_BCM2708_LEGACY=y which as I understand, it is wrong for BCM2709 (and 2710).

I did it in this way:

config DMA_BCM2708_LEGACY
    bool "BCM2708 DMA legacy API support"
    depends on (DMA_BCM2708 && !ARCH_BCM2710)
    default y

I don't have my code right now with me, but if I remember correctly, you're building with CONFIG_DMA_BCM2708_LEGACY=y which as I understand, it is wrong for BCM2709 (and 2710).

I did it in this way:

config DMA_BCM2708_LEGACY
    bool "BCM2708 DMA legacy API support"
    depends on (DMA_BCM2708 && !ARCH_BCM2710)
    default y

This comment has been minimized.

Show comment Hide comment
@TheSin-

TheSin- Mar 11, 2016

nice i'll try that thanks, I'm using 2709 as a base

TheSin- commented Mar 11, 2016

nice i'll try that thanks, I'm using 2709 as a base

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Mar 16, 2016

How're things coming along on this? Many are waiting with bated breath on the people trying right now (no sense in a bunch of duplicated efforts...)

How're things coming along on this? Many are waiting with bated breath on the people trying right now (no sense in a bunch of duplicated efforts...)

This comment has been minimized.

Show comment Hide comment
@popcornmix

popcornmix Mar 16, 2016

Contributor

Very impressive progress here.
In the last week there has been:
a 64-bit demo with uart output
a 64-bit port of U-boot
a 64-bit upstream kernel (single core only, and no gpu features)

Contributor

popcornmix commented Mar 16, 2016

Very impressive progress here.
In the last week there has been:
a 64-bit demo with uart output
a 64-bit port of U-boot
a 64-bit upstream kernel (single core only, and no gpu features)

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Mar 16, 2016

Epic. I'll need to pop over there to grab the work ongoing so that I can get a rough-cut for OE metadata there going. :D

Epic. I'll need to pop over there to grab the work ongoing so that I can get a rough-cut for OE metadata there going. :D

This comment has been minimized.

Show comment Hide comment
@swarren

swarren Apr 9, 2016

Should this be closed now? Per the 3rd comment here, the Pi Foundation is going to leave 64-bit kernel support to the community which implies, and besides that aspect should probably be covered by a bug against the kernel git not the firmware git. The firmware does now support 64-bit booting, and any remaining issues re: that feature are covered by issue #579.

swarren commented Apr 9, 2016

Should this be closed now? Per the 3rd comment here, the Pi Foundation is going to leave 64-bit kernel support to the community which implies, and besides that aspect should probably be covered by a bug against the kernel git not the firmware git. The firmware does now support 64-bit booting, and any remaining issues re: that feature are covered by issue #579.

This comment has been minimized.

Show comment Hide comment
@Ruffio

Ruffio Jun 29, 2016

Should this be closed?

Ruffio commented Jun 29, 2016

Should this be closed?

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jul 23, 2016

I wonder if this method can solve this 32-bit pointer issue:

  • If we are talking about physical addresses, since the Raspberry Pi hardware will not support anything over 4GB anytime soon, we can safely crop off the high bits and pass to the hardware.
  • If we are talking about virtual addresses, I think the "canonical address" concept from amd64 can be borrowed: all high 32 bits of a 64-bit virtual address have to be the same as bit 31 and any virtual memory address out of that range result in SIGSEGV. In other words, limit the virtual address space to 0x0000000000000000-0x000000007fffffff and 0xffffffff80000000-0xffffffffffffffff This means that when a pointer is passed to the hardware the top 32 bits can be safely cropped off, and when passed back it can be safely sign extended into a 64-bit canonical address.

xcvista commented Jul 23, 2016

I wonder if this method can solve this 32-bit pointer issue:

  • If we are talking about physical addresses, since the Raspberry Pi hardware will not support anything over 4GB anytime soon, we can safely crop off the high bits and pass to the hardware.
  • If we are talking about virtual addresses, I think the "canonical address" concept from amd64 can be borrowed: all high 32 bits of a 64-bit virtual address have to be the same as bit 31 and any virtual memory address out of that range result in SIGSEGV. In other words, limit the virtual address space to 0x0000000000000000-0x000000007fffffff and 0xffffffff80000000-0xffffffffffffffff This means that when a pointer is passed to the hardware the top 32 bits can be safely cropped off, and when passed back it can be safely sign extended into a 64-bit canonical address.

This comment has been minimized.

Show comment Hide comment
@popcornmix

popcornmix Jul 23, 2016

Contributor

I assume this scheme wouldn't help Mongo DB which I believe maps the whole database file ( > 4GB) to virtual RAM. That's the only example I've seen reported as requiring a 64-bit address space to run.

But yes, if virtual and physical address spaces are limited to 32-bit then that would avoid the issue of pointers (e.g. userdata/cookies) being returned to applications from GPU callbacks. I'm sure some would argue that is not a fully 64-bit system (although with only 1GB of physical RAM the limitation is unlikely to affect many use cases).

Contributor

popcornmix commented Jul 23, 2016

I assume this scheme wouldn't help Mongo DB which I believe maps the whole database file ( > 4GB) to virtual RAM. That's the only example I've seen reported as requiring a 64-bit address space to run.

But yes, if virtual and physical address spaces are limited to 32-bit then that would avoid the issue of pointers (e.g. userdata/cookies) being returned to applications from GPU callbacks. I'm sure some would argue that is not a fully 64-bit system (although with only 1GB of physical RAM the limitation is unlikely to affect many use cases).

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jul 23, 2016

@popcornmix This is almost exactly what the x32 ABI for amd64 is - 32-bit pointers for an otherwise 64-bit system. I think this can be a stop-gap method between the 32-bit only and fully 64-bit kernel.

Another method would be introducing one layer of indirection in the kernel. Whenever the userland passes a pointer to the GPU, it is catched by the kernel, put into a buffer, and an kernel pointer to the buffer is passed to the GPU instead. The kernel still have to keep itself inside the top-half "canonical address" range for this to work though, as pointers are still passed with their high bits cropped off. This can affect the efficiency of user-mode GPU calls but removes the 32-bit pointer length limit.

It seem to me that this pair fits well in the current Raspbian/Raspbian Lite release model. The first have a virtual memory size limit of 4GB but have faster graphics, better suited as a desktop system; while the latter have full 64-bit virtual memory space but graphics can be atrociously slow, better suited as a headless server system.

xcvista commented Jul 23, 2016

@popcornmix This is almost exactly what the x32 ABI for amd64 is - 32-bit pointers for an otherwise 64-bit system. I think this can be a stop-gap method between the 32-bit only and fully 64-bit kernel.

Another method would be introducing one layer of indirection in the kernel. Whenever the userland passes a pointer to the GPU, it is catched by the kernel, put into a buffer, and an kernel pointer to the buffer is passed to the GPU instead. The kernel still have to keep itself inside the top-half "canonical address" range for this to work though, as pointers are still passed with their high bits cropped off. This can affect the efficiency of user-mode GPU calls but removes the 32-bit pointer length limit.

It seem to me that this pair fits well in the current Raspbian/Raspbian Lite release model. The first have a virtual memory size limit of 4GB but have faster graphics, better suited as a desktop system; while the latter have full 64-bit virtual memory space but graphics can be atrociously slow, better suited as a headless server system.

This comment has been minimized.

Show comment Hide comment
@popcornmix

popcornmix Jul 23, 2016

Contributor

There is an option of using 32-bit pointers globally (as a compiler default), but that precludes using standard 64-bit debian packages, so is not a favoured option.

64-bit pointers that are forced (through some kernel virtual address limiting) to only have 32 significant bits is a possibility, but doesn't fix Mongo DB.

I think the layer of indirection in the kernel<->GPU interface is probably the best option, but there may be some performance hit in the lookups. Probably not critical in general as I suspect the number of messages awaiting a response from GPU will normally be low, but there may be some situations where it gets to be a problem.

Currently we haven't seen strong evidence (e.g. benchmarks) that show there will be a noticeable performance improvement when moving to 64-bit, so it's unlikely to become a default configuration for raspbian and hence not a very high priority. We'd certainly like to support it for users who are interested, so suggestions for good ways to solve it are welcome.

Contributor

popcornmix commented Jul 23, 2016

There is an option of using 32-bit pointers globally (as a compiler default), but that precludes using standard 64-bit debian packages, so is not a favoured option.

64-bit pointers that are forced (through some kernel virtual address limiting) to only have 32 significant bits is a possibility, but doesn't fix Mongo DB.

I think the layer of indirection in the kernel<->GPU interface is probably the best option, but there may be some performance hit in the lookups. Probably not critical in general as I suspect the number of messages awaiting a response from GPU will normally be low, but there may be some situations where it gets to be a problem.

Currently we haven't seen strong evidence (e.g. benchmarks) that show there will be a noticeable performance improvement when moving to 64-bit, so it's unlikely to become a default configuration for raspbian and hence not a very high priority. We'd certainly like to support it for users who are interested, so suggestions for good ways to solve it are welcome.

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jul 23, 2016

@popcornmix Both limited pointer solution and GPU trapping solution allows the use of standard Debian packages, and the trade-off is virtual memory space versus graphics performance. I think this should be a choice up to the user to make.

A 64-bit processor can handle SHA512 (as well as its friends SHA384, SHA512/224 and SHA512/256) much faster than a 32-bit one as the internal states, being 64-bit long, can fit in registers natively. Also AArch64 have more registers than AArch32, allowing for more aggressive optimizations.

xcvista commented Jul 23, 2016

@popcornmix Both limited pointer solution and GPU trapping solution allows the use of standard Debian packages, and the trade-off is virtual memory space versus graphics performance. I think this should be a choice up to the user to make.

A 64-bit processor can handle SHA512 (as well as its friends SHA384, SHA512/224 and SHA512/256) much faster than a 32-bit one as the internal states, being 64-bit long, can fit in registers natively. Also AArch64 have more registers than AArch32, allowing for more aggressive optimizations.

This comment has been minimized.

Show comment Hide comment
@cleverca22

cleverca22 Jul 24, 2016

would it be possible to do both?

only use half of the 64 bits for any app dealing with the gpu

but use the full 64 bits for non-gpu things like mongodb?

would it be possible to do both?

only use half of the 64 bits for any app dealing with the gpu

but use the full 64 bits for non-gpu things like mongodb?

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jul 25, 2016

@cleverca22 Then how do you tell them apart? What if a program that have already claimed a memory block out of the canonical memory block suddenly start to call GPU?

xcvista commented Jul 25, 2016

@cleverca22 Then how do you tell them apart? What if a program that have already claimed a memory block out of the canonical memory block suddenly start to call GPU?

This comment has been minimized.

Show comment Hide comment
@cleverca22

cleverca22 Jul 30, 2016

only thing i can think of there is a flag in the ELF headers that you set at compile time, to promise to never do GPU calls

though now that i think of it, you could also modify the userland, to just use mmap() to create a secondary heap in the lower 4gig of the userland?

only thing i can think of there is a flag in the ELF headers that you set at compile time, to promise to never do GPU calls

though now that i think of it, you could also modify the userland, to just use mmap() to create a secondary heap in the lower 4gig of the userland?

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jul 30, 2016

@cleverca22 There is a MAP_32BIT flag in mmap(2) for AMD64. Maybe we can implement this for AArch64? Usual malloc(3) does not have a virtual memory location promise (and can go over 2GB) but mmap(2) with MAP_32BIT guarantees a sub-2GB address range.

xcvista commented Jul 30, 2016

@cleverca22 There is a MAP_32BIT flag in mmap(2) for AMD64. Maybe we can implement this for AArch64? Usual malloc(3) does not have a virtual memory location promise (and can go over 2GB) but mmap(2) with MAP_32BIT guarantees a sub-2GB address range.

This comment has been minimized.

Show comment Hide comment
@cleverca22

cleverca22 Jul 31, 2016

and since your dealing with relatively large buffers being shared to the GPU, mmap isn't really an overhead, malloc will often internally re-route to mmap when you request large blocks

and since your dealing with relatively large buffers being shared to the GPU, mmap isn't really an overhead, malloc will often internally re-route to mmap when you request large blocks

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Aug 1, 2016

@cleverca22 @popcornmix So to round up: I think we can use a straight full 64-bit AArch64 kernel and implement the MAP_32BIT flag for mmap(2) with the same semantics as implemented on AMD64. This pose no performance penalty as there is no pointer trapping involved and no reserved memory. The MAP_32BIT flag allows allocating (or mapping) memory with its highest 33 bits of a 64-bit pointer guaranteed to be zero over the entire allocation block, allowing GPU-facing code to allocate memory with pointers that is safe to be cropped short.

This means:

  • The existing VC driver stack will still work after refactoring with 64-bit in mind,
  • No change to VC code needed,
  • The kernel must relocate itself to high-half GPU-safe memory 0xffffffff80000000-0xffffffffffffffff before making GPU calls,
  • Existing VC-facing code must be modified to use mmap(2) with MAP_32BIT to allocate memory that would be passed to the VC
  • Optionally, implement a memory range check in VC kernel code to EBADM or SIGSEGV out calls with non canonical address.

xcvista commented Aug 1, 2016

@cleverca22 @popcornmix So to round up: I think we can use a straight full 64-bit AArch64 kernel and implement the MAP_32BIT flag for mmap(2) with the same semantics as implemented on AMD64. This pose no performance penalty as there is no pointer trapping involved and no reserved memory. The MAP_32BIT flag allows allocating (or mapping) memory with its highest 33 bits of a 64-bit pointer guaranteed to be zero over the entire allocation block, allowing GPU-facing code to allocate memory with pointers that is safe to be cropped short.

This means:

  • The existing VC driver stack will still work after refactoring with 64-bit in mind,
  • No change to VC code needed,
  • The kernel must relocate itself to high-half GPU-safe memory 0xffffffff80000000-0xffffffffffffffff before making GPU calls,
  • Existing VC-facing code must be modified to use mmap(2) with MAP_32BIT to allocate memory that would be passed to the VC
  • Optionally, implement a memory range check in VC kernel code to EBADM or SIGSEGV out calls with non canonical address.

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Aug 2, 2016

Another point of optimization with a 64-bit kernel with MAP_32BIT is code address optimization with address space layout randomization.

CAO means that code segments of PIE are loaded out of the GPU-safe range. This requires ASLR facility so might as well implement it as well. It randomizes the address layout of both the kernel and the loaded PIE, reducing the likelihood a stack or heap overflow attack working.

xcvista commented Aug 2, 2016

Another point of optimization with a 64-bit kernel with MAP_32BIT is code address optimization with address space layout randomization.

CAO means that code segments of PIE are loaded out of the GPU-safe range. This requires ASLR facility so might as well implement it as well. It randomizes the address layout of both the kernel and the loaded PIE, reducing the likelihood a stack or heap overflow attack working.

This comment has been minimized.

Show comment Hide comment
@Gaunah

Gaunah Jan 7, 2017

Is this issue still relevant? e.g. a chance to get AArch64 support?

Gaunah commented Jan 7, 2017

Is this issue still relevant? e.g. a chance to get AArch64 support?

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Jan 8, 2017

I think it is... Raspbian's (i.e. The Pi Foundation) dragging their feet, but it seems that SuSE and Arch have managed to get there. (I got SuSE to boot on my Pi 3, but had varying hardware issues (i.e. It doesn't play nice with a PiTop monitor and doesn't seem to work right with a Logitech Unifying HID device right... Arch is about to be evaluated here in a few moments.)

All things considered, there's little excuses for the Foundation to NOT embrace this as an option since they're phasing out the Pi 2's SoC for the Pi 3's with the Pi 2 now only being sans WiFi and Bluetooth. It makes a HELL of a lot more sense to have two worlds- PiZero/Pi and then Pi2/3, with the old 2's being in the other 32-bit ARMv6/7 world and the rest being in the ARMv8 properly. You gain a LOT from being in AArch32, you gain even MORE in AArch64.

I think it is... Raspbian's (i.e. The Pi Foundation) dragging their feet, but it seems that SuSE and Arch have managed to get there. (I got SuSE to boot on my Pi 3, but had varying hardware issues (i.e. It doesn't play nice with a PiTop monitor and doesn't seem to work right with a Logitech Unifying HID device right... Arch is about to be evaluated here in a few moments.)

All things considered, there's little excuses for the Foundation to NOT embrace this as an option since they're phasing out the Pi 2's SoC for the Pi 3's with the Pi 2 now only being sans WiFi and Bluetooth. It makes a HELL of a lot more sense to have two worlds- PiZero/Pi and then Pi2/3, with the old 2's being in the other 32-bit ARMv6/7 world and the rest being in the ARMv8 properly. You gain a LOT from being in AArch32, you gain even MORE in AArch64.

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Jan 8, 2017

Now, SuSE side-stepped some of this- they used the UEFI "firmware" path. No telling what Arch did yet- I'll be seeing this in a bit and reporting back.

Now, SuSE side-stepped some of this- they used the UEFI "firmware" path. No telling what Arch did yet- I'll be seeing this in a bit and reporting back.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 8, 2017

Contributor

It makes a HELL of a lot more sense to have two worlds- PiZero/Pi and then Pi2/3, with the old 2's being in the other 32-bit ARMv6/7 world and the rest being in the ARMv8 properly.

That sounds like three worlds.

You gain a LOT from being in AArch32, you gain even MORE in AArch64.

Back in June, @popcornmix wrote "Currently we haven't seen strong evidence (e.g. benchmarks) that show there will be a noticeable performance improvement when moving to 64-bit", and that remains true today. If you have some compelling numbers then please share them with us.

Contributor

pelwell commented Jan 8, 2017

It makes a HELL of a lot more sense to have two worlds- PiZero/Pi and then Pi2/3, with the old 2's being in the other 32-bit ARMv6/7 world and the rest being in the ARMv8 properly.

That sounds like three worlds.

You gain a LOT from being in AArch32, you gain even MORE in AArch64.

Back in June, @popcornmix wrote "Currently we haven't seen strong evidence (e.g. benchmarks) that show there will be a noticeable performance improvement when moving to 64-bit", and that remains true today. If you have some compelling numbers then please share them with us.

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Jan 8, 2017

That's because nobody's DONE a lot of benchmarks.

I'd opine that many said the PRECISELY same things about x86-64 when it came out. I know, I was one of the early adopters, getting access to one of the Clawhammer prototypes.

Most of the commentary talk about "increased" memory sizes and the like and don't have a single clue about what they're talking about, unfortunately. I'll probably get benchmarks for myself and see.

As for your numbers...

http://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/

http://www.anandtech.com/show/7335/the-iphone-5s-review/4

30% is enough to bother with. 15%'s a bit shallow, but it's a gain all the same. Some things gain massive jumps of 200% over the 32-bits with a few things losing single-digit percentages, which could be merely implementation fails.

I'm one for not leaving things lying on the floor JUST to be able to boot one single image across the line (which you can't do anyhow...you've got PiZero/Pi and Pi2/3 images right now as it is.) Sorry, the argument there IS specious and invalid, based on just that alone.

Clear numbers have been said- and only a couple of months past when this was boldly (and quite incorrectly, I might add) said. Time to ditch the rubbish and think about what you can gain from it all.

madscientist42 commented Jan 8, 2017

That's because nobody's DONE a lot of benchmarks.

I'd opine that many said the PRECISELY same things about x86-64 when it came out. I know, I was one of the early adopters, getting access to one of the Clawhammer prototypes.

Most of the commentary talk about "increased" memory sizes and the like and don't have a single clue about what they're talking about, unfortunately. I'll probably get benchmarks for myself and see.

As for your numbers...

http://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/

http://www.anandtech.com/show/7335/the-iphone-5s-review/4

30% is enough to bother with. 15%'s a bit shallow, but it's a gain all the same. Some things gain massive jumps of 200% over the 32-bits with a few things losing single-digit percentages, which could be merely implementation fails.

I'm one for not leaving things lying on the floor JUST to be able to boot one single image across the line (which you can't do anyhow...you've got PiZero/Pi and Pi2/3 images right now as it is.) Sorry, the argument there IS specious and invalid, based on just that alone.

Clear numbers have been said- and only a couple of months past when this was boldly (and quite incorrectly, I might add) said. Time to ditch the rubbish and think about what you can gain from it all.

This comment has been minimized.

Show comment Hide comment
@madscientist42

madscientist42 Jan 8, 2017

As for Arch...it's up. Now I get to do my own analysis and benchmarking. The fact that SuSE saw fit to make this is a hint for most...they're not ones to "waste time" on things.

As for Arch...it's up. Now I get to do my own analysis and benchmarking. The fact that SuSE saw fit to make this is a hint for most...they're not ones to "waste time" on things.

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jan 9, 2017

@pelwell Two worlds: ARMv6 AArch32 version and ARMv8 AArch64 64-bit-only version.

I think that the MAP_32BIT hack I mentioned above is still relevant. The libc can be modified to use MAP_32BIT in malloc(3) by default (this shouldn't break compatibility) to make the migration easier.

xcvista commented Jan 9, 2017

@pelwell Two worlds: ARMv6 AArch32 version and ARMv8 AArch64 64-bit-only version.

I think that the MAP_32BIT hack I mentioned above is still relevant. The libc can be modified to use MAP_32BIT in malloc(3) by default (this shouldn't break compatibility) to make the migration easier.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 9, 2017

Contributor

Pi2 is ARMv7.

Contributor

pelwell commented Jan 9, 2017

Pi2 is ARMv7.

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jan 9, 2017

@pelwell There are two Pi 2 revisions, the Rev 1.1 and the Rev 1.2. Rev 1.1 shipped with BCM2836 which is a ARMv7 processor. Rev 1.2 have been updated with BCM2837, an ARMv8 AArch64 processor.

The ARMv6 version runs on Pi 0, 1, 1+, 2 Rev 1.1 and CM1, while ARMv8 AArch64 version runs on Pi 2 Rev 1.2, 3 and CM3.

xcvista commented Jan 9, 2017

@pelwell There are two Pi 2 revisions, the Rev 1.1 and the Rev 1.2. Rev 1.1 shipped with BCM2836 which is a ARMv7 processor. Rev 1.2 have been updated with BCM2837, an ARMv8 AArch64 processor.

The ARMv6 version runs on Pi 0, 1, 1+, 2 Rev 1.1 and CM1, while ARMv8 AArch64 version runs on Pi 2 Rev 1.2, 3 and CM3.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 9, 2017

Contributor

Have you tried booting a 2836-based Pi2 with a 2835 image?

Contributor

pelwell commented Jan 9, 2017

Have you tried booting a 2836-based Pi2 with a 2835 image?

This comment has been minimized.

Show comment Hide comment
@Ruffio

Ruffio Jan 9, 2017

@pelwell Is this really the level of arguments when discussing 32 vs 64 bits kernel? Wordings? Shouldn't the arguments be about performance, compatibility, pros/cons, overall architecture and what the future brings?

Ruffio commented Jan 9, 2017

@pelwell Is this really the level of arguments when discussing 32 vs 64 bits kernel? Wordings? Shouldn't the arguments be about performance, compatibility, pros/cons, overall architecture and what the future brings?

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Jan 9, 2017

He asked a perfectly valid question given what @xcvista proposed. If the suggested plan going forwards is to be running an ARMv6 kernel on Pi 0, 1 1+, CM1 and 2r1.1 and an ARMv8 kernel on 2r1.2, 3, and CM3, then it needs to be determined how well a BCM2835 kernel (the ARMv6 one) works on a BCM2836 system (the Pi 2 r1.1 SoC).

Ferroin commented Jan 9, 2017

He asked a perfectly valid question given what @xcvista proposed. If the suggested plan going forwards is to be running an ARMv6 kernel on Pi 0, 1 1+, CM1 and 2r1.1 and an ARMv8 kernel on 2r1.2, 3, and CM3, then it needs to be determined how well a BCM2835 kernel (the ARMv6 one) works on a BCM2836 system (the Pi 2 r1.1 SoC).

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 9, 2017

Contributor

@Ruffio Support multiple kernel configurations is a resource drain, which is why the 2 vs 3 question is important.

Contributor

pelwell commented Jan 9, 2017

@Ruffio Support multiple kernel configurations is a resource drain, which is why the 2 vs 3 question is important.

This comment has been minimized.

Show comment Hide comment
@xcvista

xcvista Jan 9, 2017

@Ferroin Currently the AArch32 image contains ARMv6 and v7 kernels at the same time. That is not being changed as ARMv6 userland still works with a v7 kernel. What I am talking about is a new image based on AArch64 which not only calls for a new kernel but also a new userland.

xcvista commented Jan 9, 2017

@Ferroin Currently the AArch32 image contains ARMv6 and v7 kernels at the same time. That is not being changed as ARMv6 userland still works with a v7 kernel. What I am talking about is a new image based on AArch64 which not only calls for a new kernel but also a new userland.

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Jan 9, 2017

Your statement suggested (at least to both myself and @pelwell) eliminating the ARMv7 kernel from the mix, which if it would work reasonably well, would be a viable option on the kernel side because the Foundation doesn't want to support all that many kernels.

Ferroin commented Jan 9, 2017

Your statement suggested (at least to both myself and @pelwell) eliminating the ARMv7 kernel from the mix, which if it would work reasonably well, would be a viable option on the kernel side because the Foundation doesn't want to support all that many kernels.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 9, 2017

Contributor

Precisely. And while we are prepared to selectively improve performance on some Pis, we are loathe to slow down some Pis in order to speed up others, particularly where the older, slower models are adversely affected.

Contributor

pelwell commented Jan 9, 2017

Precisely. And while we are prepared to selectively improve performance on some Pis, we are loathe to slow down some Pis in order to speed up others, particularly where the older, slower models are adversely affected.

This comment has been minimized.

Show comment Hide comment
@Ferroin

Ferroin Jan 9, 2017

Slightly OT, but I'm curious how much difference there actually is between the ARMv6 and ARMv7 kernels beyond the generic ISA differences between v6 and v7. IOW, is there actually all that much maintenance burden from a coding perspective, or is it mostly just testing? Looking at the code with my admittedly somewhat lacking background regarding the drivers in question, it doesn't look like there's much actual difference between v6 and v7 code. Looking at the issue tracker, there appears to be very little other than the initial stuff that's expected when adding new code that's been an issue resulting from a difference between v6 and v7. Yes, it's not anywhere near as trivial to add v8 support (although from looking at Arch, it looks like most of the work other than the GPU drivers is already done), and yes this will break compiled code (although the percentage of stuff on the Pi that's compiled code and not scripted is probably pretty small), but you've got a community that's pretty willing to help with testing and debugging, and given that, I think this won't add anywhere near as much work as you seem to think once the initial work of porting the (arguably poorly written given that they're not 64-bit clean) drivers is complete.

Ferroin commented Jan 9, 2017

Slightly OT, but I'm curious how much difference there actually is between the ARMv6 and ARMv7 kernels beyond the generic ISA differences between v6 and v7. IOW, is there actually all that much maintenance burden from a coding perspective, or is it mostly just testing? Looking at the code with my admittedly somewhat lacking background regarding the drivers in question, it doesn't look like there's much actual difference between v6 and v7 code. Looking at the issue tracker, there appears to be very little other than the initial stuff that's expected when adding new code that's been an issue resulting from a difference between v6 and v7. Yes, it's not anywhere near as trivial to add v8 support (although from looking at Arch, it looks like most of the work other than the GPU drivers is already done), and yes this will break compiled code (although the percentage of stuff on the Pi that's compiled code and not scripted is probably pretty small), but you've got a community that's pretty willing to help with testing and debugging, and given that, I think this won't add anywhere near as much work as you seem to think once the initial work of porting the (arguably poorly written given that they're not 64-bit clean) drivers is complete.

This comment has been minimized.

Show comment Hide comment
@pelwell

pelwell Jan 9, 2017

Contributor

There are differences in the ARM address map between BCM2835 and BCM2836 - it isn't just the ISA - but as far as ongoing development is concerned the overhead is primarily "just" one of building and testing the different variants.

Contributor

pelwell commented Jan 9, 2017

There are differences in the ARM address map between BCM2835 and BCM2836 - it isn't just the ISA - but as far as ongoing development is concerned the overhead is primarily "just" one of building and testing the different variants.

@6by9 6by9 referenced this issue in raspberrypi/linux Jan 16, 2017

Closed

ARM64: What's missing at this point #1801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment