Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port PCRE2 JIT to Linux on IBMz (s390x) #89

Open
edelsohn opened this issue Aug 24, 2020 · 71 comments · Fixed by #91
Open

Port PCRE2 JIT to Linux on IBMz (s390x) #89

edelsohn opened this issue Aug 24, 2020 · 71 comments · Fixed by #91

Comments

@edelsohn
Copy link

Enable PCRE2 JIT on Linux on IBMz (s390x-linux) and optimize to achieve equivalent speedup over non-JIT code as x86_64. Goal is full functionality and passing testsuite with JIT enabled.

An unfinished port of PCRE2 JIT to s390x exists in the Linux-on-IBM-z Github account and can be used as a starting point. IBM will contribute the code as necessary.

https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental

A $5000 bounty from IBM available posted on Bountysource
https://www.bountysource.com/issues/92353837-port-pcre2-jit-to-linux-on-ibmz-s390x

Systems access is available through the LinuxONE Community Cloud at Marist
https://linuxone.cloud.marist.edu/#/register?flag=VM

This issue is tied to a feature request to the Exim community for PCRE2 support.
https://bugs.exim.org/show_bug.cgi?id=2635

@edelsohn
Copy link
Author

@zherczeg Do you know of someone in the PCRE2 JIT community who would like to work on this issue?

@zherczeg
Copy link
Owner

Perhaps @carenas might be interested.
As for me I am working in a University which has many industrial collaborations, that could be an option, although that will be more pricey.

@carenas
Copy link
Contributor

carenas commented Aug 28, 2020

the development branch s390x shows (only in the cloud though, not qemu) some encouraging progress :

$ bin/regex_test
Pass -v to enable verbose, -s to disable this hint.

REGEX tests: all tests are PASSED on s390x 64bit (big endian + unaligned)

pcre2 itself fails (even without JIT) and will need work (pcre2test segfaults with anything that touches invalid UTF) functionality wise (ex: non including a commented out test that will segfault for the JIT part) :

Successful test ratio: 50% (330 failed)
Invalid UTF8 successful test ratio: 0% (129 failed)

and as zherczeg pointed out will likely need a significant amount of work to get close to that performance objective

@edelsohn
Copy link
Author

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

How much effort do you estimate to make PCRE2 functional?

How much effort to optimize it for the platform?

What amount of bounty would make it interesting, probably split into sub-milestones.

@carenas
Copy link
Contributor

carenas commented Aug 28, 2020

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

my bad, fixed the link. it is mostly the linux-on-ibm-z code but with a few fixes on top to fight the bitrot so it will build on top of the current tree for further development.

How much effort

it is too early to know but the original code didn't support an FPU and will need to have put_label support added before it can work with current PCRE2 (including bad UTF8), so the sooner we can get it in a shape good enough for merging (even if it doesn't perform well) the better IMHO to make sure it will not bitrot further.

@edelsohn
Copy link
Author

IBM can redirect the current bounty towards the basic enablement. We would appreciate guidance from the PCRE2 community on a plan and a sizing for the various steps to achieve basic enablement suitable for merger and then further optimization to achieve optimal performance,

@carenas
Copy link
Contributor

carenas commented Sep 2, 2020

@edelsohn: is instruction cache coherency an issue with Z?, the original code went into great extents to try to avoid clearing the instruction cache when updating code is required (ex: when jump address or constants were updated, and now when put_labels are updated), and eventhough it is using a somehow independent pool to keep those values, it is still in the same memory segment than the rest of the code and therefore likely to be considered for caching with the instructions AFAIK.

in positive news side while it doesn't yet fully work (because of put_labels) at least doesn't segfault.

@edelsohn
Copy link
Author

edelsohn commented Sep 2, 2020

Thanks for the progress.

IBMz has a fairly strong memory consistency model, like x86. Are you experiencing unusual behavior or just checking? I'm not aware of other self-modifying IBMz code explicitly flushing / clearing caches.

@carenas
Copy link
Contributor

carenas commented Sep 2, 2020

just checking, and hoping to have a (likely incomplete) version that could pass all tests by end of the week.

@carenas
Copy link
Contributor

carenas commented Sep 3, 2020

building and running all tests successfully in :

https://travis-ci.com/github/carenas/sljit/builds/182588796

as well as when applied to a TRUNK version of PCRE as shown by (running in the IBM cloud) :

make  check-TESTS
make[2]: Entering directory '/home/linux1/src/pcre'
make[3]: Entering directory '/home/linux1/src/pcre'
PASS: pcre2_jit_test
PASS: RunTest
PASS: RunGrepTest
============================================================================
Testsuite summary for PCRE2 10.36-RC1
============================================================================
# TOTAL: 3
# PASS:  3

note that it is not enabled by default and it is mainly available to coordinate further development, as it is missing functionality which would trigger abort(), has no FPU (neither SIMD) support, and has not been optimized or even profiled so it might be even slower than the interpreter.

@edelsohn
Copy link
Author

edelsohn commented Sep 3, 2020

Awesome progress, @carenas ! Let us know when you and the community have analyzed the situation and have sizings for the next steps.

@carenas
Copy link
Contributor

carenas commented Sep 4, 2020

Let us know when you and the community have analyzed the situation and have sizings for the next steps.

@zherczeg: would the following make sense for sizing and next steps?

  1. get s390x branch developer ready so it is maintained in-tree to avoid further bitrot (ETA end of the week, will need review and while I agree it has some rough edges should be safe as it is disabled from autodetection and I am hoping to keep further development rebased and under CI), to hopefully aid testing and even contributions from the wider community (still hoping those IBM insiders bring back some of their very useful architecture knowledge, this way).
  2. get a newer version that completes the implementation needed to fully support s390x (FPU and vector operations, mainly what I expect will be needed, with a target CPU of Z13), this I am expecting to be about 40 man hours, and will also include a repository with patches to PCRE to be kept under CI
  3. tie loose ends (most of the abort() should be gone by now, but there are likely still some important TODO and other work that will be needed for performance, as well as support for anything that was not path critical like fallback for non FPU or older Z with a target to hopefully make it run in QEMU) and there is likely here were most of the optimization and performance work will be done with a suggested option to include this code (still disabled for auto) in the next PCRE release to broaden user base, recruit contributors and gather information to help guide future development, and in auto hopefully in PCRE + 1.

note that the original fork was maintained for almost 2 years but it might be still 60% done and there is also the need to learn the architecture (which by being proprietary makes things slightly more difficult), but I am optimistic it could be done thanks to the cloud availability and a good starting base, which is why I'd been focused on last week to make sure all previous investment is not wasted.

@zherczeg
Copy link
Owner

zherczeg commented Sep 4, 2020

I would like to avoid what happened with the TileGX port. It has never been completed, no maintainers and thus it would be the best to remove it. As for a new port, I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level. I am curious what "proprietary" exactly means here. No public documentation available? No free tools available?

@edelsohn
Copy link
Author

edelsohn commented Sep 4, 2020

Yes, s390x is proprietary, compared with, say, RISC-V, but not compared with Intel x86_64. Proprietary doesn't mean secret. GCC, LLVM, PyPy (including NumPyPy), node.js, OpenJDK, and Mono all are ported to s390x. There is plenty of documentation.

z/Architecture ISA

Linux on z ABI

QEMU for s390x

You can request access to a system (VM/VPS) through the LinuxONE Community cloud that I mentioned in the first message.

The LinuxONE Community Cloud also hosts Travis-CI instances, as Carlo used.

IBM s390x experts are available to answer questions.

40 hours seems a reasonable estimate. We can continue to discuss how to set up incremental milestones and associated bounties.

@carenas
Copy link
Contributor

carenas commented Sep 4, 2020

I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level.

both gcc and clang can crosscompile code for s390x (and indeed I have both setup in the CI in 2 versions of Ubuntu), an I was originally using the standard gcc crosscompiler package from Ubuntu 20.04 to build the code and develop it further, so unlike TileGX this is fairly more open (with CI / cloud VMs available and several linux distributions to use in real hardware)

qemu-user targets s390x and does work, but it fails tests either because the instructions we are using are not correctly emulated in the default CPU it uses, or because the original code targets z12 and the implementation differs enough. running with an emulated z12 or z13 might help but using a real VM is easier.

which is why I mention CI and was hoping in the third step to broaden cpu support, which will also help real users with older hardware, even if initially the target will be z13 (which includes vector support).

in that context my "proprietary" label, meant we will need to rely for now in the linux one cloud for development, but unlike TileGX it is likely to get better later.

@zherczeg
Copy link
Owner

zherczeg commented Sep 4, 2020

I have checked the documentations. At first glance the system design matches well to sljit, it uses two-complement number format, the ABI is similar to PowerPC, and it has condition codes, wide multiply, and IEEE single / double precision floating point. It also has an incredible number of instruction forms, even more than ARM-T2, probably because it is an old architecture. I have one more question, do we need to support EBCDIC? Because PCRE2 has many ASCII / UTF related optimizations which will not work with it.

@edelsohn
Copy link
Author

edelsohn commented Sep 4, 2020

Linux on IBMz is ASCII and we only are asking to port PCRE2 JIT for Linux, not to z/OS in EBCDIC.

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

@carenas
Copy link
Contributor

carenas commented Sep 4, 2020

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

is z14 what is being used by the Linux ONE cloud provided vm? the travis containers are running older than usual Ubuntu versions as well so I was suspecting they might be constrained by the CPU version there as well.

note the main concern here is on being able to maintain the code base moving forward, which is why QEMU was ideal (since it can run locally on each developer workstation, and we have a lot of different OS to support there, including one folk with z/OS that is likely to benefit from this port if done in a compatible way). Of course CI and remote access to a VM in native hardware is a good enough substitute but it is less scalable, hence why I was hoping will be only needed through the original bootstrap.

@edelsohn
Copy link
Author

edelsohn commented Sep 4, 2020

I believe that the LinuxONE systems (hosted at Marist College) now are z15. Mainframe processors are not available in laptops, sorry. I understand the appeal, but I would recommend relying on the remote access over QEMU.

I am not certain what OS levels are available through Travis CI on s390x.

@carenas
Copy link
Contributor

carenas commented Sep 5, 2020

looked at the qemu side, and definitely we are unlikely to get that going regardless of how much we tweak the code generator CPU support; FWIW even qemu system (running fedora rawhide for s390x) will segfault with "interruption code 0010 ilc:3" :

Screen Shot 2020-09-04 at 7 32 12 PM

@zherczeg
Copy link
Owner

I think this issue should not be closed

@zherczeg zherczeg reopened this Sep 13, 2020
@edelsohn
Copy link
Author

edelsohn commented Oct 2, 2020

How is the enablement work proceeding?

@carenas
Copy link
Contributor

carenas commented Oct 2, 2020

How is the enablement work proceeding?

slower than planned, mainly to my poor planning; but in the right track after the first phase was completed (with a lot more changes and still a few more reservations than were originally expected).

@zherczeg
Copy link
Owner

zherczeg commented Oct 4, 2020

I have started to do some fixes but I need to learn a lot more before I can do them in a way I want and my time on voluntary work is quite limited.

@edelsohn
Copy link
Author

@carenas Any updates about this project?

@carenas
Copy link
Contributor

carenas commented Nov 12, 2020

phase 1 is included in the RC1 for PCRE 10.36, which means that with the right setup we could now get more people working in parallel to cleanup and complete the implementation to be ready for end users.

after merging phase 1 it was clear that my original plan was going to incur on too much tech debt and so it is reasonable to expect that phase 2 (getting vector operation support) and phase 3 (cleaning up old tech debt) will benefit to have wider distribution and therefore more hands than what I was originally expecting to have (only me)

this obviously doesn't qualify for the bounty terms (which in all fairness I have to admit, I was uncomfortable with, after pulling so much volunteer work reviewing phase 1), but I am still committed to get this out (hopefully with a little more help, and even if that means I will have to "subcontract") that work.

apologies for the delays, but I am hoping at the end the PCRE release that will be enabled for user support with JIT in s390x will be then of better quality.

it is important to note that most of the work I'd been doing (in PCRE, not in sljit) was to add for phase 2 support for FPU in the same way it is done for the other architectures, but my concern is that it might be too hacky and therefore increase tech debt unnecessarily even if it is possible (was hoping to get mostly vector instructions in, while leaving everything else untouched as was planned originally for phase 1), but I am concerned that with adding a third implementation it might also make sense to refactor the other 2 moving code from PCRE into sljit as well, which will be obviously a bigger undertaking that planned.

@edelsohn
Copy link
Author

@carenas Let's figure out how to solve this. There are multiple options:

  1. IBM can increase the bounty, within reason, if the project is larger than originally estimated.
  2. Bounties can be split among multiple people in any proportion.
  3. IBM can re-arrange the bounties into multiple parts associated with incremental milestones.

I'm a little concerned with the architecture redesign because IBM would like to have some PCRE JIT available on IBMz sooner rather than later. We can collaborate on a redesign as a second phase.

@edelsohn
Copy link
Author

I never saw any Jitsi invitation. I can join via Bluejeans.

@carenas
Copy link
Contributor

carenas commented Nov 15, 2020

I never saw any Jitsi invitation. I can join via Bluejeans.

Jitsi doesn't do invitations AFAIK, but I was going to create the meeting and share the link at that time;
agree Bluejeans is a nicer solution though and can manage the calendar but I am not sure if it will work for Zoltan (hence why I would like to keep it as a backup until we confirm otherwise)

sent you an invite to the email associated with your github so you can manage your calendar, anyway

@zherczeg
Copy link
Owner

Maybe I can be there a bit longer. Jitsy seems like a good solution.

@edelsohn
Copy link
Author

And share the link in Github comments? This really isn't the right medium for an interactive conversation to schedule a meeting, nor to share links to a meeting.

@zherczeg
Copy link
Owner

I have tried to do some minor improvements in the code, so I tried to log in into the virtual machine I created. However I got an "An unknown error has occurred,Please try again later." After a few days I still got the same error, so I decided to delete the vm and create a new one (the vm quota is 1). However, when I try to create or upload an ssh key for the new vm I got the same error. Since this is the only service where we can test the code, we probably need to wait until they fix it.

@edelsohn
Copy link
Author

edelsohn commented Nov 25, 2020

There is no known, system-wide problem with the VMs at Marist. Have you reported the problem through the support system?

@zherczeg
Copy link
Owner

zherczeg commented Jan 4, 2021

Is there an easy way to insert a breakpoint instruction on s390? I tried svc and trap but no success so far. I could probably call a function as the worst case, but that is not the nicest soultion.

@edelsohn
Copy link
Author

edelsohn commented Jan 4, 2021

I'm not a s390x expert. The GDB breakpoint instructions seem to be bytes 0x0,0x1 . I don't know to what instruction that corresponds. Have you tried examining a s390x breakpoint in GDB to see what instruction it inserts?

@zherczeg
Copy link
Owner

zherczeg commented Jan 5, 2021

It looks like gdb cannot decode it as an instruction with disassemble, but it stops at least, and can continue the execution, so it is good in practice. Thank you for the help.

I found another strange thing. The display/i $pc does not work in JIT code, gdb says PC not saved. The followng code shows that it works in normal functions, but not in JIT code:

165       rc = convert_executable_func.call_executable_func(&arguments);
(gdb) display/i $pc
1: x/i $pc
=> 0x2aa00165b98 <pcre2_jit_match_8+672>:       lg      %r1,232(%r11)
(gdb) si
165       rc = convert_executable_func.call_executable_func(&arguments);
1: x/i $pc
=> 0x2aa00165b9e <pcre2_jit_match_8+678>:       aghik   %r2,%r11,264
(gdb) si
0x000002aa00165ba4      165       rc = convert_executable_func.call_executable_func(&arguments);
1: x/i $pc
=> 0x2aa00165ba4 <pcre2_jit_match_8+684>:       basr    %r14,%r1
(gdb) si
PC not saved
(gdb) si
PC not saved

I can type x/i $pc after si as a workaround, but it takes a bit more time. Is there a gdb command to fix this?

@zherczeg
Copy link
Owner

zherczeg commented Jan 5, 2021

I have another question about vector registers. It seems the CPU has 32 vector registers: the first 16 is mapped to the fpu registers and the other 16 can be accessed with the RXB field of the instruction. However the ABI does not mention anything about the vector registers:

https://refspecs.linuxbase.org/ELF/zSeries/lzsabi0_zSeries.html#AEN413

The libc memchr routine uses vector registers above 16 without saving. Can I assume that registers < 16 has the same saving rules as the fpu registers and registers >= 16 are all volatile?

@edelsohn
Copy link
Author

edelsohn commented Jan 5, 2021

0x0,0x1 seems to be the byte sequence that GDB uses for a breakpoint, but I didn't suggest that it was a normal instruction. I suggested that you could examine a breakpoint set by GDB (which should be the same 0x0,0x1), not that GDB would provide a useful disassembly of the sequence.

I don't know about the x/i $pc error. Is it possible that the JIT code is not setting up the registers completely? The JIT code doesn't have to follow the normal calling convention and maybe it is avoiding some steps for speed, but would complicate debugging.

I'm not an expert on IBMz and don't know about the vector register ABI. I will try to find someone from IBM to join this issue and answer the questions.

@aarnez
Copy link

aarnez commented Jan 5, 2021

(gdb) si
PC not saved

This is a bug. It looks like a known problem in GDB that may occur on s390x when the code being stepped has no binary associated with it.

Can I assume that registers < 16 has the same saving rules as the fpu registers and registers >= 16 are all volatile?

All vector registers are volatile, except for the bits that overlap with the nonvolatile FPRs.

@carenas
Copy link
Contributor

carenas commented Jan 5, 2021

@zherczeg: to workaround not being able to use gdb to disassemble the generated code I integrated sljit with capstone and added a feature to dump the assembler that you could use from my capstone branch branch (not yet ready to be merged though and a little rusty but stable enough to be useful IMHO)

@aarnez
Copy link

aarnez commented Jan 5, 2021

[...] not being able to use gdb to disassemble the generated code [...]

Really? From reading the comments in this issue, I only understand that GDB doesn't disassemble the breakpoint instruction "0x00 0x01". Everything else should be disassembled as usual, right? And the breakpoint instruction shouldn't normally occur in any code. GDB does insert it when the user says "break ", but then GDB hides the breakpoint instruction in the disassembly and shows whatever was there before. So this should only be an issue if the breakpoint instruction was inserted in some other way, and I wonder why that would be necessary.

@carenas
Copy link
Contributor

carenas commented Jan 5, 2021

[...] not being able to use gdb to disassemble the generated code [...]

Really? From reading the comments in this issue, I only understand that GDB doesn't disassemble the
breakpoint instruction "0x00 0x01".

sorry, for oversimplifying the problem, but you are correct, the problem is not in disassembing the code, but in disassembling the code and being able to run into it which my code changes also don't address (as I found more useful to avoid doing breakpoints in the generated code anyway)

my proposed changes add two options to the "runTest" that come with sljit to allow for dumping the generated code assembler (like gdb would do) and to jump into it to validate it does the right thing after a test was created for it, it is in the test where I add logic to validate that the validated code does what it is expected so then I have no need for either gdb or breakpoints.

@edelsohn
Copy link
Author

edelsohn commented Jan 5, 2021

Carlo originally was looking for some instruction to manually set a breakpoint. When his attempts failed, I suggested the BREAK instruction sequence (0x0,0x1) inserted by GDB.

I think some of the confusion stems from ambiguity about what Carlo really wants to accomplish. What does he want the instruction to do? He wants an instruction that generates some sort of a "trap" operation that non-destructively suspends the program? He wants an instruction that is visible to a debugger (as opposed to the magic 0x0,0x1 sequence that GDB knows to ignore)?

I believe that the solutions are operating correctly but not what Carlo expected and we need more clarification about what Carlo really is trying to accomplish.

@carenas
Copy link
Contributor

carenas commented Jan 5, 2021

Carlo originally was looking for some instruction to manually set a breakpoint. When his attempts failed,
I suggested the BREAK instruction sequence (0x0,0x1) inserted by GDB.

this was work Zoltan was doing though, but yes we both had the same need which I will try to explain below and that is not even specific to s390x but also why we were earlier mentioning QEMU support as something that would be nice to have (since with an emulator you have full control on how the CPU state changes from instruction to instruction)

sljit is creating machine code dynamically, and we make the generator based on our undestanding of the documentation for that instruction.

more often than not (at least for me) I get the opcodes wrong, or misunderstand the side effects of the instructions and end up running something that could trap badly (if lucky)

at that point the only way to know what went wrong is to isolate the code generated and step through it, looking at the CPU and memory effects and for that adding a breakpoint to an specific memory address (when loaded into the PC) helps going to the interesting part of the opcodes faster.

the alternative I was using (and proposed in my dev branch) sidesteps the issue by allowing me to programatically tell the sljit program to print the generated assembler (which capstone allows, and will helpful spot invalid opcodes or even invalid sequences, that way without having to crash), and once I feel confident enough, execute the function instead and make sure its effects are what was expected (based on the documentation) which is what the test suite does for every single instruction on every supported cpu.

In this last step, I might add some asserts and iterate with some throwaway varations of the test case, to confirm all documented side effects are really observed (to guard against any documentation bug) and to make sure the generator fails safely if pushed hard against the specific instruction constrains (ex: when some possible combinations result in invalid instructions)

I believe that the solutions are operating correctly but not what Carlo expected and we need more
clarification about what Carlo really is trying to accomplish.

would let Zoltan explain here, but from his comments I suspect he is trying to add the missing FPU instructions that are needed to generate vector operations.

@zherczeg
Copy link
Owner

zherczeg commented Jan 6, 2021

The 0x0, 0x1 works nicely form my side. If you run the code without gdb, it stops with illegal instruction, but it does not matter for me. In gdb, it stops after 0x0, 0x1, and I can disassembly the rest of the code (from $pc). And the vm is more responsive than it was last year, so I can do some meaningful development this week.

@zherczeg
Copy link
Owner

zherczeg commented Jan 6, 2021

PCRE2 has 3 simd accelerated functions, and I have landed code for the simplest one:
https://lists.exim.org/lurker/message/20210106.075205.4d7fc4b1.hu.html

The good news is it seems s390x simd is quite good for text processing.

The bad news is that I had to add some TODOs because the s390x port emulates CPU status flags. I don't know why since the CPU has hardware support for flags. I also noticed that many instructions are generated for simple computations which needs to be improved as well.

@aarnez
Copy link

aarnez commented Jan 7, 2021

The bad news is that I had to add some TODOs because the s390x port emulates CPU status flags. I don't know why since the CPU has hardware support for flags. I also noticed that many instructions are generated for simple computations which needs to be improved as well.

Right, I wondered about that, too, and asked the original author Mike Monday about this. He answered:

SLJIT supports two flag registers: one representing whether the result of
an operation is zero or not and the other representing a different
condition: for example, whether the left operand was less than the right
operand or the operation overflowed. The way SLJIT is architected we need
to represent both flag registers simultaneously which IBM Z can't natively
do since the ISA only has a single 2-bit condition code register (e.g. the
result of an add logical may be 0 and it may have overflowed - the
condition code can only tell you one of those facts despite both possibly
being true).

I don't understand that explanation, though, because z/Architecture instructions usually do indicate a zero result in the condition code (CC), if applicable. Maybe SLJIT has a special way of dealing with these flags that makes it difficult to map them to the CC, but I'd be very surprised if that couldn't be handled somehow. At the time when I asked Mike about this, he was already off the project, and he considered the port to be in a prototype state, so we didn't follow up on this issue.

@zherczeg
Copy link
Owner

zherczeg commented Jan 7, 2021

I identified the following problems:

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available. In other words only overflows can be detected reliably. However, LOGICAL variants can at least be used for detecting zero result (normal and logical addition is the same operation except setting flags). Subtract something and checking zero result is a frequent operation.
  • Some instructions sets value 0 when the result is zero, others sets both 0 and 2. I don't see any other variants at Appendix C in the ISA document. Currently conditional opcodes are not aware which variant needs to be used. This is probably one reason of the emulation.
  • It seems to me that the COMPARE and ADD / SUBTRACT LOGICAL instructions use the flags in a different way. On other systems COMPARE and SUBTRACT do the same thing, except COMPARE does not store the result. By the way, SUBTRACT LOGICAL can tell if the result is == 0 (equal) or < 0 (borrow). Is it possible to tell if the result is > 0? The source registers cannot be swapped similar to ADD LOGICAL.
  • It is not possible to prevent setting the status flags by any instructions

Overall it seems to me that SUBTRACT instructions cannot be used to set status flags in general, since COMPARE instruction produces different flags. ADD / SUBTRACT LOGICAL can be used if the zero result needs to be checked, but they might set two different values. Setting some combinations (e.g. overflow + zero) might need some emulation, but they probably never needed in practice.

As for the operation, a COMPARE is likely needed after an operation when non-zero status flags are requested. It cannot be done before the operation, since the operation may overwrite flags. Hence, if the result register is the same as one of the source registers, its original value needs to be preserved for the comparison. As for shifting, an or operation with the same register operands can set the zero result flag.

Whatever flag model we choose, we need to check / update manually every instruction which set conditional codes one-by-one. There are a lot of them, so this looks like a lot of work.

@aarnez
Copy link

aarnez commented Jan 7, 2021

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available.

Right. However, in C semantics a signed integer overflow would be an undefined condition. I don't know how SLJIT expects those to be handled.

Also, if necessary, or until a better way is found, you can always add a "load and test" instruction such as LTR or LTGR to compare a signed integer against zero. (Source and target registers can be the same here.)

  • It seems to me that the COMPARE and ADD / SUBTRACT LOGICAL instructions use the flags in a different way. On other systems COMPARE and SUBTRACT do the same thing, except COMPARE does not store the result.

Well, COMPARE indicates the same flags as SUBTRACT, except when SUBTRACT would overflow. Basically, COMPARE sets the flags according to the full arithmetic result. Similarly, in C semantics two signed integers can always be compared, but not always be subtracted.

By the way, SUBTRACT LOGICAL can tell if the result is == 0 (equal) or < 0 (borrow). Is it possible to tell if the result is > 0?

All the "LOGICAL" instructions operate on unsigned integers, so a "LOGICAL" result can never be smaller than zero.

  • It is not possible to prevent setting the status flags by any instructions

Not in general, but there are cases where certain instructions can be used in place of others to prevent the condition code from being set. An example is "rotate then and/or/xor/insert selected bits", where there are variants like RISBGN.

Overall it seems to me that SUBTRACT instructions cannot be used to set status flags in general, since COMPARE instruction produces different flags.

Not sure. It really depends on how SLJIT expects signed integer overflows to be handled.

@zherczeg
Copy link
Owner

zherczeg commented Jan 7, 2021

Let me explain how status flags works in sljit. Sljit only supports two status flag bits, one is "result is zero", the other depends on the instruction and customizable. For example it can be a signed greater flag, an unsigned less than equal or carry for a subtraction operation. Setting a status flag must be explicitly requested, otherwise the flag is undefined. For example, if carry flag is requested for an addition, and the next jump depends on the zero or signed less flag, the behavior is undefined (in debug mode this triggers an assertion in the compiler). So a valid sljit code can only use the carry flag after the previously mentioned addition. Unless it is explicitly stated that an instruction does not modify flags (e.g. mov, call, jump instructions), the flags are undefined. Hence it is valid to do some mov operations before the carry flag is used, but not another addition.

Sljit does not have an "undefined on overflow" behavior: if zero flag is requested to be set, it must be set. Obviously emulation might be needed in some cases, but currently 2+ instruction are used when zero flag is used, and that is a lot. The ideal is 0 instructions in the commonly used cases.

I may misunderstand something, but COMPARE LOGICAL sets 0 for equal, 1 for less, and 2 for greater, and SUBTRACT LOGICAL sets 1 for less, 2 for equal, and 3 for greater. Apart from less, these do not match. We can use the latter all the time, and discard the result when compare is needed, but compare might have instruction forms which are useful.

@carenas
Copy link
Contributor

carenas commented Jan 8, 2021

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available.

Right. However, in C semantics a signed integer overflow would be an undefined condition. I don't know how SLJIT expects those to be handled.

sljit assumes that negative numbers are represented in Two’s Complement format and therefore mostly ignores overflows (even the ones that could be considered as undefined per C)

in practical terms this is not a problem because all CPUs supported use that format, and most modern CPUs that might be added will also support that format, but it is for sure something to consider if we ever have to generate code for a CPU that has 1 bit complement or something else.

@aarnez
Copy link

aarnez commented Jan 8, 2021

Let me explain how status flags works in sljit. [...]

Thanks, that explanation helps.

Sljit does not have an "undefined on overflow" behavior: if zero flag is requested to be set, it must be set.

Well, the only case where the result of a signed addition can be zero and an overflow occurs is when calculating INT_MIN + INT_MIN. So SLJIT expects the zero flag to be set in this case, right? (I find it a bit unfortunate that we need to introduce special handling just to cover this case.)

I may misunderstand something, but COMPARE LOGICAL sets 0 for equal, 1 for less, and 2 for greater, and SUBTRACT LOGICAL sets 1 for less, 2 for equal, and 3 for greater.

Your understanding is correct. COMPARE LOGICAL sets the CC as usual, while SUBTRACT LOGICAL is designed to prepare the borrow for a multi-precision integer subtraction.

Perhaps we could add another s390-specific field to the sljit_compiler structure that indicates the "CC mode" of the last operation performed, such as "carry/borrow mode" versus "normal mode"? Then we could make the result of get_cc dependent on that. Just a thought.

@edelsohn
Copy link
Author

Zoltan reported the following progress:

I had some free time last week and I have added SIMD support for s390x. The code is landed and the next step could be measuring the improvement, but virtual machines are not suitable for this. The SIMD support is currently enabled by default, although we might need to support compile or runtime check for availability.

I have noticed that the code generator needs improvement. For example, adding 16 to a register (r3 = r4 + 16) is translated to four instructions:

lghi %r1,16
lgr %r0,%r3
algr %r0,%r1
lgr %r4,%r0

This is less efficient compared to other ports which can do this with one machine instruction. Probably the code generator is less complex than other ports, which tries to exploit the strengths of their corresponding instruction set. I suspect improving this is not a trivial task (maybe we need to completely redesign it), but it should be since this is a serious disadvantage for the s390x port.

@aarnez Can you help measure the performance when you have a moment? And maybe you have a suggestion for Zoltan's code generation observation.

@aarnez
Copy link

aarnez commented Jan 13, 2021 via email

@zherczeg
Copy link
Owner

#106 should be the next step in this work. The patch focuses only flags, the optimal instruction selection is still far away. I cannot measure the perf on a virtual machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants