Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-march=...Xcustom no longer working? #190

Closed
seldridge opened this issue Nov 2, 2016 · 34 comments
Closed

-march=...Xcustom no longer working? #190

seldridge opened this issue Nov 2, 2016 · 34 comments

Comments

@seldridge
Copy link

Somewhere between 8000750 (good) and 55f8308 (bad), -march seems to have stopped respecting Xcustom as an option.

For example, when compiling the following program in test.c, -march doesn't seem to do anything.

int main() {
  asm volatile ("custom0 0, 0, 0, 0");
}

Compiling with and without the option:

> riscv64-unknown-elf-gcc test.c
test.c: Assembler messages:
test.c:2: Error: unrecognized opcode `custom0 0,0,0,0'
> riscv64-unknown-elf-gcc -march=RV64IMAFDXcustom test.c
test.c: Assembler messages:
test.c:2: Error: unrecognized opcode `custom0 0,0,0,0'

Am I doing something wrong or did this support get inadvertently dropped during the -march changes?

@colinschmidt
Copy link
Contributor

I think @palmer-dabbelt talked to me about this a little while ago but I don't remember what the result was.

@palmer-dabbelt
Copy link
Contributor

We decided to not upstream the Xcustom support because the RoCC ISA doesn't have a proper document describing the ISA. This stuff was kind of a CS250 hack anyway, and we're trying to keep those out of the official releases.

The idea was that we could support this with CPP macros instead of proper assembler support, since those don't have to go upstream (and then be supported forever). The general idea is that you'd emit a ".word" instead of an instruction. I believe something like

#define CUSTOM0(rs1, rs2) .word CUSTOM0_BASE | (rs1 << CUSTOM0_RS1_OFFSET) | (rs2 << CUSTOM0_RS2_OFFSET)

should do it. We'll have to resurrect RoCC support next time a class is taught, but it probably won't happen before that: Hwacha is the only accelerator we really use, and that has an assembler fork so there's proper support for the instructions that use the customN opcodes.

If this approach doesn't work then I'd be amenable to bringing back some amount of RoCC support, but it'd have to be slightly cleaner. If you do make it work, I'd love to get it upstream somewhere.

Sorry we broke your stuff!

@seldridge
Copy link
Author

Ah, that makes total sense. Thanks for the informative answer, Palmer. Upstreaming undocumented stuff seems ill advised (though the fact that you all are getting this upstreamed is super exciting).

I'll give this a shot and update all public code that I have documenting this (including basic example code for exercising the rocket-chip RoCC examples). This should be fine as I'm already using macros, it's just no longer mapping to some asm volatile XXX.

FWIW, if there's any way that you all can keep me in the loop of what's going on with extension / RoCC support I would really appreciate it. As you guys know, I've been managing my own stuff and trying to keep it semi-synced with rocket-chip master (possibly ill-advised, but it's not too bad once I'm running my own regressions). However, I'm now the local go-to guy for all "my rocket / accelerator doesn't work / is broken" queries. Having some additional information straight from you all would dramatically help.

@palmer-dabbelt
Copy link
Contributor

I actually had no idea anyone outside of Berkeley was using RoCC for anything. That's why I just talked to Colin about this, because he was the last 250 TA. I'll try and remember you're using it in the future, but now that we're upstream any assembler changes will have to go through the binutils mailing list. If you're doing odd RISC-V assembly things it might be worth subscribing (they have digests).

Sorry, again, for the trouble!

@seldridge
Copy link
Author

I'm continually evangelizing, so there may be more in the future... Also, I do know that Michael Taylor (calling @anujnr) is doing something with it though he's been using Chisel wrappers to jump to his preferred SystemVerilog.

Thanks for the pointer.

No problem at all. You guys are cranking out good stuff and it's fun to be in that ecosystem.

@seldridge
Copy link
Author

seldridge commented Nov 3, 2016

Following your approach, @palmer-dabbelt, the following gets the job done:

#define STR1(x) #x
#define STR(x) STR1(x)
#define EXTRACT(a, size, offset) (((~(~0 << size) << offset) & a) >> offset)

#define CUSTOMX_OPCODE(x) CUSTOM_##x
#define CUSTOM_0 0b0001011
#define CUSTOM_1 0b0101011
#define CUSTOM_2 0b1011011
#define CUSTOM_3 0b1111011

#define CUSTOMX(X, rd, rs1, rs2, funct) \
  CUSTOMX_OPCODE(X)                   | \
  (rd                   << (7))       | \
  (0x7                  << (7+5))     | \
  (rs1                  << (7+5+3))   | \
  (rs2                  << (7+5+3+5)) | \
  (EXTRACT(funct, 7, 0) << (7+5+3+5+5))

#define CUSTOMX_R_R_R(X, rd, rs1, rs2, funct)           \
  asm ("mv a4, %[_rs1]\n\t"                             \
       "mv a5, %[_rs2]\n\t"                             \
       ".word "STR(CUSTOMX(X, 15, 14, 15, funct))"\n\t" \
       "mv %[_rd], a5"                                  \
       : [_rd] "=r" (rd)                                \
       : [_rs1] "r" (rs1), [_rs2] "r" (rs2)             \
       : "a4", "a5");

This introduces three additional moves vs. the old rocc-support toolchain, but it works.

Edit: Fixed clobbering issue with a4 and a5.

@palmer-dabbelt
Copy link
Contributor

Great! Thanks for posting this, we'll need it as well. If the moves are a performance problem then I can try to figure something out (maybe adding something to either the assembler or compiler), I don't want to have a performance regression because of a cleanup.

@seldridge
Copy link
Author

No problem.
Two things:

  • STR does conflict with defines in the proxy kernel (simple to fix if needed)
  • The original, unedited code above was nicely clobbering a4 and a5. Fixed now.

@cbatten
Copy link

cbatten commented Dec 1, 2016

Is there some way to do this without the overhead of the extra moves? We are using the ROCC interface extensively and the overhead of those extra moves will be prohibitive. We want the same affect as our current single-instruction inline assembly where the compiler can do the register allocation and we don't need to do any extra moves.

@aswaterman
Copy link
Collaborator

@cbatten You should be able to use the register keyword with an asm constraint, so the compiler will only emit the moves if necessary:

int foo(int w, int x)
{
  register int y asm ("a4") = w;
  register int z asm ("a5") = x;
  register int a asm ("a6");

  asm ("# whatever" : "=r" (a) : "r" (y), "r" (z));

  return a;
}

A better solution would be to allow assembler plugins so that you can add custom instructions without modifying the binutils source code, but I don't know to do this in a way that the binutils maintainers would allow. @palmer-dabbelt any ideas?

@cbatten
Copy link

cbatten commented Dec 1, 2016

This needs to be inlined to avoid any performance overhead so I am not sure the above approach which basically still does explicit register allocation will work well. The alternative is we will have to fork the upstream repo and add the custom0 instructions back into binutils -- which is kind of annoying ... I mean we are already forking the RISC-V compiler flow for research, but it would be nice if the custom0 stuff worked out of the box since we also use that in courses so that students can download the upstream RISC-V cross-compiler (instead of our fork) and use it for projects.

I guess for teaching we can use CSRs to create a register-mapped interface for our accelerators. That is actually what we were doing before, but I was thinking of moving to ROCC. Sticking with a register-mapped interface might be simpler now though ...

@aswaterman
Copy link
Collaborator

I just meant to show an example... I agree you'd need to inline this or put it in a macro.

We took the customX opcodes out because of a policy decision not to have any nonstandard extensions supported by the upstream tools. I think an assembler plugin, if possible, is the right way to add them back (and is potentially better than the customX approach, because it is not limited to RoCC and provides more expressive [dis]assembly).

@cbatten
Copy link

cbatten commented Dec 1, 2016

Makes sense -- although binutils doesn't support assembler plugins. I think we will just end up using CSRs.

(as an aside: looking forward to receiving my founder's edition HiFive1 board :)

@aswaterman
Copy link
Collaborator

:)

@seldridge
Copy link
Author

Merging @aswaterman's suggestion does appear to produce reasonable assembly output with -O3.

The above macro then becomes (source):

#define STR1(x) #x
#define STR(x) STR1(x)
#define EXTRACT(a, size, offset) (((~(~0 << size) << offset) & a) >> offset)

#define CUSTOMX_OPCODE(x) CUSTOM_ ## x
#define CUSTOM_0 0b0001011
#define CUSTOM_1 0b0101011
#define CUSTOM_2 0b1011011
#define CUSTOM_3 0b1111011

#define CUSTOMX(X, rd, rs1, rs2, funct)         \
  CUSTOMX_OPCODE(X)                   |         \
  (rd                   << (7))       |         \
  (0x7                  << (7+5))     |         \
  (rs1                  << (7+5+3))   |         \
  (rs2                  << (7+5+3+5)) |         \
  (EXTRACT(funct, 7, 0) << (7+5+3+5+5))

// Standard macro that passes rd, rs1, and rs2 via registers
#define ROCC_INSTRUCTION(X, rd, rs1, rs2, funct)                \
  ROCC_INSTRUCTION_R_R_R(X, rd, rs1, rs2, funct, 10, 11, 12)

// rd, rs1, and rs2 are data
// rd_n, rs_1, and rs2_n are the register numbers to use
#define ROCC_INSTRUCTION_R_R_R(X, rd, rs1, rs2, funct, rd_n, rs1_n, rs2_n) { \
    register uint64_t rd_  asm ("x" # rd_n);                            \
    register uint64_t rs1_ asm ("x" # rs1_n) = (uint64_t) rs1;          \
    register uint64_t rs2_ asm ("x" # rs2_n) = (uint64_t) rs2;          \
    asm volatile (                                                      \
        ".word " STR(CUSTOMX(X, rd_n, rs1_n, rs2_n, funct)) "\n\t"      \
        : "=r" (rd_)                                                    \
        : [_rs1] "r" (rs1_), [_rs2] "r" (rs2_));                        \
    rd = rd_;                                                           \
  }

From that, an example load into the accumulator example is then (source):

#define k_DO_WRITE 0
#define XCUSTOM_ACC 0
#define doWrite(y, rocc_rd, data)                                       \
  ROCC_INSTRUCTION(XCUSTOM_ACC, y, data, rocc_rd, k_DO_WRITE);

Compiling the following program with -O3 (source with printfs removed):

int main() {
  uint64_t data [] = {0xdead, 0xbeef, 0x0bad, 0xf00d}, y;

  uint16_t addr = 1;
  doWrite(y, addr, data[0]);

  doRead(y, addr);
  assert(y == data[0]);

  uint64_t data_accum = -data[0] + data[1];
  doAccum(y, addr, data_accum);
  assert(y == data[0]);

  doRead(y, addr);
  assert(y == data[1]);

  uint64_t data_addr;
  doTranslate(data_addr, &data[2]);
  doLoad(y, addr, data_addr);
  assert(y == data[1]);

  doRead(y, addr);
  assert(y == data[2]);
  return 0;
}

Gets me this assembly which seems to be great:

   15290:	00f13023          	sd	a5,0(sp)
   15294:	00e13423          	sd	a4,8(sp)
   15298:	00d13823          	sd	a3,16(sp)
   1529c:	00078593          	mv	a1,a5
   152a0:	00100613          	li	a2,1
   152a4:	00c5f50b          	0xc5f50b
   152a8:	00000593          	li	a1,0
   152ac:	02c5f50b          	0x2c5f50b
   152b0:	04f51e63          	bne	a0,a5,1530c <main+0xa8>
   152b4:	00050813          	mv	a6,a0
   152b8:	40a705b3          	sub	a1,a4,a0
   152bc:	06c5f50b          	0x6c5f50b
   152c0:	0b051e63          	bne	a0,a6,1537c <main+0x118>
   152c4:	00000593          	li	a1,0
   152c8:	02c5f50b          	0x2c5f50b
   152cc:	00050793          	mv	a5,a0
   152d0:	08e51863          	bne	a0,a4,15360 <main+0xfc>
   152d4:	01010593          	addi	a1,sp,16
   152d8:	00000613          	li	a2,0
   152dc:	00c5f52b          	0xc5f52b
   152e0:	00100613          	li	a2,1
   152e4:	00050593          	mv	a1,a0
   152e8:	04c5f50b          	0x4c5f50b
   152ec:	04f51c63          	bne	a0,a5,15344 <main+0xe0>
   152f0:	00000593          	li	a1,0
   152f4:	02c5f50b          	0x2c5f50b
   152f8:	02d51863          	bne	a0,a3,15328 <main+0xc4>

Note, the macro above is going to always set the xd bit so it expects a response. This may cause a slight slowdown for instructions whose outputs are never used and xd should be deasserted. Also, using the _R_R_R variant and trying to reuse registers may do odd things. I didn't test this.

@cbatten
Copy link

cbatten commented Dec 1, 2016

Nothing like trying to write an assembler using pre-processor macros ...

@sorear
Copy link
Contributor

sorear commented Dec 1, 2016

The other solution is to write code pretending you have an extended assembler, then write a small program to process the assembly between gcc -S and as. Glasgow Haskell did this for many years to work around the lack of tail-call optimization in GCC at the time and to implement an ad-hoc optimization for single-entry vtables (very common in Haskell), called the "evil mangler", nobody liked it but it worked well most of the time.

@vanjoe
Copy link
Contributor

vanjoe commented Dec 1, 2016 via email

@aswaterman
Copy link
Collaborator

aswaterman commented Dec 1, 2016 via email

@arunthomas
Copy link
Contributor

I'm sure others will want toolchain support for their non-standard extensions. What's the recommended way to accomplish this? Inline assembler macros as in @seldridge's comment? Or would a GCC/LLVM plugin make sense?

It would be great if we could provide an example of how to make use of non-standard extensions. Maybe we should package up @seldridge's header file into a repo -- along with some documentation on how to add/use extensions?

@cliffordwolf
Copy link

I am using assembler macros like the following to wrap the custom0 instruction:

.macro timer rd rs
custom0 \rd,\rs,0,5
.endm

This allows me to write in my assembler code statements like this:

timer zero, x1

In order to convert those macros to use .word directives I'd need a mechanism to convert identifiers like zero and x1 to integer constants holding the register index. Is there such a mechanism in gnu assembler?

@colinschmidt
Copy link
Contributor

I think this is where LLVM provides a simpler solution, since you can contain the custom instruction definitions in their own standalone file. In addition, parts of LLVM already load shared libraries for out-of-tree passes so they would seem more receptive to having a true plugin than the gnu people.
That being said I've always used implemented my own ISA extensions directly into the gnu assembler, although I have added codegen for custom extensions to LLVM.

@seldridge
Copy link
Author

The above macro relies on C constructs, so when writing assembly I've been using the additional assembly-only macro:

#define ROCC_INSTRUCTION_RAW_R_R_R(x, rd, rs1, rs2, funct)      \
  .word CUSTOMX(x, ## rd, ## rs1, ## rs2, funct)

With the example you give above, something like the following should work:

.macro timer rd rs
CUSTOMX(0, rd, rs, 0, 5)
.endm

However, that's not giving you the "nice" register names, i.e., rd should be 0 and rs should be 5. Also, just watch out when using this as all the xd bits are set and passing in 0/zero into rd may cause a hang if your extension doesn't generate a response.

@arunthomas, I'm fine with going that route, but this is all a giant kludge.

@colinschmidt, yes, LLVM would clearly be the cleanest way forward.

@sorear
Copy link
Contributor

sorear commented Dec 8, 2016

Given that RISC-V has a limited number of 32-bit instruction formats, what if we added psuedo-instructions for each format? In other words these would be the same:

i_type 3, 4, a0, a7, 128
lbu a0, 128(a7)

@aswaterman
Copy link
Collaborator

That's good to know @colinschmidt. It may make sense for LLVM to be the default answer for assembling and compiling to non-standard extensions, and then the GCC flow can abdicate responsibility for that.

@arunthomas
Copy link
Contributor

arunthomas commented Dec 8, 2016 via email

@sorear
Copy link
Contributor

sorear commented Dec 8, 2016

  1. A working clang is probably quite a few months out.
  2. Requiring a clang plugin for new instructions seems to unnecessarily increase the barrier to entry.

So I still think that having register and immediate-aware xx_type pseudos would be useful to the community. Would something like that have a chance of being accepted if done?

@aswaterman
Copy link
Collaborator

I'm mostly concerned about assembling at this point. Presumably people can use GCC for codegen and assemble with LLVM-AS (once our ABI issues are all sorted out). I get the impression from @colinschmidt that it isn't an especially onerous task to extend the assembler, and we can always provide a template for people adding instructions using the existing standard formats.

That said, I'm not outright opposed to providing a way to assemble arbitrary R/I-type instructions (I guess you provide the opcode, funct3, and funct7 as additional arguments). This is more generic than Xcustom, and the burden on the assembler is relatively low.

@cliffordwolf
Copy link

That said, I'm not outright opposed to providing a way to assemble arbitrary R/I-type instructions (I guess you provide the opcode, funct3, and funct7 as additional arguments).

@aswaterman
That would be awesome. Exactly what I need.

@palmer-dabbelt
Copy link
Contributor

I like the generic instruction approach. Can someone add an issue to riscv-binutils-gdb that asks for this support? We're probably not going to be able to get to it right away because we're trying to get everything cleaned up for GCC upstreaming, which means we need to get the ABI stuff sorted out. I think the GCC 7 window closes in about a month, so we're going to have to focus on stability instead of features for a bit.

@cbatten
Copy link

cbatten commented Dec 8, 2016

+1 for the generic instruction approach where binutils supports some generic R/I-type instructions. Seems super useful for rapid prototyping new instructions where you just want to play around with a new instruction before going through the hassle of adding it to the assembler directly -- plus it seems generic enough that it doesn't fall into the "customX is too custom to support upstream".

@Yangff
Copy link

Yangff commented Nov 4, 2019

macro not work for llvm:

--- machdep.o ---
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/machdep.c:722:2: error: expected relocatable expression
./machine/asm.h:115:56: note: expanded from macro 'write_csrid'
#define write_csrid(regid, v) ({ uint64_t __tmp = (v); csrrw_proc(regid, 31, __tmp, 0, __tmp); __tmp;})
                                                       ^
./machine/asm.h:108:21: note: expanded from macro 'csrrw_proc'
  __asm__ volatile (".word " STR(make_ins_csrrw(regid, rs1_n, rd_n)) "\n\t"          \
                    ^
<inline asm>:1:8: note: instantiated into assembly here
        .word (MATCH_CSRRW) | ((0x181) << 20) | ((31) << 15) | ((0) << 7)
              ^
1 warning and 1 error generated.

Also not work for GCC assembler

/usr/local/bin/riscv64-unknown-freebsd12.0-gcc -c -x assembler-with-cpp -DLOCORE -O -pipe  -g -nostdinc  -I. -I/usr/home/yangff/sources/freebsd/sys -I/usr/home/yangff/sources/freebsd/sys/contrib/ck/include -I/usr/home/yangff/sources/freebsd/sys/contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h    -fno-omit-frame-pointer -fno-optimize-sibling-calls -MD  -MF.depend.swtch.o -MTswtch.o -fdebug-prefix-map=./machine=/usr/home/yangff/sources/freebsd/sys/riscv/include -march=rv64imafdc -mabi=lp64 -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -Wno-format -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value -Wno-address-of-packed-member -Wno-format-zero-length    -std=iso9899:1999   /usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S: Assembler messages:
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:240: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:240: Error: invalid operands (*UND* and *ABS* sections) for `|'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:240: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:240: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:366: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:366: Error: invalid operands (*UND* and *ABS* sections) for `|'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:366: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:366: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:440: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:440: Error: invalid operands (*UND* and *ABS* sections) for `|'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:440: Error: invalid operands (*UND* and *ABS* sections) for `<<'
/usr/home/yangff/sources/freebsd/sys/riscv/riscv/swtch.S:440: Error: invalid operands (*UND* and *ABS* sections) for `<<'

I'm using both GCC assembler and clang because 1. I want to use clang and 2. clang's assembler has bug with medium memory model. So I need a way to use customized instruction and csr for both gcc and clang... and clearly it works for none of them... wondering what's going wrong.

The macro:

#define make_ins_csrrw(reg, rs1, rd) (MATCH_CSRRW) | ((reg) << 20) | ((rs1) << 15) | ((rd) << 7)
#ifndef STR
#define STR(x) STR1(x)
#define STR1(x) #x
#endif
#define csrrw_proc(regid, rs1_n, rs1, rd_n, rd)  ({  \
  register uint64_t rd_  __asm__ ("x" # rd_n);            \
  register uint64_t rs __asm__ ("x" # rs1_n) = rs1;\
  __asm__ volatile (".word " STR(make_ins_csrrw(regid, rs1_n, rd_n)) "\n\t"          \
  : "=r" (rd_)                                        \
  : [_rs1] "r" (rs1));                                \
  rd = rd_;                                           \
  })

#define read_csrid(regid) ({ uint64_t __tmp = 0; csrrw_proc(regid, 0, 0, 31, __tmp); __tmp;})
#define write_csrid(regid, v) ({ uint64_t __tmp = (v); csrrw_proc(regid, 31, __tmp, 0, __tmp); __tmp;})

ASM

.macro csrrw_id rs rd 
.word make_ins_csrrw(CSR_SSTACKID, rs, rd)
.endm

And inline C

write_csrid(CSR_SSTACKID, thread0.td_pcb->pcb_stackkernel);

The only thing works is .word 0xxxxxxx to emit instruction...

@jim-wilson
Copy link
Collaborator

You need to provide a definition of MATCH_CSRRW. I don't know why you expect the compiler to define that.

The assembler provides the .insn directive that can be used to create instructions. See the gas docs.
https://sourceware.org/binutils/docs-2.33.1/as/RISC_002dV_002dDirectives.html#RISC_002dV_002dDirectives
.insn is described near the end of that page, and the next section describes the values that can be used with .insn. There is also a testcase for it in the gas testsuite if you want to see practical examples.

@Yangff
Copy link

Yangff commented Nov 4, 2019

I thought it's defined because I also included encoding.h.. but it seems not.. thank you. I will also try
.insn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests