Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding mode mismatch with the hardware #111

Closed
Kuree opened this issue Jun 5, 2019 · 38 comments
Closed

Rounding mode mismatch with the hardware #111

Kuree opened this issue Jun 5, 2019 · 38 comments
Assignees

Comments

@Kuree
Copy link
Member

Kuree commented Jun 5, 2019

Two problems:

  1. The rounding mode, which is round to nearest even, in the functional model and RTL does not match. See: https://buildkite.com/stanford-aha/lassen/builds/109#712c5179-b555-4ddb-8fd3-6da54ebba32b
    The functional model is using mpfr, which I believe should also be correct.

  2. When using -v with pytest, fault doesn't catch the error yet running individually does. There is something wrong with either the test bench setup or fault. See the successful build: https://buildkite.com/stanford-aha/lassen/builds/108#ef4d8b94-f985-4e9e-9873-cdf64ba87be3
    @leonardt can you take a closer look?

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

For 1., the output shows:

Failed on action=5 checking port O0. Expected 4117, got 4116

If i'm not mistaken, I believe that means that the functional model expects 4117 and the RTL got 4116, if we're rounding to the nearest even, then 4117 doesn't seem right?

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

For 2. I suspect this is an issue related to the TB rather than fault because it's correctly identifying the error in the single run case, but perhaps it's an issue with how fault is generating the TBs for multiple tests. Perhaps it's the fact that all the tests are being run in the same test_dir, maybe ncsim is not recompiling something when it should be? (Perhaps it's using an old test bench rather than the newly generated one for the next op). I'm going to see if running each test in it's own tempdir catches it, which would suggest the above issue.

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

My educated guess for 1 is that the rounding depends on how many extra precision bits you have. In the test, we have

3 * 3.140625 = 0x4116c000

If we have more than 1 bits used for rounding, then clearly it should round up, hence 0x4117. If we only have 1 bit for rounding, then we shouldn't round since 0x4117 is odd.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Ah, I forgot we were considering FP numbers, I was interpreting as an integer

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Curiously on kiwi, running pytest tests/test_pe.py -xv does result in the failure on fp_mult, I'm going to confirm the error isn't present when doing pytest tests/ -xv (I believe this is what I saw when I did my initial local checkout, but I'll double check. This might suggest that another test file is causing some sort of conflict.

@steveri
Copy link

steveri commented Jun 5, 2019

Do the tests depend at all on Python floating-point implementation, and thus could Python/C++ version/lib differences on kiwi vs. other machines explain the difference?

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

I believe they use the libmpfr implementation, but the issue is that the output from our functional model (based on libmpfr) is not consistent with the RTL simulation (which uses the CW primitives with rnd('h0)).

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Could the issue be related to how the multiply is promoting the inputs (concating 0s to the end) and truncating the output (see use of result_x)? Also the fact that frac_bits has 3 added to it? These are the various configuration parameters I see that might be related to the issue. Also ieee_compliance?

module mul #(parameter exp_bits=1, parameter frac_bits=1) (
  input [exp_bits+frac_bits:0] in0,
  input [exp_bits+frac_bits:0] in1,
  output [exp_bits+frac_bits:0] out
);
wire [2:0] result_x;
wire [7:0] status;
CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h0),.z({out,result_x}),.status(status));

endmodule  // mul

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

@leonardt Yes we believe this is the issue. We're still investigating the issue now. There are more problems with the IP stub.

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

Update:

We have located the bug, which is due to the extra 3 bits in the initialization. However, there is another bug in the CW_fp_mult.v that prevents us using 7-bit.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Okay, update on issue 2, weirdly enough, running pytest tests -v the failure is missed, but running pytest tests -xv the failure is caught, this is odd.

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

I forgot about the extra precision bits. I think the appropriate solution is to modify the coreir stub to make it use the exact correct bits. If we want higher precision we can specify that at the peak/magma level

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

The valina CW_fp_mult.v doesn't allow you to use correct precision bits. Due to the copyright I can't post it here, but if you look at line 60-62 you'll see the bug.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

For me locally, it seems to reliably catch the failure when running pytest -xv tests and reliable miss the failure when running pytest -v tests. Perhaps we can use -x as a temporary unblocker (not great in some cases because you'll only see the first failure, whereas it can be useful to see all failures, but in other cases it's useful to kill the run early on the first failure). It's not ideal, but it's a strange issue that I suspect is related to some weird interaction with pytest and our code.

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

@Kuree, are you saying there is also a bug actually in the IP itself? Currently coreir instances the verilog stub with an extra 3 bits of fractional precision, so I know that is likely part of the issue.

I will work on a coreir fix for this

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

@rdaly525 Yes I believe so. You can check it out and read the lines I pointed out above.

In the short run we can write a sed script to fix the bug.

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

I have a coreir branch 'float-fix' that updates the CoreIR instantiating the multiply. @Kuree, does this fix the issue?

rdaly525/coreir#753

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

Also if we do actually want the multiply to have 3 extra bits of precision, we need to update the lassen description to concat 3 bits to the inputs, then put it into the appropriate BFloat[8,7+3] multiply operator, then update the tests to match these semantics.

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

I don't think we need extra 3 bits of precision. You have to round somewhere anyway and it's difficult to match of other tools.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

okay, even stranger, I just had a run with pytest tests -v catch the error. @Kuree is there any possibility of this test failing non-deterministically? Or is the bug deterministic?

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

This bug is deterministic since it's caused by improper instantiation of the IP.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

weird, I seem to be getting consistent failures (2 in a row at least) with pytest tests -v now, I'm not exactly sure where to continue for debugging the issue, perhaps this is related to the buildkite environment? I'll look at the buildkite script to see if there's anything potentially interesting there. maybe it has to do with the docker flow, so I can try recreating that.

It's strange that I did see the pass locally earlier, but now I can't seem to reproduce it.

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

I can reproduce the same problem inside the buildkite docker environment last night, so I doubt it has something to do with the buildkite.

I think for the best interest of saving time, we can put -x flag there and monitor if the problem is going to resolve in the future.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Oh I found something interesting in the buildkite logs:
This run https://buildkite.com/stanford-aha/lassen/builds/104#eeab0d8b-a96e-45ed-aa84-ef2e392fb4de/6 from ea7f165 enables the -s flag to see the stdout.

If we search the log for fp_mul, we see


tests/test_pe.py::test_fp_binary_op[args15-op1] Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks  -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v
--
  |  
  | irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.
  | Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected.
  | expected: Tue Jun  4 22:42:23 2019
  | actual:   Tue Jun  4 22:42:24 2019
  | file: WrappedPE_tb.sv
  | module worklib.WrappedPE_tb:sv
  | errors: 0, warnings: 0
  | file: CW_fp_mult.v
  | ncvlog: *W,WARIPR: warning within protected source code.
  | Elaborating the design hierarchy:
  | Caching library 'worklib' ....... Done
  | Top level design units:
  | WrappedPE_tb
  | ncelab: *W,DSEMEL: This SystemVerilog design will be simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
  | CW_fp_add #(.sig_width(frac_bits), .exp_width(exp_bits), .ieee_compliance(0)) add1 (.a(in0),.b(in1),.rnd('h0),.z(out),.status(status));
  | \|
  | ncelab: *W,CUVMPW (./WrappedPE.v,18\|107): port sizes differ in port connection (32/3).
  | CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h0),.z({out,result_x}),.status(status));
  | \|
  | ncelab: *W,CUVMPW (./WrappedPE.v,8\|124): port sizes differ in port connection (32/3).
  | Building instance overlay tables: .................... Done
  | Generating native compiled code:
  | worklib.WrappedPE_tb:sv <0x40ef9b4e>
  | streams:   8, words:  4345
  | Building instance specific data structures.
  | Loading native compiled code:     .................... Done
  | Design hierarchy summary:
  | Instances  Unique
  | Modules:               1041      71
  | Primitives:               1       1
  | Registers:               63     114
  | Scalar wires:          2668       -
  | Expanded wires:          51       5
  | Vectored wires:        1669       -
  | Always blocks:           18      16
  | Initial blocks:           8       4
  | Cont. assignments:     1716    2028
  | Pseudo assignments:      11      11
  | Simulation timescale:   1ns
  | Writing initial simulation snapshot: worklib.WrappedPE_tb:sv
  | Loading snapshot worklib.WrappedPE_tb:sv .................... Done
  | ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
  | ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc
  | ncsim>
  | ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps
  | Created default VCD database vcddb
  | ncsim> probe -create -all -vcd -depth all
  | Created probe 1
  | ncsim> run 10000ns
  | Simulation complete via $finish(1) at time 40 NS + 0
  | ./WrappedPE_tb.sv:50         #20 $finish;
  | ncsim> quit
  | PASSED
  | tests/test_pe.py::test_fp_mul Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks  -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v
  |  
  | irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.
  | Loading snapshot worklib.WrappedPE_tb:sv .................... Done
  | ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
  | ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc
  | ncsim>
  | ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps
  | Created default VCD database vcddb
  | ncsim> probe -create -all -vcd -depth all
  | Created probe 1
  | ncsim> run 10000ns
  | Simulation complete via $finish(1) at time 40 NS + 0
  | ./WrappedPE_tb.sv:50         #20 $finish;
  | ncsim> quit
  | PASSED

Notice that in the previous test (tests/test_pe.py::test_fp_binary_op[args15-op1]), ncsim says "Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected." which makes sense because with the new test, there should be a new test bench, forcing a recompile.

Now the output of the fp_mul test doesn't show this recompile message, which would imply that it's actually running the test bench from the previous tests, any ideas why this might be happening? I'll look to see if there's any place where fault might skip generating the test bench, but maybe there's some strange condition here causing ncsim for miss the recompile?

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

Hmm, can you force fault to touch the test bench file to make sure the modified time has changed?

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

This is from Nikhil's email:

I think the bug is in the way the multiplier is instantiated. Since CW_mult does not support a 7 bit mantissa (minimum is 10 bits), we had to instantiate with a 10 bit mantissa. So CW rounds to nearest even for that precision, and not for the 7 bit precision we need. I have pasted verilog code below that should round it to nearest even for our precision (note that CW outputs to int_out, not out). Can you please try with this added to the RTL? I have set the CW rounding mode to truncate now.
module mul #(parameter exp_bits=1, parameter frac_bits=1) (
input [exp_bits+frac_bits:0] in0,
input [exp_bits+frac_bits:0] in1,
output [exp_bits+frac_bits:0] out
);
wire [exp_bits+frac_bits:0] int_out;
reg sign;
reg [exp_bits-1:0] exp;
reg [frac_bits:0] frac;

CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h1),.z({int_out,result_x}),.status());

always @(*) begin
sign = int_out[exp_bits+frac_bits];
exp = int_out[exp_bits+frac_bits-1:frac_bits];
frac = {1'b0,int_out[frac_bits-1:0]};
if ((results_x[2]&(results_x[1] | results_x[0])) | (int_out[0] & results_x[2])) begin
frac = frac + 1'd1;
if (~&exp) begin
exp = exp + frac[frac_bits];
end
end
end
assign out = {sign, exp, frac[frac_bits-1:0]};

endmodule
He said that using 7-bit won't be synthesized properly, so we have to use 10-bit here.

If we are going to go with the above code, lets explicitly write this in lassen so we can easily verify the high risk rounding code using our functional testbench. @Kuree or @nikhilbhagdikar, can you do a lassen PR using the multiply operator of FPVector[8,7+3,RNE,False] in sim.py, then use normal BFloat16 for the testbench?

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

There's a couple ways to address the issue, but what I did for now is to have the lassen test remove the generated tb file if it exists to hopefully force ncsim to recompile, see 1705e47 also changed the test script back to run them all with dc58b42 so we can see if we get the expected failure.

We could add this to fault, but this was a faster work around because otherwise we need to get the change merged into fault and released. The thing is, fault creates the test bench file through python's file IO interface, see https://github.com/leonardt/fault/blob/master/fault/system_verilog_target.py#L272. We could add logic after this to touch the created file, but it seems to me that this should result in a new timestamp every test. Maybe there's something weird going on with the way ncsim tracks file modifications? I'm hesitant to integrate this special case into fault, but if it's a general issue with how ncsim handles recompilation, then it does make sense to ahve fault automatically avoid the issue by doing some weird timestamp magic (we could use python's os.utime https://docs.python.org/3.7/library/os.html#os.utime to try to force a recompilation)

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

I'll run with -s to see if it's a similar message from ncsim (skipping the recompile step)

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

@rdaly525 What's the plan for the RTL? Are you going to hard-code the verilog fix in (I will verify the fix later today).

@rdaly525
Copy link
Contributor

rdaly525 commented Jun 5, 2019

In my opinion, it would be easier to verify the fix directly in peak by running functional tests and comparing the semantics of the code that nikhil wrote vs BFloat16.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Weird, I added a print to make sure that the test bench file was being removed before fault regenerated it, then the output looks like

tests/test_pe.py::test_fp_mul removing tb file

Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks  -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v



irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.

Loading snapshot worklib.WrappedPE_tb:sv .................... Done

ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.

ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc

ncsim> 

ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps

Created default VCD database vcddb

ncsim> probe -create -all -vcd -depth all

Created probe 1

ncsim> run 10000ns

Simulation complete via $finish(1) at time 40 NS + 0

./WrappedPE_tb.sv:50         #20 $finish;

ncsim> quit

�[32mPASSED�[0m

so it seems that even if I force remove that file before generating a new _tb, ncsim still uses the cached worklib. I could try deleting the intermediate worklib files

@Kuree
Copy link
Member Author

Kuree commented Jun 5, 2019

@leonardt. If you remove worlib, then it's the same thing as use a temp folder since ncsim won't be able to use cached version anymore. This will make the runs much slower, but should avoid the bug.

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

I'll try that first to see if that avoids the error, so we know the issue, then we can investigate how we can get ncsim to force recompile our _tb

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Ah, I saw this in the output:


Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected.
--
  | expected: Wed Jun  5 15:15:35 2019
  | actual:   Wed Jun  5 15:15:36 2019

so maybe it has to do with whether a full test run takes under 1 second, so i'm going to try adding a time.sleep(1)

@leonardt
Copy link
Contributor

leonardt commented Jun 5, 2019

Adding time.sleep(1) also fixes the issue, see https://buildkite.com/stanford-aha/lassen/builds/121#a3c84cda-fb2c-4133-adda-7ce4e3a31e33

but this makes the tests run quite a bit slower (because most of them take under a second), not sure what the best workaround is here. we could benchmark nuking INCA_libs (forcing full recompile) versus time.sleep(1)

@leonardt
Copy link
Contributor

leonardt commented Jun 6, 2019

Issue seems to be resolved using fault's timestamp-edit branch (leonardt/fault#116), see https://buildkite.com/stanford-aha/lassen/builds/131#b4859b6e-ad64-47da-8104-98a5b23ef0ad

@Kuree
Copy link
Member Author

Kuree commented Jun 11, 2019

Moved the discussion to #120, since this issue also covers another bug.

@Kuree Kuree closed this as completed Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants