Regarding the result of dhrystone with TCM #383

piondeno · 2023-12-11T04:23:55Z

Hi Dolu1990,
I generate ITCM and DTCM which each with 16KB size.
It allows all program and data can be loaded into TCM.
I wish all test can be done by TCM.

/home/datakey/tools/riscv64-unknown-elf-gcc-2018/bin/riscv64-unknown-elf-gcc -fno-inline -fno-common -O3 -DPREALLOCATE=1 -DHOST_DEBUG=0 -DMSC_CLOCK  -march=rv32im  -mabi=ilp32 -g -O3  -fno-inline   -MD -fstrict-volatile-bitfields  -o build/dhrystone.elf build/src/main.o build/src/dhry_1.o build/src/dhry_2.o build/src/crt.o build/src/stdlib.o -lc -lc  -march=rv32im  -mabi=ilp32 -nostdlib -lgcc -mcmodel=medany -nostartfiles -ffreestanding -Wl,-Bstatic,-T,../libs/linkerTcm.ld,-Map,build/dhrystone.map,--print-memory-usage 
Memory region         Used Size  Region Size  %age Used
            iTcm:       11624 B        16 KB     70.95%
            dTcm:       15376 B        16 KB     93.85%

I only modify one line code in dhrystone project because CLK_TCK seems like obsolete.

#include <time.h>
#define HZ	CLOCKS_PER_SEC
//#define HZ	CLK_TCK
#endif

Using GenFullNoMmuMaxPerf.scala as template to config Vexriscv in this test.
Ryan.zip

After programing the bitstream into FPGA (Artix 7) and run the dhrystone project.
The information was shown in the terminal.


Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 500 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    510
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          805318660
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          805318660
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           1
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Clock cycles=222809
                    DMIPS per Mhz:                              1.27

The testing result seems like not reach the description in Vexriscv github.

VexRiscv full max perf (HZ*IPC) -> (RV32IM, 1.38 DMIPS/Mhz 2.57 Coremark/Mhz, 8KB-I$,8KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch prediction in the fetch stage, branch and shift operations done in the Execute stage) ->
    Artix 7     -> 200 Mhz 1935 LUT 1216 FF 
    Cyclone V   -> 130 Mhz 1,166 ALMs
    Cyclone IV  -> 126 Mhz 2,484 LUT 1,120 FF

Because all program is in TCM, the result that I expected should be reach the 1.38DMIPS as least.
Could you give me any suggestion to improve the bench test?
Thanks

The text was updated successfully, but these errors were encountered:

Dolu1990 · 2023-12-11T11:29:15Z

Hi,

Did you tired with the vanilla GenFullNoMmuMaxPerf config ?
in which test environnement did you run you version ?

piondeno · 2023-12-12T07:26:30Z

Hi,
After removing TCM and restore cache size back to 8 KB for each I and D bus.

/home/datakey/tools/riscv64-unknown-elf-gcc-2018/bin/riscv64-unknown-elf-gcc -fno-inline -fno-common -O3 -DPREALLOCATE=1 -DHOST_DEBUG=0 -DMSC_CLOCK  -march=rv32im  -mabi=ilp32 -g -O3  -fno-inline   -MD -fstrict-volatile-bitfields  -o build/dhrystone.elf build/src/main.o build/src/dhry_1.o build/src/dhry_2.o build/src/crt.o build/src/stdlib.o -lc -lc  -march=rv32im  -mabi=ilp32 -nostdlib -lgcc -mcmodel=medany -nostartfiles -ffreestanding -Wl,-Bstatic,-T,../libs/linkerAllInSramForSim.ld,-Map,build/dhrystone.map,--print-memory-usage 
Memory region         Used Size  Region Size  %age Used
       onChipRam:       26992 B        32 KB     82.37%
           sdram:          0 GB        64 MB      0.00%

After downloading bitstream to FPGA and run the program in release mode.
The result is showing below:

Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 500 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:           A
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    510
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          -2147459732
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          -2147459732
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           5
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Clock cycles=213512
                    DMIPS per Mhz:                              1.33

The bench result is 1.33DMIPS/Mhz.
This result is better than TCM but not make sense.
Do you have any idea to help me verify it?
Thanks.

Dolu1990 · 2023-12-12T08:54:37Z

Hi,

I looked at the code, and i think i found the reason why :

VexRiscv/src/main/scala/vexriscv/plugin/DBusCachedPlugin.scala

Line 442 in 7c6c7a6

    
           arbitration.haltItself setWhen(stages.dropWhile(_ != execute).tail.map(s => s.arbitration.isValid && s.input(HAS_SIDE_EFFECT)).orR)

Basicaly, the data cache has the advantage that the write are delayed until writeback stage, while the thigly coupled dbus has the penality that write are scheduled early (execute stage) and should ensure that there is no risk of them being unscheduled by a branch or an exception or anything else.

So thigly coupled dbus will sometime have to wait for the pipeline to empty itself (when doing store)

piondeno · 2023-12-13T00:48:15Z

Hi,

Thanks for the reply.
I got it.

piondeno · 2023-12-13T03:48:07Z

Hi, @Dolu1990

May I ask one more question?

First, change the configuration for DivPlugin,

        //new DivPlugin,
        new MulDivIterativePlugin(genMul = false, genDiv = true, mulUnrollFactor = 1, divUnrollFactor = 2, dhrystoneOpt=true),

The bench will be improved like following
1.33DMIPS(8KB Cache IBUS, 8KB Cache DBUS) ->
1.38DMIPS(8KB Cache IBUS, 8KB Cache DBUS, divUnrollFactor = 2)->
1.44DMIPS(8KB Cache IBUS, 8KB Cache DBUS, divUnrollFactor = 2, dhrystoneOpt=true)
When setting dhrystoneOpt=true, is it really helpful to improve in real operation?

Second, when I set genMul = true and mulUnrollFactor=2 to replace MulPlugin,

        //new MulPlugin,
        //new DivPlugin,
        new MulDivIterativePlugin(genMul = true, genDiv = true, mulUnrollFactor = 2, divUnrollFactor = 2, dhrystoneOpt=true),

The bench test is decrease to 1.33MIPS.
Although using genMul = true in MulDivIterativePlugin can replace MulPlugin,
But performance is lower than MulPlugin.
Is it right?

Thanks

Dolu1990 · 2023-12-18T09:00:02Z

When setting dhrystoneOpt=true, is it really helpful to improve in real operation?

I would say, not realy usefull, as it only work for very small division numbers

But performance is lower than MulPlugin.

yes, at least in practice for FPGA

piondeno closed this as completed Dec 13, 2023

piondeno reopened this Dec 13, 2023

piondeno closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the result of dhrystone with TCM #383

Regarding the result of dhrystone with TCM #383

piondeno commented Dec 11, 2023

Dolu1990 commented Dec 11, 2023

piondeno commented Dec 12, 2023

Dolu1990 commented Dec 12, 2023

piondeno commented Dec 13, 2023

piondeno commented Dec 13, 2023 •

edited

Loading

Dolu1990 commented Dec 18, 2023

Regarding the result of dhrystone with TCM #383

Regarding the result of dhrystone with TCM #383

Comments

piondeno commented Dec 11, 2023

Dolu1990 commented Dec 11, 2023

piondeno commented Dec 12, 2023

Dolu1990 commented Dec 12, 2023

piondeno commented Dec 13, 2023

piondeno commented Dec 13, 2023 • edited Loading

Dolu1990 commented Dec 18, 2023

piondeno commented Dec 13, 2023 •

edited

Loading