Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the result of dhrystone with TCM #383

Closed
piondeno opened this issue Dec 11, 2023 · 6 comments
Closed

Regarding the result of dhrystone with TCM #383

piondeno opened this issue Dec 11, 2023 · 6 comments

Comments

@piondeno
Copy link

Hi Dolu1990,
I generate ITCM and DTCM which each with 16KB size.
It allows all program and data can be loaded into TCM.
I wish all test can be done by TCM.

/home/datakey/tools/riscv64-unknown-elf-gcc-2018/bin/riscv64-unknown-elf-gcc -fno-inline -fno-common -O3 -DPREALLOCATE=1 -DHOST_DEBUG=0 -DMSC_CLOCK  -march=rv32im  -mabi=ilp32 -g -O3  -fno-inline   -MD -fstrict-volatile-bitfields  -o build/dhrystone.elf build/src/main.o build/src/dhry_1.o build/src/dhry_2.o build/src/crt.o build/src/stdlib.o -lc -lc  -march=rv32im  -mabi=ilp32 -nostdlib -lgcc -mcmodel=medany -nostartfiles -ffreestanding -Wl,-Bstatic,-T,../libs/linkerTcm.ld,-Map,build/dhrystone.map,--print-memory-usage 
Memory region         Used Size  Region Size  %age Used
            iTcm:       11624 B        16 KB     70.95%
            dTcm:       15376 B        16 KB     93.85%

I only modify one line code in dhrystone project because CLK_TCK seems like obsolete.

#include <time.h>
#define HZ	CLOCKS_PER_SEC
//#define HZ	CLK_TCK
#endif

Using GenFullNoMmuMaxPerf.scala as template to config Vexriscv in this test.
Ryan.zip

After programing the bitstream into FPGA (Artix 7) and run the dhrystone project.
The information was shown in the terminal.


Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 500 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    510
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          805318660
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          805318660
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           1
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Clock cycles=222809
                    DMIPS per Mhz:                              1.27

The testing result seems like not reach the description in Vexriscv github.

VexRiscv full max perf (HZ*IPC) -> (RV32IM, 1.38 DMIPS/Mhz 2.57 Coremark/Mhz, 8KB-I$,8KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch prediction in the fetch stage, branch and shift operations done in the Execute stage) ->
    Artix 7     -> 200 Mhz 1935 LUT 1216 FF 
    Cyclone V   -> 130 Mhz 1,166 ALMs
    Cyclone IV  -> 126 Mhz 2,484 LUT 1,120 FF 

Because all program is in TCM, the result that I expected should be reach the 1.38DMIPS as least.
Could you give me any suggestion to improve the bench test?
Thanks

@Dolu1990
Copy link
Member

Hi,

Did you tired with the vanilla GenFullNoMmuMaxPerf config ?
in which test environnement did you run you version ?

@piondeno
Copy link
Author

Hi,
After removing TCM and restore cache size back to 8 KB for each I and D bus.

/home/datakey/tools/riscv64-unknown-elf-gcc-2018/bin/riscv64-unknown-elf-gcc -fno-inline -fno-common -O3 -DPREALLOCATE=1 -DHOST_DEBUG=0 -DMSC_CLOCK  -march=rv32im  -mabi=ilp32 -g -O3  -fno-inline   -MD -fstrict-volatile-bitfields  -o build/dhrystone.elf build/src/main.o build/src/dhry_1.o build/src/dhry_2.o build/src/crt.o build/src/stdlib.o -lc -lc  -march=rv32im  -mabi=ilp32 -nostdlib -lgcc -mcmodel=medany -nostartfiles -ffreestanding -Wl,-Bstatic,-T,../libs/linkerAllInSramForSim.ld,-Map,build/dhrystone.map,--print-memory-usage 
Memory region         Used Size  Region Size  %age Used
       onChipRam:       26992 B        32 KB     82.37%
           sdram:          0 GB        64 MB      0.00%

After downloading bitstream to FPGA and run the program in release mode.
The result is showing below:

Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 500 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:           A
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    510
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          -2147459732
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          -2147459732
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           5
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Clock cycles=213512
                    DMIPS per Mhz:                              1.33

The bench result is 1.33DMIPS/Mhz.
This result is better than TCM but not make sense.
Do you have any idea to help me verify it?
Thanks.

@Dolu1990
Copy link
Member

Hi,

I looked at the code, and i think i found the reason why :

arbitration.haltItself setWhen(stages.dropWhile(_ != execute).tail.map(s => s.arbitration.isValid && s.input(HAS_SIDE_EFFECT)).orR)

Basicaly, the data cache has the advantage that the write are delayed until writeback stage, while the thigly coupled dbus has the penality that write are scheduled early (execute stage) and should ensure that there is no risk of them being unscheduled by a branch or an exception or anything else.

So thigly coupled dbus will sometime have to wait for the pipeline to empty itself (when doing store)

@piondeno
Copy link
Author

Hi,

Thanks for the reply.
I got it.

@piondeno
Copy link
Author

piondeno commented Dec 13, 2023

Hi, @Dolu1990

May I ask one more question?

First, change the configuration for DivPlugin,

        //new DivPlugin,
        new MulDivIterativePlugin(genMul = false, genDiv = true, mulUnrollFactor = 1, divUnrollFactor = 2, dhrystoneOpt=true),

The bench will be improved like following
1.33DMIPS(8KB Cache IBUS, 8KB Cache DBUS) ->
1.38DMIPS(8KB Cache IBUS, 8KB Cache DBUS, divUnrollFactor = 2)->
1.44DMIPS(8KB Cache IBUS, 8KB Cache DBUS, divUnrollFactor = 2, dhrystoneOpt=true)
When setting dhrystoneOpt=true, is it really helpful to improve in real operation?

Second, when I set genMul = true and mulUnrollFactor=2 to replace MulPlugin,

        //new MulPlugin,
        //new DivPlugin,
        new MulDivIterativePlugin(genMul = true, genDiv = true, mulUnrollFactor = 2, divUnrollFactor = 2, dhrystoneOpt=true),

The bench test is decrease to 1.33MIPS.
Although using genMul = true in MulDivIterativePlugin can replace MulPlugin,
But performance is lower than MulPlugin.
Is it right?

Thanks

@piondeno piondeno reopened this Dec 13, 2023
@Dolu1990
Copy link
Member

When setting dhrystoneOpt=true, is it really helpful to improve in real operation?

I would say, not realy usefull, as it only work for very small division numbers

But performance is lower than MulPlugin.

yes, at least in practice for FPGA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants