Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cycle count of rocket chip zynq infrastructure #63

Closed
kenzhang82 opened this issue Aug 10, 2017 · 8 comments
Closed

Cycle count of rocket chip zynq infrastructure #63

kenzhang82 opened this issue Aug 10, 2017 · 8 comments

Comments

@kenzhang82
Copy link

Hi there,

Maybe I post this in the wrong forum, but I have searched many places and couldn't find answer to it.

How could we count how many cycles each instruction takes to be executed for a C program running on top of rocket chip FPGA infrastructure (say default config, i.e. pre-built image)?

The closest answer I could find is to compile the C++ cycle emulator and to simulate it, but even a simple "hello world" C program takes long time to be simulated.

Any help would be much appreciated! Thanks.

Ken

@davidbiancolin
Copy link
Contributor

I'm going to point you at section 2.8 of the RISC-V user level specification. https://riscv.org/specifications/ :)

@kenzhang82
Copy link
Author

Thanks @davidbiancolin, much appreciated! I understand why the emulator is slow, but is there anyway to know how many cycles each instruction (for example, add, sub, mul etc) takes to be executed. I tried to turn the verbose mode ON, but it doesn't have the cycle count? Or did I do anything wrong?

Also, I just figured out that we could use spike pk -s to do the same thing (understood that spike is just a functional simulator), but what would be the best (and accurate) way to profile the cycle count of a C program running on rocket chip zynq infrastructure? Thanks.

@aswaterman
Copy link
Member

Hi @z419379295 - you're asking for a metric that's fundamentally ambiguous, because pipelined processors overlap latencies. Suppose MUL has 3-cycle latency and LW has 2-cycle latency:

MUL x1, x1, x2
LW x2, 0(x2)
ADD x2, x2, x1

A single-issue in-order pipeline would incur one stall cycle before the ADD, so the sequence completes over the course of four cycles. But since the ADD is stalled on both the MUL and the LW, how do you decide how to apportion those cycles between the instructions?

@kenzhang82
Copy link
Author

Thanks @aswaterman for your help!! Aha, that makes sense to me now. Maybe I was not able to see the big picture, maybe what I was trying to do was to identify the power consumption of a C code that is being executed in rocket chip synthesized in Zynq PL, I thought it might be good to see which instruction takes up the most cycle? Or is there any way to achieve this (i.e. power profiling of instructions)? Thanks.

@aswaterman
Copy link
Member

I'm not really sure... maybe run several benchmarks, measure their power consumption and instruction mix, and then attempt to correlate power consumption with instruction mix?

@kenzhang82
Copy link
Author

The benchmark? You mean running on cycle-accurate C++ emulator? How do we measure the power consumption of a software algorithm running on RISC-V processor?

@ben-k
Copy link
Member

ben-k commented Aug 11, 2017

You would need some sort of RTL-based power model. The details are something of an open research question, so there's not going to be a push-button answer here.

@kenzhang82
Copy link
Author

Cool, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants