Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the average memory access latency measured or calculated? #11

Closed
learning-chip opened this issue Dec 28, 2022 · 2 comments
Closed

Comments

@learning-chip
Copy link
Contributor

In the HDagg paper Section "Executor Evaluation" it is said that "The average memory access latency is used as a metric to measure locality." and "PAPI’s performance counters are used to measure architecture information needed in computations related to the locality and load balance metrics."

In the sptrsv_profiler.cpp example, the PAPI event list is:

std::vector<int> event_list = {PAPI_L1_DCM, PAPI_L1_TCM, PAPI_L2_DCM,
PAPI_L2_DCA,PAPI_L2_TCM,PAPI_L2_TCA,
PAPI_L3_TCM, PAPI_L3_TCA,
PAPI_TLB_DM, PAPI_TOT_CYC, PAPI_LST_INS,
PAPI_TOT_INS, PAPI_REF_CYC, PAPI_RES_STL,
PAPI_BR_INS, PAPI_BR_MSP};

I wonder how is the memory latency obtained from the above metrics?

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 28, 2022

It is based on the average memory cycle defined in the computer architecture book (see page 75).

PAPI does not give you all counters you will need, and it changes per architecture. We used something like below:

def compute_memory_cycle_for_one_group(row, arch_params):
    dl1_miss = row['PAPI_L1_DCM'].values
    dl2_miss = row['PAPI_L2_DCM'].values
    dl3_miss = row['PAPI_L3_TCM'].values
    dl1_access = row['PAPI_LST_INS'].values
    l1_mr = dl1_miss / dl1_access
    l2_mr = dl2_miss / dl1_miss
    l3_mr = dl3_miss / dl2_miss
    l1_access_cost = arch_params['L1_ACCESS_TIME']
    l2_access_cost = arch_params['L2_ACCESS_TIME']
    l3_access_cost = arch_params['L3_ACCESS_TIME']
    mm_access_cost = arch_params['MAIN_MEMORY_ACCESS_TIME']
    avg_mem_cycle = l1_access_cost + l1_mr*(l2_access_cost + l2_mr*(l3_access_cost + l3_mr*mm_access_cost))
    exec_cycle = avg_mem_cycle  * dl1_access
    return exec_cycle

You will need some architecture parameters. You can improve the code by finding more accurate counters.

@learning-chip
Copy link
Contributor Author

Thanks, that makes sense! I find that Vtune also provides an average latency metric, but it's good to calculate and verify it from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants