Got completion with error #11

Dicridon · 2022-09-20T11:50:33Z

Hi, thanks for your open source repo of Sherman, we are happy that we can run Sherman on our cluster to learn more about this system.

The issue

We encountered protection error and deadlock running multithread and multi-machine benchmarks.

Instructions executed

We use the following instructions on each machine to run multithread and multi-machine benchmarks, which produce runtime errors. The Memcached server is on a third machine.

./hugepage.sh
./restartMemc.sh
./benchmark 2 100 4

We run the following instructions to run the single-thread and single-machine benchmark, which runs well

./hupages.sh
./restartMemc.sh
./bencharmk 1 100 1

The total number of huge pages in the hugepage.sh is modified to 4096 to reduce prepare time and huge page size is 2MiB.

Error messages

We were able to run a single-thread benchmark on a single machine, but we encountered the following errors when running multithread and multi-machine tests.

Machine configuration

As shown above, RDMA poll failed due to protection error, and deadlock was detected. We are not sure whether this is caused by the wrong hardware configuration or software bugs. The machine configuration is as follows:

The hardware configuration seems to meet the requirement of Sherman (OFED version and firmware version).

Analysis

The protection error is caused by access to invalid memory regions, but we are not sure whether this is caused by software bugs or the wrong hardware setup. The deadlock error is also confusing because the benchmarks are read-only. Can you give us some tips to debug these errors?

The text was updated successfully, but these errors were encountered:

Transpeptidase · 2022-09-20T12:12:24Z

Hi, ./restartMemc.sh only needs to be executed once before each run: execute it on machine 1, but not machine 2.
Besides, the memached consumes almost no system resources, so you can co-locate it with Sherman processes.

Dicridon · 2022-09-20T12:46:23Z

Thanks for your quick reply, we now can run Sherman successfully!

Dicridon · 2022-09-20T12:52:46Z

Sorry for reopening this issue, but when running multi-machine benchmarks, we have the following errors when the thread number exceeds 4:
on machine 0

on machine 1

And if we start the two servers at almost the same time, we have an assertion failure Assertion page->hdr.sibling_ptr != GlobalAddress::Null() failed

Transpeptidase · 2022-09-20T13:05:30Z

Can you provide a screenshot of the entire test?

Dicridon · 2022-09-21T11:45:00Z

Sorry for my late reply.

Transpeptidase · 2022-09-21T12:01:28Z

I cannot see the complete output of server 0 (right part of screenshot )

Dicridon · 2022-09-21T12:06:47Z

The missing part is below

and the registering 8589934592 memory region is some output added by us to see the execution process of Sherman (these outputs are too long and repetitive. I can capture them all)

Transpeptidase · 2022-09-21T12:21:24Z

Can you check if the error is triggerred when performing

Sherman/src/Tree.cpp

Line 67 in 76e208b

auto root_addr = dsm->alloc(kLeafPageSize);

or

Sherman/src/Tree.cpp

Line 71 in 76e208b

dsm->write_sync(page_buffer, root_addr, kLeafPageSize);

?

Dicridon · 2022-09-21T12:26:06Z

Sherman/src/Tree.cpp

Line 74 in 76e208b

bool res = dsm->cas_sync(root_ptr_ptr, 0, root_addr.val, cas_buffer);

The above line triggers the error

Transpeptidase · 2022-09-21T12:38:34Z

Is it OK when the number of threads is 2?
Can you print the information of related variables?

Dicridon · 2022-09-21T12:45:05Z

Unfortunately currently 2-thread benchmark fails too and error messages are the same (I wonder if maybe I should reboot the machines after each run?)
I have the following variables with -O0 optimization:

The root_addr.val's hex value is 0x20000000001. It doesn't look like a valid value.

Transpeptidase · 2022-09-21T13:19:31Z

How about a single thread in each machine? Please check RDMA network state via running ibv_write_bw.

Dicridon · 2022-09-21T13:26:23Z

Running single-thread benchmarks sometime is OK and occasionally produces the same error.

ibv_write_bw works fine and our own programs also work.

Dicridon · 2022-09-21T13:30:26Z

This issue is weird because we successfully ran the multithread benchmark on two machines once, but currently it doesn't work, Maybe it is due to some machine state issue?

Transpeptidase · 2022-09-21T15:01:36Z

Can you insert while(true) {} after

Sherman/test/benchmark.cpp

Line 258 in 76e208b

tree = new Tree(dsm);

?
Let's check if these two servers can init the tree successfully

Dicridon · 2022-09-22T12:32:42Z

Sorry for my so late reply, I'm currently busy on another project.
The two servers can init the tree successfully after adding the loop.

Transpeptidase · 2022-10-04T05:54:16Z

Hi, can you send your WeChat ID via q-wang18@mails.tsinghua.edu.cn ? we can communicate more efficiently through WeChat

Dicridon · 2022-10-13T07:46:13Z

Thank you so much for your help and I've sent my ID to you.

Dicridon closed this as completed Sep 20, 2022

Dicridon reopened this Sep 20, 2022

Dicridon closed this as completed Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got completion with error #11

Got completion with error #11

Dicridon commented Sep 20, 2022

Transpeptidase commented Sep 20, 2022

Dicridon commented Sep 20, 2022

Dicridon commented Sep 20, 2022

Transpeptidase commented Sep 20, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022 •

edited

Loading

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022 •

edited

Loading

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 22, 2022

Transpeptidase commented Oct 4, 2022

Dicridon commented Oct 13, 2022

Got completion with error #11

Got completion with error #11

Comments

Dicridon commented Sep 20, 2022

The issue

Instructions executed

Error messages

Machine configuration

Analysis

Transpeptidase commented Sep 20, 2022

Dicridon commented Sep 20, 2022

Dicridon commented Sep 20, 2022

Transpeptidase commented Sep 20, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022 • edited Loading

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022 • edited Loading

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Dicridon commented Sep 21, 2022

Transpeptidase commented Sep 21, 2022

Dicridon commented Sep 22, 2022

Transpeptidase commented Oct 4, 2022

Dicridon commented Oct 13, 2022

Transpeptidase commented Sep 21, 2022 •

edited

Loading

Dicridon commented Sep 21, 2022 •

edited

Loading