Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got completion with error #11

Closed
Dicridon opened this issue Sep 20, 2022 · 18 comments
Closed

Got completion with error #11

Dicridon opened this issue Sep 20, 2022 · 18 comments

Comments

@Dicridon
Copy link

Hi, thanks for your open source repo of Sherman, we are happy that we can run Sherman on our cluster to learn more about this system.

The issue

We encountered protection error and deadlock running multithread and multi-machine benchmarks.

Instructions executed

We use the following instructions on each machine to run multithread and multi-machine benchmarks, which produce runtime errors. The Memcached server is on a third machine.

./hugepage.sh
./restartMemc.sh
./benchmark 2 100 4

We run the following instructions to run the single-thread and single-machine benchmark, which runs well

./hupages.sh
./restartMemc.sh
./bencharmk 1 100 1

The total number of huge pages in the hugepage.sh is modified to 4096 to reduce prepare time and huge page size is 2MiB.

Error messages

We were able to run a single-thread benchmark on a single machine, but we encountered the following errors when running multithread and multi-machine tests.
image

Machine configuration

As shown above, RDMA poll failed due to protection error, and deadlock was detected. We are not sure whether this is caused by the wrong hardware configuration or software bugs. The machine configuration is as follows:
image
The hardware configuration seems to meet the requirement of Sherman (OFED version and firmware version).

Analysis

The protection error is caused by access to invalid memory regions, but we are not sure whether this is caused by software bugs or the wrong hardware setup. The deadlock error is also confusing because the benchmarks are read-only. Can you give us some tips to debug these errors?

@Transpeptidase
Copy link
Collaborator

Hi, ./restartMemc.sh only needs to be executed once before each run: execute it on machine 1, but not machine 2.
Besides, the memached consumes almost no system resources, so you can co-locate it with Sherman processes.

@Dicridon
Copy link
Author

Thanks for your quick reply, we now can run Sherman successfully!

@Dicridon Dicridon reopened this Sep 20, 2022
@Dicridon
Copy link
Author

Sorry for reopening this issue, but when running multi-machine benchmarks, we have the following errors when the thread number exceeds 4:
on machine 0
image
on machine 1
image

And if we start the two servers at almost the same time, we have an assertion failure Assertion page->hdr.sibling_ptr != GlobalAddress::Null() failed

@Transpeptidase
Copy link
Collaborator

Can you provide a screenshot of the entire test?

@Dicridon
Copy link
Author

Sorry for my late reply.
image

@Transpeptidase
Copy link
Collaborator

Transpeptidase commented Sep 21, 2022

I cannot see the complete output of server 0 (right part of screenshot )

@Dicridon
Copy link
Author

The missing part is below
image
and the registering 8589934592 memory region is some output added by us to see the execution process of Sherman (these outputs are too long and repetitive. I can capture them all)

@Transpeptidase
Copy link
Collaborator

Can you check if the error is triggerred when performing

auto root_addr = dsm->alloc(kLeafPageSize);

or
dsm->write_sync(page_buffer, root_addr, kLeafPageSize);

?

@Dicridon
Copy link
Author

bool res = dsm->cas_sync(root_ptr_ptr, 0, root_addr.val, cas_buffer);

The above line triggers the error

@Transpeptidase
Copy link
Collaborator

Is it OK when the number of threads is 2?
Can you print the information of related variables?

@Dicridon
Copy link
Author

Dicridon commented Sep 21, 2022

Unfortunately currently 2-thread benchmark fails too and error messages are the same (I wonder if maybe I should reboot the machines after each run?)
I have the following variables with -O0 optimization:

image
The root_addr.val's hex value is 0x20000000001. It doesn't look like a valid value.

@Transpeptidase
Copy link
Collaborator

How about a single thread in each machine? Please check RDMA network state via running ibv_write_bw.

@Dicridon
Copy link
Author

Running single-thread benchmarks sometime is OK and occasionally produces the same error.

ibv_write_bw works fine and our own programs also work.

@Dicridon
Copy link
Author

This issue is weird because we successfully ran the multithread benchmark on two machines once, but currently it doesn't work, Maybe it is due to some machine state issue?

@Transpeptidase
Copy link
Collaborator

Can you insert while(true) {} after

tree = new Tree(dsm);

?
Let's check if these two servers can init the tree successfully

@Dicridon
Copy link
Author

Sorry for my so late reply, I'm currently busy on another project.
The two servers can init the tree successfully after adding the loop.

@Transpeptidase
Copy link
Collaborator

Hi, can you send your WeChat ID via q-wang18@mails.tsinghua.edu.cn ? we can communicate more efficiently through WeChat

@Dicridon
Copy link
Author

Thank you so much for your help and I've sent my ID to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants