-
Notifications
You must be signed in to change notification settings - Fork 53
7-men generation #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The generation looks fine so far, as for discussion, I wish to follow the previous memory bandwidth problem of mine, I have observed via "perf top" and these are hot-spots during iteration process. It would no longer cause slowdowns when I use less threads, but this makes me wonder how did people pull this off a few years earlier, which I could assume a magnitude slower interconnect was required to aggregate physical memory from multiple machines in order to hold the intermediate table: even if a space-optimized indexing scheme was used, it still cannot justify the performance loss between UPI vs interconnect. Any insights? |
What slows down your system the most must be the inter-node accesses. It might already help a lot to control what nodes are working on what parts of the table. Then inter-node accesses can be limited to looking up king moves. With 8 nodes, there only need to be inter-node accesses for moves of the white king. As it is now, moves of essentially all pieces can trigger inter-node accesses because a thread will usually be working on non-local memory. The Lomonosov generator runs on a cluster with very many nodes and 8 cores per node. I suppose each node holds at least one KK slice and inter-node accesses are somehow queued and performed in batches. Some links: I think the Lomonosov generator uses 2 bytes per position during generation. That avoids the need for reduction steps (which are more painful when generating DTM tables). Earlier approaches used less RAM and a lot more time. |
I think it does not make sense not to make the generator "NUMA aware" during generation, so I will start working on it. How much time does a permutation step take on your machine? The compression algorithm does many passes over the data, but it is very sequential so it should scale quite well to a big machine. Making it NUMA aware should still help, but it might not be crucial. The other thing of course is to take into account same-piece symmetry. |
Permutation step of DTZ usually takes about 5 minutes on a 4v3 table. Time usage example:
The most time consuming steps are the first few iterations(Iteration 1~3), first saving for reduction if it reaches "Iteration 62...", it can take some 30 minutes, following saves are usually faster. |
I am using interleaving memory allocation and not binding threads(let the OS re-balance automatically), it turns out to be the fastest way. Excessive threads just made everything significantly slower thus the symptom looked like the OS migration was the killer, which it wasn't. |
I have now created a numa branch. Only rtbgen works for the moment. This might allow you to run the generator with many more threads without slowing down. It supports 2, 4 and 8 nodes. I think you will have to disable "node interleaving" in the bios. (If I understand you correctly, you have now enabled it.) |
I will test it in a moment. In BIOS, there is no such option that controls interleaving memory across NUMA nodes(only per socket interleaving if I read correctly). Manual Pg.77 This behavior was controlled by invoking via "numactl --interleave=all" on the OS side, by observing "numactl -H" I can see interleaved allocation across nodes and eventually gets migrated to busy nodes by the OS. So I will run it without memory interleaving and/or disable NUMA re-balancer from the OS and see how it goes. |
Ah I see, then no need for a reboot. I think some mainboards have an option to just interleave all memory over all nodes and to hide the NUMAness from the OS. It might still help to run with numactl --interleave=all, for example if the numa-aware rtbgen now does worse in the permutation and compression steps. |
Oh I forgot to say that you need to add the "-n" option to enable NUMA. |
I have figured them out. And good news, it does not cause any slowdowns. Some timings:
One exception is the first broken init step, during which the OS tries to migrate memory pages to appropriate nodes, upgrading them to huge pages and/or parallel disk access, but I think it is not worth optimizing. Also I have measured memory bandwidth and latency under load, local: 106GB/s(87ns), remote: 16GB/s(160ns~220ns). So this brings about a very big performance improvement. Cheers! |
Blazing fast confirmed, even with HT on.
I solved the slow first step problem by "cat *.rtbw > /dev/null". |
Excellent! The slow initialisation reminds me of a problem that Cfish had on NUMA machines when it used numa_alloc_interleaved() for allocating the TT. I changed that to letting each node fault in a part of the TT. It seems that numa_tonode_memory() also makes the initial allocation slow. I could try the same as I did for Cfish, but I wonder if the kernel might then start migrating pages between nodes. |
Btw, please check that the generated tables are correct, just to make sure I made no mistake. You could test this with a 6-piece table, for example. |
That 7 second init was from a fresh reboot + *.rtbw in system cache, so I guess it is not a problem now. Tested a 6-piece table with numa branch, generator stats are correct, compressed files are correct, but compression iterations became very slow. |
I overlooked the 7s initialisation of the second run. For speeding up the capture calculations it might help to precache the relevant *.rtbw files. |
Did compression get slower while using the -d option? This could probably be improved by letting the threads do the compression passes in a sort of random order instead of linearly from start to end. Even better might be to rebind the memory to the nodes and let each thread work on local memory. But it might be too expensive to migrate all those pages. |
Last checking run was not using the -d option, with it there is the same slowdown(perf says in replace_pairs_u8). The situation on the other machine that is building pawnful tables is rather different, there it has zero slowdowns without NUMA partitioning running with all cores(4 nodes), I can't figure out why. |
How many threads are you using on the other machine? |
88 threads. E5-4669v4, actually running with 176 threads works just fine too, where in the newer one it suffers on more than 64 threads no matter how I fiddle, unless running the NUMA branch. And following the previous compression slowdown, it is the same with a 7-piece table, this memory is allocated anew. I'm wrapping "run_threaded" with a thread limit parameter, then skip the while loop in "worker" of threads beyond limit, only the compression step should apply, results seems good and tables built are correct. |
Generation now is about 10x faster, permutation 5x(5min -> 1min), compression remain the same because I limited threads. |
10x faster is very good! Improving pawnful generation will be a bit more tricky. I think I will first NUMA-ify generation with 1 pawn and then use the same technique as in the ppppp branch for multiple pawns. I may also have a look at the reduction step. |
I don't know if the v4 is supposed to have higher bandwidth than the newer generation, but it makes sense that being NUMA aware is more important on 8 nodes than on 4 nodes. |
Stats verify failed for KQQNvKQB after reconstruction, comparing with master to find out why. |
Strange, the way you described it it sounded OK. |
See my reference PR, the first version can skip other NUMA queues entirely. I'm regenerating with the third version to be sure. |
I made some changes, which I have tried to explain in the commit message. When running without -d, the generator now makes the permutation step NUMA aware. The reason the compression step is slow with many threads might be that big arrays are being accessed that are in remote array. It shouldn't be too difficult to allocate those arrays per node, but I have not worked on that yet. I will study your PR later to see if this could be the explanation. For now, you could re-implement the changes in your PR by changing HIGH to LOW where necessary. |
Compression with high threads works better than before, less congestion, but still a slowdown, so does tramsform. It should be problem free. Tables KQQNvKQR and KQQNvKQB are successfully built. |
OK, then I will move those big arrays to local memory for each node. Whether it makes a big difference or not, it cannot hurt. |
I can set them to LOW threads while LOW can use a bit more than 64 threads without slowdowns, that's good enough since those steps won't take a long time anyway. |
Ah, I see. Do you know if the cpus throttle down when this happens? Do they get hot and reduce their clock frequency? (Probably not if the problem lies with the memory subsystem.) |
Nope, it just seemed like(or actually) taking a few hundred cycles for per memory operation instructions. |
Do you mean the [watchdog/xx] threads? What happens when they trigger? I also wonder a bit if it might be the barriers or perhaps the atomic queue counter that cannot handle so many threads. (Doesn't seem very likely but who knows.) |
There is a potential overflow in the compression code. I don't think it can affect correctness, but it probably can affect compression ratio. |
Just a bunch of "soft lockup" warnings. |
Starting from KQQNPvKQ and KQRBvKRB, will use new generator. |
New compression code works like a charm, now it can run with many threads. |
Nice! I read a bit about soft lockups. Do the warnings come with call traces? It seems the call trace will be empty if the thread got locked up in user space. If it happens in user space, then it's probably the hardware and not a kernel problem. Apparently some (or most) threads never (or at least not for many seconds) get their memory requests served. Strangely I can't seem to find any other reports on this problem. |
I just did a test to see what happens if I let the countfirst[] and countsecond[] counters overflow by making them 16 bit. This results in KQRvKQ.rtbw/z of 5610320 and 11540176 bytes instead of 3095888 and 9245968 bytes. As expected, the tables are still correct. But it seems not too likely that an overflow has happened. The countfirst and countsecond counters are used to update the pairfreq[][] array after each round of replace_pairs(). pairfreq[a][b] counts the number of times ab occurs in the table. If ab is replaced with c, then xaby becoming xcy means that pairfreq[x][a] and pairfreq[b][y] must be decreased by 1 and pairfreq[x][c] and pairfreq[c][y] must be increased. So countfirst[pair#][x] counts the number of times that xab occurs in the table and countsecond[pair#][y] counts the number of times that aby occurs (where pair# identifies ab). With 64 threads, each thread processes, for the largest tables, on average about 6 billion values (in the first round of replace_pairs()). Each pair is two values, so on average no pair will be replaced more than 3 billion times. Although there is probably some variation between threads, exceeding 2^32 seems to need very special circumstances. Most "vulnerable" seems to be an almost constant table like 6v1 WDL, but the pieceless 5v1 tables all have a smaller index range due to duplicate pieces... I think you have generated KRBNvKQN.rtbw several times now, each time with an identical result. So all seems fine for now. |
Yes, I have built KRBNvKQN multiple times, and WDL file is identical, and I have not used any thread number lower than 64 to build the tables. Also, there is an unfortunate disk failure(second time now, bummer) in the pawnful building machine, resuming when the replacement arrives. |
Master can now use libzstd to compress temporary files in parallel. The number of threads can be set in the Makefile. It should be set to a value that makes writing and reading temporary files I/O bound. I have now set it to 6 which is probably higher than necessary with the current compression-level setting. The compression level is now set to 1: Lines 282 to 286 in 99256be
Increasing it doesn't seem to gain a lot in compression ratio but does make compression slower. But feel free to experiment. Each compression thread will take about 20 MB of RAM (two buffers of approximately COPYSIZE bytes). I will now merge this with numa. |
Merged. Forgot to add that the compression ratio is much better with libzstd compared to LZ4. So this should speed up the reduction steps. |
Starting from KQRNvKRN I will use the new version, the machine has a 16-disk RAID-0 array, so I can probably use some more threads. |
Resuming pawnful table building, that machine runs fine using master branch on 176 threads. |
In the meantime I am working on making the pawn generator NUMA aware. |
The regenerated KQBNvKQB and KQBNvKQN are still missing on Lichess...... |
Since you meantioned on talkchess that you will be trying to verify the pawnless tables, are you planning to run tbver on them? (It would take a non-trivial amount of time to run tbver on all of them.) At the moment rtbver won't work with the 16-bit compressed files. Since you're using ECC memory, I don't expect any errors in the pawnless tables. It seems unlikely that the pawnless generator generates incorrect results. The biggest danger is that there is still some internal limit that is exceeded by 7-men TBs but not by 6-men TBs. The good thing is that errors in DTZ tables cannot propagate to other tables. |
Ok, the verification I intend to perform at this point is to at least check for known issues that we had earlier and make sure that I have regenerated every one of them correctly, using the correct generator. Also I found that some 5v2 tables failed as following when generating KQRBNvKP: Line 142 in 99256be
|
I'm retrying to get more information. Maybe it is just takes more than 1TB memory without -d option. |
False alarm, it just ran out of memory on that 1TB RAM machine. |
Yes, it needs just a bit more than 1TB. |
I ran rtbgenp -2 -t 64 -z KRBBPvKQ with new master, it segfaulted. I guess somewhere in compress_data of the first permutation estimation, reconstruction and it's stats verification passed. Will retry without -z and see what happens. |
I just tried and with -z it also crashes for me on e.g. KRPvKQ, but without -z it works. |
Yes, without -z it works(now at file c). |
Great. Hopefully the generated tables will be free of the problems we had with the earlier 16-bit pawnless tables. |
Was generation of KQPPvKPP deferred, or did it fail? |
@sf-x Deferred, due to it needs reduction(thus no ppppp branch) and it would be too slow to build for now. |
Generation of full 7-piece set is done. Files available at: Thanks to everyone who helped during the process. |
👊👍 Great and much appreciated work!
…Sent from my iPhone
On Aug 19, 2018, at 12:08 PM, noobpwnftw ***@***.***> wrote:
Generation of full 7-piece set is done.
Files available at:
ftp://ftp.chessdb.cn/pub/syzygy/7men/
Thanks to everyone who helped during the process.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Fascinating thread between two masters :-) |
New thread for discussing 7-men generation.
The text was updated successfully, but these errors were encountered: