Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
73d78e7
Compile time classification
Feb 19, 2017
73511f0
Refactored combinatorials out of main
Feb 19, 2017
90f0c25
Refactoring of SWAR operations
Feb 19, 2017
9c8a465
Refactoring of SWAR part 2
Feb 19, 2017
ffe229b
Completes generic implementation of popcount
Feb 19, 2017
2b41569
Renamed file
Feb 19, 2017
74f3907
Changes
Feb 19, 2017
74a795a
Start of classification function
Feb 19, 2017
7887806
Counting function
Feb 21, 2017
9cef393
Corrects greater equal
Feb 21, 2017
15176d0
Implements straight
Feb 21, 2017
28af468
Almost complete implementation of winner
Feb 21, 2017
7c17cde
Benchmarking
Feb 22, 2017
cea30ea
Refactoring
Feb 22, 2017
24c570d
Refactoring 2
Feb 22, 2017
552456e
File copy Poker_io.h
Feb 22, 2017
2e30001
Refactoring
Feb 22, 2017
e17c553
Renaming Numbers to Ranks
Feb 22, 2017
7e2535d
Benchmarks
Feb 23, 2017
b84e252
Refactoring
Feb 23, 2017
664bdc6
Adding inc/ep/Classifications.h from Poker.h
Feb 23, 2017
7713cce
Refactoring
Feb 23, 2017
4f07aff
Refactoring
Feb 23, 2017
760dd79
inc/ep/CascadeComparisons.h from Poker.h
Feb 23, 2017
b4608ca
Refactoring
Feb 23, 2017
7f1608b
Clarification
Feb 23, 2017
f036aec
No comparison function
Feb 23, 2017
6ccd035
Egyptian multiplication algorithm for straights, hand ranking
Feb 23, 2017
665e30b
Corrected known bugs
Feb 24, 2017
dc0414a
First bug corrected by tests
Feb 24, 2017
7b3a6cc
Tests
Feb 24, 2017
f5946f4
Corrects Counted::clearAt and Full House results
Feb 26, 2017
e7689c6
Removes unwanted check
Feb 28, 2017
bb4883e
More tests
Feb 28, 2017
e6d9986
Completes tests, implements progressive subsets, benchmarks hand ranks
Mar 1, 2017
4e29958
Fastest hand ranking among known codebases!
Mar 3, 2017
8bbf2ec
Obsoleted dual representation
Mar 3, 2017
a2a0ff9
Partial implementation of CardSet
Mar 5, 2017
d6e997a
Flushes proved
Mar 5, 2017
30fd8d4
Straights tested
Mar 5, 2017
dcc79f3
Improved performance: clear of boolean is an XOR
Mar 6, 2017
a7ba15d
Tests Ok
Mar 7, 2017
54e227f
Corrected hand rank comparison bug
Mar 7, 2017
68ff97e
Extends Floyd algorithm for preselected subsets
Mar 7, 2017
bdf1a39
Create communities.md
Mar 10, 2017
6e64b00
Update communities.md
Mar 10, 2017
6da52fd
Update communities.md
Mar 10, 2017
061bbe4
Classifications, attempt 2
Mar 14, 2017
601c3a2
Merge branch 'master' of https://github.com/emadrid-at-ccm/singulargy
Mar 14, 2017
4f8bb72
SuitedPocket tested
Mar 15, 2017
52d506e
All pockets tested
Mar 15, 2017
586edd0
Kazone
Mar 30, 2017
982c357
Create README.md
thecppzoo Apr 5, 2017
8c371ef
Update README.md
thecppzoo Apr 5, 2017
2caaf21
Create Floyd-Sampling.md
thecppzoo Apr 6, 2017
0846779
Update Floyd-Sampling.md
thecppzoo Apr 6, 2017
03814c1
Rename Floyd-Sampling.md to Fastest-Floyd-Sampling.md
thecppzoo Apr 6, 2017
76c845f
Create What-is-noak-or-how-to-determine-pairs.md
thecppzoo Apr 6, 2017
fa4a3b5
Update What-is-noak-or-how-to-determine-pairs.md
thecppzoo Apr 6, 2017
07b25f0
Update What-is-noak-or-how-to-determine-pairs.md
thecppzoo Apr 7, 2017
a577c16
Update What-is-noak-or-how-to-determine-pairs.md
thecppzoo Apr 8, 2017
534915c
Create Hand-Ranking.md
thecppzoo Apr 8, 2017
0ac83d2
Update README.md
thecppzoo Sep 18, 2019
b7d1cf7
Add files via upload
thecppzoo Sep 20, 2019
e240d9a
Add 'pokerbotic/' from commit 'b7d1cf765ed2d5839386b36b5e3321387d705c03'
Jan 31, 2024
f56c9b3
Removes root/third_party, makes vscode workspace all relative
Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions pokerbotic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Pokerbotic

## This repository will be changing very soon to incorporate feedback from my CPPCon 2019 presentation, please check back in a couple of weeks

**Pokerbotic** is a poker engine. It has been developed by a professional software engineer and a semi-professional poker player with professional knowledge of stochastic processes little by little.

Currently, we have the hand evaluator framework, that achieves in normally available machines a rate of 100 million evaluations per second, that is, it classifies more than 100 million poker hands into what "four of a kind", etc. they are.

**The code today assumes the AMD64 architecture**, and support of the [BMI2 instructions](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2_.28Bit_Manipulation_Instruction_Set_2.29). AMD64/Intel is not essential to this code, just that the necessary adaptations have not been made. You are welcome to help with this.

Currently, the code is a header-only framework with some use cases programmed in C++ 14.

This code beats other poker engines, including the popular open source framework "PokerStove" both on ease of use and performance due to the application of Generic Programming.

Generic Programming allows hoisting what otherwise would be run-time computation to compilation time, this is illustrated in the non-trivial `static_assert` in the code itself.

The documentation for the advanced programming techniques, including the Floyd sampling algorithm, the SWAR techniques is being written.

## How to build it

### Prerequisites:

1. GCC compatible compiler. We recommend Clang 3.9 or 4.0 specifically. Benchmarks indicate Clang gives noticeably faster code. The code uses GCC extensions in the way of builtins.
2. C++ 14. In GCC or Clang, do not forget the option `-std=c++14`
3. Support for BMI2 instructions, activated with `-march=native` (preferred way) or specifically with `-mbmi2`
4. Test cases require the ["Catch" testing framework](https://github.com/philsquared/Catch).
5. Currently the code does not require a Unix/POSIX operating system (this code should be compilable in Windows64 through either gcc or clang), however, **we reserve the option to make the code incompatible with any operating system**.

### There are several test programs available:

#### Unit tests at [src/main.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/main.cpp)

Several unit tests. This program illustrates how to use the engine framework. To build it, at the project root, you may do this:

`clang++ -std=c++14 -Iinc -DTESTS -O3 -march=native -I../Catch/include src/main.cpp -o main`

Notice you have to define TESTS and indicate the path to the "Catch" testing framework.

#### [src/benchmarks.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/benchmarks.cpp)

A program that measures the execution speed of several internal mechanisms. To build it, at the project root, you may do this:

`clang++ -std=c++14 -Iinc -DBENCHMARKS -O3 -march=native src/benchmarks.cpp -o benchmarks`

This program can be run without arguments. It will generate all 7-card hands and time the execution of all evaluations.

#### [src/comparisonBenchmark.cpp](https://github.com/thecppzoo/pokerbotic/blob/master/src/comparisonBenchmark.cpp)

This program generates as in Texas Hold'em Poker, all possible 5-community cards, and proceeds to iterate over all two-player 2-card "pocket cards".

Because of the size of this search space, this program emits a current tally of execution every 100 million cases.

To build, for example:

`clang++ -std=c++14 -Iinc -DHAND_COMPARISON -O3 -march=native -o cb src/comparisonBenchmark.cpp`

Can be run without arguments.

## Next feature to be implemented

Currently, multithreaded partitioning of evaluations is being implemented.

## Documentation/User manual

Not yet written. Most of the code available under the folder `ep/` is fully operational.

Binary file added pokerbotic/SWAR CPPCon 2019 - reduced.key
Binary file not shown.
35 changes: 35 additions & 0 deletions pokerbotic/design/Fastest-Floyd-Sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
The Floyd sampling algorithm --you can see an excellent exposition [here](http://www.nowherenearithaca.com/2013/05/robert-floyds-tiny-and-beautiful.html)-- is very convenient for use cases such as getting a hand of cards from a deck.

The fastest way to represent sets of finite and small domains (such as a deck of cards) seems to be as bits in bitfields.

For an example of a deck of 52 cards, we may want, for example, to generate all of the 7-card hands. I wrote an straightforward implementation [here](https://github.com/thecppzoo/pokerbotic/blob/master/inc/ep/Floyd.h). Its interface is this:

```c++
template<int N, int K, typename Rng>
inline uint64_t floydSample(Rng &&g);
```

With speed in mind, the size of the set and subset are template parameters. Compilers such as Clang, GCC routinely generate optimal code for the given sizes, as can be seen in the compiler explorer, which means they take advantage of those parameters being template parameters. The return value is the subset expressed as the bits set in the least significant N bits of the resulting integer.

However, what if the use case is to generate a sample (subset) of the *remaining* members of the set? for example, to generate a random 2-card *after* five cards have been selected?

That has been implemented too, in a function with this signature:

```c++
template<int N, int K, typename Rng>
inline uint64_t floydSample(Rng &&g, uint64_t preselected)
```

Here, `preselected` represents the cards already selected. If what is desired is to get two cards from the cards remaining after selecting `fiveAlreadySelected` cards, the call `ep::floydSample<47, 2>(randomGenerator, fiveAlreadySelected)` will suffice. Notice the template argument for `N` is now 47, reflecting the fact that the remaining set of cards has 47 cards. Unfortunately, it is difficult to guarantee at compilation time that the argument `fiveAlreadySelected` indeed has exactly five elements, because operations such as intersection or union result in sets with cardinalities that are fundamentally run-time values.

This overload for `ep::floydSample` requires calling a "deposit" operation. This is an interesting operation hard to implement without direct support from the processor: Given a mask, the bits of the input will be "deposited" one at a time into the bit positions indicated as bits set in the mask. In the AMD64/Intel architecture EM64T this is supported in the instruction set "BMI2" as the instruction [`PDEP`](https://chessprogramming.wikispaces.com/BMI2). The implementation of the adaptation of the Floyd algorithm for a known number of preselected elements is then straightforward: discount from the total the number of bits set, call normal floydSample, and "deposit" the result in the inverse of the preselection.

What are the costs of these implementations?

1. The programmer needs to indicate at compilation time the number of elements in the set. If this number is a runtime value, a `switch` will be needed to convert runtime to compile time numbers, that transforms into an indexed jump at the assembler level.
2. All of the operations in the normal Floyd sampling algorithm are negligible in terms of execution costs compared to calling the random number generator, which is essential in each iteration.
3. The adaptation to account for preselections only requires two assembler instructions more: inverting the preselection and depositing it. `PDEP` has been measured to be an instruction with a throughput of one per clock, which is excellent compared to implementing it in software; however, in current processors it can only be executed in a particular pipeline. In Pokerbotic we don't think we are oversubscribing this pipeline, so we suspect we get a 1-per-clock throughput for this use case.
4. However, the adaptation to account for preselections also require the programmer to accurately indicate the cardinality of the preselection. This can add the same cost as number 1 here, plus the population count, another single-pipeline, 1-per-clock throughput instruction.

We are interested in any way to implement a faster subset sample selection. This use case is at the heart of many operations in Pokerbotic.

63 changes: 63 additions & 0 deletions pokerbotic/design/Hand-Ranking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Design of the hand classification mechanism in Pokerbotic

## Detection of N-of-a-kind

Detection of N-of-a-kind, two pairs and full house is described [here](https://github.com/thecppzoo/pokerbotic/blob/master/design/What-is-noak-or-how-to-determine-pairs.md).

## Detection of flush

Flush detection happens at [Poker.h:78](https://github.com/thecppzoo/pokerbotic/blob/master/inc/ep/Poker.h#L78) the hand is filtered per each suit, and the built in for population count on the filtered set is called. This code assumes a hand can only have one suit in flush (or that the hand has up to 9 cards).

## Detection of straights

The straightforward way to detect straights, if the ranks would be consecutive bits (which we will call "packed representation") is this:

```c++
unsigned straights(unsigned cards) {
auto shifted1 = cards << 1;
auto shifted2 = cards << 2;
auto shifted3 = cards << 3;
auto shifted4 = cards << 4;
return cards & shifted1 & shifted2 & shifted3 & shifted4
}
```

By shifting and doing a conjuction at the end, the only bits in the result set to one are those that are succeeded by four consecutive bits set to one. Before accounting for the aces to work as "ace or one", there are possible improvements to be discussed:

### Checking for the presence of 5 or 10

In a deck of 13 ranks starting with 2, all straights must have either the rank 5 or the rank ten. This has a probability of nearly a third; however, testing for this explicitly is performance disadvantageous. It seems the branch is fundamentaly not predictable by the processor, so, the penalty of misprediction overcompensates the benefit of early exit. In the code above, there are 8 binary operators and 4 compile-time constants, there is little budget for branch misprediction. Older versions of the code had this check until it was benchmarked to be a disadvantage.

### Checking for partial conjunctions

For the same reason, testing if any of the conjunctions is zero to return 0 early is not performance advantageous, confirmed through benchmarking.

### Addition chain

There is one improvement that benchmarks confirm:

```c++
unsigned straights(unsigned cards) {
// assume the point of view from the bit position for tens.
auto shift1 = cards >> 1;
// in shift1 the bit for the rank ten now contains the bit for jacks
auto tj = cards & shift1;
auto shift2 = tj >> 2;
// in shift2, the position for the rank ten now contains the conjunction of originals queen and king
auto tjqk = tj & shift2;
return tjqk & (cards >> 4);
}
```

This implementation (which does not take into account the ace duality) requires 6 binary operations and 3 constants and accomplishes the same thing as the straightforward implementation. Benchmarks confirm this taking roughly 3/4 of the time than the straightforward implementation.

The key insight here is to view the detection of the straight as adding up to 5 starting with 1. The straightforward implementation does the equivalent of `1 + 1 + 1 + 1 + 1`, this new implementation does `auto two = 1 + 1; return two + two + 1`. This technique is to build an *addition chain*. This technique was inspired by the second chapter of the book ["From Mathematics To Generic Programming"](https://www.amazon.com/Mathematics-Generic-Programming-Alexander-Stepanov/dp/0321942043)

Taking into account the dual rank of aces is simply to turn on the 'ones' if there is the ace, but this requires left shift to make room for it. This can be done at the beginning of the straight check, and its cost can be amortized by the compiler doing a conditional move early, meaning the result will be ready by the time it is used.

There is one further complication in the code, which is that the engine uses the rank-array representation. Provided that the shifts are for 4, 8, 12, 16 bits instead of 1, 2, 3, 4 there isn't yet a difference. There are two needs for straights:

1. Normal straights, in which the suit of the rank does not matter. This is accomplished by making the 13 rank counts as described in how to detect pairs, etc., and using the SWAR operation `greaterEqual<1>` prior to the straight code. Naturally, the straights don't incurr in an extra cost of doing popcounts because they are amortized in the necessary part of detection of pairs, three of a kind, etc., the `greaterEqual<N>(arg)` operation requires two constants and two or three assembler operations, depending on how the result is used, thus, for practical purposes have negligible cost compared to a packed rank representation.
2. Straights to detect straight flush: Since the bits for

We suspect our detection of straight code is maximal in terms of performance.
Loading