Permalink
Switch branches/tags
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
10882 lines (8301 sloc) 368 KB
<!DOCTYPE html>
<html>
<!-- vim:set tw=100 ts=8 sw=4 et :-->
<head>
<title>Creating a Perfect Hash Table - A Performant, Parallel, Random Acyclic-Hypergraph Approach</title>
<meta name="msvalidate.01" content="E828541C73A98C315E3D6B8C88EF6057" />
<meta name="viewport" content="width=device-width, initial-scale=0.65, maximum-scale=1.0" />
<!-- https://www.google.com/fonts#UsePlace:use/Collection:Lato:200,300,300italic -->
<!--
<meta name="viewport" content="width=device-width, min-width=1100px, initial-scale=0.7, maximum-scale=1.0, shrint-to-fit=no" />
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:200,300,300italic">
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Merriweather:300,300i,400,400i">
-->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,400" rel="stylesheet">
<link rel="stylesheet" href="//oss.maxcdn.com/normalize/3.0.1/normalize.min.css">
<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css">
<link rel="stylesheet" href="../prism.css">
<link rel="stylesheet" href="../home.css">
<link rel="stylesheet" href="page.css">
<script src="//oss.maxcdn.com/jquery/2.1.1/jquery.min.js"></script>
<script src="../prism.js"></script>
<script src="../home.js"></script>
<script src="page.js"></script>
</head>
<body>
<header class="header">
<div class="header-logo" href="#">
<!--
<a class="homename" href="http://trent.me"><strong>T</strong>rent <strong>N</strong>elson</a>
-->
<a class="homename" href=".."><strong>T</strong>rent <strong>N</strong>elson</a>
</div>
<ul class="header-links">
<li><a href="#home"><i class="fa fa-home"></i>Creating a Perfect Hash Table</a></li>
<li><a href="#contents"><i class="fa fa-align-left"></i> Contents</a></li>
<li><a href="https://github.com/tpn/tracer/tree/master/PerfectHashTable" target="_blank"><i class="fa fa-github"></i> GitHub</a></li>
<li><a href="https://twitter.com/trentnelson" target="_blank"><i class="fa fa-twitter"></i> Twitter</a></li>
<!--
<li><a href="https://twitter.com/trentnelson" class="twitter-follow-button" data-show-count="false">Follow @trentnelson</a></li>
-->
</ul>
</header>
<a class="xref" name="home"></a>
<section class="section section-hero">
<div class="container">
<h1>
Creating a Perfect Hash Table
</h1>
<h3>
A Performant, Parallel, Random Acyclic-Hypergraph Approach
</h3>
</div>
</section>
<section class="section section-summary">
<div class="container">
<small>
Current status: <strong>early draft</strong>. Last update: 12th September, 2018.
Target publish date: Tuesday, 25th September, 2018.
<a href="https://github.com/tpn/website/blob/master/perfect-hash-table/index.html">
View this page's source on GitHub.</a>
<h3>TL;DR</h3>
<p>
This article documents a recent assignment regarding the implementation of a
perfect hash table. I discuss the initial goal as a set of requirements, then
capture design decisions and provide an implementation overview.
</p>
</small>
<hr/>
<h2>Requirements</h2>
<p>
Author a perfect hash table component that provides offline table
generation and fast lookup performance. Assume a key and value width of
<code>ULONG</code> (32 bits). Optimize for key set sizes ranging from
10,000 to 40,000 keys on average, with up to 100,000 keys on the high end.
</p>
<p>
Assume a key distribution similar to that of shared library address offsets;
not linear, but definitely not random, either. Do not make assumptions about
the presence of leading or trailing zeros (i.e. alignment), although feel free
to optimize for this if it doesn't detract from the component's performance on
less predictable key sets.
</p>
<p>
Prioritize lookup speed over generation speed. Optimize the lookup algorithm
to minimize latency involved in a single <code>Value = Lookup(Key)</code>
call. Table generation will be done ahead of time, in a separate process, and
only needs to ensure that tables can be generated in a reasonable amount of
time given the size of the input key set. Linear overhead is ideal, quadratic
less so, exponential would be infeasible.
</p>
<p>
Prioritize lookup speed over table size (memory requirements). A larger table
size, within reason, is an acceptable trade-off if it yields a faster lookup
speed. A minimal perfect hash table, where each key maps to exactly one table
location, is not a requirement, nor is it prohibited.
</p>
<p>
Note the inevitable tradeoff in size and performance with regards to the masking
method used by the implementation. Modulus-oriented solutions, those that use
the % operator in C, tend to be slower (modulus division can take upward of 90
cycles), but yield smaller table sizes. Solutions relying on power-of-2 based
table sizes boast much faster masking routines (e.g. <code>Input &amp;
(Size-1)</code>), but incur greater table size overhead.
</p>
<p>
The offline generation process takes, as its input, a key file. The file will
be an array of <code>ULONG</code> keys (i.e. binary data). The number of
elements in the array can be ascertained by dividing the file size by
<code>sizeof(ULONG)</code>. It produces, as its output, a perfect hash table
file that can be subsequently loaded and used in separate processes.
</p>
<p>
Callers wishing to use a given perfect hash table will need to load the file
produced in the step above. This will yield an interface from which the hash
table can be interacted with. At a bare minimum, the interface should support
the following semantics:
</p>
<pre class="code"><code class="language-c">extern PULONG Table;
ULONG Lookup(ULONG Key) { return Table[PerfectHashFunction(Key)]; }
VOID Insert(ULONG Key, ULONG Value) { Table[PerfectHashFunction(Key)] = Value; }
</code></pre>
<p>
The interface requirements are flexible and can be extended as long as the
criteria above are met at a bare minimum.
</p>
<p>
The behavior of looking up or inserting a key that wasn't in the original input
set is undefined. That is, the implementation is not required to detect or
protect against this scenario &mdash; that is the responsibility of the caller.
</p>
<p>
Feel free to review existing works on the topic, particularly the cmph open
source project, the GNU gperf library, and the plethora of papers on the subject
of perfect hashing and minimal perfect hashing.
</p>
</div>
</section>
<hr/>
<section class="section section-toc">
<div class="container">
<a class="xref" name="contents"></a>
<h1>Contents</h1>
<p>
<ul class="toc-list">
<li>
<a href="#getting-started">Getting Started</a>
</li>
<li>
<a href="#algorithm-decisions">Algorithm Decisions</a>
</li>
<li>
<a href="#initial-design-decisions">Initial Design Decisions</a>
</li>
<li>
<a href="#current-performance">Current Performance</a>
</li>
<li>
<a href="#implementation-notes">Implementation Notes</a>
</li>
<li>
<a href="#code-walkthrough">Code Walkthrough</a>
<ul>
<li>
<a href="#test-data">Preparing Test Data</a>
</li>
<li>
<a href="#running-self-test">Running the Self-Test</a>
</li>
<li>
<a href="#the-info-stream">The <code>:Info</code> Stream</a>
</li>
<li>
<a href="#self-test-overview">Self-Test Overview</a>
</li>
<li>
<a href="#context-creation">Context Creation</a>
<ul>
<li>
<a href="#PERFECT_HASH_TABLE_CONTEXT">PERFECT_HASH_TABLE_CONTEXT</a>
</li>
<li>
<a href="#CreatePerfectHashTableContext">CreatePerfectHashTableContext()</a>
</li>
</ul>
</li>
<li>
<a href="#key-loading">Key Loading</a>
<ul>
<li>
<a href="#PERFECT_HASH_TABLE_KEYS">PERFECT_HASH_TABLE_KEYS</a>
</li>
<li>
<a href="#LoadPerfectHashTableKeys">LoadPerfectHashTableKeys()</a>
</li>
</ul>
</li>
<li>
<a href="#perfect-hash-table-creation">Perfect Hash Table Creation</a>
<ul>
<li>
<a href="#PERFECT_HASH_TABLE">PERFECT_HASH_TABLE</a>
</li>
<li>
<a href="#CreatePerfectHashTable">CreatePerfectHashTable()</a>
</li>
</ul>
</li>
<li>
<a href="#chm-algo-specific-details">Algorithm-Specific Implementation Details</a>
<ul>
<li>
<a href="#CreatePerfectHashTableImplChm01">CreatePerfectHashTableImplChm01()</a>
</li>
</ul>
</li>
</ul>
</li>
<li>
<a href="#appendix">Appendix</a>
<ul>
<li>
<a href="#CreateRandomObjectNames">Rtl-&gt;CreateRandomObjectNames()</a>
</li>
<li>
<a href="#CreateMultipleBuffers">Rtl-&gt;CreateMultipleBuffers()</a>
</li>
</ul>
</li>
</ul>
</p>
</div>
</section>
<hr/>
<section class="section section-body">
<div class="container">
<a class="xref" name="getting-started"></a>
<h1>Getting Started</h1>
<p>
This was an interesting project. I'd never written a perfect hash table before,
nor was I familiar with the landscape for doing such a thing. I spent about
three days reviewing existing work, including the <a
href="http://cmph.sourceforge.net">cmph</a> project's source code. (I ended up
collecting about 147 (!) documents on the topic (papers, PhD thesis, slides,
etc) in my <a href="https://github.com/tpn/pdfs">PDFs</a> repo over the course
of the project.)
</p>
<a class="xref" name="algorithm-decisions"></a>
<h1>Algorithm Decisions</h1>
<p>
The algorithm I settled on is the acyclic random 2-part hypergraph (or r-graph,
where r = 2). The algorithm works as follows: for each key, generate two
independent hash values. Mask these values such that they fall within the
confines of the number of <em>vertices</em> picked for the table (more on this
later; for now, assume the number of <em>vertices</em> exceeds the number of
keys, or <em>edges</em> by at least 2x). These masked hash values now become
the two vertices, and are added to a graph structure by a connecting edge. The
edge is simply the 0..N index being used for enumeration, e.g.:
</p>
<pre class="code"><code class="language-c">for (Index = 0; Index &lt; NumberOfKeys; Index++) {
Key = Keys[Index];
Hash1 = HashFunction1(Key);
Hash2 = HashFunction2(Key);
Vertex1 = MaskHashFunction(Hash1);
Vertex2 = MaskHashFunction(Hash2);
Edge = Index;
GraphAddEdge(Graph, Edge, Vertex1, Vertex2);
}</code></pre>
<p>
Once constructed, the graph is assessed to determine whether or not it is
acyclic. If the graph is acyclic, it means every vertex has at most 1 degree
of connectivity to other vertices. We want an acyclic graph. If it's not
acyclic, the attempt has failed, the graph is thrown away, and a new attempt
is made, using new random seed data to drive the two hash functions. Once
an acyclic graph is found, it's relatively straight forward to convert this
into a data structure that can be used as a perfect hash table.
</p>
<p>
This algorithm first has roots in <a href="https://github.com/tpn/pdfs/blob/master/A%20Versatile%20Graph%20Structure%20for%20Edge-Oriented%20Graph%20Algorithms%20-%201987%20(Ebert1987AVD).pdf">
A Versatile Data Structure for Edge-Oriented Graph Algorithms (Ebert, 1987)</a>.
Its application to perfect hashing appears in <a
href="https://github.com/tpn/pdfs/blob/master/A%20Family%20of%20Perfect%20Hashing%20Methods%20-%201996%20(TR0242).pdf">
A Family of Perfect Hashing Methods (Majewski, Wormald, Havas, Czech, 1996)</a>,
where they focus on more rigorous proofs of the runtime complexity associated
with acyclic r-graphs, extending on the work in their earlier paper, <a
href="https://github.com/tpn/pdfs/blob/3dc05fb22d87d86117802a2dc206926c79981ca3/Graphs,%20Hypergraphs%20and%20Hashing.pdf">Graphs, Hypergraphs and Hashing (1994)</a>, and their initial works on
the matter, <a
href="https://github.com/tpn/pdfs/blob/master/An%20Optimal%20Algorithm%20for%20Generating%20Minimal%20Perfect%20Hash%20Functions%20-%201992%20(10.1.1.51.5566).pdf">
An Optimal Algorithm for generating Minimal Perfect Hash Functions (Majewski,
Havas, Czech, 1992)</a>.
</p>
<p>
There is one thing that stood out in their 1996 paper (page 9) that I was able
to verify experimentally (after finally hacking the
<a href="https://github.com/tpn/cmph-2.0/blob/master/src/chm.c">CHM</a> algorithm
in the CMHP project into a working state). To summarize, sans heavy math notation: the
probability that we find a perfect hash solution by identifying an acyclic
r-graph (r = 2) is 99.9% within 18 iterations. On average, a solution is found
in &radic;3 attempts.
</p>
<p>
This is referred to as a <a
href="https://en.wikipedia.org/wiki/Geometric_distribution">geometric
distribution</a>, something I hadn't come across before. It is a very desirable
trait, especially for this particular problem. In essence, the more we do it,
the more likely we'll figure it out. Assuming there is a solution (i.e. the
hash function is performing properly), the chance of us not solving it is provably
infinitesimal, which is neat.
</p>
<p>
Consider a gambler who is going all-in on heads against the Universe in an
infinite game of coin toss. We are the Universe. Play it long enough, and
we'll eventually see a tail.
</p>
<p>
I made the following notes around day 3 of the project:
</p>
<div class="blockquote"><small>
<p>
In my experiments, even with 10 million random keys, graph creation took about
6-7 seconds on the 64-bit release build. On average it found a solution usually
within 1-3 iterations. The worst-case I saw was 7 iterations.
</p>
<p>
The nice thing about the graph creation step is that each iteration can be
palmed off to a threadpool worker, such that you can attempt to find a graph
solution in parallel up to NCPU. On my 12 core box at home, there is a very
high probability I'll find a solution in the first batch of 12 iterations
submitted in parallel &mdash; thus, my clock time for solving the perfect hash
stays relatively consistent at 6-7 seconds, give or take.
</p>
<p>
This algorithm is not the fastest, nor the most state-of-the-art, nor does
it have the lowest bits-per-key storage requirement, nor will I be aiming
for a minimal perfect hash function. However, it's simple (relatively),
I understand it, I can explain it on a whiteboard without having to
continually reference a paper, and it's definitely fast enough in the
generation stage based on our target static key sets. It's also old; graph
theory has flourished since the 60s, and this particular algorithm came onto
the scene in 1992, and has been cited widely.
</p>
<p>
(The current state-of-the-art depicted in papers like <a
href="https://github.com/tpn/pdfs/blob/master/Fast%20Scalable%20Construction%20of%20(Minimal%20Perfect%20Hash)%20Functions%20-%2022%20Mar%202016,%20v2%20(1603.04330).pdf">
Fast Scalable Construction of (Minimal Perfect Hash) Functions (Genuzio,
Ottaviano, Vigna, 2016)</a> use "lazy Gaussian elimination" to try tackle
the minimal perfect hash problem on ultra-large key sets (in the billions
and above). That is interesting, but not a wise choice for me to tackle in
a week, nor does it improve on our target static key sets, which are very
modest in size in comparison.)
</p>
<p>
Another reason I'm favoring the algorithm I've chosen is that because the
generation stage is so cheap, relatively, and we have that nice
probabilistic guarantee that we'll "probably" find a solution by iteration
18... that gives us a lot of leeway with regards to experimenting with the
underlying hash functions used. I envision there being much faster, non-mod
based hash functions we can experiment with, that actually have relatively
poor "randomness" qualities unless a particularly good seed is found.
Combined with the threadpool infrastructure for submitting iterations in
parallel, I have a hunch that I'll be able to find some very fast hash
functions that can still be solved in an acceptable amount of time. This
will help greatly with our evaluation time; reducing the latency and CPU
cycles required to perform the lookup.
</p>
<p>
There is one large risk item associated with my current plan: the key
validation step of the cmph command line program just flat-out isn't working
properly. The graph generation step <strong>*appears*</strong> to be doing
the right thing, at least with regards to finding cyclic graphs and
discarding them, etc. The validation step works on small key sizes,
however, after about 500, I'm seeing rampant conflicts and severe
degradation of the underlying hash functions (i.e. everything is hashing to
394 or 85 or something).
</p>
</small></div>
<p>
The last paragraph depicts some of the issues I had trying to get the cmph
program to validate the perfect hash tables it was supposedly generating. Try
as I might, I just couldn't get the thing to generate solutions that were
actually valid (despite it reporting that they were valid) after a few hours
of fiddling.
</p>
<p>
More impressively, though, is that the bug survived and still exists in my
complete reimplementation of their initial modulus-oriented masking approach.
This only became apparent in the latter stages of the project, when I authored
more robust validation and test logic. Thankfully though, my power-of-2 based
approach <strong>does</strong> work, and it's a lot faster, so, who knows. The
modulus functionality was only implemented for a reference point, I didn't
anticipate using it as the final "fast" version of the perfect hash solution, so
I haven't investigated why it doesn't work properly any further. It's a little
disconcerting, though.
</p>
<a class="xref" name="initial-design-decisions"></a>
<h1>Initial Design Decisions</h1>
<p>
A few of the initial guiding sentiments regarding the design follow:
<ul>
<li>
I wanted to be able to easily mix and match different algorithms, hash
functions and masking types. I figured this would make experimentation
easier, and generally promote sensible de-coupling of internal
components. Thus, the main interface for creating a perfect hash table
is parameterized by three enums for the desired
<a href="https://github.com/tpn/tracer/blob/c7092c78aac4e01baf324815f7fbc888cdb6faa0/PerfectHashTable/PerfectHashTable.h#L62">
algorithm</a>,
<a
href="https://github.com/tpn/tracer/blob/c7092c78aac4e01baf324815f7fbc888cdb6faa0/PerfectHashTable/PerfectHashTable.h#L111">
hash function</a>, and
<a
href="https://github.com/tpn/tracer/blob/c7092c78aac4e01baf324815f7fbc888cdb6faa0/PerfectHashTable/PerfectHashTable.h#L170">
masking type</a>, respectively. These identifiers, depicted below, are
also stored with the on-disk representation of the perfect hash table,
such that the loading component knows which implementations to select
when preparing an interface for use.
<a class="xref" name="enums"></a>
<div class="tab-box language box-enums">
<ul class="tabs">
<li data-content="content-enums-algorithm">Algorithm</li>
<li data-content="content-enums-hash-function">Hash Function</li>
<li data-content="content-enums-masking-type">Masking Type</li>
</ul>
<div class="content">
<pre class="code content-enums-algorithm"><code class="language-c">//
// Define an enumeration for identifying which backend algorithm variant to
// use for creating the perfect hash table.
//
typedef enum _PERFECT_HASH_TABLE_ALGORITHM_ID {
//
// Explicitly define a null algorithm to take the 0-index slot.
// This makes enum validation easier.
//
PerfectHashTableNullAlgorithmId = 0,
//
// Begin valid algorithms.
//
PerfectHashTableDefaultAlgorithmId = 1,
PerfectHashTableChm01AlgorithmId = 1,
//
// End valid algorithms.
//
//
// N.B. Keep the next value last.
//
PerfectHashTableInvalidAlgorithmId,
} PERFECT_HASH_TABLE_ALGORITHM_ID;
typedef PERFECT_HASH_TABLE_ALGORITHM_ID *PPERFECT_HASH_TABLE_ALGORITHM_ID;
//
// Provide a simple inline algorithm validation routine.
//
FORCEINLINE
BOOLEAN
IsValidPerfectHashTableAlgorithmId(
_In_ PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId
)
{
return (
AlgorithmId &gt; PerfectHashTableNullAlgorithmId &amp;&amp;
AlgorithmId &lt; PerfectHashTableInvalidAlgorithmId
);
}
</code></pre>
<pre class="code content-enums-hash-function"><code class="language-c">//
// Define an enumeration for identifying which hash function variant to use.
//
typedef enum _PERFECT_HASH_TABLE_HASH_FUNCTION_ID {
//
// Explicitly define a null algorithm to take the 0-index slot.
// This makes enum validation easier.
//
PerfectHashTableNullHashFunctionId = 0,
//
// Begin valid hash functions.
//
PerfectHashTableHashCrc32RotateFunctionId = 1,
PerfectHashTableDefaultHashFunctionId = 1,
PerfectHashTableHashJenkinsFunctionId = 2,
//
// N.B. The following three hash functions are purposefully terrible from
// the perspective of generating a good distribution of hash values.
// They all have very simple operations and are intended to test the
// theory that even with a poor hash function, once we find the right
// seed, the hash quality is unimportant.
//
PerfectHashTableHashRotateXorFunctionId = 3,
PerfectHashTableHashAddSubXorFunctionId = 4,
PerfectHashTableHashXorFunctionId = 5,
//
// End valid hash functions.
//
//
// N.B. Keep the next value last.
//
PerfectHashTableInvalidHashFunctionId,
} PERFECT_HASH_TABLE_HASH_FUNCTION_ID;
typedef PERFECT_HASH_TABLE_HASH_FUNCTION_ID
*PPERFECT_HASH_TABLE_HASH_FUNCTION_ID;
//
// Provide a simple inline hash function validation routine.
//
FORCEINLINE
BOOLEAN
IsValidPerfectHashTableHashFunctionId(
_In_ PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId
)
{
return (
HashFunctionId &gt; PerfectHashTableNullHashFunctionId &amp;&amp;
HashFunctionId &lt; PerfectHashTableInvalidHashFunctionId
);
}
</code></pre>
<pre class="code content-enums-masking-type"><code class="language-c">//
// Define an enumeration for identifying the type of table masking used by the
// underlying perfect hash table. This has performance and size implications.
// Modulus masking typically results in smaller tables at the expenses of slower
// modulus-based hash functions. Non-modulus masking requires power-of-2 sized
// tables, which will be larger, but the resulting mask function can be done
// by logical AND instructions, which are fast.
//
typedef enum _PERFECT_HASH_TABLE_MASK_FUNCTION_ID {
//
// Null masking type.
//
PerfectHashTableNullMaskFunctionId = 0,
//
// Being valid masking types.
//
PerfectHashTableModulusMaskFunctionId = 1,
PerfectHashTableAndMaskFunctionId = 2,
PerfectHashTableDefaultMaskFunctionId = 2,
PerfectHashTableXorAndMaskFunctionId = 3,
PerfectHashTableFoldAutoMaskFunctionId = 4,
PerfectHashTableFoldOnceMaskFunctionId = 5,
PerfectHashTableFoldTwiceMaskFunctionId = 6,
PerfectHashTableFoldThriceMaskFunctionId = 7,
//
// End valid masking types.
//
//
// N.B. Keep the next value last.
//
PerfectHashTableInvalidMaskFunctionId,
} PERFECT_HASH_TABLE_MASK_FUNCTION_ID;
typedef PERFECT_HASH_TABLE_MASK_FUNCTION_ID
*PPERFECT_HASH_TABLE_MASK_FUNCTION_ID;
//
// Provide a simple inline masking type validation routine.
//
FORCEINLINE
BOOLEAN
IsValidPerfectHashTableMaskFunctionId(
_In_ PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
)
{
return (
MaskFunctionId &gt; PerfectHashTableNullMaskFunctionId &amp;&amp;
MaskFunctionId &lt; PerfectHashTableInvalidMaskFunctionId
);
}
//
// Masking tends to fall into one of two buckets: modulus and not-modulus.
// Provide an inline routine that guarantees to match all current and future
// modulus masking function IDs.
//
FORCEINLINE
BOOLEAN
IsModulusMasking(
_In_ PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
)
{
return MaskFunctionId == PerfectHashTableModulusMaskFunctionId;
}
</code></pre>
</div>
</div>
</li>
</ul>
<ul>
<li>
The problem parallelizes incredibly well. The more threads you can have
looking for an acylic graph, the better. There was no way I was
implementing this as a single-threaded solution; it's rare to get a
problem so well suited to a multithreaded approach, and the threadpool
scaffolding provided by NT is sublime, so, that was a no-brainer.
<br/>
<small>
(In fact, the way that it is currently written, you can't actually solve
the graph <strong>*without*</strong> using a threadpool. You can
dictate the level of concurrency you want, and specifying 1 means the
threadpool only gets 1 thread, which makes debugging easier, but there
is no way to isolate the solving process to avoid this. This is by
design.)
</small>
</li>
</ul>
<ul>
<li>
Memory map all the things. All file system interaction is achieved via
sections and memory maps. There is a separate threadpool used for
handling file-oriented work (such as saving the solution to disk in
parallel whilst the main thread verifies it is correct &mdash; which
it will be unless we've got internal bugs). Memory mappings are used
for the source .keys file, the resulting .pht1 file for the perfect
hash table, and the .pht1:Info NTFS stream used to capture metadata
about the perfect hash table (such as which algorithm, hash function,
and masking function was used, sizes, stats etc).
</li>
</ul>
<ul>
<li>
For handling testing, due to the limited time constraints, I went for a
big-bang systems level "self-test". Coupled with aggressive internal
ASSERT()ing, this worked out pretty well. A self-test .exe is provided,
which, when pointed at a data directory and given a set of algorithm,
hash and mask IDs, will process all *.keys files in the directory; for
each one, a perfect hash file is created, the on-disk version is then
loaded and subsequently tested. This is all handled by the routine
<a
href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/SelfTestPerfectHashTable.c">
SelfTestPerfectHashTable</a>, which also serves as a good example for
how to exercise the entire system end-to-end.
</li>
</ul>
</p>
<a class="xref" name="current-performance"></a>
<h1>Current Performance</h1>
<small>
(Note: this section is an active work-in-progress. It documents observed
performance at the time of writing: 10th June, 2018. It is the most recently
authored section in this article, and thus, depicts newer code samples and
implementation details compared to the subsequent sections. (For example,
the <code>FastIndex()</code> and <code>FastIndexUnsafe()</code> routines
were only recently added as part of performance tuning and benchmarking work;
these routines aren't mentioned in subsequent sections yet.)
</small>
<p>
The performance of the solution can be evaulated in two distinct areas: the
performance of the perfect hash table generation functionality, i.e. how quickly
can we create a perfect hash table for any given key set, and the lookup
performance of the resulting table, i.e. how many cycles does it take to look up
a key (best case, worst case), how many memory references are made for a key
lookup, etc.
</p>
<h4>Performance TL;DR</h4>
<p>
<ul>
<li>
Generation is really fast. Although the parallelism helps immensely, the
underlying acyclic hypergraph algorithm described in the original papers
deserves all of the credit.
</li>
</ul>
<ul>
<li>
(That being said, I couldn't get their modulus-based algorithm to
generate valid solutions, either in their original code or my
reimplementation of their algorithm. I'm glad I got my AND-based
masking version working, otherwise this assignment would have been
a disaster.)
</li>
</ul>
<ul>
<li>
The hash function I use has three CRC32 instructions, one rotate, and
one XOR. The latency for just the hashing logic is sub 10 cycles, give
or take.
</li>
</ul>
<ul>
<li>
The Jenkins hash function, for comparison, clocks in at about 30 cycles
for just the hashing logic. If Jenkins is paired with the original
modulus masking, the minimum latency jumps to over 100 cycles.
</li>
</ul>
<ul>
<li>
I investigated even cheaper hash functions, such as simply XOR'ing
the input with two random seeds. They often work on smaller graphs,
but the CRC32 + rotate + XOR already offers very attractive latency
<strong>and</strong> good randomness, so it's a better general purpose
option. Also, reducing the hashing cost from sub 10 to, say, sub 6
cycles &mdash; we're completely limited by the overhead of the current
function signature and table layout at that point (i.e. we could shave
off more if we passed the seeds in as parameters via r8 and r9 instead
of constantly loading them from the table).
</li>
</ul>
<ul>
<li>
The <code>FastIndex()</code> routine (i.e. what <code>Lookup()</code>
and <code>Insert()</code> call in order to obtain the array offset for
a given key) requires a minimum of two memory references within the
assigned array in order to construct the final index. All other memory
references within the routine will be served from the L1 cache with high
probability (e.g. the seed values).
</li>
</ul>
<ul>
<li>
The best case scenario, where everything results in an L1 cache hit,
is about 23 cycles of latency. I think I could get this down to about
15 cycles with an optimized LEAF_ENTRY routine written in assembly,
however, it doesn't make sense to pursue that until more robust tools
are in place for measuring performance.
</li>
</ul>
<ul>
<li>
Worst case scenario is cache misses for both lookups, resulting in a
fetch from main memory. That'll be hundreds of cycles of penalty.
Smaller table sizes should reduce the likelihood of misses on the lookup
fast path, however, access patterns will factor heavily into this; if,
say, the most frequent keys accessed fall within a smaller subset of the
total key set &mdash; the penalty for a larger table is offset. (E.g.
90% of the lookups are on 10% of the keys.) Making real-world key
frequency metrics available during table creation would definitely
add value when attempting to assess the cost of a larger table size.
</li>
</ul>
<ul>
<li>
Additional areas I'd look to improve performance:
<ul>
<li>
Analysis of the values in the input key set. What's the first
value, what's the last value, what's the minimum leading zeros,
maximum leading zeros, ditto for trailing zeros. What's the
value distribution look like? What's the gap between values
look like? Certain patterns will emerge for target key sets
that will allow for specific optimizations (like shifting away
known min trailing zeros, etc). I currently don't look at a
single key value during the creation step from the perspective
of directing the underlying algorithm.
</li>
</ul>
<ul>
<li>
Investigate prefetching. We could potentially dispatch a
prefetch (especially a prefetch with write-intent in the
<code>Insert()</code> case) early enough to offset some of
the memory latency involved in the two-part vertex lookup.
</li>
</ul>
<ul>
<li>
Write a LEAF_ENTRY in assembly that has zero error checking, no
COM dependencies (returns the index in rax, not HRESULT + output
parameter), and takes the seed values in registers. (Regarding
error checking: our <code>FastIndex()</code> routine still does
the sanity check that <code>Vertex1 != Vertex2</code> &mdash;
the only way to trigger an <code>E_FAIL</code> return code in
this case is passing in a key that isn't in the original input
set, <strong>AND</strong> having it trigger a collision with the
selected hash function and random seed data such that the same
value is produced for both values. The chances of this are
very, very slim with reasonable hash functions and good seed
data. So, we could omit the <code>S_OK</code> or
<code>E_FAIL</code> return codes in favor of returning the index
directly in rax.)
</li>
</ul>
<ul>
<li>
Currently, the user is responsible for providing the algorithm,
hash function and masking type to use in order to create the
table. With the current work on benchmarking functions, this
could be improved such that the creation routine is given a goal
to maximize, e.g. lowest latency for <code>FastIndex()</code>,
and then it tries all known algo/hash/masking variants until it
finds the best solution. This wouldn't take that much more
effort and would ensure that the most optimal combo is always
being picked for a given key set.
</li>
</ul>
</li>
</ul>
</p>
<a class="xref" name="generation-performance"></a>
<h2>Generation Performance</h2>
<p>
Broadly speaking, generation performance is very good. The parallel approach to
searching for acyclic graphs has paid off quite nicely &mdash; the thing just
flies. The current implementation is incredibly defensive at the moment, too,
and includes numerous expensive operations that could be relegated to a debug
build or removed completely.
</p>
<p>
For example, in order to check the incoming keys are unique, I simply allocate a
512MB bitmap and set a bit for every key I see in the range of 0-MAX_ULONG. I
verify I only see every key once, and then as an extra check, count the number
of set bits to ensure they match the reported number of keys. This is overly
defensive and very abusive to the cache.
</p>
<p>
Similarly, once a graph has been found, an initial internal verification routine
walks the entire graph and verifies none of the assigned indices for the entire
key set appear more than once. This happens before the final solution is saved
to disk. Then, after we load the table from the on-disk instance, we do another
thorough system test, exercising all of the public API routines (e.g.
<code>Index()</code>, <code>Lookup()</code>, <code>Insert()</code>, etc),
verifying the table is behaving correctly and keys are being mapped to unique
indices.
</p>
<p>
Even with all this additional overhead in place, a solution is generally found
within about 5 seconds on my machines, even for the largest key sets we support
(currently around ~750,000 for no particular reason other than I haven't tested
larger yet). For the key set sizes more representative of our target domain
(e.g. around 10,000 to 100,000), the generation time <em>feels</em> like it's
basically instant, at least from the user perception of running the command
line and waiting for it to report back that it solved the graph.
</p>
<p>
One aspect worth noting is related to the initial hunch I expressed earlier:
</p>
<div class="blockquote"><small>
.... I envision there being much faster, non-mod based hash functions we
can experiment with, that actually have relatively poor "randomness"
qualities unless a particularly good seed is found. Combined with the
threadpool infrastructure for submitting iterations in parallel, I have a
hunch that I'll be able to find some very fast hash functions that can still
be solved in an acceptable amount of time.
</small></div>
<p>
Just to elaborate a bit further on that note &mdash; I was anticipating a
situation where, when fed some overly-simple hash functions, the graph solving
would take far longer than the <em>99.9% probability of success by iteration 18
</em> guarantees cited by the academic work on the subject.
<p>
For example, assume we've got a 37,000 key set that solves on the first set of
attempts (i.e. within 1..NCPU), when fed a good, but expensive, hashing
function.
</p>
<p>
I pictured a scenario whereby, if fed a poorer-but-cheaper hash function, a
solution would be found in, say, 2 minutes. Or 5 minutes. Or basically any
time period that falls within our gut feeling of being acceptable for the given
input key set size.
</p>
<p>
That hunch appears to be wrong; I have not seen any evidence of this in
practice. A graph will either be solved within the those first ~20 attempts,
or it will never be solved. On numerous occasions, I left the solver running
overnight, such that I had 11 out of 12 cores searching for solutions at 5GHz
for 8 hours, and... nada. No solution was ever found. We often got really
close &mdash; I put in functionality to track the highest number of deleted
edges we'd seen during solving, and remember seeing that we were regularly
hitting 499,987 for a key set size of 500,000 (so 13 away from finding a
solution). Graphs would either solve very quickly, or not at all.
</p>
<p>
Something else I found interesting is how quickly graphs would be solved
once I implemented support for detecting when the attempts have exceeded
a certain threshold (currently 100) and then resizing the table such that
we double the number of underlying vertices, and re-trying the graph solving
in parallel. Once this was in place, graphs would always be rapidly solved.
Supplying poor hash functions would result in a number of table resize events,
but as soon as the right size was found, the graph would be solved within the
first batch of attempts.
</p>
<a class="xref" name="lookup-performance"></a>
<h2>Lookup Performance</h2>
<p>
Ironically, the most critical piece of this project &mdash; optimizing the
performance of the <code>Insert()</code> and <code>Lookup()</code> routines
&mdash; received the least amount of attention, at least in relation to the
other components like the graph solving or threadpool scaffolding.
</p>
<p>
The first hash function I half-heartedly wrote as a quick stop-gap, is,
amusingly, still the best performing one in terms of offering the best trade-off
between lookup latency and hash quality (with the latter being proportional to
underlying table size). (It uses three CRC32 instructions, a rotate, and an
XOR. I creatively gave it the moniker <code>Crc32Rotate</code>.)
</p>
<p>
Smaller tables are ideal &mdash; the minimum table size we can hope for when
using power-of-2 masking is the number of keys, rounded up to a power of two,
yielding our edge count, and then rounding that up to the next power-of-2,
yielding our vertices count, and thus, table size.
</p>
<p>
If two hash functions yield a solution from the same vertices count, the
function with the lower latency (cycles per lookup) is preferable, because
there is simply no benefit to spending extra cycles in order to try improve
the quality of the hash.
</p>
<p>
However, the same isn't necessarily true for two hash functions that yield
different table sizes. Every power-of-2 increase in table size results in
a larger memory footprint, which translates to a larger cache footprint,
which translates to an increase likelihood of cache misses when looking
up the two vertex values from the assigned array, which directly impacts
performance. The type of access pattern greatly affects performance in this
scenario too; if a subset of keys are looked up far more frequently than
the others, then they're likely to be in the cache versus not. However, if
the access pattern is truly random, the likelihood of a cache miss during
lookup is increased, because there's a larger footprint of possible values
the keys could be hashing to.
</p>
<p>
In order to give the <code>Crc32Rotate</code> hash function something to
be evaluated against, I eventually ended up authoring a version of the
Jenkins hash, which is widely used and cited, and is the hash function
used by the authors of the CMHP package. I wasn't particularly attracted
to it when first picking a hash function as it has long dependency chains
in its instruction stream, which inhibit the out-of-order and superscalar
nature of contemporary CPUs (the Jenkins hash was authored in the late 90s,
for what it's worth).
</p>
<p>
Let's take a look at both implementations before reviewing performance.
We'll review the <code>FastIndex()</code> implementations of each routine,
which inline the AND-based masking such that we don't have to make three
function calls through the COM vtbl pipeline to perform what equates to a
very simple instruction (<code>Value &amp; (TableSize - 1)</code>). <small>
(Note: the <code>FastIndex()</code> idiom was only added very recently as
part of the benchmarking work; thus, it's not mentioned in the later code
walkthrough sections yet.)</small>
</p>
<a class="xref" name="hash-functions"></a>
<div class="tab-box language box-hash-functions">
<ul class="tabs">
<li data-content="content-hash-functions-crc32rotate">Crc32Rotate</li>
<li data-content="content-hash-functions-jenkins">Jenkins</li>
</ul>
<div class="content">
<pre class="code content-hash-functions-crc32rotate"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableFastIndexImplChm01Crc32RotateHashAndMask(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG Index
)
/*++
Routine Description:
Looks up given key in a perfect hash table and returns its index. This
is a fast version of the normal Index() routine that inlines the Crc32Rotate
hash function and AND masking.
N.B. If Key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
key will hash to either an existing key's location or an empty slot,
so there is potential for returning a non-unique index.)
Arguments:
Table - Supplies a pointer to the table for which the key lookup is to be
performed.
Key - Supplies the key to look up.
Index - Receives the index associated with this key. The index will be
between 0 and Table-&gt;HashSize-1, and can be safely used to offset
directly into an appropriately sized array (e.g. Table-&gt;Values[]).
Return Value:
S_OK on success, E_FAIL if the underlying hash function returned a failure.
This will happen if the two hash values for a key happen to be identical.
It shouldn't happen once a perfect graph has been created (i.e. it only
happens when attempting to solve the graph). The Index parameter will
be cleared in the case of E_FAIL.
--*/
{
ULONG A;
ULONG B;
ULONG C;
ULONG D;
ULONG Seed1;
ULONG Seed2;
ULONG Seed3;
ULONG Input;
PULONG Seeds;
ULONG Masked;
ULONG Vertex1;
ULONG Vertex2;
PULONG Assigned;
ULONG MaskedLow;
ULONG MaskedHigh;
ULONGLONG Combined;
//IACA_VC_START();
//
// Initialize aliases.
//
Seeds = &amp;Table-&gt;Header-&gt;FirstSeed;
Seed1 = Seeds[0];
Seed2 = Seeds[1];
Seed3 = Seeds[2];
Input = Key;
Assigned = Table-&gt;Data;
//
// Calculate the individual hash parts.
//
A = _mm_crc32_u32(Seed1, Input);
B = _mm_crc32_u32(Seed2, _rotl(Input, 15));
C = Seed3 ^ Input;
D = _mm_crc32_u32(B, C);
//IACA_VC_END();
Vertex1 = A;
Vertex2 = D;
if (Vertex1 == Vertex2) {
goto Error;
}
//
// Mask each hash value such that it falls within the confines of the
// number of vertices.
//
MaskedLow = Vertex1 &amp; Table-&gt;HashMask;
MaskedHigh = Vertex2 &amp; Table-&gt;HashMask;
//
// Obtain the corresponding vertex values for the masked high and low hash
// values. These are derived from the "assigned" array that we construct
// during the creation routine's assignment step (GraphAssign()).
//
Vertex1 = Assigned[MaskedLow];
Vertex2 = Assigned[MaskedHigh];
//
// Combine the two values, then perform the index masking operation, such
// that our final index into the array falls within the confines of the
// number of edges, or keys, in the table. That is, make sure the index
// value is between 0 and Table-&gt;Keys-&gt;NumberOfElements-1.
//
Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;
Masked = Combined &amp; Table-&gt;IndexMask;
//
// Update the caller's pointer and return success. The resulting index
// value represents the array offset index for this given key in the
// underlying table, and is guaranteed to be unique amongst the original
// keys in the input set.
//
*Index = Masked;
//IACA_VC_END();
return S_OK;
Error:
//
// Clear the caller's pointer and return failure. We should only hit this
// point if the caller supplies a key that both: a) wasn't in the original
// input set, and b) happens to result in a hash value where both the high
// part and low part are identical, which is rare, but not impossible.
//
*Index = 0;
return E_FAIL;
}</code></pre>
<pre class="code content-hash-functions-jenkins"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableFastIndexImplChm01JenkinsHashAndMask(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG Index
)
/*++
Routine Description:
Looks up given key in a perfect hash table and returns its index. This
is a fast version of the normal Index() routine that inlines the Jenkins
hash function and AND masking.
N.B. If Key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
key will hash to either an existing key's location or an empty slot,
so there is potential for returning a non-unique index.)
Arguments:
Table - Supplies a pointer to the table for which the key lookup is to be
performed.
Key - Supplies the key to look up.
Index - Receives the index associated with this key. The index will be
between 0 and Table-&gt;HashSize-1, and can be safely used to offset
directly into an appropriately sized array (e.g. Table-&gt;Values[]).
Return Value:
S_OK on success, E_FAIL if the underlying hash function returned a failure.
This will happen if the two hash values for a key happen to be identical.
It shouldn't happen once a perfect graph has been created (i.e. it only
happens when attempting to solve the graph). The Index parameter will
be cleared in the case of E_FAIL.
--*/
{
ULONG A;
ULONG B;
ULONG C;
ULONG D;
ULONG E;
ULONG F;
PBYTE Byte;
ULONG Seed1;
ULONG Seed2;
ULONG Input;
PULONG Seeds;
ULONG Masked;
ULONG Vertex1;
ULONG Vertex2;
PULONG Assigned;
ULONG MaskedLow;
ULONG MaskedHigh;
ULONGLONG Combined;
//
// Initialize aliases.
//
//IACA_VC_START();
Seeds = &amp;Table-&gt;Header-&gt;FirstSeed;
Seed1 = Seeds[0];
Seed2 = Seeds[1];
Input = Key;
Byte = (PBYTE)&amp;Input;
//
// Generate the first hash.
//
A = B = 0x9e3779b9;
C = Seed1;
A += (((ULONG)Byte[3]) &lt;&lt; 24);
A += (((ULONG)Byte[2]) &lt;&lt; 16);
A += (((ULONG)Byte[1]) &lt;&lt; 8);
A += ((ULONG)Byte[0]);
A -= B; A -= C; A ^= (C &gt;&gt; 13);
B -= C; B -= A; B ^= (A &lt;&lt; 8);
C -= A; C -= B; C ^= (B &gt;&gt; 13);
A -= B; A -= C; A ^= (C &gt;&gt; 12);
B -= C; B -= A; B ^= (A &lt;&lt; 16);
C -= A; C -= B; C ^= (B &gt;&gt; 5);
A -= B; A -= C; A ^= (C &gt;&gt; 3);
B -= C; B -= A; B ^= (A &lt;&lt; 10);
C -= A; C -= B; C ^= (B &gt;&gt; 15);
Vertex1 = C;
//
// Generate the second hash.
//
D = E = 0x9e3779b9;
F = Seed2;
D += (((ULONG)Byte[3]) &lt;&lt; 24);
D += (((ULONG)Byte[2]) &lt;&lt; 16);
D += (((ULONG)Byte[1]) &lt;&lt; 8);
D += ((ULONG)Byte[0]);
D -= E; D -= F; D ^= (F &gt;&gt; 13);
E -= F; E -= D; E ^= (D &lt;&lt; 8);
F -= D; F -= E; F ^= (E &gt;&gt; 13);
D -= E; D -= F; D ^= (F &gt;&gt; 12);
E -= F; E -= D; E ^= (D &lt;&lt; 16);
F -= D; F -= E; F ^= (E &gt;&gt; 5);
D -= E; D -= F; D ^= (F &gt;&gt; 3);
E -= F; E -= D; E ^= (D &lt;&lt; 10);
F -= D; F -= E; F ^= (E &gt;&gt; 15);
//IACA_VC_END();
Vertex2 = F;
if (Vertex1 == Vertex2) {
goto Error;
}
//
// Mask each hash value such that it falls within the confines of the
// number of vertices.
//
MaskedLow = Vertex1 &amp; Table-&gt;HashMask;
MaskedHigh = Vertex2 &amp; Table-&gt;HashMask;
//
// Obtain the corresponding vertex values for the masked high and low hash
// values. These are derived from the "assigned" array that we construct
// during the creation routine's assignment step (GraphAssign()).
//
Assigned = Table-&gt;Data;
Vertex1 = Assigned[MaskedLow];
Vertex2 = Assigned[MaskedHigh];
//
// Combine the two values, then perform the index masking operation, such
// that our final index into the array falls within the confines of the
// number of edges, or keys, in the table. That is, make sure the index
// value is between 0 and Table-&gt;Keys-&gt;NumberOfElements-1.
//
Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;
Masked = Combined &amp; Table-&gt;IndexMask;
//
// Update the caller's pointer and return success. The resulting index
// value represents the array offset index for this given key in the
// underlying table, and is guaranteed to be unique amongst the original
// keys in the input set.
//
*Index = Masked;
//IACA_VC_END();
return S_OK;
Error:
//
// Clear the caller's pointer and return failure. We should only hit this
// point if the caller supplies a key that both: a) wasn't in the original
// input set, and b) happens to result in a hash value where both the high
// part and low part are identical, which is rare, but not impossible.
//
*Index = 0;
return E_FAIL;
}</code></pre>
</div>
</div>
<p>
As you can see, the CRC32 routine is doing far less work to derive the
two 32-bit hash values for a given input key. The IACA profiles for each
routine tell a similar story:
</p>
<div class="tab-box box-iaca language box-hash-functions-iaca">
<ul class="tabs">
<li data-content="content-hash-functions-iaca-crc32rotate">Crc32Rotate (IACA)</li>
<li data-content="content-hash-functions-iaca-jenkins">Jenkins (IACA)</li>
</ul>
<div class="content">
<pre class="code iaca content-hash-functions-iaca-crc32rotate"><code class="language-nasm">S:\tracer&gt;iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 19.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 2.5 0.0 | 2.0 | 7.0 7.0 | 7.0 7.0 | 2.0 | 2.5 | 2.0 | 2.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov rbx, r8
| 1 | | | 1.0 1.0 | | | | | | mov r9, qword ptr [rcx+0xc8]
| 1* | | | | | | | | | mov r10, rcx
| 1 | | | | 1.0 1.0 | | | | | mov edi, dword ptr [r9+0x64]
| 1 | | | 1.0 1.0 | | | | | | mov eax, dword ptr [r9+0x6c]
| 1 | | | | 1.0 1.0 | | | | | mov r11d, dword ptr [r9+0x68]
| 1 | | 0.5 | | | | | 0.5 | | xor eax, edx
| 0X | | | | | | | | | crc32 edi, edx
| 2 | 1.0 | | | | | | 1.0 | | rol edx, 0xf
| 0X | | | | | | | | | crc32 r11d, edx
| 0X | | | | | | | | | crc32 r11d, eax
| 1* | | | | | | | | | cmp edi, r11d
| 0*F | | | | | | | | | jnz 0x19
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [r8], 0x0
| 1 | | 0.5 | | | | 0.5 | | | mov eax, 0x80004005
| 1 | | | 1.0 1.0 | | | | | | mov rbx, qword ptr [rsp+0x8]
| 1 | | | | 1.0 1.0 | | | | | mov rdi, qword ptr [rsp+0x10]
| 3^ | | | 1.0 1.0 | | | | | | ret
| 1 | | | | 1.0 1.0 | | | | | mov r8, qword ptr [rcx+0x98]
| 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [rcx+0x44]
| 1* | | | | | | | | | mov eax, edi
| 1# | | 0.5 | | 1.0 1.0 | | 0.5 | | | mov rdi, qword ptr [rsp+0x10]
| 1 | 0.5 | | | | | 0.5 | | | and rax, rcx
| 1* | | | | | | | | | mov edx, r11d
| 1 | | 0.5 | | | | | 0.5 | | and rdx, rcx
| 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [r8+rdx*4]
| 2 | 0.5 | | | 1.0 1.0 | | 0.5 | | | add ecx, dword ptr [r8+rax*4]
| 1* | | | | | | | | | xor eax, eax
| 2^ | 0.5 | | 1.0 1.0 | | | 0.5 | | | and ecx, dword ptr [r10+0x48]
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rbx], ecx
| 1 | | | | 1.0 1.0 | | | | | mov rbx, qword ptr [rsp+0x8]
Total Num Of &#181;ops: 34
</code></pre>
<pre class="code iaca content-hash-functions-iaca-jenkins"><code class="language-nasm">S:\tracer&gt;iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 42.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 21.0 0.0 | 22.0 | 7.5 7.5 | 7.5 7.5 | 2.0 | 22.0 | 22.0 | 2.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov r10, qword ptr [rcx+0xc8]
| 1* | | | | | | | | | mov eax, edx
| 1 | | | | | | | 1.0 | | shr eax, 0x10
| 1* | | | | | | | | | mov r11d, edx
| 1* | | | | | | | | | movzx edx, al
| 1* | | | | | | | | | mov rbx, rcx
| 1 | 1.0 | | | | | | | | shr r11d, 0x18
| 1* | | | | | | | | | mov eax, r9d
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov esi, dword ptr [r10+0x64]
| 1* | | | | | | | | | mov rdi, r8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov r10d, dword ptr [r10+0x68]
| 1 | | 1.0 | | | | | | | mov r8d, 0x9e3779b9
| 1 | | | | | | | 1.0 | | shr eax, 0x8
| 1* | | | | | | | | | movzx ecx, al
| 1* | | | | | | | | | mov eax, esi
| 1 | 1.0 | | | | | | | | shr eax, 0xd
| 1 | | | | | | | 1.0 | | shl r11d, 0x8
| 1 | | | | | | 1.0 | | | add r11d, edx
| 1* | | | | | | | | | movzx r9d, r9b
| 1* | | | | | | | | | mov edx, r9d
| 1 | 1.0 | | | | | | | | shl r11d, 0x8
| 1 | | 1.0 | | | | | | | sub edx, esi
| 1 | | | | | | 1.0 | | | add r11d, ecx
| 1* | | | | | | | | | mov ecx, r8d
| 1 | | | | | | | 1.0 | | shl r11d, 0x8
| 1 | | 1.0 | | | | | | | sub ecx, esi
| 1 | | | | | | 1.0 | | | add edx, r11d
| 1 | | 1.0 | | | | | | | xor edx, eax
| 1 | | | | | | 1.0 | | | sub r9d, r10d
| 1 | 1.0 | | | | | | | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | | | | | | | 1.0 | | shl eax, 0x8
| 1 | | 1.0 | | | | | | | sub r8d, r10d
| 1 | | | | | | 1.0 | | | xor ecx, eax
| 1 | 1.0 | | | | | | | | add r9d, r11d
| 1 | | 1.0 | | | | | | | sub esi, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | shr eax, 0xd
| 1 | | | | | | 1.0 | | | sub esi, edx
| 1 | 1.0 | | | | | | | | xor esi, eax
| 1 | | 1.0 | | | | | | | sub edx, esi
| 1* | | | | | | | | | mov eax, esi
| 1 | | | | | | | 1.0 | | shr eax, 0xc
| 1 | | | | | | 1.0 | | | sub edx, ecx
| 1 | 1.0 | | | | | | | | xor edx, eax
| 1 | | 1.0 | | | | | | | sub ecx, esi
| 1 | | | | | | 1.0 | | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | | | | | | | 1.0 | | shl eax, 0x10
| 1 | 1.0 | | | | | | | | xor ecx, eax
| 1 | | 1.0 | | | | | | | sub esi, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | shr eax, 0x5
| 1 | | | | | | 1.0 | | | sub esi, edx
| 1 | 1.0 | | | | | | | | xor esi, eax
| 1 | | 1.0 | | | | | | | sub edx, esi
| 1* | | | | | | | | | mov eax, esi
| 1 | | | | | | | 1.0 | | shr eax, 0x3
| 1 | | | | | | 1.0 | | | sub edx, ecx
| 1 | 1.0 | | | | | | | | xor edx, eax
| 1 | | 1.0 | | | | | | | sub ecx, esi
| 1 | | | | | | 1.0 | | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | | | | | | | 1.0 | | shl eax, 0xa
| 1 | 1.0 | | | | | | | | xor ecx, eax
| 1* | | | | | | | | | mov eax, r10d
| 1 | | | | | | | 1.0 | | shr eax, 0xd
| 1 | | 1.0 | | | | | | | sub esi, ecx
| 1 | | | | | | 1.0 | | | xor r9d, eax
| 1 | 1.0 | | | | | | | | shr ecx, 0xf
| 1 | | 1.0 | | | | | | | sub r8d, r9d
| 1 | | | | | | 1.0 | | | sub esi, edx
| 1* | | | | | | | | | mov eax, r9d
| 1 | | | | | | | 1.0 | | xor esi, ecx
| 1 | 1.0 | | | | | | | | shl eax, 0x8
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub r10d, r8d
| 1* | | | | | | | | | mov eax, r8d
| 1 | | | | | | | 1.0 | | sub r10d, r9d
| 1 | 1.0 | | | | | | | | shr eax, 0xd
| 1 | | 1.0 | | | | | | | xor r10d, eax
| 1 | | | | | | 1.0 | | | sub r9d, r10d
| 1* | | | | | | | | | mov eax, r10d
| 1 | | | | | | | 1.0 | | sub r9d, r8d
| 1 | 1.0 | | | | | | | | shr eax, 0xc
| 1 | | 1.0 | | | | | | | xor r9d, eax
| 1 | | | | | | 1.0 | | | sub r8d, r10d
| 1 | | | | | | | 1.0 | | sub r8d, r9d
| 1* | | | | | | | | | mov eax, r9d
| 1 | 1.0 | | | | | | | | shl eax, 0x10
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub r10d, r8d
| 1* | | | | | | | | | mov eax, r8d
| 1 | | | | | | | 1.0 | | sub r10d, r9d
| 1 | 1.0 | | | | | | | | shr eax, 0x5
| 1 | | 1.0 | | | | | | | xor r10d, eax
| 1 | | | | | | 1.0 | | | sub r9d, r10d
| 1* | | | | | | | | | mov eax, r10d
| 1 | | | | | | | 1.0 | | sub r9d, r8d
| 1 | 1.0 | | | | | | | | shr eax, 0x3
| 1 | | 1.0 | | | | | | | sub r8d, r10d
| 1 | | | | | | 1.0 | | | xor r9d, eax
| 1 | | | | | | | 1.0 | | sub r8d, r9d
| 1* | | | | | | | | | mov eax, r9d
| 1 | 1.0 | | | | | | | | shl eax, 0xa
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub r10d, r8d
| 1 | | | | | | | 1.0 | | shr r8d, 0xf
| 1 | 1.0 | | | | | | | | sub r10d, r9d
| 1 | | 1.0 | | | | | | | xor r10d, r8d
| 1* | | | | | | | | | cmp esi, r10d
| 0*F | | | | | | | | | jnz 0x1d
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdi], 0x0
| 1 | | | | | | 1.0 | | | mov eax, 0x80004005
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rbx, qword ptr [rsp+0x8]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rsi, qword ptr [rsp+0x10]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rdi, qword ptr [rsp+0x18]
| 3^ | | | 0.5 0.5 | 0.5 0.5 | | | | | ret
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov edx, dword ptr [rbx+0x44]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov r8, qword ptr [rbx+0x98]
| 1* | | | | | | | | | mov ecx, edx
| 1* | | | | | | | | | mov eax, r10d
| 1 | | | | | | | 1.0 | | and rcx, rax
| 1* | | | | | | | | | mov eax, esi
| 1# | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | mov rsi, qword ptr [rsp+0x10]
| 1 | | 1.0 | | | | | | | and rdx, rax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rcx*4]
| 2 | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | add eax, dword ptr [r8+rdx*4]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | and eax, dword ptr [rbx+0x48]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rbx, qword ptr [rsp+0x8]
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdi], eax
| 1* | | | | | | | | | xor eax, eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rdi, qword ptr [rsp+0x18]
Total Num Of &#181;ops: 138
</code></pre>
</div>
</div>
<h3>A Note on Modulus Masking</h3>
<p>
For the sake of comparison, let's take a look at the IACA profile of the
modulus-masking based approach used in the original academic papers. (Note that
I wasn't able to get their modulus-masking algorithm to generate valid graphs
without collisions &mdash; both in the original CMPH project and my
reimplementation of their algorithm. So, the following routine isn't used
anywhere. It is useful for depicting the impact of the modulus operator,
though.)
</p>
<p>
Converting the AND-masking to a %-mask for the Jenkins hash function results in
a diff that looks like this:
</p>
<pre class="code"><code class="language-diff">
% diff -u JenkinsAnd.c JenkinsMod.c
--- JenkinsAnd.c 2018-06-10 20:48:49.330504900 -0700
+++ JenkinsMod.c 2018-06-10 20:50:06.373304600 -0700
@@ -1,6 +1,6 @@
_Use_decl_annotations_
HRESULT
-PerfectHashTableFastIndexImplChm01JenkinsHashAndMask(
+PerfectHashTableFastIndexImplChm01JenkinsHashModMask(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG Index
@@ -130,8 +130,8 @@
// number of vertices.
//
- MaskedLow = Vertex1 &amp; Table-&gt;HashMask;
- MaskedHigh = Vertex2 &amp; Table-&gt;HashMask;
+ MaskedLow = Vertex1 % Table-&gt;HashModulus;
+ MaskedHigh = Vertex2 % Table-&gt;HashModulus;
//
// Obtain the corresponding vertex values for the masked high and low hash
@@ -153,7 +153,7 @@
Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;
- Masked = Combined &amp; Table-&gt;IndexMask;
+ Masked = Combined % Table-&gt;IndexModulus;
//
// Update the caller's pointer and return success. The resulting index
</code></pre>
<p>
The IACA profile is as follows:
</p>
<div class="tab-box box-iaca language box-hash-functions-mod-iaca">
<ul class="tabs">
<li data-content="content-hash-functions-iaca-jenkins-mod">Jenkins-Modulus (IACA)</li>
</ul>
<div class="content">
<pre class="code iaca content-hash-functions-iaca-jenkins-mod"><code class="language-nasm">S:\tracer&gt;iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 119.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 30.0 0.0 | 28.0 | 7.0 7.0 | 7.0 7.0 | 2.0 | 29.0 | 29.0 | 2.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | mov r10, qword ptr [rcx+0xc8]
| 1* | | | | | | | | | mov r9d, edx
| 1* | | | | | | | | | mov eax, edx
| 1* | | | | | | | | | mov r11d, edx
| 1 | | | | | | | 1.0 | | shr eax, 0x10
| 1* | | | | | | | | | mov rdi, r8
| 1* | | | | | | | | | movzx edx, al
| 1 | | 1.0 | | | | | | | mov r8d, 0x9e3779b9
| 1 | | | | 1.0 1.0 | | | | | mov esi, dword ptr [r10+0x64]
| 1* | | | | | | | | | mov eax, r9d
| 1 | | | | | | | 1.0 | | shr eax, 0x8
| 1* | | | | | | | | | movzx ecx, al
| 1* | | | | | | | | | mov eax, esi
| 1 | 1.0 | | | | | | | | shr eax, 0xd
| 1 | | | | | | | 1.0 | | shr r11d, 0x18
| 1 | 1.0 | | | | | | | | shl r11d, 0x8
| 1 | | 1.0 | | | | | | | add r11d, edx
| 1* | | | | | | | | | movzx r9d, r9b
| 1* | | | | | | | | | mov edx, r9d
| 1 | | | | | | | 1.0 | | shl r11d, 0x8
| 1 | | | | | | 1.0 | | | sub edx, esi
| 1 | | 1.0 | | | | | | | add r11d, ecx
| 1* | | | | | | | | | mov ecx, r8d
| 1 | 1.0 | | | | | | | | shl r11d, 0x8
| 1 | | | | | | 1.0 | | | sub ecx, esi
| 1 | | 1.0 | | | | | | | add edx, r11d
| 1 | | | | | | 1.0 | | | xor edx, eax
| 1 | | | | | | | 1.0 | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | 1.0 | | | | | | | | shl eax, 0x8
| 1 | | 1.0 | | | | | | | xor ecx, eax
| 1 | | | | | | 1.0 | | | sub esi, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | shr eax, 0xd
| 1 | 1.0 | | | | | | | | sub esi, edx
| 1 | | 1.0 | | | | | | | xor esi, eax
| 1 | | | | | | 1.0 | | | sub edx, esi
| 1* | | | | | | | | | mov eax, esi
| 1 | | | | | | | 1.0 | | shr eax, 0xc
| 1 | 1.0 | | | | | | | | sub edx, ecx
| 1 | | 1.0 | | | | | | | xor edx, eax
| 1 | | | | | | 1.0 | | | sub ecx, esi
| 1 | | | | | | | 1.0 | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | 1.0 | | | | | | | | shl eax, 0x10
| 1 | | 1.0 | | | | | | | xor ecx, eax
| 1 | | | | | | 1.0 | | | sub esi, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | shr eax, 0x5
| 1 | 1.0 | | | | | | | | sub esi, edx
| 1 | | 1.0 | | | | | | | xor esi, eax
| 1 | | | | | | 1.0 | | | sub edx, esi
| 1* | | | | | | | | | mov eax, esi
| 1 | | | | | | | 1.0 | | shr eax, 0x3
| 1 | 1.0 | | | | | | | | sub edx, ecx
| 1 | | 1.0 | | | | | | | xor edx, eax
| 1 | | | | | | 1.0 | | | sub ecx, esi
| 1 | | | | | | | 1.0 | | sub ecx, edx
| 1* | | | | | | | | | mov eax, edx
| 1 | 1.0 | | | | | | | | shl eax, 0xa
| 1 | | 1.0 | | | | | | | xor ecx, eax
| 1 | | | | | | 1.0 | | | sub esi, ecx
| 1 | | | | | | | 1.0 | | shr ecx, 0xf
| 1 | 1.0 | | | | | | | | sub esi, edx
| 1 | | 1.0 | | | | | | | xor esi, ecx
| 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [r10+0x68]
| 1 | | | | | | 1.0 | | | sub r9d, ecx
| 1 | | | | | | | 1.0 | | sub r8d, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | 1.0 | | | | | | | | add r9d, r11d
| 1 | | | | | | | 1.0 | | shr eax, 0xd
| 1 | | 1.0 | | | | | | | xor r9d, eax
| 1 | | | | | | 1.0 | | | sub r8d, r9d
| 1* | | | | | | | | | mov eax, r9d
| 1 | 1.0 | | | | | | | | shl eax, 0x8
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub ecx, r8d
| 1* | | | | | | | | | mov eax, r8d
| 1 | | | | | | | 1.0 | | sub ecx, r9d
| 1 | 1.0 | | | | | | | | shr eax, 0xd
| 1 | | 1.0 | | | | | | | xor ecx, eax
| 1 | | | | | | 1.0 | | | sub r9d, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | sub r9d, r8d
| 1 | 1.0 | | | | | | | | shr eax, 0xc
| 1 | | 1.0 | | | | | | | xor r9d, eax
| 1 | | | | | | 1.0 | | | sub r8d, ecx
| 1 | | | | | | | 1.0 | | sub r8d, r9d
| 1* | | | | | | | | | mov eax, r9d
| 1 | 1.0 | | | | | | | | shl eax, 0x10
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub ecx, r8d
| 1* | | | | | | | | | mov eax, r8d
| 1 | | | | | | | 1.0 | | sub ecx, r9d
| 1 | 1.0 | | | | | | | | shr eax, 0x5
| 1 | | 1.0 | | | | | | | xor ecx, eax
| 1 | | | | | | 1.0 | | | sub r9d, ecx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | | | | | 1.0 | | sub r9d, r8d
| 1 | 1.0 | | | | | | | | shr eax, 0x3
| 1 | | 1.0 | | | | | | | sub r8d, ecx
| 1 | | | | | | 1.0 | | | xor r9d, eax
| 1 | | | | | | | 1.0 | | sub r8d, r9d
| 1* | | | | | | | | | mov eax, r9d
| 1 | 1.0 | | | | | | | | shl eax, 0xa
| 1 | | 1.0 | | | | | | | xor r8d, eax
| 1 | | | | | | 1.0 | | | sub ecx, r8d
| 1 | | | | | | | 1.0 | | shr r8d, 0xf
| 1 | 1.0 | | | | | | | | sub ecx, r9d
| 1 | | 1.0 | | | | | | | xor ecx, r8d
| 1* | | | | | | | | | cmp esi, ecx
| 0*F | | | | | | | | | jnz 0x1d
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdi], 0x0
| 1 | | | | | | 1.0 | | | mov eax, 0x80004005
| 1 | | | | 1.0 1.0 | | | | | mov rbx, qword ptr [rsp+0x8]
| 1 | | | 1.0 1.0 | | | | | | mov rsi, qword ptr [rsp+0x10]
| 1 | | | | 1.0 1.0 | | | | | mov rdi, qword ptr [rsp+0x18]
| 3^ | | | 1.0 1.0 | | | | | | ret
| 1 | | | | 1.0 1.0 | | | | | mov r10, qword ptr [rbx+0x98]
| 1* | | | | | | | | | xor edx, edx
| 1* | | | | | | | | | mov eax, ecx
| 1 | | | 1.0 1.0 | | | | | | mov ecx, dword ptr [rbx+0x58]
| 0X | | | | | | | | | div dword ptr [rbx+0x54]
| 1* | | | | | | | | | mov eax, esi
| 1# | 1.0 | | | 1.0 1.0 | | | | | mov rsi, qword ptr [rsp+0x10]
| 1 | | | 1.0 1.0 | | | | | | mov r9d, dword ptr [r10+rdx*4]
| 1* | | | | | | | | | xor edx, edx
| 0X | | | | | | | | | div dword ptr [rbx+0x54]
| 1 | | | | 1.0 1.0 | | | | | mov rbx, qword ptr [rsp+0x8]
| 1 | | | 1.0 1.0 | | | | | | mov eax, dword ptr [r10+rdx*4]
| 1* | | | | | | | | | xor edx, edx
| 1 | | | | | | | 1.0 | | add rax, r9
| 32 | 9.0 | 7.0 | | | | 9.0 | 7.0 | | div rcx
| 1* | | | | | | | | | xor eax, eax
| 2^ | | | | | 1.0 | | | 1.0 | mov dword ptr [rdi], edx
| 1 | | | | 1.0 1.0 | | | | | mov rdi, qword ptr [rsp+0x18]
Total Num Of &#181;ops: 168
</code></pre>
</div>
</div>
<p>
The three div operations bump the &#181;op count from 119 up to 168, a 1.4x
increase. The block throughput is even worse, jumping from 43 to 119, ~2.7x
increase. At 19 cycles, the CRC32-Rotate version is clearly in the lead, 2.2x
better than Jenkins, and 6.2x better than Jenkins with modulus masking as per
the original paper.
</p>
<h3>In Defense of Jenkins...</h3>
<p>
It definitely produces a better quality hash than the CRC32-based routine. This
is evidenced by the fact that, against our 50 auto-generated linear/random key
files of varying sizes &mdash; I've not seen Jenkins result in a single table
resize event.
</p>
<p>
On the other hand, on the same data set, the CRC32 routine will typically need
to resort to around 15 table resize events in order to find a solution. The
affected 15 files are all the random ones (I've not seen any linear ones needing
resizing, even the large ones). There could be some pathological issue that is
easily fixed, but I haven't investigated in detail, yet. On all of the key sets
that are more representative of our target, the CRC32 routine obtains a solution
easily.
</p>
<h2></h2>
<a class="xref" name="implementation-notes"></a>
<h1>Implementation Notes</h1>
<small>
(The following section serves as a general introduction to the project's
organization, concepts, files, etc. It aims to provide a bit of
background context to some of the idioms that will be observed when reviewing
the code. The next section, <a href="#code-walkthrough">Code Walkthrough</a>,
digs into the implementation details in a more end-to-end type fashion.)
</small>
<p>
From an implementation perspective, I decided to develop the component as part
of my <a href="https://github.com/tpn/tracer">tracer</a> project for two
reasons: a) the tracer project already has a lot of scaffolding in place
(especially for the boring bits) that I'd be able to re-use, which is useful
when time-constrained, and b) a perfect hash table is actually a pretty
neat component that I can see my self using down the track.
</p>
<p>
The implementation is creatively-named <a
href="https://github.com/tpn/tracer/tree/master/PerfectHashTable">PerfectHashTable</a>,
and is a DLL.
</p>
<p>
The <a
href="https://github.com/tpn/tracer/tree/master/PerfectHashTableSelfTestExe">PerfectHashTableSelfTestExe</a>
project, also very creatively-named, is the standalone self-test .exe.
</p>
<p>
Development of the project was done in Visual Studio 2017 via the
<a href="https://github.com/tpn/tracer/tree/master/PerfectHashTable.sln">PerfectHashTable.sln</a>
solution, which includes Debug, Release, PGInstrument and PGOptimize
configurations.
</p>
<p>
The coding style and conventions are the same as the rest of the tracer project:
Cutler Normal Form C, heavily commented, and generally very NT-esque in style.
</p>
<p>
Components in the tracer project have an interesting restriction in that they
can't use the C runtime library in any way (due to the need to potentially be
remote injected into a target process, and not wanting to deal with varying
levels of C runtime availability). Instead, they make extensive, exclusive
use of the NT runtime primitives afforded by <code>ntdll.dll</code>,
<code>kernel32.dll</code>, and <code>ntoskrnl.exe</code>. This
functionality is wrapped up by a component named
<a href="https://github.com/tpn/tracer/tree/master/Rtl">Rtl</a> (inspired by the
NT runtime library of the same name). There is a huge structure named
<a
href="https://github.com/tpn/tracer/blob/7fc2741b27e5705f4dc93f5dbf1280e08d7dfa8a/Rtl/Rtl.h#L6380">RTL</a>,
which contains the kitchen sink of things we need across all components, as well
as function pointers to DDK-type functionality normally not available if you're
including <code>&lt;Windows.h&gt;</code> (such as bitmaps, AVL tables, prefix
tables, etc).
<small>(Note: one of the original design objectives of the RTL structure was to
facilitate writing components that could run in both kernel and user mode
without needing recompilation or #ifdefs. That is, you could author and test
the vast majority of your functionality in user mode, where development,
debugging and testing is easier, without needing to deploy it into the kernel
until much later in the development lifecycle.)</small>
</p>
<p>
The reason for mentioning this is that the first two parameters for every public
function of the PerfectHashTable component are usually <code>Rtl</code> and
<code>Allocator</code>. E.g.:
</p>
<pre class="code"><code class="language-c">_Use_decl_annotations_
BOOLEAN
CreatePerfectHashTable(
PRTL Rtl,
PALLOCATOR Allocator,
PPERFECT_HASH_TABLE_CONTEXT Context,
PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId,
PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId,
PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId,
PULARGE_INTEGER NumberOfTableElementsPointer,
PPERFECT_HASH_TABLE_KEYS Keys,
PCUNICODE_STRING HashTablePath
)</code></pre>
<p>
The <a
href="https://github.com/tpn/tracer/blob/7fc2741b27e5705f4dc93f5dbf1280e08d7dfa8a/Rtl/Memory.h#L464">ALLOCATOR</a>
structure encapsulates the memory management functions like
<code>Calloc()</code> and <code>Free()</code>, such that we're not dependent
upon CRT linkage to equivalent functions. Initially, it just had the basic set
of routines required to mimic Python's <code>PyMemAllocatorEx</code> structure:
<code>malloc()</code>, <code>calloc()</code>, <code>realloc()</code> and
<code>free()</code>. However, since then, it grew to support a plethora of
routines, many of which driven by the TraceStore component, but also more
general functions, such as aligned memory allocation.
</p>
<p>
Short story long: if you see <code>Rtl</code> our <code>Allocator</code>
variables anywhere, it's just our runtime glue or memory allocator
stuff.
</p>
<p>
Moving on, all public functions get a SAL-annotated function pointer typedef in
the main public header file (<a
href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTable.h">PerfectHashTable.h</a>).
Functions are defined for creating and destroying keys and perfect hash tables,
creating and destroying contexts (a context is required to create a table), and
loading a previously-created table.
<p>
I use the same API pattern I used for the
<a href="https://github.com/tpn/tracer/tree/master/StringTable2">StringTable2</a>
component, which I discuss
<a href="https://trent.me/is-prefix-of-string-in-table/#implementation-considerations">here</a>.
This involves defining a structure named
<a
href="https://github.com/tpn/tracer/blob/515692eb7da838079f482e7ada27c6bda59617eb/PerfectHashTable/PerfectHashTable.h#L561">PERFECT_HASH_TABLE_API</a>,
and an inline function,
<a
href="https://github.com/tpn/tracer/blob/515692eb7da838079f482e7ada27c6bda59617eb/PerfectHashTable/PerfectHashTable.h#L624">LoadPerfectHashTableApi()</a>.
</p>
<p>
The structure looks like this:
</p>
<pre class="code"><code class="language-c">typedef struct _Struct_size_bytes_(SizeOfStruct) _PERFECT_HASH_TABLE_API {
//
// Size of the structure, in bytes. This is filled in automatically by
// LoadPerfectHashTableApi() based on the initial SizeOfAnyApi parameter.
//
_In_range_(sizeof(struct _PERFECT_HASH_TABLE_API),
sizeof(struct _PERFECT_HASH_TABLE_API_EX)) ULONG SizeOfStruct;
//
// Number of function pointers contained in the structure. This is filled
// in automatically by LoadPerfectHashTableApi() based on the initial
// SizeOfAnyApi parameter divided by the size of a function pointer.
//
ULONG NumberOfFunctions;
//
// Begin function pointers.
//
union {
PVOID FirstFunctionPointer;
PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;
};
PLOAD_PERFECT_HASH_TABLE_KEYS LoadPerfectHashTableKeys;
PDESTROY_PERFECT_HASH_TABLE_KEYS DestroyPerfectHashTableKeys;
PCREATE_PERFECT_HASH_TABLE_CONTEXT CreatePerfectHashTableContext;
PDESTROY_PERFECT_HASH_TABLE_CONTEXT DestroyPerfectHashTableContext;
PCREATE_PERFECT_HASH_TABLE CreatePerfectHashTable;
PLOAD_PERFECT_HASH_TABLE LoadPerfectHashTable;
PTEST_PERFECT_HASH_TABLE TestPerfectHashTable;
PINITIALIZE_PERFECT_HASH_TABLE_ALLOCATOR InitializePerfectHashAllocator;
PINITIALIZE_PERFECT_HASH_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
InitializePerfectHashAllocatorFromRtlBootstrap;
} PERFECT_HASH_TABLE_API;
typedef PERFECT_HASH_TABLE_API *PPERFECT_HASH_TABLE_API;
</code></pre>
<p>
And the loader routine looks like this:
</p>
<pre class="code"><code class="language-c">FORCEINLINE
BOOLEAN
LoadPerfectHashTableApi(
_In_ PRTL Rtl,
_Inout_ HMODULE *ModulePointer,
_In_opt_ PUNICODE_STRING ModulePath,
_In_ ULONG SizeOfAnyApi,
_Out_writes_bytes_all_(SizeOfAnyApi) PPERFECT_HASH_TABLE_ANY_API AnyApi
)
/*++
Routine Description:
Loads the perfect hash table module and resolves all API functions for
either the PERFECT_HASH_TABLE_API or PERFECT_HASH_TABLE_API_EX structure.
The desired API is indicated by the SizeOfAnyApi parameter.
Example use:
PERFECT_HASH_TABLE_API_EX GlobalApi;
PPERFECT_HASH_TABLE_API_EX Api;
Success = LoadPerfectHashApi(Rtl,
NULL,
NULL,
sizeof(GlobalApi),
(PPERFECT_HASH_TABLE_ANY_API)&amp;GlobalApi);
ASSERT(Success);
Api = &amp;GlobalApi;
In this example, the extended API will be provided as our sizeof(GlobalApi)
will indicate the structure size used by PERFECT_HASH_TABLE_API_EX.
Arguments:
Rtl - Supplies a pointer to an initialized RTL structure.
ModulePointer - Optionally supplies a pointer to an existing module handle
for which the API symbols are to be resolved. May be NULL. If not
NULL, but the pointed-to value is NULL, then this parameter will
receive the handle obtained by LoadLibrary() as part of this call.
If the string table module is no longer needed, but the program will
keep running, the caller should issue a FreeLibrary() against this
module handle.
ModulePath - Optionally supplies a pointer to a UNICODE_STRING structure
representing a path name of the string table module to be loaded.
If *ModulePointer is not NULL, it takes precedence over this parameter.
If NULL, and no module has been provided via *ModulePointer, loading
will be attempted via LoadLibraryA("PerfectHashTable.dll")'.
SizeOfAnyApi - Supplies the size, in bytes, of the underlying structure
pointed to by the AnyApi parameter.
AnyApi - Supplies the address of a structure which will receive resolved
API function pointers. The API furnished will depend on the size
indicated by the SizeOfAnyApi parameter.
Return Value:
TRUE on success, FALSE on failure.
--*/
</code></pre>
<p>
This is a little quirky, but I've found it to be a useful way to
dynamically load and use small components at runtime without requiring
any static linking or dll export/import glue.
</p>
<p>
Publicly, the keys and context structures are opaque. Routines are exposed to
create them and destroy them, and that's it. The structure details are reserved
for main private header for the component,
<a
href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTablePrivate.h">PerfectHashTablePrivate.h</a>,
which is available for inclusion by all of our internal .c files. Pre-compiled
headers are used, such that the only file each individual .c file needs to
include is <a
href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/stdafx.h">stdafx.h</a>.
</p>
<p>
The actual perfect hash table interface that implements the required
<code>Insert()</code> and <code>Lookup()</code> is actually exposed as a
COM-style vtbl structure named
<a href="https://github.com/tpn/tracer/blob/515692eb7da838079f482e7ada27c6bda59617eb/PerfectHashTable/PerfectHashTable.h#L438">
PERFECT_HASH_TABLE_VTBL</a>. It's not technically a COM interface, as we don't
expose a <code>QueryInterface()</code> function pointer in the first slot, but
it has the same initial <code>IUnknown</code> layout, and leverages the
reference counting facilities implied by <code>AddRef()</code> and
<code>Release()</code> to manage lifetime. (This differs from all of the other
structures, which will typically have both a Create and Destroy-type method.)
The structure, with an accompanying definition of the loader function pointer
and all public vtbl methods for interacting with the perfect hash table, looks
like this:
</p>
<pre class="code"><code class="language-c">//
// Forward definition of the interface.
//
typedef struct _PERFECT_HASH_TABLE_VTBL PERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL *PPERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL **PPPERFECT_HASH_TABLE;
//
// Define a minimal vtbl encapsulation structure if we're a public
// (i.e. non-internal) build. The actual structure is defined in
// PerfectHashTablePrivate.h.
//
#ifndef _PERFECT_HASH_TABLE_INTERNAL_BUILD
typedef struct _PERFECT_HASH_TABLE {
PPERFECT_HASH_TABLE_VTBL Vtbl;
} PERFECT_HASH_TABLE;
#else
typedef struct _PERFECT_HASH_TABLE PERFECT_HASH_TABLE;
#endif
typedef PERFECT_HASH_TABLE *PPERFECT_HASH_TABLE;
typedef
_Check_return_
_Success_(return != 0)
BOOLEAN
(NTAPI LOAD_PERFECT_HASH_TABLE)(
_In_ PRTL Rtl,
_In_ PALLOCATOR Allocator,
_In_opt_ PPERFECT_HASH_TABLE_KEYS Keys,
_In_ PCUNICODE_STRING Path,
_Out_ PPERFECT_HASH_TABLE *TablePointer
);
typedef LOAD_PERFECT_HASH_TABLE *PLOAD_PERFECT_HASH_TABLE;
//
// Define the public perfect hash table functions.
//
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_INSERT)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Key,
_In_ ULONG Value,
_Out_opt_ PULONG PreviousValue
);
typedef PERFECT_HASH_TABLE_INSERT *PPERFECT_HASH_TABLE_INSERT;
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_LOOKUP)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Key,
_Out_ PULONG Value
);
typedef PERFECT_HASH_TABLE_LOOKUP *PPERFECT_HASH_TABLE_LOOKUP;
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_DELETE)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Key,
_Out_opt_ PULONG PreviousValue
);
typedef PERFECT_HASH_TABLE_DELETE *PPERFECT_HASH_TABLE_DELETE;
//
// Given a key, this routine returns the relative index of the key in the
// underlying hash table. This is guaranteed to be within the bounds of the
// table size.
//
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_INDEX)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Key,
_In_ PULONG Index
);
typedef PERFECT_HASH_TABLE_INDEX *PPERFECT_HASH_TABLE_INDEX;
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_HASH)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Input,
_Out_ PULONGLONG Hash
);
typedef PERFECT_HASH_TABLE_HASH *PPERFECT_HASH_TABLE_HASH;
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_MASK_HASH)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONG Input,
_Out_ PULONG Masked
);
typedef PERFECT_HASH_TABLE_MASK_HASH *PPERFECT_HASH_TABLE_MASK_HASH;
typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_MASK_INDEX)(
_In_ PPERFECT_HASH_TABLE Table,
_In_ ULONGLONG Input,
_Out_ PULONG Masked
);
typedef PERFECT_HASH_TABLE_MASK_INDEX *PPERFECT_HASH_TABLE_MASK_INDEX;
//
// Loaded hash tables are reference counted using the AddRef()/Release() COM
// semantics. The number of AddRef() calls should match the number of Release()
// calls. The resources will be released when the final Release() is called.
//
typedef
ULONG
(NTAPI PERFECT_HASH_TABLE_ADD_REF)(
_In_ PPERFECT_HASH_TABLE Table
);
typedef PERFECT_HASH_TABLE_ADD_REF *PPERFECT_HASH_TABLE_ADD_REF;
typedef
ULONG
(NTAPI PERFECT_HASH_TABLE_RELEASE)(
_In_ PPERFECT_HASH_TABLE Table
);
typedef PERFECT_HASH_TABLE_RELEASE *PPERFECT_HASH_TABLE_RELEASE;
//
// The interface as a vtbl. Note that we're *almost* a valid COM interface,
// except for the NULL pointer that will occupy the first slot where the impl
// for QueryInterface() is meant to live.
//
typedef struct _PERFECT_HASH_TABLE_VTBL {
PVOID Unused;
PPERFECT_HASH_TABLE_ADD_REF AddRef;
PPERFECT_HASH_TABLE_RELEASE Release;
PPERFECT_HASH_TABLE_INSERT Insert;
PPERFECT_HASH_TABLE_LOOKUP Lookup;
PPERFECT_HASH_TABLE_DELETE Delete;
PPERFECT_HASH_TABLE_INDEX Index;
PPERFECT_HASH_TABLE_HASH Hash;
PPERFECT_HASH_TABLE_MASK_HASH MaskHash;
PPERFECT_HASH_TABLE_MASK_INDEX MaskIndex;
} PERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL *PPERFECT_HASH_TABLE_VTBL;</code></pre>
<p>
The COM-style vtbl pattern worked quite nicely here, and provided the perfect
amount of exposure between the public and private definitions of the table
interface. It allows us to mix and match who provides what with regards to
the vtbl function pointers, based on the selected algorithm, hash function
and masking type. The
<a
href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTableConstants.h">PerfectHashTableConstants.h</a>
file exposes an internal routine named
<a
href="https://github.com/tpn/tracer/blob/515692eb7da838079f482e7ada27c6bda59617eb/PerfectHashTable/PerfectHashTableConstants.h#L86">InitializeExtendedVtbl</a>,
which looks like this:
</p>
<pre class="code"><code class="language-c">//
// Helper inline routine for initializing the extended vtbl interface.
//
FORCEINLINE
VOID
InitializeExtendedVtbl(
_In_ PPERFECT_HASH_TABLE Table,
_Inout_ PPERFECT_HASH_TABLE_VTBL_EX Vtbl
)
{
Vtbl-&gt;AddRef = PerfectHashTableAddRef;
Vtbl-&gt;Release = PerfectHashTableRelease;
Vtbl-&gt;Insert = PerfectHashTableInsert;
Vtbl-&gt;Lookup = PerfectHashTableLookup;
Vtbl-&gt;Delete = PerfectHashTableDelete;
Vtbl-&gt;Index = IndexRoutines[Table-&gt;AlgorithmId];
Vtbl-&gt;Hash = HashRoutines[Table-&gt;HashFunctionId];
Vtbl-&gt;MaskHash = MaskHashRoutines[Table-&gt;MaskFunctionId];
Vtbl-&gt;MaskIndex = MaskIndexRoutines[Table-&gt;MaskFunctionId];
Vtbl-&gt;SeededHash = SeededHashRoutines[Table-&gt;HashFunctionId];
Table-&gt;Vtbl = Vtbl;
}</code></pre>
<p>
Note that, thanks to COM, the first five routines can actually be serviced by
generic, algorithm-agnostic implementations. That is, because the
<code><a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTableInsert.c">Insert()</a></code>,
<code><a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTableLookup.c">Lookup()</a></code>,
and
<code><a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTableDelete.c">Delete()</a></code>
routines simply use the <code>Index()</code> routine to
obtain the array offset for the given input key, they don't
need to know anything about the backend algorithm, hash or
masking type. This actually makes all of these functions
very simple, as can be seen below:
</p>
<div class="tab-box language box-table-functions">
<ul class="tabs">
<li data-content="content-table-functions-insert">Insert</li>
<li data-content="content-table-functions-lookup">Lookup</li>
<li data-content="content-table-functions-delete">Delete</li>
<li data-content="content-table-functions-addref">AddRef</li>
<li data-content="content-table-functions-release">Release</li>
</ul>
<div class="content">
<pre class="code content-table-functions-insert"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableInsert(
PPERFECT_HASH_TABLE Table,
ULONG Key,
ULONG Value,
PULONG PreviousValue
)
/*++
Routine Description:
Looks up given key in a perfect hash table and returns the value set by
the Insert() routine. If no insertion has taken place for this key, this
routine guarantees to return 0 as the value.
N.B. If Key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
key will hash to either an existing key's location or an empty slot,
so there is potential to corrupt the table in the sense that previously
inserted values will be trampled over.)
Arguments:
Table - Supplies a pointer to the table to insert the key/value into.
Key - Supplies the key to insert.
Value - Supplies the value to insert.
PreviousValue - Optionally supplies a pointer that will receive the previous
value at the relevant table location prior to this insertion. If no
prior insertion, the previous value is guaranteed to be 0.
Return Value:
S_OK in all normal operating conditions. E_FAIL may be returned in some
cases when passed a key not in the original input set. The PreviousValue
parameter, if non-NULL, will be cleared in this case.
--*/
{
ULONG Index;
ULONG Existing;
HRESULT Result;
//
// Obtain the index for this key.
//
Result = Table-&gt;Vtbl-&gt;Index(Table, Key, &amp;Index);
if (FAILED(Result)) {
//
// Clear the caller's pointer if applicable and return error.
//
if (ARGUMENT_PRESENT(PreviousValue)) {
*PreviousValue = 0;
}
return E_FAIL;
}
//
// Get the existing value.
//
Existing = Table-&gt;Values[Index];
//
// Write the new value.
//
Table-&gt;Values[Index] = Value;
//
// Update the caller's pointer if applicable.
//
if (ARGUMENT_PRESENT(PreviousValue)) {
*PreviousValue = Existing;
}
return S_OK;
}</code></pre>
<pre class="code content-table-functions-lookup"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableLookup(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG Value
)
/*++
Routine Description:
Looks up given key in a perfect hash table and returns the value set by
the Insert() routine. If no insertion has taken place for this key, this
routine guarantees to return 0 as the value.
N.B. If key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
value returned will be the value for some other key in the table that
hashes to the same location -- or potentially an empty slot in the
table.)
Arguments:
Table - Supplies a pointer to the table for which the key lookup is to be
performed.
Key - Supplies the key to look up.
Value - Receives the vaue for the given key.
Return Value:
S_OK in all normal operating conditions. E_FAIL may be returned in some
cases when passed a key not in the original input set. The Value parameter
will be set to NULL in this case.
--*/
{
ULONG Index;
HRESULT Result;
//
// Obtain the index for this key.
//
Result = Table-&gt;Vtbl-&gt;Index(Table, Key, &amp;Index);
if (FAILED(Result)) {
//
// Clear the caller's pointers and return the error code.
//
*Value = 0;
return E_FAIL;
}
*Value = Table-&gt;Values[Index];
return S_OK;
}</code></pre>
<pre class="code content-table-functions-delete"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableDelete(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG PreviousValue
)
/*++
Routine Description:
Deletes a key from a perfect hash table, optionally returning the value
prior to deletion back to the caller. Deletion simply clears the value
associated with the key, and thus, is a simple O(1) operation. Deleting
a key that has not yet been inserted has no effect other than potentially
returning 0 as the previous value. That is, a caller can safely issue
deletes of keys regardless of whether or not said keys were inserted first.
N.B. If key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
key will hash to either an existing key's location or an empty slot,
so there is potential to corrupt the table in the sense that a
previously inserted value for an unrelated, valid key will be cleared.)
Arguments:
Table - Supplies a pointer to the table for which the key is to be deleted.
Key - Supplies the key to delete.
PreviousValue - Optionally supplies a pointer that will receive the previous
value at the relevant table location prior to this deletion. If no
prior insertion, the previous value is guaranteed to be 0.
Return Value:
S_OK in all normal operating conditions. E_FAIL may be returned in some
cases when passed a key not in the original input set. The PreviousValue
parameter, if non-NULL, will be cleared in this case.
--*/
{
ULONG Index;
ULONG Existing;
HRESULT Result;
//
// Obtain the index for this key.
//
Result = Table-&gt;Vtbl-&gt;Index(Table, Key, &amp;Index);
if (FAILED(Result)) {
//
// Clear the caller's pointer if applicable and return error.
//
if (ARGUMENT_PRESENT(PreviousValue)) {
*PreviousValue = 0;
}
return E_FAIL;
}
//
// Get the existing value.
//
Existing = Table-&gt;Values[Index];
//
// Clear the value.
//
Table-&gt;Values[Index] = 0;
//
// Update the caller's pointer if applicable.
//
if (ARGUMENT_PRESENT(PreviousValue)) {
*PreviousValue = Existing;
}
return S_OK;
}</code></pre>
<pre class="code content-table-functions-addref"><code class="language-c">_Use_decl_annotations_
ULONG
PerfectHashTableAddRef(
PPERFECT_HASH_TABLE Table
)
/*++
Routine Description:
Increments the reference count for a perfect hash table.
Arguments:
Table - Supplies a pointer to the table for which the reference count
is to be incremented.
Return Value:
The new reference count.
--*/
{
return InterlockedIncrement(&amp;Table-&gt;ReferenceCount);
}</code></pre>
<pre class="code content-table-functions-release"><code class="language-c">_Use_decl_annotations_
ULONG
PerfectHashTableRelease(
PPERFECT_HASH_TABLE Table
)
/*++
Routine Description:
Decrements the reference count associated with a perfect hash table. If
this is the last reference, the table is destroyed.
Arguments:
Table - Supplies a pointer to the table for which the reference count
is to be decremented.
Return Value:
The new reference count.
--*/
{
ULONG Count = InterlockedDecrement(&amp;Table-&gt;ReferenceCount);
PPERFECT_HASH_TABLE TablePointer = Table;
if (Count &gt; 0) {
return Count;
}
DestroyPerfectHashTable(&amp;TablePointer, NULL);
return Count;
}</code></pre>
</div>
</div>
<p>
I prefer this COM-style vtbl approach to the cmph project's approach, which
uses giant switch statements to select appropriate backends, e.g.:
</p>
<pre class="code"><code class="language-c">
cmph_t *cmph_new(cmph_config_t *mph)
{
cmph_t *mphf = NULL;
double c = mph-&gt;c;
DEBUGP("Creating mph with algorithm %s\n", cmph_names[mph-&gt;algo]);
switch (mph-&gt;algo)
{
case CMPH_CHM:
DEBUGP("Creating chm hash\n");
mphf = chm_new(mph, c);
break;
case CMPH_BMZ: /* included - Fabiano */
DEBUGP("Creating bmz hash\n");
mphf = bmz_new(mph, c);
break;
case CMPH_BMZ8: /* included - Fabiano */
DEBUGP("Creating bmz8 hash\n");
mphf = bmz8_new(mph, c);
break;
case CMPH_BRZ: /* included - Fabiano */
DEBUGP("Creating brz hash\n");
if (c &gt;= 2.0) brz_config_set_algo(mph, CMPH_FCH);
else brz_config_set_algo(mph, CMPH_BMZ8);
mphf = brz_new(mph, c);
break;
case CMPH_FCH: /* included - Fabiano */
DEBUGP("Creating fch hash\n");
mphf = fch_new(mph, c);
break;
case CMPH_BDZ: /* included - Fabiano */
DEBUGP("Creating bdz hash\n");
mphf = bdz_new(mph, c);
break;
case CMPH_BDZ_PH: /* included - Fabiano */
DEBUGP("Creating bdz_ph hash\n");
mphf = bdz_ph_new(mph, c);
break;
case CMPH_CHD_PH: /* included - Fabiano */
DEBUGP("Creating chd_ph hash\n");
mphf = chd_ph_new(mph, c);
break;
case CMPH_CHD: /* included - Fabiano */
DEBUGP("Creating chd hash\n");
mphf = chd_new(mph, c);
break;
default:
assert(0);
}
return mphf;
}
int cmph_dump(cmph_t *mphf, FILE *f)
{
switch (mphf-&gt;algo)
{
case CMPH_CHM:
return chm_dump(mphf, f);
case CMPH_BMZ: /* included - Fabiano */
return bmz_dump(mphf, f);
case CMPH_BMZ8: /* included - Fabiano */
return bmz8_dump(mphf, f);
case CMPH_BRZ: /* included - Fabiano */
return brz_dump(mphf, f);
case CMPH_FCH: /* included - Fabiano */
return fch_dump(mphf, f);
case CMPH_BDZ: /* included - Fabiano */
return bdz_dump(mphf, f);
case CMPH_BDZ_PH: /* included - Fabiano */
return bdz_ph_dump(mphf, f);
case CMPH_CHD_PH: /* included - Fabiano */
return chd_ph_dump(mphf, f);
case CMPH_CHD: /* included - Fabiano */
return chd_dump(mphf, f);
default:
assert(0);
}
assert(0);
return 0;
}
</code></pre>
<p>
The <code>Index()</code> routine is algorithm-specific, and forms the backbone of
the perfect hash table functionality: the ability to take a key and return a
unique index that can be used to offset into an array of data.
</p>
<p>
Here's what the <code>
<a
href="https://github.com/tpn/tracer/blob/db04ecfbd9162c838c4d747d745b10d786890cd3/PerfectHashTable/Chm_01.c#L3291">Index()</a></code>
routine looks like for our CHM algorithm implementation. Note that it too
leverages the COM interface for driving the hash and masking routines.
</p>
<pre class="code"><code class="language-c">_Use_decl_annotations_
HRESULT
PerfectHashTableIndexImplChm01(
PPERFECT_HASH_TABLE Table,
ULONG Key,
PULONG Index
)
/*++
Routine Description:
Looks up given key in a perfect hash table and returns its index.
N.B. If Key did not appear in the original set the hash table was created
from, the behavior of this routine is undefined. (In practice, the
key will hash to either an existing key's location or an empty slot,
so there is potential for returning a non-unique index.)
Arguments:
Table - Supplies a pointer to the table for which the key lookup is to be
performed.
Key - Supplies the key to look up.
Index - Receives the index associated with this key. The index will be
between 0 and NumberOfKeys-1, and can be safely used to offset directly
into an appropriately sized array (e.g. Table-&gt;Values[]).
Return Value:
S_OK on success, E_FAIL if the underlying hash function returned a failure.
This will happen if the two hash values for a key happen to be identical.
It shouldn't happen once a perfect graph has been created (i.e. it only
happens when attempting to solve the graph). The Index parameter will
be cleared in the case of E_FAIL.
--*/
{
ULONG Masked;
ULONG Vertex1;
ULONG Vertex2;
ULONG MaskedLow;
ULONG MaskedHigh;
PULONG Assigned;
ULONGLONG Combined;
ULARGE_INTEGER Hash;
//
// Hash the incoming key into the 64-bit representation, which is two
// 32-bit ULONGs in disguise, each one driven by a separate seed value.
//
if (FAILED(Table-&gt;Vtbl-&gt;Hash(Table, Key, &amp;Hash.QuadPart))) {
goto Error;
}
//
// Mask each hash value such that it falls within the confines of the
// number of vertices. That is, make sure the value is between 0 and
// Table-&gt;NumberOfVertices-1.
//
if (FAILED(Table-&gt;Vtbl-&gt;MaskHash(Table, Hash.LowPart, &amp;MaskedLow))) {
goto Error;
}
if (FAILED(Table-&gt;Vtbl-&gt;MaskHash(Table, Hash.HighPart, &amp;MaskedHigh))) {
goto Error;
}
//
// Obtain the corresponding vertex values for the masked high and low hash
// values. These are derived from the "assigned" array that we construct
// during the creation routine's assignment step (GraphAssign()).
//
Assigned = Table-&gt;Data;
Vertex1 = Assigned[MaskedLow];
Vertex2 = Assigned[MaskedHigh];
//
// Combine the two values, then perform the index masking operation, such
// that our final index into the array falls within the confines of the
// number of edges, or keys, in the table. That is, make sure the index
// value is between 0 and Table-&gt;Keys-&gt;NumberOfElements-1.
//
Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;
if (FAILED(Table-&gt;Vtbl-&gt;MaskIndex(Table, Combined, &amp;Masked))) {
goto Error;
}
//
// Update the caller's pointer and return success. The resulting index
// value represents the array offset index for this given key in the
// underlying table, and is guaranteed to be unique amongst the original
// keys in the input set.
//
*Index = Masked;
return S_OK;
Error:
//
// Clear the caller's pointer and return failure. We should only hit this
// point if the caller supplies a key that both: a) wasn't in the original
// input set, and b) happens to result in a hash value where both the high
// part and low part are identical, which is rare, but not impossible.
//
*Index = 0;
return E_FAIL;
}
</code></pre>
<p>
This routine lives in our
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/Chm_01.c">Chm_01.c</a>
file, which is our only algorithm backend at the moment. This file, together
with its counterpart
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/Chm_01.h">Chm_01.h</a>
header, features all of the algorithm-specific structures and routines related
to the implementation. It is here we define the graph structures specific to
the CHM algorithm: <code>GRAPH_DIMENSIONS</code>, <code>GRAPH_INFO</code>, and
<code>GRAPH</code>.
</p>
<div class="tab-box language box-chm01-structs">
<ul class="tabs">
<li data-content="content-chm01-structs-graph_dimensions">GRAPH_DIMENSIONS</li>
<li data-content="content-chm01-structs-graph_info">GRAPH_INFO</li>
<li data-content="content-chm01-structs-graph">GRAPH</li>
</ul>
<div class="content">
<pre class="code content-chm01-structs-graph_dimensions"><code class="language-c">//
// Define the primary dimensions governing the graph size.
//
typedef struct _GRAPH_DIMENSIONS {
//
// Number of edges in the graph. This corresponds to the number of keys
// in our input set. If modulus masking is active, the number of keys and
// the number of edges will be identical. Otherwise, the number of edges
// will be the number of keys rounded up to a power of 2.
//
ULONG NumberOfEdges;
//
// Total number of edges in the graph. This will be twice the size of the
// NumberOfEdges value above, due to the quirky way the underlying r-graph
// algorithm captures two hash values in the same list and offsets the
// second set after the first, e.g.:
//
// Edge2 = Edge1 + Graph-&gt;NumberOfEdges;
//
ULONG TotalNumberOfEdges;
//
// Number of vertices in the graph. This will vary based on the masking
// type. It is doubled every time a graph resize event is encountered.
//
ULONG NumberOfVertices;
//
// The number of edges in the graph, rounded up to a power of 2, and then
// shifted the appropriate amount to extract the exponent part for 2^n.
//
BYTE NumberOfEdgesPowerOf2Exponent;
//
// As above, but rounded up to the next power of 2 first.
//
BYTE NumberOfEdgesNextPowerOf2Exponent;
//
// The same exponent logic applied to the number of vertices as per the
// NumberOfVertices field above.
//
BYTE NumberOfVerticesPowerOf2Exponent;
//
// And again for the next value.
//
BYTE NumberOfVerticesNextPowerOf2Exponent;
} GRAPH_DIMENSIONS;
C_ASSERT(sizeof(GRAPH_DIMENSIONS) == 16);
typedef GRAPH_DIMENSIONS *PGRAPH_DIMENSIONS;</code></pre>
<pre class="code content-chm01-structs-graph_info"><code class="language-c">//
// Define various memory offsets associated with a given graph structure.
// This allows parallel worker threads to reset their local GRAPH instance
// back to the initial state each time they want to try a new random seed.
//
typedef struct _GRAPH_INFO {
//
// Number of pages consumed by the entire graph and all backing arrays.
//
ULONG NumberOfPagesPerGraph;
//
// Page size (e.g. 4096, 2MB).
//
ULONG PageSize;
//
// Total number of graphs created. This will match the maximum concurrency
// level of the upstream context.
//
ULONG NumberOfGraphs;
//
// Number of RTL_BITMAP structures used by the graph.
//
USHORT NumberOfBitmaps;
//
// Size of the GRAPH structure.
//
USHORT SizeOfGraphStruct;
//
// System allocation granularity. We align the memory map for the on-disk
// structure using this value initially.
//
ULONG AllocationGranularity;
//
// If a masking type other than modulus is active, the AbsoluteEdge() needs
// a way to mask edge values that exceed the number of edges in the table.
// It does this via EdgeMask, which is initialized to the number of edges
// (which will be power-of-2 sized for non-modulus masking), minus 1, such
// that all lower bits will be set.
//
ULONG EdgeMask;
//
// Also capture the mask required to isolate vertices.
//
ULONG VertexMask;
//
// Graph dimensions. This information is duplicated in the graph due to
// it being accessed frequently.
//
GRAPH_DIMENSIONS Dimensions;
//
// Pointer to the owning context.
//
PPERFECT_HASH_TABLE_CONTEXT Context;
//
// Base address of the entire graph allocation.
//
union {
PVOID BaseAddress;
struct _GRAPH *FirstGraph;
};
//
// Array sizes.
//
ULONGLONG EdgesSizeInBytes;
ULONGLONG NextSizeInBytes;
ULONGLONG FirstSizeInBytes;
ULONGLONG PrevSizeInBytes;
ULONGLONG AssignedSizeInBytes;
ULONGLONG ValuesSizeInBytes;
//
// Deleted edges bitmap buffer size.
//
ULONGLONG DeletedEdgesBitmapBufferSizeInBytes;
//
// Visited vertices bitmap buffer size.
//
ULONGLONG VisitedVerticesBitmapBufferSizeInBytes;
//
// Assigned bitmap buffer size.
//
ULONGLONG AssignedBitmapBufferSizeInBytes;
//
// Index bitmap buffer size.
//
ULONGLONG IndexBitmapBufferSizeInBytes;
//
// The allocation size of the graph, including structure size and all
// array and bitmap buffer sizes.
//
ULONGLONG AllocSize;
//
// Allocation size rounded up to the nearest page size multiple.
//
ULONGLONG FinalSize;
} GRAPH_INFO;
typedef GRAPH_INFO *PGRAPH_INFO;</pre></code>
<pre class="code content-chm01-structs-graph"><code class="language-c">//
// Define the graph structure. This represents an r-graph, or a hypergraph,
// or an r-partite 2-uniform graph, or any other seemingly unlimited number
// of names floating around in academia for what appears to be exactly the
// same thing.
//
typedef struct _Struct_size_bytes_(SizeOfStruct) _GRAPH {
//
// List entry used to push the graph onto the context's work list.
//
SLIST_ENTRY ListEntry;
//
// Edge and vertex masks that can be used when non-modulus masking is in
// place. Both of these values are duplicated from the info structure as
// they are accessed frequently.
//
//
ULONG EdgeMask;
ULONG VertexMask;
//
// Duplicate the mask type, as well, as this directs AbsoluteEdge()'s
// decision to use the two masks above.
//
PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;
//
// Duplicate the number of keys, as this is also frequently referenced.
//
ULONG NumberOfKeys;
//
// Structure size, in bytes.
//
_Field_range_(== , sizeof(struct _GRAPH)) ULONG SizeOfStruct;
//
// Graph flags.
//
GRAPH_FLAGS Flags;
//
// Pointer to the info structure describing various sizes.
//
PGRAPH_INFO Info;
//
// Graph attempt. This ID is derived from an interlocked increment against
// Context-&gt;Attempts, and represents the attempt number across all threads.
//
ULONGLONG Attempt;
//
// A localized attempt number that reflects the number of attempts made
// by just this thread.
//
ULONG ThreadAttempt;
//
// Thread ID of the thread that owns us. Each callback thread is provided
// a single graph, and will attempt to solve the perfect hash table until
// told otherwise. Thus, there's a 1:1 relationship between graph instance
// and owning thread.
//
ULONG ThreadId;
//
// Counter that is incremented each time we delete an edge during the
// acyclic graph detection stage.
//
ULONG DeletedEdgeCount;
//
// Counter that is incremented each time we visit a vertex during the
// assignment stage.
//
ULONG VisitedVerticesCount;
//
// Capture collisions during assignment step.
//
ULONG Collisions;
//
// Inline the GRAPH_DIMENSIONS structure. This is available from the
// GRAPH_INFO structure, however, it's accessed frequently, so we inline
// it to avoid the extra level of indirection.
//
union {
struct {
ULONG NumberOfEdges;
ULONG TotalNumberOfEdges;
ULONG NumberOfVertices;
BYTE NumberOfEdgesPowerOf2Exponent;
BYTE NumberOfEdgesNextPowerOf2Exponent;
BYTE NumberOfVerticesPowerOf2Exponent;
BYTE NumberOfVerticesNextPowerOf2Exponent;
};
GRAPH_DIMENSIONS Dimensions;
};
//
// Duplicate the context pointer. (This is also available from Info.)
//
PPERFECT_HASH_TABLE_CONTEXT Context;
//
// Edges array. The number of elements in this array is governed by the
// TotalNumberOfEdges field, and will be twice the number of edges.
//
PEDGE Edges;
//
// Array of the "next" edge array, as per the referenced papers. The number
// of elements in this array is also governed by TotalNumberOfEdges.
//
PEDGE Next;
//
// Array of vertices. Number of elements is governed by the
// NumberOfVertices field.
//
PVERTEX First;
//
// The original CHM paper in 1996 references a "prev" array to "facilitate
// fast deletion". However, the chmp project appears to have switched to
// using bitmaps. Let's reserve a slot for the "prev" array anyway.
//
PVERTEX Prev;
//
// Array of assigned vertices. Number of elements is governed by the
// NumberOfVertices field.
//
PVERTEX Assigned;
//
// Array of values indexed by the offsets in the Assigned array. This
// essentially allows us to simulate a loaded table that supports the
// Insert(), Index() and Lookup() routines as part of graph validation.
//
PULONG Values;
//
// Bitmap used to capture deleted edges as part of the acyclic detection
// stage. The SizeOfBitMap will reflect TotalNumberOfEdges.
//
RTL_BITMAP DeletedEdges;
//
// Bitmap used to capture vertices visited as part of the assignment stage.
// The SizeOfBitMap will reflect NumberOfVertices.
//
RTL_BITMAP VisitedVertices;
//
// Bitmap used to test the correctness of the Assigned array.
//
RTL_BITMAP AssignedBitmap;
//
// Bitmap used to track indices during the assignment step.
//
RTL_BITMAP IndexBitmap;
//
// Capture the seeds used for each hash function employed by the graph.
//
ULONG NumberOfSeeds;
struct {
union {
struct {
union {
ULONG Seed1;
ULONG FirstSeed;
};
ULONG Seed2;
};
ULARGE_INTEGER Seeds12;
};
union {
struct {
ULONG Seed3;
union {
ULONG Seed4;
ULONG LastSeed;
};
};
ULARGE_INTEGER Seeds34;
};
};
} GRAPH;
typedef GRAPH *PGRAPH;</code></pre>
</div>
</div>
<p>
In general, the structures and functions defined in our
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/PerfectHashTablePrivate.h">PerfectHashTablePrivate.h</a>
file are algorithm agnostic. That is, they don't deal with the notion of a
graph, as that's an implementation detail of the CHM algorithm.
</p>
<a class="xref" name="code-walkthrough"></a>
<h1>Code Walkthrough</h1>
<p>
Let's examine the code by walking through the behavior of the
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/SelfTestPerfectHashTable.c">
SelfTestPerfectHashTable()</a> routine.
</p>
<a class="xref" name="test-data"></a>
<h2>Test Data</h2>
<p>
To generate some test data, you have two options. If you have Python and NumPy
installed, you can generate a bunch of linear and random key files via the
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/data/generate.py">generate.py</a>
file that lives in the <code>data</code> subdirectory of the
<code>PerfectHashTable</code> component. Alternatively, you can check out a
small git repository that houses some existing sample test data.
</p>
<h3>Python Generated</h3>
<p>
The generate.py file and its output follows. It will generate linear and random
arrays of given sizes, saving a .txt representation of the values, a .keys
representation of the binary data (which is the actual file we load as part of
perfect hash table creation), and, if matplotlib is installed, a .png plot of
the data, which allows a quick sanity check that the value distribution is what
we expect.
</p>
<small>
(Note: the random routine creates a double-sized array, then calls
<code>np.unique()</code> against it to ensure all the key values are unique,
then isolates the <code>[:size]</code> portion we're actually interested in.
One side effect of calling <code>np.unique()</code> is that our resulting array
is sorted. We don't stipulate whether key files are required to be sorted at
the moment, as it makes no difference to the underlying perfect hash table
algorithm we're using. i.e. we don't suffer the sort of pathological worst-case
performance issues some data structures exhibit when inserting data in sorted
order. Nevertheless, if you look at the random files compared to the linear
files, they're usually quite similar, given everything is sorted.)
</small>
<br/>
<div class="tab-box language box-generate">
<ul class="tabs">
<li data-content="content-generate-generate">generate.py</li>
<li data-content="content-generate-output">Output</li>
</ul>
<div class="content">
<pre class="code content-generate-generate"><code class="language-python">
from __future__ import print_function
import numpy as np
try:
import matplotlib.pyplot as plt
# Turn off interactive mode to disable plots being displayed
# prior to saving them to disk.
plt.ioff()
def save_array_plot_to_png_file(filename, a):
plt.plot(a)
plt.savefig(filename)
print("Generated %s." % filename)
except ImportError:
def save_array_plot_to_png_file(filename, a):
pass
def save_array_to_text_file(filename, a):
with open(filename, 'wb') as f:
for i in a:
f.write(str(i).zfill(9).encode('ascii'))
f.write(b'\n')
print("Generated %s." % filename)
def save_array_to_binary_file(filename, a):
fp = np.memmap(filename, dtype='uint32', mode='w+', shape=a.shape)
fp[:] = a[:]
del fp
print("Generated %s." % filename)
def save_array(prefix, a):
l = len(a)
png_filename = '%s-%d.png' % (prefix, l)
text_filename = '%s-%d.txt' % (prefix, l)
binary_filename = '%s-%d.keys' % (prefix, l)
save_array_plot_to_png_file(png_filename, a)
save_array_to_text_file(text_filename, a)
save_array_to_binary_file(binary_filename, a)
def gen_random_unique_array(size):
a = np.random.randint(0, (1 &lt;&lt; 32)-1, size * 2, dtype='uint32')
a = np.unique(a)[:size]
return a if len(a) == size else None
def gen_linear_array(size):
a = np.linspace(0, size, num=size, dtype='uint32')
return a
sizes = (
2000, 4000, 4050, 5000, 10000, 15000, 17000, 25000, 31000, 33000, 50000,
63000, 65500, 75000, 100000, 121000, 125000, 150000, 200000, 225000, 245000,
255000, 265000, 389161, 472374,
)
functions = (
('linear', gen_linear_array),
('random', gen_random_unique_array),
)
def main():
for (name, func) in functions:
for size in sizes:
a = func(size)
if a is None:
continue
save_array(name, a)
if __name__ == '__main__':
main()
</code></pre>
<pre class="code content-generate-output"><code class="language-nasm"> (py35) C:\Users\Trent\Home\src\tracer\PerfectHashTable\data&gt;python generate.py
Generated linear-2000.png.
Generated linear-2000.txt.
Generated linear-2000.keys.
Generated linear-4000.png.
Generated linear-4000.txt.
Generated linear-4000.keys.
Generated linear-4050.png.
Generated linear-4050.txt.
Generated linear-4050.keys.
Generated linear-5000.png.
Generated linear-5000.txt.
Generated linear-5000.keys.
Generated linear-10000.png.
Generated linear-10000.txt.
Generated linear-10000.keys.
Generated linear-15000.png.
Generated linear-15000.txt.
Generated linear-15000.keys.
Generated linear-17000.png.
Generated linear-17000.txt.
Generated linear-17000.keys.
Generated linear-25000.png.
Generated linear-25000.txt.
Generated linear-25000.keys.
Generated linear-31000.png.
Generated linear-31000.txt.
Generated linear-31000.keys.
Generated linear-33000.png.
Generated linear-33000.txt.
Generated linear-33000.keys.
Generated linear-50000.png.
Generated linear-50000.txt.
Generated linear-50000.keys.
Generated linear-63000.png.
Generated linear-63000.txt.
Generated linear-63000.keys.
Generated linear-65500.png.
Generated linear-65500.txt.
Generated linear-65500.keys.
Generated linear-75000.png.
Generated linear-75000.txt.
Generated linear-75000.keys.
Generated linear-100000.png.
Generated linear-100000.txt.
Generated linear-100000.keys.
Generated linear-121000.png.
Generated linear-121000.txt.
Generated linear-121000.keys.
Generated linear-125000.png.
Generated linear-125000.txt.
Generated linear-125000.keys.
Generated linear-150000.png.
Generated linear-150000.txt.
Generated linear-150000.keys.
Generated linear-200000.png.
Generated linear-200000.txt.
Generated linear-200000.keys.
Generated linear-225000.png.
Generated linear-225000.txt.
Generated linear-225000.keys.
Generated linear-245000.png.
Generated linear-245000.txt.
Generated linear-245000.keys.
Generated linear-255000.png.
Generated linear-255000.txt.
Generated linear-255000.keys.
Generated linear-265000.png.
Generated linear-265000.txt.
Generated linear-265000.keys.
Generated linear-389161.png.
Generated linear-389161.txt.
Generated linear-389161.keys.
Generated linear-472374.png.
Generated linear-472374.txt.
Generated linear-472374.keys.
Generated random-2000.png.
Generated random-2000.txt.
Generated random-2000.keys.
Generated random-4000.png.
Generated random-4000.txt.
Generated random-4000.keys.
Generated random-4050.png.
Generated random-4050.txt.
Generated random-4050.keys.
Generated random-5000.png.
Generated random-5000.txt.
Generated random-5000.keys.
Generated random-10000.png.
Generated random-10000.txt.
Generated random-10000.keys.
Generated random-15000.png.
Generated random-15000.txt.
Generated random-15000.keys.
Generated random-17000.png.
Generated random-17000.txt.
Generated random-17000.keys.
Generated random-25000.png.
Generated random-25000.txt.
Generated random-25000.keys.
Generated random-31000.png.
Generated random-31000.txt.
Generated random-31000.keys.
Generated random-33000.png.
Generated random-33000.txt.
Generated random-33000.keys.
Generated random-50000.png.
Generated random-50000.txt.
Generated random-50000.keys.
Generated random-63000.png.
Generated random-63000.txt.
Generated random-63000.keys.
Generated random-65500.png.
Generated random-65500.txt.
Generated random-65500.keys.
Generated random-75000.png.
Generated random-75000.txt.
Generated random-75000.keys.
Generated random-100000.png.
Generated random-100000.txt.
Generated random-100000.keys.
Generated random-121000.png.
Generated random-121000.txt.
Generated random-121000.keys.
Generated random-125000.png.
Generated random-125000.txt.
Generated random-125000.keys.
Generated random-150000.png.
Generated random-150000.txt.
Generated random-150000.keys.
Generated random-200000.png.
Generated random-200000.txt.
Generated random-200000.keys.
Generated random-225000.png.
Generated random-225000.txt.
Generated random-225000.keys.
Generated random-245000.png.
Generated random-245000.txt.
Generated random-245000.keys.
Generated random-255000.png.
Generated random-255000.txt.
Generated random-255000.keys.
Generated random-265000.png.
Generated random-265000.txt.
Generated random-265000.keys.
Generated random-389161.png.
Generated random-389161.txt.
Generated random-389161.keys.
Generated random-472374.png.
Generated random-472374.txt.
Generated random-472374.keys.
</code></pre>
</div>
</div>
<h3>Git Clone</h3>
<p>
If you don't have Python and NumPy installed or you can't get the
<code>generate.py</code> to work, you can clone a small git repository
named <a href="https://github.com/tpn/perfecthash">perfecthash</a>, which
contains a <code>data</code> directory that has a test file in it named
<code>mshtml-37209.keys</code>:
<hr/>
<pre>
C:\Users\Trent\Home\src&gt; git clone https://github.com/tpn/perfecthash
...
C:\Users\Trent\Home\src&gt; cd perfecthash\data
C:\Users\Trent\Home\src\perfecthash\data&gt; dir
...
</pre>
<hr/>
</p>
<a class="xref" name="running-self-test"></a>
<h2>Running the Self-Test</h2>
<p>
If you clone the <a href="https://github.com/tpn/tracer">tracer</a> project
and open
<a href="https://github.com/tpn/tracer/tree/master/PerfectHashTable.sln">PerfectHashTable.sln</a>
in Visual Studio 2017, you should be able to build the project from scratch.
Selecting the Debug build configuration will make the code easier to step
through if you'd like to follow along with an interactive debugging session.
</p>
<p>
Let's run the debug build of the self-test program against the data directory we
prepared earlier. (I'll use the <code>perfecthash\data</code> directory for now
as it only has two key files in it, <code>mshtml-10.keys</code> and
<code>mshtml-37209.keys</code>.) It's worth noting up-front how
<strong>not</strong> user friendly the<code>PerfectHashTableSelfTest.exe</code>
program is. It's the bare minimum command line wrapper around the
<a href="https://github.com/tpn/tracer/blob/master/PerfectHashTable/SelfTestPerfectHashTable.c">
SelfTestPerfectHashTable()</a> routine. It takes five mandatory parameters:
the test directory containing the *.keys files you wish to process, the
algorithm ID, the hash function ID, the mask function ID, and the maximum
concurrency (specifying 0 will use all cores). (You can optionally add a
dummy character as the final parameter which will print a little <code>Press
any key to continue.</code> message and then wait for a keypress before
exiting &mdash; this is useful when debugging via Visual Studio and you want the
console window to stay open so you can review results.) Usage:
<hr/>
<pre>C:\Users\Trent\Home\src\tracer&gt;x64\Debug\PerfectHashTableSelfTest.exe
Usage: PerfectHashTableSelfTest.exe &lt;TestDataDirectory (must be fully-qualified)&gt;
&lt;AlgorithmId&gt;
&lt;HashFunctionId&gt;
&lt;MaskFunctionId&gt;
&lt;MaximumConcurrency (0-ncpu)&gt;
[PauseBeforeExit (can be any character)]
E.g.: PerfectHashTableSelfTest.exe C:\Users\Trent\Home\src\perfecthash\data 1 1 2 0
</pre>
<hr/>
</p>
<p>
Let's try run it against our <code>perfecthash\data</code> directory. We'll use
the CHM algorithm (1... the only option), the CRC32-Rotate hash function (1),
the AND-based masking function (2), and a max concurrency of 0, which will
default to all cores, which will be 4 on this Surface Book.
<hr/>
<pre>S:\tracer&gt;x64\Debug\PerfectHashTableSelfTest.exe S:\perfecthash\data 1 1 2 0
Starting perfect hash self-test for directory: S:\perfecthash\data.
Processing key file: mshtml-10.keys
Successfully created perfect hash table: S:\perfecthash\data\mshtml-10.pht1.
Successfully loaded perfect hash table: S:\perfecthash\data\mshtml-10.pht1.
Algorithm: Chm01 (1).
Hash Function: Crc32Rotate (1).
Mask Function: And (2).
Successfully tested perfect hash table.
Concurrency: 4.
Number of attempts: 1.
Number of failed attempts: 0.
Number of solutions found: 1.
Number of keys: 10.
Number of table elements (vertices): 32.
Seed 1: 1086980845.
Seed 2: 3416976708.
Seed 3: 2208417421.
Seed 4: 3947037297.
Cycles to solve: 178412.
Cycles to verify: 3924.
Cycles to prepare file: 821965.
Cycles to save file: 9093112.
Microseconds to solve: 63.
Microseconds to verify: 1.
Microseconds to prepare file: 292.
Microseconds to save file: 3238.
Processing key file: mshtml-37209.keys
Successfully created perfect hash table: S:\perfecthash\data\mshtml-37209.pht1.
Successfully loaded perfect hash table: S:\perfecthash\data\mshtml-37209.pht1.
Algorithm: Chm01 (1).
Hash Function: Crc32Rotate (1).
Mask Function: And (2).
Successfully tested perfect hash table.
Concurrency: 4.
Number of attempts: 3.
Number of failed attempts: 0.
Number of solutions found: 3.
Number of keys: 37209.
Number of table elements (vertices): 131072.
Seed 1: 43757427.
Seed 2: 283337092.
Seed 3: 1819527900.
Seed 4: 1437745965.
Cycles to solve: 71649060.
Cycles to verify: 6385313.
Cycles to prepare file: 41221291.
Cycles to save file: 51229166.
Microseconds to solve: 25516.
Microseconds to verify: 2274.
Microseconds to prepare file: 14680.
Microseconds to save file: 18244.
</pre>
<hr/>
</p>
<p>
Your output should look similar to the output above. There will be two new
files, <code>mshtml-10.pht1</code> and <code>mshtml-37209.pht1</code> in that
<code>perfecthash\data</code> directory. The .pht1 file is the array of
"assigned" values we use as part of index construction for the CHM algorithm.
When the output says <em>Successfully loaded perfect hash table ...</em>, it
is referring to the .pht1 files.
</p>
<p>
The metadata for the perfect hash table is stored in an NTFS stream named
<code>:Info</code> that is attached to the .pht1 file. We can view this
via the <code>Get-Item -Path mshtml-37209.pht1 -Stream *</code> PowerShell
command:
<hr/>
<pre>PS S:\perfecthash\data&gt; Get-Item -Path mshtml-37209.pht1 -Stream *
PSPath : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data\mshtml-37209.pht1::$DATA
PSParentPath : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data
PSChildName : mshtml-37209.pht1::$DATA
PSDrive : S
PSProvider : Microsoft.PowerShell.Core\FileSystem
PSIsContainer : False
FileName : S:\perfecthash\data\mshtml-37209.pht1
Stream : :$DATA
Length : 524288
PSPath : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data\mshtml-37209.pht1:Info
PSParentPath : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data
PSChildName : mshtml-37209.pht1:Info
PSDrive : S
PSProvider : Microsoft.PowerShell.Core\FileSystem
PSIsContainer : False
FileName : S:\perfecthash\data\mshtml-37209.pht1
Stream : Info
Length : 256
</pre>
<hr/>
</p>
<p>
You can see the normal file data in the first stream (the default file stream)
named <code>$DATA</code>. This is reported as length 524,288 bytes. If we
divide this by 4 (the <code>sizeof(ULONG)</code>), we get 131,072 &mdash; which
corresponds to the number of vertices reported for the final table in the
output:
<hr/>
<pre>...
Number of table elements (vertices): 131072.
...
</pre><hr/>
</p>
<a class="xref" name="the-info-stream"></a>
<h3>The <code>:Info</code> Stream</h3>
<p>
Following that is our custom <code>:Info</code> NTFS stream of length 256 bytes.
As mentioned earlier, all of our file I/O is done through memory maps. During
the creation step, the <code>:Info</code> stream is created and mapped at a base
address accessible through <code>Table-&gt;InfoStreamBaseAddress</code> (we'll
discuss the <code>Table</code> structure soon). In our CHM implementation
routine, we literally cast the base address to an <em>on-disk</em> structure
named <code>GRAPH_INFO_ON_DISK</code>, which is required to embed a common
header structure named <code>TABLE_INFO_ON_DISK_HEADER</code> as its first
member.