Skip to content
Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
18505 lines (14727 sloc) 670 KB
<!DOCTYPE html>
<html>
<!-- vim:set tw=100 ts=8 sw=4 et :-->
<head>
<title>Is Prefix Of String In Table? A Journey Into SIMD String Processing.</title>
<meta name="msvalidate.01" content="E828541C73A98C315E3D6B8C88EF6057" />
<meta name="viewport" content="width=device-width, initial-scale=0.65, maximum-scale=1.0" />
<!-- https://www.google.com/fonts#UsePlace:use/Collection:Lato:200,300,300italic -->
<!--
<meta name="viewport" content="width=device-width, min-width=1100px, initial-scale=0.7, maximum-scale=1.0, shrint-to-fit=no" />
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:200,300,300italic">
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Merriweather:300,300i,400,400i">
-->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,400" rel="stylesheet">
<link rel="stylesheet" href="//oss.maxcdn.com/normalize/3.0.1/normalize.min.css">
<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css">
<link rel="stylesheet" href="../prism.css">
<link rel="stylesheet" href="../home.css">
<link rel="stylesheet" href="page.css">
<script src="//oss.maxcdn.com/jquery/2.1.1/jquery.min.js"></script>
<script src="../prism.js"></script>
<script src="../home.js"></script>
<script src="page.js"></script>
</head>
<body>
<header class="header">
<div class="header-logo" href="#">
<!--
<a class="homename" href="http://trent.me"><strong>T</strong>rent <strong>N</strong>elson</a>
-->
<a class="homename" href=".."><strong>T</strong>rent <strong>N</strong>elson</a>
</div>
<ul class="header-links">
<li><a href="#home"><i class="fa fa-home"></i> Is Prefix Of String In Table?</a></li>
<li><a href="#contents"><i class="fa fa-align-left"></i> Contents</a></li>
<li><a href="https://github.com/tpn/tracer/tree/v0.1.12/StringTable2" target="_blank"><i class="fa fa-github"></i> GitHub</a></li>
<li><a href="https://twitter.com/trentnelson" target="_blank"><i class="fa fa-twitter"></i> Twitter</a></li>
<!--
<li><a href="https://twitter.com/trentnelson" class="twitter-follow-button" data-show-count="false">Follow @trentnelson</a></li>
-->
</ul>
</header>
<a class="xref" name="home"></a>
<section class="section section-hero">
<div class="container">
<h1>
Is Prefix Of String In Table?
</h1>
<h3>
A Journey Into SIMD String Processing.
</h3>
</div>
</section>
<section class="section section-summary">
<div class="container">
<small>
Published: 4th May, 2018.
<!--
Updated: 4th May, 2018.
Target publish date: <del>20th April, 2018</del> <del>23rd April, 2018</del>
<del>25th April, 2018</del> <del>30th April, 2018</del> <del>2nd May, 2018</del>
7th May, 2018.
-->
Thanks to <a href="https://twitter.com/rygorous">Fabian Giesen</a>,
<a href="https://twitter.com/pshufb">Wojciech Mu&#322;a</a>,
<a href="https://twitter.com/geofflangdale">Geoff Langdale</a>,
<a href="https://twitter.com/lemire">Daniel Lemire</a>, and
<a href="https://twitter.com/KendallWillets">Kendall Willets</a>
for their valuable
<a href="https://twitter.com/trentnelson/status/985715037934440448">feedback</a>
on an early draft of this article. <a
href="https://github.com/tpn/website/blob/master/is-prefix-of-string-in-table/index.html">
View this page's source on GitHub.</a>
<!-- 15.6 + 48.53 + 2.42 + 33.85 + 42 + 49.67 + 11.55 + 9.12 + 12.95 + 4.87 -->
Hours spent on this article to date: 230.56.
<hr/>
<h3>TL;DR</h3>
<p>
Wrote some C and assembly code that uses SIMD instructions to perform prefix
matching of strings. The C code was between 4-7x faster than the baseline
implementation for prefix matching. The assembly code was 9-12x faster than the
baseline specifically for the negative match case (determining that an incoming
string definitely does <strong>not</strong> prefix match any of our known
strings). The fastest negative match could be done in around 6 CPU cycles, which
is pretty quick. (Integer division, for example, takes about 90 cycles.)
</p>
</small>
<hr/>
<h2>Overview</h2>
<p>
Goal: given a string, determine if it prefix-matches a set of known strings as
fast as possible. That is, in a set of known strings, do any of them prefix
match the incoming search string?
</p>
<p>
A reference implementation was written in C as a <a
href="#IsPrefixOfCStrInArray">baseline</a>, which simply looped
through an array of strings, comparing each one, byte-by-byte, looking for a
prefix match. Prefix match performance ranged from 28 CPU cycles to 130, and
negative match performance was around 74 cycles.
</p>
<p>
A SIMD-friendly C structure called <a href="#STRING_TABLE">STRING_TABLE</a> was
derived. It is optimized for up to 16 strings, ideally of length less than or
equal 16 characters. The table is created from the set of known strings
up-front; it is sorted by length, ascending, and a unique character (with
regards to other characters at the same byte offset) is then extracted, along
with its index. A 16 byte character array, <a
href="#STRING_SLOT">STRING_SLOT</a>, is used to capture the unique characters.
A 16 element array of unsigned characters, SLOT_INDEX, is used to capture the
index. Similarly, lengths are stored in the same fashion via SLOT_LENGTHS.
Finally, a 16 element array of STRING_SLOTs is used to capture up to the first
16 bytes of each string in the set.
</p>
<p>
An example of the memory layout of the STRING_TABLE structure at run time, using
sample <a href="#ntfs-reserved-names">test data</a>, is depicted below. Note
the width of each row is 16 bytes (128 bits), which is the size of an XMM register.
</p>
<a href="StringTable.svg" target="_blank">
<img class="svg-image" src="StringTable.svg"/>
</a>
<!--
<picture>
<source srcset="StringTableLayout2.png"/>
<img width="1042px" height="675px" srcset="StringTableLayout2.png"/>
</picture>
-->
<p>
The layout of the STRING_TABLE structure allows us to determine if a given
search string does <strong>not</strong> prefix match all 16 strings at once
in 12 assembly instructions. This breaks down into 18 &#181;ops, with a
block throughput of 3.48 cycles on Intel's Skylake architecture. (In practice,
this clocks in at around 6 CPU cycles.)
</p>
<div class="tab-box language box-intro">
<ul class="tabs">
<li data-content="content-intro-nasm">Assembly</li>
<li data-content="content-intro-iaca">IACA</li>
</ul>
<div class="content">
<pre class="code content-intro-nasm"><code class="language-nasm">
mov rax, String.Buffer[rdx] ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx] ; Broadcast string length.
vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu xmm0, xmmword ptr [rax] ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4 ; Identify slots &gt; string len.
vpshufb xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare rearranged to unique.
vptest xmm1, xmm5 ; Unique slots AND (!long slots).
jnc short Pfx10 ; CY=0, continue with routine.
xor eax, eax ; CY=1, no match.
not al ; al = -1 (NO_MATCH_FOUND).
ret ; Return NO_MATCH_FOUND.
</code></pre>
<pre class="code content-intro-iaca"><code class="language-nasm">
S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
----------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 1.0 | 3.5 3.5 | 3.5 3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
| | Ports pressure in cycles | |
|&#181;ops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1 | | |0.5 0.5|0.5 0.5| | | | | mov rax, qword ptr [rdx+0x8]
| 2 | | |0.5 0.5|0.5 0.5| |1.0| | | vpbroadcastb xmm4, byte ptr [rdx]
| 1 | | |0.5 0.5|0.5 0.5| | | | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1 | | |0.5 0.5|0.5 0.5| | | | | vmovdqu xmm0, xmmword ptr [rax]
| 1 |1.0| | | | | | | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ | | |0.5 0.5|0.5 0.5| |1.0| | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ | |1.0|0.5 0.5|0.5 0.5| | | | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2 |1.0| | | | |1.0| | | vptest xmm1, xmm5
| 1 | | | | | | |1.0| | jnb 0x10
| 1* | | | | | | | | | xor eax, eax
| 1 | | | | | | |1.0| | not al
| 3^ | | |0.5 0.5|0.5 0.5| | | | | ret
Total Num Of &#181;ops: 18
</code></pre>
</div>
</div>
<p>
Here's a simplified walk-through of a negative match in action,
using the search string "CAT":
<a href="StringTable-NegativeMatch-v3.svg" target="_blank">
<img class="svg-image" src="StringTable-NegativeMatch-v3.svg"/>
</a>
</p>
<p>
Ten iterations of a function named IsPrefixOfStringInTable were authored. The
<a href="#IsPrefixOfStringInTable_10">tenth</a> and final iteration was the
fastest, prefix matching in as little as 19 cycles &mdash; a 4x improvement over
the baseline. Negative matching took 11 cycles &mdash; a 6.7x improvement.
</p>
<p>
An <a
href="#IsPrefixOfStringInTable_x64_2">assembly</a>
version of the algorithm was authored specifically to optimize for the negative
match case, and was able to do so in as little as 8 cycles, representing a 9x
improvement over the baseline. (It was a little bit slower than the fastest
C routine in the case of prefix matches, though, as can be seen below.)
</p>
<p>
Feedback for an early draft of this article was then solicited via <a
href="https://twitter.com/trentnelson/status/985715037934440448">Twitter</a>,
resulting in four more iterations of the C version, and three more iterations of
the assembly version. The PGO build of the fastest C version prefix matched in
about 16 cycles (and also had the best "worst case input string" performance
(where three slots needed comparison), negative matching in about 26 cycles).
The fifth iteration of the assembly version negative matched in about 6 cycles,
a 3 and 1 cycle improvement, respectively.
</p>
<p>
<a href="Benchmark-Overview-v2.svg" target="_blank">
<img class="svg-image" src="Benchmark-Overview-v2.svg"/>
</a>
</p>
<p>
We were then ready to publish, but felt compelled to investigate an odd
performance quirk we'd noticed with one of the assembly routines, which
yielded 7 more assembly versions. Were any of them faster? Let's find out.
</p>
</div>
</section>
<hr/>
<section class="section section-toc">
<div class="container">
<a class="xref" name="contents"></a>
<h1>Contents</h1>
<p>
<ul class="toc-list">
<li>
<a href="#background">Background</a>
<ul class="toc-list-2">
<li><a href="#tracer-project">The Tracer Project</a></li>
<li><a href="#baseline">Baseline C Implementation</a></li>
<li>
<a href="#proposed-interface">Proposed Interface</a>
<ul>
<li>
The <a href="#IsPrefixOfStringInTable">
IsPrefixOfStringInTable</a> function.
</li>
<li>
The <a href="#STRING_MATCH">STRING_MATCH</a> structure.
</li>
</ul>
</li>
<li><a href="#test-data">The Test Data</a></li>
<li>
<a href="#requirements-and-design-decisions">
Requirements and Design Decisions
</a>
</li>
</ul>
</li>
<li>
<a href="#data-structures">The Data Structures</a>
<ul class="toc-list-2">
<li>
<a href="#STRING_TABLE">STRING_TABLE</a>
</li>
<li><a href="#STRING_ARRAY">STRING_ARRAY</a></li>
<li><a href="#STRING_SLOT">STRING_SLOT</a></li>
<li><a href="#SLOT_INDEX">SLOT_INDEX</a></li>
<li><a href="#SLOT_LENGTHS">SLOT_LENGTHS</a></li>
<li><a href="CreateStringTable">String Table Construction</a>
</ul>
</li>
<li>
<a href="#benchmark">The Benchmark</a>
</li>
<li>
<a href="#implementations">The Implementations</a>
<ul class="toc-list-2">
<li>
Round 1
<ul class="toc-list-3">
<li><a href="#round1-c">C</a></li>
<ul class="toc-list-4">
<li>
<a href="#IsPrefixOfCStrInArray">IsPrefixOfCStrInArray</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_1">IsPrefixOfStringInTable_1</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_2">IsPrefixOfStringInTable_2</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_3">IsPrefixOfStringInTable_3</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_4">IsPrefixOfStringInTable_4</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_5">IsPrefixOfStringInTable_5</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_6">IsPrefixOfStringInTable_6</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_7">IsPrefixOfStringInTable_7</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_8">IsPrefixOfStringInTable_8</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_9">IsPrefixOfStringInTable_9</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_10">IsPrefixOfStringInTable_10</a>
</li>
</ul>
<li><a href="#round1-assembly">Assembly</a>
<ul class="toc-list-4">
<li>
<a href="#IsPrefixOfStringInTable_x64_1">IsPrefixOfStringInTable_x64_1</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_2">IsPrefixOfStringInTable_x64_2</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_3">IsPrefixOfStringInTable_x64_3</a>
</li>
</ul>
</ul>
</li>
<li>
<a href="#round2">Round 2; Post-Internet Feedback</a>
<ul class="toc-list-3">
<li>C</li>
<ul class="toc-list-4">
<li>
<a href="#IsPrefixOfStringInTable_11">IsPrefixOfStringInTable_11</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_12">IsPrefixOfStringInTable_12</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_13">IsPrefixOfStringInTable_13</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_14">IsPrefixOfStringInTable_14</a>
</li>
</ul>
<li>Assembly</a>
<ul class="toc-list-4">
<li>
<a href="#IsPrefixOfStringInTable_x64_4">IsPrefixOfStringInTable_x64_4</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_5">IsPrefixOfStringInTable_x64_5</a>
</li>
</ul>
</ul>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_3-review">Round 3; Investigating why IsPrefixOfStringInTable_x64_3 was so slow...</a>
<ul class="toc-list-3">
<ul class="toc-list-4">
<li>
<a href="#IsPrefixOfStringInTable_x64_7">IsPrefixOfStringInTable_x64_7</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_8">IsPrefixOfStringInTable_x64_8</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_9">IsPrefixOfStringInTable_x64_9</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_10">IsPrefixOfStringInTable_x64_10</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_11">IsPrefixOfStringInTable_x64_11</a>
</li>
<li>
<a href="#IsPrefixOfStringInTable_x64_12">IsPrefixOfStringInTable_x64_12</a>
</li>
</ul>
</ul>
</li>
</ul>
</li>
<li>
<a href="#other-applications">Other Applications</a>
</li>
<li>
<a href="#appendix">Appendix</a>
<ul class="toc-list-2">
<li><a href="#implementation-considerations">Implementation Considerations</a></li>
<li><a href="#release-vs-pgo">Release vs PGO</a></li>
<li>A list of all C <a href="#typedefs">typedefs</a> referenced in the article</li>
</ul>
</li>
</ul>
</p>
</div>
</section>
<hr/>
<section class="section section-body">
<div class="container">
<h1>The Background</h1>
<h2>The Tracer Project</h2>
<p>
One of the frustrations I had with existing Python profilers was that there was
no easy or efficient means to filter or exclude trace information based on the module
name of the code being executed. I tackled this in my
<a href="https://github.com/tpn/tracer">tracer</a> project, which allows you to
set an environment variable named TRACER_MODULE_NAMES to restrict which modules
should be traced, e.g.:
<pre>set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy</pre>
</p>
<p>
If the code being executed is coming from the module
<code>myproject3.subproject.foo</code>, then we need to trace it, as that string
<strong>prefix matches</strong> the third entry on our list.
</p>
<p>
This article details the custom data structure and algorithm I came up with in
order to try and solve the prefix matching problem more optimally with a SIMD
approach. The resulting <a
href="https://github.com/tpn/tracer/tree/master/StringTable2">StringTable</a>
component is used extensively within the tracer project, and as such, must
conform to unique constraints such as no use of the C runtime library and
allocating all memory through TraceStore-backed allocators. Thus, it's not
really something you'd drop in to your current project in its current form.
Hopefully, the article still proves to be interesting.
</p>
<small>
<p>
Note: the code samples provided herein are copied directly from the tracer
project, which is written in C and assembly, and uses the Pascal-esque
<em>Cutler Normal Form</em> style for C. If you're used to the more UNIX-style
<a href="https://www.freebsd.org/cgi/man.cgi?query=style&sektion=9">
<em>Kernel Normal Form</em></a> of C, it's quite like that, except that it's
absolutely nothing like that, and all these code samples will probably be
very jarring.
<p>
</small>
<a class="xref" name="baseline"></a>
<h2>Baseline C Implementation</h2>
<p>
The simplest way of solving this in C is to have an array of C strings (i.e.
NULL terminated byte arrays), then for each string, loop through byte by byte
and see if it prefix matches the search string.
</p>
<div class="tab-box language box-simple">
<ul class="tabs">
<li data-content="content-simple-cnf">Baseline (Cutler Normal Form)</li>
<li data-content="content-simple-knf">Baseline (Kernel Normal Form)</li>
</ul>
<div class="content">
<pre class="code content-simple-cnf"><code class="language-c">
//
// Declare a set of module names to be used as a string array.
//
const PCSZ ModuleNames[] = {
"myproject1",
"myproject2",
"myproject3.subproject",
"numpy",
"pandas",
"scipy",
NULL,
};
//
// Define the function pointer typedef.
//
typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_CSTR_IN_ARRAY)(
_In_ PCSZ *StringArray,
_In_ PCSZ String,
_Out_opt_ PSTRING_MATCH Match
);
typedef IS_PREFIX_OF_CSTR_IN_ARRAY *PIS_PREFIX_OF_CSTR_IN_ARRAY;
//
// Forward declaration.
//
IS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
PCSZ *StringArray,
PCSZ String,
PSTRING_MATCH Match
)
{
PCSZ Left;
PCSZ Right;
PCSZ *Target;
ULONG Index = 0;
ULONG Count;
for (Target = StringArray; *Target != NULL; Target++, Index++) {
Count = 0;
Left = String;
Right = *Target;
while (*Left &amp;&amp; *Right &amp;&amp; *Left++ == *Right++) {
Count++;
}
if (Count &gt; 0 &amp;&amp; !*Right) {
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)Count;
Match-&gt;String = NULL;
}
return (STRING_TABLE_INDEX)Index;
}
}
return NO_MATCH_FOUND;
}
</code></pre>
<pre class="code content-simple-knf"><code class="language-c">
const char *module_names[] = {
"myproject1",
"myproject2",
"myproject3.subproject",
"numpy",
"pandas",
"scipy",
0,
};
struct string_match {
/* Index of the match. */
unsigned char index;
/* Number of characters matched. */
unsigned char number_of_chars_matched;
/* Pad out to an 8-byte boundary. */
unsigned short padding[3];
/* Pointer to the string that was matched. */
char *str;
};
unsigned char
is_prefix_of_c_str_in_array(const char **array,
const char *str,
struct string_match *match)
{
char *left, *right, **target;
unsigned int c, i = 0;
for (target = array; target; target++, i++) {
c = 0;
left = str;
right *target;
while (*left &amp;&amp; *right &amp;&amp; *left++ == *right) {
c++;
}
if (c &gt; 0 &amp;&amp; !*right) {
if (match) {
match-&gt;index = i;
match-&gt;chars_matched = c;
match-&gt;str = target[i];
}
return i;
}
}
return -1;
}
</code></pre>
</div>
</div>
<p>
Another type of code pattern that the string table attempts to replace is
anything that does a lot of if/else if/else if-type string comparisons to
look for keywords. For example, in the
<a href="https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb10315479fc00086f08e25d968b4b43c49/q3asm/q3asm.c#L609">
Quake III</a> source, there's some symbol/string processing logic that looks
like this:
</p>
<pre class="code content-q3"><code class="language-c">
// call instructions reset currentArgOffset
if ( !strncmp( token, "CALL", 4 ) ) {
EmitByte( &amp;segment[CODESEG], OP_CALL );
instructionCount++;
currentArgOffset = 0;
return;
}
// arg is converted to a reversed store
if ( !strncmp( token, "ARG", 3 ) ) {
EmitByte( &amp;segment[CODESEG], OP_ARG );
instructionCount++;
if ( 8 + currentArgOffset >= 256 ) {
CodeError( "currentArgOffset >= 256" );
return;
}
EmitByte( &amp;segment[CODESEG], 8 + currentArgOffset );
currentArgOffset += 4;
return;
}
// ret just leaves something on the op stack
if ( !strncmp( token, "RET", 3 ) ) {
EmitByte( &amp;segment[CODESEG], OP_LEAVE );
instructionCount++;
EmitInt( &amp;segment[CODESEG], 8 + currentLocals + currentArgs );
return;
}
// pop is needed to discard the return value of
// a function
if ( !strncmp( token, "pop", 3 ) ) {
EmitByte( &amp;segment[CODESEG], OP_POP );
instructionCount++;
return;
}
...
</code></pre>
<p>
An example of using the string table approach for this problem is discussed
in the <a href="#other-applications">Other Applications</a> section.
</p>
<a class="xref" name="proposed-interface"></a>
<h3>The Proposed Interface</h3>
<p>
Let's take a look at the interface we're proposing, the
<code>IsPrefixOfStringInTable</code> function, that this article is based upon:
</p>
<a class="xref" name="IsPrefixOfStringInTable"></a>
<pre class="code content-proposed-interface-cnf"><code class="language-c">
//
// Our string table index is simply a char, with -1 indicating no match found.
//
typedef CHAR STRING_TABLE_INDEX;
#define NO_MATCH_FOUND -1
typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_STRING_IN_TABLE)(
_In_ PSTRING_TABLE StringTable,
_In_ PSTRING String,
_Out_opt_ PSTRING_MATCH StringMatch
);
typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE;
IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
</code></pre>
<p>
All implementations discussed in this article adhere to that function signature.
The <a href="#STRING_TABLE">STRING_TABLE</a> structure will be discussed shortly.
</p>
<p>
The STRING_MATCH structure is used to optionally communicate information about
the prefix match back to the caller. The index and characters matched fields
are often very useful when using the string table for text parsing; see the <a
href="#other-applications">other applications</a> section below for an example.
</p>
<p>
The structure is defined as follows:
</p>
<a class="xref" name="STRING_MATCH"></a>
<pre class="code content-string-match"><code class="language-c">
//
// This structure is used to communicate matches back to the caller.
//
typedef struct _STRING_MATCH {
//
// Index of the match.
//
BYTE Index;
//
// Number of characters matched.
//
BYTE NumberOfMatchedCharacters;
//
// Pad out to 8-bytes.
//
USHORT Padding[3];
//
// Pointer to the string that was matched. The underlying buffer will
// stay valid for as long as the STRING_TABLE struct persists.
//
PSTRING String;
} STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH;
C_ASSERT(sizeof(STRING_MATCH) == 16);
</code></pre>
<a class="xref" name="test-data"></a>
<h2>The Test Data</h2>
<p>
Instead of using some arbitrary Python module names, this article is going to
focus on a string table constructed out of a set of 16 strings that represent
reserved names of the NTFS file system, at least when it was first released
way back in the early 90s.
</p>
<p>
This list is desirable as it has good distribution of characters, there is
a good mix of both short and long entries, plus one oversized one
($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all
strings lead with a common character (the dollar sign), preventing a simple
<em>first character</em> optimization used by <a href="https://github.com/tpn/tracer/blob/2018-04-18.1/StringTable/StringTable.h#L324">
the initial version of the StringTable component I wrote in 2016</a>.
</p>
<p>
So the scenario we'll be emulating, in this case, is that we've just been passed
a filename for creation, and we need to check if it prefix matches any of the
reserved names.
</p>
<p>
Here's the full list of NTFS names we'll be using. We're assuming 8-bit ASCII
encoding (no UTF-8) and case sensitive. (If this were actually the NT kernel,
we'd need to use wide characters with UTF-16 enconding, and be
case-insensitive.)
</p>
<a class="xref" name="ntfs-reserved-names"></a>
<h3>NTFS Reserved Names</h3>
<p>
<ul>
<li>$AttrDef</li>
<li>$BadClus</li>
<li>$Bitmap</li>
<li>$Boot</li>
<li>$Extend</li>
<li>$LogFile</li>
<li>$MftMirr</li>
<li>$Mft</li>
<li>$Secure</li>
<li>$UpCase</li>
<li>$Volume</li>
<li>$Cairo</li>
<li>$INDEX_ALLOCATION</li>
<li>$DATA</li>
<li>????</li>
<li>.</li>
</ul>
</p>
<p>
The ordering is important in certain cases. For example, when you have
overlapping strings, such as $MftMirr, and $Mft, you should put the longest
strings first. They will be matched first, and as our routine terminates upon
the first successful prefix match &mdash; if a longer string resided after a
shorter one, it would never get detected.
</p>
<p>
Let's review some guiding design requirements and cover some of the design
decisions I made, which should help shape your understanding of the
implementation.
</p>
<a class="xref" name="requirements-and-design-decisions"></a>
<h2>Requirements and Design Decisions</h2>
<p>
The STRING struct will be used to capture incoming search strings as well as the
representation of any strings registered in the table (or more accurately, in
the corresponding StringArray structure associated with the string table.
</p>
<pre class="code content-string-struct"><code class="language-c">
//
// The STRING structure used by the NT kernel. Our STRING_ARRAY structure
// relies on an array of these structures. We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//
typedef struct _STRING {
USHORT Length;
USHORT MaximumLength;
ULONG Padding;
PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;
</code></pre>
<p>
The design should optimize for string lengths less than or equal to 16. Lengths
greater than 16 are permitted, up to 128 bytes, but they incur more overhead during
the prefix lookup.
</p>
<p>
The design should prioritize the fast-path code where there is no match for a
given search string. Being able to terminate the search as early as possible is
ideal.
</p>
<p>
The performance hits taken by unaligned data access are non-nelgible, especially
when dealing with XMM/YMM loads. Pay special care to alignment constrants and
make sure that everything under our control is aligned on a suitable boundary.
(The only thing we can't really control in the real world is the alignment of
the incoming search string buffer, which will often be at undesirable alignments
like 2, 4, 6, etc. Our test program explicitly aligns the incoming search
strings on 32-byte boundaries to avoid the penalties associated with unaligned
access.)
</p>
<p>
The string table is geared toward a single-shot build. Once you've created it
with a given string array or used a delimited environment variable, that's it.
There are no AddString() or RemoveString() routines. The order you provided the
strings in will be the same order the table uses &mdash; no re-ordering will be
done. Thus, for prefix matching purposes, if two strings share a common prefix,
the longer one should go first, as the prefix search routine will check it first.
</p>
<p>
Only single matches are performed; the first match that qualifies as a prefix
match (target string in table had length less than or equal to the search
string, and all of its characters matched). There is no support for obtaining
multiple matches &mdash; if you've constructed your string tables properly
(no duplicate or incorrectly-ordered overlapping fields), you shouldn't need to.
</p>
<p>
So, to summarise, the design guidelines are as follows.
<ul>
<li>
Prioritize fast-path exit in the non-matched case. (I refer to this as
<strong>negative matching</strong> in a lot of places.)
</li>
<li>
Optimize for up to 16 string slots, where each slot has up to 16
characters, ideally. It can have up to 128 in total, however, any bytes
outside of the first sixteen live in the string array structure
supporting the string table (accessible via pStringArray).
</li>
<li>
If a slot is longer than 16 characters, optimize for the assumption that
it won't be *that* much longer. i.e. assume a string of length 18 bytes
is more common than 120 bytes.
</li>
</ul>
</p>
<a class="xref" name="data-structures"></a>
<h1>The Data Structures</h1>
<p>
The primary data structure employed by this solution is the STRING_TABLE
structure. It is composed of supporting structures: STRING_SLOT, SLOT_INDEX and
SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY
structure from which it was created.
</p>
<p>
Let's review the STRING_TABLE <small>
<a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/StringTable.h#L194">
(view on GitHub)</a></small> structure first and then touch on the supporting
structures.
</p>
<a class="xref" name="STRING_TABLE"></a>
<h2>STRING_TABLE</h2>
<div class="tab-box language box-string-table">
<ul class="tabs">
<li data-content="content-string-table-cnf">C - Cutler Normal Form</li>
<li data-content="content-string-table-knf">C - Kernel Normal Form</li>
<li data-content="content-string-table-masm">MASM</li>
</ul>
<div class="content">
<pre class="code content-string-table-cnf"><code class="language-c">//
// The STRING_TABLE struct is an optimized structure for testing whether a
// prefix entry for a string is in a table, with the expectation that the
// strings being compared will be relatively short (ideally &lt;= 16 characters),
// and the table of string prefixes to compare to will be relatively small
// (ideally &lt;= 16 strings).
//
// The overall goal is to be able to prefix match a string with the lowest
// possible (amortized) latency. Fixed-size, memory-aligned character arrays,
// and SIMD instructions are used to try and achieve this.
//
typedef struct _STRING_TABLE {
//
// A slot where each individual element contains a uniquely-identifying
// letter, with respect to the other strings in the table, of each string
// in an occupied slot.
//
STRING_SLOT UniqueChars;
//
// (16 bytes consumed.)
//
//
// For each unique character identified above, the following structure
// captures the 0-based index of that character in the underlying string.
// This is used as an input to vpshufb to rearrange the search string's
// characters such that it can be vpcmpeqb'd against the unique characters
// above.
//
SLOT_INDEX UniqueIndex;
//
// (32 bytes consumed.)
//
//
// Length of the underlying string in each slot.
//
SLOT_LENGTHS Lengths;
//
// (48 bytes consumed, aligned at 16 bytes.)
//
//
// Pointer to the STRING_ARRAY associated with this table, which we own
// (we create it and copy the caller's contents at creation time and
// deallocate it when we get destroyed).
//
// N.B. We use pStringArray here instead of StringArray because the
// latter is a field name at the end of the struct.
//
//
PSTRING_ARRAY pStringArray;
//
// (56 bytes consumed, aligned at 8 bytes.)
//
//
// String table flags.
//
STRING_TABLE_FLAGS Flags;
//
// (60 bytes consumed, aligned at 4 bytes.)
//
//
// A 16-bit bitmap indicating which slots are occupied.
//
USHORT OccupiedBitmap;
//
// A 16-bit bitmap indicating which slots have strings longer than 16 chars.
//
USHORT ContinuationBitmap;
//
// (64 bytes consumed, aligned at 64 bytes.)
//
//
// The 16-element array of STRING_SLOT structs. We want this to be aligned
// on a 64-byte boundary, and it consumes 256-bytes of memory.
//
STRING_SLOT Slots[16];
//
// (320 bytes consumed, aligned at 64 bytes.)
//
//
// We want the structure size to be a power of 2 such that an even number
// can fit into a 4KB page (and reducing the likelihood of crossing page
// boundaries, which complicates SIMD boundary handling), so we have an
// extra 192-bytes to play with here. The CopyStringArray() routine is
// special-cased to allocate the backing STRING_ARRAY structure plus the
// accommodating buffers in this space if it can fit.
//
// (You can test whether or not this occurred by checking the invariant
// `StringTable-&gt;pStringArray == &amp;StringTable-&gt;StringArray`, if this
// is true, the array was allocated within this remaining padding space.)
//
union {
STRING_ARRAY StringArray;
CHAR Padding[192];
};
} STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE;
//
// Assert critical size and alignment invariants at compile time.
//
C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots) == 64);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320);
C_ASSERT(sizeof(STRING_TABLE) == 512);
</code></pre>
<pre class="code content-string-table-knf"><code class="language-c">struct string_table {
char unique_chars[16];
unsigned char unique_index[16];
unsigned char slot_lengths[16];
struct string_array *string_array_ptr;
struct string_table_flags flags;
unsigned short occupied_bitmap;
unsigned short continuation_bitmap;
char slots[16][16];
union {
struct string_array string_array;
char padding[184];
} u;
};
</code></pre>
<pre class="code content-string-table-masm"><code class="language-nasm">STRING_TABLE struct
UniqueChars CHAR 16 dup (?)
UniqueIndex BYTE 16 dup (?)
Lengths BYTE 16 dup (?)
pStringArray PSTRING_ARRAY ?
Flags ULONG ?
OccupiedBitmap USHORT ?
ContinuationBitmap USHORT ?
Slots STRING_SLOT 16 dup ({ })
union
StringArray STRING_ARRAY {?}
Padding CHAR 192 dup (?)
ends
STRING_TABLE ends
;
; Assert our critical field offsets and structure size as per the same approach
; taken in StringTable.h.
;
.erre (STRING_TABLE.UniqueIndex eq 16), @CatStr(&lt;UnexpectedOffset STRING_TABLE.UniqueIndex: &gt;, %(STRING_TABLE.UniqueIndex))
.erre (STRING_TABLE.Lengths eq 32), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Lengths: &gt;, %(STRING_TABLE.Lengths))
.erre (STRING_TABLE.pStringArray eq 48), @CatStr(&lt;UnexpectedOffset STRING_TABLE.pStringArray: &gt;, %(STRING_TABLE.pStringArray))
.erre (STRING_TABLE.Slots eq 64), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Slots: &gt;, %(STRING_TABLE.Slots))
.erre (STRING_TABLE.Padding eq 320), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Padding: &gt;, %(STRING_TABLE.Padding))
.erre (size STRING_TABLE eq 512), @CatStr(&lt;IncorrectStructSize: STRING_TABLE: &gt;, %(size STRING_TABLE))
PSTRING_TABLE typedef ptr STRING_TABLE
;
; CamelCase typedefs that are nicer to work with in assembly
; than their uppercase counterparts.
;
StringTable typedef STRING_TABLE
</code></pre>
</div>
</div>
<p>
The following diagram depicts an in-memory representation of the STRING_TABLE
structure using our NTFS reserved prefix names. It is created via the
<a href="#CreateStringTable">CreateStringTable</a> routine, which we feature
in the appendix of this article.
</p>
<a href="StringTable.svg" target="_blank">
<img class="svg-image" src="StringTable.svg"/>
</a>
<!--
<picture>
<source srcset="StringTableLayout2.png"/>
<img width="1042px" height="675px" srcset="StringTableLayout2.png"/>
<source srcset="StringTableLayout2.png"/>
<img width="1641px" height="1020px" srcset="StringTableLayout2.png"/>
</picture>
-->
<p>
In order to improve the uniqueness of the unique characters selected from each
string, the strings are sorted by length during string table creation and
enumerated in this order whilst identifying unique characters. The rationale
behind this is that shorter strings simply have fewer characters to choose from,
longer strings have more to choose from. If we identified unique characters in
the order they appear in the string table, we may have longer strings preceeding
shorter ones, such that toward the end of the table, nothing unique can be
extracted from the short ones.
</p>
<p>
The utility of the string table is maximised by ensuring a unique character is
selected from every string, thus, we sort by length first. Note that the
uniqueness is actually determined by offset:character pairs, with the offsets
becoming the indices stored in the <em>UniqueIndex</em> slot. If you trace
through the diagram above, you'll see that the unique character in each slot
matches the character in the corresponding string slot, indicated by the
underlying index.
</p>
<p>
</p>
<a class="xref" name="supporting-structures"></a>
<h2>Supporting Structures</h2>
The string array captures a raw array representation of the underlying strings
making up the string table. It is either embedded within the padding area at the
end of the string table, or a separate allocation is made during string table
creation. The main interface to creating a string table is via a STRING_ARRAY
structure. The helper functions,
<a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L471">
CreateStringTableFromDelimitedString
</a> and
<a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L595">
CreateStringTableFromDelimitedEnvironmentVariable
</a> simply break down their input into a STRING_ARRAY representation first
before calling
<a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L51">
CreateStringTable
</a>.
<a class="xref" name="STRING_ARRAY"></a>
<h3>STRING_ARRAY</h3>
<pre class="code content-string-array"><code class="language-c">typedef struct _Struct_size_bytes_(SizeInQuadwords&gt;&gt;3) _STRING_ARRAY {
//
// Size of the structure, in quadwords. Why quadwords? It allows us to
// keep this size field to a USHORT, which helps with the rest of the
// alignment in this struct (we want the STRING Strings[] array to start
// on an 8-byte boundary).
//
// N.B. We can't express the exact field size in the SAL annotation
// below, because the array of buffer sizes are inexpressible;
// however, we know the maximum length, so we can use the implicit
// invariant that the total buffer size can't exceed whatever num
// elements * max size is.
//
_Field_range_(&lt;=, (
sizeof(struct _STRING_ARRAY) +
((NumberOfElements - 1) * sizeof(STRING)) +
(MaximumLength * NumberOfElements)
) &gt;&gt; 3)
USHORT SizeInQuadwords;
//
// Number of elements in the array.
//
USHORT NumberOfElements;
//
// Minimum and maximum lengths for the String-&gt;Length fields. Optional.
//
USHORT MinimumLength;
USHORT MaximumLength;
//
// A pointer to the STRING_TABLE structure that "owns" us.
//
struct _STRING_TABLE *StringTable;
//
// The string array. Number of elements in the array is governed by the
// NumberOfElements field above.
//
STRING Strings[ANYSIZE_ARRAY];
} STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;
</code></pre>
<small>
<a class="xref" name="SAL"></a>
<p>
Note: the odd-looking macros <a
href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/sal.h#L597">
_Struct_size_bytes_</a> and
<a
href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/sal.h#L615">
_Field_range_</a> are
<a
href="https://docs.microsoft.com/en-us/visualstudio/code-quality/annotating-structs-and-classes">
SAL Annotations</a>. There's a neat deck called
<a
href="https://github.com/tpn/pdfs/blob/master/Program%20Analysis%20with%20PREfast%20and%20SAL%20-%20Erik%20Poll%20-%20Slides%20(3_StaticAnalysisPREfast).pdf"
>Engineering Better Software at Microsoft</a> which captures some interesting
details about SAL, for those wanting to read more. The Code Analysis engine
that uses the annotations is built upon the <a
href="https://github.com/Z3Prover/z3">Z3 Theorem Prover</a>, which is a
fascinating little project in its own right.
</p>
</small>
<p>
And finally, we're left with the smaller helper structs that we use to
encapsulate the various innards of the string table. (I use unions that
feature XMMWORD representations (which is a typedef of __m128i, representing
an XMM register) as well as underlying byte/character representations as I
personally find it makes the resulting C code a bit nicer.)
</p>
<a class="xref" name="STRING_SLOT"></a>
<h3>STRING_SLOT</h3>
<pre class="code content-string-slot"><code class="language-c">//
// String tables are composed of a 16 element array of 16 byte string "slots",
// which represent a unique character (with respect to other strings in the
// table) for a string in a given slot index. The STRING_SLOT structure
// provides a convenient wrapper around this construct.
//
typedef union DECLSPEC_ALIGN(16) _STRING_SLOT {
XMMWORD CharsXmm;
CHAR Char[16];
} STRING_SLOT, *PSTRING_SLOT, **PPSTRING_SLOT;
C_ASSERT(sizeof(STRING_SLOT) == 16);
</code></pre>
<a class="xref" name="SLOT_INDEX"></a>
<h3>SLOT_INDEX</h3>
<pre class="code content-slot-index"><code class="language-c">//
// An array of 1 byte unsigned integers used to indicate the 0-based index of
// a given unique character in the corresponding string.
//
typedef union DECLSPEC_ALIGN(16) _SLOT_INDEX {
XMMWORD IndexXmm;
BYTE Index[16];
} SLOT_INDEX, *PSLOT_INDEX, **PPSLOT_INDEX;
C_ASSERT(sizeof(SLOT_INDEX) == 16);
</code></pre>
<a class="xref" name="SLOT_LENGTHS"></a>
<h3>SLOT_LENGTHS</h3>
<pre class="code content-slot-lengths"><code class="language-c">//
// A 16 element array of 1 byte unsigned integers, used to capture the length
// of each string slot in a single XMM 128-bit register.
//
typedef union DECLSPEC_ALIGN(16) _SLOT_LENGTHS {
XMMWORD SlotsXmm;
BYTE Slots[16];
} SLOT_LENGTHS, *PSLOT_LENGTHS, **PPSLOT_LENGTHS;
C_ASSERT(sizeof(SLOT_LENGTHS) == 16);
</code></pre>
<a class="xref" name="CreateStringTable"></a>
<h2>String Table Construction</h2>
<p>
The <a
href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L147">
CreateSingleStringTable</a> routine is responsible for construction of a new
STRING_TABLE. It is here we identify the unique set of characters (and their
indices) to store in the first two fields of the string table.
</p>
<div class="tab-box language box-create">
<ul class="tabs">
<li data-content="content-create-string-table">CreateSingleStringTable</li>
</ul>
<div class="content">
<pre class="code content-create-string-table"><code class="language-c">//
// Define private types used by this module.
//
typedef struct _LENGTH_INDEX_ENTRY {
BYTE Length;
BYTE Index;
} LENGTH_INDEX_ENTRY;
typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY;
typedef struct _LENGTH_INDEX_TABLE {
LENGTH_INDEX_ENTRY Entry[16];
} LENGTH_INDEX_TABLE;
typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE;
typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP {
YMMWORD Ymm;
XMMWORD Xmm[2];
LONG Bits[(256 / (4 &lt;&lt; 3))]; // 8
} CHARACTER_BITMAP;
C_ASSERT(sizeof(CHARACTER_BITMAP) == 32);
typedef CHARACTER_BITMAP *PCHARACTER_BITMAP;
typedef struct _SLOT_BITMAPS {
CHARACTER_BITMAP Bitmap[16];
} SLOT_BITMAPS;
typedef SLOT_BITMAPS *PSLOT_BITMAPS;
//
// Function implementation.
//
_Use_decl_annotations_
PSTRING_TABLE
CreateSingleStringTable(
PRTL Rtl,
PALLOCATOR StringTableAllocator,
PALLOCATOR StringArrayAllocator,
PSTRING_ARRAY StringArray,
BOOL CopyArray
)
/*++
Routine Description:
Allocates space for a STRING_TABLE structure using the provided allocators,
then initializes it using the provided STRING_ARRAY. If CopyArray is set
to TRUE, the routine will copy the string array such that the caller is
free to destroy it after the table has been successfully created. If it
is set to FALSE and StringArray-&gt;StringTable has a non-NULL value, it is
assumed that sufficient space has already been allocated for the string
table and this pointer will be used to initialize the rest of the structure.
DestroyStringTable() must be called against the returned PSTRING_TABLE when
the structure is no longer needed in order to ensure resources are released.
Arguments:
Rtl - Supplies a pointer to an initialized RTL structure.
StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which
will be used for creating the STRING_TABLE.
StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which
may be used to create the STRING_ARRAY if it cannot fit within the
padding of the STRING_TABLE structure. This is kept separate from the
StringTableAllocator due to the stringent alignment requirements of the
string table.
StringArray - Supplies a pointer to an initialized STRING_ARRAY structure
that contains the STRING structures that are to be added to the table.
CopyArray - Supplies a boolean value indicating whether or not the
StringArray structure should be deep-copied during creation. This is
typically set when the caller wants to be able to free the structure
as soon as this call returns (or can't guarantee it will persist past
this function's invocation, i.e. if it was stack allocated).
Return Value:
A pointer to a valid PSTRING_TABLE structure on success, NULL on failure.
Call DestroyStringTable() on the returned structure when it is no longer
needed in order to ensure resources are cleaned up appropriately.
--*/
{
BYTE Byte;
BYTE Count;
BYTE Index;
BYTE Length;
BYTE NumberOfElements;
ULONG HighestBit;
ULONG OccupiedMask;
PULONG Bits;
USHORT OccupiedBitmap;
USHORT ContinuationBitmap;
PSTRING_TABLE StringTable;
PSTRING_ARRAY StringArray;
PSTRING String;
PSTRING_SLOT Slot;
STRING_SLOT UniqueChars;
SLOT_INDEX UniqueIndex;
SLOT_INDEX LengthIndex;
SLOT_LENGTHS Lengths;
LENGTH_INDEX_TABLE LengthIndexTable;
PCHARACTER_BITMAP Bitmap;
SLOT_BITMAPS SlotBitmaps;
PLENGTH_INDEX_ENTRY Entry;
//
// Validate arguments.
//
if (!ARGUMENT_PRESENT(StringTableAllocator)) {
return NULL;
}
if (!ARGUMENT_PRESENT(StringArrayAllocator)) {
return NULL;
}
if (!ARGUMENT_PRESENT(SourceStringArray)) {
return NULL;
}
if (SourceStringArray-&gt;NumberOfElements == 0) {
return NULL;
}
//
// Copy the incoming string array if applicable.
//
if (CopyArray) {
StringArray = CopyStringArray(
StringTableAllocator,
StringArrayAllocator,
SourceStringArray,
FIELD_OFFSET(STRING_TABLE, StringArray),
sizeof(STRING_TABLE),
&amp;StringTable
);
if (!StringArray) {
return NULL;
}
} else {
//
// We're not copying the array, so initialize StringArray to point at
// the caller's SourceStringArray, and StringTable to point at the
// array's StringTable field (which will be non-NULL if sufficient
// space has been allocated).
//
StringArray = SourceStringArray;
StringTable = StringArray-&gt;StringTable;
}
//
// If StringTable has no value, we've either been called with CopyArray set
// to FALSE, or CopyStringArray() wasn't able to allocate sufficient space
// for both the table and itself. Either way, we need to allocate space for
// the table.
//
if (!StringTable) {
StringTable = (PSTRING_TABLE)(
StringTableAllocator-&gt;AlignedCalloc(
StringTableAllocator-&gt;Context,
1,
sizeof(STRING_TABLE),
STRING_TABLE_ALIGNMENT
)
);
if (!StringTable) {
return NULL;
}
}
//
// Make sure the fields that are sensitive to alignment are, in fact,
// aligned correctly.
//
if (!AssertStringTableFieldAlignment(StringTable)) {
DestroyStringTable(StringTableAllocator,
StringArrayAllocator,
StringTable);
return NULL;
}
//
// At this point, we have copied the incoming StringArray if necessary,
// and we've allocated sufficient space for the StringTable structure.
// Enumerate over all of the strings, set the continuation bit if the
// length &gt; 16, set the relevant slot length, set the relevant unique
// character entry, then move the first 16-bytes of the string into the
// relevant slot via an aligned SSE mov.
//
//
// Initialize pointers and counters, clear stack-based structures.
//
Slot = StringTable-&gt;Slots;
String = StringArray-&gt;Strings;
OccupiedBitmap = 0;
ContinuationBitmap = 0;
NumberOfElements = (BYTE)StringArray-&gt;NumberOfElements;
UniqueChars.CharsXmm = _mm_setzero_si128();
UniqueIndex.IndexXmm = _mm_setzero_si128();
LengthIndex.IndexXmm = _mm_setzero_si128();
//
// Set all the slot lengths to 0x7f up front instead of defaulting
// to zero. This allows for simpler logic when searching for a prefix
// string, which involves broadcasting a search string's length to an XMM
// register, then doing _mm_cmpgt_epi8() against the lengths array and
// the string length. If we left the lengths as 0 for unused slots, they
// would get included in the resulting comparison register (i.e. the high
// bits would be set to 1), so we'd have to do a subsequent masking of
// the result at some point using the OccupiedBitmap. By defaulting the
// lengths to 0x7f, we ensure they'll never get included in any cmpgt-type
// SIMD matches. (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8()
// intrinsic assumes packed signed integers.)
//
Lengths.SlotsXmm = _mm_set1_epi8(0x7f);
ZeroStruct(LengthIndexTable);
ZeroStruct(SlotBitmaps);
for (Count = 0; Count &lt; NumberOfElements; Count++) {
XMMWORD CharsXmm;
//
// Set the string length for the slot.
//
Length = Lengths.Slots[Count] = (BYTE)String-&gt;Length;
//
// Set the appropriate bit in the continuation bitmap if the string is
// longer than 16 bytes.
//
if (Length &gt; 16) {
ContinuationBitmap |= (Count == 0 ? 1 : 1 &lt;&lt; (Count + 1));
}
if (Count == 0) {
Entry = &amp;LengthIndexTable.Entry[0];
Entry-&gt;Index = 0;
Entry-&gt;Length = Length;
} else {
//
// Perform a linear scan of the length-index table in order to
// identify an appropriate insertion point.
//
for (Index = 0; Index &lt; Count; Index++) {
if (Length &lt; LengthIndexTable.Entry[Index].Length) {
break;
}
}
if (Index != Count) {
//
// New entry doesn't go at the end of the table, so shuffle
// everything else down.
//
Rtl-&gt;RtlMoveMemory(&amp;LengthIndexTable.Entry[Index + 1],
&amp;LengthIndexTable.Entry[Index],
(Count - Index) * sizeof(*Entry));
}
Entry = &amp;LengthIndexTable.Entry[Index];
Entry-&gt;Index = Count;
Entry-&gt;Length = Length;
}
//
// Copy the first 16-bytes of the string into the relevant slot. We
// have taken care to ensure everything is 16-byte aligned by this
// stage, so we can use SSE intrinsics here.
//
CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer);
_mm_store_si128(&amp;(*Slot).CharsXmm, CharsXmm);
//
// Advance our pointers.
//
++Slot;
++String;
}
//
// Store the slot lengths.
//
_mm_store_si128(&amp;(StringTable-&gt;Lengths.SlotsXmm), Lengths.SlotsXmm);
//
// Loop through the strings in order of shortest to longest and construct
// the uniquely-identifying character table with corresponding index.
//
for (Count = 0; Count &lt; NumberOfElements; Count++) {
Entry = &amp;LengthIndexTable.Entry[Count];
Length = Entry-&gt;Length;
Slot = &amp;StringTable-&gt;Slots[Entry-&gt;Index];
//
// Iterate over each character in the slot and find the first one
// without a corresponding bit set.
//
for (Index = 0; Index &lt; Length; Index++) {
Bitmap = &amp;SlotBitmaps.Bitmap[Index];
Bits = (PULONG)&amp;Bitmap-&gt;Bits[0];
Byte = Slot-&gt;Char[Index];
if (!BitTestAndSet(Bits, Byte)) {
break;
}
}
UniqueChars.Char[Count] = Byte;
UniqueIndex.Index[Count] = Index;
LengthIndex.Index[Count] = Entry-&gt;Index;
}
//
// Loop through the elements again such that the unique chars are stored
// in the order they appear in the table.
//
for (Count = 0; Count &lt; NumberOfElements; Count++) {
for (Index = 0; Index &lt; NumberOfElements; Index++) {
if (LengthIndex.Index[Index] == Count) {
StringTable-&gt;UniqueChars.Char[Count] = UniqueChars.Char[Index];
StringTable-&gt;UniqueIndex.Index[Count] = UniqueIndex.Index[Index];
break;
}
}
}
//
// Generate and store the occupied bitmap. Each bit, from low to high,
// corresponds to the index of a slot. When set, the slot is occupied.
// When clear, it is not. So, fill bits from the highest bit set down.
//
HighestBit = (1 &lt;&lt; (StringArray-&gt;NumberOfElements-1));
OccupiedMask = _blsmsk_u32(HighestBit);
StringTable-&gt;OccupiedBitmap = (USHORT)OccupiedMask;
//
// Store the continuation bitmap.
//
StringTable-&gt;ContinuationBitmap = (USHORT)(ContinuationBitmap);
//
// Wire up the string array to the table.
//
StringTable-&gt;pStringArray = StringArray;
//
// And we're done, return the table.
//
return StringTable;
}
</code></pre>
</div>
</div>
<a class="xref" name="benchmark"></a>
<h1>The Benchmark</h1>
<p>
The performance comparison graphs in the subsequent sections were generated in
Excel, using CSV data output by the creatively-named program
<a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2BenchmarkExe/main.c#L227">
StringTable2BenchmarkExe</a>.
</p>
<p>
Modern CPUs are fast, timing is hard, especially when you're dealing with CPU
cycle comparisons. No approach is perfect. Here's what I settled on:
<ul>
<li>
The benchmark utility has <code>#pragma optimize("", off)</code> at the
start of the file, which disables global optimizations, even in release
(optimized) builds. This prevents the compiler doing clever things with
regards to scheduling of the timestamping logic, which affects reported
times.
</li>
<li>
The benchmark utility pins itself to a single core and sets its thread
priority to the highest permissible value at startup. (Turbo is
disabled on the computer, such that the frequency is pinned to 3.68GHz.)
</li>
<li>
The benchmark utility is fed an array of function pointers and test
inputs. It iterates over each test input, and then iterates over
each function, calling it with the test input and potentially verifying
the result (some functions are included for comparison purposes but
don't actually produce correct results, and thus, do not have their
results verified).
</li>
<li>
The test input string is copied into a local buffer that is aligned on a
32 byte boundary. This ensures that all test inputs are being compared
fairly. (The natural alignment of the buffers varies anywhere from 2 to
512 bytes, unaligned buffers have a significant impact on the timings.)
</li>
<li>
The function is run once, with the result captured. If verification has
been requested, the result is verified. We <code>__debugbreak()</code>
immediately if there's a mismatch, which is handy during development.
</li>
<li>
<code>NtDelayExecution(TRUE, 1)</code> is called, which results in a sleep of
approximately 100 nanoseconds. This is done to force a context switch,
such that the thread gets a new scheduling quantum before each function
is run.
</li>
<li>
The function is executed 100 times for warmup.
</li>
<li>
Timings are taken for 1000 iterations of the function using the given
test input. The <code>__rdtscp()</code> intrinsic is used (which forces
some serialization) to capture the timestamp counter before and after the
iterations.
</li>
<li>
This process is repeated 100 times. The minimum time observed to
perform 1000 iterations (out of 100 attempts) is captured as the
function's best time.
</li>
</ul>
</p>
<h4>Release vs PGO Oddities</h4>
<p>
All of the times in the graphs come from the profile-guided optimization build
of the StringTable component. The PGO build is faster than the normal release
build in every case, except one, where it is notably slower.
</p>
<p>
It's... odd. I haven't investigated it. The following graph depicts the
affected function, IsPrefixOfStringInTable_1, and a few other versions for
reference, and depicts the performance of the PGO build to the release build on
the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".
</p>
<a href="Benchmark-Release-vs-PGO-v3.svg" target="_blank">
<img class="svg-image" src="Benchmark-Release-vs-PGO-v3.svg"/>
</a>
<p>
Only that function is affected, and the problem really only manifests on
the two example test strings depicted. As this routine essentially serves
as one of the initial baseline implementations, it would be misleading to
compare all of our optimized PGO versions to the abnormally-slow baseline
implementation. So, the release and PGO timings were blended together into
a single CSV, and the Excel PivotTables pick whatever the minimum time is
for a given function and test input.
</p>
<p>
Thus, you're always looking at the PGO timings, except for this outlier case
where the release versions are faster.
</p>
<a class="xref" name="implementations"></a>
<h1>The Implementations</h1>
<a class="xref" name="round1"></a>
<h2>Round 1</h2>
<p>
In this section, we'll take a look at the various implementations I experimented
with on the first pass, prior to soliciting any feedback. I figured there were
a couple of ways I could present this information. First, I could hand-pick
what I choose to show and hide, such that a nice rosey picture is presented that
makes it seem like I effortlessly arrived at the fastest implementation
without much actual effort whatsoever.
</p>
<p>
Or I could show the gritty reality of how everything <strong>actually</strong> went
down in a chronological fashion, errors and all. And there were definitely some
errors! For better or for worse, I've chosen to go down this route, so you'll
get to enjoy some pretty tedious tweaks (changing a single line, for example)
before the juicy stuff really kicks in.
</p>
<p>
Additionally, with the benefit of writing this little section introduction
retro-actively, iterations 4 and 5 aren't testing what I thought they were
initially testing. I've left them in as is; if anything, it demonstrates the
importance of only changing one thing at a time, and making sure you're testing
what you think you're testing. I'll discuss the errors with those iterations
later in the article.
</p>
<a class="xref" name="IsPrefixOfCStrInArray"></a>
<h2>IsPrefixOfCStrInArray</h2>
<small>
<a href="#IsPrefixOfStringInTable_1">IsPrefixOfStringInTable_1 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
Let's review the baseline implementation again, as that's what we're ultimately
comparing ourselves against. This version enumerates the string array (and thus
has a slightly different function signature to the STRING_TABLE-based functions)
looking for prefix matches. No SIMD instructions are used. The timings
captured should be proportional to the location of the test input string in the
array. That is, it should take less time to prefix match strings that occur
earlier in the array versus those that appear later.
</p>
<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
PCSZ *StringArray,
PCSZ String,
PSTRING_MATCH Match
)
{
PCSZ Left;
PCSZ Right;
PCSZ *Target;
ULONG Index = 0;
ULONG Count;
for (Target = StringArray; *Target != NULL; Target++, Index++) {
Count = 0;
Left = String;
Right = *Target;
while (*Left &amp;&amp; *Right &amp;&amp; *Left++ == *Right++) {
Count++;
}
if (Count &gt; 0 &amp;&amp; !*Right) {
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)Count;
Match-&gt;String = NULL;
}
return (STRING_TABLE_INDEX)Index;
}
}
return NO_MATCH_FOUND;
}
</code></pre>
<hr/>
<a class="xref" name="IsPrefixOfStringInTable_1"></a>
<h2>IsPrefixOfStringInTable_1</h2>
<small>
<a href="#IsPrefixOfCStrInArray"><i class="fa fa-arrow-left"></i> IsPrefixOfCStrInArray</a> |
<a href="#IsPrefixOfStringInTable_2">IsPrefixOfStringInTable_2 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
This version is similar to the <code>IsPrefixOfCStrInArray</code>
implementation, except it utilizes the slot length information provided by the
<code>STRING_ARRAY</code> structure, and conforms to our standard
<code>IsPrefixOfStringInTable</code> function signature. It uses no SIMD
instructions.
</p>
<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_1(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine performs a simple linear scan of the string table looking for
a prefix match against each slot.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
BYTE Left;
BYTE Right;
ULONG Index;
ULONG Count;
PSTRING_ARRAY StringArray;
PSTRING TargetString;
//IACA_VC_START();
StringArray = StringTable-&gt;pStringArray;
if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
return NO_MATCH_FOUND;
}
for (Count = 0; Count &lt; StringArray-&gt;NumberOfElements; Count++) {
TargetString = &amp;StringArray-&gt;Strings[Count];
if (String-&gt;Length &lt; TargetString-&gt;Length) {
continue;
}
for (Index = 0; Index &lt; TargetString-&gt;Length; Index++) {
Left = String-&gt;Buffer[Index];
Right = TargetString-&gt;Buffer[Index];
if (Left != Right) {
break;
}
}
if (Index == TargetString-&gt;Length) {
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Count;
Match-&gt;NumberOfMatchedCharacters = (BYTE)Index;
Match-&gt;String = TargetString;
}
return (STRING_TABLE_INDEX)Count;
}
}
//IACA_VC_END();
return NO_MATCH_FOUND;
}
</code></pre>
<p>
Here's the performance of these two baseline routines:
<a href="Benchmark-01-v6.svg" target="_blank">
<img class="svg-image" src="Benchmark-01-v6.svg"/>
</a>
</p>
<p>
That's an interesting result! Even without using any SIMD instructions, version
1, the <code>IsPrefixOfStringInTable_1</code> routine, is faster (in all but one case)
than the baseline <code>IsPrefixOfCStrInArray</code> routine, thanks to a more
sophisticated data structure.
</p>
<p>
(And really, it's not even using the sophisticated parts of the
<code>STRING_TABLE</code>; it's just leveraging the fact that we've captured the
lengths of each string in the backing <code>STRING_ARRAY</code> structure, by
virtue of the fact that we use the <code>STRING</code> structure to wrap our
strings (versus relying on the standard NULL-terminated C string approach).)
</p>
<hr/>
<a class="xref" name="IsPrefixOfStringInTable_2"></a>
<h2>IsPrefixOfStringInTable_2</h2>
<small>
<a href="#IsPrefixOfStringInTable_1"><i class="fa fa-arrow-left"></i> IsPrefixOfStringInTable_1</a> |
<a href="#IsPrefixOfStringInTable_3">IsPrefixOfStringInTable_3 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
This version is the first of the routines to use SIMD instructions. It is
actually based on the prefix matching routine I wrote for the first version
of the StringTable component back in 2016. The layout of the STRING_TABLE
struct differed in the first version; only the first character of each slot
was used to do the initial exclusion (as opposed to the unique character),
and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits),
so the match bitmap had to be constructed slightly differently.
</p>
<p>
None of those details really apply to our second attempt at the StringTable
component, detailed in this article. Our lengths are 8 bits, and we use unique
characters in the initial negative match fast-path. However, the first version
used an elaborate AVX2 prefix match routine that was geared toward matching
long strings, and attempted to use non-temporal streaming load instructions
where possible (which would only make sense for a large number of long strings
in a very small set of cache-thrashing scenarios).
Compare our simpler implementation, <code>IsPrefixMatch</code>, which we use in
version 3 onward, to the far more elaborate (and unncessary)
<code>IsPrefixMatchAvx2</code>:
</p>
<div class="tab-box language box-is-prefix-match">
<ul class="tabs">
<li data-content="content-is-prefix-match">IsPrefixMatch</li>
<li data-content="content-is-prefix-match-avx2">IsPrefixMatchAvx2</li>
</ul>
<div class="content">
<pre class="code content-is-prefix-match"><code class="language-c">FORCEINLINE
BYTE
IsPrefixMatch(
_In_ PCSTRING SearchString,
_In_ PCSTRING TargetString,
_In_ BYTE Offset
)
{
PBYTE Left;
PBYTE Right;
BYTE Matched = 0;
BYTE Remaining = (SearchString-&gt;Length - Offset) + 1;
Left = (PBYTE)RtlOffsetToPointer(SearchString-&gt;Buffer, Offset);
Right = (PBYTE)RtlOffsetToPointer(TargetString-&gt;Buffer, Offset);
while (--Remaining &amp;&amp; *Left++ == *Right++) {
Matched++;
}
Matched += Offset;
if (Matched != TargetString-&gt;Length) {
return NO_MATCH_FOUND;
}
return Matched;
}
</code></pre>
<pre class="code content-is-prefix-match-avx2"><code class="language-c">FORCEINLINE
USHORT
IsPrefixMatchAvx2(
_In_ PCSTRING SearchString,
_In_ PCSTRING TargetString,
_In_ USHORT Offset
)
{
USHORT SearchStringRemaining;
USHORT TargetStringRemaining;
ULONGLONG SearchStringAlignment;
ULONGLONG TargetStringAlignment;
USHORT CharactersMatched = Offset;
LONG Count;
LONG Mask;
PCHAR SearchBuffer;
PCHAR TargetBuffer;
STRING_SLOT SearchSlot;
XMMWORD SearchXmm;
XMMWORD TargetXmm;
XMMWORD ResultXmm;
YMMWORD SearchYmm;
YMMWORD TargetYmm;
YMMWORD ResultYmm;
SearchStringRemaining = SearchString-&gt;Length - Offset;
TargetStringRemaining = TargetString-&gt;Length - Offset;
SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString-&gt;Buffer, Offset);
TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString-&gt;Buffer, Offset);
//
// This routine is only called in the final stage of a prefix match when
// we've already verified the slot's corresponding original string length
// (referred in this routine as the target string) is less than or equal
// to the length of the search string.
//
// We attempt as many 32-byte comparisons as we can, then as many 16-byte
// comparisons as we can, then a final &lt; 16-byte comparison if necessary.
//
// We use aligned loads if possible, falling back to unaligned if not.
//
StartYmm:
if (SearchStringRemaining &gt;= 32 &amp;&amp; TargetStringRemaining &gt;= 32) {
//
// We have at least 32 bytes to compare for each string. Check the
// alignment for each buffer and do an aligned streaming load (non-
// temporal hint) if our alignment is at a 32-byte boundary or better;
// reverting to an unaligned load when not.
//
SearchStringAlignment = GetAddressAlignment(SearchBuffer);
TargetStringAlignment = GetAddressAlignment(TargetBuffer);
if (SearchStringAlignment &lt; 32) {
SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer);
} else {
SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer);
}
if (TargetStringAlignment &lt; 32) {
TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer);
} else {
TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer);
}
//
// Compare the two vectors.
//
ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm);
//
// Generate a mask from the result of the comparison.
//
Mask = _mm256_movemask_epi8(ResultYmm);
//
// There were at least 32 characters remaining in each string buffer,
// thus, every character needs to have matched in order for this search
// to continue. If there were less than 32 characters, we can terminate
// this prefix search here. (-1 == 0xffffffff == all bits set == all
// characters matched.)
//
if (Mask != -1) {
//
// Not all characters were matched, terminate the prefix search.
//
return NO_MATCH_FOUND;
}
//
// All 32 characters were matched. Update counters and pointers
// accordingly and jump back to the start of the 32-byte processing.
//
SearchStringRemaining -= 32;
TargetStringRemaining -= 32;
CharactersMatched += 32;
SearchBuffer += 32;
TargetBuffer += 32;
goto StartYmm;
}
//
// Intentional follow-on to StartXmm.
//
StartXmm:
//
// Update the search string's alignment.
//
if (SearchStringRemaining &gt;= 16 &amp;&amp; TargetStringRemaining &gt;= 16) {
//
// We have at least 16 bytes to compare for each string. Check the
// alignment for each buffer and do an aligned streaming load (non-
// temporal hint) if our alignment is at a 16-byte boundary or better;
// reverting to an unaligned load when not.
//
SearchStringAlignment = GetAddressAlignment(SearchBuffer);
if (SearchStringAlignment &lt; 16) {
SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
} else {
SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer);
}
TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer);
//
// Compare the two vectors.
//
ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);
//
// Generate a mask from the result of the comparison.
//
Mask = _mm_movemask_epi8(ResultXmm);
//
// There were at least 16 characters remaining in each string buffer,
// thus, every character needs to have matched in order for this search
// to continue. If there were less than 16 characters, we can terminate
// this prefix search here. (-1 == 0xffff -&gt; all bits set -&gt; all chars
// matched.)
//
if ((SHORT)Mask != (SHORT)-1) {
//
// Not all characters were matched, terminate the prefix search.
//
return NO_MATCH_FOUND;
}
//
// All 16 characters were matched. Update counters and pointers
// accordingly and jump back to the start of the 16-byte processing.
//
SearchStringRemaining -= 16;
TargetStringRemaining -= 16;
CharactersMatched += 16;
SearchBuffer += 16;
TargetBuffer += 16;
goto StartXmm;
}
if (TargetStringRemaining == 0) {
//
// We'll get here if we successfully prefix matched the search string
// and all our buffers were aligned (i.e. we don't have a trailing
// &lt; 16 bytes comparison to perform).
//
return CharactersMatched;
}
//
// If we get here, we have less than 16 bytes to compare. Our target
// strings are guaranteed to be 16-byte aligned, so we can load them
// using an aligned stream load as in the previous cases.
//
TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer);
//
// Loading the remainder of our search string's buffer is a little more
// complicated. It could reside within 15 bytes of the end of the page
// boundary, which would mean that a 128-bit load would cross a page
// boundary.
//
// At best, the page will belong to our process and we'll take a performance
// hit. At worst, we won't own the page, and we'll end up triggering a hard
// page fault.
//
// So, see if the current search buffer address plus 16 bytes crosses a page
// boundary. If it does, take the safe but slower approach of a ranged
// memcpy (movsb) into a local stack-allocated STRING_SLOT structure.
//
if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) {
//
// No page boundary is crossed, so just do an unaligned 128-bit move
// into our Xmm register. (We could do the aligned/unaligned dance
// here, but it's the last load we'll be doing (i.e. it's not
// potentially on a loop path), so I don't think it's worth the extra
// branch cost, although I haven't measured this empirically.)
//
SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
} else {
//
// We cross a page boundary, so only copy the the bytes we need via
// __movsb(), then do an aligned stream load into the Xmm register
// we'll use in the comparison.
//
__movsb((PBYTE)&amp;SearchSlot.Char,
(PBYTE)SearchBuffer,
SearchStringRemaining);
SearchXmm = _mm_stream_load_si128(&amp;SearchSlot.CharsXmm);
}
//
// Compare the final vectors.
//
ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);
//
// Generate a mask from the result of the comparison, but mask off (zero
// out) high bits from the target string's remaining length.
//
Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining);
//
// Count how many characters were matched and determine if we were a
// successful prefix match or not.
//
Count = __popcnt(Mask);
if ((USHORT)Count == TargetStringRemaining) {
//
// If we matched the same amount of characters as remaining in the
// target string, we've successfully prefix matched the search string.
// Return the total number of characters we matched.
//
CharactersMatched += (USHORT)Count;
return CharactersMatched;
}
//
// After all that work, our string match failed at the final stage! Return
// to the caller indicating we were unable to make a prefix match.
//
return NO_MATCH_FOUND;
}
</code></pre>
</div>
</div>
<p>
The AVX2 routine is overkill, especially considering the emphasis we put on
favoring short strings versus longer ones in the requirements section. But
we want to put broad statements like that to the test, so let's include it
as our first SIMD implementation so that we can see how it stacks up against
the simpler versions.
</p>
<p>
Also note, this is the first time we're seeing the full body of the SIMD-style
<code>IsPrefixOfStringInTable</code> implementation. It's commented heavily,
and, in general, the core algorithm doesn't fundamentally change across
iterations (things are just tweaked slightly), so I'd recommend reading through
it thoroughly to build up a mental model of how the matching algorithm works.
It's pretty straight forward, and the subsequent iterations will make a lot more
sense as they're typically presented as diffs against the previous version
first.
</p>
<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_2(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This is our first AVX-optimized version of the routine.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable-&gt;pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable-&gt;UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}
</code></pre>
<p>
Let's see how version 2, our first SIMD attempt, performs in comparison to the
two baselines.
</p>
<p>
<a href="Benchmark-02-v1.svg" target="_blank">
<img class="svg-image" src="Benchmark-02-v1.svg"/>
</a>
</p>
<p>
Eek! Our first SIMD attempt actually has worse prefix matching performance in
most cases! The only area where it shows a performance improvement is negative
matching.
</p>
<hr/>
<a class="xref" name="IsPrefixOfStringInTable_3"></a>
<h2>IsPrefixOfStringInTable_3</h2>
<small>
<a href="#IsPrefixOfStringInTable_2"><i class="fa fa-arrow-left"></i> IsPrefixOfStringInTable_2</a> |
<a href="#IsPrefixOfStringInTable_4">IsPrefixOfStringInTable_4 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
For version 3, let's replace the call to <code>IsPrefixMatchAvx2</code> with our
simpler version, <code>IsPrefixMatch</code>:
</p>
<div class="tab-box language box-3v2">
<ul class="tabs">
<li data-content="content-3v2-diff">Diff</li>
<li data-content="content-3-full">Full</li>
</ul>
<div class="content">
<pre class="code content-3v2-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_2.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_2.c 2018-04-15 22:35:55.458773500 -0400
+++ IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
@@ -18,7 +18,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_2(
+IsPrefixOfStringInTable_3(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -278,7 +278,7 @@
TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
- CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
+ CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
</code></pre>
<pre class="code content-3-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This is our first AVX-optimized version of the routine.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable-&gt;pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable-&gt;UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}</code></pre>
</div>
</div>
<p>
<a href="Benchmark-03-v1.svg" target="_blank">
<img class="svg-image" src="Benchmark-03-v1.svg"/>
</a>
</p>
<p>
Phew! We finally see superior performance across the board. This ends the
short lived tenure of version 2, which is demonstrably worse in every case.
We'll also omit the <code>IsPrefixOfCStrInArray</code> routine from the graphs
for now (for the most part), as it has served its initial baseline purpose.
</p>
<hr/>
<a class="xref" name="IsPrefixOfStringInTable_4"></a>
<h2>IsPrefixOfStringInTable_4</h2>
<small>
<a href="#IsPrefixOfStringInTable_3"><i class="fa fa-arrow-left"></i> IsPrefixOfStringInTable_3</a> |
<a href="#IsPrefixOfStringInTable_5">IsPrefixOfStringInTable_5 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
When I first wrote the initial string table code, I was playing around with
different strategies for loading the initial search string buffer. That
resulted in the file
<a
href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h">
StringLoadStoreOperations.h</a>, which defined a bunch of helper macros.
I've included them below, but don't spend too much time absorbing them, they're
not good practice, and they're all irrelevant anyway as soon as we switch to
<code>_mm_loadu_si128()</code> in a few versions. I'm including them because
they set the scene for versions 4, 5 and 6.
</p>
<pre class="code"><code class="language-c">
/*++
VOID
LoadSearchStringIntoXmmRegister_SEH(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
Attempts an aligned 128-bit load of String-&gt;Buffer into Slot.CharXmm via
the _mm_load_si128() intrinsic. The intrinsic is surrounded in a __try/
__except block that catches EXCEPTION_ACCESS_VIOLATION exceptions.
If such an exception is caught, the routine will check to see if the string
buffer's address will cross a page boundary if 16-bytes are loaded. If a
page boundary would be crossed, a __movsb() intrinsic is used to copy only
the bytes specified by String-&gt;Length, otherwise, an unaligned 128-bit load
is attemped via the _mm_loadu_si128() intrinsic.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String-&gt;Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar) \
LengthVar = min(String-&gt;Length, 16); \
TRY_SSE42_ALIGNED { \
Slot.CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer); \
} CATCH_EXCEPTION_ACCESS_VIOLATION { \
if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) { \
__movsb(Slot.Char, String-&gt;Buffer, LengthVar); \
} else { \
Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer); \
} \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlignmentCheck(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine checks to see if a page boundary will be crossed if 16-bytes
are loaded from the address supplied by String-&gt;Buffer. If a page boundary
will be crossed, a __movsb() intrinsic is used to only copy String-&gt;Length
bytes into the given Slot.
If no page boundary will be crossed by a 128-bit load, the alignment of
the address supplied by String-&gt;Buffer is checked. If the alignment isn't
at least on a 16-byte boundary, an unaligned load will be issued via the
_mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String-&gt;Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \
LengthVar = min(String-&gt;Length, 16); \
if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) { \
__movsb(Slot.Char, String-&gt;Buffer, LengthVar); \
} else if (GetAddressAlignment(String-&gt;Buffer) &lt; 16) { \
Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer); \
} else { \
Slot.CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer); \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlwaysUnaligned(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine performs an unaligned 128-bit load of the address supplied by
String-&gt;Buffer into the given Slot via the _mm_loadu_si128() intrinsic.
No checks are done regarding whether or not a page boundary will be crossed.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String-&gt;Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \
LengthVar = min(String-&gt;Length, 16); \
if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) { \
__movsb(Slot.Char, String-&gt;Buffer, LengthVar); \
} else if (GetAddressAlignment(String-&gt;Buffer) &lt; 16) { \
Slot.CharsXmm = _mm_loadu_si128(String-&gt;Buffer); \
} else { \
Slot.CharsXmm = _mm_load_si128(String-&gt;Buffer); \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlwaysMovsb(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine copies min(String-&gt;Length, 16) bytes from String-&gt;Buffer
into the given Slot via the __movsb() intrinsic. The memory referenced by
the Slot is not cleared first via SecureZeroMemory().
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String-&gt;Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \
LengthVar = min(String-&gt;Length, 16); \
__movsb(Slot.Char, String-&gt;Buffer, LengthVar);
</code></pre>
<p>
In our <a
href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringTable2.vcxproj#L52">StringTable2.vcxproj</a>
file, we have the following:
</p>
<hr/>
<small><pre>
&lt;PropertyGroup Label="Globals"&gt;
...
&lt;LoadSearchStringStrategy&gt;AlwaysMovsb&lt;/LoadSearchStringStrategy&gt;
&lt;!--
&lt;LoadSearchStringStrategy&gt;SEH&lt;/LoadSearchStringStrategy&gt;
&lt;LoadSearchStringStrategy&gt;AlignmentCheck&lt;/LoadSearchStringStrategy&gt;
&lt;LoadSearchStringStrategy&gt;AlwaysUnaligned&lt;/LoadSearchStringStrategy&gt;
--&gt;
</pre></small>
<hr/>
<p>
This basically allowed me to toggle which of the strategies I wanted to use to
do load the search string into an XMM register. As you can see above, the
default is to use the <code>AlwaysMovsb</code> approach*; so, with version 4,
let's swap that out for the <code>SEH</code> approach, which wraps the aligned
load in a structured exception handler that falls back to <code>__movsb()</code>
if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.
</p>
<p>
<small>
<p>[*]: Or was it?</p>
<p>Narrator: <a
href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L226">
it wasn't</a>.</p>
<small><p>(Note: these little <em>Narrator</em> interjections work best if you
imagine they're being read in <a
href="https://en.wikipedia.org/wiki/Arrested_Development_(TV_series)">Ron
Howard</a>'s voice.)</p>
</small>
</small>
</p>
<div class="tab-box language box-4v3">
<ul class="tabs">
<li data-content="content-4v3-diff">Diff</li>
<li data-content="content-4-full">Full</li>
</ul>
<div class="content">
<pre class="code content-4v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_4.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_4.c 2018-04-15 22:35:55.453274200 -0400
@@ -18,7 +18,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_4(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -31,7 +31,8 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This is our first AVX-optimized version of the routine.
+ This routine is a variant of version 3 that uses a structured exception
+ handler for loading the initial search string.
Arguments:
@@ -123,7 +124,7 @@
// Load the first 16-bytes of the search string into an XMM register.
//
- LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+ LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
</code></pre>
<pre class="code content-4-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine is a variant of version 3 that uses a structured exception
handler for loading the initial search string.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable-&gt;pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable-&gt;UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}</code></pre>
</div>
</div>
<p>
Performance of version 4 was slightly worse than 3 in every case:
<a href="Benchmark-04-v1.svg" target="_blank">
<img class="svg-image" src="Benchmark-04-v1.svg"/>
</a>
</p>
<p>
Version 3 is still in the lead with the <code>AlwaysMovsb</code>-based search string
loading approach.
<small>
<p>Narrator: except the
<a href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L112"> AlignmentCheck</a>
macro was actually active, not the
<a href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L112"> AlwaysMovsb</a>
one.
</small>
</p>
<hr/>
<a class="xref" name="IsPrefixOfStringInTable_5"></a>
<h2>IsPrefixOfStringInTable_5</h2>
<small>
<a href="#IsPrefixOfStringInTable_4"><i class="fa fa-arrow-left"></i> IsPrefixOfStringInTable_4</a> |
<a href="#IsPrefixOfStringInTable_6">IsPrefixOfStringInTable_6 <i class="fa fa-arrow-right"></i></a>
</small>
<p>
Version 5 is an interesting one. It's the first time we attempt to validate our
claim that it's more efficient to give the CPU a bunch of independent things to
do up-front, versus putting more branches in and attempting to terminate as
early as possible.
</p>
<p>
Note: we'll also explicitly use the <code>LoadSearchStringIntoXmmRegister_AlwaysMovsb</code>
macro here, instead of <code>LoadSearchStringIntoXmmRegister</code>, just to
make it more explicit that we're actually relying on the
<code>__movsb()</code>-based string loading routine.
</p>
<small><p>Narrator: can anyone spot the mistake with this logic?</p></small>
<div class="tab-box language box-5v3">
<ul class="tabs">
<li data-content="content-5v3-diff">Diff</li>
<li data-content="content-5-full">Full</li>
</ul>
<div class="content">
<pre class="code content-5v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_5.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_5.c 2018-04-15 13:24:52.480972900 -0400
@@ -16,9 +16,13 @@
#include "stdafx.h"
+//
+// Variant of v3 with early-exits.
+//
+
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_5(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -31,7 +35,11 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This is our first AVX-optimized version of the routine.
+ This routine is a variant of version 3 that uses early exits (i.e.
+ returning NO_MATCH_FOUND as early as we can). It is designed to evaluate
+ the assertion we've been making that it's more optimal to give the CPU
+ to do a bunch of things up front versus doing something, then potentially
+ branching, doing the next thing, potentially branching, etc.