is-prefix-of-string-in-table/index.html

<!DOCTYPE html>
<html>
    <!-- vim:set tw=100 ts=8 sw=4 et                                                            :-->
    <head>
        <title>Is Prefix Of String In Table?  A Journey Into SIMD String Processing.</title>
        <meta name="msvalidate.01" content="E828541C73A98C315E3D6B8C88EF6057" />
        <meta name="viewport" content="width=device-width, initial-scale=0.65, maximum-scale=1.0" />

        <!-- https://www.google.com/fonts#UsePlace:use/Collection:Lato:200,300,300italic -->
        <!--
        <meta name="viewport" content="width=device-width, min-width=1100px, initial-scale=0.7, maximum-scale=1.0, shrint-to-fit=no" />
        <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:200,300,300italic">
        <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Merriweather:300,300i,400,400i">
        -->
        <link href="https://fonts.googleapis.com/css?family=Open+Sans:300,400" rel="stylesheet">
        <link rel="stylesheet" href="//oss.maxcdn.com/normalize/3.0.1/normalize.min.css">
        <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css">
        <link rel="stylesheet" href="../prism.css">
        <link rel="stylesheet" href="../home.css">
        <link rel="stylesheet" href="page.css">
        <script src="//oss.maxcdn.com/jquery/2.1.1/jquery.min.js"></script>
        <script src="../prism.js"></script>
        <script src="../home.js"></script>
        <script src="page.js"></script>
    </head>
    <body>

        <header class="header">
            <div class="header-logo" href="#">
                <!--
                <a class="homename" href="http://trent.me"><strong>T</strong>rent <strong>N</strong>elson</a>
                -->
                <a class="homename" href=".."><strong>T</strong>rent <strong>N</strong>elson</a>
            </div>
            <ul class="header-links">
                <li><a href="#home"><i class="fa fa-home"></i> Is Prefix Of String In Table?</a></li>
                <li><a href="#contents"><i class="fa fa-align-left"></i> Contents</a></li>
                <li><a href="https://github.com/tpn/tracer/tree/v0.1.12/StringTable2" target="_blank"><i class="fa fa-github"></i> GitHub</a></li>
                <li><a href="https://twitter.com/trentnelson" target="_blank"><i class="fa fa-twitter"></i> Twitter</a></li>
                <!--
                <li><a href="https://twitter.com/trentnelson" class="twitter-follow-button" data-show-count="false">Follow @trentnelson</a></li>
                -->
            </ul>
        </header>

        <a class="xref" name="home"></a>
        <section class="section section-hero">
            <div class="container">
                <h1>
                    Is Prefix Of String In Table?
                </h1>
                <h3>
                    A Journey Into SIMD String Processing.
                </h3>
            </div>
        </section>

        <section class="section section-summary">
            <div class="container">

                <small>
                    Published: 4th May, 2018.
                    <!--
                    Updated: 4th May, 2018.
                    Target publish date: <del>20th April, 2018</del> <del>23rd April, 2018</del>
                    <del>25th April, 2018</del> <del>30th April, 2018</del> <del>2nd May, 2018</del>
                    7th May, 2018.
                    -->
                    Thanks to <a href="https://twitter.com/rygorous">Fabian Giesen</a>,
                    <a href="https://twitter.com/pshufb">Wojciech Mu&#322;a</a>,
                    <a href="https://twitter.com/geofflangdale">Geoff Langdale</a>,
                    <a href="https://twitter.com/lemire">Daniel Lemire</a>, and
                    <a href="https://twitter.com/KendallWillets">Kendall Willets</a>
                    for their valuable
                    <a href="https://twitter.com/trentnelson/status/985715037934440448">feedback</a>
                    on an early draft of this article.  <a
                        href="https://github.com/tpn/website/blob/master/is-prefix-of-string-in-table/index.html">
                        View this page's source on GitHub.</a>

                    <!-- 15.6 + 48.53 + 2.42 + 33.85 + 42 + 49.67 + 11.55 + 9.12 + 12.95 + 4.87 -->
                    Hours spent on this article to date: 230.56.

                <hr/>
                <h3>TL;DR</h3>
                <p>
                    Wrote some C and assembly code that uses SIMD instructions to perform prefix
                    matching of strings.  The C code was between 4-7x faster than the baseline
                    implementation for prefix matching.  The assembly code was 9-12x faster than the
                    baseline specifically for the negative match case (determining that an incoming
                    string definitely does <strong>not</strong> prefix match any of our known
                    strings).  The fastest negative match could be done in around 6 CPU cycles, which
                    is pretty quick.  (Integer division, for example, takes about 90 cycles.)
                </p>
                </small>
                <hr/>

                <h2>Overview</h2>
                <p>

                    Goal: given a string, determine if it prefix-matches a set of known strings as
                    fast as possible.  That is, in a set of known strings, do any of them prefix
                    match the incoming search string?

                </p>
                <p>

                    A reference implementation was written in C as a <a
                    href="#IsPrefixOfCStrInArray">baseline</a>, which simply looped
                    through an array of strings, comparing each one, byte-by-byte, looking for a
                    prefix match.  Prefix match performance ranged from 28 CPU cycles to 130, and
                    negative match performance was around 74 cycles.

                </p>
                <p>

                    A SIMD-friendly C structure called <a href="#STRING_TABLE">STRING_TABLE</a> was
                    derived.  It is optimized for up to 16 strings, ideally of length less than or
                    equal 16 characters.  The table is created from the set of known strings
                    up-front; it is sorted by length, ascending, and a unique character (with
                    regards to other characters at the same byte offset) is then extracted, along
                    with its index.  A 16 byte character array, <a
                    href="#STRING_SLOT">STRING_SLOT</a>, is used to capture the unique characters.
                    A 16 element array of unsigned characters, SLOT_INDEX, is used to capture the
                    index.  Similarly, lengths are stored in the same fashion via SLOT_LENGTHS.
                    Finally, a 16 element array of STRING_SLOTs is used to capture up to the first
                    16 bytes of each string in the set.

                </p>

                <p>

                    An example of the memory layout of the STRING_TABLE structure at run time, using
                    sample <a href="#ntfs-reserved-names">test data</a>, is depicted below.  Note
                    the width of each row is 16 bytes (128 bits), which is the size of an XMM register.

                </p>

                <a href="StringTable.svg" target="_blank">
                    <img class="svg-image" src="StringTable.svg"/>
                </a>

                <!--
                <picture>
                    <source srcset="StringTableLayout2.png"/>
                    <img width="1042px" height="675px" srcset="StringTableLayout2.png"/>
                </picture>
                -->

                <p>
                    The layout of the STRING_TABLE structure allows us to determine if a given
                    search string does <strong>not</strong> prefix match all 16 strings at once
                    in 12 assembly instructions.  This breaks down into 18 &#181;ops, with a
                    block throughput of 3.48 cycles on Intel's Skylake architecture.  (In practice,
                    this clocks in at around 6 CPU cycles.)
                </p>

                <div class="tab-box language box-intro">
                    <ul class="tabs">
                        <li data-content="content-intro-nasm">Assembly</li>
                        <li data-content="content-intro-iaca">IACA</li>
                    </ul>
                    <div class="content">
<pre class="code content-intro-nasm"><code class="language-nasm">
mov      rax,  String.Buffer[rdx]                   ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx]      ; Broadcast string length.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu  xmm0, xmmword ptr [rax]                    ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4                           ; Identify slots &gt; string len.
vpshufb  xmm5, xmm0, StringTable.UniqueIndex[rcx]   ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx]   ; Compare rearranged to unique.
vptest   xmm1, xmm5                                 ; Unique slots AND (!long slots).
jnc      short Pfx10                                ; CY=0, continue with routine.
xor      eax, eax                                   ; CY=1, no match.
not      al                                         ; al = -1 (NO_MATCH_FOUND).
ret                                                 ; Return NO_MATCH_FOUND.
</code></pre>

<pre class="code content-intro-iaca"><code class="language-nasm">
S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|&#181;ops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |1.0|   |       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ |   |1.0|0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of &#181;ops: 18
</code></pre>

                    </div>
                </div>

                <p>

                    Here's a simplified walk-through of a negative match in action,
                    using the search string "CAT":

                    <a href="StringTable-NegativeMatch-v3.svg" target="_blank">
                        <img class="svg-image" src="StringTable-NegativeMatch-v3.svg"/>
                    </a>

                </p>

                <p>

                    Ten iterations of a function named IsPrefixOfStringInTable were authored.  The
                    <a href="#IsPrefixOfStringInTable_10">tenth</a> and final iteration was the
                    fastest, prefix matching in as little as 19 cycles &mdash; a 4x improvement over
                    the baseline.  Negative matching took 11 cycles &mdash; a 6.7x improvement.

                </p>

                <p>

                    An <a
                    href="#IsPrefixOfStringInTable_x64_2">assembly</a>
                    version of the algorithm was authored specifically to optimize for the negative
                    match case, and was able to do so in as little as 8 cycles, representing a 9x
                    improvement over the baseline.  (It was a little bit slower than the fastest
                    C routine in the case of prefix matches, though, as can be seen below.)

                </p>

                <p>

                    Feedback for an early draft of this article was then solicited via <a
                    href="https://twitter.com/trentnelson/status/985715037934440448">Twitter</a>,
                    resulting in four more iterations of the C version, and three more iterations of
                    the assembly version.  The PGO build of the fastest C version prefix matched in
                    about 16 cycles (and also had the best "worst case input string" performance
                    (where three slots needed comparison), negative matching in about 26 cycles).
                    The fifth iteration of the assembly version negative matched in about 6 cycles,
                    a 3 and 1 cycle improvement, respectively.

                </p>

                <p>
                    <a href="Benchmark-Overview-v2.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-Overview-v2.svg"/>
                    </a>
                </p>

                <p>

                    We were then ready to publish, but felt compelled to investigate an odd
                    performance quirk we'd noticed with one of the assembly routines, which
                    yielded 7 more assembly versions.  Were any of them faster?  Let's find out.

                </p>

            </div>
        </section>
        <hr/>

        <section class="section section-toc">
            <div class="container">

                <a class="xref" name="contents"></a>
                <h1>Contents</h1>

                <p>
                    <ul class="toc-list">
                        <li>
                            <a href="#background">Background</a>
                            <ul class="toc-list-2">
                                <li><a href="#tracer-project">The Tracer Project</a></li>
                                <li><a href="#baseline">Baseline C Implementation</a></li>
                                <li>
                                    <a href="#proposed-interface">Proposed Interface</a>
                                    <ul>
                                        <li>
                                            The <a href="#IsPrefixOfStringInTable">
                                            IsPrefixOfStringInTable</a> function.
                                        </li>
                                        <li>
                                            The <a href="#STRING_MATCH">STRING_MATCH</a> structure.
                                        </li>
                                    </ul>
                                </li>
                                <li><a href="#test-data">The Test Data</a></li>
                                <li>
                                    <a href="#requirements-and-design-decisions">
                                        Requirements and Design Decisions
                                    </a>
                                </li>
                            </ul>
                        </li>
                        <li>
                            <a href="#data-structures">The Data Structures</a>
                            <ul class="toc-list-2">
                                <li>
                                    <a href="#STRING_TABLE">STRING_TABLE</a>
                                </li>
                                <li><a href="#STRING_ARRAY">STRING_ARRAY</a></li>
                                <li><a href="#STRING_SLOT">STRING_SLOT</a></li>
                                <li><a href="#SLOT_INDEX">SLOT_INDEX</a></li>
                                <li><a href="#SLOT_LENGTHS">SLOT_LENGTHS</a></li>
                                <li><a href="CreateStringTable">String Table Construction</a>
                            </ul>
                        </li>

                        <li>
                            <a href="#benchmark">The Benchmark</a>
                        </li>
                        <li>
                            <a href="#implementations">The Implementations</a>
                            <ul class="toc-list-2">
                                <li>
                                    Round 1
                                    <ul class="toc-list-3">
                                        <li><a href="#round1-c">C</a></li>
                                        <ul class="toc-list-4">
                                            <li>
                                                <a href="#IsPrefixOfCStrInArray">IsPrefixOfCStrInArray</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_1">IsPrefixOfStringInTable_1</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_2">IsPrefixOfStringInTable_2</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_3">IsPrefixOfStringInTable_3</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_4">IsPrefixOfStringInTable_4</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_5">IsPrefixOfStringInTable_5</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_6">IsPrefixOfStringInTable_6</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_7">IsPrefixOfStringInTable_7</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_8">IsPrefixOfStringInTable_8</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_9">IsPrefixOfStringInTable_9</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_10">IsPrefixOfStringInTable_10</a>
                                            </li>

                                        </ul>

                                        <li><a href="#round1-assembly">Assembly</a>
                                        <ul class="toc-list-4">
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_1">IsPrefixOfStringInTable_x64_1</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_2">IsPrefixOfStringInTable_x64_2</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_3">IsPrefixOfStringInTable_x64_3</a>
                                            </li>
                                        </ul>
                                    </ul>
                                </li>
                                <li>
                                    <a href="#round2">Round 2; Post-Internet Feedback</a>
                                    <ul class="toc-list-3">
                                        <li>C</li>
                                        <ul class="toc-list-4">
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_11">IsPrefixOfStringInTable_11</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_12">IsPrefixOfStringInTable_12</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_13">IsPrefixOfStringInTable_13</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_14">IsPrefixOfStringInTable_14</a>
                                            </li>
                                        </ul>

                                        <li>Assembly</a>
                                        <ul class="toc-list-4">
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_4">IsPrefixOfStringInTable_x64_4</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_5">IsPrefixOfStringInTable_x64_5</a>
                                            </li>
                                        </ul>
                                    </ul>
                                </li>
                                <li>
                                    <a href="#IsPrefixOfStringInTable_x64_3-review">Round 3; Investigating why IsPrefixOfStringInTable_x64_3 was so slow...</a>
                                    <ul class="toc-list-3">
                                        <ul class="toc-list-4">
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_7">IsPrefixOfStringInTable_x64_7</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_8">IsPrefixOfStringInTable_x64_8</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_9">IsPrefixOfStringInTable_x64_9</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_10">IsPrefixOfStringInTable_x64_10</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_11">IsPrefixOfStringInTable_x64_11</a>
                                            </li>
                                            <li>
                                                <a href="#IsPrefixOfStringInTable_x64_12">IsPrefixOfStringInTable_x64_12</a>
                                            </li>
                                        </ul>
                                    </ul>
                                </li>
                            </ul>
                        </li>

                        <li>
                            <a href="#other-applications">Other Applications</a>
                        </li>
                        <li>
                            <a href="#appendix">Appendix</a>
                            <ul class="toc-list-2">
                                <li><a href="#implementation-considerations">Implementation Considerations</a></li>
                                <li><a href="#release-vs-pgo">Release vs PGO</a></li>
                                <li>A list of all C <a href="#typedefs">typedefs</a> referenced in the article</li>
                            </ul>
                        </li>
                    </ul>
                </p>
            </div>
        </section>
        <hr/>

        <section class="section section-body">
            <div class="container">

                <h1>The Background</h1>

                <h2>The Tracer Project</h2>
                <p>

                    One of the frustrations I had with existing Python profilers was that there was
                    no easy or efficient means to filter or exclude trace information based on the module
                    name of the code being executed.  I tackled this in my
                    <a href="https://github.com/tpn/tracer">tracer</a> project, which allows you to
                    set an environment variable named TRACER_MODULE_NAMES to restrict which modules
                    should be traced, e.g.:
                    <pre>set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy</pre>

                </p>

                <p>

                    If the code being executed is coming from the module
                    <code>myproject3.subproject.foo</code>, then we need to trace it, as that string
                    <strong>prefix matches</strong> the third entry on our list.

                </p>

                <p>

                    This article details the custom data structure and algorithm I came up with in
                    order to try and solve the prefix matching problem more optimally with a SIMD
                    approach.  The resulting <a
                    href="https://github.com/tpn/tracer/tree/master/StringTable2">StringTable</a>
                    component is used extensively within the tracer project, and as such, must
                    conform to unique constraints such as no use of the C runtime library and
                    allocating all memory through TraceStore-backed allocators.  Thus, it's not
                    really something you'd drop in to your current project in its current form.
                    Hopefully, the article still proves to be interesting.

                </p>

                <small>

                    <p>

                        Note: the code samples provided herein are copied directly from the tracer
                        project, which is written in C and assembly, and uses the Pascal-esque
                        <em>Cutler Normal Form</em> style for C.  If you're used to the more UNIX-style
                        <a href="https://www.freebsd.org/cgi/man.cgi?query=style&sektion=9">
                        <em>Kernel Normal Form</em></a> of C, it's quite like that, except that it's
                        absolutely nothing like that, and all these code samples will probably be
                        very jarring.

                    <p>

                </small>

                <a class="xref" name="baseline"></a>
                <h2>Baseline C Implementation</h2>

                <p>

                    The simplest way of solving this in C is to have an array of C strings (i.e.
                    NULL terminated byte arrays), then for each string, loop through byte by byte
                    and see if it prefix matches the search string.

                </p>

                <div class="tab-box language box-simple">
                    <ul class="tabs">
                        <li data-content="content-simple-cnf">Baseline (Cutler Normal Form)</li>
                        <li data-content="content-simple-knf">Baseline (Kernel Normal Form)</li>
                    </ul>
                    <div class="content">
<pre class="code content-simple-cnf"><code class="language-c">
//
// Declare a set of module names to be used as a string array.
//

const PCSZ ModuleNames[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    NULL,
};

//
// Define the function pointer typedef.
//

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_CSTR_IN_ARRAY)(
    _In_ PCSZ *StringArray,
    _In_ PCSZ String,
    _Out_opt_ PSTRING_MATCH Match
    );
typedef IS_PREFIX_OF_CSTR_IN_ARRAY *PIS_PREFIX_OF_CSTR_IN_ARRAY;

//
// Forward declaration.
//

IS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left &amp;&amp; *Right &amp;&amp; *Left++ == *Right++) {
            Count++;
        }

        if (Count &gt; 0 &amp;&amp; !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)Count;
                Match-&gt;String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}

</code></pre>
<pre class="code content-simple-knf"><code class="language-c">
const char *module_names[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    0,
};

struct string_match {
    /* Index of the match. */
    unsigned char index;

    /* Number of characters matched. */
    unsigned char number_of_chars_matched;

    /* Pad out to an 8-byte boundary. */
    unsigned short padding[3];

    /* Pointer to the string that was matched. */
    char *str;
};

unsigned char
is_prefix_of_c_str_in_array(const char **array,
                            const char *str,
                            struct string_match *match)
{
    char *left, *right, **target;
    unsigned int c, i = 0;

    for (target = array; target; target++, i++) {
        c = 0;
        left = str;
        right *target;
        while (*left &amp;&amp; *right &amp;&amp; *left++ == *right) {
            c++;
        }
        if (c &gt; 0 &amp;&amp; !*right) {
            if (match) {
                match-&gt;index = i;
                match-&gt;chars_matched = c;
                match-&gt;str = target[i];
            }
            return i;
        }
    }

    return -1;
}
</code></pre>
                    </div>
                </div>

                <p>

                    Another type of code pattern that the string table attempts to replace is
                    anything that does a lot of if/else if/else if-type string comparisons to
                    look for keywords.  For example, in the
                    <a href="https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb10315479fc00086f08e25d968b4b43c49/q3asm/q3asm.c#L609">
                    Quake III</a> source, there's some symbol/string processing logic that looks
                    like this:

                </p>

<pre class="code content-q3"><code class="language-c">
	// call instructions reset currentArgOffset
	if ( !strncmp( token, "CALL", 4 ) ) {
		EmitByte( &amp;segment[CODESEG], OP_CALL );
		instructionCount++;
		currentArgOffset = 0;
		return;
	}

	// arg is converted to a reversed store
	if ( !strncmp( token, "ARG", 3 ) ) {
		EmitByte( &amp;segment[CODESEG], OP_ARG );
		instructionCount++;
		if ( 8 + currentArgOffset >= 256 ) {
			CodeError( "currentArgOffset >= 256" );
			return;
		}
		EmitByte( &amp;segment[CODESEG], 8 + currentArgOffset );
		currentArgOffset += 4;
		return;
	}

	// ret just leaves something on the op stack
	if ( !strncmp( token, "RET", 3 ) ) {
		EmitByte( &amp;segment[CODESEG], OP_LEAVE );
		instructionCount++;
		EmitInt( &amp;segment[CODESEG], 8 + currentLocals + currentArgs );
		return;
	}

	// pop is needed to discard the return value of
	// a function
	if ( !strncmp( token, "pop", 3 ) ) {
		EmitByte( &amp;segment[CODESEG], OP_POP );
		instructionCount++;
		return;
	}

        ...
</code></pre>

                <p>

                    An example of using the string table approach for this problem is discussed
                    in the <a href="#other-applications">Other Applications</a> section.

                </p>

                <a class="xref" name="proposed-interface"></a>

                <h3>The Proposed Interface</h3>

                <p>

                    Let's take a look at the interface we're proposing, the
                    <code>IsPrefixOfStringInTable</code> function, that this article is based upon:

                </p>

                <a class="xref" name="IsPrefixOfStringInTable"></a>

<pre class="code content-proposed-interface-cnf"><code class="language-c">
//
// Our string table index is simply a char, with -1 indicating no match found.
//

typedef CHAR STRING_TABLE_INDEX;
#define NO_MATCH_FOUND -1

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_STRING_IN_TABLE)(
    _In_ PSTRING_TABLE StringTable,
    _In_ PSTRING String,
    _Out_opt_ PSTRING_MATCH StringMatch
    );
typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE;

IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
</code></pre>

                <p>

                    All implementations discussed in this article adhere to that function signature.
                    The <a href="#STRING_TABLE">STRING_TABLE</a> structure will be discussed shortly.

                </p>

                <p>

                    The STRING_MATCH structure is used to optionally communicate information about
                    the prefix match back to the caller.  The index and characters matched fields
                    are often very useful when using the string table for text parsing; see the <a
                    href="#other-applications">other applications</a> section below for an example.

                </p>

                <p>

                    The structure is defined as follows:

                </p>

                <a class="xref" name="STRING_MATCH"></a>

<pre class="code content-string-match"><code class="language-c">
//
// This structure is used to communicate matches back to the caller.
//

typedef struct _STRING_MATCH {

    //
    // Index of the match.
    //

    BYTE Index;

    //
    // Number of characters matched.
    //

    BYTE NumberOfMatchedCharacters;

    //
    // Pad out to 8-bytes.
    //

    USHORT Padding[3];

    //
    // Pointer to the string that was matched.  The underlying buffer will
    // stay valid for as long as the STRING_TABLE struct persists.
    //

    PSTRING String;

} STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH;
C_ASSERT(sizeof(STRING_MATCH) == 16);
</code></pre>

                <a class="xref" name="test-data"></a>
                <h2>The Test Data</h2>

                <p>

                    Instead of using some arbitrary Python module names, this article is going to
                    focus on a string table constructed out of a set of 16 strings that represent
                    reserved names of the NTFS file system, at least when it was first released
                    way back in the early 90s.

                </p>

                <p>

                    This list is desirable as it has good distribution of characters, there is
                    a good mix of both short and long entries, plus one oversized one
                    ($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all
                    strings lead with a common character (the dollar sign), preventing a simple
                    <em>first character</em> optimization used by <a href="https://github.com/tpn/tracer/blob/2018-04-18.1/StringTable/StringTable.h#L324">
                    the initial version of the StringTable component I wrote in 2016</a>.

                </p>
                <p>

                    So the scenario we'll be emulating, in this case, is that we've just been passed
                    a filename for creation, and we need to check if it prefix matches any of the
                    reserved names.

                </p>

                <p>

                    Here's the full list of NTFS names we'll be using.  We're assuming 8-bit ASCII
                    encoding (no UTF-8) and case sensitive.  (If this were actually the NT kernel,
                    we'd need to use wide characters with UTF-16 enconding, and be
                    case-insensitive.)

                </p>

                <a class="xref" name="ntfs-reserved-names"></a>
                <h3>NTFS Reserved Names</h3>

                <p>
                    <ul>
                        <li>$AttrDef</li>
                        <li>$BadClus</li>
                        <li>$Bitmap</li>
                        <li>$Boot</li>
                        <li>$Extend</li>
                        <li>$LogFile</li>
                        <li>$MftMirr</li>
                        <li>$Mft</li>
                        <li>$Secure</li>
                        <li>$UpCase</li>
                        <li>$Volume</li>
                        <li>$Cairo</li>
                        <li>$INDEX_ALLOCATION</li>
                        <li>$DATA</li>
                        <li>????</li>
                        <li>.</li>
                    </ul>
                </p>

                <p>

                    The ordering is important in certain cases.  For example, when you have
                    overlapping strings, such as $MftMirr, and $Mft, you should put the longest
                    strings first.  They will be matched first, and as our routine terminates upon
                    the first successful prefix match &mdash; if a longer string resided after a
                    shorter one, it would never get detected.

                </p>

                <p>

                    Let's review some guiding design requirements and cover some of the design
                    decisions I made, which should help shape your understanding of the
                    implementation.

                </p>

                <a class="xref" name="requirements-and-design-decisions"></a>
                <h2>Requirements and Design Decisions</h2>

                <p>

                    The STRING struct will be used to capture incoming search strings as well as the
                    representation of any strings registered in the table (or more accurately, in
                    the corresponding StringArray structure associated with the string table.

                </p>
<pre class="code content-string-struct"><code class="language-c">
//
// The STRING structure used by the NT kernel.  Our STRING_ARRAY structure
// relies on an array of these structures.  We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//

typedef struct _STRING {
    USHORT Length;
    USHORT MaximumLength;
    ULONG  Padding;
    PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;
</code></pre>

                <p>

                    The design should optimize for string lengths less than or equal to 16.  Lengths
                    greater than 16 are permitted, up to 128 bytes, but they incur more overhead during
                    the prefix lookup.

                </p>

                <p>

                    The design should prioritize the fast-path code where there is no match for a
                    given search string.  Being able to terminate the search as early as possible is
                    ideal.

                </p>

                <p>

                    The performance hits taken by unaligned data access are non-nelgible, especially
                    when dealing with XMM/YMM loads.  Pay special care to alignment constrants and
                    make sure that everything under our control is aligned on a suitable boundary.

                    (The only thing we can't really control in the real world is the alignment of
                    the incoming search string buffer, which will often be at undesirable alignments
                    like 2, 4, 6, etc.  Our test program explicitly aligns the incoming search
                    strings on 32-byte boundaries to avoid the penalties associated with unaligned
                    access.)

                </p>

                <p>

                    The string table is geared toward a single-shot build.  Once you've created it
                    with a given string array or used a delimited environment variable, that's it.
                    There are no AddString() or RemoveString() routines.  The order you provided the
                    strings in will be the same order the table uses &mdash; no re-ordering will be
                    done.  Thus, for prefix matching purposes, if two strings share a common prefix,
                    the longer one should go first, as the prefix search routine will check it first.

                </p>

                <p>

                    Only single matches are performed; the first match that qualifies as a prefix
                    match (target string in table had length less than or equal to the search
                    string, and all of its characters matched).  There is no support for obtaining
                    multiple matches &mdash; if you've constructed your string tables properly
                    (no duplicate or incorrectly-ordered overlapping fields), you shouldn't need to.

                </p>

                <p>
                    So, to summarise, the design guidelines are as follows.

                    <ul>

                        <li>

                            Prioritize fast-path exit in the non-matched case.  (I refer to this as
                            <strong>negative matching</strong> in a lot of places.)

                        </li>

                        <li>

                            Optimize for up to 16 string slots, where each slot has up to 16
                            characters, ideally.  It can have up to 128 in total, however, any bytes
                            outside of the first sixteen live in the string array structure
                            supporting the string table (accessible via pStringArray).

                        </li>

                        <li>

                            If a slot is longer than 16 characters, optimize for the assumption that
                            it won't be *that* much longer.  i.e. assume a string of length 18 bytes
                            is more common than 120 bytes.

                        </li>

                    </ul>
               </p>

                <a class="xref" name="data-structures"></a>
                <h1>The Data Structures</h1>

                <p>

                    The primary data structure employed by this solution is the STRING_TABLE
                    structure.  It is composed of supporting structures: STRING_SLOT, SLOT_INDEX and
                    SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY
                    structure from which it was created.

                </p>

                <p>

                    Let's review the STRING_TABLE <small>
                    <a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/StringTable.h#L194">
                    (view on GitHub)</a></small> structure first and then touch on the supporting
                    structures.

                </p>

                <a class="xref" name="STRING_TABLE"></a>
                <h2>STRING_TABLE</h2>

                <div class="tab-box language box-string-table">
                    <ul class="tabs">
                        <li data-content="content-string-table-cnf">C - Cutler Normal Form</li>
                        <li data-content="content-string-table-knf">C - Kernel Normal Form</li>
                        <li data-content="content-string-table-masm">MASM</li>
                    </ul>
                    <div class="content">
<pre class="code content-string-table-cnf"><code class="language-c">//
// The STRING_TABLE struct is an optimized structure for testing whether a
// prefix entry for a string is in a table, with the expectation that the
// strings being compared will be relatively short (ideally &lt;= 16 characters),
// and the table of string prefixes to compare to will be relatively small
// (ideally &lt;= 16 strings).
//
// The overall goal is to be able to prefix match a string with the lowest
// possible (amortized) latency.  Fixed-size, memory-aligned character arrays,
// and SIMD instructions are used to try and achieve this.
//

typedef struct _STRING_TABLE {

    //
    // A slot where each individual element contains a uniquely-identifying
    // letter, with respect to the other strings in the table, of each string
    // in an occupied slot.
    //

    STRING_SLOT UniqueChars;

    //
    // (16 bytes consumed.)
    //

    //
    // For each unique character identified above, the following structure
    // captures the 0-based index of that character in the underlying string.
    // This is used as an input to vpshufb to rearrange the search string's
    // characters such that it can be vpcmpeqb'd against the unique characters
    // above.
    //

    SLOT_INDEX UniqueIndex;

    //
    // (32 bytes consumed.)
    //

    //
    // Length of the underlying string in each slot.
    //

    SLOT_LENGTHS Lengths;

    //
    // (48 bytes consumed, aligned at 16 bytes.)
    //

    //
    // Pointer to the STRING_ARRAY associated with this table, which we own
    // (we create it and copy the caller's contents at creation time and
    // deallocate it when we get destroyed).
    //
    // N.B.  We use pStringArray here instead of StringArray because the
    //       latter is a field name at the end of the struct.
    //
    //

    PSTRING_ARRAY pStringArray;

    //
    // (56 bytes consumed, aligned at 8 bytes.)
    //

    //
    // String table flags.
    //

    STRING_TABLE_FLAGS Flags;

    //
    // (60 bytes consumed, aligned at 4 bytes.)
    //

    //
    // A 16-bit bitmap indicating which slots are occupied.
    //

    USHORT OccupiedBitmap;

    //
    // A 16-bit bitmap indicating which slots have strings longer than 16 chars.
    //

    USHORT ContinuationBitmap;

    //
    // (64 bytes consumed, aligned at 64 bytes.)
    //

    //
    // The 16-element array of STRING_SLOT structs.  We want this to be aligned
    // on a 64-byte boundary, and it consumes 256-bytes of memory.
    //

    STRING_SLOT Slots[16];

    //
    // (320 bytes consumed, aligned at 64 bytes.)
    //

    //
    // We want the structure size to be a power of 2 such that an even number
    // can fit into a 4KB page (and reducing the likelihood of crossing page
    // boundaries, which complicates SIMD boundary handling), so we have an
    // extra 192-bytes to play with here.  The CopyStringArray() routine is
    // special-cased to allocate the backing STRING_ARRAY structure plus the
    // accommodating buffers in this space if it can fit.
    //
    // (You can test whether or not this occurred by checking the invariant
    //  `StringTable-&gt;pStringArray == &amp;StringTable-&gt;StringArray`, if this
    //  is true, the array was allocated within this remaining padding space.)
    //

    union {
        STRING_ARRAY StringArray;
        CHAR Padding[192];
    };

} STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE;

//
// Assert critical size and alignment invariants at compile time.
//

C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots)   == 64);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320);
C_ASSERT(sizeof(STRING_TABLE) == 512);

</code></pre>

<pre class="code content-string-table-knf"><code class="language-c">struct string_table {
    char                       unique_chars[16];
    unsigned char              unique_index[16];
    unsigned char              slot_lengths[16];
    struct string_array       *string_array_ptr;
    struct string_table_flags  flags;
    unsigned short             occupied_bitmap;
    unsigned short             continuation_bitmap;
    char                       slots[16][16];
    union {
        struct string_array    string_array;
        char                   padding[184];
    } u;
};

</code></pre>
<pre class="code content-string-table-masm"><code class="language-nasm">STRING_TABLE struct
    UniqueChars         CHAR 16 dup  (?)
    UniqueIndex         BYTE 16 dup  (?)
    Lengths             BYTE 16 dup  (?)
    pStringArray        PSTRING_ARRAY ?
    Flags               ULONG         ?
    OccupiedBitmap      USHORT        ?
    ContinuationBitmap  USHORT        ?
    Slots               STRING_SLOT 16 dup ({ })
    union
        StringArray STRING_ARRAY {?}
        Padding CHAR 192 dup (?)
    ends
STRING_TABLE ends

;
; Assert our critical field offsets and structure size as per the same approach
; taken in StringTable.h.
;

.erre (STRING_TABLE.UniqueIndex  eq  16), @CatStr(&lt;UnexpectedOffset STRING_TABLE.UniqueIndex: &gt;, %(STRING_TABLE.UniqueIndex))
.erre (STRING_TABLE.Lengths      eq  32), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Lengths: &gt;, %(STRING_TABLE.Lengths))
.erre (STRING_TABLE.pStringArray eq  48), @CatStr(&lt;UnexpectedOffset STRING_TABLE.pStringArray: &gt;, %(STRING_TABLE.pStringArray))
.erre (STRING_TABLE.Slots        eq  64), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Slots: &gt;, %(STRING_TABLE.Slots))
.erre (STRING_TABLE.Padding      eq 320), @CatStr(&lt;UnexpectedOffset STRING_TABLE.Padding: &gt;, %(STRING_TABLE.Padding))
.erre (size STRING_TABLE eq 512), @CatStr(&lt;IncorrectStructSize: STRING_TABLE: &gt;, %(size STRING_TABLE))

PSTRING_TABLE typedef ptr STRING_TABLE

;
; CamelCase typedefs that are nicer to work with in assembly
; than their uppercase counterparts.
;

StringTable typedef STRING_TABLE
</code></pre>

                    </div>
                </div>

                <p>

                    The following diagram depicts an in-memory representation of the STRING_TABLE
                    structure using our NTFS reserved prefix names.  It is created via the
                    <a href="#CreateStringTable">CreateStringTable</a> routine, which we feature
                    in the appendix of this article.

                </p>

                <a href="StringTable.svg" target="_blank">
                    <img class="svg-image" src="StringTable.svg"/>
                </a>

                <!--
                <picture>
                    <source srcset="StringTableLayout2.png"/>
                    <img width="1042px" height="675px" srcset="StringTableLayout2.png"/>
                    <source srcset="StringTableLayout2.png"/>
                    <img width="1641px" height="1020px" srcset="StringTableLayout2.png"/>
                </picture>
                -->

                <p>

                    In order to improve the uniqueness of the unique characters selected from each
                    string, the strings are sorted by length during string table creation and
                    enumerated in this order whilst identifying unique characters.  The rationale
                    behind this is that shorter strings simply have fewer characters to choose from,
                    longer strings have more to choose from.  If we identified unique characters in
                    the order they appear in the string table, we may have longer strings preceeding
                    shorter ones, such that toward the end of the table, nothing unique can be
                    extracted from the short ones.

                </p>

                <p>

                    The utility of the string table is maximised by ensuring a unique character is
                    selected from every string, thus, we sort by length first.  Note that the
                    uniqueness is actually determined by offset:character pairs, with the offsets
                    becoming the indices stored in the <em>UniqueIndex</em> slot.  If you trace
                    through the diagram above, you'll see that the unique character in each slot
                    matches the character in the corresponding string slot, indicated by the
                    underlying index.

                </p>

                <p>


                </p>


                <a class="xref" name="supporting-structures"></a>
                <h2>Supporting Structures</h2>

                The string array captures a raw array representation of the underlying strings
                making up the string table.  It is either embedded within the padding area at the
                end of the string table, or a separate allocation is made during string table
                creation.  The main interface to creating a string table is via a STRING_ARRAY
                structure.  The helper functions,
                <a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L471">
                    CreateStringTableFromDelimitedString
                </a> and
                <a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L595">
                    CreateStringTableFromDelimitedEnvironmentVariable
                </a> simply break down their input into a STRING_ARRAY representation first
                before calling
                <a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L51">
                    CreateStringTable
                </a>.

                <a class="xref" name="STRING_ARRAY"></a>
                <h3>STRING_ARRAY</h3>

<pre class="code content-string-array"><code class="language-c">typedef struct _Struct_size_bytes_(SizeInQuadwords&gt;&gt;3) _STRING_ARRAY {

    //
    // Size of the structure, in quadwords.  Why quadwords?  It allows us to
    // keep this size field to a USHORT, which helps with the rest of the
    // alignment in this struct (we want the STRING Strings[] array to start
    // on an 8-byte boundary).
    //
    // N.B.  We can't express the exact field size in the SAL annotation
    //       below, because the array of buffer sizes are inexpressible;
    //       however, we know the maximum length, so we can use the implicit
    //       invariant that the total buffer size can't exceed whatever num
    //       elements * max size is.
    //

    _Field_range_(&lt;=, (
        sizeof(struct _STRING_ARRAY) +
        ((NumberOfElements - 1) * sizeof(STRING)) +
        (MaximumLength * NumberOfElements)
    ) &gt;&gt; 3)
    USHORT SizeInQuadwords;

    //
    // Number of elements in the array.
    //

    USHORT NumberOfElements;

    //
    // Minimum and maximum lengths for the String-&gt;Length fields.  Optional.
    //

    USHORT MinimumLength;
    USHORT MaximumLength;

    //
    // A pointer to the STRING_TABLE structure that "owns" us.
    //

    struct _STRING_TABLE *StringTable;

    //
    // The string array.  Number of elements in the array is governed by the
    // NumberOfElements field above.
    //

    STRING Strings[ANYSIZE_ARRAY];

} STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;
</code></pre>

                <small>

                    <a class="xref" name="SAL"></a>
                    <p>

                        Note: the odd-looking macros <a
                        href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/sal.h#L597">
                        _Struct_size_bytes_</a> and
                        <a
                        href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/sal.h#L615">
                        _Field_range_</a> are
                        <a
                        href="https://docs.microsoft.com/en-us/visualstudio/code-quality/annotating-structs-and-classes">
                        SAL Annotations</a>.  There's a neat deck called
                        <a
                        href="https://github.com/tpn/pdfs/blob/master/Program%20Analysis%20with%20PREfast%20and%20SAL%20-%20Erik%20Poll%20-%20Slides%20(3_StaticAnalysisPREfast).pdf"
                        >Engineering Better Software at Microsoft</a> which captures some interesting
                        details about SAL, for those wanting to read more.  The Code Analysis engine
                        that uses the annotations is built upon the <a
                        href="https://github.com/Z3Prover/z3">Z3 Theorem Prover</a>, which is a
                        fascinating little project in its own right.

                    </p>

                </small>

                <p>

                    And finally, we're left with the smaller helper structs that we use to
                    encapsulate the various innards of the string table.  (I use unions that
                    feature XMMWORD representations (which is a typedef of __m128i, representing
                    an XMM register) as well as underlying byte/character representations as I
                    personally find it makes the resulting C code a bit nicer.)

                </p>

                <a class="xref" name="STRING_SLOT"></a>
                <h3>STRING_SLOT</h3>

<pre class="code content-string-slot"><code class="language-c">//
// String tables are composed of a 16 element array of 16 byte string "slots",
// which represent a unique character (with respect to other strings in the
// table) for a string in a given slot index.  The STRING_SLOT structure
// provides a convenient wrapper around this construct.
//

typedef union DECLSPEC_ALIGN(16) _STRING_SLOT {
    XMMWORD CharsXmm;
    CHAR Char[16];
} STRING_SLOT, *PSTRING_SLOT, **PPSTRING_SLOT;
C_ASSERT(sizeof(STRING_SLOT) == 16);
</code></pre>

                <a class="xref" name="SLOT_INDEX"></a>
                <h3>SLOT_INDEX</h3>
<pre class="code content-slot-index"><code class="language-c">//
// An array of 1 byte unsigned integers used to indicate the 0-based index of
// a given unique character in the corresponding string.
//

typedef union DECLSPEC_ALIGN(16) _SLOT_INDEX {
    XMMWORD IndexXmm;
    BYTE Index[16];
} SLOT_INDEX, *PSLOT_INDEX, **PPSLOT_INDEX;
C_ASSERT(sizeof(SLOT_INDEX) == 16);
</code></pre>

                <a class="xref" name="SLOT_LENGTHS"></a>
                <h3>SLOT_LENGTHS</h3>
<pre class="code content-slot-lengths"><code class="language-c">//
// A 16 element array of 1 byte unsigned integers, used to capture the length
// of each string slot in a single XMM 128-bit register.
//

typedef union DECLSPEC_ALIGN(16) _SLOT_LENGTHS {
    XMMWORD SlotsXmm;
    BYTE Slots[16];
} SLOT_LENGTHS, *PSLOT_LENGTHS, **PPSLOT_LENGTHS;
C_ASSERT(sizeof(SLOT_LENGTHS) == 16);
</code></pre>

                <a class="xref" name="CreateStringTable"></a>
                <h2>String Table Construction</h2>

                <p>

                    The <a
                    href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2/CreateStringTable.c#L147">
                    CreateSingleStringTable</a> routine is responsible for construction of a new
                    STRING_TABLE.  It is here we identify the unique set of characters (and their
                    indices) to store in the first two fields of the string table.

                </p>

                <div class="tab-box language box-create">
                    <ul class="tabs">
                        <li data-content="content-create-string-table">CreateSingleStringTable</li>
                    </ul>
                    <div class="content">
<pre class="code content-create-string-table"><code class="language-c">//
// Define private types used by this module.
//

typedef struct _LENGTH_INDEX_ENTRY {
    BYTE Length;
    BYTE Index;
} LENGTH_INDEX_ENTRY;
typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY;

typedef struct _LENGTH_INDEX_TABLE {
    LENGTH_INDEX_ENTRY Entry[16];
} LENGTH_INDEX_TABLE;
typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE;

typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP {
    YMMWORD Ymm;
    XMMWORD Xmm[2];
    LONG Bits[(256 / (4 &lt;&lt; 3))];  // 8
} CHARACTER_BITMAP;
C_ASSERT(sizeof(CHARACTER_BITMAP) == 32);
typedef CHARACTER_BITMAP *PCHARACTER_BITMAP;

typedef struct _SLOT_BITMAPS {
    CHARACTER_BITMAP Bitmap[16];
} SLOT_BITMAPS;
typedef SLOT_BITMAPS *PSLOT_BITMAPS;

//
// Function implementation.
//

_Use_decl_annotations_
PSTRING_TABLE
CreateSingleStringTable(
    PRTL Rtl,
    PALLOCATOR StringTableAllocator,
    PALLOCATOR StringArrayAllocator,
    PSTRING_ARRAY StringArray,
    BOOL CopyArray
    )
/*++

Routine Description:

    Allocates space for a STRING_TABLE structure using the provided allocators,
    then initializes it using the provided STRING_ARRAY.  If CopyArray is set
    to TRUE, the routine will copy the string array such that the caller is
    free to destroy it after the table has been successfully created.  If it
    is set to FALSE and StringArray-&gt;StringTable has a non-NULL value, it is
    assumed that sufficient space has already been allocated for the string
    table and this pointer will be used to initialize the rest of the structure.

    DestroyStringTable() must be called against the returned PSTRING_TABLE when
    the structure is no longer needed in order to ensure resources are released.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which
        will be used for creating the STRING_TABLE.

    StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which
        may be used to create the STRING_ARRAY if it cannot fit within the
        padding of the STRING_TABLE structure.  This is kept separate from the
        StringTableAllocator due to the stringent alignment requirements of the
        string table.

    StringArray - Supplies a pointer to an initialized STRING_ARRAY structure
        that contains the STRING structures that are to be added to the table.

    CopyArray - Supplies a boolean value indicating whether or not the
        StringArray structure should be deep-copied during creation.  This is
        typically set when the caller wants to be able to free the structure
        as soon as this call returns (or can't guarantee it will persist past
        this function's invocation, i.e. if it was stack allocated).

Return Value:

    A pointer to a valid PSTRING_TABLE structure on success, NULL on failure.
    Call DestroyStringTable() on the returned structure when it is no longer
    needed in order to ensure resources are cleaned up appropriately.

--*/
{
    BYTE Byte;
    BYTE Count;
    BYTE Index;
    BYTE Length;
    BYTE NumberOfElements;
    ULONG HighestBit;
    ULONG OccupiedMask;
    PULONG Bits;
    USHORT OccupiedBitmap;
    USHORT ContinuationBitmap;
    PSTRING_TABLE StringTable;
    PSTRING_ARRAY StringArray;
    PSTRING String;
    PSTRING_SLOT Slot;
    STRING_SLOT UniqueChars;
    SLOT_INDEX UniqueIndex;
    SLOT_INDEX LengthIndex;
    SLOT_LENGTHS Lengths;
    LENGTH_INDEX_TABLE LengthIndexTable;
    PCHARACTER_BITMAP Bitmap;
    SLOT_BITMAPS SlotBitmaps;
    PLENGTH_INDEX_ENTRY Entry;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(StringTableAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(StringArrayAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(SourceStringArray)) {
        return NULL;
    }

    if (SourceStringArray-&gt;NumberOfElements == 0) {
        return NULL;
    }

    //
    // Copy the incoming string array if applicable.
    //

    if (CopyArray) {

        StringArray = CopyStringArray(
            StringTableAllocator,
            StringArrayAllocator,
            SourceStringArray,
            FIELD_OFFSET(STRING_TABLE, StringArray),
            sizeof(STRING_TABLE),
            &amp;StringTable
        );

        if (!StringArray) {
            return NULL;
        }

    } else {

        //
        // We're not copying the array, so initialize StringArray to point at
        // the caller's SourceStringArray, and StringTable to point at the
        // array's StringTable field (which will be non-NULL if sufficient
        // space has been allocated).
        //

        StringArray = SourceStringArray;
        StringTable = StringArray-&gt;StringTable;

    }

    //
    // If StringTable has no value, we've either been called with CopyArray set
    // to FALSE, or CopyStringArray() wasn't able to allocate sufficient space
    // for both the table and itself.  Either way, we need to allocate space for
    // the table.
    //

    if (!StringTable) {

        StringTable = (PSTRING_TABLE)(
            StringTableAllocator-&gt;AlignedCalloc(
                StringTableAllocator-&gt;Context,
                1,
                sizeof(STRING_TABLE),
                STRING_TABLE_ALIGNMENT
            )
        );

        if (!StringTable) {
            return NULL;
        }
    }

    //
    // Make sure the fields that are sensitive to alignment are, in fact,
    // aligned correctly.
    //

    if (!AssertStringTableFieldAlignment(StringTable)) {
        DestroyStringTable(StringTableAllocator,
                           StringArrayAllocator,
                           StringTable);
        return NULL;
    }

    //
    // At this point, we have copied the incoming StringArray if necessary,
    // and we've allocated sufficient space for the StringTable structure.
    // Enumerate over all of the strings, set the continuation bit if the
    // length &gt; 16, set the relevant slot length, set the relevant unique
    // character entry, then move the first 16-bytes of the string into the
    // relevant slot via an aligned SSE mov.
    //

    //
    // Initialize pointers and counters, clear stack-based structures.
    //

    Slot = StringTable-&gt;Slots;
    String = StringArray-&gt;Strings;

    OccupiedBitmap = 0;
    ContinuationBitmap = 0;
    NumberOfElements = (BYTE)StringArray-&gt;NumberOfElements;
    UniqueChars.CharsXmm = _mm_setzero_si128();
    UniqueIndex.IndexXmm = _mm_setzero_si128();
    LengthIndex.IndexXmm = _mm_setzero_si128();

    //
    // Set all the slot lengths to 0x7f up front instead of defaulting
    // to zero.  This allows for simpler logic when searching for a prefix
    // string, which involves broadcasting a search string's length to an XMM
    // register, then doing _mm_cmpgt_epi8() against the lengths array and
    // the string length.  If we left the lengths as 0 for unused slots, they
    // would get included in the resulting comparison register (i.e. the high
    // bits would be set to 1), so we'd have to do a subsequent masking of
    // the result at some point using the OccupiedBitmap.  By defaulting the
    // lengths to 0x7f, we ensure they'll never get included in any cmpgt-type
    // SIMD matches.  (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8()
    // intrinsic assumes packed signed integers.)
    //

    Lengths.SlotsXmm = _mm_set1_epi8(0x7f);

    ZeroStruct(LengthIndexTable);
    ZeroStruct(SlotBitmaps);

    for (Count = 0; Count &lt; NumberOfElements; Count++) {

        XMMWORD CharsXmm;

        //
        // Set the string length for the slot.
        //

        Length = Lengths.Slots[Count] = (BYTE)String-&gt;Length;

        //
        // Set the appropriate bit in the continuation bitmap if the string is
        // longer than 16 bytes.
        //

        if (Length &gt; 16) {
            ContinuationBitmap |= (Count == 0 ? 1 : 1 &lt;&lt; (Count + 1));
        }

        if (Count == 0) {

            Entry = &amp;LengthIndexTable.Entry[0];
            Entry-&gt;Index = 0;
            Entry-&gt;Length = Length;

        } else {

            //
            // Perform a linear scan of the length-index table in order to
            // identify an appropriate insertion point.
            //

            for (Index = 0; Index &lt; Count; Index++) {
                if (Length &lt; LengthIndexTable.Entry[Index].Length) {
                    break;
                }
            }

            if (Index != Count) {

                //
                // New entry doesn't go at the end of the table, so shuffle
                // everything else down.
                //

                Rtl-&gt;RtlMoveMemory(&amp;LengthIndexTable.Entry[Index + 1],
                                   &amp;LengthIndexTable.Entry[Index],
                                   (Count - Index) * sizeof(*Entry));
            }

            Entry = &amp;LengthIndexTable.Entry[Index];
            Entry-&gt;Index = Count;
            Entry-&gt;Length = Length;
        }

        //
        // Copy the first 16-bytes of the string into the relevant slot.  We
        // have taken care to ensure everything is 16-byte aligned by this
        // stage, so we can use SSE intrinsics here.
        //

        CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer);
        _mm_store_si128(&amp;(*Slot).CharsXmm, CharsXmm);

        //
        // Advance our pointers.
        //

        ++Slot;
        ++String;

    }

    //
    // Store the slot lengths.
    //

    _mm_store_si128(&amp;(StringTable-&gt;Lengths.SlotsXmm), Lengths.SlotsXmm);

    //
    // Loop through the strings in order of shortest to longest and construct
    // the uniquely-identifying character table with corresponding index.
    //


    for (Count = 0; Count &lt; NumberOfElements; Count++) {
        Entry = &amp;LengthIndexTable.Entry[Count];
        Length = Entry-&gt;Length;
        Slot = &amp;StringTable-&gt;Slots[Entry-&gt;Index];

        //
        // Iterate over each character in the slot and find the first one
        // without a corresponding bit set.
        //

        for (Index = 0; Index &lt; Length; Index++) {
            Bitmap = &amp;SlotBitmaps.Bitmap[Index];
            Bits = (PULONG)&amp;Bitmap-&gt;Bits[0];
            Byte = Slot-&gt;Char[Index];
            if (!BitTestAndSet(Bits, Byte)) {
                break;
            }
        }

        UniqueChars.Char[Count] = Byte;
        UniqueIndex.Index[Count] = Index;
        LengthIndex.Index[Count] = Entry-&gt;Index;
    }

    //
    // Loop through the elements again such that the unique chars are stored
    // in the order they appear in the table.
    //

    for (Count = 0; Count &lt; NumberOfElements; Count++) {
        for (Index = 0; Index &lt; NumberOfElements; Index++) {
            if (LengthIndex.Index[Index] == Count) {
                StringTable-&gt;UniqueChars.Char[Count] = UniqueChars.Char[Index];
                StringTable-&gt;UniqueIndex.Index[Count] = UniqueIndex.Index[Index];
                break;
            }
        }
    }

    //
    // Generate and store the occupied bitmap.  Each bit, from low to high,
    // corresponds to the index of a slot.  When set, the slot is occupied.
    // When clear, it is not.  So, fill bits from the highest bit set down.
    //

    HighestBit = (1 &lt;&lt; (StringArray-&gt;NumberOfElements-1));
    OccupiedMask = _blsmsk_u32(HighestBit);
    StringTable-&gt;OccupiedBitmap = (USHORT)OccupiedMask;

    //
    // Store the continuation bitmap.
    //

    StringTable-&gt;ContinuationBitmap = (USHORT)(ContinuationBitmap);

    //
    // Wire up the string array to the table.
    //

    StringTable-&gt;pStringArray = StringArray;

    //
    // And we're done, return the table.
    //

    return StringTable;
}
</code></pre>
                    </div>
                </div>

                <a class="xref" name="benchmark"></a>
                <h1>The Benchmark</h1>

                <p>

                    The performance comparison graphs in the subsequent sections were generated in
                    Excel, using CSV data output by the creatively-named program
                    <a href="https://github.com/tpn/tracer/blob/2018-04-18.2/StringTable2BenchmarkExe/main.c#L227">
                    StringTable2BenchmarkExe</a>.

                </p>

                <p>

                    Modern CPUs are fast, timing is hard, especially when you're dealing with CPU
                    cycle comparisons.  No approach is perfect.  Here's what I settled on:

                    <ul>

                        <li>

                            The benchmark utility has <code>#pragma optimize("", off)</code> at the
                            start of the file, which disables global optimizations, even in release
                            (optimized) builds.  This prevents the compiler doing clever things with
                            regards to scheduling of the timestamping logic, which affects reported
                            times.

                        </li>

                        <li>

                            The benchmark utility pins itself to a single core and sets its thread
                            priority to the highest permissible value at startup.  (Turbo is
                            disabled on the computer, such that the frequency is pinned to 3.68GHz.)

                        </li>

                        <li>

                            The benchmark utility is fed an array of function pointers and test
                            inputs.  It iterates over each test input, and then iterates over
                            each function, calling it with the test input and potentially verifying
                            the result (some functions are included for comparison purposes but
                            don't actually produce correct results, and thus, do not have their
                            results verified).

                        </li>

                        <li>

                            The test input string is copied into a local buffer that is aligned on a
                            32 byte boundary.  This ensures that all test inputs are being compared
                            fairly.  (The natural alignment of the buffers varies anywhere from 2 to
                            512 bytes, unaligned buffers have a significant impact on the timings.)

                        </li>

                        <li>

                            The function is run once, with the result captured.  If verification has
                            been requested, the result is verified.  We <code>__debugbreak()</code>
                            immediately if there's a mismatch, which is handy during development.

                        </li>

                        <li>

                            <code>NtDelayExecution(TRUE, 1)</code> is called, which results in a sleep of
                            approximately 100 nanoseconds.  This is done to force a context switch,
                            such that the thread gets a new scheduling quantum before each function
                            is run.

                        </li>

                        <li>

                            The function is executed 100 times for warmup.

                        </li>

                        <li>

                            Timings are taken for 1000 iterations of the function using the given
                            test input.  The <code>__rdtscp()</code> intrinsic is used (which forces
                            some serialization) to capture the timestamp counter before and after the
                            iterations.

                        </li>

                        <li>

                            This process is repeated 100 times.  The minimum time observed to
                            perform 1000 iterations (out of 100 attempts) is captured as the
                            function's best time.

                        </li>

                    </ul>

                </p>

                <h4>Release vs PGO Oddities</h4>

                <p>

                    All of the times in the graphs come from the profile-guided optimization build
                    of the StringTable component.  The PGO build is faster than the normal release
                    build in every case, except one, where it is notably slower.

                </p>

                <p>

                    It's... odd.  I haven't investigated it.  The following graph depicts the
                    affected function, IsPrefixOfStringInTable_1, and a few other versions for
                    reference, and depicts the performance of the PGO build to the release build on
                    the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".

                </p>

                <a href="Benchmark-Release-vs-PGO-v3.svg" target="_blank">
                    <img class="svg-image" src="Benchmark-Release-vs-PGO-v3.svg"/>
                </a>

                <p>

                    Only that function is affected, and the problem really only manifests on
                    the two example test strings depicted.  As this routine essentially serves
                    as one of the initial baseline implementations, it would be misleading to
                    compare all of our optimized PGO versions to the abnormally-slow baseline
                    implementation.  So, the release and PGO timings were blended together into
                    a single CSV, and the Excel PivotTables pick whatever the minimum time is
                    for a given function and test input.

                </p>

                <p>

                    Thus, you're always looking at the PGO timings, except for this outlier case
                    where the release versions are faster.

                </p>


                <a class="xref" name="implementations"></a>
                <h1>The Implementations</h1>

                <a class="xref" name="round1"></a>
                <h2>Round 1</h2>

                <p>

                    In this section, we'll take a look at the various implementations I experimented
                    with on the first pass, prior to soliciting any feedback.  I figured there were
                    a couple of ways I could present this information.  First, I could hand-pick
                    what I choose to show and hide, such that a nice rosey picture is presented that
                    makes it seem like I effortlessly arrived at the fastest implementation
                    without much actual effort whatsoever.

                </p>

                <p>

                    Or I could show the gritty reality of how everything <strong>actually</strong> went
                    down in a chronological fashion, errors and all.  And there were definitely some
                    errors!  For better or for worse, I've chosen to go down this route, so you'll
                    get to enjoy some pretty tedious tweaks (changing a single line, for example)
                    before the juicy stuff really kicks in.

                </p>

                <p>

                    Additionally, with the benefit of writing this little section introduction
                    retro-actively, iterations 4 and 5 aren't testing what I thought they were
                    initially testing.  I've left them in as is; if anything, it demonstrates the
                    importance of only changing one thing at a time, and making sure you're testing
                    what you think you're testing.  I'll discuss the errors with those iterations
                    later in the article.

                </p>

                <a class="xref" name="IsPrefixOfCStrInArray"></a>
                <h2>IsPrefixOfCStrInArray</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_1">IsPrefixOfStringInTable_1  <i class="fa fa-arrow-right"></i></a>
                </small>
                <p>

                    Let's review the baseline implementation again, as that's what we're ultimately
                    comparing ourselves against.  This version enumerates the string array (and thus
                    has a slightly different function signature to the STRING_TABLE-based functions)
                    looking for prefix matches.  No SIMD instructions are used.  The timings
                    captured should be proportional to the location of the test input string in the
                    array.  That is, it should take less time to prefix match strings that occur
                    earlier in the array versus those that appear later.

                </p>
<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left &amp;&amp; *Right &amp;&amp; *Left++ == *Right++) {
            Count++;
        }

        if (Count &gt; 0 &amp;&amp; !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)Count;
                Match-&gt;String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}
</code></pre>
                <hr/>

                <a class="xref" name="IsPrefixOfStringInTable_1"></a>
                <h2>IsPrefixOfStringInTable_1</h2>
                <small>
                    <a href="#IsPrefixOfCStrInArray"><i class="fa fa-arrow-left"></i>  IsPrefixOfCStrInArray</a> |
                    <a href="#IsPrefixOfStringInTable_2">IsPrefixOfStringInTable_2  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    This version is similar to the <code>IsPrefixOfCStrInArray</code>
                    implementation, except it utilizes the slot length information provided by the
                    <code>STRING_ARRAY</code> structure, and conforms to our standard
                    <code>IsPrefixOfStringInTable</code> function signature.  It uses no SIMD
                    instructions.

                </p>

<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_1(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine performs a simple linear scan of the string table looking for
    a prefix match against each slot.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    BYTE Left;
    BYTE Right;
    ULONG Index;
    ULONG Count;
    PSTRING_ARRAY StringArray;
    PSTRING TargetString;

    //IACA_VC_START();

    StringArray = StringTable-&gt;pStringArray;

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        return NO_MATCH_FOUND;
    }

    for (Count = 0; Count &lt; StringArray-&gt;NumberOfElements; Count++) {

        TargetString = &amp;StringArray-&gt;Strings[Count];

        if (String-&gt;Length &lt; TargetString-&gt;Length) {
            continue;
        }

        for (Index = 0; Index &lt; TargetString-&gt;Length; Index++) {
            Left = String-&gt;Buffer[Index];
            Right = TargetString-&gt;Buffer[Index];
            if (Left != Right) {
                break;
            }
        }

        if (Index == TargetString-&gt;Length) {

            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Count;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)Index;
                Match-&gt;String = TargetString;

            }

            return (STRING_TABLE_INDEX)Count;
        }

    }

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>

                <p>

                    Here's the performance of these two baseline routines:

                    <a href="Benchmark-01-v6.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-01-v6.svg"/>
                    </a>

                </p>

                <p>

                    That's an interesting result!  Even without using any SIMD instructions, version
                    1, the <code>IsPrefixOfStringInTable_1</code> routine, is faster (in all but one case)
                    than the baseline <code>IsPrefixOfCStrInArray</code> routine, thanks to a more
                    sophisticated data structure.

                </p>

                <p>

                    (And really, it's not even using the sophisticated parts of the
                    <code>STRING_TABLE</code>; it's just leveraging the fact that we've captured the
                    lengths of each string in the backing <code>STRING_ARRAY</code> structure, by
                    virtue of the fact that we use the <code>STRING</code> structure to wrap our
                    strings (versus relying on the standard NULL-terminated C string approach).)

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_2"></a>
                <h2>IsPrefixOfStringInTable_2</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_1"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_1</a> |
                    <a href="#IsPrefixOfStringInTable_3">IsPrefixOfStringInTable_3  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    This version is the first of the routines to use SIMD instructions.  It is
                    actually based on the prefix matching routine I wrote for the first version
                    of the StringTable component back in 2016.  The layout of the STRING_TABLE
                    struct differed in the first version; only the first character of each slot
                    was used to do the initial exclusion (as opposed to the unique character),
                    and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits),
                    so the match bitmap had to be constructed slightly differently.

                </p>

                <p>

                    None of those details really apply to our second attempt at the StringTable
                    component, detailed in this article.  Our lengths are 8 bits, and we use unique
                    characters in the initial negative match fast-path.  However, the first version
                    used an elaborate AVX2 prefix match routine that was geared toward matching
                    long strings, and attempted to use non-temporal streaming load instructions
                    where possible (which would only make sense for a large number of long strings
                    in a very small set of cache-thrashing scenarios).

                    Compare our simpler implementation, <code>IsPrefixMatch</code>, which we use in
                    version 3 onward, to the far more elaborate (and unncessary)
                    <code>IsPrefixMatchAvx2</code>:

                </p>

                <div class="tab-box language box-is-prefix-match">
                    <ul class="tabs">
                        <li data-content="content-is-prefix-match">IsPrefixMatch</li>
                        <li data-content="content-is-prefix-match-avx2">IsPrefixMatchAvx2</li>
                    </ul>
                    <div class="content">
<pre class="code content-is-prefix-match"><code class="language-c">FORCEINLINE
BYTE
IsPrefixMatch(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ BYTE Offset
    )
{
    PBYTE Left;
    PBYTE Right;
    BYTE Matched = 0;
    BYTE Remaining = (SearchString-&gt;Length - Offset) + 1;

    Left = (PBYTE)RtlOffsetToPointer(SearchString-&gt;Buffer, Offset);
    Right = (PBYTE)RtlOffsetToPointer(TargetString-&gt;Buffer, Offset);

    while (--Remaining &amp;&amp; *Left++ == *Right++) {
        Matched++;
    }

    Matched += Offset;
    if (Matched != TargetString-&gt;Length) {
        return NO_MATCH_FOUND;
    }

    return Matched;
}
</code></pre>
<pre class="code content-is-prefix-match-avx2"><code class="language-c">FORCEINLINE
USHORT
IsPrefixMatchAvx2(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ USHORT Offset
    )
{
    USHORT SearchStringRemaining;
    USHORT TargetStringRemaining;
    ULONGLONG SearchStringAlignment;
    ULONGLONG TargetStringAlignment;
    USHORT CharactersMatched = Offset;

    LONG Count;
    LONG Mask;

    PCHAR SearchBuffer;
    PCHAR TargetBuffer;

    STRING_SLOT SearchSlot;

    XMMWORD SearchXmm;
    XMMWORD TargetXmm;
    XMMWORD ResultXmm;

    YMMWORD SearchYmm;
    YMMWORD TargetYmm;
    YMMWORD ResultYmm;

    SearchStringRemaining = SearchString-&gt;Length - Offset;
    TargetStringRemaining = TargetString-&gt;Length - Offset;

    SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString-&gt;Buffer, Offset);
    TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString-&gt;Buffer, Offset);

    //
    // This routine is only called in the final stage of a prefix match when
    // we've already verified the slot's corresponding original string length
    // (referred in this routine as the target string) is less than or equal
    // to the length of the search string.
    //
    // We attempt as many 32-byte comparisons as we can, then as many 16-byte
    // comparisons as we can, then a final &lt; 16-byte comparison if necessary.
    //
    // We use aligned loads if possible, falling back to unaligned if not.
    //

StartYmm:

    if (SearchStringRemaining &gt;= 32 &amp;&amp; TargetStringRemaining &gt;= 32) {

        //
        // We have at least 32 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 32-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);
        TargetStringAlignment = GetAddressAlignment(TargetBuffer);

        if (SearchStringAlignment &lt; 32) {
            SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer);
        } else {
            SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer);
        }

        if (TargetStringAlignment &lt; 32) {
            TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer);
        } else {
            TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer);
        }

        //
        // Compare the two vectors.
        //

        ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm256_movemask_epi8(ResultYmm);

        //
        // There were at least 32 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 32 characters, we can terminate
        // this prefix search here.  (-1 == 0xffffffff == all bits set == all
        // characters matched.)
        //

        if (Mask != -1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 32 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 32-byte processing.
        //

        SearchStringRemaining -= 32;
        TargetStringRemaining -= 32;

        CharactersMatched += 32;

        SearchBuffer += 32;
        TargetBuffer += 32;

        goto StartYmm;
    }

    //
    // Intentional follow-on to StartXmm.
    //

StartXmm:

    //
    // Update the search string's alignment.
    //

    if (SearchStringRemaining &gt;= 16 &amp;&amp; TargetStringRemaining &gt;= 16) {

        //
        // We have at least 16 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 16-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);

        if (SearchStringAlignment &lt; 16) {
            SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
        } else {
            SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer);
        }

        TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer);

        //
        // Compare the two vectors.
        //

        ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm_movemask_epi8(ResultXmm);

        //
        // There were at least 16 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 16 characters, we can terminate
        // this prefix search here.  (-1 == 0xffff -&gt; all bits set -&gt; all chars
        // matched.)
        //

        if ((SHORT)Mask != (SHORT)-1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 16 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 16-byte processing.
        //

        SearchStringRemaining -= 16;
        TargetStringRemaining -= 16;

        CharactersMatched += 16;

        SearchBuffer += 16;
        TargetBuffer += 16;

        goto StartXmm;
    }

    if (TargetStringRemaining == 0) {

        //
        // We'll get here if we successfully prefix matched the search string
        // and all our buffers were aligned (i.e. we don't have a trailing
        // &lt; 16 bytes comparison to perform).
        //

        return CharactersMatched;
    }

    //
    // If we get here, we have less than 16 bytes to compare.  Our target
    // strings are guaranteed to be 16-byte aligned, so we can load them
    // using an aligned stream load as in the previous cases.
    //

    TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer);

    //
    // Loading the remainder of our search string's buffer is a little more
    // complicated.  It could reside within 15 bytes of the end of the page
    // boundary, which would mean that a 128-bit load would cross a page
    // boundary.
    //
    // At best, the page will belong to our process and we'll take a performance
    // hit.  At worst, we won't own the page, and we'll end up triggering a hard
    // page fault.
    //
    // So, see if the current search buffer address plus 16 bytes crosses a page
    // boundary.  If it does, take the safe but slower approach of a ranged
    // memcpy (movsb) into a local stack-allocated STRING_SLOT structure.
    //

    if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) {

        //
        // No page boundary is crossed, so just do an unaligned 128-bit move
        // into our Xmm register.  (We could do the aligned/unaligned dance
        // here, but it's the last load we'll be doing (i.e. it's not
        // potentially on a loop path), so I don't think it's worth the extra
        // branch cost, although I haven't measured this empirically.)
        //

        SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);

    } else {

        //
        // We cross a page boundary, so only copy the the bytes we need via
        // __movsb(), then do an aligned stream load into the Xmm register
        // we'll use in the comparison.
        //

        __movsb((PBYTE)&amp;SearchSlot.Char,
                (PBYTE)SearchBuffer,
                SearchStringRemaining);

        SearchXmm = _mm_stream_load_si128(&amp;SearchSlot.CharsXmm);
    }

    //
    // Compare the final vectors.
    //

    ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

    //
    // Generate a mask from the result of the comparison, but mask off (zero
    // out) high bits from the target string's remaining length.
    //

    Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining);

    //
    // Count how many characters were matched and determine if we were a
    // successful prefix match or not.
    //

    Count = __popcnt(Mask);

    if ((USHORT)Count == TargetStringRemaining) {

        //
        // If we matched the same amount of characters as remaining in the
        // target string, we've successfully prefix matched the search string.
        // Return the total number of characters we matched.
        //

        CharactersMatched += (USHORT)Count;
        return CharactersMatched;
    }

    //
    // After all that work, our string match failed at the final stage!  Return
    // to the caller indicating we were unable to make a prefix match.
    //

    return NO_MATCH_FOUND;
}
</code></pre>
                    </div>
                </div>

                <p>

                    The AVX2 routine is overkill, especially considering the emphasis we put on
                    favoring short strings versus longer ones in the requirements section.  But
                    we want to put broad statements like that to the test, so let's include it
                    as our first SIMD implementation so that we can see how it stacks up against
                    the simpler versions.

                </p>

                <p>

                    Also note, this is the first time we're seeing the full body of the SIMD-style
                    <code>IsPrefixOfStringInTable</code> implementation.  It's commented heavily,
                    and, in general, the core algorithm doesn't fundamentally change across
                    iterations (things are just tweaked slightly), so I'd recommend reading through
                    it thoroughly to build up a mental model of how the matching algorithm works.
                    It's pretty straight forward, and the subsequent iterations will make a lot more
                    sense as they're typically presented as diffs against the previous version
                    first.

                </p>

<pre class="code"><code class="language-c">
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_2(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>

                <p>

                    Let's see how version 2, our first SIMD attempt, performs in comparison to the
                    two baselines.

                </p>

                <p>

                    <a href="Benchmark-02-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-02-v1.svg"/>
                    </a>

                </p>

                <p>

                    Eek!  Our first SIMD attempt actually has worse prefix matching performance in
                    most cases!  The only area where it shows a performance improvement is negative
                    matching.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_3"></a>
                <h2>IsPrefixOfStringInTable_3</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_2"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_2</a> |
                    <a href="#IsPrefixOfStringInTable_4">IsPrefixOfStringInTable_4  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    For version 3, let's replace the call to <code>IsPrefixMatchAvx2</code> with our
                    simpler version, <code>IsPrefixMatch</code>:

                </p>

                <div class="tab-box language box-3v2">
                    <ul class="tabs">
                        <li data-content="content-3v2-diff">Diff</li>
                        <li data-content="content-3-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-3v2-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_2.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_2.c 2018-04-15 22:35:55.458773500 -0400
+++ IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_2(
+IsPrefixOfStringInTable_3(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -278,7 +278,7 @@

             TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

-            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
+            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

             if (CharactersMatched == NO_MATCH_FOUND) {
</code></pre>
<pre class="code content-3-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}</code></pre>
                    </div>
                </div>

                <p>

                    <a href="Benchmark-03-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-03-v1.svg"/>
                    </a>

                </p>


                <p>

                    Phew!  We finally see superior performance across the board.  This ends the
                    short lived tenure of version 2, which is demonstrably worse in every case.
                    We'll also omit the <code>IsPrefixOfCStrInArray</code> routine from the graphs
                    for now (for the most part), as it has served its initial baseline purpose.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_4"></a>
                <h2>IsPrefixOfStringInTable_4</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_3"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_3</a> |
                    <a href="#IsPrefixOfStringInTable_5">IsPrefixOfStringInTable_5  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    When I first wrote the initial string table code, I was playing around with
                    different strategies for loading the initial search string buffer.  That
                    resulted in the file
                    <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h">
                    StringLoadStoreOperations.h</a>, which defined a bunch of helper macros.
                    I've included them below, but don't spend too much time absorbing them, they're
                    not good practice, and they're all irrelevant anyway as soon as we switch to
                    <code>_mm_loadu_si128()</code> in a few versions.  I'm including them because
                    they set the scene for versions 4, 5 and 6.

                </p>

<pre class="code"><code class="language-c">
/*++

    VOID
    LoadSearchStringIntoXmmRegister_SEH(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    Attempts an aligned 128-bit load of String-&gt;Buffer into Slot.CharXmm via
    the _mm_load_si128() intrinsic.  The intrinsic is surrounded in a __try/
    __except block that catches EXCEPTION_ACCESS_VIOLATION exceptions.

    If such an exception is caught, the routine will check to see if the string
    buffer's address will cross a page boundary if 16-bytes are loaded.  If a
    page boundary would be crossed, a __movsb() intrinsic is used to copy only
    the bytes specified by String-&gt;Length, otherwise, an unaligned 128-bit load
    is attemped via the _mm_loadu_si128() intrinsic.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String-&gt;Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar)   \
    LengthVar = min(String-&gt;Length, 16);                               \
    TRY_SSE42_ALIGNED {                                                \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer);      \
    } CATCH_EXCEPTION_ACCESS_VIOLATION {                               \
        if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) {  \
            __movsb(Slot.Char, String-&gt;Buffer, LengthVar);             \
        } else {                                                       \
            Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer); \
        }                                                              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlignmentCheck(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine checks to see if a page boundary will be crossed if 16-bytes
    are loaded from the address supplied by String-&gt;Buffer.  If a page boundary
    will be crossed, a __movsb() intrinsic is used to only copy String-&gt;Length
    bytes into the given Slot.

    If no page boundary will be crossed by a 128-bit load, the alignment of
    the address supplied by String-&gt;Buffer is checked.  If the alignment isn't
    at least on a 16-byte boundary, an unaligned load will be issued via the
    _mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String-&gt;Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \
    LengthVar = min(String-&gt;Length, 16);                                       \
    if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) {              \
        __movsb(Slot.Char, String-&gt;Buffer, LengthVar);                         \
    } else if (GetAddressAlignment(String-&gt;Buffer) &lt; 16) {                     \
        Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);             \
    } else {                                                                   \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer);              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysUnaligned(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine performs an unaligned 128-bit load of the address supplied by
    String-&gt;Buffer into the given Slot via the _mm_loadu_si128() intrinsic.
    No checks are done regarding whether or not a page boundary will be crossed.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String-&gt;Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \
    LengthVar = min(String-&gt;Length, 16);                                   \
    if (PointerToOffsetCrossesPageBoundary(String-&gt;Buffer, 16)) {          \
        __movsb(Slot.Char, String-&gt;Buffer, LengthVar);                     \
    } else if (GetAddressAlignment(String-&gt;Buffer) &lt; 16) {                 \
        Slot.CharsXmm = _mm_loadu_si128(String-&gt;Buffer);                   \
    } else {                                                               \
        Slot.CharsXmm = _mm_load_si128(String-&gt;Buffer);                    \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysMovsb(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine copies min(String-&gt;Length, 16) bytes from String-&gt;Buffer
    into the given Slot via the __movsb() intrinsic.  The memory referenced by
    the Slot is not cleared first via SecureZeroMemory().

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String-&gt;Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \
    LengthVar = min(String-&gt;Length, 16);                                     \
    __movsb(Slot.Char, String-&gt;Buffer, LengthVar);
</code></pre>
                <p>

                    In our <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringTable2.vcxproj#L52">StringTable2.vcxproj</a>
                    file, we have the following:

                </p>

                <hr/>
<small><pre>
  &lt;PropertyGroup Label="Globals"&gt;
    ...
    &lt;LoadSearchStringStrategy&gt;AlwaysMovsb&lt;/LoadSearchStringStrategy&gt;
    &lt;!--
    &lt;LoadSearchStringStrategy&gt;SEH&lt;/LoadSearchStringStrategy&gt;
    &lt;LoadSearchStringStrategy&gt;AlignmentCheck&lt;/LoadSearchStringStrategy&gt;
    &lt;LoadSearchStringStrategy&gt;AlwaysUnaligned&lt;/LoadSearchStringStrategy&gt;
    --&gt;
</pre></small>
                <hr/>

                <p>

                    This basically allowed me to toggle which of the strategies I wanted to use to
                    do load the search string into an XMM register.  As you can see above, the
                    default is to use the <code>AlwaysMovsb</code> approach*; so, with version 4,
                    let's swap that out for the <code>SEH</code> approach, which wraps the aligned
                    load in a structured exception handler that falls back to <code>__movsb()</code>
                    if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.

                </p>

                <p>

                    <small>

                    <p>[*]: Or was it?</p>
                    <p>Narrator: <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L226">
                    it wasn't</a>.</p>

                    <small><p>(Note: these little <em>Narrator</em> interjections work best if you
                    imagine they're being read in <a
                    href="https://en.wikipedia.org/wiki/Arrested_Development_(TV_series)">Ron
                    Howard</a>'s voice.)</p>
                    </small>

                    </small>

                </p>


                <div class="tab-box language box-4v3">
                    <ul class="tabs">
                        <li data-content="content-4v3-diff">Diff</li>
                        <li data-content="content-4-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-4v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_4.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_4.c 2018-04-15 22:35:55.453274200 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_4(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses a structured exception
+    handler for loading the initial search string.

 Arguments:

@@ -123,7 +124,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string
</code></pre>
<pre class="code content-4-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses a structured exception
    handler for loading the initial search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}</code></pre>
                    </div>
                </div>

                <p>

                    Performance of version 4 was slightly worse than 3 in every case:

                    <a href="Benchmark-04-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-04-v1.svg"/>
                    </a>

                </p>

                <p>

                    Version 3 is still in the lead with the <code>AlwaysMovsb</code>-based search string
                    loading approach.

                    <small>

                    <p>Narrator: except the
                    <a href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L112"> AlignmentCheck</a>
                    macro was actually active, not the
                    <a href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L112"> AlwaysMovsb</a>
                    one.
                    </small>

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_5"></a>
                <h2>IsPrefixOfStringInTable_5</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_4"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_4</a> |
                    <a href="#IsPrefixOfStringInTable_6">IsPrefixOfStringInTable_6  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    Version 5 is an interesting one.  It's the first time we attempt to validate our
                    claim that it's more efficient to give the CPU a bunch of independent things to
                    do up-front, versus putting more branches in and attempting to terminate as
                    early as possible.

                </p>

                <p>

                    Note: we'll also explicitly use the <code>LoadSearchStringIntoXmmRegister_AlwaysMovsb</code>
                    macro here, instead of <code>LoadSearchStringIntoXmmRegister</code>, just to
                    make it more explicit that we're actually relying on the
                    <code>__movsb()</code>-based string loading routine.

                </p>

                <small><p>Narrator: can anyone spot the mistake with this logic?</p></small>


                <div class="tab-box language box-5v3">
                    <ul class="tabs">
                        <li data-content="content-5v3-diff">Diff</li>
                        <li data-content="content-5-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-5v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_5.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_5.c 2018-04-15 13:24:52.480972900 -0400
@@ -16,9 +16,13 @@

 #include "stdafx.h"

+//
+// Variant of v3 with early-exits.
+//
+
 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_5(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +35,11 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses early exits (i.e.
+    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
+    the assertion we've been making that it's more optimal to give the CPU
+    to do a bunch of things up front versus doing something, then potentially
+    branching, doing the next thing, potentially branching, etc.

 Arguments:

@@ -51,6 +59,8 @@
 --*/
 {
     ULONG Bitmap;
+    ULONG CharBitmap;
+    ULONG LengthBitmap;
     ULONG Mask;
     ULONG Count;
     ULONG Length;
@@ -71,7 +81,6 @@
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
     XMMWORD IncludeSlotsByLength;
-    XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     StringArray = StringTable-&gt;pStringArray;
@@ -123,7 +132,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string
@@ -133,11 +142,6 @@
     UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                   StringTable-&gt;UniqueIndex.IndexXmm);

-    //
-    // Load the slot length array into an XMM register.
-    //
-
-    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

     //
     // Load the string table's unique character array into an XMM register.
@@ -146,13 +150,6 @@
     TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

     //
-    // Broadcast the search string's length into an XMM register.
-    //
-
-    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
-    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
-
-    //
     // Compare the search string's unique character with all of the unique
     // characters of strings in the table, saving the results into an XMM
     // register.  This comparison will indicate which slots we can ignore
@@ -162,6 +159,25 @@

     IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

+    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);
+
+    if (!CharBitmap) {
+        return NO_MATCH_FOUND;
+    }
+
+    //
+    // Load the slot length array into an XMM register.
+    //
+
+    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);
+
+    //
+    // Broadcast the search string's length into an XMM register.
+    //
+
+    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
+    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
+
     //
     // Find all slots that are longer than the incoming string length, as these
     // are the ones we're going to exclude from any prefix match.
@@ -182,31 +198,16 @@

     IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

-    //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
-    //
-
-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

-    //
-    // Generate a mask.
-    //
+    if (!LengthBitmap) {
+        return NO_MATCH_FOUND;
+    }

-    Bitmap = _mm_movemask_epi8(IncludeSlots);
+    Bitmap = CharBitmap &amp; LengthBitmap;

     if (!Bitmap) {
-
-        //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
-        //
-
-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //
</code></pre>
<pre class="code content-5-full"><code class="language-c">_Use_decl_annotations_
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_5(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses early exits (i.e.
    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
    the assertion we've been making that it's more optimal to give the CPU
    to do a bunch of things up front versus doing something, then potentially
    branching, doing the next thing, potentially branching, etc.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG CharBitmap;
    ULONG LengthBitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);


    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);

    if (!CharBitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

    if (!LengthBitmap) {
        return NO_MATCH_FOUND;
    }

    Bitmap = CharBitmap &amp; LengthBitmap;

    if (!Bitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

// vim:set ts=8 sw=4 sts=4 tw=80 expandtab                                     :
</code></pre>


                    </div>
                </div>

                <p>

                    If our theory is correct, the performance of this version should be worse, due
                    to all the extra branches in the initial test.  Let's see if we're right:

                </p>

                <p>

                    <a href="Benchmark-05-v2.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-05-v2.svg"/>
                    </a>

                </p>

                <p>

                    Holy smokes, version 5 <strong>is</strong> bad!  It's so bad it's actually
                    closest in territory to the bunk version 2 that had the elaborate AVX2 prefix
                    matching routine.  <small>(Note: actually it was so close I ended up double-checking
                    the two routines were correct; they were, so this is just a
                    coincidence.)</small>

                </p>

                <small>

                    <p>Narrator: nice "double-checking", you putz.</p>

                </small>

                <p>

                    That's good news though, as it validates this assumption that we've been working
                    with since inception:

                </p>

<pre class="code"><code class="language-c">    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //
</code></pre>

                <p>

                    That's the end of version's 5 tenure.  TL;DR: less branches &gt; more branches.

                </p>

                <small>

                    <p>

                        Narrator: more accurate TL;DR: <code>__movsb()</code> is slow, and always
                        make sure you're testing what you think you're testing before.

                    </p>

                </small>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_6"></a>
                <h2>IsPrefixOfStringInTable_6</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_5"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_5</a> |
                    <a href="#IsPrefixOfStringInTable_7">IsPrefixOfStringInTable_7  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    Version 6 is boring.  We tweak the initial loading of the search string,
                    explicitly loading it via an unaligned load.  If the underlying buffer is
                    aligned on a 16 byte boundary, this is just as fast as an aligned load.
                    If not, hey, at least it doesn't crash &mdash; it's just slow.

                </p>

                <p>

                    (If you attempt an aligned load on an address that isn't aligned at a 16 byte
                    boundary, the processor will generate an exception, resulting in the crash of
                    your program (assuming you don't have any structured exception handlers in place
                    to catch the error).)

                </p>

                <div class="tab-box language box-6v3">
                    <ul class="tabs">
                        <li data-content="content-6v3-diff">Diff</li>
                        <li data-content="content-6-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-6v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_6.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_6.c 2018-04-26 18:29:40.594556800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_6(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine differs from version 3 in that we do an unaligned load of
+    the search string buffer without any SEH wrappers or alignment checks.

 Arguments:

@@ -123,7 +124,8 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    SearchLength = min(String-&gt;Length, 16);
+    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

     //
     // Broadcast the search string's unique characters according to the string
</code></pre>
<pre class="code content-6-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_6(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine differs from version 3 in that we do an unaligned load of
    the search string buffer without any SEH wrappers or alignment checks.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    SearchLength = min(String-&gt;Length, 16);
    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

</code></pre>

                    </div>
                </div>

                <p>

                    Version 6 should be faster than version 3; we omit alignment checks, all of our
                    input buffers are aligned at 32 bytes, and an unaligned XMM load of an aligned
                    buffer should definitely be faster than a <code>__movsb()</code>.  Let's see:

                </p>

                <p>

                    <a href="Benchmark-06-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-06-v1.svg"/>
                    </a>

                </p>

                <p>

                    We have a new winner!  Version 3 had a good run, but it's time to retire.
                    Let's tweak version 6 going forward.

                </p>

                <small>

                    <p>

                        Narrator: this is actually testing <code>_mm_loadu_si128()</code> against
                        the
                        <a href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringLoadStoreOperations.h#L112"> AlignmentCheck</a>
                        routine, which first calls
                        <code>PointerToOffsetCrossesPageBoundary()</code>, and then checks the
                        address alignment before calling <code>_mm_load_si128()</code>.
                        As unaligned loads are just as fast as aligned loads as long as the
                        underlying buffer is aligned, all this is really showing is that it's
                        slightly faster not doing the pointer boundary check and address
                        alignment check, which shouldn't be that surprising.

                    </p>

                </small>


                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_7"></a>
                <h2>IsPrefixOfStringInTable_7</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_6"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_6</a> |
                    <a href="#IsPrefixOfStringInTable_8">IsPrefixOfStringInTable_8  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    Version 7 tweaks version 6 a little bit.  We don't need the search string length
                    calculated so early in the routine.  Let's move it to later.

                </p>

                <div class="tab-box language box-7v6">
                    <ul class="tabs">
                        <li data-content="content-7v6-diff">Diff</li>
                        <li data-content="content-7-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-7v6-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_6.c IsPrefixOfStringInTable_7.c
--- IsPrefixOfStringInTable_6.c 2018-04-15 22:35:55.450273700 -0400
+++ IsPrefixOfStringInTable_7.c 2018-04-26 10:00:53.905933700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_6(
+IsPrefixOfStringInTable_7(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,9 +31,10 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine differs from version 3 in that we do an aligned load of the
-    search string buffer without any SEH wrappers or alignment checks.  (Thus,
-    this routine will fault if the buffer is unaligned.)
+    This routine is based off version 6, but alters when we calculate the
+    "search length" for the given string, which is done via the expression
+    'min(String-&gt;Length, 16)'.  We don't need this value until later in the
+    routine, when we're ready to start comparing strings.

 Arguments:

@@ -125,7 +126,6 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    SearchLength = min(String-&gt;Length, 16);
     Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

     //
@@ -213,6 +213,13 @@
     }

     //
+    // Calculate the "search length" of the incoming string, which ensures we
+    // only compare up to the first 16 characters.
+    //
+
+    SearchLength = min(String-&gt;Length, 16);
+
+    //
     // A popcount against the mask will tell us how many slots we matched, and
     // thus, need to compare.
     //
</code></pre>
<pre class="code content-7-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_7(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 6, but alters when we calculate the
    "search length" for the given string, which is done via the expression
    'min(String-&gt;Length, 16)'.  We don't need this value until later in the
    routine, when we're ready to start comparing strings.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable-&amp;gt;pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray-&amp;gt;MinimumLength &amp;gt; String-&amp;gt;Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&amp;gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&amp;gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;amp;StringTable-&amp;gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;amp;StringTable-&amp;gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&amp;gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&amp;gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &amp;gt;&amp;gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;amp;StringTable-&amp;gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;amp;&amp;amp; Length &amp;gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;amp;StringTable-&amp;gt;pStringArray-&amp;gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&amp;gt;Index = (BYTE)Index;
                Match-&amp;gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&amp;gt;String = &amp;amp;StringTable-&amp;gt;pStringArray-&amp;gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}</code></pre>

                    </div>
                </div>

                <p>

                    This is a tiny change; if it shows any performance difference, it should err on
                    the side of a positive change, although perhaps the compiler noticed that we
                    didn't use the expression until much later and deferred the scheduling until
                    after the initial negative match logic.  Let's see:

                </p>

                <p>

                    <a href="Benchmark-07-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-07-v1.svg"/>
                    </a>

                </p>

                <p>

                    Tiny change, tiny performance improvement!  Looks like this saves a couple of
                    cycles, thus ending the short-lived rein of version 6.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_8"></a>
                <h2>IsPrefixOfStringInTable_8</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_7"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_7</a> |
                    <a href="#IsPrefixOfStringInTable_9">IsPrefixOfStringInTable_9  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    Version 8 is based off version 7, but omits the initial length test.  Again,
                    it's another small change, but if version 5 was anything to go off, the less
                    branches, the better.

                </p>

                <div class="tab-box language box-8v7">
                    <ul class="tabs">
                        <li data-content="content-8v7-diff">Diff</li>
                        <li data-content="content-8-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-8v7-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_7.c IsPrefixOfStringInTable_8.c
--- IsPrefixOfStringInTable_7.c 2018-04-26 10:21:43.253466500 -0400
+++ IsPrefixOfStringInTable_8.c 2018-04-26 10:21:27.109761800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_7(
+IsPrefixOfStringInTable_8(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,10 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 6, but alters when we calculate the
-    "search length" for the given string, which is done via the expression
-    'min(String-&gt;Length, 16)'.  We don't need this value until later in the
-    routine, when we're ready to start comparing strings.
+    This routine is based off version 7, but omits the initial minimum
+    length test of the string array.

 Arguments:

@@ -63,7 +61,6 @@
     ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
-    PSTRING_ARRAY StringArray;
     STRING_SLOT Slot;
     STRING_SLOT Search;
     STRING_SLOT Compare;
@@ -77,17 +74,6 @@
     XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

-    StringArray = StringTable-&gt;pStringArray;
-
-    //
-    // If the minimum length of the string array is greater than the length of
-    // our search string, there can't be a prefix match.
-    //
-
-    if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
-        goto NoMatch;
-    }
-
     //
     // Unconditionally do the following five operations before checking any of
     // the results and determining how the search should proceed:
</code></pre>
<pre class="code content-8-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_8(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 7, but omits the initial minimum
    length test of the string array.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>
                    </div>
                </div>

                <p>

                    <a href="Benchmark-08-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-08-v1.svg"/>
                    </a>

                </p>

                <p>

                    Hey, look at that, another win across the board!  Omitting the length test
                    shaves off a few more cycles for both prefix and negative matching.  Version 7's
                    one-round rein has come to a timely end.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_9"></a>
                <h2>IsPrefixOfStringInTable_9</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_8"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_8</a> |
                    <a href="#IsPrefixOfStringInTable_10">IsPrefixOfStringInTable_10  <i class="fa fa-arrow-right"></i></a>
                </small>


                <p>

                    Version 9 tweaks version 8 and simply does <code>return NO_MATCH_FOUND</code>
                    after the initial bitmap check versus <code>goto NoMatch</code>.  (The use of goto was a
                    bit peculiar there, anyway.  And we're going to rewrite the body in a similar
                    fashion for version 10, but let's try stick to making one change at a time.)

                </p>

                <div class="tab-box language box-9v8">
                    <ul class="tabs">
                        <li data-content="content-9v8-diff">Diff</li>
                        <li data-content="content-9-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-9v8-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_8.c IsPrefixOfStringInTable_9.c
--- IsPrefixOfStringInTable_8.c 2018-04-26 10:30:52.337935400 -0400
+++ IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_8(
+IsPrefixOfStringInTable_9(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 7, but omits the initial minimum
-    length test of the string array.
+    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
+    after the initial bitmap check versus 'goto NoMatch'.

 Arguments:

@@ -195,7 +195,7 @@
         // search string.
         //

-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //
@@ -330,8 +330,6 @@
     // If we get here, we didn't find a match.
     //

-NoMatch:
-
     //IACA_VC_END();

     return NO_MATCH_FOUND;
</code></pre>
<pre class="code content-9-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_9(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
    after the initial bitmap check versus 'goto NoMatch'.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match-&gt;Index = (BYTE)Index;
                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

</code></pre>

                    </div>
                </div>

                <p>

                    <a href="Benchmark-09-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-09-v1.svg"/>
                    </a>

                </p>

                <p>

                    This is an interesting one.  The return versus goto looks to have cost us a
                    little bit with the first few test inputs.  But only a tiny amount, we're
                    talking about like 0.2 more cycles, which is nothing in the grand scheme of
                    things.  (Although let's not pull on that thread too much, the entire premise
                    of the whole article will quickly unravel!)

                </p>

                <p>
                    Version 9 improves the negative match performance by a few cycles, though, so
                    let's keep it.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_10"></a>
                <h2>IsPrefixOfStringInTable_10</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_9"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_9</a> |
                    <a href="#IsPrefixOfStringInTable_11">IsPrefixOfStringInTable_11  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    At this point, we've exhausted all the little easy tweaks.  Let's re-write
                    the inner loop that does the character comparison and see how that affects
                    performance.

                </p>

                <p>

                    This should be an interesting one because the way it's written now is... a tad
                    odd.  (I've clearly made some assumptions regarding optimal branch organization
                    at the very least.)

                </p>


                <div class="tab-box language box-10v9">
                    <ul class="tabs">
                        <li data-content="content-10v9-diff">Diff</li>
                        <li data-content="content-10-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-10v9-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_9.c IsPrefixOfStringInTable_10.c
--- IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
+++ IsPrefixOfStringInTable_10.c        2018-04-26 10:38:09.357890400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_9(
+IsPrefixOfStringInTable_10(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
-    after the initial bitmap check versus 'goto NoMatch'.
+    This version is based off version 9, but rewrites the inner loop that
+    checks for comparisons.

 Arguments:

@@ -264,7 +264,17 @@

         CharactersMatched = __popcnt(Mask);

-        if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {
+        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {
+
+            //
+            // The slot length is longer than the number of characters matched
+            // from the search string; this isn't a prefix match.  Continue.
+            //
+
+            continue;
+        }
+
+        if (Length &gt; 16) {

             //
             // The first 16 characters in the string matched against this
@@ -283,46 +293,24 @@
                 //

                 continue;
-
-            } else {
-
-                //
-                // We successfully prefix matched the search string against
-                // this slot.  The code immediately following us deals with
-                // handling a successful prefix match at the initial slot
-                // level; let's avoid an unnecessary branch and just jump
-                // directly into it.
-                //
-
-                goto FoundMatch;
             }
         }

-        if ((USHORT)CharactersMatched == Length) {
-
-FoundMatch:
-
-            //
-            // This slot is a prefix match.  Fill out the Match structure if the
-            // caller provided a non-NULL pointer, then return the index of the
-            // match.
-            //
-
-
-            if (ARGUMENT_PRESENT(Match)) {
+        //
+        // This slot is a prefix match.  Fill out the Match structure if the
+        // caller provided a non-NULL pointer, then return the index of the
+        // match.
+        //

-                Match-&gt;Index = (BYTE)Index;
-                Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
-                Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
+        if (ARGUMENT_PRESENT(Match)) {

-            }
+            Match-&gt;Index = (BYTE)Index;
+            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
+            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

-            return (STRING_TABLE_INDEX)Index;
         }

-        //
-        // Not enough characters matched, so continue the loop.
-        //
+        return (STRING_TABLE_INDEX)Index;

     } while (--Count);
</code></pre>
<pre class="code content-10-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_10(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 9, but rewrites the inner loop that
    checks for comparisons.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match-&gt;Index = (BYTE)Index;
            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>
                    </div>
                </div>

                <p>

                    That's a nicer bit of logic.  More C-like, less assembly-like.  Arguably
                    clearer.  Let's see how they compare.  (This is an interesting one as I
                    genuinely don't have a strong hunch what sort of a performance impact this
                    will have; obviously I thought the first way of structuring the loop was
                    optimal, and I had that in place for two years before deciding to embark
                    on this article, which led to the rework we just saw.)

                </p>

                <p>

                    <a href="Benchmark-10-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-10-v1.svg"/>
                    </a>

                </p>

                <p>

                    Hey, look at that, we've shaved off a few more cycles in most cases, especially
                    for the negative matches!

                </p>

                <hr/>
                <h2>Speeding Up Negative Matches with Assembly</h2>

                <p>

                    <small><small>(Note: if you build the <a
                    href="https://github.com/tpn/tracer">Tracer</a> project, you can run a helper
                    batch file in the root directory called cdb-simple.bat, which uses cdb to launch
                    one of the project's executables called ModuleLoader.exe, which will start up,
                    load all of our tracing project's DLLs, then allow the debugger to break in,
                    which yields us a debugger prompt from which we can easily disassemble functions
                    and inspect runtime function entries etc.  This is the approach I used for
                    capturing the output over the next couple of sections.)</small></small>

                </p>

                <p>

                    Now for the fun part!  Let's take a look at the disassembly of the initial part
                    of version 10 that is responsible for doing the negative match logic and see if
                    there are any improvements we can make.

                </p>

                <hr/>
                <small><pre>
0:000&gt; uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h
00007fff`f69c1e04 c5fa6f5920      vmovdqu xmm3,xmmword ptr [rcx+20h]
00007fff`f69c1e09 4c8b6a08        mov     r13,qword ptr [rdx+8]
00007fff`f69c1e0d 4d8bf0          mov     r14,r8
00007fff`f69c1e10 440fb63a        movzx   r15d,byte ptr [rdx]
00007fff`f69c1e14 33ed            xor     ebp,ebp
00007fff`f69c1e16 44883c24        mov     byte ptr [rsp],r15b
00007fff`f69c1e1a 488bf1          mov     rsi,rcx
00007fff`f69c1e1d c4e279780c24    vpbroadcastb xmm1,byte ptr [rsp]
00007fff`f69c1e23 c4c17a6f6500    vmovdqu xmm4,xmmword ptr [r13]
00007fff`f69c1e29 c4e259004110    vpshufb xmm0,xmm4,xmmword ptr [rcx+10h]
00007fff`f69c1e2f c5f97411        vpcmpeqb xmm2,xmm0,xmmword ptr [rcx]
00007fff`f69c1e33 c5e164c9        vpcmpgtb xmm1,xmm3,xmm1
00007fff`f69c1e37 c5f1ef0d41320000 vpxor   xmm1,xmm1,xmmword ptr [StringTable2!_xmmffffffffffffffffffffffffffffffff (00007fff`f69c5080)]
00007fff`f69c1e3f c5e9dbd1        vpand   xmm2,xmm2,xmm1
00007fff`f69c1e43 c579d7c2        vpmovmskb r8d,xmm2
00007fff`f69c1e47 c5fa7f5c2410    vmovdqu xmmword ptr [rsp+10h],xmm3
00007fff`f69c1e4d 4585c0          test    r8d,r8d
00007fff`f69c1e50 0f849a000000    je      StringTable2!IsPrefixOfStringInTable_10+0x100 (00007fff`f69c1ef0)
                </pre></small>
                <hr/>

                <p>

                    There's a bit of cruft there at the start with regards to setting up the
                    function's prologue (pushing non-volatile registers to the stack, etc).  That's
                    to be expected for C (and C++, and basically every language); as the programmer,
                    you don't really have any <strong>direct</strong> control over how many registers
                    a compiler uses for a routine, how much stack space it uses, which registers it
                    uses when, etc.

                </p>

                <p>

                    However, with assembly, we're on the other end of the spectrum: we can control
                    everything!  We also have a little trick up our sleeves: the venerable
                    <code>LEAF_ENTRY</code>.

                </p>

                <p>

                    First, some background.  The Windows x64 ABI and calling convention dictates there
                    are two types of functions: <a
                    href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/macamd64.inc#L524">
                    NESTED_ENTRY</a> and <a
                    href="https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16299.0/shared/macamd64.inc#L353">
                    <code>LEAF_ENTRY</code></a>.  <code>NESTED_ENTRY</code> is by far the most common; C and C++ functions are
                    all implicitly <code>NESTED_ENTRY</code> functions.  (The <code>LEAF_ENTRY</code> and <code>NESTED_ENTRY</code> symbols
                    are MASM (ml64.exe) macro names, but the concept applies to all languages.)

                </p>

                <p>

                    A <code>LEAF_ENTRY</code> can only be implemented in assembly.  It is constrained in that it
                    may not manipulate any of the non-volatile x64 registers (rbx, rdi, rsi, rsp,
                    rbp, r12, r13, r14, r15, xmm6-15), nor may it <code>call</code> any other functions
                    (because <code>call</code> implicitly modifies the stack pointer), nor may it have a structured
                    exception handler (because handling an exception for a given stack frame also
                    manipulates the stack pointer).

                </p>

                <p>

                    The reason behind all of these constraints is that <code>LEAF_ENTRY</code> routines do not
                    have any unwind information generated for them in their runtime function
                    entries.  Unwind information is used by the kernel to do just that, unwind
                    the modifications made to non-volatile registers whilst traversing back up
                    through the call stack looking for an exception handler in the case of an
                    exception.

                </p>

                <p>

                    For example, here's the function entry and associated unwind information for the
                    PGO build of the IsPrefixOfStringInTable_10 function:

                </p>

                <hr/>
<small><pre>
0:000&gt; .fnent StringTable2!IsPrefixOfStringInTable_10
Debugger function entry 000001d8`2ea03cf8 for:
(00007fff`f8411df0)   StringTable2!IsPrefixOfStringInTable_10
Exact matches:
    StringTable2!IsPrefixOfStringInTable_10 (struct _STRING_TABLE *,
                                             struct _STRING *,
                                             struct _STRING_MATCH *)

BeginAddress      = 00000000`00001df0
EndAddress        = 00000000`00001e59
UnwindInfoAddress = 00000000`000054f8

Unwind info at 00007fff`f84154f8, 14 bytes
  version 1, flags 0, prolog 14, codes 8
  00: offs 14, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 58 reg: rsi.
  02: offs 14, unwind op 4, op info 5   UWOP_SAVE_NONVOL FrameOffset: 50 reg: rbp.
  04: offs 14, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
  05: offs 10, unwind op 0, op info f   UWOP_PUSH_NONVOL reg: r15.
  06: offs e, unwind op 0, op info e    UWOP_PUSH_NONVOL reg: r14.
  07: offs c, unwind op 0, op info d    UWOP_PUSH_NONVOL reg: r13.
</pre></small>
                <hr/>

                <p>

                    We can see that this routine manipulates 6 non-volatile registers in total,
                    including the stack pointer.  The first instructions of the routine constitute
                    the function's prologue; in the disassembly, you can see that three of the rxx
                    registers are pushed to the stack and then 0x20 (32) bytes of stack space is
                    allocated:

                </p>
                <hr/>
<small><pre>
0:000&gt; uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h
</pre></small>
                <hr/>

                <p>

                    It also cheekily uses the home parameter space for stashing rbp and rsi, instead
                    of pushing them to the stack.  That's fair game though, this is the PGO build,
                    I'd expect it to use some extra tricks for shaving off a few more cycles here
                    and there.  I'd do the same thing if I were writing assembly.  (Side note: if
                    you view the source of this page, there's a commented-out section below that
                    depicts the runtime function entry for the release build of version 10; it uses
                    9 registers instead of 6, and 40 bytes of stack space instead of 32.  I wrote it
                    before switching to using the PGO for everything.)

                </p>

                <p>

                    The home parameter space is a 32 byte area that immediately follows the return
                    address (i.e. the value of <code>rsp</code> when the function is entered); it is
                    mandated by the x64 calling convention on Windows, and is primarily intended to
                    provide some scratch space for a routine to <em>home</em> its parameter
                    registers (i.e. the registers used for the first four arguments of a function:
                    rcx, rdx, r8 and r9).  This allows the four volatile registers to be repurposed
                    within a routine, but also have a way to refer to the parameters again if need
                    be.  At least, that's what it was intended for &mdash; however, its not
                    something that is enforced, you can basically treat the area as a free 32 byte
                    scratch area if you're writing assembly.

                </p>

                <p>

                    <small>
                    (On a semi-related note, I'd highly recommend reading
                    <a href="https://github.com/tpn/pdfs/blob/4d2296269d3737b649def585a19eb103cda9c3d0/A%20History%20of%20Modern%2064-bit%20Computing%20-%20Feb%202007%20(CSEP590A).pdf">
                    A History of Modern 64-bit Computing</a> if you have some spare time, it's a
                    fascinating insight into contemporary x64 conventions we often take for granted,
                    drawing on numerous interviews with industry luminaries like Dave Cutler and Linus
                    Torvalds.  I found it incredibly useful for understanding the <em>why</em>
                    behind things like home parameter space, structured exception handling, runtime
                    function entries, and why you can't write inline assembly for x64 with MSVC
                    anymore &mdash; it provides a direct vector for obliterating the mechanisms
                    relied upon by the kernel stack unwinding functionality.  (At least, I think
                    that's the reason &mdash; can anyone from Microsoft confirm?))
                    </small>

                </p>

<!--
    This is the runtime function entry for the release build of version 10, including follow-up
    commentary.  I wrote this before switching to the PGO version for everything.

<small><pre>
0:000&gt; .fnent StringTable2!IsPrefixOfStringInTable_10
Debugger function entry 000001f9`048edf98 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_10 (struct _STRING_TABLE *, struct _STRING *, struct _STRING_MATCH *)

BeginAddress      = 00000000`00001fe0
EndAddress        = 00000000`00002165
UnwindInfoAddress = 00000000`00004370

Unwind info at 00007ffd`15594370, 1e bytes
  version 1, flags 0, prolog 81, codes d
  00: offs 81, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 20 reg: rdi.
  02: offs 7c, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 60 reg: rsi.
  04: offs 77, unwind op 4, op info 5   UWOP_SAVE_NONVOL FrameOffset: 58 reg: rbp.
  06: offs 72, unwind op 4, op info 3   UWOP_SAVE_NONVOL FrameOffset: 50 reg: rbx.
  08: offs c, unwind op 2, op info 4    UWOP_ALLOC_SMALL.
  09: offs 8, unwind op 0, op info f    UWOP_PUSH_NONVOL reg: r15.
  0a: offs 6, unwind op 0, op info e    UWOP_PUSH_NONVOL reg: r14.
  0b: offs 4, unwind op 0, op info d    UWOP_PUSH_NONVOL reg: r13.
  0c: offs 2, unwind op 0, op info c    UWOP_PUSH_NONVOL reg: r12.
0:000&gt;
</pre></small>
                <hr/>

                <p>

                    We can see that the C version of our routine manipulates 9 non-volatile
                    registers in total, including the stack pointer.  The first instructions
                    of the C version constitute the function's prologue; in the disassembly,
                    you can see that four of the rxx registers are pushed to the stack and then
                    0x28 (40) bytes of stack space is allocated:

                </p>

                <hr/>
<small><pre>
0:000&gt; uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007ffd`15591fe0 4154            push    r12
00007ffd`15591fe2 4155            push    r13
00007ffd`15591fe4 4156            push    r14
00007ffd`15591fe6 4157            push    r15
00007ffd`15591fe8 4883ec28        sub     rsp,28h
--- END OF PROLOGUE ---
00007ffd`15591fec c5fa6f5920      vmovdqu xmm3,xmmword ptr [rcx+20h]
</pre></small>
                <hr/>
-->

                <hr/>
                <a class="xref" name="round1-assembly"></a>
                <a class="xref" name="IsPrefixOfStringInTable_x64_1"></a>
                <h2>IsPrefixOfStringInTable_x64_1</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_2">IsPrefixOfStringInTable_x64_2  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    So, knowing what we now know about the venerable little <code>LEAF_ENTRY</code> trick, let's
                    see if we can construct a simple routine in assembly that just deals with the
                    negative match case.

                </p>

<pre class="code"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00

        ;IACA_VC_START

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result back into xmm0.
;

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm0, xmm0, xmm1

;
; Load the string table's unique character array into xmm2, and the lengths for
; each string slot into xmm3.
;

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm5 to all ones.  This is used later.
;

        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's unique character array (xmm0) against the string
; table's unique chars (xmm2), saving the result back into xmm0.
;

        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm5            ; Invert the result.

;
; Intersect-via-test xmm0 and xmm1 to identify string slots of a suitable
; length with a matching unique character.
;

        vptest      xmm0, xmm1                  ; Check for no match.
        ;jnz        short @F                    ; There was a match.
                                                ; (Not yet implemented.)

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ;
        not         al                          ; rax = -1
        ret

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>

                <p>

                    Note how we don't need to push anything to the stack as we didn't manipulate any
                    non-volatile registers.  If an exception occurs within the body of our
                    implementation (say we dereference a NULL pointer), the kernel knows it doesn't
                    have to undo any non-volatile register modifications (using offsets specified by
                    the unwind information) because there isn't any unwind information.  It can
                    simply advance to the frame before us (e.g. rsp at the time of the fault, minus
                    8 bytes) as it continues its search for runtime function entries and associated
                    unwind information.  As you can see, the unwind info is effectively empty:

                </p>

                <hr/>
<small><pre>
0:000&gt; .fnent StringTable2!IsPrefixOfStringInTable_x64_1
Debugger function entry 000001f9`048edf98 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_1 (void)

BeginAddress      = 00000000`00003290
EndAddress        = 00000000`000032cb
UnwindInfoAddress = 00000000`00004468

Unwind info at 00007ffd`15594468, 4 bytes
  version 1, flags 0, prolog 0, codes 0
</pre></small>
                <hr/>

                <p>

                    Let's see how this scrappy little fellow (who always returns NO_MATCH_FOUND but
                    still mimics the steps required to successfuly negative match) does against the
                    leading C implementation at this point, version 10:

                </p>

                <p>
                    <a href="Benchmark-x64-01-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-01-v1.svg"/>
                    </a>
                </p>

                <p>

                    Fwoah, look at that, we've shaved about three cycles off the C version!

                </p>

                <small><p>

                    (Note that when I first wrote this, I was comparing the assembly version against
                    the release build (not the PGO build), which was clocking in at about 13-14
                    cycles for negative matching.  So getting it down to ~7.5 from 13-14 was a bit
                    more exciting.  Damn the PGO build and it's 10.9-ish cycles for negative
                    matching!)

                </p></small>

                <p>

                    The good news is that our theory about the performance of the <code>LEAF_ENTRY</code> looks
                    like it's paid off: we can reliably get about 7.5 cycles for negative matching.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_2"></a>
                <h2>IsPrefixOfStringInTable_x64_2</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_1"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_1</a> |
                    <a href="#IsPrefixOfStringInTable_x64_3">IsPrefixOfStringInTable_x64_3  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    The bad news is that we now need to implement the rest of the functionality
                    within the constraints of a <code>LEAF_ENTRY</code>!

                <p>

                    The problem with a <code>LEAF_ENTRY</code> for anything more than a trivial bit of code is
                    that you only have a handful of volatile registers to work with, and no stack
                    space can be used for register spilling or temporaries.  (Technically I could
                    use the home parameter space, but, eh, we're already avoiding stack spills, why
                    not make life harder for ourselves and try avoid <strong>all</strong> memory
                    spilling.)

                </p>

                <p>

                    If you can't spill to memory, your only option is really spilling to XMM
                    registers via <code>vpinsr</code> and <code>vpextr</code> combinations, which,
                    as you can see in the implementation of version 2 below, I have to do a lot.
                </p>

                <small><p>

                    (Also note: when I wrote this version, I didn't use the disassembly from the C
                    routines for guidance.  I find that as soon as you start to grok the disassembly
                    for a given routine, it becomes harder to think of ways to approach it from a
                    fresh angle.  Also, the <code>LEAF_ENTRY</code> aspect significantly limited what I could do
                    anyway, so I figured I may as well just give it a crack from scratch and see
                    what I could come up with.  It would be an interesting point of reference
                    compared to a future iteration that tries to improve on the disassembly of an
                    optimized PGO version, for example.)

                </p></small>

                <p>

                    The diff view for this routine is less useful given the vast majority of the
                    code is new, so I've put the full version of the code first.  It's based more or
                    less on the approach used by version 8 of the C routine (I actually wrote it
                    after I wrote version 8; versions 9 and 10 of the C routine (with the latter
                    having the improved loop logic) came after).

                </p>

                <div class="tab-box language box-x64-2v1">
                    <ul class="tabs">
                        <li data-content="content-x64-2-full">Full</li>
                        <li data-content="content-x64-2v1-diff">Diff</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-2-full"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-2v1-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_1.asm IsPrefixOfStringInTable_x64_2.asm
--- IsPrefixOfStringInTable_x64_1.asm   2018-04-29 11:03:46.403568800 -0400
+++ IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
@@ -50,12 +50,12 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result back into xmm0.
+; result into xmm5.
 ;

         ;IACA_VC_START
@@ -63,34 +63,36 @@
         mov     rax, String.Buffer[rdx]
         vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
         vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm0, xmm0, xmm1
+        vpshufb xmm5, xmm0, xmm1

 ;
-; Load the string table's unique character array into xmm2, and the lengths for
-; each string slot into xmm3.
-;
+; Load the string table's unique character array into xmm2.

         vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
-        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

 ;
-; Set xmm5 to all ones.  This is used later.
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
 ;

-        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.
+        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Load the lengths of each string table slot into xmm3.
 ;
+        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+;
+; Set xmm2 to all ones.  We use this later to invert the length comparison.
+;
+
+        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

 ;
-; Compare the search string's unique character array (xmm0) against the string
-; table's unique chars (xmm2), saving the result back into xmm0.
+; Broadcast the byte-sized string length into xmm4.
 ;

-        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
 ; Compare the search string's length, which we've broadcasted to all 8-byte
@@ -100,30 +102,378 @@
 ; a slot with a length less than or equal to our search string's length.
 ;

-        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm5            ; Invert the result.
+        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
+        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

 ;
-; Intersect-and-test the unique character match xmm mask register (xmm0) with
+; Intersect-and-test the unique character match xmm mask register (xmm5) with
 ; the length match mask xmm register (xmm1).  This affects flags, allowing us
 ; to do a fast-path exit for the no-match case (where ZF = 1).
 ;

-        vptest      xmm0, xmm1                  ; Check for no match.
-        ;jnz        short @F                    ; There was a match.
-                                                ; (Not yet implemented.)
+        vptest      xmm5, xmm1                  ; Check for no match.
+        jnz         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
 ;

-        xor         eax, eax                    ;
-        not         al                          ; rax = -1
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+        ;IACA_VC_END
+
+;
+; (There was at least one match, continue with processing.)
+;
+
+;
+; Calculate the "search length" for the incoming search string, which is
+; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
+; currently lives in xmm4, albeit as a byte-value broadcasted across the
+; entire register, so extract that first.)
+;
+; Once the search length is calculated, deposit it back at the second byte
+; location of xmm4.
+;
+;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
+;
+;   r11 - String length (String-&gt;Length)
+;
+
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
+        mov         rax, 16                     ; Load 16 into rax.
+        mov         r10, r11                    ; Copy into r10.
+        cmp         r10w, ax                    ; Compare against 16.
+        cmova       r10w, ax                    ; Use 16 if length is greater.
+        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].
+
+;
+; Home our parameter registers into xmm registers instead of their stack-backed
+; location, to avoid memory writes.
+;
+
+        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
+        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
+        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].
+
+;
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; yielding a mask identifying indices we need to perform subsequent matches
+; upon.  Convert this into a bitmap and save in xmm2d[2].
+;
+
+        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.
+
+;
+; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
+;
+
+        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
+        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].
+
+;
+; Summary of xmm register stashing for the rest of the routine:
+;
+; xmm2:
+;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
+;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
+;
+; xmm4:
+;       0:7     (vpinsrb 0)     length of search string
+;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
+;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
+;      24:31    (vpinsrb 3)     shift count
+;
+; xmm5:
+;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
+;      64:95    (vpinsrd 2)     bitmap of slots to compare
+;      96:127   (vpinsrd 3)     index of slot currently being processed
+;
+
+;
+; Initialize rcx as our counter register by doing a popcnt against the bitmap
+; we just generated in edx, and clear our shift count register (r9).
+;
+
+        popcnt      ecx, edx                    ; Count bits in bitmap.
+        xor         r9, r9                      ; Clear r9.
+
+        align 16
+
+;
+; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
+; trailing zeros of the bitmap, and then add in the shift count, producing an
+; index (rax) we can use to load the corresponding slot.
+;
+; Register usage at top of loop:
+;
+;   rax - Index.
+;
+;   rcx - Loop counter.
+;
+;   rdx - Bitmap initially, then slot length.
+;
+;   r9 - Shift count.
+;
+;   r10 - Search length.
+;
+;   r11 - String length.
+;
+
+Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
+        mov         eax, r8d                    ; Copy tzcnt to rax,
+        add         rax, r9                     ; Add shift to create index.
+        inc         r8                          ; tzcnt + 1
+        shrx        rdx, rdx, r8                ; Reposition bitmap.
+        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
+        xor         edx, edx                    ; Clear edx.
+        mov         r9, rax                     ; Copy index back to shift.
+        inc         r9                          ; Shift = Index + 1
+        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
+
+;
+; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
+; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
+;
+; Then, load the string table slot at this index into xmm1, then shift rax back.
+;
+
+        shl         eax, 4
+        vpextrq     r8, xmm2, 0
+        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
+        shr         eax, 4
+
+;
+; The search string's first 16 characters are already in xmm0.  Compare this
+; against the slot that has just been loaded into xmm1, storing the result back
+; into xmm1.
+;
+
+        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.
+
+;
+; Convert the XMM mask into a 32-bit representation, then zero high bits after
+; our "search length", which allows us to ignore the results of the comparison
+; above for bytes that were after the search string's length, if applicable.
+; Then, count the number of bits remaining, which tells us how many characters
+; we matched.
+;
+
+        vpmovmskb   r8d, xmm1                   ; Convert into mask.
+        bzhi        r8d, r8d, r10d              ; Zero high bits.
+        popcnt      r8d, r8d                    ; Count bits.
+
+;
+; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
+; it, we can load rax (the current index) into xmm1 and use it to extract the
+; slot length via shuffle.  (The length will be in the lowest byte of xmm1
+; after the shuffle, which we can then vpextrb.)
+;
+
+        movd        xmm1, rax                   ; Load index into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
+        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.
+
+;
+; If 16 characters matched, and the search string's length is longer than 16,
+; we're going to need to do a comparison of the remaining strings.
+;
+
+        cmp         r8w, 16                     ; Compare chars matched to 16.
+        je          short @F                    ; 16 chars matched.
+        jmp         Pfx30                       ; Less than 16 matched.
+
+;
+; All 16 characters matched.  If the slot length is greater than 16, we need
+; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
+; then great, that's a slot match, we're done.
+;
+
+@@:     cmp         dl, 16                      ; Compare length to 16.
+        ja          Pfx50                       ; Length is &gt; 16.
+        je          short Pfx35                 ; Lengths match!
+                                                ; Length &lt;= 16, fall through...
+
+;
+; Less than or equal to 16 characters were matched.  Compare this against the
+; length of the slot; if equal, this is a match, if not, no match, continue.
+;
+
+Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
+        jne         @F                          ; No match found.
+        jmp         short Pfx35                 ; Match found!
+
+;
+; No match against this slot, decrement counter and either continue the loop
+; or terminate the search and return no match.
+;
+
+@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        dec         cx                          ; Decrement counter.
+        jnz         Pfx20                       ; cx != 0, continue.
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
+; former is used when we need to copy the number of characters matched from r8
+; back to rax.  The latter jump target doesn't require this.
+;
+
+Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.
+
+;
+; Load the match parameter back into r8 and test to see if it's not-NULL, in
+; which case we need to fill out a STRING_MATCH structure for the match.
+;
+
+Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
+        test        r8, r8                      ; Is NULL?
+        jnz         short @F                    ; Not zero, need to fill out.
+
+;
+; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
+;
+
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        ret                                     ; StringMatch == NULL, finish.
+
+;
+; StringMatch is not NULL.  Fill out characters matched (currently rax), then
+; reload the index from xmm5 into rax and save.
+;
+
+@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        mov         byte ptr StringMatch.Index[r8], al
+
+;
+; Final step, loading the address of the string in the string array.  This
+; involves going through the StringTable, so we need to load that parameter
+; back into rcx, then resolving the string array address via pStringArray,
+; then the relevant STRING offset within the StringArray.Strings structure.
+;
+
+        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
+        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
+        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
+        shr         eax, 4                  ; Revert the scaling.
+
         ret

+;
+; 16 characters matched and the length of the underlying slot is greater than
+; 16, so we need to do a little memory comparison to determine if the search
+; string is a prefix match.
+;
+; The slot length is stored in rax at this point, and the search string's
+; length is stored in r11.  We know that the search string's length will
+; always be longer than or equal to the slot length at this point, so, we
+; can subtract 16 (currently stored in r10) from rax, and use the resulting
+; value as a loop counter, comparing the search string with the underlying
+; string slot byte-by-byte to determine if there's a match.
+;
+
+Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.
+
+;
+; Free up some registers by stashing their values into various xmm offsets.
+;
+
+        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
+        mov         rcx, rdx                ; Free up rdx, rcx is now counter.
+
+;
+; Load the search string buffer and advance it 16 bytes.
+;
+
+        vpextrq     r11, xmm2, 1            ; Extract String into r11.
+        mov         r11, String.Buffer[r11] ; Load buffer address.
+        add         r11, r10                ; Advance buffer 16 bytes.
+
+;
+; Loading the slot is more involved as we have to go to the string table, then
+; the pStringArray pointer, then the relevant STRING offset within the string
+; array (which requires re-loading the index from xmm5d[3]), then the string
+; buffer from that structure.
+;
+
+        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
+        mov         r8, StringTable.pStringArray[r8] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+
+        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
+        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
+        add         r8, r10                 ; Advance buffer 16 bytes.
+
+        xor         eax, eax                ; Clear eax.
+
+;
+; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
+; Do a byte-by-byte comparison.
+;
+
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;
+
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.
+
+;
+; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
+; how many characters we matched, and then jump to Pfx40 for finalization.
+;
+
+        add         rax, r10
+        jmp         Pfx40
+
+;
+; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
+; it.  If it's zero, we have no more strings to compare, so we can do a quick
+; exit.  If there are still comparisons to be made, restore the other registers
+; we trampled then jump back to the start of the loop Pfx20.
+;
+
+Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
+        dec         cx                          ; Decrement counter.
+        jnz         short @F                    ; Jump forward if not zero.
+
+;
+; No more comparisons remaining, return.
+;
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; More comparisons remain; restore the registers we clobbered and continue loop.
+;
+
+@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
+        vpextrb     r11, xmm4, 0                ; Restore r11.
+        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        jmp         Pfx20                       ; Continue comparisons.
+
         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>

                    </div>
                </div>

                <p>

                    Looking back on my time logs (shout out to my favorite iPhone app,
                    <a
                    href="https://itunes.apple.com/us/app/hourstracker-hours-and-pay/id336456412?mt=8">HoursTracker</a>!),
                    the routine above took about 8 hours to implement over the course of about two
                    days, give or take.  Writing assembly is slow, writing correct assembly is even
                    slower.  I generally find that there's a noticeable hump I need to get over in
                    the first say 30 minutes of any assembly programming session, but once you get
                    into the zone, things can start flowing quite nicely.  I'm an aggressive
                    debugger user; often, to get started I'll write a simple <code>LEAF_ENTRY</code>
                    that looks like this:

                    <small><pre>
        LEAF_ENTRY Foo, _TEXT$00
                int 3
                xor eax, eax
                ret
        LEAF_END Foo, _TEXT$00
                    </pre></small>
                    That'll allow me to attach the debugger and at least inspect the parameter
                    registers so I can write the next couple of instructions.  I find it definitely
                    helps get me into the zone quicker.

                </p>

                <p>

                    Anyway, enough about that.  Let's look at performance.  Again, this will be an
                    interesting one &mdash; other than the optimal negative match logic that I
                    copied from version 1, the sole focus was on getting a working assembly version;
                    I wasn't giving any thought to performance at this stage.

                </p>

                <p>

                    So, it'll be interesting to see how it compares to a) version 1 in the negative
                    matching case (it should be very close), and b) against the C versions in the
                    prefix matching case (it hopefully won't be prohibitively worse).

                </p>

                <p>
                    <a href="Benchmark-x64-02-Negative-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-02-Negative-v1.svg"/>
                    </a>
                </p>

                <p>

                    Hmmm, that's not too bad!  We're very close to version 1 for negative matching,
                    within about 0.5 cycles or so.  That sounds about right, given that our initial
                    logic had to be tweaked a bit to play nicer with the rest of the implementation.
                    And we're still about 3-4 cycles faster than the fastest C version.

                </p>

                <p>

                    What about prefix matching performance?

                </p>

                <p>
                    <a href="Benchmark-x64-02-v2.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-02-v2.svg"/>
                    </a>
                </p>

                <p>

                    The prefix matching performance isn't too bad either!  We're definitely slower
                    than the C version, ranging from about 4 cycles to 10 cycles in most cases,
                    with the $INDEX_ALLOCATION input about 13 cycles slower.

                </p>

                <p>

                    <a class="xref" name="oddities"></a>
                    <small>

                    (I've just noticed the pattern with regards to the first 8 entries, $AttrDef to
                    $Mft, clocking in at about 18 and 24 cycles respectively.  But the next four
                    entries, $Secure to $Cairo, consistently clock in at about 24 and 34 cycles
                    respectively.  $Secure is the 9th slot, which puts it at memory offset 192 bytes
                    from the start of the string table.  And then the 18 and 24 cycle behavior
                    returns for the last two items, <code>????</code> and <code>.</code>, which are
                    at the end of the string table's inner slot array.  This pattern is prevalent in
                    all of our iterations.  Very peculiar!  We'll investigate later.)

                    </small>

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_3"></a>
                <h2>IsPrefixOfStringInTable_x64_3</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_2"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_2</a> |
                    <a href="#IsPrefixOfStringInTable_x64_4">IsPrefixOfStringInTable_x64_4  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p><small>

                    (We're nearly at the end of the first round of iterations, I promise!)

                </small></p>

                <p>

                    Seeing the performance of the second version in assembly, I decided to try whip
                    up a third version, which would switch from a <code>LEAF_ENTRY</code> to <code>NESTED_ENTRY</code>, and
                    use <code>rep cmps</code> for the byte comparison for long strings (instead of
                    the byte-by-byte approach used now).

                </p>

                <p>

                    In order to use <code>rep cmps</code>, you need to use two non-volatile
                    registers, <code>rsi</code> (the source index) and <code>rdi</code> (the
                    destination index).  You also need to specify the direction of the comparison,
                    which means mutating the flags, which are also classed as non-volatile, so they
                    need to be pushed to the stack in the prologue and popped back off in the
                    epilogue.

                </p>

                <p>

                    I didn't really expect this to offer a measurable speedup, but it was a tangible
                    reason to use a <code>NESTED_ENTRY</code>, and otherwise allowed me to stay within the
                    confines of the version 2 implementation.

                </p>

                <p>

                    Let's take a look at the implementation.  At the very least, it's useful to see
                    how you can go about organizing your prologue in MASM.  For
                    <code>NESTED_ENTRY</code> routines, I always define a <code>Locals</code>
                    structure that encorporates the return address and home parameter space for easy
                    access.  Mainly because it allows me to write code like this:

                </p>

<pre class="code"><code class="language-nasm">        mov     Locals.HomeRcx[rsp], rcx        ; Home our first param.
        mov     Locals.HomeRdx[rsp], rdx        ; Home our second param.
        mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
</code></pre>

                <p>

                    Instead of working with offsets, like this:

                </p>

<pre class="code"><code class="language-nasm">        mov     qword ptr [rsp+30h], rcx        ; Home our first param.
        mov     qword ptr [rsp+38h], rdx        ; Home our second param.
        mov     rsi, qword ptr [rsp+10h]        ; Restore rsi.
        mov     rdi, qword ptr [rsp+8]          ; Restore rdi.
</code></pre>

                <p>

                    This routine was written last, after version 10 of the C routine, so it
                    incorporates the slightly re-arranged loop logic that proved to be faster for
                    that version.  Other than that, the main changes involved converting all the
                    early exit returns in the body of the function to jump to a single exit point,
                    <code>Pfx90</code>, mainly to simplify epilogue exit code.

                </p>

                <div class="tab-box language box-x64-3v2">
                    <ul class="tabs">
                        <li data-content="content-x64-3v2-diff">Diff</li>
                        <li data-content="content-x64-3-full">Full</li>
                    </ul>
                    <div class="content">

<pre class="code content-x64-3v2-diff"><code class="language-diff"> % diff -u IsPrefixOfStringInTable_x64_2.asm IsPrefixOfStringInTable_x64_3.asm
--- IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
+++ IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:01:10.033827200 -0400
@@ -18,6 +18,31 @@

 include StringTable.inc

+;
+; Define a locals struct for saving flags, rsi and rdi.
+;
+
+Locals struct
+
+    Padding             dq      ?
+    SavedRdi            dq      ?
+    SavedRsi            dq      ?
+    SavedFlags          dq      ?
+
+    ReturnAddress       dq      ?
+    HomeRcx             dq      ?
+    HomeRdx             dq      ?
+    HomeR8              dq      ?
+    HomeR9              dq      ?
+
+Locals ends
+
+;
+; Exclude the return address onward from the frame calculation size.
+;
+
+LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
+
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -33,6 +58,14 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
+;   This routine is based off version 2.  It has been converted into a nested
+;   entry (version 2 is a leaf entry), and uses 'repe cmpsb' to do the string
+;   comparison for long strings (instead of the byte-by-byte comparison used
+;   in version 2).  This requires use of the rsi and rdi registers, and the
+;   direction flag.  These are all non-volatile registers and thus, must be
+;   saved to the stack in the function prologue (hence the need to make this
+;   a nested entry).
+;
 ; Arguments:
 ;
 ;   StringTable - Supplies a pointer to a STRING_TABLE struct.
@@ -50,7 +83,19 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
+
+;
+; Begin prologue.  Allocate stack space and save non-volatile registers.
+;
+
+        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
+
+        push_eflags                                 ; Save flags.
+        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
+        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
+
+        END_PROLOGUE

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -120,7 +165,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp         Pfx90                       ; Return.

         ;IACA_VC_END

@@ -214,7 +259,7 @@
 ;
 ;   rcx - Loop counter.
 ;
-;   rdx - Bitmap initially, then slot length.
+;   rdx - Bitmap.
 ;
 ;   r9 - Shift count.
 ;
@@ -228,8 +273,6 @@
         add         rax, r9                     ; Add shift to create index.
         inc         r8                          ; tzcnt + 1
         shrx        rdx, rdx, r8                ; Reposition bitmap.
-        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
-        xor         edx, edx                    ; Clear edx.
         mov         r9, rax                     ; Copy index back to shift.
         inc         r9                          ; Shift = Index + 1
         vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
@@ -252,7 +295,7 @@
 ; into xmm1.
 ;

-        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.
+        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

 ;
 ; Convert the XMM mask into a 32-bit representation, then zero high bits after
@@ -267,17 +310,6 @@
         popcnt      r8d, r8d                    ; Count bits.

 ;
-; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
-; it, we can load rax (the current index) into xmm1 and use it to extract the
-; slot length via shuffle.  (The length will be in the lowest byte of xmm1
-; after the shuffle, which we can then vpextrb.)
-;
-
-        movd        xmm1, rax                   ; Load index into xmm1.
-        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
-        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.
-
-;
 ; If 16 characters matched, and the search string's length is longer than 16,
 ; we're going to need to do a comparison of the remaining strings.
 ;
@@ -287,37 +319,38 @@
         jmp         Pfx30                       ; Less than 16 matched.

 ;
-; All 16 characters matched.  If the slot length is greater than 16, we need
-; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
-; then great, that's a slot match, we're done.
+; All 16 characters matched.  Load the underlying slot's length from the
+; relevant offset in the xmm3 register, then check to see if it's greater than,
+; equal or less than 16.
 ;

-@@:     cmp         dl, 16                      ; Compare length to 16.
+@@:     movd        xmm1, rax                   ; Load into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
+        vpextrb     rax, xmm1, 0                ; And extract back into rax.
+        cmp         al, 16                      ; Compare length to 16.
         ja          Pfx50                       ; Length is &gt; 16.
         je          short Pfx35                 ; Lengths match!
                                                 ; Length &lt;= 16, fall through...

 ;
 ; Less than or equal to 16 characters were matched.  Compare this against the
-; length of the slot; if equal, this is a match, if not, no match, continue.
+; length of the search string; if equal, this is a match.
 ;

-Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
-        jne         @F                          ; No match found.
-        jmp         short Pfx35                 ; Match found!
+Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
+        je          short Pfx35                 ; Match found!

 ;
 ; No match against this slot, decrement counter and either continue the loop
 ; or terminate the search and return no match.
 ;

-@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
         dec         cx                          ; Decrement counter.
         jnz         Pfx20                       ; cx != 0, continue.

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp         Pfx90                       ; Return.

 ;
 ; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
@@ -341,7 +374,7 @@
 ;

         vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        ret                                     ; StringMatch == NULL, finish.
+        jmp         Pfx90                       ; StringMatch == NULL, finish.

 ;
 ; StringMatch is not NULL.  Fill out characters matched (currently rax), then
@@ -367,7 +400,7 @@
         mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
         shr         eax, 4                  ; Revert the scaling.

-        ret
+        jmp         Pfx90

 ;
 ; 16 characters matched and the length of the underlying slot is greater than
@@ -382,14 +415,15 @@
 ; string slot byte-by-byte to determine if there's a match.
 ;

-Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.
+Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

 ;
 ; Free up some registers by stashing their values into various xmm offsets.
 ;

+        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
         vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
-        mov         rcx, rdx                ; Free up rdx, rcx is now counter.
+        mov         rcx, rax                ; Free up rax, rcx is now counter.

 ;
 ; Load the search string buffer and advance it 16 bytes.
@@ -409,31 +443,27 @@
         vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
         mov         r8, StringTable.pStringArray[r8] ; Load string array.

+        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
         shl         eax, 4                  ; Scale the index; sizeof STRING=16.

         lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
         mov         r8, String.Buffer[r8]   ; Load string table buffer address.
         add         r8, r10                 ; Advance buffer 16 bytes.

-        xor         eax, eax                ; Clear eax.
+        mov         rax, rcx                ; Copy counter.

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Do a byte-by-byte comparison.
+; Set up rsi/rdi so we can do a 'rep cmps'.
 ;

-        align 16
-@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
-        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
-        jne         short Pfx60                 ; If not equal, jump.
-
-;
-; The two bytes were equal, update rax, decrement rcx and potentially continue
-; the loop.
-;
+        cld
+        mov         rsi, r11
+        mov         rdi, r8
+        repe        cmpsb

-        inc         ax                          ; Increment index.
-        loopnz      @B                          ; Decrement cx and loop back.
+        test        cl, 0
+        jnz         short Pfx60                 ; Not all bytes compared, jump.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -460,7 +490,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp Pfx90                               ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -473,7 +503,17 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00
+        align   16
+
+Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
+        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
+        popfq                                   ; Restore flags.
+        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
+
+        ret
+
+        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-3-full"><code class="language-full">;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 2.  It has been converted into a nested
;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
;   comparison for long strings (instead of the byte-by-byte comparison used
;   in version 2).  This requires use of the rsi and rdi registers, and the
;   direction flag.  These are all non-volatile registers and thus, must be
;   saved to the stack in the function prologue (hence the need to make this
;   a nested entry).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

        alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
        popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
                    </div>
                </div>

                <p>

                    I don't have a strong hunch as to how this will perform; like I said earlier, it
                    was mainly done to set up the scaffolding for using a <code>NESTED_ENTRY</code> in the
                    future, such that we'll have the glue in place if we want to iterate on the
                    disassembly of the PGO versions.  If I had to guess, I suspect it will be
                    slightly slower than version 2, but surely not by much, right?  It's a pretty
                    minor change in the grand scheme of things.  Let's take a look.

                </p>

                <p>
                    <a href="Benchmark-x64-03-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-03-v1.svg"/>
                    </a>
                </p>

                <p>

                    Hah!  Version 3 is much, much worse!  Even its negative matching performance is
                    terrible, which is the one thing the assembly versions have been good at so far.
                    How peculiar.

                <p>

                    Now, in the interest of keeping events chronological, as much as I'd like to
                    dive in now and figure out why, I'll have to defer to my behavior when I
                    encountered this performance gap: I laughed, shelved the version 3 experiment,
                    and moved on.

                </p>

                <p>

                    That's a decidedly unsatisfying end to the matter, though, I'll admit.  We'll
                    come back to it later in the article and try and get some closure as to why it
                    was so slow, comparatively.

                </p>

                <hr/>
                <h2>Internet Feedback</h2>

                <p>

                    So, at this point, with version 10 of the C routine and version 2 of the
                    assembly version in hand, and a very early draft of this article, I solicited
                    <a href="https://twitter.com/trentnelson/status/985715037934440448">feedback on
                    Twitter</a> and got some great responses.  Thanks again to
                    <a href="https://twitter.com/rygorous">Fabian Giesen</a>,
                    <a href="https://twitter.com/pshufb">Wojciech Mu&#322;a</a>,
                    <a href="https://twitter.com/geofflangdale">Geoff Langdale</a>,
                    <a href="https://twitter.com/lemire">Daniel Lemire</a>, and
                    <a href="https://twitter.com/KendallWillets">Kendall Willets</a>
                    for their discussion and input over the course of a few days!

                </p>

                <hr/>
                <a class="xref" name="round2"></a>
                <h1>Round 2 &mdash; Post-Internet Feedback</h1>

                <p>

                    Let's take a look at the iterations that came about after receiving feedback.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_11"></a>
                <h2>IsPrefixOfStringInTable_11</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_10"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_10</a> |
                    <a href="#IsPrefixOfStringInTable_12">IsPrefixOfStringInTable_12  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Both <a href="https://twitter.com/rygorous">Fabian Giesen</a> and
                    <a href="https://twitter.com/pshufb">Wojciech Mu&#322;a</a> pointed out that
                    we could use <code>_mm_andnot_si128()</code> to avoid the need to invert the
                    results of the <code>IncludeSlotsByLength</code> XMM register (via
                    <code>_mm_xor_si128()</code>).  Let's try that.

                </p>

                <div class="tab-box language box-11v10">
                    <ul class="tabs">
                        <li data-content="content-11v10-diff">Diff</li>
                        <li data-content="content-11-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-11v10-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_11.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 10:38:09.357890400 -0400
+++ IsPrefixOfStringInTable_11.c        2018-04-26 12:43:44.184528000 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_11(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but with the vpandn used at the
+    end of the initial test, as suggested by Wojciech Mula (@pshufb).

 Arguments:

@@ -70,9 +70,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -158,28 +156,25 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we invert the mask shortly after.
+    //      we do the "and not" intersection with the include slots next.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
-    //
-
-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
-    //
     // We're now ready to intersect the two XMM registers to determine which
     // slots should still be included in the comparison (i.e. which slots have
     // the exact same unique character as the string and a length less than or
     // equal to the length of the search string).
     //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
+    //

-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);

     //
     // Generate a mask.
</code></pre>
<pre class="code content-11-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_11(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but with the vpandn used at the
    end of the initial test, as suggested by Wojciech Mula (@pshufb).

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we do the "and not" intersection with the include slots next.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match-&gt;Index = (BYTE)Index;
            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>

                    </div>
                </div>

                <p>

                    We're only shaving one instruction off here, so the performance gain, if any,
                    should be very modest.

                </p>

                <p>

                    <a href="Benchmark-11-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-11-v1.svg"/>
                    </a>

                </p>

                <p>

                    Definitely a slight improvement over version 10 in most cases!

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_4"></a>
                <h2>IsPrefixOfStringInTable_x64_4</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_3"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_3</a> |
                    <a href="#IsPrefixOfStringInTable_x64_5">IsPrefixOfStringInTable_x64_5  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Something I didn't know about <code>vptest</code> that Fabian pointed out is
                    that it actually does two operations.  The first essentially does an AND of the
                    two input registers and sets the zero flag (ZF=1) if the result is all 0s.
                    We've been using that aspect in the assembly version up to now.

                </p>

                <p>

                    However, it also does the equivalent of <code>(xmm0 and (not xmm1))</code>, and
                    sets the carry flag (CY=1) if that expression evaluates to all zeros. That's
                    handy, because it's exactly the expression we want to do!

                </p>

                <p>

                    So, let's take version 2 of our assembly routine, remove the vpxor bit, and
                    re-arrange the vptest inputs such that we can do a <code>jnc</code> instead of
                    <code>jnz</code>:

                </p>

                <div class="tab-box language box-x64-4v2">
                    <ul class="tabs">
                        <li data-content="content-x64-4v2-diff">Diff</li>
                        <li data-content="content-x64-4-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-4v2-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_2.asm IsPrefixOfStringInTable_x64_4.asm
--- IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
+++ IsPrefixOfStringInTable_x64_4.asm   2018-04-26 14:16:37.909717200 -0400
@@ -33,6 +33,10 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
+;   This routine is based off version 2, but leverages the fact that
+;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
+;   to all 0s, avoiding the the need to do the pxor or pandn steps.
+;
 ; Arguments:
 ;
 ;   StringTable - Supplies a pointer to a STRING_TABLE struct.
@@ -50,7 +54,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -83,12 +87,6 @@
         vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

 ;
-; Set xmm2 to all ones.  We use this later to invert the length comparison.
-;
-
-        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.
-
-;
 ; Broadcast the byte-sized string length into xmm4.
 ;

@@ -103,16 +101,16 @@
 ;

         vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

 ;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
-; the length match mask xmm register (xmm1).  This affects flags, allowing us
-; to do a fast-path exit for the no-match case (where ZF = 1).
+; the inverted length match mask xmm register (xmm1).  This will set the carry
+; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
+; us to do a fast-path exit for the no-match case.
 ;

-        vptest      xmm5, xmm1                  ; Check for no match.
-        jnz         short Pfx10                 ; There was a match.
+        vptest      xmm1, xmm5                  ; Check for no match.
+        jnc         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
@@ -159,12 +157,12 @@
         vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

 ;
-; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
 ; yielding a mask identifying indices we need to perform subsequent matches
 ; upon.  Convert this into a bitmap and save in xmm2d[2].
 ;

-        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
         vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

 ;
@@ -473,7 +471,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-4-full"><code class="language-nasm">
;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 2, but leverages the fact that
;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
;   to all 0s, avoiding the the need to do the pxor or pandn steps.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the inverted length match mask xmm register (xmm1).  This will set the carry
; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
; us to do a fast-path exit for the no-match case.
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00
</code></pre>
                    </div>
                </div>

                <p>

                    Let's see how that stacks up against the existing version 2 of the assembly
                    routine:

                </p>

                <p>
                    <a href="Benchmark-x64-04-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-04-v1.svg"/>
                    </a>
                </p>

                <p>

                    Nice, we've shaved an entire cycle off the negative match path!  I say that both
                    seriously and sarcastically.  A single cycle, wow, stop the press!  On the other
                    hand, going from 8 cycles to 7 cycles is usually a lot harder than, say, going
                    from 100,000 cycles to 80,000 cycles.  We're so close to the lower bound,
                    additional cycle improvements is a lot like trying to get blood out of a stone.

                </p>


                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_12"></a>
                <h2>IsPrefixOfStringInTable_12</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_11"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_11</a> |
                    <a href="#IsPrefixOfStringInTable_13">IsPrefixOfStringInTable_13  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    The <code>vptest</code> fast-path exit definitely yielded a repeatable and measurable gain
                    for the assembly version.  Let's replicate it in a C version.

                </p>

                <div class="tab-box language box-12v10">
                    <ul class="tabs">
                        <li data-content="content-12v10-diff">Diff</li>
                        <li data-content="content-12-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-12v10-diff"><code class="language-diff">
% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_12.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 13:28:06.006627100 -0400
+++ IsPrefixOfStringInTable_12.c        2018-04-26 17:47:54.970331600 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_12(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,8 +32,15 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but with factors in the improvements
+    made to version 4 of the x64 assembly version, thanks to suggestions from
+    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
+
+    Like version 11, we omit the vpxor to invert the lengths, but instead of
+    an initial vpandn, we leverage the fact that vptest sets the carry flag
+    if all 0s result from the expression: "param1 and (not param2)".  This
+    allows us to do a fast-path early exit (like x64 version 2 does) if no
+    match is found.

 Arguments:

@@ -71,9 +78,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -159,47 +164,58 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we invert the mask shortly after.
+    //      we do the "and not" intersection with the include slots next.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
+    // We can do a fast-path test for no match here via _mm_testc_si128(),
+    // which is essentially equivalent to the following logic, just with
+    // fewer instructions:
     //
-
-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
-    //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
+    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+    //                                      IncludeSlotsByUniqueChar);
     //
-
-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
-
+    //      if (!IncludeSlots) {
+    //          return NO_MATCH_FOUND;
+    //      }
     //
-    // Generate a mask.
     //

-    Bitmap = _mm_movemask_epi8(IncludeSlots);
-
-    if (!Bitmap) {
+    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

         //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
+        // No remaining slots were left after we intersected the slots with
+        // matching unique characters with the inverted slots to ignore due
+        // to length.  Thus, no prefix match was found.
         //

         return NO_MATCH_FOUND;
     }

     //
+    // Continue with the remaining logic, including actually generating the
+    // IncludeSlots, which we need for bitmap generation as part of our
+    // comparison loop.
+    //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
+    //
+
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);
+
+    //
+    // Generate a mask, count the number of bits, and initialize the search
+    // length.
+    //
+
+    Bitmap = _mm_movemask_epi8(IncludeSlots);
+
+    //
     // Calculate the "search length" of the incoming string, which ensures we
     // only compare up to the first 16 characters.
     //
</code></pre>
<pre class="code content-12-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_12(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but with factors in the improvements
    made to version 4 of the x64 assembly version, thanks to suggestions from
    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).

    Like version 11, we omit the vpxor to invert the lengths, but instead of
    an initial vpandn, we leverage the fact that vptest sets the carry flag
    if all 0s result from the expression: "param1 and (not param2)".  This
    allows us to do a fast-path early exit (like x64 version 2 does) if no
    match is found.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we do the "and not" intersection with the include slots next.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We can do a fast-path test for no match here via _mm_testc_si128(),
    // which is essentially equivalent to the following logic, just with
    // fewer instructions:
    //
    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
    //                                      IncludeSlotsByUniqueChar);
    //
    //      if (!IncludeSlots) {
    //          return NO_MATCH_FOUND;
    //      }
    //
    //

    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

        //
        // No remaining slots were left after we intersected the slots with
        // matching unique characters with the inverted slots to ignore due
        // to length.  Thus, no prefix match was found.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Continue with the remaining logic, including actually generating the
    // IncludeSlots, which we need for bitmap generation as part of our
    // comparison loop.
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask, count the number of bits, and initialize the search
    // length.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match-&gt;Index = (BYTE)Index;
            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>
                    </div>
                </div>

                <p>

                    <a href="Benchmark-12-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-12-v1.svg"/>
                    </a>

                </p>

                <p>

                    Eh, there's not much in this one.  The negative match fast path is basically
                    identical, and the normal prefix matches are a tiny bit slower.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_13"></a>
                <h2>IsPrefixOfStringInTable_13</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_12"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_12</a> |
                    <a href="#IsPrefixOfStringInTable_14">IsPrefixOfStringInTable_14  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    <a href="https://twitter.com/rygorous/status/985737342156652544">Another tip
                    from Fabian:</a> we can tweak the loop logic further.  Instead of
                    shifting the bitmap right each iteration (and keeping a separate shift count),
                    we can just leverage the <code>blsr</code> intrinsic, which stands for <em>reset
                    lowest set bit</em>, and is equivalent to doing <code>x &amp; (x -1)</code>.
                    This allows us to tweak the loop organization as well, such that we can simply
                    do <code>while (Bitmap) { }</code> instead of the <code>do { } while (--Count)</code>
                    approach we've been using.

                </p>

                <div class="tab-box language box-13v10">
                    <ul class="tabs">
                        <li data-content="content-13v10-diff">Diff</li>
                        <li data-content="content-13-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-13v10-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_13.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 18:22:23.926168500 -0400
+++ IsPrefixOfStringInTable_13.c        2018-04-26 19:16:34.926170200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_13(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,8 +32,10 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but does away with the bitmap
+    shifting logic and `do { } while (--Count)` loop, instead simply using
+    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
+    Giesen (@rygorous) for pointing this approach out.

 Arguments:

@@ -54,12 +56,9 @@
 {
     ULONG Bitmap;
     ULONG Mask;
-    ULONG Count;
     ULONG Length;
     ULONG Index;
-    ULONG Shift = 0;
     ULONG CharactersMatched;
-    ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
     STRING_SLOT Slot;
@@ -206,31 +205,26 @@

     SearchLength = min(String-&gt;Length, 16);

-    //
-    // A popcount against the mask will tell us how many slots we matched, and
-    // thus, need to compare.
-    //
-
-    Count = __popcnt(Bitmap);
-
-    do {
+    while (Bitmap) {

         //
         // Extract the next index by counting the number of trailing zeros left
-        // in the bitmap and adding the amount we've already shifted by.
+        // in the bitmap.
         //

-        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
-        Index = NumberOfTrailingZeros + Shift;
+        Index = _tzcnt_u32(Bitmap);

         //
-        // Shift the bitmap right, past the zeros and the 1 that was just found,
-        // such that it's positioned correctly for the next loop's tzcnt. Update
-        // the shift count accordingly.
+        // Clear the bitmap's lowest set bit, such that it's ready for the next
+        // loop's tzcnt if no match is found in this iteration.  Equivalent to
+        //
+        //      Bitmap &amp;= Bitmap - 1;
+        //
+        // (Which the optimizer will convert into a blsr instruction anyway in
+        //  non-debug builds.  But it's nice to be explicit.)
         //

-        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
-        Shift = Index + 1;
+        Bitmap = _blsr_u32(Bitmap);

         //
         // Load the slot and its length.
@@ -313,7 +307,7 @@

         return (STRING_TABLE_INDEX)Index;

-    } while (--Count);
+    }

     //
     // If we get here, we didn't find a match.
</code></pre>
<pre class="code content-13-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_13(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but does away with the bitmap
    shifting logic and `do { } while (--Count)` loop, instead simply using
    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
    Giesen (@rygorous) for pointing this approach out.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Length;
    ULONG Index;
    ULONG CharactersMatched;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    while (Bitmap) {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap.
        //

        Index = _tzcnt_u32(Bitmap);

        //
        // Clear the bitmap's lowest set bit, such that it's ready for the next
        // loop's tzcnt if no match is found in this iteration.  Equivalent to
        //
        //      Bitmap &amp;= Bitmap - 1;
        //
        // (Which the optimizer will convert into a blsr instruction anyway in
        //  non-debug builds.  But it's nice to be explicit.)
        //

        Bitmap = _blsr_u32(Bitmap);

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match-&gt;Index = (BYTE)Index;
            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    }

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>

                    </div>
                </div>

                <p>

                    I like this change.  It was a great suggestion from Fabian.  Let's see how it
                    performs.  Hopefully it'll do slightly better at prefix matching, given that
                    we're effectively reducing the number of instructions required as part of the
                    string comparison logic.

                </p>

                <p>

                    <a href="Benchmark-13-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-13-v1.svg"/>
                    </a>

                </p>

                <p>

                    Ah!  A measurable, repeatable speed-up!  Excellent!

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_14"></a>
                <h2>IsPrefixOfStringInTable_14</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_13"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_13</a> |
                    <a href="#IsPrefixOfStringInTable_15">IsPrefixOfStringInTable_15  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Let's give the C version the same chance as the assembly version with regards to
                    negative matching; we'll take version 13 above and factor in the
                    <code>vptest</code> logic from version 12.

                </p>

                <div class="tab-box language box-14v13">
                    <ul class="tabs">
                        <li data-content="content-14v13-diff">Diff (14 vs 13)</li>
                        <li data-content="content-14v12-diff">Diff (14 vs 12)</li>
                        <li data-content="content-14-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-14v13-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_13.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_13.c        2018-04-26 19:16:34.926170200 -0400
+++ IsPrefixOfStringInTable_14.c        2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_13(
+IsPrefixOfStringInTable_14(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,10 +32,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 10, but does away with the bitmap
-    shifting logic and `do { } while (--Count)` loop, instead simply using
-    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
-    Giesen (@rygorous) for pointing this approach out.
+    This version combines the altered bitmap logic from version 13 with the
+    fast-path _mm_testc_si128() exit from version 12.

 Arguments:

@@ -70,9 +68,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -164,22 +160,43 @@
     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
+    // We can do a fast-path test for no match here via _mm_testc_si128(),
+    // which is essentially equivalent to the following logic, just with
+    // fewer instructions:
     //
+    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+    //                                      IncludeSlotsByUniqueChar);
+    //
+    //      if (!IncludeSlots) {
+    //          return NO_MATCH_FOUND;
+    //      }
+    //
+    //
+
+    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
+        //
+        // No remaining slots were left after we intersected the slots with
+        // matching unique characters with the inverted slots to ignore due
+        // to length.  Thus, no prefix match was found.
+        //
+
+        return NO_MATCH_FOUND;
+    }

     //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
+    // Continue with the remaining logic, including actually generating the
+    // IncludeSlots, which we need for bitmap generation as part of our
+    // comparison loop.
+    //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
     //

-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);

     //
     // Generate a mask.
@@ -187,17 +204,6 @@

     Bitmap = _mm_movemask_epi8(IncludeSlots);

-    if (!Bitmap) {
-
-        //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
-        //
-
-        return NO_MATCH_FOUND;
-    }
-
     //
     // Calculate the "search length" of the incoming string, which ensures we
     // only compare up to the first 16 characters.

</code></pre>
<pre class="code content-14v12-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_12.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_12.c        2018-04-26 17:47:54.970331600 -0400
+++ IsPrefixOfStringInTable_14.c        2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_12(
+IsPrefixOfStringInTable_14(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,15 +32,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 10, but with factors in the improvements
-    made to version 4 of the x64 assembly version, thanks to suggestions from
-    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
-
-    Like version 11, we omit the vpxor to invert the lengths, but instead of
-    an initial vpandn, we leverage the fact that vptest sets the carry flag
-    if all 0s result from the expression: "param1 and (not param2)".  This
-    allows us to do a fast-path early exit (like x64 version 2 does) if no
-    match is found.
+    This version combines the altered bitmap logic from version 13 with the
+    fast-path _mm_testc_si128() exit from version 12.

 Arguments:

@@ -61,12 +54,9 @@
 {
     ULONG Bitmap;
     ULONG Mask;
-    ULONG Count;
     ULONG Length;
     ULONG Index;
-    ULONG Shift = 0;
     ULONG CharactersMatched;
-    ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
     STRING_SLOT Slot;
@@ -118,7 +108,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    Search.CharsXmm = _mm_load_si128((PXMMWORD)String-&gt;Buffer);
+    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

     //
     // Broadcast the search string's unique characters according to the string
@@ -164,7 +154,7 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we do the "and not" intersection with the include slots next.
+    //      we invert the mask shortly after.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
@@ -209,8 +199,7 @@
                                     IncludeSlotsByUniqueChar);

     //
-    // Generate a mask, count the number of bits, and initialize the search
-    // length.
+    // Generate a mask.
     //

     Bitmap = _mm_movemask_epi8(IncludeSlots);
@@ -222,31 +211,26 @@

     SearchLength = min(String-&gt;Length, 16);

-    //
-    // A popcount against the mask will tell us how many slots we matched, and
-    // thus, need to compare.
-    //
-
-    Count = __popcnt(Bitmap);
-
-    do {
+    while (Bitmap) {

         //
         // Extract the next index by counting the number of trailing zeros left
-        // in the bitmap and adding the amount we've already shifted by.
+        // in the bitmap.
         //

-        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
-        Index = NumberOfTrailingZeros + Shift;
+        Index = _tzcnt_u32(Bitmap);

         //
-        // Shift the bitmap right, past the zeros and the 1 that was just found,
-        // such that it's positioned correctly for the next loop's tzcnt. Update
-        // the shift count accordingly.
+        // Clear the bitmap's lowest set bit, such that it's ready for the next
+        // loop's tzcnt if no match is found in this iteration.  Equivalent to
+        //
+        //      Bitmap &amp;= Bitmap - 1;
+        //
+        // (Which the optimizer will convert into a blsr instruction anyway in
+        //  non-debug builds.  But it's nice to be explicit.)
         //

-        Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
-        Shift = Index + 1;
+        Bitmap = _blsr_u32(Bitmap);

         //
         // Load the slot and its length.
@@ -329,7 +313,7 @@

         return (STRING_TABLE_INDEX)Index;

-    } while (--Count);
+    }

     //
     // If we get here, we didn't find a match.
</code></pre>
<pre class="code content-14-full"><code class="language-c">_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_14(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version combines the altered bitmap logic from version 13 with the
    fast-path _mm_testc_si128() exit from version 12.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Length;
    ULONG Index;
    ULONG CharactersMatched;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable-&gt;UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We can do a fast-path test for no match here via _mm_testc_si128(),
    // which is essentially equivalent to the following logic, just with
    // fewer instructions:
    //
    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
    //                                      IncludeSlotsByUniqueChar);
    //
    //      if (!IncludeSlots) {
    //          return NO_MATCH_FOUND;
    //      }
    //
    //

    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

        //
        // No remaining slots were left after we intersected the slots with
        // matching unique characters with the inverted slots to ignore due
        // to length.  Thus, no prefix match was found.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Continue with the remaining logic, including actually generating the
    // IncludeSlots, which we need for bitmap generation as part of our
    // comparison loop.
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String-&gt;Length, 16);

    while (Bitmap) {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap.
        //

        Index = _tzcnt_u32(Bitmap);

        //
        // Clear the bitmap's lowest set bit, such that it's ready for the next
        // loop's tzcnt if no match is found in this iteration.  Equivalent to
        //
        //      Bitmap &amp;= Bitmap - 1;
        //
        // (Which the optimizer will convert into a blsr instruction anyway in
        //  non-debug builds.  But it's nice to be explicit.)
        //

        Bitmap = _blsr_u32(Bitmap);

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched &lt; Length &amp;&amp; Length &lt;= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length &gt; 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match-&gt;Index = (BYTE)Index;
            Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    }

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}
</code></pre>

                    </div>
                </div>

                <p>

                    We're really clutching at straws here obviously with regards to trying to eke
                    out more performance.  The <code>_mm_testc_si128()</code> alteration was a tiny
                    bit slower for <a href="#IsPrefixOfStringInTable_12">version 12</a> across the
                    board.  However, the <code>vptest</code> (which is the underlying assembly
                    instruction that maps to the <code>_mm_testc_si128()</code> intrinsic) version
                    4 of our assembly was definitely a little bit faster than the other versions.
                    Let's see how our final C version performs:

                </p>

                <p>

                    <a href="Benchmark-14-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-14-v1.svg"/>
                    </a>


                </p>

                <p>

                    Welp, at least it's consistent!  Like version 12, the <code>_mm_testc_si128()</code>
                    change doesn't really offer a compelling improvement for version 14.  That makes
                    version 13 officially our fastest C implementation for round 2.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_5"></a>
                <h2>IsPrefixOfStringInTable_x64_5</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_4"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_4</a> |
                    <a href="#IsPrefixOfStringInTable_x64_6">IsPrefixOfStringInTable_x64_6  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Before we conclude round 2, let's see if we can eke any more performance out of
                    the negative match fast path of our fastest assembly version so far: version 4.
                    For this step, I'm going to leverage <a
                    href="https://software.intel.com/en-us/articles/intel-architecture-code-analyzer">
                    Intel Architecture Code Analyzer</a>, or IACA, for short.

                </p>

                <p>

                    This is a handy little static analysis tool that can provide useful information
                    for fine-tuning performance sensitive code.  Let's take a look at the output
                    from IACA for our assembly version 4.  To do this, I uncomment the two macros,
                    <code>IACA_VC_START</code> and <code>IACA_VC_END</code>, which reside at the
                    start and end of the negative match logic.  These macros are defined in
                    <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringTable.inc#L20">StringTable.inc</a>,
                    and look like this:

                </p>

<pre class="code"><code class="language-nasm">IACA_VC_START macro Name

        mov     byte ptr gs:[06fh], 06fh

        endm

IACA_VC_END macro Name

        mov     byte ptr gs:[0deh], 0deh

        endm
</code></pre>

                <p>

                    The equivalent versions for C are defined in
                    <a href="https://github.com/tpn/tracer/blob/v0.1.11/Rtl/Rtl.h#L427">Rtl.h</a>,
                    and look like this:

                </p>

<pre class="code"><code class="language-c">//
// Define start/end markers for IACA.
//

#define IACA_VC_START() __writegsbyte(111, 111)
#define IACA_VC_END()   __writegsbyte(222, 222)
</code></pre>

                <p>

                    You may have noticed commented-out versions of these macros in both the C and
                    assembly code.  What they do is emit a specific byte pattern in the instruction
                    byte code that the IACA tool can detect.  You place the start and end markers
                    around the code you're interested in, recompile it, then run IACA against the
                    final executable (or library).

                </p>

                <p>

                    Let's see what happens when we do this for our version 4 assembly routine.  I'll
                    include the relevant assembly snippet, reformatted into a more concise fashion,
                    followed by the IACA output (also reformatted into a more concise fashion):

                </p>

                <a class="xref" name="IsPrefixOfStringInTable_x64_5-diff"></a>
                <div class="tab-box language box-x64-v4">
                    <ul class="tabs">
                        <li data-content="content-x64-v4-nasm">Assembly</li>
                        <li data-content="content-x64-v4-iaca">IACA</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-v4-nasm"><code class="language-nasm">
mov      rax,  String.Buffer[rdx]                       ; Load address of string buffer.
vmovdqu  xmm0, xmmword ptr [rax]                        ; Load search buffer.
vmovdqa  xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
vpshufb  xmm5, xmm0, xmm1                               ; Rearrange string by uniq. ix.
vmovdqa  xmm2, xmmword ptr StringTable.UniqueChars[rcx] ; Load unique chars.
vpcmpeqb xmm5, xmm5, xmm2                               ; Compare unique chars.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx]     ; Load table lengths.
vpbroadcastb xmm4, byte ptr String.Length[rdx]          ; Broadcast string length.
vpcmpgtb xmm1, xmm3, xmm4                               ; Identify long slots.
vptest   xmm1, xmm5                                     ; Unique slots AND (!long slots).
jnc      short Pfx10                                    ; CY=0, continue with routine.
xor      eax, eax                                       ; CY=1, no match.  Clear rax.
not      al                                             ; al = -1 (NO_MATCH_FOUND)
ret                                                     ; Return NO_MATCH_FOUND.
</code></pre>
<pre class="code content-x64-v4-iaca"><code class="language-nasm">
S:\Source\tracer&gt;iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.74 Cycles       Throughput Bottleneck: Dependency Chains
Loop Count:  22
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|&#181;ops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm1, xmmword ptr [rcx+0x10]
| 1  |   |   |       |       | |1.0|   | | vpshufb xmm5, xmm0, xmm1
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm2, xmmword ptr [rcx]
| 1  |1.0|   |0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmm2
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |1.0|       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of &#181;ops: 18
</code></pre>
                    </div>
                </div>

                <p>

                    The <a
                    href="https://github.com/tpn/pdfs/blob/4d2296269d3737b649def585a19eb103cda9c3d0/Intel%20Architecture%20Code%20Analyzer%20-%20User's%20Guide%20-%20v3.0%20(2017).pdf">
                    Intel Architecture Code Analyzer User Manual (v3.0)</a> provides decent
                    documentation about how to interpret the output, so I won't go over the gory
                    details.  What I'm really looking at in this pass is what my block throughput
                    is, and potentially what the bottleneck is.

                </p>

                <p>

                    In this case, our block throughput is being reported as 3.74 cycles, which
                    basically indicates how many CPU cycles it takes to execute the block.  Our
                    bottleneck is dependency chains, which refers to the situation where, say,
                    instruction C can't start because the results from instruction A aren't
                    ready yet.  (This... this is a vastly simplified explanation.)

                </p>

                <p>

                    Alright, well, what can we do?  A good answer would be that with an intimate
                    understanding of contemporary Intel CPU architecture, you can pin-point exactly
                    what needs changing in order to reduce dependencies, and maximise port
                    utilization, and leverage macro fusion, but also not forgetting about
                    microfusion, and remembering microcode latencies, and generally become one with
                    the Intel optimization manual, but never at the expense of under-utilizing
                    your front-back &#181;op-frobulator, unless the inverted cache re-up and
                    re-vigor policy is in its hybrid coalesced L9 state, at which point any increase
                    in thermal unit pivot-bracketing will nullify all efforts across product lines 4
                    and 5 but exponentially accelerate entropy discount licensing for models 6 and
                    7, but only after recent microcode patches if you're in the Northern hemisphere
                    during an unseasonably cold Spring.

                </p>

                <p>

                    Or you can just move shit around until the number gets smaller.  That's what I did.

                </p>

                <p>

                    Well, that's not entirely true.  Fabian did make a good suggestion when he was
                    reviewing some of my assembly that I was often needlessly doing a load into an
                    XMM register only to use it once in a subsequent operation.  Instead of doing
                    that I could just use the <em>load-op</em> version of the instruction, which
                    allows for an instruction input parameter to be sourced from memory.

                </p>

                <p>

                    For example, instead of this:

                </p>

<pre class="code"><code class="language-nasm">vmovdqa  xmm2, xmmword ptr StringTable.UniqueChars[rcx]       ; Load unique chars.
vpcmpeqb xmm5, xmm5, xmm2                                     ; Compare unique chars.
</code></pre>

                <p>

                    You can do this:

                </p>

<pre class="code"><code class="language-nasm">vpcmpeqb xmm5, xmm5, xmmword ptr StringTable.UniqueChars[rcx] ; Compare unique chars.
</code></pre>

                <p>

                    But yeah, other than a few load-op tweaks, I basically just shuffled shit around
                    until the block throughput reported lower.  Very rigorous methodology, I know.
                    Here's the final version, which also happens to be the version quoted in the
                    introduction of this article:

                </p>

                <div class="tab-box language box-intro2">
                    <ul class="tabs">
                        <li data-content="content-intro2-nasm">Assembly</li>
                        <li data-content="content-intro2-iaca">IACA</li>
                    </ul>
                    <div class="content">
<pre class="code content-intro2-nasm"><code class="language-nasm">
mov      rax,  String.Buffer[rdx]                   ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx]      ; Broadcast string length.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu  xmm0, xmmword ptr [rax]                    ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4                           ; Identify slots &gt; string len.
vpshufb  xmm5, xmm0, StringTable.UniqueIndex[rcx]   ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx]   ; Compare rearranged to unique.
vptest   xmm1, xmm5                                 ; Unique slots AND (!long slots).
jnc      short Pfx10                                ; CY=0, continue with routine.
xor      eax, eax                                   ; CY=1, no match.
not      al                                         ; al = -1 (NO_MATCH_FOUND).
ret                                                 ; Return NO_MATCH_FOUND.
</code></pre>
<pre class="code content-intro2-iaca"><code class="language-nasm">
S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|&#181;ops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |1.0|   |       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ |   |1.0|0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of &#181;ops: 18
</code></pre>

                    </div>
                </div>

                <p>

                    As you can see, that is reporting a block throughput of 3.48 instead of 3.74.  A
                    whopping 0.26 reduction!  Also note the bottleneck is now being reported as
                    <code>FrontEnd</code>, which basically means that the thing holding up this code
                    now is literally the CPU's ability to decode the actual instruction stream into
                    actionable internal work.  (Again, super simplistic explanation of a very complex
                    process.)

                </p>


                <p>

                    For the sake of completeness, here's the proper diff and full version of
                    assembly version 5:

                </p>

                <div class="tab-box language box-x64-5v4">
                    <ul class="tabs">
                        <li data-content="content-x64-5v4-diff">Diff</li>
                        <li data-content="content-x64-5-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-5v4-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_4.asm IsPrefixOfStringInTable_x64_5.asm
--- IsPrefixOfStringInTable_x64_4.asm   2018-04-26 17:56:37.934374900 -0400
+++ IsPrefixOfStringInTable_x64_5.asm   2018-04-26 18:17:26.087861100 -0400
@@ -33,9 +33,14 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2, but leverages the fact that
-;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
-;   to all 0s, avoiding the the need to do the pxor or pandn steps.
+;   This routine is identical to version 4, but has the initial negative match
+;   instructions re-ordered and tweaked in order to reduce the block throughput
+;   reported by IACA (from 3.74 to 3.48).
+;
+;   N.B. Although this does result in a measurable speedup, the clarity suffers
+;        somewhat due to the fact that instructions that were previously paired
+;        together are now spread out (e.g. moving the string buffer address into
+;        rax and then loading that into xmm0 three instructions later).
 ;
 ; Arguments:
 ;
@@ -54,32 +59,21 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_5, _TEXT$00

 ;
-; Load the string buffer into xmm0, and the unique indexes from the string table
-; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result into xmm5.
+; Load the address of the string buffer into rax.
 ;

         ;IACA_VC_START

-        mov     rax, String.Buffer[rdx]
-        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
-        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm5, xmm0, xmm1
-
-;
-; Load the string table's unique character array into xmm2.
-
-        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
+        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

 ;
-; Compare the search string's unique character array (xmm5) against the string
-; table's unique chars (xmm2), saving the result back into xmm5.
+; Broadcast the byte-sized string length into xmm4.
 ;

-        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
 ; Load the lengths of each string table slot into xmm3.
@@ -88,26 +82,38 @@
         vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Load the search string buffer into xmm0.
 ;

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

 ;
 ; Compare the search string's length, which we've broadcasted to all 8-byte
 ; elements of the xmm4 register, to the lengths of the slots in the string
-; table, to find those that are greater in length.  Invert the result, such
-; that we're left with a masked register where each 0xff element indicates
-; a slot with a length less than or equal to our search string's length.
+; table, to find those that are greater in length.
 ;

         vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

 ;
+; Shuffle the buffer in xmm0 according to the unique indexes, and store the
+; result into xmm5.
+;
+
+        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.
+
+;
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
+;
+
+        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.
+
+;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
-; the inverted length match mask xmm register (xmm1).  This will set the carry
-; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
-; us to do a fast-path exit for the no-match case.
+; the length match mask xmm register (xmm1).  This affects flags, allowing us
+; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
+; been inverted).
 ;

         vptest      xmm1, xmm5                  ; Check for no match.
@@ -472,7 +478,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_5, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-5-full"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is identical to version 4, but has the initial negative match
;   instructions re-ordered and tweaked in order to reduce the block throughput
;   reported by IACA (from 3.74 to 3.48).
;
;   N.B. Although this does result in a measurable speedup, the clarity suffers
;        somewhat due to the fact that instructions that were previously paired
;        together are now spread out (e.g. moving the string buffer address into
;        rax and then loading that into xmm0 three instructions later).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_5, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_5, _TEXT$00

</code></pre>
                    </div>
                </div>

                <p>

                    Did it make a difference?  Were we able to shave any time off the negative match
                    fast path?  Let's find out:

                </p>

                <p>
                    <a href="Benchmark-x64-05-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-05-v1.svg"/>
                    </a>
                </p>

                <p>

                    Hurrah!  We've got a new winner!  Our final tweaks yielded a very small but
                    measurable and repeatable improvement in both prefix matching and negative
                    matching!  Let's mark that up as a win.

                </p>

                <!--
                <p>

                    Actually, now that we've got VTune working without immediately causing a kernel
                    panic (updating motherboard BIOS appeared to fix that problem), let's compare
                    VTune's summary of version 4 versus version 5:

                </p>

                <hr/>
                <pre>
    IsPrefixOfStringInTable_x64_4       |       IsPrefixOfStringInTable_x64_5
Elapsed Time:	        0.456s          |   Elapsed Time:	    0.456s
Clockticks:	        1,553,630,000   |   Clockticks:	            1,551,040,000
Instructions Retired:	5,047,170,000   |   Instructions Retired:   5,047,170,000
CPI Rate:	        0.308           |   CPI Rate:	            0.307
MUX Reliability:	0.872           |   MUX Reliability:        0.859
Front-End Bound:	 7.3%           |   Front-End Bound:         0.8%
Bad Speculation:	 4.8%           |   Bad Speculation:         0.0%
Back-End Bound:	         4.4%           |   Back-End Bound:          0.0%
Retiring:	        83.5%           |   Retiring:              100.0%
   General Retirement:  82.6%           |      General Retirement: 100.0%
   Microcode Sequencer:  0.9%           |      Microcode Sequencer:  0.4%
</pre>
                <hr/>

                <p>

                    100% retirement of pipeline slots; not bad for just shuffling shit around!
                    You can see the tiny decrease in CPI, from 0.308 to 0.307, which matches the
                    tiny performance improvement we observed for negative matching.

                </p>

                <p>

                    Note: the command line invocations for version 4 and 5 above were as follows:

                </p>

                <hr/>
<pre>

</pre>
                <hr/>
                -->

                <!--
                <a class="xref" name="IsPrefixOfStringInTable_15"></a>
                <h2>IsPrefixOfStringInTable_15</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_14"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_14</a>
                </small>
                -->


                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_3-review"></a>
                <h2>Reviewing IsPrefixOfStringInTable_x64_3...</h2>

                <p>

                    Alright, we need to get some closure on why
                    <code><a href="#IsPrefixOfStringInTable_x64_3">IsPrefixOfStringInTable_x64_3</a></code>
                    was so bad in comparison to
                    <code><a href="#IsPrefixOfStringInTable_x64_2">IsPrefixOfStringInTable_x64_2</a></code>.
                    Let's review the performance chart again quickly:

                </p>

                <p>
                    <a href="Benchmark-x64-03-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-03-v1.svg"/>
                    </a>
                </p>

                <p>

                    What immediately stands out to me with those results is how
                    <strong>everything</strong> seems to be impacted; it's not just the prefix
                    matching performance that's bad, it's the negative match performance as well.
                    This is odd, as we didn't really change anything in the negative match logic.

                </p>

                <p>

                    Except for that pesky prologue we added to stash the values of <code>rsi</code>,
                    <code>rdi</code>, and the flags register.  Hmmm!  That seems like a good a place
                    as any to start investigating.  Let's whip up another version that defers the
                    prologue until <strong>after</strong> the initial negative match logic.  This
                    exploits a little detail regarding prologues in that they need to appear in the
                    first 255 bytes of the function byte code &mdash; but don't necessarily need to
                    appear at the very start.  As long as the prologue definition for the register
                    is the first time the register is mutated, you've got a bit of room to play with
                    regarding where to actually put it.

                </p>

                <p>

                    So, here's version 7 of the routine, based off version 3, that simply relocates
                    the prologue code to appear after the initial negative match logic:

                </p>

                <div class="tab-box language box-x64-7v3">
                    <ul class="tabs">
                        <li data-content="content-x64-7v3-diff">Diff</li>
                        <li data-content="content-x64-7-full">Full</li>
                    </ul>
                    <div class="content">

<pre class="code content-x64-7v3-diff"><code class="language-diff"> % diff -u IsPrefixOfStringInTable_x64_3.asm IsPrefixOfStringInTable_x64_7.asm
--- IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:13:23.879193700 -0400
+++ IsPrefixOfStringInTable_x64_7.asm   2018-04-29 19:33:06.374193900 -0400
@@ -58,13 +58,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2.  It has been converted into a nested
-;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
-;   comparison for long strings (instead of the byte-by-byte comparison used
-;   in version 2).  This requires use of the rsi and rdi registers, and the
-;   direction flag.  These are all non-volatile registers and thus, must be
-;   saved to the stack in the function prologue (hence the need to make this
-;   a nested entry).
+;   This routine is based off version 3, but relocates the prologue code to
+;   after the initial negative match logic (jump target Pfx10).
 ;
 ; Arguments:
 ;
@@ -83,19 +78,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
-
-;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-        push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -165,11 +148,23 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

         ;IACA_VC_END

 ;
+; Begin prologue.  Allocate stack space and save non-volatile registers.
+;
+
+Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.
+
+        push_eflags                                 ; Save flags.
+        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
+        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
+
+        END_PROLOGUE
+
+;
 ; (There was at least one match, continue with processing.)
 ;

@@ -187,7 +182,7 @@
 ;   r11 - String length (String-&gt;Length)
 ;

-Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
+        vpextrb     r11, xmm4, 0                ; Load length.
         mov         rax, 16                     ; Load 16 into rax.
         mov         r10, r11                    ; Copy into r10.
         cmp         r10w, ax                    ; Compare against 16.
@@ -512,7 +507,7 @@

         ret

-        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-7-full"><code class="language-nasm">;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 3, but relocates the prologue code to
;   after the initial negative match logic (jump target Pfx10).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

        vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
        popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

</code></pre>
                    </div>
                </div>

                <p>

                    Let's see how that change impacts the runtime function entry with regards to the
                    unwind code information.  Here's the entry for assembly version 3:

                </p>

                <hr/>
<small><pre>
0:000&gt; .fnent StringTable2!IsPrefixOfStringInTable_x64_3
Debugger function entry 00000185`395f2a88 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_3 (void)

BeginAddress      = 00000000`00003dc0
EndAddress        = 00000000`00003fb0
UnwindInfoAddress = 00000000`00005508

Unwind info at 00007fff`f8425508, 10 bytes
  version 1, flags 0, prolog f, codes 6
  00: offs f, unwind op 4, op info 7    UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs a, unwind op 4, op info 6    UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 5, unwind op 2, op info 0    UWOP_ALLOC_SMALL.
  05: offs 4, unwind op 2, op info 3    UWOP_ALLOC_SMALL.
</pre></small>
                <hr/>

                <p>

                    Compare that to the entry for the routine we just wrote, version 7, with the
                    prologue appearing much later in the routine:

                </p>

                <hr/>
<small><pre>
0:000&gt; .fnent StringTable2!IsPrefixOfStringInTable_x64_7
Debugger function entry 00000185`395f2a88 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_7 (void)

BeginAddress      = 00000000`00004540
EndAddress        = 00000000`00004730
UnwindInfoAddress = 00000000`00005530

Unwind info at 00007fff`f8425530, 10 bytes
  version 1, flags 0, prolog 4c, codes 6
  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
</pre></small>
                <hr/>

                <p>

                    As you can see, the <code>prolog</code> value has changed to <code>0x4c</code>,
                    and the offsets for each entry have also changed accordingly.  Let's disassemble
                    the function and see if we can correlate the addresses of our prologue
                    instructions to the offsets indicated above:

                </p>

                <hr/>
<small><pre>
0:000&gt; uf StringTable2!IsPrefixOfStringInTable_x64_7
StringTable2!IsPrefixOfStringInTable_x64_7:
00007fff`f8424540 488b4208        mov     rax,qword ptr [rdx+8]
00007fff`f8424544 c5fa6f00        vmovdqu xmm0,xmmword ptr [rax]
00007fff`f8424548 c5f96f4910      vmovdqa xmm1,xmmword ptr [rcx+10h]
00007fff`f842454d c4e27900e9      vpshufb xmm5,xmm0,xmm1
00007fff`f8424552 c5f96f11        vmovdqa xmm2,xmmword ptr [rcx]
00007fff`f8424556 c5d174ea        vpcmpeqb xmm5,xmm5,xmm2
00007fff`f842455a c5f96f5920      vmovdqa xmm3,xmmword ptr [rcx+20h]
00007fff`f842455f c4e26929d2      vpcmpeqq xmm2,xmm2,xmm2
00007fff`f8424564 c4e2797822      vpbroadcastb xmm4,byte ptr [rdx]
00007fff`f8424569 c5e164cc        vpcmpgtb xmm1,xmm3,xmm4
00007fff`f842456d c5f1efca        vpxor   xmm1,xmm1,xmm2
00007fff`f8424571 c4e27917e9      vptest  xmm5,xmm1
00007fff`f8424576 7505            jne     StringTable2!IsPrefixOfStringInTable_x64_7+0x3d (00007fff`f842457d)

StringTable2!IsPrefixOfStringInTable_x64_7+0x38:
00007fff`f8424578 33c0            xor     eax,eax
00007fff`f842457a f6d0            not     al
00007fff`f842457c c3              ret

StringTable2!IsPrefixOfStringInTable_x64_7+0x3d:
00007fff`f842457d 4883ec20        sub     rsp,20h
00007fff`f8424581 9c              pushfq
00007fff`f8424582 4889742410      mov     qword ptr [rsp+10h],rsi
00007fff`f8424587 48897c2408      mov     qword ptr [rsp+8],rdi
00007fff`f842458c c4c37914e300    vpextrb r11d,xmm4,0
00007fff`f8424592 48c7c010000000  mov     rax,10h
</pre></small>
                <hr/>

                <p>

                    All of the addresses share <code>0x00007fff`f8424</code> as the first 13 digits,
                    so we can ignore that part to simplify the values we're working with.  Let's
                    take a look at the first of our prologue instructions, <code>sub rsp,
                    20h</code>.  This maps to our <code>alloc_stack LOCALS_SIZE</code> line:

                </p>

                <hr/>
<small><pre>
Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE
</pre></small>
                <hr/>

                <p>

                    The <code>sub rsp, 20h</code> line appears at byte offset <code>0x57d</code>.
                    If we subtract that from the address of the very first instruction,
                    <code>0x540</code>, we get 61, or <code>0x3d</code> in hex.

                </p>

                <p>

                    Hmmmm.  That doesn't map to any of the offsets that appear in version 7's
                    runtime function entry.  Let's try the address of the <code>pushfq</code>
                    instruction, which is at offset <code>0x581</code>.  If we subtract the
                    start address <code>0x540</code> from that, we're left with 65, which in hex
                    is, <em>drum roll</em>, <code>0x41</code>!  That matches the last line of the
                    runtime function entry:

                </p>

<small><pre>
  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
      ^^^^^^^
</pre></small>

                <p>

                    That makes sense if we think about the purpose of the unwind entries.  They are
                    there for the kernel to compare against a faulting instruction's address (i.e.
                    the value contained in the RIP register at the time of the fault) in order to
                    determine what needs to be unwound as part of exception handling.  In this case,
                    at byte offset <code>0x41</code>, the <code>sub rsp, 20h</code> instruction will
                    have already been executed, so the kernel knows it needs to unwind this (e.g. by
                    doing what will effectively equate to <code>add rsp, 20h</code>) within the
                    exception handling logic when it needs to unwind the entire frame and restore
                    all of the non-volatile registers.

                </p>

                <p>

                    If we take a look at the first instruction <strong>after</strong>
                    our last prologue instruction, <code>vpextrb r11d, xmm4, 0</code>, it resides at
                    address offset <code>0x58c</code>.  Subtracting the start address
                    <code>0x540</code> from that, we get 76, which is <code>0x4c</code> in hex,
                    which matches the offset of the last unwind entry, as well as the prologue
                    end point:

                </p>

<small><pre>
Unwind info at 00007fff`f8425530, 10 bytes
  version 1, flags 0, prolog 4c, codes 6
                      ^^^^^^^^^
  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
      ^^^^^^^
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
</pre></small>

                <p>

                    The reason that the prologue must occur within the first 255 bytes of the
                    function is simply due to the fact that the prologue size and offsets are
                    stored using a single byte, so 255 is the maximum value that can be represented.
                    When writing a <code>NESTED_ENTRY</code> with MASM, you need to have the
                    <code>END_PROLOGUE</code> macro (which expands to <code>.endprolog</code>)
                    occur within the first 255 bytes of your function.

                </p>

                <p>

                    If we move the <code>END_PROLOGUE</code> line in version 7 way down to the
                    bottom of the routine and try and compile, MASM balks:

                </p>

<small><pre>
IsPrefixOfStringInTable_x64_7.asm(511): error A2247: size of prolog too big, must be &gt; 256 bytes
</pre></small>

                <p>

                    <small>
                        (Note: I have no idea why the spelling of prolog vs prologue and epilog vs
                        epilogue is so inconsistent within the Microsoft tooling and docs.)
                    </small>

                </p>

                <p>

                    Let's get back on track.  We need to review the performance of version 7 to see
                    if relocating the prologue has any impact on the negative matching performance
                    of the routine.  If it does, this is a strong indicator that it's at fault,
                    especially if the prefix matching still shows the same performance issues.
                    Here's the comparison:

                </p>

                <p>
                    <a href="Benchmark-x64-07-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-07-v1.svg"/>
                    </a>
                </p>

                <p>

                    Hah!  Look at that, the negative match performance is back on par with version
                    2.  So, the blame now squarely points to something peculiar in the prologue
                    inducing a huge (well, relatively huge) performance hit.  But the prologue is so
                    simple!  It's only pushing flags, and two registers!

                </p>

                <a class="xref" name="IsPrefixOfStringInTable_x64_8"></a>
                <h3>IsPrefixOfStringInTable_x64_8</h3>

                <p>

                    I know register pushing is cheap.  Borderline free in the grand scheme of
                    things.  Flags though.  Flags are an interesting one.  The bane of the out-of-order
                    CPU pipeline, they could very well be forcing a synchronization point within the
                    code, preventing all the contemporary goodies you get when you let the CPU do
                    its thing whenever it wants, rather than when you need.  (Goodies like...
                    Meltdown!)

                </p>

                <p>

                    Let's test the theory.  We'll take version 7 and simply comment-out the flag
                    pushing and popping behavior.

                </p>

                <p><small>

                    (Technically we're not allowed to do that; the direction indicator is classed as
                    non-volatile; if the calling function has it set to reverse, and on return,
                    we've set it to forward, things are going to be problematic if it actually
                    <strong>wanted</strong> it set to reverse.  In practice, this isn't that common.
                    At least with our current stack, what with our aversion to even using a C
                    runtime library, we know nothing in our benchmark environment is going to be
                    faced with that predicament.)

                </small></p>

                <div class="tab-box language box-x64-8v7">
                    <ul class="tabs">
                        <li data-content="content-x64-8v7-diff">Diff</li>
                        <li data-content="content-x64-8-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-8v7-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_7.asm IsPrefixOfStringInTable_x64_8.asm
--- IsPrefixOfStringInTable_x64_7.asm   2018-04-29 21:10:09.061479900 -0400
+++ IsPrefixOfStringInTable_x64_8.asm   2018-04-29 22:08:02.761164300 -0400
@@ -58,8 +58,9 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 3, but relocates the prologue code to
-;   after the initial negative match logic (jump target Pfx10).
+;   This routine is based off version 7, but comments-out the pushing and
+;   popping of flags to the stack in the prologue and epilogue, respectively,
+;   in order to test a theory regarding performance.
 ;
 ; Arguments:
 ;
@@ -78,8 +79,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00
-
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -158,7 +158,7 @@

 Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

-        push_eflags                                 ; Save flags.
+       ;push_eflags                                 ; Save flags.
         save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
         save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

@@ -502,12 +502,12 @@

 Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
         mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-        popfq                                   ; Restore flags.
+       ;popfq                                   ; Restore flags.
         add     rsp, LOCALS_SIZE                ; Deallocate stack space.

         ret

-        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00
+        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-8-full"><code class="language-nasm">;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 7, but comments out the pushing and
;   popping of flags to the stack in the prologue and epilogue, respectively,
;   in order to test a theory regarding performance.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

       ;push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

        vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
       ;popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>

                    </div>
                </div>

                <p>

                    Let's see how we perform when we cheat by not saving flags:

                </p>

                <p>
                    <a href="Benchmark-x64-08-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-08-v1.svg"/>
                    </a>
                </p>

                <p>

                    Crikey!  Flags were clearly at fault!  Not only that, but look at the
                    performance of the routine in comparison with version 2 for prefix matching;
                    there's a definite improvement in performance!  (I also looked up the latency of
                    <code>pushfq</code>: 9 cycles!  I had no idea it was that expensive.)

                </p>

                <p>

                    ....wait wait wait.  Shut the front door!  This new assembly version is nearly
                    as fast as the fastest C versions, and it doesn't even have the optimized
                    negative match re-work in place.  Plot twist!

                </p>

                <p>

                    That means the slight tweak we made re-arranging the logic for <a
                    href="#IsPrefixOfStringInTable_x64_3">IsPrefixOfStringInTable_x64_3</a> actually
                    provided a tangible speed-up, but it was lost in the noise of <code>pushfq</code>
                    slowing things down so much.  Or perhaps
                    <a href="#IsPrefixOfStringInTable_x64_2">IsPrefixOfStringInTable_x64_2</a> is
                    just doing something particularly bad.

                </p>

                <p>

                    Either way, it means we means we might be able to wrangle an assembly version
                    that can dominate the negative matching fast path <strong>and</strong> give the
                    C version a run for its money with prefix matching, which would be a great way
                    to end the article!  Let's give it a shot.

                </p>

                <p>

                    <small>

                        (Note: this is the first point in the article where I'm not retroactively
                        documenting what I've done &mdash; it's all live!  I have no idea if I'll
                        be able to produce a final assembly version that's competitive with C in
                        all aspects.  Then again, I'm persistent and stubborn, so who knows.)

                    </small>

                </p>

                <p>

                    We'll do this in a couple of pieces.  First, we'll convert version 8 (which has
                    version 3's logic) into a <code>LEAF_ENTRY</code> and restore the byte-by-byte
                    comparison logic instead of <code>repe cmpsb</code>, but keep everything else
                    identical.  This will be version 9.  For version 10, we can tidy up version 9 a
                    bit and replace some of the jumps to the epilogue area (<code>Pfx90</code>) with
                    a simple <code>ret</code> where applicable.

                </p>

                <p>

                    From there, we'll make version 11, which will combine version 10 and the
                    optimized negative match logic we established in the assembly version 5.
                    After that, we can use versions 12 onward to try replicate the superior
                    inner loop approach identified by Fabian that led to the C routine
                    <a href="#IsPrefixOfStringInTable_13">IsPrefixOfStringInTable_13</a>.  And to
                    think we were almost going to publish this article without investigating the
                    slowdown associated with version 3 of the assembly!

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_9"></a>
                <h2>IsPrefixOfStringInTable_x64_9</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_8"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_8</a> |
                    <a href="#IsPrefixOfStringInTable_x64_10">IsPrefixOfStringInTable_x64_10  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    As mentioned, let's take the version 8 <code>NESTED_ENTRY</code> and convert
                    it into a <code>LEAF_ENTRY</code> with the least amount of code churn possible.
                    As version 8 is essentially version 3 with a relocated prologue and the
                    <code>push_eflags</code>/<code>popfq</code> bits commented out, I'll provide a
                    diff against version 3 as well.

                </p>

                <div class="tab-box language box-x64-9">
                    <ul class="tabs">
                        <li data-content="content-x64-9v8-diff">Diff (9 vs 8)</li>
                        <li data-content="content-x64-9v3-diff">Diff (9 vs 3)</li>
                        <li data-content="content-x64-9-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-9v8-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_8.asm IsPrefixOfStringInTable_x64_9.asm
--- IsPrefixOfStringInTable_x64_8.asm   2018-04-29 22:08:02.761164300 -0400
+++ IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
@@ -18,31 +18,6 @@

 include StringTable.inc

-;
-; Define a locals struct for saving flags, rsi and rdi.
-;
-
-Locals struct
-
-    Padding             dq      ?
-    SavedRdi            dq      ?
-    SavedRsi            dq      ?
-    SavedFlags          dq      ?
-
-    ReturnAddress       dq      ?
-    HomeRcx             dq      ?
-    HomeRdx             dq      ?
-    HomeR8              dq      ?
-    HomeR9              dq      ?
-
-Locals ends
-
-;
-; Exclude the return address onward from the frame calculation size.
-;
-
-LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
-
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -58,9 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 7, but comments out the pushing and
-;   popping of flags to the stack in the prologue and epilogue, respectively,
-;   in order to test a theory regarding performance.
+;   This routine is based off version 8, but reverts the 'repe cmps' to the
+;   same byte-by-byte comparison loop we used in all the previous version.
+;   As this removes the dependency on rsi, rdi and the direction flag, we
+;   no longer need to push those values to the stack, so we also revert back
+;   to a LEAF_ENTRY.
 ;
 ; Arguments:
 ;
@@ -79,7 +56,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -153,18 +130,6 @@
         ;IACA_VC_END

 ;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-       ;push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
-
-;
 ; (There was at least one match, continue with processing.)
 ;

@@ -182,7 +147,7 @@
 ;   r11 - String length (String-&gt;Length)
 ;

-        vpextrb     r11, xmm4, 0                ; Load length.
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
         mov         rax, 16                     ; Load 16 into rax.
         mov         r10, r11                    ; Copy into r10.
         cmp         r10w, ax                    ; Compare against 16.
@@ -449,16 +414,21 @@

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Set up rsi/rdi so we can do a 'rep cmps'.
+; Do a byte-by-byte comparison.
 ;

-        cld
-        mov         rsi, r11
-        mov         rdi, r8
-        repe        cmpsb
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;

-        test        cl, 0
-        jnz         short Pfx60                 ; Not all bytes compared, jump.
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -485,7 +455,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp Pfx90                               ; Return.
+        ret                                     ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -498,16 +468,9 @@

         ;IACA_VC_END

-        align   16
-
-Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
-        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-       ;popfq                                   ; Restore flags.
-        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
-
-        ret
+Pfx90:  ret

-        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
 </code></pre>
<pre class="code content-x64-9v3-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_3.asm IsPrefixOfStringInTable_x64_9.asm
--- IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:13:23.879193700 -0400
+++ IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
@@ -18,31 +18,6 @@

 include StringTable.inc

-;
-; Define a locals struct for saving flags, rsi and rdi.
-;
-
-Locals struct
-
-    Padding             dq      ?
-    SavedRdi            dq      ?
-    SavedRsi            dq      ?
-    SavedFlags          dq      ?
-
-    ReturnAddress       dq      ?
-    HomeRcx             dq      ?
-    HomeRdx             dq      ?
-    HomeR8              dq      ?
-    HomeR9              dq      ?
-
-Locals ends
-
-;
-; Exclude the return address onward from the frame calculation size.
-;
-
-LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
-
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -58,13 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2.  It has been converted into a nested
-;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
-;   comparison for long strings (instead of the byte-by-byte comparison used
-;   in version 2).  This requires use of the rsi and rdi registers, and the
-;   direction flag.  These are all non-volatile registers and thus, must be
-;   saved to the stack in the function prologue (hence the need to make this
-;   a nested entry).
+;   This routine is based off version 8, but reverts the 'repe cmps' to the
+;   same byte-by-byte comparison loop we used in all the previous version.
+;   As this removes the dependency on rsi, rdi and the direction flag, we
+;   no longer need to push those values to the stack, so we also revert back
+;   to a LEAF_ENTRY.
 ;
 ; Arguments:
 ;
@@ -83,20 +56,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
-
-;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-        push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
-
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -165,7 +125,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

         ;IACA_VC_END

@@ -454,16 +414,21 @@

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Set up rsi/rdi so we can do a 'rep cmps'.
+; Do a byte-by-byte comparison.
 ;

-        cld
-        mov         rsi, r11
-        mov         rdi, r8
-        repe        cmpsb
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;

-        test        cl, 0
-        jnz         short Pfx60                 ; Not all bytes compared, jump.
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -490,7 +455,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp Pfx90                               ; Return.
+        ret                                     ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -503,16 +468,9 @@

         ;IACA_VC_END

-        align   16
-
-Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
-        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-        popfq                                   ; Restore flags.
-        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
-
-        ret
+Pfx90:  ret

-        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
 </code></pre>
<pre class="code content-x64-9-full"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 8, but reverts the 'repe cmps' to the
;   same byte-by-byte comparison loop we used in all the previous version.
;   As this removes the dependency on rsi, rdi and the direction flag, we
;   no longer need to push those values to the stack, so we also revert back
;   to a LEAF_ENTRY.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

Pfx90:  ret

        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
                    </div>
                </div>

                <p>

                    Let's go straight into version 10, as it's only a very minor tweak to version 9
                    above.
                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_10"></a>
                <h2>IsPrefixOfStringInTable_x64_10</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_9"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_9</a> |
                    <a href="#IsPrefixOfStringInTable_x64_11">IsPrefixOfStringInTable_x64_11  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Remove the final remnants of the <code>NESTED_ENTRY</code> and replace the jumps
                    to exit label <code>Pfx90</code> with <code>ret</code> instead.

                </p>

                <div class="tab-box language box-x64-10v9">
                    <ul class="tabs">
                        <li data-content="content-x64-10v9-diff">Diff</li>
                        <li data-content="content-x64-10-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code context-x64-10v9-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_9.asm IsPrefixOfStringInTable_x64_10.asm
--- IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
+++ IsPrefixOfStringInTable_x64_10.asm  2018-05-02 08:16:39.672110400 -0400
@@ -33,11 +33,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 8, but reverts the 'repe cmps' to the
-;   same byte-by-byte comparison loop we used in all the previous version.
-;   As this removes the dependency on rsi, rdi and the direction flag, we
-;   no longer need to push those values to the stack, so we also revert back
-;   to a LEAF_ENTRY.
+;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
+;   been replaced by normal 'ret' lines.
 ;
 ; Arguments:
 ;
@@ -56,7 +53,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -310,7 +307,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

 ;
 ; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
@@ -334,7 +331,7 @@
 ;

         vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        jmp         Pfx90                       ; StringMatch == NULL, finish.
+        ret                                     ; StringMatch == NULL, finish.

 ;
 ; StringMatch is not NULL.  Fill out characters matched (currently rax), then
@@ -360,7 +357,7 @@
         mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
         shr         eax, 4                  ; Revert the scaling.

-        jmp         Pfx90
+        ret

 ;
 ; 16 characters matched and the length of the underlying slot is greater than
@@ -468,9 +465,7 @@

         ;IACA_VC_END

-Pfx90:  ret
-
-        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-10-full"><code class="language-nasm">;++
;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
;   been replaced by normal 'ret' lines.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
                    </div>
                </div>

                <p>

                    Let's review performance.  I'll omit the C versions from the graphs for now
                    whilst we focus on optimizing the assembly versions.  In this next comparison,
                    we want to verify that we're still seeing the performance gains we saw in
                    version 8 in versions 9 and 10.  If the timings for version 9 and 10 differ,
                    I'd expect version 10 to be better &mdash; but it won't be by much.

                </p>

                <p>

                    <small>

                        (Note: I had to generate new CSV files for these graphs, as the old ones
                        didn't have any timings for these new functions we've added.  It's easier
                        to just regenerate timings for everything, versus trying to splice in the
                        new timings into the old files.  So, there will be small differences in
                        the numbers you see here for old routines referenced earlier (i.e. the
                        timings for assembly versions 2, 4, 5 and 8 aren't identical to earlier
                        graphs).  The differences are negligible (a handful of cycles per 1000
                        iterations).  I'll put the GitHub URL of the corresponding source data
                        used to generate each graph herein.  They'll all live within
                        <a
                        href="https://github.com/tpn/website/blob/master/is-prefix-of-string-in-table/">this
                        directory</a>, which contains all the source for everything in this
                        article.)

                    </small>

                </p>

                <p>
                    <a href="Benchmark-x64-10-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-10-v1.svg"/>
                    </a>
                </p>

                <p>

                    Excellent!  Version 10 is a tiny bit faster than 9, but both retain the speed
                    advantages we saw from version 8.  We can also see how expensive the setup cost
                    is for <code>repe cmpsb</code>, too, which version 8 used.  It's not necessarily
                    a fair comparison, as only one byte is being compared ($INDEX_ALLOCATION is 17
                    bytes long; so we're only comparing the last N letter), and there's a fixed
                    overhead with the <code>repe cmp/stos/lods</code>-type instructions that can't
                    be avoided.  (They can prove optimal for longer sequences, though.)

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_11"></a>
                <h2>IsPrefixOfStringInTable_x64_11</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_10"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_10</a> |
                    <a href="#IsPrefixOfStringInTable_x64_12">IsPrefixOfStringInTable_x64_12  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>


                    Let's take version 10 and blend in the optimal negative match instruction
                    ordering we used for version 5.  (Version 10 is essentially derived from
                    version 3, and we wrote that before we'd come up with the optimizations
                    explored in versions 4 and 5.)

                </p>

                <div class="tab-box language box-x64-11v10">
                    <ul class="tabs">
                        <li data-content="content-x64-11v10-diff">Diff</li>
                        <li data-content="content-x64-11-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-11v10-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_10.asm IsPrefixOfStringInTable_x64_11.asm
--- IsPrefixOfStringInTable_x64_10.asm  2018-05-02 08:16:39.672110400 -0400
+++ IsPrefixOfStringInTable_x64_11.asm  2018-05-03 17:21:18.181161900 -0400
@@ -33,8 +33,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
-;   been replaced by normal 'ret' lines.
+;   This routine is based off version 10, with the optimized negative prefix
+;   matching logic in place from version 5.
 ;
 ; Arguments:
 ;
@@ -53,68 +53,65 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00
+
 ;
-; Load the string buffer into xmm0, and the unique indexes from the string table
-; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result into xmm5.
+; Load the address of the string buffer into rax.
 ;

         ;IACA_VC_START

-        mov     rax, String.Buffer[rdx]
-        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
-        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm5, xmm0, xmm1
+        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

 ;
-; Load the string table's unique character array into xmm2.
+; Broadcast the byte-sized string length into xmm4.
+;

-        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
-; Compare the search string's unique character array (xmm5) against the string
-; table's unique chars (xmm2), saving the result back into xmm5.
+; Load the lengths of each string table slot into xmm3.
 ;

-        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.
+        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

 ;
-; Load the lengths of each string table slot into xmm3.
+; Load the search string buffer into xmm0.
 ;
-        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.
+
+        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

 ;
-; Set xmm2 to all ones.  We use this later to invert the length comparison.
+; Compare the search string's length, which we've broadcasted to all 8-byte
+; elements of the xmm4 register, to the lengths of the slots in the string
+; table, to find those that are greater in length.
 ;

-        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.
+        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Shuffle the buffer in xmm0 according to the unique indexes, and store the
+; result into xmm5.
 ;

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

 ;
-; Compare the search string's length, which we've broadcasted to all 8-byte
-; elements of the xmm4 register, to the lengths of the slots in the string
-; table, to find those that are greater in length.  Invert the result, such
-; that we're left with a masked register where each 0xff element indicates
-; a slot with a length less than or equal to our search string's length.
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
 ;

-        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm2            ; Invert the result.
+        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

 ;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
 ; the length match mask xmm register (xmm1).  This affects flags, allowing us
-; to do a fast-path exit for the no-match case (where ZF = 1).
+; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
+; been inverted).
 ;

-        vptest      xmm5, xmm1                  ; Check for no match.
-        jnz         short Pfx10                 ; There was a match.
+        vptest      xmm1, xmm5                  ; Check for no match.
+        jnc         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
@@ -161,12 +158,12 @@
         vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

 ;
-; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
 ; yielding a mask identifying indices we need to perform subsequent matches
 ; upon.  Convert this into a bitmap and save in xmm2d[2].
 ;

-        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
         vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

 ;
@@ -465,7 +462,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-11-full"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 10, with the optimized negative prefix
;   matching logic in place from version 5.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is &gt; 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length &lt;= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

</code></pre>
                    </div>
                </div>

                <p>

                    Notice the similarity between the diff above and the one for
                    <a href="#IsPrefixOfStringInTable_x64_5-diff">IsPrefixOfStringInTable_x64_5</a>.
                    Let's see how the performance compares.  The negative match performance for
                    version 11 should be on par with version 5.

                </p>

                <p>
                    <a href="Benchmark-x64-11-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-11-v1.svg"/>
                    </a>
                </p>

                <p>

                    We have a new winner!  Version 11 is now the fastest assembly version across the
                    board for both prefix and negative matching.  Before we start our final pass on
                    version 12, let's take a quick look at how we currently compare against the
                    fastest C version:

                </p>

                <p>
                    <a href="Benchmark-x64-11v14C-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-11v14C-v1.svg"/>
                    </a>
                </p>

                <p>

                    It's already very close!  We just need to shave off a few more cycles on the
                    assembly version to take the crown.

                </p>

                <hr/>
                <a class="xref" name="IsPrefixOfStringInTable_x64_12"></a>
                <h2>IsPrefixOfStringInTable_x64_12</h2>
                <small>
                    <a href="#IsPrefixOfStringInTable_x64_11"><i class="fa fa-arrow-left"></i>  IsPrefixOfStringInTable_x64_11</a> |
                    <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">IsPrefixOfStringInTable_x64_13  <i class="fa fa-arrow-right"></i></a>
                </small>

                <p>

                    Let's start with updating the main loop logic such that it matches
                    <a href="#IsPrefixOfStringInTable_13">IsPrefixOfStringInTable_13</a>.  We'll
                    omit the bitmap shifting and loop count in lieu of the <code>blsr</code>
                    approach.

                </p>

                <p>

                    <em>(About 5 hours pass...)</em>

                </p>

                <p>

                    Alright, I'm back!  Version 12 of our assembly routine is complete!  This was
                    the first big major change to the routine since version 2 really, and I had the
                    benefit of the past ~220 hours already spent obsessing over this topic, so, I'm
                    actually pretty happy with the result!  Let's take a look.  (The diff view of
                    this version is pretty messy compared to the others, given the increased amount
                    of code churn that was involved.)

                </p>

                <div class="tab-box language box-x64-12">
                    <ul class="tabs">
                        <li data-content="content-x64-12v11-diff">Diff</li>
                        <li data-content="content-x64-12-full">Full</li>
                    </ul>
                    <div class="content">
<pre class="code content-x64-12v11-diff"><code class="language-diff">% diff -u IsPrefixOfStringInTable_x64_11.asm IsPrefixOfStringInTable_x64_12.asm
--- IsPrefixOfStringInTable_x64_11.asm  2018-05-03 17:21:18.181161900 -0400
+++ IsPrefixOfStringInTable_x64_12.asm  2018-05-04 12:36:24.773963100 -0400
@@ -33,8 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 10, with the optimized negative prefix
-;   matching logic in place from version 5.
+;   This routine is based on version 11, but leverages the inner loop logic
+;   tweak we used in version 13 of the C version, pointed out by Fabian Giesen
+;   (@rygorous).  That is, we do away with the shifting logic and explicit loop
+;   counting, and simply use blsr to keep iterating through the bitmap until it
+;   is empty.
 ;
 ; Arguments:
 ;
@@ -53,7 +56,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_12, _TEXT$00

 ;
 ; Load the address of the string buffer into rax.
@@ -129,9 +132,7 @@

 ;
 ; Calculate the "search length" for the incoming search string, which is
-; equivalent of 'min(String-&gt;Length, 16)'.  (The search string's length
-; currently lives in xmm4, albeit as a byte-value broadcasted across the
-; entire register, so extract that first.)
+; equivalent of 'min(String-&gt;Length, 16)'.
 ;
 ; Once the search length is calculated, deposit it back at the second byte
 ; location of xmm4.
@@ -141,21 +142,18 @@
 ;   r11 - String length (String-&gt;Length)
 ;

-Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
-        mov         rax, 16                     ; Load 16 into rax.
-        mov         r10, r11                    ; Copy into r10.
-        cmp         r10w, ax                    ; Compare against 16.
-        cmova       r10w, ax                    ; Use 16 if length is greater.
-        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load string length.
+        mov         r9, 16                      ; Load 16 into r9.
+        mov         r10, r11                    ; Copy length into r10.
+        cmp         r10w, r9w                   ; Compare against 16.
+        cmova       r10w, r9w                   ; Use 16 if length is greater.

 ;
-; Home our parameter registers into xmm registers instead of their stack-backed
-; location, to avoid memory writes.
+; Home our parameter register rdx into the base of xmm2.
 ;

         vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
-        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
-        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].
+        vmovq       xmm2, rdx                   ; Save rcx.

 ;
 ; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
@@ -171,77 +169,70 @@
 ;

         vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
-        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].
+        vmovq       xmm5, r8                    ; Save r8 into xmm5q[0].

 ;
 ; Summary of xmm register stashing for the rest of the routine:
 ;
-; xmm2:
-;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
-;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
-;
-; xmm4:
-;       0:7     (vpinsrb 0)     length of search string
-;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16)
-;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
-;      24:31    (vpinsrb 3)     shift count
+;   xmm2:
+;        0:63   (vpinsrq 0)     rdx (2nd function parameter, String)
 ;
-; xmm5:
+;   xmm4:
+;       0:7     (vpinsrb 0)     length of search string [r11]
+;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16) [r10]
+;
+;   xmm5:
 ;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
 ;      64:95    (vpinsrd 2)     bitmap of slots to compare
 ;      96:127   (vpinsrd 3)     index of slot currently being processed
 ;
-
+; Non-stashing xmm register use:
 ;
-; Initialize rcx as our counter register by doing a popcnt against the bitmap
-; we just generated in edx, and clear our shift count register (r9).
+;   xmm0: First 16 characters of search string.
+;
+;   xmm3: Slot lengths.
+;
+;   xmm1: Freebie!
 ;
-
-        popcnt      ecx, edx                    ; Count bits in bitmap.
-        xor         r9, r9                      ; Clear r9.

         align 16

 ;
 ; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
-; trailing zeros of the bitmap, and then add in the shift count, producing an
-; index (rax) we can use to load the corresponding slot.
+; trailing zeros of the bitmap, producing an index (rax) we can use to load the
+; corresponding slot.
 ;
-; Register usage at top of loop:
+; Volatile register usage at top of loop:
 ;
-;   rax - Index.
-;
-;   rcx - Loop counter.
+;   rcx - StringTable.
 ;
 ;   rdx - Bitmap.
 ;
-;   r9 - Shift count.
+;   r9 - Constant value of 16.
+;
+;   r10 - Search length (min(String-&gt;Length, 16))
+;
+;   r11 - Search string length (String-&gt;Length).
 ;
-;   r10 - Search length.
+; Use of remaining volatile registers during loop:
 ;
-;   r11 - String length.
+;   rax - Index.
+;
+;   r8 - Freebie!
 ;

-Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
-        mov         eax, r8d                    ; Copy tzcnt to rax,
-        add         rax, r9                     ; Add shift to create index.
-        inc         r8                          ; tzcnt + 1
-        shrx        rdx, rdx, r8                ; Reposition bitmap.
-        mov         r9, rax                     ; Copy index back to shift.
-        inc         r9                          ; Shift = Index + 1
-        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
+Pfx20:  tzcnt       eax, edx                    ; Count trailing zeros = index.

 ;
 ; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
 ; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
 ;
-; Then, load the string table slot at this index into xmm1, then shift rax back.
+; Then, load the string table slot at this index into xmm1.
 ;

-        shl         eax, 4
-        vpextrq     r8, xmm2, 0
-        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
-        shr         eax, 4
+        mov         r8, rax                     ; Copy index (rax) into r8.
+        shl         r8, 4                       ; "Scale" the index.
+        vmovdqa     xmm1, xmmword ptr [r8 + StringTable.Slots[rcx]]

 ;
 ; The search string's first 16 characters are already in xmm0.  Compare this
@@ -264,187 +255,220 @@
         popcnt      r8d, r8d                    ; Count bits.

 ;
-; If 16 characters matched, and the search string's length is longer than 16,
-; we're going to need to do a comparison of the remaining strings.
+; Determine if less than 16 characters matched, as this avoids needing to do
+; a more convoluted test to see if a byte-by-byte string comparison is needed
+; (for lengths longer than 16).
 ;

-        cmp         r8w, 16                     ; Compare chars matched to 16.
-        je          short @F                    ; 16 chars matched.
-        jmp         Pfx30                       ; Less than 16 matched.
+        cmp         r8w, r9w                    ; Compare chars matched to 16.
+        jl          short Pfx30                 ; Less than 16 matched.

 ;
 ; All 16 characters matched.  Load the underlying slot's length from the
-; relevant offset in the xmm3 register, then check to see if it's greater than,
-; equal or less than 16.
-;
-
-@@:     movd        xmm1, rax                   ; Load into xmm1.
-        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
-        vpextrb     rax, xmm1, 0                ; And extract back into rax.
-        cmp         al, 16                      ; Compare length to 16.
-        ja          Pfx50                       ; Length is &gt; 16.
-        je          short Pfx35                 ; Lengths match!
-                                                ; Length &lt;= 16, fall through...
+; relevant offset in the xmm3 register into r11, then check to see if it's
+; greater than 16.  If it is, we're going to need to do a string compare,
+; handled by Pfx50.
+;
+; N.B. The approach for loading the slot length here is a little quirky.  We
+;      have all the lengths for slots in xmm3, and we have the current match
+;      index in rax.  If we move rax into an xmm register (xmm1 in this case),
+;      we can use it to shuffle xmm3, such that the length we're interested in
+;      will be deposited back into the lowest byte, which we can then extract
+;      via vpextrb.
+;
+
+        movd        xmm1, rax                   ; Load index into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length by index.
+        vpextrb     r11, xmm1, 0                ; Extract slot length into r11.
+        cmp         r11w, r9w                   ; Compare length to 16.
+        ja          short Pfx50                 ; Length is &gt; 16.
+        jmp         short Pfx40                 ; Lengths match!

 ;
-; Less than or equal to 16 characters were matched.  Compare this against the
-; length of the search string; if equal, this is a match.
+; Less than 16 characters were matched.  Compare this against the length of the
+; search string; if equal, this is a match.
 ;

 Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
-        je          short Pfx35                 ; Match found!
+        je          short Pfx40                 ; Match found!

 ;
-; No match against this slot, decrement counter and either continue the loop
-; or terminate the search and return no match.
+; No match against this slot.  Clear the lowest set bit of the bitmap and check
+; to see if there are any bits remaining in it.
 ;

-        dec         cx                          ; Decrement counter.
-        jnz         Pfx20                       ; cx != 0, continue.
+        blsr        edx, edx                    ; Reposition bitmap.
+        test        edx, edx                    ; Is bitmap empty?
+        jnz         short Pfx20                 ; Bits remain, continue loop.
+
+;
+; No more bits remain set in the bitmap, we're done.  Indicate no match found
+; and return.
+;

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
         ret                                     ; Return.

 ;
-; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
-; former is used when we need to copy the number of characters matched from r8
-; back to rax.  The latter jump target doesn't require this.
+; Load the match parameter into r9 and test to see if it's not-NULL, in which
+; case we need to fill out a STRING_MATCH structure for the match, handled by
+; jump target Pfx80 at the end of this routine.
 ;

-Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.
+Pfx40:  vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
+        test        r9, r9                      ; Is NULL?
+        jnz         Pfx80                       ; Not zero, need to fill out.

 ;
-; Load the match parameter back into r8 and test to see if it's not-NULL, in
-; which case we need to fill out a STRING_MATCH structure for the match.
+; StringMatch is NULL, we're done.  We can return straight from here, rax will
+; still have the index stored.
 ;

-Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
-        test        r8, r8                      ; Is NULL?
-        jnz         short @F                    ; Not zero, need to fill out.
+        ret                                     ; StringMatch == NULL, finish.

 ;
-; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
+; 16 characters matched and the length of the underlying slot is greater than
+; 16, so we need to do a little memory comparison to determine if the search
+; string is a prefix match.
 ;
-
-        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        ret                                     ; StringMatch == NULL, finish.
-
+; Register use on block entry:
 ;
-; StringMatch is not NULL.  Fill out characters matched (currently rax), then
-; reload the index from xmm5 into rax and save.
+;   rax - Index.
 ;
-
-@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
-        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        mov         byte ptr StringMatch.Index[r8], al
-
+;   rcx - StringTable.
 ;
-; Final step, loading the address of the string in the string array.  This
-; involves going through the StringTable, so we need to load that parameter
-; back into rcx, then resolving the string array address via pStringArray,
-; then the relevant STRING offset within the StringArray.Strings structure.
+;   rdx - Bitmap.
 ;
-
-        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
-        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.
-
-        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
-        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
-        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
-        shr         eax, 4                  ; Revert the scaling.
-
-        ret
-
+;   r9 - Constant value of 16.
 ;
-; 16 characters matched and the length of the underlying slot is greater than
-; 16, so we need to do a little memory comparison to determine if the search
-; string is a prefix match.
+;   r10 - Search length (min(String-&gt;Length, 16))
 ;
-; The slot length is stored in rax at this point, and the search string's
-; length is stored in r11.  We know that the search string's length will
-; always be longer than or equal to the slot length at this point, so, we
-; can subtract 16 (currently stored in r10) from rax, and use the resulting
-; value as a loop counter, comparing the search string with the underlying
-; string slot byte-by-byte to determine if there's a match.
+;   r11 - Slot length.
+;
+; Register use during the block (after we've freed things up and loaded the
+; values we need):
+;
+;   rax - Index/accumulator.
+;
+;   rcx - Loop counter (for byte comparison).
+;
+;   rdx - Byte loaded into dl for comparison.
+;
+;   r8 - Target string buffer.
+;
+;   r9 - Search string buffer.
 ;
-
-Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

 ;
-; Free up some registers by stashing their values into various xmm offsets.
+; Initialize r8 such that it's pointing to the slot's String-&gt;Buffer address.
+; This is a bit fiddly as we need to go through StringTable.pStringArray first
+; to get the base address of the STRING_ARRAY, then the relevant STRING offset
+; within the array, then the String-&gt;Buffer address from that structure.  Then,
+; add 16 to it, such that it's ready as the base address for comparison.
 ;

-        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
-        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
-        mov         rcx, rax                ; Free up rax, rcx is now counter.
+Pfx50:  mov         r8, StringTable.pStringArray[rcx] ; Load string array addr.
+        mov         r9, rax                 ; Copy index into r9.
+        shl         r9, 4                   ; "Scale" index; sizeof STRING=16.
+        lea         r8, [r9 + StringArray.Strings[r8]] ; Load STRING address.
+        mov         r8, String.Buffer[r8]   ; Load String-&gt;Buffer address.
+        add         r8, r10                 ; Advance it 16 bytes.

 ;
-; Load the search string buffer and advance it 16 bytes.
+; Load the string's buffer address into r9.  We need to get the original
+; function parameter value (rdx) from xmm2q[0], then load the String-&gt;Buffer
+; address, then advance it 16 bytes.
 ;

-        vpextrq     r11, xmm2, 1            ; Extract String into r11.
-        mov         r11, String.Buffer[r11] ; Load buffer address.
-        add         r11, r10                ; Advance buffer 16 bytes.
+        vpextrq     r9, xmm2, 0             ; Extract String into r9.
+        mov         r9, String.Buffer[r9]   ; Load buffer address.
+        add         r9, r10                 ; Advance buffer 16 bytes.

 ;
-; Loading the slot is more involved as we have to go to the string table, then
-; the pStringArray pointer, then the relevant STRING offset within the string
-; array (which requires re-loading the index from xmm5d[3]), then the string
-; buffer from that structure.
+; Save the StringTable parameter, currently in rcx, into xmm1, which is a free
+; use xmm register at this point.  This frees up rcx, allowing us to copy the
+; slot length, currently in r11, and then subtracting 16 (currently in r10),
+; in order to account for the fact that we've already matched 16 bytes.  This
+; allows us to then use rcx as the loop counter for the byte-by-byte comparison.
 ;

-        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
-        mov         r8, StringTable.pStringArray[r8] ; Load string array.
+        vmovq       xmm1, rcx               ; Free up rcx.
+        mov         rcx, r11                ; Copy slot length.
+        sub         rcx, r10                ; Subtract 16.

-        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
-        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+;
+; We'd also like to use rax as the accumulator within the loop.  It currently
+; stores the index, which is important, so, stash that in r10 for now.  (We
+; know r10 is always 16 at this point, so it's easy to restore afterward.)
+;

-        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
-        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
-        add         r8, r10                 ; Advance buffer 16 bytes.
+        mov         r10, rax                ; Save rax to r10.
+        xor         eax, eax                ; Clear rax.

-        mov         rax, rcx                ; Copy counter.
+;
+; And we'd also like to use rdx/dl to load each byte of the search string.  It
+; currently holds the bitmap, which we need, so stash that in r11 for now, which
+; is the last of our free volatile registers at this point (after we've copied
+; the slot length from it above).
+;
+
+        mov         r11, rdx                ; Save rdx to r11.
+        xor         edx, edx                ; Clear rdx.

 ;
-; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Do a byte-by-byte comparison.
+; We've got both buffer addresses + 16 bytes loaded in r8 and r9 respectively.
+; We need to do a byte-by-byte comparison.  The loop count is in rcx, and rax
+; is initialized to 0.  We're ready to go!
 ;

-        align 16
-@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
-        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
-        jne         short Pfx60                 ; If not equal, jump.
+@@:     mov         dl, byte ptr [rax + r9] ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8] ; Compare to byte in slot.
+        jne         short Pfx60             ; Bytes didn't match, exit loop.

 ;
-; The two bytes were equal, update rax, decrement rcx and potentially continue
+; The two bytes were equal, update rax, decrement rcx, and potentially continue
 ; the loop.
 ;
+        inc         al                      ; Increment index.
+        dec         cl                      ; Decrement counter.
+        jnz         short @B                ; Continue if not 0.

-        inc         ax                          ; Increment index.
-        loopnz      @B                          ; Decrement cx and loop back.
+;
+; All bytes matched!  The number of characters matched will live in rax, and
+; we also need to add 16 to it to account for the first chunk that was already
+; matched.  However, rax is also our return value, and needs to point at the
+; index of the slot that matched.  Exchange it with r8 first, as if we do have
+; a StringMatch parameter, the jump target Pfx80 will be expecting r8 to hold
+; the number of characters matched.
+;
+
+        mov         r8, rax                     ; Save characters matched.
+        mov         rax, r10                    ; Re-load index from r10.
+        vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
+        test        r9, r9                      ; Is NULL?
+        jnz         short Pfx75                 ; Not zero, need to fill out.

 ;
-; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
-; how many characters we matched, and then jump to Pfx40 for finalization.
+; StringMatch is NULL, we're done.  Return rax, which will have the index in it.
 ;

-        add         rax, r10
-        jmp         Pfx40
+        ret                                     ; StringMatch == NULL, finish.

 ;
-; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
-; it.  If it's zero, we have no more strings to compare, so we can do a quick
-; exit.  If there are still comparisons to be made, restore the other registers
-; we trampled then jump back to the start of the loop Pfx20.
+; The byte comparisons were not equal.  Re-load the bitmap from r11 into rdx,
+; reposition it by clearing the lowest set bit, and potentially exit if there
+; are no more bits remaining.
 ;

-Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
-        dec         cx                          ; Decrement counter.
-        jnz         short @F                    ; Jump forward if not zero.
+Pfx60:  mov         rdx, r11                    ; Reload bitmap.
+        blsr        edx, edx                    ; Clear lowest set bit.
+        test        edx, edx                    ; Is bitmap empty?
+        jnz         short Pfx65                 ; Bits remain.

 ;
-; No more comparisons remaining, return.
+; No more bits remain set in the bitmap, we're done.  Indicate no match found
+; and return.
 ;

         xor         eax, eax                    ; Clear rax.
@@ -452,17 +476,65 @@
         ret                                     ; Return.

 ;
-; More comparisons remain; restore the registers we clobbered and continue loop.
+; We need to continue the loop, having had this oversized string test (length &gt;
+; 16 characters) fail.  Before we do that though, restore the registers we
+; clobbered to comply with Pfx20's top-of-the-loop register use assumptions.
 ;

-@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
-        vpextrb     r11, xmm4, 0                ; Restore r11.
-        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+Pfx65:  vpextrb     r11, xmm4, 0                ; Restore string length.
+        vpextrq     rcx, xmm1, 0                ; Restore rcx (StringTable).
+        mov         r9, 16                      ; Restore constant 16 to r9.
+        mov         r10, r9                     ; Restore search length.
+                                                ; (We know it's always 16 here.)
         jmp         Pfx20                       ; Continue comparisons.

+;
+; This is the target for when we need to fill out the StringMatch structure.
+; It's located at the end of this routine because we're optimizing for the
+; case where the parameter is NULL in the loop body above, and we don't want
+; to pollute the code cache with this logic (which is quite convoluted).
+
+; N.B. Pfx75 is the jump target when we need to add 16 to the characters matched
+;      count stored in r8.  This particular path is exercised by the long string
+;      matching logic (i.e. when strings are longer than 16 and the prefix match
+;      is confirmed via byte-by-byte comparison).  We also need to reload rcx
+;      from xmm1.
+;
+; Expected register use at this point:
+;
+;   rax - Index of match.
+;
+;   rcx - StringTable.
+;
+;   r8 - Number of characters matched.
+;
+;   r9 - StringMatch.
+;
+;
+
+Pfx75:  add         r8, 16                                  ; Add 16 to count.
+        vpextrq     rcx, xmm1, 0                            ; Reload rcx.
+
+Pfx80:  mov         byte ptr StringMatch.NumberOfMatchedCharacters[r9], r8b
+        mov         byte ptr StringMatch.Index[r9], al
+
+;
+; Final step, loading the address of the string in the string array.  This
+; involves going through the StringTable to find the string array address via
+; pStringArray, then the relevant STRING offset within the StringArray.Strings
+; structure.
+;
+
+        mov         rcx, StringTable.pStringArray[rcx]      ; Load string array.
+        mov         r8, rax                                 ; Copy index to r8.
+        shl         r8, 4                                   ; "Scale" index.
+        lea         rdx, [r8 + StringArray.Strings[rcx]]    ; Resolve address.
+        mov         qword ptr StringMatch.String[r9], rdx   ; Save STRING ptr.
+        ret                                                 ; Return!
+
         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_12, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
<pre class="code content-x64-12-full"><code class="language-nasm">;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based on version 11, but leverages the inner loop logic
;   tweak we used in version 13 of the C version, pointed out by Fabian Giesen
;   (@rygorous).  That is, we do away with the shifting logic and explicit loop
;   counting, and simply use blsr to keep iterating through the bitmap until it
;   is empty.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_12, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String-&gt;Length, 16)'.
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String-&gt;Length, 16))
;
;   r11 - String length (String-&gt;Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load string length.
        mov         r9, 16                      ; Load 16 into r9.
        mov         r10, r11                    ; Copy length into r10.
        cmp         r10w, r9w                   ; Compare against 16.
        cmova       r10w, r9w                   ; Use 16 if length is greater.

;
; Home our parameter register rdx into the base of xmm2.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vmovq       xmm2, rdx                   ; Save rcx.

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vmovq       xmm5, r8                    ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
;   xmm2:
;        0:63   (vpinsrq 0)     rdx (2nd function parameter, String)
;
;   xmm4:
;       0:7     (vpinsrb 0)     length of search string [r11]
;       8:15    (vpinsrb 1)     min(String-&gt;Length, 16) [r10]
;
;   xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;
; Non-stashing xmm register use:
;
;   xmm0: First 16 characters of search string.
;
;   xmm3: Slot lengths.
;
;   xmm1: Freebie!
;

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, producing an index (rax) we can use to load the
; corresponding slot.
;
; Volatile register usage at top of loop:
;
;   rcx - StringTable.
;
;   rdx - Bitmap.
;
;   r9 - Constant value of 16.
;
;   r10 - Search length (min(String-&gt;Length, 16))
;
;   r11 - Search string length (String-&gt;Length).
;
; Use of remaining volatile registers during loop:
;
;   rax - Index.
;
;   r8 - Freebie!
;

Pfx20:  tzcnt       eax, edx                    ; Count trailing zeros = index.

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1.
;

        mov         r8, rax                     ; Copy index (rax) into r8.
        shl         r8, 4                       ; "Scale" the index.
        vmovdqa     xmm1, xmmword ptr [r8 + StringTable.Slots[rcx]]

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Determine if less than 16 characters matched, as this avoids needing to do
; a more convoluted test to see if a byte-by-byte string comparison is needed
; (for lengths longer than 16).
;

        cmp         r8w, r9w                    ; Compare chars matched to 16.
        jl          short Pfx30                 ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register into r11, then check to see if it's
; greater than 16.  If it is, we're going to need to do a string compare,
; handled by Pfx50.
;
; N.B. The approach for loading the slot length here is a little quirky.  We
;      have all the lengths for slots in xmm3, and we have the current match
;      index in rax.  If we move rax into an xmm register (xmm1 in this case),
;      we can use it to shuffle xmm3, such that the length we're interested in
;      will be deposited back into the lowest byte, which we can then extract
;      via vpextrb.
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length by index.
        vpextrb     r11, xmm1, 0                ; Extract slot length into r11.
        cmp         r11w, r9w                   ; Compare length to 16.
        ja          short Pfx50                 ; Length is &gt; 16.
        jmp         short Pfx40                 ; Lengths match!

;
; Less than 16 characters were matched.  Compare this against the length of the
; search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx40                 ; Match found!

;
; No match against this slot.  Clear the lowest set bit of the bitmap and check
; to see if there are any bits remaining in it.
;

        blsr        edx, edx                    ; Reposition bitmap.
        test        edx, edx                    ; Is bitmap empty?
        jnz         short Pfx20                 ; Bits remain, continue loop.

;
; No more bits remain set in the bitmap, we're done.  Indicate no match found
; and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Load the match parameter into r9 and test to see if it's not-NULL, in which
; case we need to fill out a STRING_MATCH structure for the match, handled by
; jump target Pfx80 at the end of this routine.
;

Pfx40:  vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
        test        r9, r9                      ; Is NULL?
        jnz         Pfx80                       ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done.  We can return straight from here, rax will
; still have the index stored.
;

        ret                                     ; StringMatch == NULL, finish.

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; Register use on block entry:
;
;   rax - Index.
;
;   rcx - StringTable.
;
;   rdx - Bitmap.
;
;   r9 - Constant value of 16.
;
;   r10 - Search length (min(String-&gt;Length, 16))
;
;   r11 - Slot length.
;
; Register use during the block (after we've freed things up and loaded the
; values we need):
;
;   rax - Index/accumulator.
;
;   rcx - Loop counter (for byte comparison).
;
;   rdx - Byte loaded into dl for comparison.
;
;   r8 - Target string buffer.
;
;   r9 - Search string buffer.
;

;
; Initialize r8 such that it's pointing to the slot's String-&gt;Buffer address.
; This is a bit fiddly as we need to go through StringTable.pStringArray first
; to get the base address of the STRING_ARRAY, then the relevant STRING offset
; within the array, then the String-&gt;Buffer address from that structure.  Then,
; add 16 to it, such that it's ready as the base address for comparison.
;

Pfx50:  mov         r8, StringTable.pStringArray[rcx] ; Load string array addr.
        mov         r9, rax                 ; Copy index into r9.
        shl         r9, 4                   ; "Scale" index; sizeof STRING=16.
        lea         r8, [r9 + StringArray.Strings[r8]] ; Load STRING address.
        mov         r8, String.Buffer[r8]   ; Load String-&gt;Buffer address.
        add         r8, r10                 ; Advance it 16 bytes.

;
; Load the string's buffer address into r9.  We need to get the original
; function parameter value (rdx) from xmm2q[0], then load the String-&gt;Buffer
; address, then advance it 16 bytes.
;

        vpextrq     r9, xmm2, 0             ; Extract String into r9.
        mov         r9, String.Buffer[r9]   ; Load buffer address.
        add         r9, r10                 ; Advance buffer 16 bytes.

;
; Save the StringTable parameter, currently in rcx, into xmm1, which is a free
; use xmm register at this point.  This frees up rcx, allowing us to copy the
; slot length, currently in r11, and then subtracting 16 (currently in r10),
; in order to account for the fact that we've already matched 16 bytes.  This
; allows us to then use rcx as the loop counter for the byte-by-byte comparison.
;

        vmovq       xmm1, rcx               ; Free up rcx.
        mov         rcx, r11                ; Copy slot length.
        sub         rcx, r10                ; Subtract 16.

;
; We'd also like to use rax as the accumulator within the loop.  It currently
; stores the index, which is important, so, stash that in r10 for now.  (We
; know r10 is always 16 at this point, so it's easy to restore afterward.)
;

        mov         r10, rax                ; Save rax to r10.
        xor         eax, eax                ; Clear rax.

;
; And we'd also like to use rdx/dl to load each byte of the search string.  It
; currently holds the bitmap, which we need, so stash that in r11 for now, which
; is the last of our free volatile registers at this point (after we've copied
; the slot length from it above).
;

        mov         r11, rdx                ; Save rdx to r11.
        xor         edx, edx                ; Clear rdx.

;
; We've got both buffer addresses + 16 bytes loaded in r8 and r9 respectively.
; We need to do a byte-by-byte comparison.  The loop count is in rcx, and rax
; is initialized to 0.  We're ready to go!
;

@@:     mov         dl, byte ptr [rax + r9] ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8] ; Compare to byte in slot.
        jne         short Pfx60             ; Bytes didn't match, exit loop.

;
; The two bytes were equal, update rax, decrement rcx, and potentially continue
; the loop.
;
        inc         al                      ; Increment index.
        dec         cl                      ; Decrement counter.
        jnz         short @B                ; Continue if not 0.

;
; All bytes matched!  The number of characters matched will live in rax, and
; we also need to add 16 to it to account for the first chunk that was already
; matched.  However, rax is also our return value, and needs to point at the
; index of the slot that matched.  Exchange it with r8 first, as if we do have
; a StringMatch parameter, the jump target Pfx80 will be expecting r8 to hold
; the number of characters matched.
;

        mov         r8, rax                     ; Save characters matched.
        mov         rax, r10                    ; Re-load index from r10.
        vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
        test        r9, r9                      ; Is NULL?
        jnz         short Pfx75                 ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done.  Return rax, which will have the index in it.
;

        ret                                     ; StringMatch == NULL, finish.

;
; The byte comparisons were not equal.  Re-load the bitmap from r11 into rdx,
; reposition it by clearing the lowest set bit, and potentially exit if there
; are no more bits remaining.
;

Pfx60:  mov         rdx, r11                    ; Reload bitmap.
        blsr        edx, edx                    ; Clear lowest set bit.
        test        edx, edx                    ; Is bitmap empty?
        jnz         short Pfx65                 ; Bits remain.

;
; No more bits remain set in the bitmap, we're done.  Indicate no match found
; and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; We need to continue the loop, having had this oversized string test (length &gt;
; 16 characters) fail.  Before we do that though, restore the registers we
; clobbered to comply with Pfx20's top-of-the-loop register use assumptions.
;

Pfx65:  vpextrb     r11, xmm4, 0                ; Restore string length.
        vpextrq     rcx, xmm1, 0                ; Restore rcx (StringTable).
        mov         r9, 16                      ; Restore constant 16 to r9.
        mov         r10, r9                     ; Restore search length.
                                                ; (We know it's always 16 here.)
        jmp         Pfx20                       ; Continue comparisons.

;
; This is the target for when we need to fill out the StringMatch structure.
; It's located at the end of this routine because we're optimizing for the
; case where the parameter is NULL in the loop body above, and we don't want
; to pollute the code cache with this logic (which is quite convoluted).

; N.B. Pfx75 is the jump target when we need to add 16 to the characters matched
;      count stored in r8.  This particular path is exercised by the long string
;      matching logic (i.e. when strings are longer than 16 and the prefix match
;      is confirmed via byte-by-byte comparison).  We also need to reload rcx
;      from xmm1.
;
; Expected register use at this point:
;
;   rax - Index of match.
;
;   rcx - StringTable.
;
;   r8 - Number of characters matched.
;
;   r9 - StringMatch.
;
;

Pfx75:  add         r8, 16                                  ; Add 16 to count.
        vpextrq     rcx, xmm1, 0                            ; Reload rcx.

Pfx80:  mov         byte ptr StringMatch.NumberOfMatchedCharacters[r9], r8b
        mov         byte ptr StringMatch.Index[r9], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable to find the string array address via
; pStringArray, then the relevant STRING offset within the StringArray.Strings
; structure.
;

        mov         rcx, StringTable.pStringArray[rcx]      ; Load string array.
        mov         r8, rax                                 ; Copy index to r8.
        shl         r8, 4                                   ; "Scale" index.
        lea         rdx, [r8 + StringArray.Strings[rcx]]    ; Resolve address.
        mov         qword ptr StringMatch.String[r9], rdx   ; Save STRING ptr.
        ret                                                 ; Return!

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_12, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
</code></pre>
                    </div>
                </div>

                <p>

                    I'm really happy with how that turned out!  Switching to <code>blsr</code>
                    really improved the layout of the inner loop, and vastly reduced our register
                    pressure, which means less XMM register spilling is required, which is always a
                    good thing.

                </p>

                <p>

                    But does it improve performance?  Eek!  It's our final Hail Mary attempt at an
                    improvement.  Can we beat the fastest profile-guided optimization build of the C
                    version in both prefix matching and negative matching?

                </p>

                <p>

                    *Drum roll*

                </p>

                <hr/>
                <p>
                    <br/>
                    <br/>
                    <small>

                        (This page doesn't have any ads.  But if it did, I'd totally put them here.
                        All sneaky like, just as the article gets interesting.)

                    </small>
                    <br/>
                    <br/>
                    <br/>
                </p>
                <hr/>

                <p>

                    The performance for version 12 of the assembly is...

                </p>

                <p>
                    <a href="Benchmark-x64-12-v1.svg" target="_blank">
                        <img class="svg-image" src="Benchmark-x64-12-v1.svg"/>
                    </a>
                </p>

                <p>

                    &#70;&#36;&#35;&#42;&#64;&#37;&#105;&#110;&#103; ey, look at that!  :-)

                </p>

                <p>

                    The assembly version brings in gold across the board!  Hot damn!  A quick run
                    through VTune suggests the routine is clocking in with a CPI of 0.266, which is
                    pretty darn close to the theoretical maximum of 0.25 (which implies 4
                    instructions retired per clock cycle).

                </p>

                <a class="xref" name="other-applications"></a>
                <h1>Other Applications</h1>
                <p>

                    Once I'd written the first version of the StringTable component, for better or
                    worse, it became the hammer for all of my string-related problems!  My favorite
                    example of this is the code I wrote for parsing the output of Windows debug
                    engine's <em>examine symbols</em> command.

                </p>

                <p>

                    Here's an example of a few lines of output from the cdb command
                    <small><code>x /v /t Rtl!*</code></small>:

                </p>

<small>
<pre>
prv global 00007ffd`1570d100   10 struct _STRING Rtl!ExtendedLengthVolumePrefixA = struct _STRING "\\?\"
prv global 00007ffd`1570d110   10 struct _UNICODE_STRING Rtl!ExtendedLengthVolumePrefixW = "\\?\"
prv global 00007ffd`1570da30  5a8 char *[181] Rtl!RtlFunctionNames = char *[181]
prv global 00007ffd`15711018    8 <function> * Rtl!__C_specific_handler_impl = 0x00007ffd`214c0f00
prv global 00007ffd`1570d820  208 char *[65] Rtl!RtlExFunctionNames = char *[65]
prv global 00007ffd`15711000    8 <function> * Rtl!atexit_impl = 0x00007ffd`15704370
...
prv global 00007ffd`1570d120  1a8 char *[53] Rtl!CuFunctionNames = char *[53]
prv func   00007ffd`15708be0   5d <function> Rtl!AppendCharBufferToCharBuffer (char **, char *, unsigned long)
prv func   00007ffd`15702450   2e <function> Rtl!RtlHeapAllocatorFreePointer (void *, void **)
prv func   00007ffd`15702730   3c <function> Rtl!RtlHeapAllocatorAlignedFreePointer (void *, void **)
prv func   00007ffd`157093b0   1e <CLR type> Rtl!UnregisterRtlAtExitEntry$fin$0 (void)
...
prv func   00007ffd`157061f0   48 <function> Rtl!RtlCryptGenRandom (struct _RTL *, unsigned long, unsigned char *)
prv func   00007ffd`15707500   b2 <function> Rtl!AppendTailGuardedListTsx (struct _GUARDED_LIST *, struct _LIST_ENTRY *)
prv func   00007ffd`157075c0    d <function> Rtl!DummyVectorCall1 (union __m128i *, union __m128i *, ...
prv func   00007ffd`157025c0   4f <function> Rtl!RtlHeapAllocatorAlignedMalloc (void *, unsigned int64, unsigned int64)
prv func   00007ffd`15703cc0   9e <function> Rtl!DisableCreateSymbolicLinkPrivilege (void)
</pre>
</small>

                <p>

                    The function <a
                    href="https://github.com/tpn/tracer/blob/c2f36bcc686ce6633c91671650f58b62bffb126e/DebugEngine/DebugEngineExamineSymbols.c#L143">
                    ExamineSymbolsParseLine</a> is called for each line of output and is responsible
                    for parsing it into a <a
                    href="https://github.com/tpn/tracer/blob/c2f36bcc686ce6633c91671650f58b62bffb126e/DebugEngine/DebugEngine.h#L1655">
                    DEBUG_ENGINE_EXAMINED_SYMBOL</a> structure.  It's some good ol' fashioned string
                    processing using nothing but pointer arithmetic and a bunch of string tables.

                </p>

                <p>

                    It was the first time I needed to match more than 16 strings in a given category,
                    though.  A pattern emerged that was quite reasonable, and it became my defacto
                    way of dealing with multiple string tables for a given category.

                </p>

                <p>

                    Let's look at the <em>basic type</em> category.  Two string tables were
                    constructed from the following constant delimited strings
                    <small>
                        <a href="https://github.com/tpn/tracer/blob/c2f36bcc686ce6633c91671650f58b62bffb126e/DebugEngine/DebugEngineConstants.c#L114">
                            (view on GitHub)
                        </a>
                    </small>:

                </p>


<pre class="code"><code class="language-c">
#define DSTR(String) String ";"

//
// ExamineSymbolsBasicTypes
//

const STRING ExamineSymbolsBasicTypes1 = RTL_CONSTANT_STRING(
    DSTR("&lt;NoType&gt;")
    DSTR("&lt;function&gt;")
    DSTR("char")
    DSTR("wchar_t")
    DSTR("short")
    DSTR("long")
    DSTR("int64")
    DSTR("int")
    DSTR("unsigned char")
    DSTR("unsigned wchar_t")
    DSTR("unsigned short")
    DSTR("unsigned long")
    DSTR("unsigned int64")
    DSTR("unsigned int")
    DSTR("union")
    DSTR("struct")
);

const STRING ExamineSymbolsBasicTypes2 = RTL_CONSTANT_STRING(
    DSTR("&lt;CLR type&gt;")
    DSTR("bool")
    DSTR("void")
    DSTR("class")
    DSTR("float")
    DSTR("double")
    DSTR("_SAL_ExecutionContext")
    DSTR("__enative_startup_state")
);

</code></pre>

                <p>

                    In concert with the two string tables, an enumeration was defined
                    <small>
                        <a href="https://github.com/tpn/tracer/blob/c2f36bcc686ce6633c91671650f58b62bffb126e/DebugEngine/DebugEngine.h#L1029">
                            (view on GitHub)
                        </a>
                    </small>:

                </p>

<pre class="code"><code class="language-c">
//
// The order of these enumeration symbols must match the exact order of the
// corresponding string in the relevant ExamineSymbolsBasicTypes[1..n] STRING
// structure (see DebugEngineConstants.c).  This is because string tables are
// created from the delimited strings and the match index is cast directly to
// an enum of this type.
//

typedef enum _DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE {
    UnknownType = -1,

    //
    // First 16 types captured by BasicTypeStringTable1.
    //

    NoType = 0,
    FunctionType,

    CharType,
    WideCharType,
    ShortType,
    LongType,
    Integer64Type,
    IntegerType,

    UnsignedCharType,
    UnsignedWideCharType,
    UnsignedShortType,
    UnsignedLongType,
    UnsignedInteger64Type,
    UnsignedIntegerType,

    UnionType,
    StructType,

    //
    // Next 16 types captured by BasicTypeStringTable2.
    //

    CLRType = 16,
    BoolType,
    VoidType,
    ClassType,
    FloatType,
    DoubleType,
    SALExecutionContextType,
    ENativeStartupStateType,

    //
    // Any types that don't map directly to literal type names extracted from
    // the output string are listed here.  The first one starts at 48 in order
    // to differentiate it from the string tables.
    //

    //
    // Call site of an inline function.
    //

    InlineCallerType = 48,

    //
    // Enum is special in that it doesn't map to a string in the string table;
    // if a type can't be inferred from the list above, it defaults to Enum.
    //

    EnumType,

    //
    // Any enumeration value >= InvalidType is invalid.  Make sure this always
    // comes last in the enum layout.
    //

    InvalidType

} DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE;

</code></pre>

                <p>

                    Here's the part of the logic within
                    <a href="https://github.com/tpn/tracer/blob/c2f36bcc686ce6633c91671650f58b62bffb126e/DebugEngine/DebugEngineExamineSymbols.c#L407"
                    >ExamineSymbolsParseLine</a> that deals with matching the <em>basic type</em>
                    part of the line.  This refers to the 5th column of the output, e.g. the
                    <small><code>struct</code></small>,
                    <small><code>char *[181]</code></small>,
                    <small><code>&lt;function&gt;</code></small>,
                    <small><code>&lt;CLR type&gt;</code></small>
                    bits in the following output:
                    <small><pre>
prv global 00007ffd`1570d110   10 struct _UNICODE_STRING Rtl!ExtendedLengthVolumePrefixW = "\\?\"
prv global 00007ffd`1570da30  5a8 char *[181] Rtl!RtlFunctionNames = char *[181]
prv global 00007ffd`15711018    8 &lt;function&gt; * Rtl!__C_specific_handler_impl = 0x00007ffd`214c0f00
prv func   00007ffd`157093b0   1e &lt;CLR type&gt; Rtl!UnregisterRtlAtExitEntry$fin$0 (void)
                    </pre></small>

                </p>

<pre class="code"><code class="language-c">    //
    // (Type declarations of the variables being referenced shortly.)
    //

    SHORT MatchOffset;
    USHORT MatchIndex;
    USHORT MatchAttempts;
    USHORT NumberOfStringTables;
    STRING BasicType;
    STRING_MATCH Match;
    PSTRING_TABLE StringTable;
    DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE SymbolType;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

    ...

    //
    // The basic type will be next.  Set up the variable then search the string
    // table for a match.  Set the length to the BytesRemaining for now; as long
    // as it's greater than or equal to the basic type length (which it should
    // always be), that will be fine.
    //

    BasicType.Buffer = Char;
    BasicType.Length = (USHORT)BytesRemaining;
    BasicType.MaximumLength = (USHORT)BytesRemaining;

    StringTable = Session-&gt;ExamineSymbolsBasicTypeStringTable1;
    IsPrefixOfStringInTable = Session-&gt;StringTableApi-&gt;IsPrefixOfStringInTable;
    MatchOffset = 0;
    MatchAttempts = 0;
    NumberOfStringTables = Session-&gt;NumberOfBasicTypeStringTables;
    ZeroStruct(Match);

RetryBasicTypeMatch:

    MatchIndex = IsPrefixOfStringInTable(StringTable, &amp;BasicType, &amp;Match);

    if (MatchIndex == NO_MATCH_FOUND) {
        if (++MatchAttempts &gt;= NumberOfStringTables) {

            //
            // We weren't able to match the name to any known types.
            // Default to the enum type.
            //

            SymbolType = EnumType;

        } else {

            //
            // There are string tables remaining.  Attempt another match.
            //

            StringTable++;
            MatchOffset += MAX_STRING_TABLE_ENTRIES;
            goto RetryBasicTypeMatch;

        }

    } else {

        //
        // We found a match.  Our enums are carefully offset in order to allow
        // the following `index + offset = enum value` logic to work.
        //

        SymbolType = MatchIndex + MatchOffset;
    }

    //
    // N.B. This next part doesn't occur in the source file, but I wanted
    //      to include it to demonstrate how you could then simply switch
    //      on the resulting symbol type directly, e.g.:
    //

    switch (SymbolType) {
        case CharType:
        case WideCharType:
            ...
            break;

        case UnionType:
        case StructType:
            ...
            break;

        default:
            ...
            break;

    }
</code></pre>

                <p>

                    If there's no match found, we check to see if we've performed the maximum number
                    of attempts, that is, whether or not we've exhausted all our string tables.  If
                    we have, we just default to the EnumType.

                </p>

                <p>

                    Otherwise, bump the StringTable pointer (which relies on the fact that the
                    underlying string table pointers in the session structure are contiguous &mdash;
                    a handy implementation detail), bump the match offset by number of entries per
                    string table, and try the match again.

                </p>

                <p>

                    If we found a match, we can obtain the SymbolType enum representation of the
                    underlying match by simply adding the match index to the match offset.  I like
                    that.  It's simple and fast.  It also plays nicely with switch statements; do
                    your lookup, resolve the underlying enum value, and process each possible path
                    in a case statement like you'd do with any other integer representation of an
                    option.

                </p>

                <p>

                    The other nice side-effect is that it forces you to pick which table a given
                    string should go in.  I made this decision by looking at which types occurred
                    most frequently, and simply put those in the first table.  Less frequent types
                    go in subsequent tables.

                </p>
                <p>

                    I have a hunch there's a lot of mileage in that approach; that is, linear
                    scanning an array of string tables until a match is found.  There will be an
                    inflection point where some form of a log(n) binary tree search will perform
                    better overall, but it would be very interesting to see how many strings you
                    need to potentially match against before that point is hit.

                </p>
                <p>

                    Unless the likelihood of matching any given string in your set is completely
                    random, by ordering the strings in your tables by how frequently they occur,
                    the amortized cost of parsing a chunk of text would be very competitive using
                    this approach, I would think.

                </p>
                <p>

                    A fun experiment for next time, perhaps!

                </p>

                <a class="xref" name="appendix"></a>
                <h1>Appendix</h1>

                <p>

                    And now here's all the stuff that wasn't important enough to occur earlier in
                    the article.

                </p>

                <a class="xref" name="implementation-considerations"></a>
                <h2>Implementation Considerations</h2>

                <p>

                    One issue with writing so many versions of the exact same function is... how do
                    you actually handle this?  Downstream consumers of the component don't need to
                    access the 30 different function pointers for each function you've experimented
                    with, but things like unit tests and benchmark programs do.

                </p>

                <p>

                    Here's what I did for the StringTable component.  Define two API structures, a
                    normal one and an "extended" one.  The extended one mirrors the normal one, and
                    then adds all of its additional functions to the end.

                </p>

                <p>

                    I use a .def file to control the DLL function exports, with an alias to easily
                    control which version of a function is the official version.  The main header
                    file then contains some bootstrap glue (in the form of an inline function) that
                    dynamically loads the target library and resolves the number of API methods
                    according to the size of the API structure provided.

                </p>

                <p>

                    This currently means that the StringTable2.dll includes all 14 C and 5 assembly
                    variants, which is harmless, but it does increase the size of the module
                    unnecessarily.  (The module is currently about 19KB in size, whereas it would be
                    under 4KB if only the official versions were included.)  What I'll probably end
                    up doing is setting up a second project called StringTableEx, and, in
                    conjunction with some #ifdefs, have that be the version of the module that
                    contains all the additional functions, with the normal version just containing
                    the official versions.

                </p>

                <p>

                    Here's the <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringTable.h#L1342">bootstrap glue
                    from StringTable.h</a> and the
                    <a
                    href="https://github.com/tpn/tracer/blob/v0.1.11/StringTable2/StringTable.def">StringTable.def</a>
                    file I currently use.  (Note: this routine uses the
                    <a
                    href="https://github.com/tpn/tracer/blob/master/Rtl/SymbolLoader.c#L117">LoadSymbols()</a>
                    function from the Rtl component.)

                </p>

                <div class="tab-box language box-api">
                    <ul class="tabs">
                        <li data-content="content-api-c">Bootstrap Header Glue</li>
                        <li data-content="content-api-def">StringTable.def</li>
                    </ul>
                    <div class="content">
<pre class="code content-api-c"><code class="language-c">
//
// Define the string table API structure.
//

typedef struct _STRING_TABLE_API {

    PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;

    PCOPY_STRING_ARRAY CopyStringArray;
    PCREATE_STRING_TABLE CreateStringTable;
    PDESTROY_STRING_TABLE DestroyStringTable;

    PINITIALIZE_STRING_TABLE_ALLOCATOR
        InitializeStringTableAllocator;

    PINITIALIZE_STRING_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
        InitializeStringTableAllocatorFromRtlBootstrap;

    PCREATE_STRING_ARRAY_FROM_DELIMITED_STRING
        CreateStringArrayFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_STRING
        CreateStringTableFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_ENVIRONMENT_VARIABLE
        CreateStringTableFromDelimitedEnvironmentVariable;

    PIS_STRING_IN_TABLE IsStringInTable;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

} STRING_TABLE_API;
typedef STRING_TABLE_API *PSTRING_TABLE_API;

typedef struct _STRING_TABLE_API_EX {

    //
    // Inline STRING_TABLE_API.
    //

    PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;

    PCOPY_STRING_ARRAY CopyStringArray;
    PCREATE_STRING_TABLE CreateStringTable;
    PDESTROY_STRING_TABLE DestroyStringTable;

    PINITIALIZE_STRING_TABLE_ALLOCATOR
        InitializeStringTableAllocator;

    PINITIALIZE_STRING_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
        InitializeStringTableAllocatorFromRtlBootstrap;

    PCREATE_STRING_ARRAY_FROM_DELIMITED_STRING
        CreateStringArrayFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_STRING
        CreateStringTableFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_ENVIRONMENT_VARIABLE
        CreateStringTableFromDelimitedEnvironmentVariable;

    PIS_STRING_IN_TABLE IsStringInTable;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

    //
    // Extended API methods used for benchmarking.
    //

    PIS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_1;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_2;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_3;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_4;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_5;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_6;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_7;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_8;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_9;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_10;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_11;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_12;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_13;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_14;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_1;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_2;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_3;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_4;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_5;
    PIS_PREFIX_OF_STRING_IN_TABLE IntegerDivision_x64_1;

} STRING_TABLE_API_EX;
typedef STRING_TABLE_API_EX *PSTRING_TABLE_API_EX;

typedef union _STRING_TABLE_ANY_API {
    STRING_TABLE_API Api;
    STRING_TABLE_API_EX ApiEx;
} STRING_TABLE_ANY_API;
typedef STRING_TABLE_ANY_API *PSTRING_TABLE_ANY_API;

FORCEINLINE
BOOLEAN
LoadStringTableApi(
    _In_ PRTL Rtl,
    _Inout_ HMODULE *ModulePointer,
    _In_opt_ PUNICODE_STRING ModulePath,
    _In_ ULONG SizeOfAnyApi,
    _Out_writes_bytes_all_(SizeOfAnyApi) PSTRING_TABLE_ANY_API AnyApi
    )
/*++

Routine Description:

    Loads the string table module and resolves all API functions for either
    the STRING_TABLE_API or STRING_TABLE_API_EX structure.  The desired API
    is indicated by the SizeOfAnyApi parameter.

    Example use:

        STRING_TABLE_API_EX GlobalApi;
        PSTRING_TABLE_API_EX Api;

        Success = LoadStringTableApi(Rtl,
                                     NULL,
                                     NULL,
                                     sizeof(GlobalApi),
                                     (PSTRING_TABLE_ANY_API)&amp;GlobalApi);
        ASSERT(Success);
        Api = &amp;GlobalApi;

    In this example, the extended API will be provided as our sizeof(GlobalApi)
    will indicate the structure size used by STRING_TABLE_API_EX.

    See ../StringTable2BenchmarkExe/main.c for a complete example.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    ModulePointer - Optionally supplies a pointer to an existing module handle
        for which the API symbols are to be resolved.  May be NULL.  If not
        NULL, but the pointed-to value is NULL, then this parameter will
        receive the handle obtained by LoadLibrary() as part of this call.
        If the string table module is no longer needed, but the program will
        keep running, the caller should issue a FreeLibrary() against this
        module handle.

    ModulePath - Optionally supplies a pointer to a UNICODE_STRING structure
        representing a path name of the string table module to be loaded.
        If *ModulePointer is not NULL, it takes precedence over this parameter.
        If NULL, and no module has been provided via *ModulePointer, an attempt
        will be made to load the library via 'LoadLibraryA("StringTable.dll")'.

    SizeOfAnyApi - Supplies the size, in bytes, of the underlying structure
        pointed to by the AnyApi parameter.

    AnyApi - Supplies the address of a structure which will receive resolved
        API function pointers.  The API furnished will depend on the size
        indicated by the SizeOfAnyApi parameter.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    BOOL Success;
    HMODULE Module = NULL;
    ULONG NumberOfSymbols;
    ULONG NumberOfResolvedSymbols;

    //
    // Define the API names.
    //
    // N.B. These names must match STRING_TABLE_API_EX exactly (including the
    //      order).
    //

    const PCSTR Names[] = {
        "SetCSpecificHandler",
        "CopyStringArray",
        "CreateStringTable",
        "DestroyStringTable",
        "InitializeStringTableAllocator",
        "InitializeStringTableAllocatorFromRtlBootstrap",
        "CreateStringArrayFromDelimitedString",
        "CreateStringTableFromDelimitedString",
        "CreateStringTableFromDelimitedEnvironmentVariable",
        "IsStringInTable",
        "IsPrefixOfStringInTable",
        "IsPrefixOfCStrInArray",
        "IsPrefixOfStringInTable_1",
        "IsPrefixOfStringInTable_2",
        "IsPrefixOfStringInTable_3",
        "IsPrefixOfStringInTable_4",
        "IsPrefixOfStringInTable_5",
        "IsPrefixOfStringInTable_6",
        "IsPrefixOfStringInTable_7",
        "IsPrefixOfStringInTable_8",
        "IsPrefixOfStringInTable_9",
        "IsPrefixOfStringInTable_10",
        "IsPrefixOfStringInTable_11",
        "IsPrefixOfStringInTable_12",
        "IsPrefixOfStringInTable_13",
        "IsPrefixOfStringInTable_14",
        "IsPrefixOfStringInTable_x64_1",
        "IsPrefixOfStringInTable_x64_2",
        "IsPrefixOfStringInTable_x64_3",
        "IsPrefixOfStringInTable_x64_4",
        "IsPrefixOfStringInTable_x64_5",
        "IntegerDivision_x64_1",
    };

    //
    // Define an appropriately sized bitmap we can passed to Rtl-&gt;LoadSymbols().
    //

    ULONG BitmapBuffer[(ALIGN_UP(ARRAYSIZE(Names), sizeof(ULONG) &lt;&lt; 3) &gt;&gt; 5)+1];
    RTL_BITMAP FailedBitmap = { ARRAYSIZE(Names)+1, (PULONG)&amp;BitmapBuffer };

    //
    // Determine the number of symbols we want to resolve based on the size of
    // the API indicated by the caller.
    //

    if (SizeOfAnyApi == sizeof(AnyApi-&gt;Api)) {
        NumberOfSymbols = sizeof(AnyApi-&gt;Api) / sizeof(ULONG_PTR);
    } else if (SizeOfAnyApi == sizeof(AnyApi-&gt;ApiEx)) {
        NumberOfSymbols = sizeof(AnyApi-&gt;ApiEx) / sizeof(ULONG_PTR);
    } else {
        return FALSE;
    }

    //
    // Attempt to load the underlying string table module if necessary.
    //

    if (ARGUMENT_PRESENT(ModulePointer)) {
        Module = *ModulePointer;
    }

    if (!Module) {
        if (ARGUMENT_PRESENT(ModulePath)) {
            Module = LoadLibraryW(ModulePath-&gt;Buffer);
        } else {
            Module = LoadLibraryA("StringTable2.dll");
        }
    }

    if (!Module) {
        return FALSE;
    }

    //
    // We've got a handle to the string table module.  Load the symbols we want
    // dynamically via Rtl-&gt;LoadSymbols().
    //

    Success = Rtl-&gt;LoadSymbols(
        Names,
        NumberOfSymbols,
        (PULONG_PTR)AnyApi,
        NumberOfSymbols,
        Module,
        &amp;FailedBitmap,
        TRUE,
        &amp;NumberOfResolvedSymbols
    );

    ASSERT(Success);

    //
    // Debug helper: if the breakpoint below is hit, then the symbol names
    // have potentially become out of sync.  Look at the value of first failed
    // symbol to assist in determining the cause.
    //

    if (NumberOfSymbols != NumberOfResolvedSymbols) {
        PCSTR FirstFailedSymbolName;
        ULONG FirstFailedSymbol;
        ULONG NumberOfFailedSymbols;

        NumberOfFailedSymbols = Rtl-&gt;RtlNumberOfSetBits(&amp;FailedBitmap);
        FirstFailedSymbol = Rtl-&gt;RtlFindSetBits(&amp;FailedBitmap, 1, 0);
        FirstFailedSymbolName = Names[FirstFailedSymbol-1];
        __debugbreak();
    }

    //
    // Set the C specific handler for the module, such that structured
    // exception handling will work.
    //

    AnyApi-&gt;Api.SetCSpecificHandler(Rtl-&gt;__C_specific_handler);

    //
    // Update the caller's pointer and return success.
    //

    if (ARGUMENT_PRESENT(ModulePointer)) {
        *ModulePointer = Module;
    }

    return TRUE;
}

</code></pre>
<pre class="code content-api-def"><code class="language-c">
LIBRARY StringTable2
EXPORTS
    SetCSpecificHandler
    CopyStringArray
    CreateStringTable
    DestroyStringTable
    InitializeStringTableAllocator
    InitializeStringTableAllocatorFromRtlBootstrap
    CreateStringArrayFromDelimitedString
    CreateStringTableFromDelimitedString
    CreateStringTableFromDelimitedEnvironmentVariable
    TestIsPrefixOfStringInTableFunctions
    IsStringInTable
    IsPrefixOfStringInTable_1
    IsPrefixOfStringInTable_2
    IsPrefixOfStringInTable_3
    IsPrefixOfStringInTable_4
    IsPrefixOfStringInTable_5
    IsPrefixOfStringInTable_6
    IsPrefixOfStringInTable_7
    IsPrefixOfStringInTable_8
    IsPrefixOfStringInTable_9
    IsPrefixOfStringInTable_10
    IsPrefixOfStringInTable_11
    IsPrefixOfStringInTable_12
    IsPrefixOfStringInTable_13
    IsPrefixOfStringInTable_14
    IsPrefixOfStringInTable_x64_1
    IsPrefixOfStringInTable_x64_2
    IsPrefixOfStringInTable_x64_3
    IsPrefixOfStringInTable_x64_4
    IsPrefixOfStringInTable_x64_5
    IsPrefixOfCStrInArray
    IntegerDivision_x64_1
    IsPrefixOfStringInTable=IsPrefixOfStringInTable_13
</code></pre>
                    </div>
                </div>

                <a class="xref" name="release-vs-pgo"></a>
                <h3>Release Build versus Profile Guided Optimization Build</h3>

                <p>

                    It's interesting to see a side-by-side comparison of the optimized release build
                    next to the PGO build.  The main changes are mostly all to do with branching and
                    jump direction.

                </p>

                <a href="IsPrefixOfStringInTable_13-Release-vs-PGO.png" target="_blank">
                    <img width="800px" height="1550px" src="IsPrefixOfStringInTable_13-Release-vs-PGO-Small.png"/>
                </a>
                <!--
                    <picture>
                        <source srcset="IsPrefixOfStringInTable_13-Release-vs-PGO.png"/>
                        <img width="800px" height="1550px" srcset="PrefixOfStringInTable_13-Release-vs-PGO-Small.png"/>
                        <img width="1664px" height="3224px" srcset="PrefixOfStringInTable_13-Release-vs-PGO.png"/>
                    </picture>
                -->

                <a class="xref" name="typedefs"></a>
                <h3>Typedefs</h3>

                <p>

                    If there's one thing you can't argue about with the Pascal-style Cutler Normal
                    Form, is that it loves a good typedef.  For the sake of completeness, here's a
                    list of all the explicit or implied typedefs featured in the code on this page.

                </p>

<pre class="code content-typedefs"><code class="language-c">

//
// Standard NT/Windows typedefs (typically living in minwindef.h).
//

typedef void *PVOID;
typedef char CHAR;
typedef short SHORT;
typedef long LONG;
typedef unsigned long ULONG;
typedef ULONG *PULONG;
typedef unsigned short USHORT;
typedef USHORT *PUSHORT;
typedef unsigned char UCHAR;
typedef UCHAR *PUCHAR;
typedef _Null_terminated_ char *PSZ;
typedef const _Null_terminated_ char *PCSZ;

typedef int BOOL;
typedef unsigned char BYTE;
typedef unsigned short WORD;

typedef BYTE BOOLEAN;
typedef BOOLEAN *PBOOLEAN;

//
// The STRING structure used by the NT kernel.  Our STRING_ARRAY structure
// relies on an array of these structures.  We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//

typedef struct _STRING {
    USHORT Length;
    USHORT MaximumLength;
    ULONG  Padding;
    PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;

//
// Our SIMD register typedefs.
//

typedef __m128i DECLSPEC_ALIGN(16) XMMWORD, *PXMMWORD, **PPXMMWORD;
typedef __m256i DECLSPEC_ALIGN(32) YMMWORD, *PYMMWORD, **PPYMMWORD;
typedef __m512i DECLSPEC_ALIGN(64) ZMMWORD, *PZMMWORD, **PPZMMWORD;

</code></pre>

            <hr/>
            <h3>Contact</h3>
            <p>

                Like the article?  Let me know!  E-mail: &#116;&#114;&#101;&#110;&#116;&#64;&#116;&#114;&#101;&#110;&#116;&#46;&#109;&#101;

            </p>

            </div>

        </section>

        <script type="text/javascript">
            // Google Analytics
            (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
            (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
            m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
            })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

            ga('create', 'UA-24686252-1', 'auto');
            ga('send', 'pageview');
        </script>

        <section class="section section-comments">
            <div class="container">
                <p>
                    <hr/>
                </p>
                <!--
                I don't think I like disqus.  Especially not with the ads.
                Disable for now.
                <div id="disqus_thread"></div>
                <script>
                    var disqus_config = function () {
                        this.page.url = "http://trent.me/is-prefix-of-string-in-table";
                        this.page.identifier = "is-prefix-of-string-in-table";
                        this.callbacks.onNewComment = [function(comment) {
                            ga('send', {
                                'hitType': 'event',            // Required.
                                'eventCategory': 'Comments',   // Required.
                                'eventAction': 'New Comment',  // Required.
                                'eventLabel': 'New Comment'
                            });
                        }];
                    };

                    (function() {
                        var d = document, s = d.createElement('script');
                        s.src = 'https://trent-me.disqus.com/embed.js';
                        s.setAttribute('data-timestamp', +new Date());
                        (d.head || d.body).appendChild(s);
                    })();

                </script>
                -->
            </div>
        </section>

        <section class="section section-footer">
            <div class="container">
                <small>
                    <p>
                        <a
                        href="https://github.com/tpn/website/blob/master/is-prefix-of-string-in-table/index.html">
                        View this page's source on GitHub.</a>
                    </p>
                    <p>
                        <a href="https://twitter.com/trentnelson" class="twitter-follow-button" data-show-count="false">Follow @trentnelson</a>
                        <iframe src="https://ghbtns.com/github-btn.html?user=tpn&type=follow" frameborder="0" scrolling="0" width="170px" height="20px"></iframe>
                    </p>
                </small>
            </div>
        </section>

        <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

    </body>
</html>