Skip to content

TMBitset checklist #245

@yakra

Description

@yakra

Benchmark:

  • Route::clinched_by_traveler: branching vs indirection
    Branching is faster.
  • Graph generation with different units
    Bigger is faster.
  • Explicitly inline matching_vertices_and_edges (Change name? It finds travelers too)
    Slightly faster on all machines except lab2. (Noise?)
  • Try branchless variant of eaa6:
    for (size_t index = 0; index < traveler_lists.size(); ++index)
    	code[index/4] += clinched_by(traveler_lists[index]) << index%4;
    Branching is faster.
  • Revert eaa6; apply 4e05 directly to prev commit
    For Computing Stats, 96e9...
    • is THE top performer so far for lab1, lab3, lab4.
    • outperforms 4e05 on BiggaTomato.
    • on lab2 lags behind top performer 4e05 by only 3.8 ms. Just noise? A very tight race here.
  • 5c76 inconclusive, but does appear slightly slower for userlogs, very consistently. Try:
    • comparing against 6922 for a more apples-to-apples comparison.
    • comparing against the final selected commit once all the dust settles (see 4a64).
    • Out of curiosity, is there a diff if I build with GCC instead of clang? Yes.
  • What if traveler_lists is nixed altogether?
    As we iterate thru traveler_set, just keep a count instead of creating a vector.
    The trade-off is doing one more iteration thru traveler_set at the end of the traveled graph for traveler names.
    Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations?
    Preliminary results: helps on BiggaTomato; hurts on lab1. Be interesting to see results on bsdlab, at higher thread counts, and after doing more work on the RAM bandwidth bottleneck.
    Final: No speed advantage. Leaving as-is. Maybe re-examine after more RAM bandwidth improvements are implemented.

Try out:

  • What happens to traveled graphs if a devel system comes before concurrent a/p in systems.csv?
  • How much does initial HGEdge construction slow down if I force an active/preview canonical HighwaySegment?
    ~ 0.01 - 0.02 s on BiggaTomato -- 0.9 - 1.9 % more time.
  • Simple iteration optimization (with different units)
    The trade-off: Larger units = skip back farther but less often.
    • 64-bit is clearly suboptimal.
    • 8, 16 & 32-bit all very close to margin of error. Maybe a slight edge for 32-bit; makes sense to use it anyway because good |= performance.
  • Complex iteration optimization
    • 8-bit f749 underperforms 8-32-bit simple iteration on lab{1..4}; slight lead on BiggaTomato.
    • 16-bit ec4e is 1st place for BiggaTomato & lab1; underperforms f749 & even 64-bit f8cf on lab2.
      Other machines TBD.
      • [bits >> 1] solution
      • ternaries
        Both underperform ec4e on BiggaTomato & lab1. Successively better on lab2 though, with ternaries 1st place overall & [bits >> 1] falling between 16 & 32-bit SIO2. Other machines TBD.
    • How much time does the 15-bit lookup table take to construct? Does this outweigh the advantage of using 16-bit ComItOpt?
      • About 0.3 ms on BiggaTomato.
      • N/A -- there isn't an advantage to using 16-bit ComItOpt.
    • LOL what if I constexpr the damn thing by brute force? Never mind. Nothing to be gained by doing this.
    • 32k is 1/16 of 512k. Will ec4e perform poorly on Epoch? No. Performs well; outperforms SIO2. Similar to BiggaTomato.
  • Pointer punning |=
    Try simplified versions for larger units:
    • 64-32-16-bit
    • 64-16-bit
    • 64-32-bit
  • Replace TravelerList::traveler_num with one
    unsigned int* traveler_num = new unsigned int[TravelerList::allusers.size()]; per thread.
    Index via for (TravelerList *t : traveler_lists) traveler_nums[t-TravelerList::allusers.data()] = travnum++;
  • A variant of eaa6 with TMBitset [] operator? LOLNOPE. Different vectors (everything vs subset); indices don't match up.

Kinda both:

  • Init TravelerLists at beginning; init segments with size not capacity
  • Or better yet, init segments with TravelerList::ids.size()
    Segments set via TMArray::size, set via TravelerList::ids.size()
  • Timestamp for first task, with & without. Python Too?
  • TMArray<TravelerList> means threaded construction in place (via placement new) without separate read_list function.
  • TMBitset<HGVertex*> #251

Clean up:

  • fix comment
  • Constexpr the static edge format constants
  • Template specialization
  • As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp
  • Extraneous else after continue; parens can be clarified
    else if ((w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.front()
    && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.back())
    || (w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.back()
    && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.front()))
    new HGEdge(w->vertex, HGEdge::collapsed | HGEdge::traveled);
  • extraneous visibility == 1 check just before that
  • (4db4) maxbits less useful in SimItOpt2. Delete it; replace with 8*sizeof(unit); let -1 and +1 cancel out.
  • bits should be unsigned char in pun branch. Dumb luck that it ran without errors. Fixed for ComItOpt.
    Never mind. Using a branch that retains unit instead.
  • uint8_tetc.
  • Explicit instantiation?
  • Inlining MV&E (2d1b) means #include "../../templates/TMBitset.cpp" not needed in HighwayGraph.h. Can lose the include guard too LOL. 🤠
  • segments SQL table: only iterate clinched_by for active/preview systems
  • (c8b9) Check for diffs due to constant folding: 8/sizeof(unit)
  • Comment for != and |= operators
  • Review TMBitset variable names
  • Cast (unit)1 before <<, lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch.
  • (4db4) Switch: are additions in for loop reordered? Make it look pretty!
  • nix () and add_value until needed

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions