forked from TravelMapping/DataProcessing
-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
TravelMapping/DataProcessing
#592Description
Benchmark:
-
Route::clinched_by_traveler: branching vs indirection
Branching is faster. - Graph generation with different
units
Bigger is faster. - Explicitly inline
matching_vertices_and_edges(Change name? It finds travelers too)
Slightly faster on all machines except lab2. (Noise?) - Try branchless variant of
eaa6:Branching is faster.for (size_t index = 0; index < traveler_lists.size(); ++index) code[index/4] += clinched_by(traveler_lists[index]) << index%4;
- Revert
eaa6; apply4e05directly to prev commit
For Computing Stats,96e9...
• is THE top performer so far for lab1, lab3, lab4.
• outperforms4e05on BiggaTomato.
• on lab2 lags behind top performer4e05by only 3.8 ms. Just noise? A very tight race here. -
5c76inconclusive, but does appear slightly slower for userlogs, very consistently. Try:- comparing against
6922for a more apples-to-apples comparison. - comparing against the final selected commit once all the dust settles (see
4a64). - Out of curiosity, is there a diff if I build with GCC instead of clang? Yes.
- comparing against
- What if
traveler_listsis nixed altogether?
As we iterate thrutraveler_set, just keep a count instead of creating a vector.
The trade-off is doing one more iteration thrutraveler_setat the end of the traveled graph for traveler names.
Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations?
Preliminary results: helps on BiggaTomato; hurts on lab1. Be interesting to see results on bsdlab, at higher thread counts, and after doing more work on the RAM bandwidth bottleneck.
Final: No speed advantage. Leaving as-is. Maybe re-examine after more RAM bandwidth improvements are implemented.
Try out:
- What happens to traveled graphs if a devel system comes before concurrent a/p in systems.csv?
- How much does initial
HGEdgeconstruction slow down if I force an active/preview canonicalHighwaySegment?
~ 0.01 - 0.02 s on BiggaTomato -- 0.9 - 1.9 % more time. - Simple iteration optimization (with different
units)
The trade-off: Larger units = skip back farther but less often.- 64-bit is clearly suboptimal.
- 8, 16 & 32-bit all very close to margin of error. Maybe a slight edge for 32-bit; makes sense to use it anyway because good
|=performance.
- Complex iteration optimization
- 8-bit
f749underperforms 8-32-bit simple iteration on lab{1..4}; slight lead on BiggaTomato. - 16-bit
ec4eis 1st place for BiggaTomato & lab1; underperformsf749& even 64-bitf8cfon lab2.
Other machines TBD.-
[bits >> 1]solution - ternaries
Both underperformec4eon BiggaTomato & lab1. Successively better on lab2 though, with ternaries 1st place overall &[bits >> 1]falling between 16 & 32-bit SIO2. Other machines TBD.
-
- How much time does the 15-bit lookup table take to construct? Does this outweigh the advantage of using 16-bit ComItOpt?
- About 0.3 ms on BiggaTomato.
- N/A -- there isn't an advantage to using 16-bit ComItOpt.
LOL what if INever mind. Nothing to be gained by doing this.constexprthe damn thing by brute force?- 32k is 1/16 of 512k. Will
ec4eperform poorly on Epoch? No. Performs well; outperforms SIO2. Similar to BiggaTomato.
- 8-bit
- Pointer punning
|=
Try simplified versions for larger units:- 64-32-16-bit
- 64-16-bit
- 64-32-bit
- Replace
TravelerList::traveler_numwith one
unsigned int* traveler_num = new unsigned int[TravelerList::allusers.size()];per thread.
Index viafor (TravelerList *t : traveler_lists) traveler_nums[t-TravelerList::allusers.data()] = travnum++; -
A variant ofLOLNOPE. Different vectors (everything vs subset); indices don't match up.eaa6with TMBitset[]operator?
Kinda both:
Init TravelerLists at beginning; init segments with size not capacity-
Or better yet, init segments withTravelerList::ids.size()
Segments set via TMArray::size, set via TravelerList::ids.size() - Timestamp for first task, with & without. Python Too?
-
TMArray<TravelerList>means threaded construction in place (via placement new) without separate read_list function. - TMBitset<HGVertex*> #251
Clean up:
- fix comment
- Constexpr the static edge format constants
- Template specialization
- As of
ecff, cmath no longer explicitly needed in HighwayGraph.cpp - Extraneous
elseaftercontinue; parens can be clarifiedDataProcessing/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp
Lines 112 to 116 in 80f2724
else if ((w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.front() && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.back()) || (w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.back() && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.front())) new HGEdge(w->vertex, HGEdge::collapsed | HGEdge::traveled); - extraneous
visibility == 1check just before that - (
4db4)maxbitsless useful inSimItOpt2. Delete it; replace with8*sizeof(unit); let-1and+1cancel out. bitsshould beunsigned charinpunbranch. Dumb luck that it ran without errors. Fixed forComItOpt.
Never mind. Using a branch that retainsunitinstead.-
uint8_tetc. - Explicit instantiation?
- Inlining MV&E (
2d1b) means#include "../../templates/TMBitset.cpp"not needed in HighwayGraph.h. Can lose the include guard too LOL. 🤠 -
segmentsSQL table: only iterateclinched_byfor active/preview systems - (
c8b9) Check for diffs due to constant folding:8/sizeof(unit) - Comment for
!=and|=operators - Review TMBitset variable names
- Cast
(unit)1before<<, lest the unsigned long bug make a reappearance. Seef8cfonSimItOpt2branch. - (
4db4) Switch: are additions inforloop reordered? Make it look pretty! - nix
()andadd_valueuntil needed