Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary timing measurements #472

Open
timwoj opened this issue Jul 17, 2019 · 9 comments

Comments

@timwoj
Copy link
Contributor

commented Jul 17, 2019

I started looking into converting the Dictionary class into more of a template the other day similar to the recent work for List and Queue. This led me to a lot of questions about the implementation and some discussion in Slack about rework of the class and timing issues. Below are some benchmarks of Dictionary against std::map and std::unordered_map, in order to get a sense of what the actual performance differences are.

The primary reason for the way Dictionary is implemented is to avoid a spike in CPU usage when the dictionary needs to be resized. It's implemented as a set of two Lists of buckets based on HashKeys. When a resize occurs, it spreads the actual resize process over a span of time instead of doing all of the resize all at once. std::unordered_map is typically implemented as a hash table of buckets based on the key, similar to Dictionary. The main difference is that unordered_map will do the entire resize at once so you see the timing hit as it resizes. std::map is typically implemented as a tree (usually red-black trees) based on a hash of the key.

The graphs below are for inserting 1000000 elements into each of the tables. For map and unordered_map the key is an int and value is an int*. For Dictionary, it uses a PDict where the key is a HashMap containing an int and the value is an int*. The graphs are the time (in microseconds) it takes to insert each element, measured by clock_gettime, plotted by python's matplotlib.

Note that these graphs are for inserts only and don't measure any of the other operations on the tree, but inserts are where the primary complication of Dictionary's implementation lies.

Dictionary

coregeek:dict testing tim$ time ./a.out d
Dictionary
real	0m1.758s
user	0m1.697s
sys	0m0.057s

Average operation time (us): 1.389912
Max operation time (us): 1734

dict

map

coregeek:dict testing tim$ time ./a.out r
std::map
real	0m2.015s
user	0m1.733s
sys	0m0.276s

Average operation time (us): 2.226538
Max operation time (us): 68

map

unordered_map

coregeek:dict testing tim$ time ./a.out u
std::unordered_map
real	0m1.088s
user	0m0.806s
sys	0m0.277s

Average operation time (us): 1.266901
Max operation time (us): 98

umap

Timing code: test.cc.txt
Plotting code: plot.py.txt

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2019

@JustinAzoff asked in Slack whether we had any benchmarks for creating/destroying small Dictionaries, so I added that below. The data below is for creating 1000000 3-element std::map and Dictionary objects and deleting them in between each creation. I didn't create graphs for these.

Dictionary

real	0m2.548s
user	0m2.489s
sys	0m0.048s

std::map

real	0m1.410s
user	0m1.174s
sys	0m0.230s

Updated timing code: test.cc.txt

@rsmmr

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

Interesting, nice plots!

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2019

Just for comparison I ran a graph of the Dictionary data, removing any point > 100us. This forces the lower points to become more pronounced and gives an easier comparison to the individual inserts against map.

Average operation time (us): 1.383076
Max operation time (us): 76

dict trimmed

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2019

@JustinAzoff pointed out that the code for map/unordered_map might be slightly faster due to the fact that I'm storing pointers in Dictionary but regular ints in the other two, and the call to new int() may be skewing the data. I fixed the code to store <int,int*> and updated the graphs/text in the original post.

It was definitely skewing the data, with the runtime for both going up about 200ms and the average operation time going up very slightly (~5us).

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2019

Both @rsmmr and @JustinAzoff mentioned that some benchmarks for lookups would be useful. These are containers using const char* keys and int* values. The code creates the map, and then times how long it takes to look up each key in reverse order of creation, run 5 times in a row.

Dictionary

Time for 50 lookups (us): 21000
Time for 50 lookups (us): 15000
Time for 50 lookups (us): 15000
Time for 50 lookups (us): 15000
Time for 50 lookups (us): 14000

std::map

Time for 50 lookups (us): 13000
Time for 50 lookups (us): 13000
Time for 50 lookups (us): 13000
Time for 50 lookups (us): 12000
Time for 50 lookups (us): 11000

std::unordered_map

Time for 50 lookups (us): 9000
Time for 50 lookups (us): 9000
Time for 50 lookups (us): 9000
Time for 50 lookups (us): 9000
Time for 50 lookups (us): 9000

There's also the question of iterating over a dictionary. For the two std maps we can do this with a normal ranged-for loop in C++. For Dictionary this requires creating an IterCookie. Same as above, these are containers using const char* for keys and int* for values.

Dictionary

Time for 50 iterations (us): 7000
Time for 50 iterations (us): 7000
Time for 50 iterations (us): 7000
Time for 50 iterations (us): 6000
Time for 50 iterations (us): 6000

std::map

Time for 50 iterations (us): 2000
Time for 50 iterations (us): 3000
Time for 50 iterations (us): 3000
Time for 50 iterations (us): 3000
Time for 50 iterations (us): 3000

std::unordered_map

Time for 50 iterations (us): 1000
Time for 50 iterations (us): 1000
Time for 50 iterations (us): 2000
Time for 50 iterations (us): 2000
Time for 50 iterations (us): 1000

Updated test code supporting both of the above cases: test.cc.txt

@rsmmr

This comment has been minimized.

Copy link
Member

commented Jul 19, 2019

Nice. Clear win all around.

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

As I started working on this in further detail, one benchmark missing above is when the key is a complex object such as in the case of connections that have two IPs and two port numbers.

With the old Dict code, a lookup loops through all of the elements in the list and compares them one at a time using memcmp against the provided HashKey. For std::map the comparison uses std::less and the operator< method defined for the complex object. Since the map is likely implemented as a tree, we'd expect the searching to be faster than linear time. Depending on the speed of the operator< method, this ought to be faster than Dict. For std::unordered_map it's just a comparison against the hash of the object, however that is calculated, and also faster than linear time depending on how the hash table is implemented.

The test below is the same as the regular test at the start of the GHI, where we insert 1000000 elements into a container. This causes at least one comparison for every inserted object. I didn’t add fifo_map tests because earlier testing showed that it was slow enough to be unusable. Surprisingly, unordered_map is slower than std::map here. I’m not sure if that’s because of the slowness of creating the hashes for the elements vs just inserting them directly.

Dictionary

real	0m1.968s
user	0m1.903s
sys	0m0.058s

Average operation time (us): 1.517161
Max operation time (us): 100

Screen Shot 2019-07-31 at 10 48 26 AM

std::map

real	0m1.506s
user	0m1.248s
sys	0m0.252s

Average operation time (us): 1.546464
Max operation time (us): 73

Screen Shot 2019-07-31 at 10 47 30 AM

std::unordered_map

real	0m1.755s
user	0m1.459s
sys	0m0.286s

Average operation time (us): 1.444918
Max operation time (us): 96

Screen Shot 2019-07-31 at 10 48 08 AM

Updated benchmarking code: test.cc.txt

@timwoj

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

@jsiwek brought up the following on the PR for simple types:

Generally, I think this is the decision process I use:

  • Need ordering: use map
  • Don't need ordering, not on perf. critical path: still use map
  • Don't need ordering, on perf. critical path: measure if unordered_map is better and use it if so

Reason for needing measurements is it's not always intuitive that unordered will be better. E.g. commonly suspect small unordered-maps with long string keys actually may have hash operations which end up causing worse lookup performance than ordered-map. Memory usage differences could be another consideration to add even more nuance.

I think that last paragraph matches the above benchmarks. unordered_map is definitely slower than prior benchmarks once larger keys are used.

@JustinAzoff

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

One thing the benchmarks may be missing is adding and removing many items over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.