Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ripser backend #106

Merged
merged 25 commits into from
Nov 2, 2020
Merged

Conversation

reds-heig
Copy link

Hello,

Description

This PR updates the C++ ripser backend to the latest code available at Ripser and also improve some parts of the code.

Changes

About the main changes add, this are some of the main ones:

  • Update to latest ripser.cpp available
  • Add support to robinhood hashmap
  • Flat coefficient binomial table
  • Use specialized function to compute module 2 with mask operator
  • Refactor to remove dups

The profiling of ripser showed a lot of time spent using std::unordered_map. From personal experience and as you'll see on the benchmark below, using robinhood::unordered_map add a certain amount of speed-up by just replacing std::unordered_map. I don't have the numbers about the memory consumption for ripser.py but in a different project, the memory was reduced by 10% just by using robinhood::unordered_map.

Comments

std::unordered_map

I did not add directly robinhood hashmap as a dependency but I let a #if definded in the code in case someone (I in this case) would like to use it.

I would really like to not loose the gain in performances added by the new unordered_map.

Enclosing radius

I add a third table in the benchmark discussing about the enclosing radius optimization. In ripser this optimization is used when no threshold is set explicitly. In ripser.py to enable the use of enclosing radius, we need to set the threshold parameter to threshold=np.finfo(np.float32).max.

In ripser.py the default value of the threshold is infinity, meaning that the enclosing radius isn't used. But as described in ripser paper, p.11, section 4, input:

If no threshold is specified, the minimum enclosing radius of the input is used as a threshold,as suggested by Henselman-Petrusek [16]. Above that threshold the Vietoris–Rips complex is a simplicial cone with apex a minimizing point x, and so the homology remains trivial afterwards.

From what I understand, there's no point in computing PH above this radius, but maybe am I wrong ?

Because it could be a possibility to change inside ripser the condition to also in case if the threshold is set to infinity, use the enclosing radius as a threshold, what do you think ? I mailed directly Prof. Bauer. but I don't have any news.

Test

I runned the test available with pytest, I also verified in more details on different datasets barcodes and cocycles.
But please, feel free to test on your side :)

Please, let me know if I need to add some changes or if you wouldn't like to merge this PR

Best,
Julián


Benchmark

I made some benchmarking to show run time difference on the same datasets used in the original ripser paper:

Ripser.py

Dataset size threshold dim coeff time [s]
sphere3 192 2 2 1.5
dragon 2000 1 2 3.8
o3 1024 1.8 3 2 3.1
random16 50 7 2 *
fractal 512 2 2 18.3
o3 4096 1.4 3 2 **

* Run out of memory, necessary to use enclosing radius
** I'm surprized it runs out of memory

Ripser.py updated

Dataset size threshold dim coeff time [s] time robinhood [s]
sphere3 192 2 2 1.5 1.2
dragon 2000 1 2 3.8 3.3
o3 1024 1.8 3 2 2.9 2.2
random16 50 7 2 * *
fractal 512 2 2 17.7 14
o3 4096 1.4 3 2 68.6 53.4

* Run out of memory, necessary to use enclosing radius

Ripser.py Using enclosing radius

In order to use the enclosing radius optimization implemented in ripser, it's
necessary to pass as a threshold the value np.finfo(np.float32).max. In the previous table, the previous value is used when no threshold is specified.

Dataset size threshold dim coeff time [s]
sphere3 192 2 2 1.5
dragon 2000 1 2 2.9
o3 1024 1.8 3 2 2.9
random16 50 7 2 8.4
fractal 512 2 2 17.7
o3 4096 1.4 3 2 68.6

@MonkeyBreaker
Copy link
Contributor

I'll fix the issue with the windows compilation ASAP, sorry for that ...

@ulupo
Copy link
Contributor

ulupo commented Sep 24, 2020

  • Concerning benchmarks, I guess the ultimate table would combine the enclosing radius fix with robinhood hashmap, correct? Perhaps it would be worth showing it? @reds-heig

  • As far as I understand, some changes to the project CI should be made so that robinhood hashmap is available when building Python wheels, right? The user who compiles from source, on the other hand, should not need it installed. @ctralie @sauln is this correct?

  • @ctralie: I think (to be double-checked) the newer C++ backend does not suffer from the issues that led me to opening Incorrect output on COO matrices instantiated with rows and columns not in lexicographic order #103. So perhaps the changes made in Fix #103 #104 can be reverted if this is merged.

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #106 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #106   +/-   ##
=======================================
  Coverage   96.75%   96.75%           
=======================================
  Files           3        3           
  Lines         154      154           
  Branches       26       26           
=======================================
  Hits          149      149           
  Misses          4        4           
  Partials        1        1           
Impacted Files Coverage Δ
ripser/_version.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc90fb9...7e1107b. Read the comment docs.

@bdice
Copy link
Collaborator

bdice commented Sep 25, 2020

@MonkeyBreaker Should the Windows compatibility changes be suggested in a pull request to the upstream repository? They were initially just a hack I did to make it build so that we could have Windows conda-forge builds (users were asking @sauln for Windows compatibility, if I recall).

@MonkeyBreaker
Copy link
Contributor

@bdice by upstream repository, you mean the main repository of Ripser ?

@bdice
Copy link
Collaborator

bdice commented Sep 25, 2020

@MonkeyBreaker Yes. The changes made here should enable cross-platform compatibility without any major degradations in the functionality, and it would make this Windows compatible Python/Cython library easier to maintain in the future as the C++ library continues to evolve. It seems like a win-win. https://github.com/Ripser/ripser

setup.py Show resolved Hide resolved
@MonkeyBreaker
Copy link
Contributor

I just tested to compile ripser on windows, and effectively the same issues are encountered.
I think it would be a good idea to prepare a separate PR on the upstream repository enabling compilation on windows.

@MonkeyBreaker
Copy link
Contributor

To give more details @ulupo:

Concerning benchmarks, I guess the ultimate table would combine the enclosing radius fix with robinhood hashmap, correct? Perhaps it would be worth showing it? @reds-heig

The last table shows the results without robinhood, would it be worth adding them ?

As far as I understand, some changes to the project CI should be made so that robinhood hashmap is available when building Python wheels, right? The user who compiles from source, on the other hand, should not need it installed. @ctralie @sauln is this correct?

Well I didn't want to add for ripser,py the dependency with robinhood. But I let the possibility in the C++ to use robinhood in case it's already present in you project.

@ctralie: I think (to be double-checked) the newer C++ backend does not suffer from the issues that led me to opening #103. So perhaps the changes made in #104 can be reverted if this is merged.

Should this be related to this fix add in ripser ?

@ulupo
Copy link
Contributor

ulupo commented Sep 25, 2020

@MonkeyBreaker

The last table shows the results without robinhood, would it be worth adding them ?

I guess I was suggesting that it may be worthwhile to do so to have the "ultimate" performance figures.

Well I didn't want to add for ripser,py the dependency with robinhood. But I let the possibility in the C++ to use robinhood in case it's already present in you project.

What I meant was that I guess the CI for this project builds Python wheels using compiled extensions and that maybe these compiled extensions should be made using robinhood so that the final Python wheels can benefit from the performance boost. I am not suggesting any changes for the end user who compiles from source. I am just saying that it would be good if the Python user who installs from PyPI could also benefit from this particular addition.

Should this be related to this fix add in ripser ?

Maybe, though I only think this is the case from experimenting with you on some input and not from looking deeply into the git history.

@sauln
Copy link
Member

sauln commented Sep 25, 2020

Thanks @reds-heig for this awesome work, and thanks @ulupo and @bdice for fielding this 🙇

these compiled extensions should be made using robinhood so that the final Python wheels can benefit from the performance boost

This sounds good to me, but unfortunately the CI is only running automated tests at the moment, not builds and deploys. It's been a minute since I've touched the travis code, so that might be easier to convert to github-actions in the long run if we want to add CI/CD.

We should probably also set it up so it runs tests both with and without robinhood present.

@MonkeyBreaker
Copy link
Contributor

I add the last table with robinhood timing added:

Dataset size threshold dim coeff time [s] time robinhood [s]
sphere3 192 2 2 1.5 1.2
dragon 2000 1 2 2.9 2.5
o3 1024 1.8 3 2 2.9 2.2
random16 50 7 2 8.4 6.0
fractal 512 2 2 17.7 14
o3 4096 1.4 3 2 68.6 53.4

About robinhhood inside of the CI, one easy solution could be as follow:

  1. Add robinhood as a submodule for the repository: git submodule add https://github.com/martinus/robin-hood-hashing ripser/<something>
  2. inside of setup.py, you could test if the folder is present and set the following flags for the compilation:
    • os.path.isdir('ripser/<something>'), if true
    • append to define_macros the following : ("USE_ROBINHOOD_HASHMAP", 1)
    • Add the correct include path, gcc/clang -Iripser/<something>/src/include or for msvc /Iripser/<something>/src/include

@MonkeyBreaker
Copy link
Contributor

MonkeyBreaker commented Oct 6, 2020

Hi !

I had a bit of time, and I tried to add the robinhood hashmap into the CI.
I have done as follow:

  • git clone https://github.com/martinus/robin-hood-hashing ripser/robinhood
  • Inside of the setup.py, if the folder ripser/robinhood exist, I enable robinhood hashmap when compiling ripser.

From what I can observe so far:

  • travis-ci and appveyor, robinhood hashmap is correctly downloaded
  • In appveyor, which manages the Windows deploy, the compilation flag are correctly set : 3.7 and 3.7 x64
  • In travis-ci, robinhood hashmap is correctly download, but I do not have access to the compilation flags. I should add -v before starting the script setup.py inside of ci_scripts/install.sh

Otherwise, on my machine this seems to work, but I encountered an issue that only dev of the library should encounter:

If you already run pip install . and after that you download the robinhood hashmap, unfortunately when running again command pip install . won't recompile ripser. You need first to clean the previous build to ensure that it recompiles using robinhood hashmap.

Of course maybe my edits are not the changes you would expect, I'm really not an expert on CI, but it's a first draft of them :). I didn't want migrate travis to github-actions because I think this shall be done in a separate PR.

Best,
Julián

.travis.yml Outdated
export DISPLAY=:99.0;
sh -e /etc/init.d/xvfb start;
sleep 3;
fi
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew list python &>/dev/null || brew install python; fi
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew list python3 &>/dev/null || brew install python3; fi
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install pyenv-virtualenv; fi
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install git; fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this line is not needed for osx. The travis error says Error: git 2.24.1 is already installed

Copy link
Contributor

@MonkeyBreaker MonkeyBreaker Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it, but I didn't want to assume that git is already installed. I didn't expect the pipeline to fail if git was already present to be honest.

@sauln
Copy link
Member

sauln commented Oct 6, 2020

That's for taking a pass at this. The changes look reasonable, but I'm not much of an expert myself. Thanks for taking a pass at this!

@sauln
Copy link
Member

sauln commented Oct 6, 2020

Could @bdice, @ulupo, or @ctralie confirm the c++ changes look good? I don't have enough c++ background to judge them.

If one of you confirms and the CI runs successfully, I suggest we merge the changes.

@bdice
Copy link
Collaborator

bdice commented Oct 7, 2020

@sauln The C++ is somewhat hard to review because the upstream file has been updated, in addition to the Windows compatibility fixes and optional Robinhood dependency. I gave it a quick overview and I think it's fine. Overall the PR state is "working" and the integrations with CI, etc. appear to be functioning as desired.

I think it's in the interest of the maintainers to reduce the downstream burden on ripser.py, and I would push just a little more on a topic I raised previously: this PR is mostly C++ changes and this package's purpose/scope (as I understand it) is to wrap the C++ library and offer convenient Python bindings. Filing an upstream PR to https://github.com/Ripser/ripser for the Windows compatibility fixes and getting these two ripser.cpp files in sync seems like the appropriate next step. If the files are kept "in sync" then future upstream changes can be easily merged into this repository by simply copying the new C++ code into this repo.

As for Robinhood, it seems like this PR offers a 28% average speedup at the cost of potentially breaking the "copy from upstream" method for keeping this repo's C++ internals up to date. Obviously performance improvements are good, but it's a tradeoff to consider carefully. This isn't an issue if someone (PR author(s)? @reds-heig @MonkeyBreaker) is willing to help maintain that patch. Also, if the upstream project is abandoned at some point, then there's no longer a problem -- this repository is already incurring that maintenance cost if the upstream is not accepting new PRs (it doesn't appear to have changed much recently). I just wanted to throw that in there as a perspective from a fellow OSS maintainer with finite time/resources. This code appears to work well and I would approve it for merge if project maintainers agree upon consideration.

@MonkeyBreaker
Copy link
Contributor

@bdice you raised some good points.

  • About a PR for the upstream repository for windows compatibility, I can do it. But not sure it will be merged, I don't know if it stays maintained or the author is quite busy.
  • This PR aligns on the C++ upstream implementation, but there are some differences:
    • ripser.py supports lowerstar filtrations and for that, non 0 vertex birth need to be supported, in the upstream implementation, as far as I understood this isn't the case.
    • The upstream implementation has slightly different approach to set the maximal dimension, it uses dim_max(std::min(_dim_max, index_t(dist.size() - 2))) as compared to ripser.py which is simply dim_max(_dim_max), the reason is that if the users wants till dimension 5 (per example) and the previous rule set dim_max to let's say 3, the output will only contain 3 dimensions instead of 5. Because the 2 missing dimension have 0 barcodes.
    • Currently in ripser.py, the ratio parameter isn't exposed for Python. From the documentation of upstream, ratio is used for : "ratio r: only show persistence pairs with death/birth ratio > r.". To support this in ripser.py, it should be necessary to expose this parameter. In ripser.py ratio is set to 1.
    • Maybe it's trivial, but the indenting of source is with my own personal rules, I use clang_format. I think that for future changes it would be good if a defined indentation is used, much easier to maintain :). All this text to say that I couldn't reproduce the indentation from upstream because I did not find it.
  • About Robinhood, in the upstream repository. there's a optional support for Google's sparsehash, I just find easier to maintain as a dependency Robinhood than Google's sparsehash. If my memory doesn't fail me, I think that just replacing std::unordered_map by Robinhood in upstream won't compile because at one place in the code, an insert does not support the constructor used. But I'm not 100% sure, this was some time ago. But in any case, nothing difficult to fix.
  • As for helping to maintain, sure I'm more than happy to do it, but this will be done on my spare time ...

@bdice
Copy link
Collaborator

bdice commented Oct 7, 2020

@MonkeyBreaker Thanks for the helpful insight! I wasn't aware that there were other changes in this repo's copy of the C++ code. Since that's the case, then my intent of aligning the two C++ implementations may not be a realistic goal. Thanks again for the PR and for thinking about the questions I raised. I'll let project maintainers finalize and merge this PR.

@sauln
Copy link
Member

sauln commented Oct 8, 2020

Alright, let's ship it! @MonkeyBreaker could you update the version # (https://github.com/scikit-tda/ripser.py/blob/master/ripser/_version.py) to 0.6.0 and write a brief summary in the changelog (CHANGELOG.md) and then I'll merge the changes and redeploy everything.

@ctralie
Copy link
Member

ctralie commented Oct 8, 2020 via email

@MonkeyBreaker
Copy link
Contributor

MonkeyBreaker commented Oct 8, 2020

Hi !

@ctralie thank you for your feedback ! I hope that despite all what's going on in the US, everything will go well.
Well, if we're ready to drop features (like ratio), we could maybe discuss on a separate Issue if we could/should remove unnecessary computation, I'm an optimization guy and keeping work that we don't use at all, shall in my opinion be removed. About the changes you made into ripser, in my opinion they were worth it.

Anyway, I do think because of this, it's probably more effort than it's
worth to keep this synced with the original ripser, but then again, there
are other capabilities there like persistent homology instead of
cohomology, where it's possible to actually extract representative cycles.
So I am open to discussion later. But for now, let's keep it its own thing.

I think that we should be able to update the code to integrate this kind of possibilities in the future even if we are a bit different than the upstream repository. For me, it's important that we match as much as possible the upstream repository because if new changes are done upstream they shall be easy to integrate.

@sauln Let me proceed with the changes, but before this PR will be merged, I would like to have the opinion of everyone about one of the points I raised in the description of the PR:

Enclosing radius

I add a third table in the benchmark discussing about the enclosing radius optimization. In ripser this optimization is used when no threshold is set explicitly. In ripser.py to enable the use of enclosing radius, we need to set the threshold parameter to threshold=np.finfo(np.float32).max.
In ripser.py the default value of the threshold is infinity, meaning that the enclosing radius isn't used. But as described in ripser paper, p.11, section 4, input:
If no threshold is specified, the minimum enclosing radius of the input is used as a threshold,as suggested by Henselman-Petrusek [16]. Above that threshold the Vietoris–Rips complex is a simplicial cone with apex a minimizing point x, and so the homology remains trivial afterwards.
From what I understand, there's no point in computing PH above this radius, but maybe am I wrong ?
Because it could be a possibility to change inside ripser the condition to also in case if the threshold is set to infinity, use the enclosing radius as a threshold, what do you think ? I mailed directly Prof. Bauer. but I don't have any news.

Currently in order to use the enclosing radius optimization, we need to set in Python the threshold to np.finfo(np.float32).max. Otherwise it will use the one set by the user, or by default inf. As I said earlier and from what I understood from the papers, computing homology for a greater radius than the enclosing radius, won't output more barcodes. I think we should modify condition here:

if (threshold == std::numeric_limits<value_t>::max() || threshold == std::numeric_limits<value_t>::infinity()) {
 ....

@ctralie, @sauln, @bdice, @ulupo what do you think about this ? Should I update the code or is there a reason to compute in some cases "To Infinity... and Beyond!"

Best,
Julián

@ulupo
Copy link
Contributor

ulupo commented Oct 9, 2020

@MonkeyBreaker knows my opinion on this issue from the conversations we've had on it, but to share with everybody else: I think we should make sure that the enclosing radius optimization is used when appropriate, as the performance benefits can be large (if slightly unpredictable).

In C++ ripser, what seems to happen in https://github.com/Ripser/ripser/blob/286d3696796a707eecd0f71e6377880f60c936da/ripser.cpp#L1022-L1039 is this:

  1. if the user does not pass a threshold via the --threshold option in the command line, the threshold is internally set to std::numeric_limits<value_t>::max() and the enclosing radius optimization is used;
  2. if the user passes std::numeric_limits<value_t>::infinity() explicitly as a threshold, no optimization is used.

I think 2 is a small unintentional design flaw. It seems clear to me that std::numeric_limits<value_t>::infinity() should also mean "we will not be using a threshold", and hence that the enclosing radius optimization should be used.

So I would be in favour of implementing @MonkeyBreaker's suggested modification of the if clause.

When interfacing with Python, are we sure that np.inf will be passed correctly by the binding code as std::numeric_limits<value_t>::infinity()?

@ulupo
Copy link
Contributor

ulupo commented Oct 9, 2020

Additionally, I'd like to repeat a previous point I made: I think that with this update of the C++ backend one should be able to fully revert #104 and the lexicographic ordering should not be necessary. If this is the case, I suggest this is done in this PR or at least as part of the 0.6 release, and that an example such as the one I gave in #103 is added as a test to avoid regressions.

@sauln
Copy link
Member

sauln commented Oct 9, 2020

@ulupo and @MonkeyBreaker, you've made a good case for modifying the behavior. Again, I am not as familiar with the C++ backend as I should be, so will trust both of your judgements.

@ulupo Could you add regression tests and revert the changes in a follow up PR?

@ulupo
Copy link
Contributor

ulupo commented Oct 9, 2020

@ulupo Could you add regression tests and revert the changes in a follow up PR?

Sure OK! 👍

@ctralie
Copy link
Member

ctralie commented Oct 9, 2020 via email

@ulupo
Copy link
Contributor

ulupo commented Oct 12, 2020

@MonkeyBreaker thanks for the extra commit! I repeat one small question I had, just to be sure:

When interfacing with Python, are we sure that np.inf will be passed correctly by the binding code as std::numeric_limits<value_t>::infinity()?

If yes, I have nothing more to add and leave it for the maintainers to decide on whether the state is good for merging.

@MonkeyBreaker
Copy link
Contributor

@ulupo About infinity, I verified one thing: if float infinity equals double infinity and from my results it's the case.

About np.inf, the information I get is from the official documentation of numpy.

NumPy uses the IEEE Standard for Binary Floating-Point for Arithmetic (IEEE 754). This means that Not a Number is not equivalent to infinity. Also that positive infinity is not equivalent to negative infinity. But infinity is equivalent to positive infinity.

But I cannot find information about Python inf and C++ inf. At the moment in cython or pybind11, np.inf and std::numeric_limits<value_t>::infinity() are equal. I think that the best way to always be sure it will be the case, is to add a test.

julian added 9 commits October 29, 2020 20:34
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@MonkeyBreaker
Copy link
Contributor

Well, I'll go sleep, I'm struggling with Windows (Yay ...).

With the new CI, the windows job uses conda, so far so good, but for something I cannot find why, it uses then by default gcc.exe compiler. This implies that the flags should start with - and not /. And currently the flags are configured depending on the platform (Windows, Darwin, etc.).

I did not find at the moment how to detect the compiler used for the compilation and according to this, choose the correct flag format. I'll give it a try tomorrow or on the week-end.

Julián

julian added 6 commits October 30, 2020 13:18
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@MonkeyBreaker
Copy link
Contributor

Hurray ! Seems to work now.

In order to make it work I follow this answer on SO.
The "hack" I implemented is to create a setup.cfg into which I add the following 2 lines:

[build]
compiler=msvc

If you have another solution, please feel free to integrate it :)

Julián

julian added 3 commits October 30, 2020 17:00
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@MonkeyBreaker
Copy link
Contributor

Hi everyone,

As @ubauer requested I benchmarked and also added some changes:

  • PACK now works also on windows, see below for a tiny discussion on this
  • Replace std::hash with robinhood::hash, the benchmark shows better performance.

About PACK on windows, windows compiler require data each data fields have the same type in order to pack correctly. In order to do this, I changed inside of entry_t coefficient_t coefficient into index_t coefficient. This works because both are signed types. The performance and the behaviour are from my test and observations the same, please feel free to double check.

robinhood has two different memory layouts. Currently I used robinhood hashmap in "auto" mode to choose the a memory layout. I benchmarked for our case and I got the following results:

Dataset size threshold dim coeff unordered_map [s] unordered_flat_map [s] unordered_node_map [s]
sphere3 192 2 2 1.4 1.4 1.4
dragon 2000 1 2 2.6 2.6 2.7
o3 1024 1.8 3 2 2.3 2.3 2.4
random16 50 7 2 6.5 6.5 6.9
fractal 512 2 2 16.1 16.1 16.5
o3 4096 1.4 3 2 57.9 57.9 58.4

unordered_map is the auto mode, and apparently in our case it choose the best memory layout without any performance looses. From what I understood, this choice is made at compilation detecting the type of key used inside of the hashmap.
BTW, if the results seems slower than the one at the beginning of the PR, it's because my computer was doing other heavy tasks, which I think slowed a bit the execution.

Reading the robinhood documentation I stumbled on an implemented hash function directly available. The benchmark showed that the hash function provides a bit more of speed up :)

Dataset size threshold dim coeff std::hash [s] std::robinhood [s]
sphere3 192 2 2 1.4 1.4
dragon 2000 1 2 2.6 2.6
o3 1024 1.8 3 2 2.3 2.3
random16 50 7 2 6.5 6.2
fractal 512 2 2 16.1 15
o3 4096 1.4 3 2 57.9 53.9

Please let me know if you have any question/suggestion.

@ubauer you gave now write access to the repository, feel free to add all the changes we discussed :)

Best,
Julián

@sauln
Copy link
Member

sauln commented Oct 30, 2020

This is great! I'll give it one last review this weekend and give a few days for anyone else to make comments before shipping.

Thanks @MonkeyBreaker for all your work with this improvement 🙇

Copy link
Collaborator

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Really nice work, this took a lot of effort.

@sauln sauln merged commit f784e1f into scikit-tda:master Nov 2, 2020
@sauln
Copy link
Member

sauln commented Nov 2, 2020

@MonkeyBreaker thank you for the hard work putting this together and patience with getting it merged! 6.0.0 is out. I still have a bit of work to do to get the documentation rolled out.

I was hoping to get 1 more brief PR from you! Could you add a blurb about the robinhood installation in the readme and in the docs site? I think also a copy of the benchmarking table would be really helpful within the docs site as well!

Thank you :D

@MonkeyBreaker
Copy link
Contributor

@sauln thank you for the merge !

Sure about the installation and the benchmarking, let me prepare it, hopefully bythe end of the week.

And also, thank you everyone for the already amazing work on the library :D

Julián

@ctralie
Copy link
Member

ctralie commented Nov 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants