Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topk feature request #67

Closed
gitboy16 opened this issue Aug 5, 2022 · 16 comments
Closed

Topk feature request #67

gitboy16 opened this issue Aug 5, 2022 · 16 comments
Labels
enhancement New feature or request performance

Comments

@gitboy16
Copy link

gitboy16 commented Aug 5, 2022

Hi,
Thanks for the package. Would it be possible to have the ability to return the index of the values instead of the values when using topk?
Thanks

@sl-solution sl-solution added the enhancement New feature or request label Aug 6, 2022
@sl-solution
Copy link
Owner

Yes, we can add a new function, it should be straightforward. Probably we can call the new function topkperm(?)

@giantmoa
Copy link
Contributor

giantmoa commented Aug 6, 2022

It would be better to add an extra argument to topk, wouldn't be?

@gitboy16
Copy link
Author

gitboy16 commented Aug 6, 2022

It is really up to you, I don't mind an extra argument or a new function as long as it is efficient and as fast as topk. Thank you.

@sl-solution
Copy link
Owner

I have added the new functionality to the master branch.

topk(x, k; rev = false, output_indices = false)

Return upto k largest nonmissing elements of x. When rev = true it returns upto k smallest nonmissing elements of x. When all elements are missing, the function returns [missing].

If output_indices = true, the function returns the values and their indices.

The topk function uses < for comparing values

@gitboy16
Copy link
Author

gitboy16 commented Aug 6, 2022

Thank you. I will have a look on Monday.

@gitboy16
Copy link
Author

gitboy16 commented Aug 8, 2022

What I feared happened. topk is now 2 times slower in my benchmark.

x = rand(8*(10^8));
@time InMemoryDatasets.topk(x, 6)
  2.618424 seconds (1 allocation: 112 bytes)

Before I had it at 1.237111 seconds.

@sl-solution sl-solution reopened this Aug 9, 2022
@sl-solution
Copy link
Owner

sl-solution commented Aug 9, 2022

What I feared happened. topk is now 2 times slower in my benchmark.

x = rand(8*(10^8));
@time InMemoryDatasets.topk(x, 6)
  2.618424 seconds (1 allocation: 112 bytes)

Before I had it at 1.237111 seconds.

I am fixing this. I am also going to go with topkperm, since adding extra keyword argument would cause inference problem.

@sl-solution
Copy link
Owner

check master.

@gitboy16
Copy link
Author

gitboy16 commented Aug 9, 2022

Hi, thank you. I think you have a performance regression issue on topk. On master topk seems slower. 1.754926 seconds (2 allocations: 224 bytes) . It was just 1 allocation before. Did you change topk? On master topk and topkperm seem to have the same performance.

@sl-solution
Copy link
Owner

On master topk seems slower. 1.754926 seconds (2 allocations: 224 bytes) . It was just 1 allocation before. Did you change topk? On master topk and topkperm seem to have the same performance.

This is expected, since topk now supports any DataType (previously it only supported numeric values) plus the compiler can infer its output correctly (this is important for combine, modify, etc. ). One extra allocation is due to calling allowmissing on the output.

@sl-solution
Copy link
Owner

Hi, thank you. I think you have a performance regression issue on topk. On master topk seems slower. 1.754926 seconds (2 allocations: 224 bytes) . It was just 1 allocation before. Did you change topk? On master topk and topkperm seem to have the same performance.

Would you please check the improve_topk branch? It still will be 2 allocations, however, it should be much faster and additionally supports two extra keyword arguments: lt and by.

@gitboy16
Copy link
Author

Hi, thanks, (using the same benchmark as above) yes topk is a bit faster at 1.65sec and topkperm much faster at 1.35 sec. Thank you

@sl-solution
Copy link
Owner

Hi, thanks, (using the same benchmark as above) yes topk is a bit faster at 1.65sec and topkperm much faster at 1.35 sec. Thank you

Both should be about the same timing - I guess the difference here is due to something else.

One more thing...

The latest commit adds multithreading to topk and topkperm. check it out and pass threads=true to have a sneak peek.

@gitboy16
Copy link
Author

Nice work! Yes I see that both functions have similar performance now but I only see a difference for topk when using multithreading. It doea not seem to work for topkperm. Do you have a private email I can contact you on? Thank you.

@sl-solution
Copy link
Owner

Nice work! Yes I see that both functions have similar performance now but I only see a difference for topk when using multithreading. It doea not seem to work for topkperm. Do you have a private email I can contact you on? Thank you.

It should be a bug, I am working on that.
You can use my github email directly github.sl.solution@icloud.com.

@sl-solution
Copy link
Owner

Currently we pass < as the default value for lt, however, in new release we will set isless as the default value for lt. This is because < is not safe when data contains NaN. Thus, to gain the maximum performance, you should pass lt = < when it is suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

3 participants