Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark GapEncoder divergence #593

Merged
merged 31 commits into from
Aug 7, 2023

Conversation

LilianBoulard
Copy link
Member

@LilianBoulard LilianBoulard commented Jun 12, 2023

Following a discussion with @Vincent-Maladiere, here is a benchmark meant to monitor some values during the internal iterations of the GapEncoder, which could help us understand what's going wrong.

This PR also improves upon #574 by fixing a bug with n_repeat, and adding support for multiple return dictionaries (which, here, is useful to save the values at each iteration).

Please read the file bench_gap_divergence.py's docstring for an in-depth explanation of the process, and give us your insight on what I could add before running it. @GaelVaroquaux

Part of the saga #342

@LilianBoulard LilianBoulard requested review from GaelVaroquaux and removed request for GaelVaroquaux June 12, 2023 18:47
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LilianBoulard, here are some quick comments. My main point is that benchmarks should be simpler to read and program.


print(self.W_.shape, self.A_.shape, self.B_.shape)

self.benchmark_results_.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change 2 things here:

  1. Only take the last values for this run instead of accumulating data. Meaning, this append() should be placed after the break
  2. It's safer and easier for debugging to write results in txt / csv file than accumulating things in RAM (if you cancel the run the results are not lost)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Okay, I see your point
  2. True, I'll look into that

Copy link
Member Author

@LilianBoulard LilianBoulard Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved monitor to automatically save intermediary results, thanks for the suggestion :)
For the first point, I've left it before the break for now so that we can see with greater detail what's going on (and to be honest, the matrices aren't that large for employee_salaries, only ~3700 x 10 each), we might reconsider this when working with the other datasets!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last part, it's just that we're going to have mostly duplicated points, so it will be more work for you to clean your data for plotting ;)

benchmarks/bench_gap_divergence.py Show resolved Hide resolved
benchmarks/bench_gap_divergence.py Outdated Show resolved Hide resolved
@Vincent-Maladiere
Copy link
Member

Unless I'm missing something, we want to benchmark the GapEncoder quickly since it's currently the main bottleneck for the TableVectorizer. Improving the benchmark script brings value, but we should aim at simplicity and getting the result first, then iterate on the benchmark.

This way, we can parallelize work and improve the GapEncoder default hyper-parameters, for example.

WDYT?

@LilianBoulard LilianBoulard added benchmarks Something related to the benchmarks and removed enhancement New feature or request labels Jun 26, 2023
@LilianBoulard
Copy link
Member Author

LilianBoulard commented Jul 17, 2023

Here's the result of the --plot:
image
Note:

  • max_iter_e_step is the inner loop
  • gap_iter is the outer loop
  • The red dashed line in the center-top plot is the tolerence (parameter tol)

@GaelVaroquaux
Copy link
Member

Looking at the plots above, I think that can see that the drug_directory dataset is particularly difficult.

In particular, on the top left plot, the score of the corresponding problem seems to be going up as a function of iterations, although it really shouldn't be.

Maybe that would be an interesting specific case to investigate

…_gap_divergence

# Conflicts:
#	benchmarks/bench_tablevectorizer_tuning.py
#	benchmarks/utils/__init__.py
#	benchmarks/utils/_various.py
#	benchmarks/utils/monitor.py
Copy link
Contributor

@LeoGrin LeoGrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements to the benchmarking suite, thanks! The doc is much clearer, and the hot-load is cool

_monitored[key].append(value)
# To avoid repeating code, we move the result
# mapping(s) to a list.
result_mappings = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this part is a bit complicated to support the monitored function returning a list of dict? Is this a possible situtation?

Copy link
Member Author

@LilianBoulard LilianBoulard Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is a bit complex I agree, but I haven't found a better way of doing it. And yes, we do want to support getting returned a list of dicts from the monitored function, "why" is explained in the docstring :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah sorry, we're actually using this in this benchmark 🤦‍♂️

benchmarks/utils/monitor.py Outdated Show resolved Hide resolved
Co-authored-by: LeoGrin <45738728+LeoGrin@users.noreply.github.com>
Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jovan-stojanovic jovan-stojanovic merged commit 5e22f20 into skrub-data:main Aug 7, 2023
24 checks passed
LeoGrin added a commit to LeoGrin/skrub that referenced this pull request Aug 24, 2023
* Improve framework

* Add Gap divergence benchmark

* Set initial iter values

* Add omitted value (score)

* Force keyword arguments and add progress saving

* Minor fixes

* Update to main

* Add pyarrow to benchmark requirements

* Implement cross-validation

* Update README

* Parallelize cross-validation

* Fix attribute access

* Fix attribute access

* Fix unpacking

* Fix results naming

* Fix results bug

* Multiple columns support and W_change plot v1

* Refactor dataset getters

* Adapt getters usage to new format

* Small fixes

* New plots

* Fix dataset categorization

* Update used datasets

* Add score per inner iteration plot

* Add benchmark results

* Compute the score after tuning

* Add issue link for `road_safety`

* Update results

* Update benchmarks/utils/monitor.py

Co-authored-by: LeoGrin <45738728+LeoGrin@users.noreply.github.com>

---------

Co-authored-by: LeoGrin <45738728+LeoGrin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks Something related to the benchmarks no changelog needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants