Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconstruction measures #9

Merged

Conversation

agoscinski
Copy link
Collaborator

@agoscinski agoscinski commented Nov 10, 2020

Implements reconstruction measures from the paper arXiv 2009.02741

This first suggestion how the reconstruction measures would be laid out. Feedback is welcomed.

Still missing:

@agoscinski agoscinski self-assigned this Nov 10, 2020
Copy link
Collaborator

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very quick look, the code looks fine but does need a lot more doc. Why did you include a separate version of the scalers instead of basing your work on top of #8?

setup.cfg Outdated Show resolved Hide resolved
skcosmo/metrics/reconstruction_measures.py Outdated Show resolved Hide resolved
@Luthaf
Copy link
Collaborator

Luthaf commented Nov 10, 2020

Another question to address is how to deal with datasets for examples. I would like to make sure skcosmo is NOT coupled to librascal or ASE, since it should be usable and useful for people with different workflows.

One way of doing this would be to host pre-computed SOAP vectors for the example datasets and make them available in some way like

from skcosmo_datasets import CSD

X = CSD.descriptor.SOAP_power_spectrum
Y = CSD.properties.CS_local

@agoscinski
Copy link
Collaborator Author

I had a very quick look, the code looks fine but does need a lot more doc. Why did you include a separate version of the scalers instead of basing your work on top of #8?

This branch was not ready when I started this branch. I will rebase the branch as soon as #8 is merged.

Another question to address is how to deal with datasets for examples. I would like to make sure skcosmo is NOT coupled to librascal or ASE, since it should be usable and useful for people with different workflows.

One way of doing this would be to host pre-computed SOAP vectors for the example datasets and make them available in some way like

For the degenerate datasets this interface is offered.

from skcosmo.datasets import load_degenerate_manifold

degenerate_manifold = load_degenerate_manifold()
degenerate_manifold.data.SOAP_power_spectrum
degenerate_manifold.data.SOAP_bispectrum
degenerate_manifold.DESCR

I did not use descriptor because this becomes a bit ambiguous with the description of the dataset as used in sklearn. I havent added the CSD dataset because this would be something for a dataset with a downloadable link which we havent decided yet. The dataset is not essential for this pull request.

@rosecers
Copy link
Collaborator

@agoscinski now that a documentation infrastructure has been merged into main, please add documentation to docs/source as a part of this PR, before we can merge it.

Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Alex,

This should be split up into multiple PRs. I would separate PRs for:

  • The small changes to setup.cfg and docs/requirements that can be merged quickly
  • The small changes to the docs setup (napoleon) that can be merged quickly
  • loading degenerate manifold data
  • orthogonal procrustes
  • ridgeregression2foldCV
  • metrics/reconstruction_measures
  • model_selection

Any tests/examples/gitignore can be put with the corresponding functions. Right now this PR is way to large to feasibly go through piece by piece, and it will be much easier to merge smaller bits at a time, even if some merges have to wait for others to be put in.

docs/source/conf.py Outdated Show resolved Hide resolved
skcosmo/datasets/descr/degenerate_manifold.rst Outdated Show resolved Hide resolved
skcosmo/linear_model/_base.py Show resolved Hide resolved
skcosmo/linear_model/_base.py Show resolved Hide resolved
skcosmo/linear_model/_base.py Outdated Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved
@agoscinski
Copy link
Collaborator Author

agoscinski commented Nov 16, 2020

This should be split up into multiple PRs. I would separate PRs for:

  • The small changes to setup.cfg and docs/requirements that can be merged quickly
  • The small changes to the docs setup (napoleon) that can be merged quickly
  • loading degenerate manifold data
  • orthogonal procrustes
  • ridgeregression2foldCV
  • metrics/reconstruction_measures
  • model_selection

I would suggest to split these into individual commits when rebasing the PR instead of individual PRs, since everything is part of reconstruction measures and is used for the examples, therefore self contained. It makes more sense to me and is less work.
Except of

The small changes to setup.cfg and docs/requirements that can be merged quickly
The small changes to the docs setup (napoleon) that can be merged quickly
This one we can put into another PR

@rosecers
Copy link
Collaborator

This should be split up into multiple PRs. I would separate PRs for:

  • The small changes to setup.cfg and docs/requirements that can be merged quickly
  • The small changes to the docs setup (napoleon) that can be merged quickly
  • loading degenerate manifold data
  • orthogonal procrustes
  • ridgeregression2foldCV
  • metrics/reconstruction_measures
  • model_selection

I would suggest to split these into individual commits when rebasing the PR instead of individual PRs, since everything is part of reconstruction measures and is used for the examples, therefore self contained. It makes more sense to me and is less work.

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

@agoscinski
Copy link
Collaborator Author

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

It makes more sense to me and is less work.

skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved
skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved
@rosecers
Copy link
Collaborator

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

It makes more sense to me and is less work.

It will likely result in better code, quicker merging, and will be easier for your fellow developers to review your code. I don't think it will be significantly more work for you to split it up and focus on getting smaller bits of code merged at a time.

@agoscinski
Copy link
Collaborator Author

All of the bits depend on the reconstruction measures. If you change something in OrthogonalRegression then it also changes the reconstruction measure. It is code + example how it is intended be used. I think this would construct more complexity splitting this up because there is a logical dependency between the pieces.
I would rebase the code to individual commits which should already simplify the PR review.

@codecov-io
Copy link

codecov-io commented Nov 18, 2020

Codecov Report

Merging #9 (6b620b8) into main (9a20760) will decrease coverage by 23.32%.
The diff coverage is 76.59%.

Impacted file tree graph

@@             Coverage Diff              @@
##              main       #9       +/-   ##
============================================
- Coverage   100.00%   76.67%   -23.33%     
============================================
  Files            1       12       +11     
  Lines            1      283      +282     
  Branches         0       35       +35     
============================================
+ Hits             1      217      +216     
- Misses           0       57       +57     
- Partials         0        9        +9     
Impacted Files Coverage Δ
skcosmo/preprocessing/scalers.py 28.57% <28.57%> (ø)
skcosmo/metrics/_reconstruction_measures.py 94.20% <94.20%> (ø)
skcosmo/linear_model/_ridge.py 96.15% <96.15%> (ø)
skcosmo/__init__.py 100.00% <100.00%> (ø)
skcosmo/datasets/__init__.py 100.00% <100.00%> (ø)
skcosmo/datasets/_base.py 100.00% <100.00%> (ø)
skcosmo/linear_model/__init__.py 100.00% <100.00%> (ø)
skcosmo/linear_model/_base.py 100.00% <100.00%> (ø)
skcosmo/metrics/__init__.py 100.00% <100.00%> (ø)
skcosmo/model_selection/__init__.py 100.00% <100.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a20760...6b620b8. Read the comment docs.

docs/source/conf.py Outdated Show resolved Hide resolved
docs/source/reference.rst Outdated Show resolved Hide resolved
examples/metrics/plot_gfrm.py Outdated Show resolved Hide resolved
examples/metrics/plot_lfre.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@rosecers rosecers mentioned this pull request Dec 2, 2020
3 tasks
@agoscinski agoscinski force-pushed the feat/feature-space-measures branch 2 times, most recently from 00699d8 to de78bcd Compare December 7, 2020 19:21
@agoscinski
Copy link
Collaborator Author

agoscinski commented Dec 7, 2020

Starting from commit b7b6c80 ("RidgeRegression2FoldCV added alpha type reg method") are new changes. Everything before is just a rebase. The commit message are relatively verbose about the additional changes.

EDIT: it actually starts one commit before the one mentioned 788c769 "replace setUp to setUpClass to reduce test time"

@agoscinski agoscinski marked this pull request as ready for review December 7, 2020 19:26
Copy link
Collaborator

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good overall, I have a few high level questions/remarks!

pyproject.toml Show resolved Hide resolved
skcosmo/linear_model/_ridge.py Outdated Show resolved Hide resolved
skcosmo/metrics/__init__.py Show resolved Hide resolved
@Luthaf
Copy link
Collaborator

Luthaf commented Dec 9, 2020

This looks good to me overall, however given the size of the PR I would appreciate if someone else could give it a look as well!

Also, did you complete the last point in your first comment?

check if reconstruction measures give similar results as in the paper

Finally, the git history is a bit all over the place, could you squash related commits together?

@agoscinski
Copy link
Collaborator Author

Okay let me squash some things, check the last point (havent done it yet) and then @rosecers can maybe have a second look on it?

Copy link
Collaborator Author

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolved the comments because they were repetitive and I wanted to reduce the discussion to a single comment.

skcosmo/linear_model/_base.py Show resolved Hide resolved
ortho_group.rvs(cls.features_small.shape[1])
)

def test_global_reconstruction_error_identity(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a representative for all review comments about adding docs to the two-liner tests:

But you understood this from the code instantly, the code is self-explanatory. I don't think one should be forced to add doc to such simple code. Especially it is just repetitive. I will add the suggestions you send to it, because you already put the work of writing them. But I dont think one should be required to add doc to such simple code.

tests/test_metrics.py Show resolved Hide resolved
@rosecers
Copy link
Collaborator

I resolved the comments because they were repetitive and I wanted to reduce the discussion to a single comment.

If you're planning to commit the suggestions, you can click "commit" and it will apply the suggestion and resolve the issue. I'd prefer to leave them unresolved until this happens, so the patches do not get lost in the fray.

Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @agoscinski, it looks overall good, although there are outstanding comments and questions that need to be addressed before rebasing/merging.

Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the commits to get a clean commit history then feel free to merge.

the bash for loop did not include notebooks in subdirectories. the
simplest way to solve this seemed to be using find.
when copying the example notebooks to ./tox/examples no subdirectories
are created.
allows the same test function being executed for multiple inputs
the initialization has only be done one time for the all class related tests
which can additionally return overlapping train and test splits
it is linear regression with the additional constraint that the weight matrix has to be an orthogonal matrix/projector
it is in certain cases a more efficient/accurate ridge regression, using a 2-fold cross-validation scheme, in comparison to sklearn.linear_model.RidgeCV
it includes GFRE, GFRD and LFRE like reconstruction measures
modules: model_selection, metrics and linear_model
@agoscinski agoscinski merged commit 97adbd5 into scikit-learn-contrib:main Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants