Reconstruction measures #9

agoscinski · 2020-11-10T08:36:48Z

Implements reconstruction measures from the paper arXiv 2009.02741

This first suggestion how the reconstruction measures would be laid out. Feedback is welcomed.

Still missing:

overall documentation
merge with preprocessing branch Add KernelCentering and NormalizeScaler classes for preprocessing #8
LFRE implementation and example
code formatting
check if reconstruction measures give similar results as in the paper

Luthaf

I had a very quick look, the code looks fine but does need a lot more doc. Why did you include a separate version of the scalers instead of basing your work on top of #8?

setup.cfg

skcosmo/metrics/reconstruction_measures.py

Luthaf · 2020-11-10T10:42:43Z

Another question to address is how to deal with datasets for examples. I would like to make sure skcosmo is NOT coupled to librascal or ASE, since it should be usable and useful for people with different workflows.

One way of doing this would be to host pre-computed SOAP vectors for the example datasets and make them available in some way like

from skcosmo_datasets import CSD

X = CSD.descriptor.SOAP_power_spectrum
Y = CSD.properties.CS_local

agoscinski · 2020-11-11T09:44:31Z

I had a very quick look, the code looks fine but does need a lot more doc. Why did you include a separate version of the scalers instead of basing your work on top of #8?

This branch was not ready when I started this branch. I will rebase the branch as soon as #8 is merged.

Another question to address is how to deal with datasets for examples. I would like to make sure skcosmo is NOT coupled to librascal or ASE, since it should be usable and useful for people with different workflows.

One way of doing this would be to host pre-computed SOAP vectors for the example datasets and make them available in some way like

For the degenerate datasets this interface is offered.

from skcosmo.datasets import load_degenerate_manifold

degenerate_manifold = load_degenerate_manifold()
degenerate_manifold.data.SOAP_power_spectrum
degenerate_manifold.data.SOAP_bispectrum
degenerate_manifold.DESCR

I did not use descriptor because this becomes a bit ambiguous with the description of the dataset as used in sklearn. I havent added the CSD dataset because this would be something for a dataset with a downloadable link which we havent decided yet. The dataset is not essential for this pull request.

rosecers · 2020-11-12T15:14:33Z

@agoscinski now that a documentation infrastructure has been merged into main, please add documentation to docs/source as a part of this PR, before we can merge it.

rosecers

Hey Alex,

This should be split up into multiple PRs. I would separate PRs for:

The small changes to setup.cfg and docs/requirements that can be merged quickly
The small changes to the docs setup (napoleon) that can be merged quickly
loading degenerate manifold data
orthogonal procrustes
ridgeregression2foldCV
metrics/reconstruction_measures
model_selection

Any tests/examples/gitignore can be put with the corresponding functions. Right now this PR is way to large to feasibly go through piece by piece, and it will be much easier to merge smaller bits at a time, even if some merges have to wait for others to be put in.

docs/source/conf.py

skcosmo/datasets/descr/degenerate_manifold.rst

skcosmo/linear_model/_base.py

skcosmo/metrics/_reconstruction_measures.py

agoscinski · 2020-11-16T11:43:12Z

This should be split up into multiple PRs. I would separate PRs for:

The small changes to setup.cfg and docs/requirements that can be merged quickly

The small changes to the docs setup (napoleon) that can be merged quickly

loading degenerate manifold data

orthogonal procrustes

ridgeregression2foldCV

metrics/reconstruction_measures

model_selection

I would suggest to split these into individual commits when rebasing the PR instead of individual PRs, since everything is part of reconstruction measures and is used for the examples, therefore self contained. It makes more sense to me and is less work.
Except of

The small changes to setup.cfg and docs/requirements that can be merged quickly
The small changes to the docs setup (napoleon) that can be merged quickly
This one we can put into another PR

rosecers · 2020-11-16T12:18:02Z

This should be split up into multiple PRs. I would separate PRs for:

The small changes to setup.cfg and docs/requirements that can be merged quickly

The small changes to the docs setup (napoleon) that can be merged quickly

loading degenerate manifold data

orthogonal procrustes

ridgeregression2foldCV

metrics/reconstruction_measures

model_selection

I would suggest to split these into individual commits when rebasing the PR instead of individual PRs, since everything is part of reconstruction measures and is used for the examples, therefore self contained. It makes more sense to me and is less work.

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

agoscinski · 2020-11-16T12:48:26Z

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

It makes more sense to me and is less work.

skcosmo/metrics/_reconstruction_measures.py

rosecers · 2020-11-16T12:52:27Z

I'm just wary of making such a large PR, when some of it can be ready to merge fairly quickly, whereas other pieces will take a bit more iteration. Is there any reason not to make separate PRs?

It makes more sense to me and is less work.

It will likely result in better code, quicker merging, and will be easier for your fellow developers to review your code. I don't think it will be significantly more work for you to split it up and focus on getting smaller bits of code merged at a time.

agoscinski · 2020-11-16T13:05:09Z

All of the bits depend on the reconstruction measures. If you change something in OrthogonalRegression then it also changes the reconstruction measure. It is code + example how it is intended be used. I think this would construct more complexity splitting this up because there is a logical dependency between the pieces.
I would rebase the code to individual commits which should already simplify the PR review.

.github/workflows/tests.yml

codecov-io · 2020-11-18T11:57:11Z

Codecov Report

Merging #9 (6b620b8) into main (9a20760) will decrease coverage by 23.32%.
The diff coverage is 76.59%.

@@             Coverage Diff              @@
##              main       #9       +/-   ##
============================================
- Coverage   100.00%   76.67%   -23.33%     
============================================
  Files            1       12       +11     
  Lines            1      283      +282     
  Branches         0       35       +35     
============================================
+ Hits             1      217      +216     
- Misses           0       57       +57     
- Partials         0        9        +9

Impacted Files	Coverage Δ
skcosmo/preprocessing/scalers.py	`28.57% <28.57%> (ø)`
skcosmo/metrics/_reconstruction_measures.py	`94.20% <94.20%> (ø)`
skcosmo/linear_model/_ridge.py	`96.15% <96.15%> (ø)`
skcosmo/__init__.py	`100.00% <100.00%> (ø)`
skcosmo/datasets/__init__.py	`100.00% <100.00%> (ø)`
skcosmo/datasets/_base.py	`100.00% <100.00%> (ø)`
skcosmo/linear_model/__init__.py	`100.00% <100.00%> (ø)`
skcosmo/linear_model/_base.py	`100.00% <100.00%> (ø)`
skcosmo/metrics/__init__.py	`100.00% <100.00%> (ø)`
skcosmo/model_selection/__init__.py	`100.00% <100.00%> (ø)`
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a20760...6b620b8. Read the comment docs.

docs/source/conf.py

examples/linear_model/plot_orthogonal_regression_nonanalytic_behavior.py

docs/source/reference.rst

examples/metrics/plot_gfrm.py

examples/metrics/plot_lfre.py

pyproject.toml

setup.py

agoscinski · 2020-12-07T19:22:53Z

Starting from commit b7b6c80 ("RidgeRegression2FoldCV added alpha type reg method") are new changes. Everything before is just a rebase. The commit message are relatively verbose about the additional changes.

EDIT: it actually starts one commit before the one mentioned 788c769 "replace setUp to setUpClass to reduce test time"

Luthaf

The code looks good overall, I have a few high level questions/remarks!

pyproject.toml

skcosmo/linear_model/_ridge.py

skcosmo/metrics/__init__.py

skcosmo/linear_model/_ridge.py

Luthaf · 2020-12-09T09:43:58Z

This looks good to me overall, however given the size of the PR I would appreciate if someone else could give it a look as well!

Also, did you complete the last point in your first comment?

check if reconstruction measures give similar results as in the paper

Finally, the git history is a bit all over the place, could you squash related commits together?

agoscinski · 2020-12-09T10:19:49Z

Okay let me squash some things, check the last point (havent done it yet) and then @rosecers can maybe have a second look on it?

agoscinski · 2020-12-10T10:25:41Z

So I was able to get very similar results to Figure 3 in the paper, I guess the differences are due to the slightly different used regularization parameters (alphas)
https://user-images.githubusercontent.com/2772557/101759095-7595d500-3ad9-11eb-904f-c5a0c1d953ad.png
https://user-images.githubusercontent.com/2772557/101759097-762e6b80-3ad9-11eb-8e54-d04272dec8ee.png
https://user-images.githubusercontent.com/2772557/101759098-762e6b80-3ad9-11eb-948c-1bea08b98681.png
https://user-images.githubusercontent.com/2772557/101759101-76c70200-3ad9-11eb-90b8-e87b39441244.png
https://user-images.githubusercontent.com/2772557/101759103-76c70200-3ad9-11eb-94ea-ebd5c170637c.png
https://user-images.githubusercontent.com/2772557/101759106-775f9880-3ad9-11eb-96ff-e991162bdb76.png
And the degenerate manifold from Fig. 8
https://user-images.githubusercontent.com/2772557/101759566-0cfb2800-3ada-11eb-8841-c2b55ab71316.png

I would rebase it again at the end when I have the final go to merge, since it might still be useful to see the descriptive commit messages for the review

tests/test_metrics.py

tests/test_model_selection.py

agoscinski

I resolved the comments because they were repetitive and I wanted to reduce the discussion to a single comment.

skcosmo/linear_model/_base.py

agoscinski · 2020-12-16T17:35:00Z

tests/test_metrics.py

+            ortho_group.rvs(cls.features_small.shape[1])
+        )
+
+    def test_global_reconstruction_error_identity(self):


This is a representative for all review comments about adding docs to the two-liner tests:

But you understood this from the code instantly, the code is self-explanatory. I don't think one should be forced to add doc to such simple code. Especially it is just repetitive. I will add the suggestions you send to it, because you already put the work of writing them. But I dont think one should be required to add doc to such simple code.

tests/test_metrics.py

rosecers · 2020-12-18T13:01:07Z

I resolved the comments because they were repetitive and I wanted to reduce the discussion to a single comment.

If you're planning to commit the suggestions, you can click "commit" and it will apply the suggestion and resolve the issue. I'd prefer to leave them unresolved until this happens, so the patches do not get lost in the fray.

skcosmo/model_selection/_split.py

examples/linear_model/plot_orthogonal_regression_nonanalytic_behavior.ipynb

rosecers

Hey @agoscinski, it looks overall good, although there are outstanding comments and questions that need to be addressed before rebasing/merging.

rosecers

Please rebase the commits to get a clean commit history then feel free to merge.

the bash for loop did not include notebooks in subdirectories. the simplest way to solve this seemed to be using find. when copying the example notebooks to ./tox/examples no subdirectories are created.

allows the same test function being executed for multiple inputs

the initialization has only be done one time for the all class related tests

which can additionally return overlapping train and test splits

it is linear regression with the additional constraint that the weight matrix has to be an orthogonal matrix/projector

it is in certain cases a more efficient/accurate ridge regression, using a 2-fold cross-validation scheme, in comparison to sklearn.linear_model.RidgeCV

it includes GFRE, GFRD and LFRE like reconstruction measures

modules: model_selection, metrics and linear_model

agoscinski requested a review from Luthaf November 10, 2020 08:36

agoscinski self-assigned this Nov 10, 2020

Luthaf reviewed Nov 10, 2020

View reviewed changes

setup.cfg Outdated Show resolved Hide resolved

skcosmo/metrics/reconstruction_measures.py Outdated Show resolved Hide resolved

agoscinski force-pushed the feat/feature-space-measures branch from 4310f3f to 4c997a8 Compare November 16, 2020 08:16

rosecers requested changes Nov 16, 2020

View reviewed changes

skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved

skcosmo/metrics/_reconstruction_measures.py Outdated Show resolved Hide resolved

rosecers reviewed Nov 17, 2020

View reviewed changes

.github/workflows/tests.yml Outdated Show resolved Hide resolved

Luthaf mentioned this pull request Nov 18, 2020

Set the setuptools package version without importing skcosmo #16

Merged

rosecers reviewed Nov 20, 2020

View reviewed changes

Luthaf mentioned this pull request Nov 23, 2020

Ensure dataset are packaged with the code #24

Merged

rosecers mentioned this pull request Dec 2, 2020

Adding PCovR #31

Merged

3 tasks

agoscinski force-pushed the feat/feature-space-measures branch 2 times, most recently from 00699d8 to de78bcd Compare December 7, 2020 19:21

agoscinski marked this pull request as ready for review December 7, 2020 19:26

agoscinski requested a review from Luthaf December 7, 2020 19:26

Luthaf reviewed Dec 8, 2020

View reviewed changes

pyproject.toml Show resolved Hide resolved

skcosmo/linear_model/_ridge.py Outdated Show resolved Hide resolved

skcosmo/metrics/__init__.py Show resolved Hide resolved

Luthaf reviewed Dec 8, 2020

View reviewed changes

skcosmo/linear_model/_ridge.py Outdated Show resolved Hide resolved