Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPCA pipeline #91

Merged
merged 14 commits into from Apr 11, 2021
Merged

MPCA pipeline #91

merged 14 commits into from Apr 11, 2021

Conversation

shuo-zhou
Copy link
Member

@shuo-zhou shuo-zhou commented Apr 10, 2021

PR for MPCA pipeline Card.

Description

Pipeline MPCA->Feature selection by Fisher score->SVM/Logistic Regression .

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • In-line docstrings updated and documentation docs updated.

@shuo-zhou shuo-zhou added enhancement Improvement of existing code tests Tests and coverage labels Apr 10, 2021
@shuo-zhou shuo-zhou added this to In progress in v0.1.0 via automation Apr 10, 2021
@codecov-io
Copy link

codecov-io commented Apr 10, 2021

Codecov Report

Merging #91 (9c6385d) into master (8d30ff8) will increase coverage by 1.93%.
The diff coverage is 91.04%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #91      +/-   ##
=========================================
+ Coverage    4.47%   6.41%   +1.93%     
=========================================
  Files          36      37       +1     
  Lines        2927    2994      +67     
=========================================
+ Hits          131     192      +61     
- Misses       2796    2802       +6     
Impacted Files Coverage Δ
kale/pipeline/mpca_trainer.py 91.04% <91.04%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d30ff8...9c6385d. Read the comment docs.

@haipinglu haipinglu changed the title Mpca pipeline MPCA pipeline Apr 10, 2021
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered LinearSVC (liblinear)? Or, can we make it an option (with one as the default)? SVC is using libsvm but LinearSVC is using liblinear.

I noticed this when checking https://scikit-learn.org/stable/modules/feature_selection.html and found that the examples using LinearSVC rather than SVC.

Did you check the matlab SVM I sent to you to see which version was used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVC object has predict_proba function, which can give the probbabilitiy of each class, while LinearSVC does not have it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LinearSVC can give a better accuracy and the user does not care the probability, then it is still a decent choice. How often did we report the probability of each class in our papers so far?

I understand probability is something good, but I do not consider it as essential or must-have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probability is a feature requested by Cameron. How about making it optional (LinearSVC and SVC(kernel="linear"))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A feature requested by a user does not mean to enforce it for all users. Otherwise, sklearn won't have LinearSVC. Good to have options.


from ..embed.mpca import MPCA

classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked the Matlab code options?
Does this cover the options used in our CMR Matlab code? It seems that we used the default there, which seems to be 1/n (an adaptive value) according to https://uk.mathworks.com/help/stats/fitclinear.html#d123e312449
At least this value of C should be covered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a close value in np.logspace(-3, 2, 6), which is a list [0.001, 0.01, 0.1, 0, 1, 10]. The optimal value of C will be determined by grid search if classifier_params="auto". I will add 1/n to this list if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sz144 1/n is an interesting and smart choice because when n --> infinity, 1/n --> 0 and no regularisation is needed.

from ..embed.mpca import MPCA

classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}],
"lr": [LogisticRegression, {"C": np.logspace(-3, 2, 6)}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, consider to define repeated values np.logspace(-3, 2, 6) as a (global) variable

classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}],
"lr": [LogisticRegression, {"C": np.logspace(-3, 2, 6)}]}

default_search_params = {'cv': 5}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not do cv in matlab, see https://uk.mathworks.com/help/stats/fitclinear.html#d123e314573
Is it an option here or a must? It may not be necessary, considering how Matlab deals with it (and we get better results). CV on small sample may overfit as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CV is used to determine the value of C only if classifier_param is set to be "auto".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Some light documentation/comments in the tests will be helpful for review / reading / future changes.

classifier (str, optional): Classifier for training. Options: support vector machine (svc) or
logistic regression (lr). Defaults to 'svc'.
classifier_params (dict, optional): Parameters of classifier. Defaults to 'auto'.
mpca_params (dict, optional): Parameters of Multi-linear PCA. Defaults to None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Multi-linear here? Be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change

self.auto_classifier_param = True
clf_param_gird = classifiers[classifier][1]
self.grid_search = GridSearchCV(classifiers[classifier][0](),
param_grid=clf_param_gird,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: gird

else:
f_score, p_val = f_classif(x_proj, y)
self.feature_order = (-1 * f_score).argsort()
x_proj = x_proj[:, self.feature_order][:, :self.n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a new name be better for selected features from x_proj?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is x_train better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who can tell the difference? Isn't x also x_train?

check_is_fitted(self.clf)

x_proj = self.mpca.transform(x)
x_new = x_proj[:, self.feature_order][:, :self.n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use consistent naming convention, here x_new but you reused x_proj above.

trainer.fit(x, y)
y_pred = trainer.predict(x)
testing.assert_equal(np.unique(y), np.unique(y_pred))
assert accuracy_score(y, y_pred) >= 0.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected training error (to be >0.8)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training accuracy > 0.8, i.e. training error < 0.2.

Copy link
Member Author

@shuo-zhou shuo-zhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments.


from ..embed.mpca import MPCA

classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a close value in np.logspace(-3, 2, 6), which is a list [0.001, 0.01, 0.1, 0, 1, 10]. The optimal value of C will be determined by grid search if classifier_params="auto". I will add 1/n to this list if necessary.

classifier (str, optional): Classifier for training. Options: support vector machine (svc) or
logistic regression (lr). Defaults to 'svc'.
classifier_params (dict, optional): Parameters of classifier. Defaults to 'auto'.
mpca_params (dict, optional): Parameters of Multi-linear PCA. Defaults to None.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change

trainer.fit(x, y)
y_pred = trainer.predict(x)
testing.assert_equal(np.unique(y), np.unique(y_pred))
assert accuracy_score(y, y_pred) >= 0.8
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training accuracy > 0.8, i.e. training error < 0.2.

assert accuracy_score(y, y_pred) >= 0.8

if classifier == "linear_svc":
with pytest.raises(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very glad to learn this from you. Thanks.
This may be another important documentation for developers to refer to (besides fixture): https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions

@haipinglu
Copy link
Member

haipinglu commented Apr 10, 2021

@sz144 In-line docstrings seem fine but please have documentation in docs updated before merging. Otherwise, this new API won't appear in docs. Thanks.
This is actually the last checkbox in the description that you leave unchecked/ignored. You should complete and tick the applicable checkboxes when done. The checkbox is a checklist for you to review and check.

@shuo-zhou
Copy link
Member Author

@sz144 In-line docstrings seem fine but please have documentation in docs updated before merging. Otherwise, this new API won't appear in docs. Thanks.
This is actually the last checkbox in the description that you leave unchecked/ignored. You should complete and tick the applicable checkboxes when done. The checkbox is a checklist for you to review and check.

Thanking you for pointing this out. The last checkbox is checked. Is there anything I need to do before merging?

@haipinglu
Copy link
Member

haipinglu commented Apr 11, 2021

@sz144 In-line docstrings seem fine but please have documentation in docs updated before merging. Otherwise, this new API won't appear in docs. Thanks.
This is actually the last checkbox in the description that you leave unchecked/ignored. You should complete and tick the applicable checkboxes when done. The checkbox is a checklist for you to review and check.

Thanking you for pointing this out. The last checkbox is checked. Is there anything I need to do before merging?

@sz144 Come on. Have you done the docs update? I have make the text have documentation in docs updated bold. How can you tick something that you have not done yet? Could you check the docs yourself before asking me to check?


Args:
classifier (str, optional): Classifier for training. Options: support vector machine (svc) or
logistic regression (lr). Defaults to 'svc'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated docstrings. Now there are three options and you need to explain a bit so users know the difference from the docs rather than have to read the code.

# Haiping Lu, h.lu@sheffield.ac.uk or hplu@ieee.org
# =============================================================================

"""Implementation of MPCA->Feature Selection->Linear SVM/LogisticRegression Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add references of papers using this pipeline:
For Cardiac MRI: https://doi.org/10.1093/ehjci/jeaa001
For brain fMRI: https://doi.org/10.1007/978-3-319-24553-9_75
For gait videos (though KNN rather than SVM/LR was used): https://doi.org/10.1109/TNN.2007.901277

classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention
(pp. 613-620). Springer, Cham.
[3] Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2008). MPCA: Multilinear principal component analysis of
tensor objects. IEEE transactions on Neural Networks, 19(1), 18-39.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"transactions" needs to be capitalized.

@haipinglu haipinglu merged commit 16cb102 into master Apr 11, 2021
v0.1.0 automation moved this from In progress to Done Apr 11, 2021
@haipinglu haipinglu deleted the MPCA-Pipeline branch April 11, 2021 09:56
This was referenced Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of existing code tests Tests and coverage
Projects
No open projects
v0.1.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants