New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPCA pipeline #91
MPCA pipeline #91
Conversation
Codecov Report
@@ Coverage Diff @@
## master #91 +/- ##
=========================================
+ Coverage 4.47% 6.41% +1.93%
=========================================
Files 36 37 +1
Lines 2927 2994 +67
=========================================
+ Hits 131 192 +61
- Misses 2796 2802 +6
Continue to review full report at Codecov.
|
kale/pipeline/mpca_trainer.py
Outdated
from sklearn.feature_selection import f_classif | ||
from sklearn.linear_model import LogisticRegression | ||
from sklearn.model_selection import GridSearchCV | ||
from sklearn.svm import SVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered LinearSVC
(liblinear)? Or, can we make it an option (with one as the default)? SVC is using libsvm but LinearSVC is using liblinear.
I noticed this when checking https://scikit-learn.org/stable/modules/feature_selection.html and found that the examples using LinearSVC
rather than SVC
.
Did you check the matlab SVM I sent to you to see which version was used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVC object has predict_proba function, which can give the probbabilitiy of each class, while LinearSVC does not have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If LinearSVC can give a better accuracy and the user does not care the probability, then it is still a decent choice. How often did we report the probability of each class in our papers so far?
I understand probability is something good, but I do not consider it as essential or must-have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probability is a feature requested by Cameron. How about making it optional (LinearSVC
and SVC(kernel="linear")
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A feature requested by a user does not mean to enforce it for all users. Otherwise, sklearn won't have LinearSVC. Good to have options.
kale/pipeline/mpca_trainer.py
Outdated
|
||
from ..embed.mpca import MPCA | ||
|
||
classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked the Matlab code options?
Does this cover the options used in our CMR Matlab code? It seems that we used the default there, which seems to be 1/n (an adaptive value) according to https://uk.mathworks.com/help/stats/fitclinear.html#d123e312449
At least this value of C
should be covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a close value in np.logspace(-3, 2, 6)
, which is a list [0.001, 0.01, 0.1, 0, 1, 10]
. The optimal value of C
will be determined by grid search if classifier_params="auto"
. I will add 1/n to this list if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sz144 1/n is an interesting and smart choice because when n --> infinity, 1/n --> 0 and no regularisation is needed.
kale/pipeline/mpca_trainer.py
Outdated
from ..embed.mpca import MPCA | ||
|
||
classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}], | ||
"lr": [LogisticRegression, {"C": np.logspace(-3, 2, 6)}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, consider to define repeated values np.logspace(-3, 2, 6)
as a (global) variable
classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}], | ||
"lr": [LogisticRegression, {"C": np.logspace(-3, 2, 6)}]} | ||
|
||
default_search_params = {'cv': 5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not do cv in matlab, see https://uk.mathworks.com/help/stats/fitclinear.html#d123e314573
Is it an option here or a must? It may not be necessary, considering how Matlab deals with it (and we get better results). CV on small sample may overfit as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CV is used to determine the value of C
only if classifier_param
is set to be "auto".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Some light documentation/comments in the tests
will be helpful for review / reading / future changes.
kale/pipeline/mpca_trainer.py
Outdated
classifier (str, optional): Classifier for training. Options: support vector machine (svc) or | ||
logistic regression (lr). Defaults to 'svc'. | ||
classifier_params (dict, optional): Parameters of classifier. Defaults to 'auto'. | ||
mpca_params (dict, optional): Parameters of Multi-linear PCA. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Multi-linear here? Be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change
kale/pipeline/mpca_trainer.py
Outdated
self.auto_classifier_param = True | ||
clf_param_gird = classifiers[classifier][1] | ||
self.grid_search = GridSearchCV(classifiers[classifier][0](), | ||
param_grid=clf_param_gird, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: gird
kale/pipeline/mpca_trainer.py
Outdated
else: | ||
f_score, p_val = f_classif(x_proj, y) | ||
self.feature_order = (-1 * f_score).argsort() | ||
x_proj = x_proj[:, self.feature_order][:, :self.n_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a new name be better for selected features from x_proj?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is x_train better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who can tell the difference? Isn't x also x_train?
kale/pipeline/mpca_trainer.py
Outdated
check_is_fitted(self.clf) | ||
|
||
x_proj = self.mpca.transform(x) | ||
x_new = x_proj[:, self.feature_order][:, :self.n_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use consistent naming convention, here x_new but you reused x_proj above.
trainer.fit(x, y) | ||
y_pred = trainer.predict(x) | ||
testing.assert_equal(np.unique(y), np.unique(y_pred)) | ||
assert accuracy_score(y, y_pred) >= 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected training error (to be >0.8)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training accuracy > 0.8, i.e. training error < 0.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments.
kale/pipeline/mpca_trainer.py
Outdated
|
||
from ..embed.mpca import MPCA | ||
|
||
classifiers = {"svc": [SVC, {"kernel": ["linear"], "C": np.logspace(-3, 2, 6)}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a close value in np.logspace(-3, 2, 6)
, which is a list [0.001, 0.01, 0.1, 0, 1, 10]
. The optimal value of C
will be determined by grid search if classifier_params="auto"
. I will add 1/n to this list if necessary.
kale/pipeline/mpca_trainer.py
Outdated
classifier (str, optional): Classifier for training. Options: support vector machine (svc) or | ||
logistic regression (lr). Defaults to 'svc'. | ||
classifier_params (dict, optional): Parameters of classifier. Defaults to 'auto'. | ||
mpca_params (dict, optional): Parameters of Multi-linear PCA. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change
trainer.fit(x, y) | ||
y_pred = trainer.predict(x) | ||
testing.assert_equal(np.unique(y), np.unique(y_pred)) | ||
assert accuracy_score(y, y_pred) >= 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training accuracy > 0.8, i.e. training error < 0.2.
assert accuracy_score(y, y_pred) >= 0.8 | ||
|
||
if classifier == "linear_svc": | ||
with pytest.raises(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very glad to learn this from you. Thanks.
This may be another important documentation for developers to refer to (besides fixture): https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions
@sz144 In-line docstrings seem fine but please have documentation in |
Thanking you for pointing this out. The last checkbox is checked. Is there anything I need to do before merging? |
@sz144 Come on. Have you done the docs update? I have make the text have documentation in |
kale/pipeline/mpca_trainer.py
Outdated
|
||
Args: | ||
classifier (str, optional): Classifier for training. Options: support vector machine (svc) or | ||
logistic regression (lr). Defaults to 'svc'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated docstrings. Now there are three options and you need to explain a bit so users know the difference from the docs rather than have to read the code.
# Haiping Lu, h.lu@sheffield.ac.uk or hplu@ieee.org | ||
# ============================================================================= | ||
|
||
"""Implementation of MPCA->Feature Selection->Linear SVM/LogisticRegression Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add references of papers using this pipeline:
For Cardiac MRI: https://doi.org/10.1093/ehjci/jeaa001
For brain fMRI: https://doi.org/10.1007/978-3-319-24553-9_75
For gait videos (though KNN rather than SVM/LR was used): https://doi.org/10.1109/TNN.2007.901277
kale/pipeline/mpca_trainer.py
Outdated
classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention | ||
(pp. 613-620). Springer, Cham. | ||
[3] Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2008). MPCA: Multilinear principal component analysis of | ||
tensor objects. IEEE transactions on Neural Networks, 19(1), 18-39. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"transactions" needs to be capitalized.
PR for MPCA pipeline Card.
Description
Pipeline MPCA->Feature selection by Fisher score->SVM/Logistic Regression .
Status
Ready
Types of changes
docs
updated.