Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Linear Subspace Classifier #14139

Open
agamemnonc opened this issue Jun 21, 2019 · 5 comments
Open

Suggestion: Linear Subspace Classifier #14139

agamemnonc opened this issue Jun 21, 2019 · 5 comments
Labels
help wanted module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks New Feature

Comments

@agamemnonc
Copy link
Contributor

agamemnonc commented Jun 21, 2019

Do folks believe that the following classifier is worth implementing?

Naseem, I., Togneri, R. and Bennamoun, M., 2010. Linear regression for face recognition. IEEE transactions on pattern analysis and machine intelligence, 32(11), pp.2106-2112.

Link to paper (~1K citations).

This is a simple classifier that looks at the distance of a test point to the linear subspace spanned by the training examples in each class. Then it predicts the class with the smallest distance. From my understanding, it only works when the number of features is larger than the number of counts for all classes, otherwise the computed linear kernel matrix is singular and therefore not invertible. Posterior probability estimates are not natively supported.

In the case of high input dimensionality with few samples, however, it might be useful in some cases. From a quick stab I had at implementing the classifier and playing around with toy datasets where n_features > n_samples, it seems to be doing a much better job than e.g. LogisticRegression or LinearDiscriminantAnalysis, which suffers from variable collinearity.

If the core team think that this is a feature worth implementing, I am happy to submit a PR.

@agramfort
Copy link
Member

agramfort commented Jun 22, 2019 via email

@agamemnonc
Copy link
Contributor Author

Sure.

I am also copying in the implementation so that people can run the benchmark themselves.

If we decide to include the feature, I will of course properly submit the code via a PR.

import warnings
warnings.filterwarnings("ignore")

import numpy as np

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils import check_X_y
from sklearn.utils.multiclass import check_classification_targets

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import (LogisticRegression, SGDClassifier,
                                  RidgeClassifier)
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier

class LinearSubspaceClassifier(BaseEstimator, ClassifierMixin):
    """Linear Subspace Classifier

    The classifier works by comparing the distance of a test point to the
    subspace spanned by the collection of training points in each class. It
    computes the distance for each class and picks the class with minimum
    distance. Prediction probabilities are not natively supported.

    Attributes
    ----------
    hat_ : list of array-like, shape = [n_features, n_features]
        Hat matrix for each class.

    References
    ----------
    .. [1] N. Imran, R. Togneri, M. Bennamoun.
           "Linear regression for face recognition." IEEE transactions on
           pattern analysis and machine intelligence. 32(11), 2106-2112, 2010.
    """

    def __init__(self):
        super(LinearSubspaceClassifier, self).__init__()

    def fit(self, X, y):
        """Fit the model according to the given training data and parameters.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            Training vector, where n_samples in the number of samples and
            n_features is the number of features.

        y : array, shape = [n_samples]
            Target values.
        """
        X, y = check_X_y(X, y)
        check_classification_targets(y)
        self.classes_, y = np.unique(y, return_inverse=True)
        n_samples, n_features = X.shape
        n_classes = len(self.classes_)

        _, counts = np.unique(y, return_counts=True)
        if np.any(counts > n_features):
            warnings.warn("Found some classes with more counts than input "
                          "features. Results may be unstable.")

        self.hat_ = []

        for ind in range(n_classes):
            Xg = X[y == ind, :]
            Gg = np.dot(Xg, Xg.T)
            self.hat_.append(np.dot(np.dot(Xg.T, np.linalg.inv(Gg)), Xg))

        return self

    def decision_function(self, X):
        """Apply decision function to an array of samples.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            Array of samples (test vectors).

        Returns
        -------
        C : array, shape = [n_samples, n_classes] or [n_samples,]
            Decision function values related to each class, per sample.
            In the two-class case, the shape is [n_samples,].
        """
        n_samples, n_features = X.shape
        n_classes = len(self.classes_)
        dec_func = np.zeros((n_samples, n_classes))
        for ind in range(n_classes):
            dec_func[:, ind] = np.linalg.norm(
                np.dot(X, np.eye(n_features) - self.hat_[ind]),
                axis=1)

        return dec_func

    def predict(self, X):
        """Perform classification on an array of test vectors X.

        The predicted class C for each sample in X is returned.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            Array of samples (test vectors).

        Returns
        -------
        C : array, shape = [n_samples,]
        """
        D = self.decision_function(X)
        return np.argmin(D, axis=1)

    
X, y = make_classification(n_samples=150, n_features=350, n_classes=5,
                           n_redundant=0, n_informative=350,
                           n_clusters_per_class=1, random_state=10)
estimators = [
    LinearDiscriminantAnalysis(),
    LinearSubspaceClassifier(),
    SGDClassifier(),
    LogisticRegression(multi_class='auto', solver='liblinear'),
    RandomForestClassifier(),
    RidgeClassifier(),
    GaussianProcessClassifier()
]

print("Cross-validated (3-fold) accuracy score")
print("---------------------------------------")
for estimator in estimators:
    print(estimator.__class__.__name__, "{:.2f}".format(
        np.mean(np.asarray(cross_val_score(
            X=X, y=y, estimator=estimator, cv=3)))
    ))

Output

Cross-validated (3-fold) accuracy score
---------------------------------------
LinearDiscriminantAnalysis 0.22
LinearSubspaceClassifier 0.83
SGDClassifier 0.33
LogisticRegression 0.39
RandomForestClassifier 0.26
RidgeClassifier 0.41
GaussianProcessClassifier 0.19

@agramfort
Copy link
Member

i played quickly with it and it seems indeed surprisingly efficient in prediction. I see some scalability issues in the implementation could maybe be improved.
To me if it works out of the box LDA and RidgeClassifierCV on a comprehensive list of datasets that could be even more convincing.

@agamemnonc
Copy link
Contributor Author

I was kind of surprised myself too!

Regarding scalability, as I mentioned above I think the algorithm only works when n_features > n_samples, otherwise the kernel matrix is low-rank and therefore non-invertible. But in that case it seems that it can outperform other classifiers, but I have only tested on artificial datasets. Would the fact that it works only in this special case be a blocker for including in sklearn? Or could it perhaps be included with an appropriate warning in documentation?

Apologies, I didn't quite understand your comment on LDA and RidgeClassiiferCV, could you please clarify? And do you have any specific datasets in mind? Again, if we decide to include, I think we should benchmark on datasets where n_features > n_samples.

@agramfort
Copy link
Member

agramfort commented Jun 30, 2019 via email

@cmarmo cmarmo added module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks help wanted labels Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks New Feature
Projects
None yet
Development

No branches or pull requests

4 participants