-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Linear Subspace Classifier #14139
Comments
Can you share a benchmark ?
|
Sure. I am also copying in the implementation so that people can run the benchmark themselves. If we decide to include the feature, I will of course properly submit the code via a PR. import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils import check_X_y
from sklearn.utils.multiclass import check_classification_targets
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import (LogisticRegression, SGDClassifier,
RidgeClassifier)
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
class LinearSubspaceClassifier(BaseEstimator, ClassifierMixin):
"""Linear Subspace Classifier
The classifier works by comparing the distance of a test point to the
subspace spanned by the collection of training points in each class. It
computes the distance for each class and picks the class with minimum
distance. Prediction probabilities are not natively supported.
Attributes
----------
hat_ : list of array-like, shape = [n_features, n_features]
Hat matrix for each class.
References
----------
.. [1] N. Imran, R. Togneri, M. Bennamoun.
"Linear regression for face recognition." IEEE transactions on
pattern analysis and machine intelligence. 32(11), 2106-2112, 2010.
"""
def __init__(self):
super(LinearSubspaceClassifier, self).__init__()
def fit(self, X, y):
"""Fit the model according to the given training data and parameters.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and
n_features is the number of features.
y : array, shape = [n_samples]
Target values.
"""
X, y = check_X_y(X, y)
check_classification_targets(y)
self.classes_, y = np.unique(y, return_inverse=True)
n_samples, n_features = X.shape
n_classes = len(self.classes_)
_, counts = np.unique(y, return_counts=True)
if np.any(counts > n_features):
warnings.warn("Found some classes with more counts than input "
"features. Results may be unstable.")
self.hat_ = []
for ind in range(n_classes):
Xg = X[y == ind, :]
Gg = np.dot(Xg, Xg.T)
self.hat_.append(np.dot(np.dot(Xg.T, np.linalg.inv(Gg)), Xg))
return self
def decision_function(self, X):
"""Apply decision function to an array of samples.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Array of samples (test vectors).
Returns
-------
C : array, shape = [n_samples, n_classes] or [n_samples,]
Decision function values related to each class, per sample.
In the two-class case, the shape is [n_samples,].
"""
n_samples, n_features = X.shape
n_classes = len(self.classes_)
dec_func = np.zeros((n_samples, n_classes))
for ind in range(n_classes):
dec_func[:, ind] = np.linalg.norm(
np.dot(X, np.eye(n_features) - self.hat_[ind]),
axis=1)
return dec_func
def predict(self, X):
"""Perform classification on an array of test vectors X.
The predicted class C for each sample in X is returned.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Array of samples (test vectors).
Returns
-------
C : array, shape = [n_samples,]
"""
D = self.decision_function(X)
return np.argmin(D, axis=1)
X, y = make_classification(n_samples=150, n_features=350, n_classes=5,
n_redundant=0, n_informative=350,
n_clusters_per_class=1, random_state=10)
estimators = [
LinearDiscriminantAnalysis(),
LinearSubspaceClassifier(),
SGDClassifier(),
LogisticRegression(multi_class='auto', solver='liblinear'),
RandomForestClassifier(),
RidgeClassifier(),
GaussianProcessClassifier()
]
print("Cross-validated (3-fold) accuracy score")
print("---------------------------------------")
for estimator in estimators:
print(estimator.__class__.__name__, "{:.2f}".format(
np.mean(np.asarray(cross_val_score(
X=X, y=y, estimator=estimator, cv=3)))
)) Output
|
i played quickly with it and it seems indeed surprisingly efficient in prediction. I see some scalability issues in the implementation could maybe be improved. |
I was kind of surprised myself too! Regarding scalability, as I mentioned above I think the algorithm only works when Apologies, I didn't quite understand your comment on |
I meant take some datasets on openml and bench against
LinearDisciminantAnalysis and RidgeClassiiferCV
which are linear methods so you have a fair comparison to start with.
I think there must be a reasonable solution to address the case n_features
< n_samples
… |
Do folks believe that the following classifier is worth implementing?
Link to paper (~1K citations).
This is a simple classifier that looks at the distance of a test point to the linear subspace spanned by the training examples in each class. Then it predicts the class with the smallest distance. From my understanding, it only works when the number of features is larger than the number of counts for all classes, otherwise the computed linear kernel matrix is singular and therefore not invertible. Posterior probability estimates are not natively supported.
In the case of high input dimensionality with few samples, however, it might be useful in some cases. From a quick stab I had at implementing the classifier and playing around with toy datasets where
n_features > n_samples
, it seems to be doing a much better job than e.g.LogisticRegression
orLinearDiscriminantAnalysis
, which suffers from variable collinearity.If the core team think that this is a feature worth implementing, I am happy to submit a PR.
The text was updated successfully, but these errors were encountered: