# API proposal for metric-learn to enhance scikit-learn compatibility

The following notebook will propose a draft of API for distance metric learning algorithms that are compatible with scikit-learn, allowing to easily do pipelines, cross-validations, and connect with other scikit-learn objects.

In [103]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.utils import check_random_state
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.model_selection import cross_val_score
from scipy.sparse import csr_matrix
import warnings
warnings.filterwarnings('ignore')  # we still have warnings because of sparse arrays but will be fixed
from sklearn.pipeline import Pipeline, make_pipeline

# Description of the API

Most metric algorithms that are not supervised in the classical way (with inputs X and labels y) take as input some pairs where each of these has a label (1 or 0, for positive pairs/negative pairs also called positive/negative constraint). One might want to evaluate the performance of a metric learning algorithm by doing cross-validation of a score (roc_auc score for instance), splitting between train/test on the **constraints** and not on the points. Therefore to be able to do so, we could create a ``Pairs`` object that would contain the information of the points (``X``) **and** the pairs (that can be represented as two lists: the list of the first indexes of pairs ``a``, and the second indexes ``b``).

In [68]:
class Pairs():

    def __init__(self, X, a, b):
        self.a = a
        self.b = b
        self.X = X
        self.shape = (len(a), X.shape[1])

    def __getitem__(self, item):
        # Note that to avoid useless memory consumption, when splitting we delete the points that are not used
        a_sliced = self.a[item]
        b_sliced = self.b[item]
        unique_array = np.unique(np.concatenate([np.array(a_sliced), np.array(b_sliced)]))
        inverted_index = self._build_inverted_index(unique_array)
        pruned_X = self.X[unique_array].copy()  # copy so that the behaviour is always the same
        rescaled_sliced_a = inverted_index[a_sliced].A.ravel()
        rescaled_sliced_b = inverted_index[b_sliced].A.ravel()
        return Pairs(pruned_X, rescaled_sliced_a, rescaled_sliced_b)

    def __len__(self):
        return self.shape

    def __str__(self):
        return np.stack([self.X[self.a], self.X[self.b]], axis=1).__str__()
    
    def __repr__(self):
        return np.stack([self.X[self.a], self.X[self.b]], axis=1).__repr__()

    def asarray(self):
        return np.stack([self.X[self.a], self.X[self.b]], axis=1)

    @staticmethod
    def _build_inverted_index(unique_array):
        inverted_index = csr_matrix((np.max(unique_array) + 1, 1), dtype=int)
        inverted_index[unique_array] = np.arange(len(unique_array))[:, None]
        return inverted_index

Then we would want a MetricLearner to be able to train on such a ``Pairs`` object and on the labels of constraints. We create a dummy metric learner just for testing purposes. It will just do one step in the direction of the gradient of pairwise distances of similar points. It can be used as a transfomer and as a classifier. 
- Its ``fit`` method takes as an input an object ``Pairs`` that represent pair of points, and an array-like (or list like) ``y`` that represent the labels of the constraints (positive constraint of negative constraint). (so ``len(y) == len(a) == lenb(b)``)
- Then when ``decision_function`` is called with input a ``Pairs`` object, the metric learner will return the pairwise distances of the considered pairs. This will be useful to evaluate the cross-validation roc_auc score when splitting train/test on the pairs.
- When ``transform`` is called with input a ``Pairs`` object, the metric learner will return a transformation of the **points** contained in the ``Pairs`` object. It can then return either a pairwise distance matrix on points, or an embedding of the points in the new space (if the algorithm can be expressed as such). This return type can be chosen at the creation of the Metric Learner by a flag.

In [116]:
class DummyMetricLearner(BaseEstimator, TransformerMixin):
    def __init__(self, return_embedding=True):
        self.A = None
        self.return_embedding = return_embedding
        
    def fit(self, X_wrapped, y):
        X, constraints = self.prepare_input(X_wrapped, y)
        diffs = X[constraints[0]] - X[constraints[1]]
        self.metric = diffs.T.dot(diffs)
    
    def fit_transform(self, X_wrapped, y):
        self.fit(X_wrapped, y)
        return self.transform(X_wrapped)

    def decision_function(self, X_wrapped):
        X_embedded = self.transform(X_wrapped)
        squared_distances = np.sum((X_embedded[:, None] - X_embedded)**2,
                                   axis=2)
        return squared_distances[X_wrapped.a, X_wrapped.b]
    
    def transform(self, X_wrapped):
        X_embedded = X_wrapped.X.dot(self.metric)
        if self.return_embedding:
            return X_embedded
        else: 
            return np.sqrt(np.sum((X_embedded[:, None] - X_embedded)**2, axis=2))
    
    @staticmethod
    def prepare_input(X, y):
        a = X.a[y==0]
        b = X.b[y==0]
        c = X.a[y==1]
        d = X.b[y==1]
        X = X.X
        return X, [a, b, c, d]
        

# Examples

Let's create a very simple synthetic dataset of pairs and labels for the pairs, and then the ``Pairs`` object, as well a Metric Learner.

In [117]:
# #RNG = check_random_state(0)
# X = np.random.randn(20, 5)
X = np.arange(0, 20)[:, None] * np.ones((20, 4))
a = np.array([1, 2, 4, 6, 7, 10, 11, 14, 17, 10])
b = np.array([5, 3, 6, 7, 1, 16, 14, 16, 17, 11])
y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

pairs = Pairs(X, a, b)
dml = DummyMetricLearner()

Pairs are a way to get an array of couples of points, without replicating a point in the computer's memory for every constraint this point is involved in. We can print it as we think of it: a list of couples of samples, in the form of a 3D numpy array.

In [118]:
pairs

array([[[  1.,   1.,   1.,   1.],
        [  5.,   5.,   5.,   5.]],

       [[  2.,   2.,   2.,   2.],
        [  3.,   3.,   3.,   3.]],

       [[  4.,   4.,   4.,   4.],
        [  6.,   6.,   6.,   6.]],

       [[  6.,   6.,   6.,   6.],
        [  7.,   7.,   7.,   7.]],

       [[  7.,   7.,   7.,   7.],
        [  1.,   1.,   1.,   1.]],

       [[ 10.,  10.,  10.,  10.],
        [ 16.,  16.,  16.,  16.]],

       [[ 11.,  11.,  11.,  11.],
        [ 14.,  14.,  14.,  14.]],

       [[ 14.,  14.,  14.,  14.],
        [ 16.,  16.,  16.,  16.]],

       [[ 17.,  17.,  17.,  17.],
        [ 17.,  17.,  17.,  17.]],

       [[ 10.,  10.,  10.,  10.],
        [ 11.,  11.,  11.,  11.]]])

We can do slicing on this object. It will indeed slice the pairs. This is useful for cross-validating metric learning algorithms.

In [119]:
pairs[2: 5]

array([[[ 4.,  4.,  4.,  4.],
        [ 6.,  6.,  6.,  6.]],

       [[ 6.,  6.,  6.,  6.],
        [ 7.,  7.,  7.,  7.]],

       [[ 7.,  7.,  7.,  7.],
        [ 1.,  1.,  1.,  1.]]])

We can now do what we want: a cross-validation of the roc-auc score of the metric learner, splitting on the pairs.

In [123]:
cross_val_score(dml, pairs, y, scoring='roc_auc', n_jobs=1)

array([ 0.  ,  0.75,  1.  ])

We can also do clustering: passing the information of the labels of constraints in the y.

In [132]:
from sklearn.cluster import KMeans
pipe = make_pipeline(dml, KMeans())
pipe.fit_predict(pairs, y)

array([5, 5, 5, 1, 1, 1, 6, 6, 2, 2, 2, 7, 7, 0, 0, 0, 4, 4, 3, 3], dtype=int32)

We can also chain a metric learner with unsupervised learning algorithms

In [134]:
from sklearn.decomposition import PCA
pipe_unsupervised = make_pipeline(dml, PCA())
pipe.fit_transform(pairs, y)

array([[ 8320.,  2080.,  4420.,   520.,  6760.,  9620.,  3380.,  5460.],
       [ 7800.,  1560.,  3900.,     0.,  6240.,  9100.,  2860.,  4940.],
       [ 7280.,  1040.,  3380.,   520.,  5720.,  8580.,  2340.,  4420.],
       [ 6760.,   520.,  2860.,  1040.,  5200.,  8060.,  1820.,  3900.],
       [ 6240.,     0.,  2340.,  1560.,  4680.,  7540.,  1300.,  3380.],
       [ 5720.,   520.,  1820.,  2080.,  4160.,  7020.,   780.,  2860.],
       [ 5200.,  1040.,  1300.,  2600.,  3640.,  6500.,   260.,  2340.],
       [ 4680.,  1560.,   780.,  3120.,  3120.,  5980.,   260.,  1820.],
       [ 4160.,  2080.,   260.,  3640.,  2600.,  5460.,   780.,  1300.],
       [ 3640.,  2600.,   260.,  4160.,  2080.,  4940.,  1300.,   780.],
       [ 3120.,  3120.,   780.,  4680.,  1560.,  4420.,  1820.,   260.],
       [ 2600.,  3640.,  1300.,  5200.,  1040.,  3900.,  2340.,   260.],
       [ 2080.,  4160.,  1820.,  5720.,   520.,  3380.,  2860.,   780.],
       [ 1560.,  4680.,  2340.,  6240.,     0.,  28

<div class="alert alert-block alert-info">
Note that when creating a ``Pairs`` object, we keep the whole X inside it. This makes it possible to use the Metric Learner as a transformer on this X. However, when we do a splitting on the constraints, we will only keep the points from X that are present in the pairs of this slice, to be more memory efficient and to really create two datasets that are independent of one another.
</div>

# Discussion