# Large Margin Nearest Neighbor classifier benchmark
*Comparing NN-1, metric-learn and scikit-learn implementation on a subset of the MNIST dataset*


*Date*: 18 May 2017

**Dependencies:** Python 3.6, scikit-learn 18.1, metric-learn 0.3.0, memory_profiler, psutil

In [6]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_mldata
from sklearn.utils.random import choice as random_choice
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
# implementation from PR https://github.com/scikit-learn/scikit-learn/pull/8602
from lmnn import LargeMarginNearestNeighbor  
from metric_learn.lmnn import python_LMNN

%load_ext memory_profiler

random_state = 43

n_samples = None   # to betchmark metric-learn on a subset of MNIST data
n_components = 50

mnist = fetch_mldata('MNIST original')

if n_samples is not None:
    mask = random_choice(np.arange(len(mnist.target)), size=(n_samples,), replace=False,
                         random_state=random_state)
else:
    mask = slice(None)

X_train, X_test, y_train, y_test = train_test_split(mnist.data[mask], mnist.target[mask],
                                                    test_size=0.3, random_state=random_state)

pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print('Applying PCA (n_components={}) on a {} sample subset of MNIST'
      .format(n_components, n_samples))
print('Classes in the training set:', np.sort(np.unique(y_test)))
print('Classes in the training set:', np.sort(np.unique(y_train)))

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler
Applying PCA (n_components=50) on a None sample subset of MNIST
Classes in the training set: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
Classes in the training set: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]


In [7]:
%%time
%%memit -r 1

nn = KNeighborsClassifier(n_neighbors=1, algorithm='brute')
nn.fit(X_train_pca, y_train)
print('Accuracy score: {:.3f}'.format(nn.score(X_test_pca, y_test)))

Accuracy score: 0.975
peak memory: 16158.90 MiB, increment: 15643.79 MiB
CPU times: user 12.4 s, sys: 3.08 s, total: 15.5 s
Wall time: 10.6 s


In [14]:
%%time
%%memit -r 1

# metric-learn implementation
if n_samples is None or n_samples > 2000:
    print('Skipping the metric-learn LMNN implementation for large datasets '
          'as it does not scale well.')
    
else:

    lmnn = python_LMNN(k=1)

    lmnn.fit(X_train_pca, y_train)

    # project into the LM space
    X_train_pca_lmnn = lmnn.transform(X_train_pca)
    X_test_pca_lmnn = lmnn.transform(X_test_pca)
                                 
    nn = KNeighborsClassifier(n_neighbors=1, algorithm='brute')
    nn.fit(X_train_pca_lmnn, y_train)
    print('Accuracy score: {:.3f}'.format(nn.score(X_test_pca_lmnn, y_test)))

Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
Skipping the metric-learn LMNN implementation for large datasets as it does not scale well.
peak memory: 312.84 MiB, increment: 0.00 MiB
CPU times: user 44 ms, sys: 60 ms, total: 104 ms
Wall time: 238 ms


In [15]:
%%time
%%memit -r 1

nn = LargeMarginNearestNeighbor(n_neighbors=1, random_state=random_state)
nn.fit(X_train_pca, y_train)
print('Accuracy score: {:.3f}'.format(nn.score(X_test_pca, y_test)))

Accuracy score: 0.977
peak memory: 2105.35 MiB, increment: 1792.51 MiB
CPU times: user 3h 46min 32s, sys: 3min 7s, total: 3h 49min 39s
Wall time: 1h 2min 31s
