# Visual search with k-NN

This is the third notebook of this project.


## Build X and y vectors

In this Notebook, as well as in the other Notebooks, I will have to build X_train, X_valid, X_test, y_train, y_valid and y_test from the data. As this will have to be done several time, I will add it into *mylib.py* file as a library function.

The code is basic:

    def loadXy(data=None, concatenate=[]):
        """
        This function returns the data, X_train, y_train, X_valid, y_valid, X_test and y_test vectors
        from the data passed as parameter.
        If the data parameter is set to None, this function uses the loadNpz() function with default parameter
        to get the data.
        Note that the data passed as parameter must comply with the structure passed in Notebook number 1
        """

        if data==None:
            data=loadNpz()

        X=dict()
        y=dict()
        # Get X_train from high level feateurs and y_train from labels
        for name in data['DATASET_NAME']:
            X[name]=data[name]['features']
            y[name]=data[name]['labels']
            print("X {} shape:".format(name),X[name].shape)
            print("y {} shape:".format(name),X[name].shape)

        return (data, X, y)



## Load X,y data from NPZ

Using the function added to *mylib.py* file, it's now easy to grab data and X/y vectors ready to be used for model training and tuning

In [40]:
# Run content of mylib.py file
%run mylib.py

# Load data from NPZ file
#data=loadNpz()
(data, X, y)=loadXy()

Loading 'train' set
  loading  data
     shape: (281, 299, 299, 3) - dtype: float64
  loading  features
     shape: (281, 2048) - dtype: float64
  loading  filenames
     shape: (281,) - dtype: <U46
  loading  labels
     shape: (281,) - dtype: int32


Loading 'test' set
  loading  data
     shape: (51, 299, 299, 3) - dtype: float64
  loading  features
     shape: (51, 2048) - dtype: float64
  loading  filenames
     shape: (51,) - dtype: <U50
  loading  labels
     shape: (51,) - dtype: int32


Loading 'valid' set
  loading  data
     shape: (139, 299, 299, 3) - dtype: float64
  loading  features
     shape: (139, 2048) - dtype: float64
  loading  filenames
     shape: (139,) - dtype: <U30
  loading  labels
     shape: (139,) - dtype: int32


X train shape: (281, 2048)
y train shape: (281, 2048)
X test shape: (51, 2048)
y test shape: (51, 2048)
X valid shape: (139, 2048)
y valid shape: (139, 2048)


## Fit and tune a k-NN classifier



In [46]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create k-NN classifier
knn = KNeighborsClassifier(
    n_neighbors=len(data['class_name']), # Set k to number of class we have in datasets
    # Use the simple 'brute' strategy to find nearest neighbors.
    # It's faster in this case!
    algorithm='brute',
    n_jobs=-1
)

# Create the pipeline and fit it to training data
knn_pipe = Pipeline([
    ('scaler', StandardScaler()), # With standardization
    # ('scaler', None), # Better performance without standardization!
    ('knn', knn)
])
knn_pipe.fit(X['train'], y['train'])

# Evaluate on test set
accuracy = knn_pipe.score(X['valid'], y['valid'])

# Print accuracy
print('k-nearest neighbors (k={}) accuracy: {:.3f}'.format(len(data['class_name']), accuracy))

k-nearest neighbors (k=6) accuracy: 0.928


Pick an image from the test set and plot its 10 nearest neighbors from the train set