10. The classic Olivetti faces dataset contains 400 grayscale 64 × 64 pixel images of faces. Each image is flattened to a 1D vector of size 4,096. 40 different people were photographed (10 times each), and the usual task is to train a model that can predict which person is represented in each picture. Load the dataset using the **sklearn.datasets.fetch_olivetti_faces()**
function, then split it into a training set, a validation set, and a test set (note that
the dataset is already scaled between 0 and 1). Since the dataset is quite small, you probably want to use stratified sampling to ensure that there are the same number of images per person in each set. Next, cluster the images using K-Means, and ensure that you have a good number of clusters (using one of the techniques discussed in this chapter). Visualize the clusters: do you see similar faces in each cluster?

In [11]:
# Loading in data

from sklearn.datasets import fetch_olivetti_faces

olivetti = fetch_olivetti_faces()

In [13]:
# Splitting the data using stratified shuffling

'''
This is a small dataset so we need to use stratified sampling to ensure all the 
groups in the population is represented fairly in the samples.
'''

## Imports
from sklearn.model_selection import StratifiedShuffleSplit

## Splitting it once with the test size having 40 observations
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=40, random_state=42)

## Getting the indices for validation and test set
train_valid_idx, test_idx = next(strat_split.split(olivetti.data, olivetti.target))

## Getting the validation set
X_train_valid = olivetti.data[train_valid_idx]
y_train_valid = olivetti.target[train_valid_idx]

## Getting test set
X_test = olivetti.data[test_idx]
y_test = olivetti.data[test_idx]

## Splitting the validation set to get train and validation set (validation set consisting of 80 images)
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=80, random_state=43)

## Getting the indices for train and validation set
train_idx, valid_idx = next(strat_split.split(X_train_valid, y_train_valid))

## Getting the train set
X_train = X_train_valid[train_idx]
y_train = y_train_valid[train_idx]

## Getting the validation set
X_valid = X_train_valid[valid_idx]
y_valid = y_train_valid[valid_idx]

In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV

## Initializing KMeans object
kmeans = KMeans(n_clusters=5)

## Implementing GridSearch
param_grid = dict(n_clusters=range(1, 100))
grid_clf = GridSearchCV(kmeans, param_grid, cv=3, verbose=0)
grid_clf.fit(X_train, y_train)

Fitting 3 folds for each of 99 candidates, totalling 297 fits
[CV] n_clusters=1 ....................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..................................... n_clusters=1, total=   0.2s
[CV] n_clusters=1 ....................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] ..................................... n_clusters=1, total=   0.2s
[CV] n_clusters=1 ....................................................
[CV] ..................................... n_clusters=1, total=   0.2s
[CV] n_clusters=2 ....................................................
[CV] ..................................... n_clusters=2, total=   0.5s
[CV] n_clusters=2 ....................................................
[CV] ..................................... n_clusters=2, total=   0.5s
[CV] n_clusters=2 ....................................................
[CV] ..................................... n_clusters=2, total=   0.4s
[CV] n_clusters=3 ....................................................
[CV] ..................................... n_clusters=3, total=   0.6s
[CV] n_clusters=3 ....................................................
[CV] ..................................... n_clusters=3, total=   0.6s
[CV] n_clusters=3 ....................................................
[CV] .

[Parallel(n_jobs=1)]: Done 297 out of 297 | elapsed: 19.0min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=KMeans(algorithm='auto', copy_x=True, init='k-means++',
                              max_iter=300, n_clusters=5, n_init=10,
                              n_jobs=None, precompute_distances='auto',
                              random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None,
             param_grid={'n_clusters': range(1, 100)}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=2)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
## Viewing the best parameters
print(f'Best Parameters: {grid_clf.best_params_}')

## Viewing the best score
print(f'Best Score: {grid_clf.score(X_test, y_test)}')


Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] kmeans__n_clusters=90 ...........................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................ kmeans__n_clusters=90, total=   8.0s
[CV] kmeans__n_clusters=90 ...........................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.0s remaining:    0.0s


[CV] ............................ kmeans__n_clusters=90, total=   8.4s
[CV] kmeans__n_clusters=90 ...........................................
[CV] ............................ kmeans__n_clusters=90, total=   7.2s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total=   7.9s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total=   8.4s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total=   7.0s
[CV] kmeans__n_clusters=92 ...........................................
[CV] ............................ kmeans__n_clusters=92, total=   7.5s
[CV] kmeans__n_clusters=92 ...........................................
[CV] ............................ kmeans__n_clusters=92, total=   8.1s
[CV] kmeans__n_clusters=92 ...........................................
[CV] .

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  8.0min finished


Best Parameters: {'kmeans__n_clusters': 93}
Best Score: 0.98
