## Using Clustering for Preprocessing
- Clustering can be an efficient approach to dimensionality reduction, in particular as a
preprocessing step before a supervised learning algorithm.
- let’s tackle the digits dataset, which is a simple
MNIST-like dataset containing 1,797 grayscale 8 × 8 images representing the digits
0 to 9.
- `The dataset is 64 dimensional data, we reduce its dimensionality and test its score`

## Importing the packages

In [None]:
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import linear_model

## Loading the dataset:

In [35]:
X_digits, y_digits = datasets.load_digits(return_X_y=True)
print(f"{X_digits.shape} {y_digits.shape}")

(1797, 64) (1797,)


## Train and Test Split

In [40]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_digits, y_digits,
                                                                    random_state=0,
                                                                    test_size=0.25,
                                                                    shuffle=True,
                                                                    stratify=y_digits)

print(f"TRAINING INFO: {X_train.shape} {y_train.shape}")
print(f"TEST INFO: {X_test.shape} {y_test.shape}")

TRAINING INFO: (1347, 64) (1347,)
TEST INFO: (450, 64) (450,)


## Model training and prediction

In [53]:
log_res = linear_model.LogisticRegression(n_jobs=-1, max_iter=100)
log_res.fit(X_train, y_train)

In [69]:
score = log_res.score(X_test, y_test)
print(f"SCORE: {score:.4f}")

SCORE: 0.9578


## Model Improvement
### 1. Dimensionality reduction from 64-dim to 30-dim using KMeans 
- Let's arbitary select n_clusters=30

In [101]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans 

pipeline = Pipeline([
            ("kmeans", KMeans(n_clusters=30)),
            ("log_reg", linear_model.LogisticRegression(n_jobs=-1, max_iter=200, random_state=0))
          ])

pipeline.fit(X_train, y_train)

In [102]:
score = pipeline.score(X_test, y_test)
print(f"SCORE: {score:.4f}")

SCORE: 0.9667


### 2. find the best n_cluster using GridSerachCV 

In [103]:
from sklearn.model_selection import GridSearchCV

param_grid = dict(kmeans__n_clusters=range(2, 120))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)

Fitting 3 folds for each of 118 candidates, totalling 354 fits
[CV] END ...............................kmeans__n_clusters=2; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=2; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=2; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=5; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=5; total time=   0.2s
[CV] END ...............................kmeans

In [104]:
print(grid_clf.best_params_)

{'kmeans__n_clusters': 109}


In [105]:
grid_clf.score(X_test,  y_test)

0.9555555555555556