## Using Clustering for Preprocessing
- Clustering can be an efficient approach to dimensionality reduction, in particular as a
preprocessing step before a supervised learning algorithm.
- let’s tackle the digits dataset, which is a simple
MNIST-like dataset containing 1,797 grayscale 8 × 8 images representing the digits
0 to 9.
- `The dataset is 64 dimensional data, we reduce its dimensionality and test its score`

## Importing the packages

In [10]:
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import linear_model

## Loading the dataset:

In [2]:
X_digits, y_digits = datasets.load_digits(return_X_y=True)
print(f"{X_digits.shape} {y_digits.shape}")

(1797, 64) (1797,)


## Train and Test Split

In [3]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_digits, y_digits,
                                                                    random_state=0,
                                                                    test_size=0.25,
                                                                    shuffle=True,
                                                                    stratify=y_digits)

print(f"TRAINING INFO: {X_train.shape} {y_train.shape}")
print(f"TEST INFO: {X_test.shape} {y_test.shape}")

TRAINING INFO: (1347, 64) (1347,)
TEST INFO: (450, 64) (450,)


## Model training and prediction

In [4]:
log_res = linear_model.LogisticRegression(n_jobs=-1, max_iter=100)
log_res.fit(X_train, y_train)

In [5]:
score = log_res.score(X_test, y_test)
print(f"SCORE: {score:.4f}")

SCORE: 0.9578


## Model Improvement
### 1. Dimensionality reduction from 64-dim to 30-dim using KMeans 
- Let's arbitary select n_clusters=30

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans 

pipeline = Pipeline([
            ("kmeans", KMeans(n_clusters=30)),
            ("log_reg", linear_model.LogisticRegression(n_jobs=-1, max_iter=200, random_state=0))
          ])

pipeline.fit(X_train, y_train)

In [8]:
score = pipeline.score(X_test, y_test)
print(f"SCORE: {score:.4f}")

SCORE: 0.9689


### 2. find the best n_cluster using GridSerachCV 

In [103]:
param_grid = dict(kmeans__n_clusters=range(2, 120))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)

Fitting 3 folds for each of 118 candidates, totalling 354 fits
[CV] END ...............................kmeans__n_clusters=2; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=2; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=2; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=3; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=4; total time=   0.1s
[CV] END ...............................kmeans__n_clusters=5; total time=   0.2s
[CV] END ...............................kmeans__n_clusters=5; total time=   0.2s
[CV] END ...............................kmeans

In [104]:
print(grid_clf.best_params_)

{'kmeans__n_clusters': 109}


In [105]:
grid_clf.score(X_test,  y_test)

0.9555555555555556

In [11]:
param_grid = {
    "n_clusters": range(2, 100),
    "n_init": range(10, 20, 2)
}

kmeans_grid = model_selection.GridSearchCV(estimator=KMeans(), param_grid=param_grid, cv=3, verbose=2, n_jobs=-1)
kmeans_grid.fit(X_train, y_train)

Fitting 3 folds for each of 490 candidates, totalling 1470 fits


In [12]:
kmeans_grid.best_params_

{'n_clusters': 99, 'n_init': 18}

In [13]:
kmeans_grid.score(X_test, y_test)

-177861.87401148502

In [25]:
import pandas as pd
pd.DataFrame(kmeans_grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_clusters,param_n_init,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.356559,0.016050,0.000000,0.000000,2,10,"{'n_clusters': 2, 'n_init': 10}",-480572.278791,-473756.970156,-481771.986561,-478700.411836,3529.687089,489
1,0.361184,0.018612,0.000000,0.000000,2,12,"{'n_clusters': 2, 'n_init': 12}",-480534.546585,-473756.970156,-481771.986561,-478687.834434,3523.055617,486
2,0.371480,0.019883,0.002667,0.003771,2,14,"{'n_clusters': 2, 'n_init': 14}",-480572.278791,-473756.970156,-481771.986561,-478700.411836,3529.687089,489
3,0.445814,0.060449,0.000000,0.000000,2,16,"{'n_clusters': 2, 'n_init': 16}",-480550.241248,-473756.970156,-481771.986561,-478693.065988,3525.804585,487
4,0.405945,0.021559,0.000000,0.000000,2,18,"{'n_clusters': 2, 'n_init': 18}",-480550.241248,-473756.970156,-481771.986561,-478693.065988,3525.804585,487
...,...,...,...,...,...,...,...,...,...,...,...,...,...
485,1.005214,0.092913,0.002667,0.003771,99,10,"{'n_clusters': 99, 'n_init': 10}",-176227.048829,-176215.214301,-177048.642860,-176496.968663,390.122484,4
486,1.236349,0.152575,0.000000,0.000000,99,12,"{'n_clusters': 99, 'n_init': 12}",-178006.024083,-175745.260238,-178992.746054,-177581.343458,1359.364130,9
487,1.312833,0.061384,0.001142,0.001615,99,14,"{'n_clusters': 99, 'n_init': 14}",-180011.623171,-175759.098211,-176943.158833,-177571.293405,1792.002030,8
488,1.290635,0.010339,0.000000,0.000000,99,16,"{'n_clusters': 99, 'n_init': 16}",-175147.615602,-173709.914062,-177993.799225,-175617.109630,1780.119192,2


In [5]:
import pandas as pd

In [14]:
try:
    path = "./data/Cust_Segmentation-1.csv"
    df = pd.read_csv(path)
    
except Exception as e:
    print(e)

[Errno 2] No such file or directory: './data/Cust_Segmentation-1.csv'


In [15]:
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,Address,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,NBA001,6.3
1,2,47,1,26,100,4.582,8.218,0.0,NBA021,12.8
2,3,33,2,10,57,6.111,5.802,1.0,NBA013,20.9
3,4,29,2,4,19,0.681,0.516,0.0,NBA009,6.3
4,5,47,1,31,253,9.308,8.908,0.0,NBA008,7.2


In [33]:
import sys
try:
    print("Reading dataframe...")
    cdf = pd.read_parquet("file_path")
    print(f"The Dataframe has {df.shape[0]} rows and {df.shape[1]} columns")
    print(f"Preprocessing of the data...")
   


except Exception as e:
    print()
    print(f"Error Mssg: {e}")

Reading dataframe...

Error Mssg: [Errno 2] No such file or directory: 'file_path'
