For this analysis I'm going to use the Fashion MNIST data set. 

The fashion mnist data set is composed of 60,000 small square 28x28 grayscale images of 10 types of clothing items: such as shoes, t-shirts, dress. Each item label is mapped to a 0-9 integer.

- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot

I will apply dimension reduction techniques combined with classification methods to build a classifier.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('fashion-mnist_train.csv')
data.drop_duplicates(inplace=True)
X = data.drop('label',axis=1)
y = data.label

# To make your life easier, let's use only the first 1500 data points.
X = X.loc[0:1500,]
y = y.loc[0:1500,] 

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 862)

In [13]:
import numpy as np
from sklearn.decomposition import KernelPCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.manifold import Isomap
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

This data set is a multi-level data set. Some general rules you should follow:

1. Tune the dimension reduction technique
2. Tune the model
3. Select the hyperparameters based on a hold-out set (either via CV or train/validate/test split)
4. Report the accuracy on the test set

I will be using kernel PCA, LLE, and Isomap. For each dimension reduction technique, I will perform classification with two classifiers.

#### Kernel PCA

In [16]:
# TRAIN
param_grid = {
    "kpca__n_components": [2, 10, 25, 50, 100],
    "kpca__gamma": np.linspace(0.03, 0.05, 5),
    "kpca__kernel": ["linear", "sigmoid", "poly"],
}

clf = Pipeline([
    ("kpca", KernelPCA()),
    ("dtc", DecisionTreeClassifier(max_depth=2, max_leaf_nodes=2, ccp_alpha=2))
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="accuracy")
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"Best CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = KernelPCA(n_components=best_params["kpca__n_components"],
                     gamma=best_params["kpca__gamma"],
                     kernel=best_params["kpca__kernel"])
best_clf = DecisionTreeClassifier(max_depth=2, max_leaf_nodes=2, ccp_alpha=2)

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("dtc", best_clf)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")


Best Parameters (Train Set):
    kpca__gamma: 0.03
    kpca__kernel: linear
    kpca__n_components: 2
Best CV Score: 0.1147

Accuracy on Test Set: 0.0984


In [17]:
# TRAIN
param_grid = {
    "kpca__n_components": [2, 10, 25, 50, 100],
    "kpca__kernel": ["linear", "sigmoid", "poly"],
    "kpca__gamma": np.linspace(0.03, 0.05, 10),
    "knn__n_neighbors": range(2, 10),
}

clf = Pipeline([
    ("kpca", KernelPCA()),
    ("knn", KNeighborsClassifier())
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = KernelPCA(n_components=best_params["kpca__n_components"],
                     kernel=best_params["kpca__kernel"],
                     gamma=best_params["kpca__gamma"])
best_knn = KNeighborsClassifier(n_neighbors=best_params["knn__n_neighbors"])

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("knn", best_knn)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")


Best Parameters (Train Set):
    knn__n_neighbors: 5
    kpca__gamma: 0.03
    kpca__kernel: linear
    kpca__n_components: 50

Best CV Score: 0.7511

Accuracy on Test Set: 0.8298


#### LLE

In [19]:
# TRAIN
param_grid = {
    "lle__n_components": [2, 10, 25, 50, 100],
    "lle__n_neighbors": range(2, 7, 2),
    "dtc__max_depth": [2, 4, 8, 16]
}

clf = Pipeline([
    ("lle", LocallyLinearEmbedding()),
    ("dtc", DecisionTreeClassifier(max_leaf_nodes=2, ccp_alpha=2))
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = LocallyLinearEmbedding(n_components=best_params["lle__n_components"],
                     n_neighbors=best_params["lle__n_neighbors"])
best_clf = DecisionTreeClassifier(max_depth=best_params["dtc__max_depth"], max_leaf_nodes=2, ccp_alpha=2)

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("dtc", best_clf)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")


Best Parameters (Train Set):
    dtc__max_depth: 2
    lle__n_components: 2
    lle__n_neighbors: 4

Best CV Score: 0.1147

Accuracy on Test Set: 0.0984


In [20]:
# TRAIN
param_grid = {
    "lle__n_components": [2, 10, 25, 50, 100],
    "lle__n_neighbors": range(2, 11, 2),
    "knn__n_neighbors": range(2, 10),
}

clf = Pipeline([
    ("lle", LocallyLinearEmbedding()),
    ("knn", KNeighborsClassifier())
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = LocallyLinearEmbedding(n_components=best_params["lle__n_components"],
                     n_neighbors=best_params["lle__n_neighbors"])
best_knn = KNeighborsClassifier(n_neighbors=best_params["knn__n_neighbors"])

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("knn", best_knn)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")

Best Parameters (Train Set):
    knn__n_neighbors: 7
    lle__n_components: 50
    lle__n_neighbors: 8

Best CV Score: 0.7209

Accuracy on Test Set: 0.7739


#### Isomap

In [21]:
# TRAIN
param_grid = {
    "isomap__n_components": [2, 10, 25, 50, 100],
    "isomap__n_neighbors": range(2, 11, 2),
    "dtc__max_depth": [2, 4, 8, 16]
}

clf = Pipeline([
    ("isomap", Isomap()),
    ("dtc", DecisionTreeClassifier(max_leaf_nodes=2, ccp_alpha=2))
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = Isomap(n_components=best_params["isomap__n_components"],
                     n_neighbors=best_params["isomap__n_neighbors"])
best_clf = DecisionTreeClassifier(max_depth=best_params["dtc__max_depth"], max_leaf_nodes=2, ccp_alpha=2)

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("dtc", best_clf)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")

Best Parameters (Train Set):
    dtc__max_depth: 2
    isomap__n_components: 2
    isomap__n_neighbors: 2

Best CV Score: 0.1147

Accuracy on Test Set: 0.0984


In [22]:
# TRAIN
param_grid = {
    "isomap__n_components": [2, 10, 25, 50, 100],
    "isomap__n_neighbors": range(2, 11, 2),
    "knn__n_neighbors": range(2, 10),
}

clf = Pipeline([
    ("isomap", Isomap()),
    ("knn", KNeighborsClassifier())
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_kpca = Isomap(n_components=best_params["isomap__n_components"],
                     n_neighbors=best_params["isomap__n_neighbors"])
best_knn = KNeighborsClassifier(n_neighbors=best_params["knn__n_neighbors"])

best_pipeline = Pipeline([
    ("kpca", best_kpca),
    ("knn", best_knn)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")

Best Parameters (Train Set):
    isomap__n_components: 25
    isomap__n_neighbors: 4
    knn__n_neighbors: 7

Best CV Score: 0.7129

Accuracy on Test Set: 0.8059


#### What is the best combination according to your accuracy score on the test set?

A summary of the results is presented in the following table:
| Dimension Reduction | Classifier    | Accuracy Score |
| ------------------- | ------------- | -------------- |
| KPCA                | Decision Tree | 0.0984         |
| **KPCA**            | **KNN**       | **0.8298**     |
| LLE                 | Decision Tree | 0.0984         |
| LLE                 | KNN           | 0.7739         |
| Isomap              | Decision Tree | 0.0984         |
| Isomap              | KNN           | 0.8059         |

The best combination appears to be KPCA with KNN, which achieved an accuracy score of 0.8298 on the test set. 

KPCA sseems to be effective in preserving important information about the dataset. My understanding is that this is achieved by projecting it into a higher-dimensional space using the linear kernel function.

KNN can work well when the dimensionality of the data is reduced effectively because it relies on distance metrics to make predictions. Dimensionality reduction techniques like KPCA can help in finding meaningful representations of data points in lower dimensions.

---

Now using the original data set (i.e. not reduced data) and the two classifers, I'll run the procedure again, but this time without any dimension reduction.

In [23]:
# TRAIN
param_grid = {
    "dtc__max_depth": range(2, 8),
    "dtc__max_leaf_nodes": range(2, 8),
    "dtc__ccp_alpha": np.linspace(0, 10, 5)
}

clf = Pipeline([
    ("dtc", DecisionTreeClassifier())
])

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="accuracy")
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_dtc = DecisionTreeClassifier(max_depth=best_params["dtc__max_depth"],
                     max_leaf_nodes=best_params["dtc__max_leaf_nodes"],
                     ccp_alpha=best_params["dtc__ccp_alpha"])

best_pipeline = Pipeline([
    ("dtc", best_dtc)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")

Best Parameters (Train Set):
    dtc__ccp_alpha: 0.0
    dtc__max_depth: 4
    dtc__max_leaf_nodes: 7

Best CV Score: 0.5378

Accuracy on Test Set: 0.5878


In [26]:
# TRAIN
param_grid = {
    "knn__n_neighbors": range(2, 10),
    "knn__weights": ['uniform', 'distance'],
    "knn__p": [1, 2],
}

knn = Pipeline([
    ("knn", KNeighborsClassifier())
])

grid_search = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters (Train Set):")
for param in grid_search.best_params_:
    print(f"    {param}: {grid_search.best_params_[param]}")
print(f"\nBest CV Score: {round(grid_search.best_score_, 4)}")

# TEST
best_params = grid_search.best_params_
best_knn = KNeighborsClassifier(n_neighbors=best_params["knn__n_neighbors"],
                                weights=best_params["knn__weights"],
                                p=best_params["knn__p"])

best_pipeline = Pipeline([
    ("knn", best_knn)
])

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on Test Set: {round(accuracy, 4)}")

Best Parameters (Train Set):
    knn__n_neighbors: 6
    knn__p: 1
    knn__weights: distance

Best CV Score: 0.7502

Accuracy on Test Set: 0.8298


Observations:

Accuracy Scores for Classifiers without Dimension Reduction:
* Decision Tree: 0.7502
* KNN: 0.8298

The accuracy score for KNN in this part (0.8298) is identical to the accuracy score achieved when KNN was used with KPCA in the first part (0.8298). This suggests that, for this specific dataset, the dimension reduction technique (KPCA) didn't significantly improve or degrade the performance of KNN.

The accuracy score for the Decision Tree in this part (0.7502) is also consistent with the results obtained when dimension reduction techniques were used in the first part (0.0984). Decision Trees tend to be less effective in high-dimensional spaces, whether dimensionality reduction is applied or not, and KNN seems to be a more suitable choice for this dataset.

In summary, in this specific scenario and dataset, KNN consistently performs better than Decision Trees in terms of accuracy, both with and without dimension reduction techniques. The dimensionality reduction techniques applied in the first part did not seem to provide a significant advantage in this case, and KNN alone achieved the highest accuracy score.