# **Problem 3: PCA**

To build a k-Nearest Neighbor (kNN) classifier for the gisette data set and evaluate the effect of dimansionality reduction using PCA, I have done the given steps:

Run the steps 1 to 5 once on Original Data, and once using Dimensionality reduction using PCA:

1. Load the trainSet, trainLabels, and testSet data.

2. Defined Leave One Out (LOO) model

3. For each k from 1 to 10, performed KNN with LOO split, calculated error

4. Check the least error and assign that k as the best k

5. Predict the labels of test data with the best value of k obtained

At last, compare the model score with test set given labels and predicted labels in both cases.

In [4]:
import numpy as np

trainX = np.loadtxt("/content/gisette_trainSet.txt", dtype=float)

trainY = np.loadtxt("/content/gisette_trainLabels.txt", dtype=float)

In [5]:
trainX.shape

(6000, 5000)

In [6]:
testX = np.loadtxt("/content/gisette_testSet.txt", dtype=float)

testY = np.loadtxt("/content/gisette_testLabels.txt", dtype=float)

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier


score_val = []
for k in range(1, 11):
    print("Executing k=", k)
    model = KNeighborsClassifier(n_neighbors=k)
    score_val.append(np.mean(cross_val_score(model, trainX, trainY, cv=5)))

print(score_val)

max_value = max(score_val)
max_index = np.argmax(score_val)
print("Best k = {}".format(max_index))
    

Executing k= 1
Executing k= 2
Executing k= 3
Executing k= 4
Executing k= 5
Executing k= 6
Executing k= 7
Executing k= 8
Executing k= 9
Executing k= 10
[0.9560000000000001, 0.9593333333333334, 0.9636666666666667, 0.9633333333333333, 0.9593333333333334, 0.9630000000000001, 0.9606666666666668, 0.9633333333333333, 0.9613333333333334, 0.9629999999999999]
Best k = 2


In [11]:
model = KNeighborsClassifier(n_neighbors=max_index)
model.fit(trainX, trainY)
print(" Without PCA, Score for best k is {}".format(model.score(testX, testY, sample_weight=None)))

 Without PCA, Score for best k is 0.961


# **With PCA**

In [16]:
from sklearn.decomposition import PCA

# Perform PCA on the training data and reduce its dimensionality
model_pca = PCA(n_components=50, random_state=0)
trainX_reduced = model_pca.fit_transform(trainX)

# Find the best value of k using cross-validation
score_val = []
for k in range(1, 11):
    pca_knn = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(pca_knn, trainX_reduced, trainY, cv=5)
    score_val.append(np.mean(cv_scores))
    
print(score_val)
max_index = np.argmax(score_val) 
print("Best k = {}".format(max_index))

[0.9671666666666667, 0.9665000000000001, 0.9696666666666666, 0.9706666666666667, 0.9724999999999999, 0.9719999999999999, 0.9706666666666667, 0.9716666666666667, 0.9708333333333334, 0.9705]
Best k = 4


In [14]:
knn_pca = KNeighborsClassifier(n_neighbors=max_index)
knn_pca.fit(trainX_reduced, trainY)

testX_reduced = model_pca.transform(testX)

print(" With PCA, Score for best k is {}".format(knn_pca.score(testX_reduced, testY, sample_weight=None)))

 With PCA, Score for best k is 0.971


# **Comparison**

In [15]:
print(" Without PCA, Score for best k is {}".format(model.score(testX, testY, sample_weight=None)))
print(" With PCA, Score for best k is {}".format(knn_pca.score(testX_reduced, testY, sample_weight=None)))

 Without PCA, Score for best k is 0.961
 With PCA, Score for best k is 0.971


**Observation:** Here we can see a 1% implovement in accuracy score from original data without PCA and the reduced data using PCA.

# **Cross Validation Approach:**

Cross-validation is a technique used in machine learning and statistics to evaluate the performance of a predictive model by partitioning the available data into multiple sets, or "folds," and using these folds to train and test the model iteratively.

The basic idea of cross-validation is to split the available data into k subsets or "folds" of approximately equal size. One of the folds is held out as the test set, and the remaining k-1 folds are used as the training set. The model is then trained on the training set and evaluated on the test set. This process is repeated k times, with each fold serving as the test set exactly once, and the results are averaged to obtain an overall estimate of model performance.

The advantage of cross-validation is that it provides a more reliable estimate of model performance than a single train/test split, since it uses multiple test sets and thus reduces the variance of the estimate. It also allows for the use of all available data for both training and testing, which is important for small data sets. In short, it helps in making the model more resilient towards Bias-Variance Tradeoff.