# PCA for dimensionality reduction before classification

PCA (or any other unsupervised dimensionality reduction) can be used to transform data before fitting a supervised learning model. In this problem, we will apply PCA for dimensionality reduction, then use a support vector classifier fitted on a subset of the PCA-transformed features.

We will use K-fold cross validation to decide the number of principal components to use, according to the one-SE rule.

In this workspace, write code to split the data into training and test sets, and perform the analysis described above on the training set.

|Name|	Type|	Description|
| --- | --- | --- |
|`acc_mean`|	1d numpy array|	Mean validation accuracy for each candidate model.|
|`acc_se`|	1d numpy array|	Standard error of the mean of validation accuracy for each candidate model.|
|`n_pca_opt`|	integer|	Optimal number of components according to 'best mean validation accuracy' rule.|
|`n_pca_one_se`|	integer|	Optimal number of components according to 'one SE' rule.|


In [23]:
import numpy as np
import random

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Upload the `data.csv` file to this workspace, then read in the data to a `numpy` array in `X` and the labels to `y`. If you want to, you can add code to the following cell to explore `X` (for example, see its shape).


In [24]:
dat = np.genfromtxt('data.csv',delimiter=',')
X = dat[:, :-1]
y = dat[:, -1]

Use `train_test_split` to split the data into training and test sets. Reserve 30% of the data for the test set. 

Make sure to shuffle the data, and pass `random_state = 42` so that your random split will match the auto-grader's.

In [25]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

You will use the training data to fit a support vector classifier. However, instead of fitting the training data directly, you will first transform it using PCA. Then, you will use only a subset of features - the first `n_comp` principal components - as input to your classifier. 

You will use K-fold cross validation to find the optimal value of `n_comp`. You should consider every possible value of `n_comp`, from 1 component (simplest possible model) to all of the components (most flexible model).

In the next cell,

* Use the `sklearn` implementation of `KFold` to iterate over candidate models. In your `KFold`, use 5 splits, and don't shuffle the data (you already shuffled it when dividing into training and test.)
* Use the `sklearn` implementation of `PCA` to transform the data. Pass `random_state = 42` to `PCA` so that your result will match the auto-grader's.
* Use the `sklearn` implementation of `SVC` to classify the data using the first `n_comp` principal components.  Pass `random_state = 42` to `SVC` so that your result will match the auto-grader's.

In [26]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
n_components = np.arange(1, X_train.shape[1] + 1)
acc_mean = np.zeros(len(n_components))
acc_se = np.zeros(len(n_components))
kf = KFold(n_splits=5, shuffle=False)

# Initialize accuracy lists for each n_components
accs = [[] for _ in n_components]

for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Fit PCA with max components ONCE per fold
    pca_full = PCA(n_components=X_train.shape[1], random_state=42)
    X_train_pca_full = pca_full.fit_transform(X_train_fold)
    X_val_pca_full = pca_full.transform(X_val_fold)

    for j, n_comp in enumerate(n_components):
        # Slice the PCA-transformed data for the current n_components
        X_train_pca = X_train_pca_full[:, :n_comp]
        X_val_pca = X_val_pca_full[:, :n_comp]

        svc = SVC(random_state=42)
        svc.fit(X_train_pca, y_train_fold)
        y_pred = svc.predict(X_val_pca)
        accuracy = accuracy_score(y_val_fold, y_pred)

        accs[j].append(accuracy)

Compute the mean validation accuracy and the standard error of the mean validation accuracy across the folds. Save the results in `acc_mean` and `acc_se`, respectively. 

In [27]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
acc_mean = np.array([np.mean(acc) for acc in accs])
acc_se = np.array([np.std(acc, ddof=1) / np.sqrt(len(acc)) for acc in accs])

Then, compute the optimal value of `n_comp`, and save this in `n_pca_opt`.

In [28]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
n_pca_opt = n_components[np.argmax(acc_mean)]

Finally, compute the optimal `n_comp` according to the one-SE rule, and save this in `n_pca_one_se`.

In [29]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
best_acc = np.max(acc_mean)
best_acc_se = acc_se[np.argmax(acc_mean)]
threshold = best_acc - best_acc_se
n_pca_one_se = n_components[np.where(acc_mean >= threshold)[0][0]]