# Step 1: Translation and Standardization

In this step, we center and standardize the dataset. We subtract the mean of each feature from the respective feature values (translation), and then divide by the standard deviation of each feature (standardization). This ensures that all features are on the same scale, which is important for PCA as it is sensitive to the relative scaling of features.

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('breast_dataset.csv')
stdata = (data - np.mean(data, axis=0)) / np.std(data, axis=0)

# Step 2: Compute the Covariance Matrix

The covariance matrix gives us a sense of how different features of the dataset are correlated with one another. This matrix is critical for PCA because it quantifies the relationships between variables.

In [None]:
covmat = np.cov(stdata.T)

# Step 3: Eigendecomposition of the Covariance Matrix

Now we perform eigendecomposition on the covariance matrix. This will give us the eigenvalues and eigenvectors, which represent the variance and direction of the principal components, respectively.

In [None]:
evals, evecs = np.linalg.eig(covmat)

# Step 4: Project the Data onto the Principal Components

In this step, we project the standardized data onto the eigenvectors (principal components). This transforms the data into a new space defined by the principal components, reducing dimensionality while retaining most of the variance.

In [None]:
projdata = np.dot(stdata, evecs)

# Step 5: Calculate the Percentage of Variance Explained by Each Principal Component

To evaluate the performance of PCA, we calculate how much variance each principal component explains. This helps us decide how many components to retain for optimal dimensionality reduction.

In [None]:
explvar = evals / np.sum(evals)

# Step 6: Reconstruct the Data

We can reconstruct the data from the reduced set of principal components to validate our PCA transformation. This step is a useful check to ensure that we haven't lost significant information.

In [None]:
recondata = np.dot(projdata, evecs.T)

# Step 7: Visualizing the Data Using the First 2 and 3 Principal Components

We can visualize the data projected onto the first 2 principal components (2D plot) and the first 3 principal components (3D plot) to see how well PCA separates the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
plt.scatter(projdata[:, 0], projdata[:, 1])
plt.title('2D PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(projdata[:, 0], projdata[:, 1], projdata[:, 2])
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('3D PCA')
plt.show()

# Step 8: Compare Original Data and Reconstructed Data

Finally, we reconstruct the data using fewer principal components (e.g., the first 2) and compare it with the original standardized data to see how much information is retained.

In [None]:
reddata = projdata[:, :2]
reconred = np.dot(reddata, evecs[:, :2].T)

plt.figure(figsize=(8,6))
plt.scatter(stdata[:, 0], stdata[:, 1], label='Original')
plt.scatter(reconred[:, 0], reconred[:, 1], label='Reconstructed')
plt.legend()
plt.show()