<a href="https://colab.research.google.com/github/sahiiljadhav/Principal-component-analysis/blob/main/PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principal Component Analysis

In [None]:
# Import libraries
import pandas as pd  # Data manipulation
import numpy as np  # Numerical operations
from sklearn.model_selection import train_test_split  # Split data
from sklearn.preprocessing import StandardScaler  # Standardize features
from sklearn.decomposition import PCA  # Dimensionality reduction
from sklearn.linear_model import LogisticRegression  # Logistic regression model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # Model evaluation metrics

# Import Inbuilt Dataset

In [None]:
from sklearn.datasets import load_breast_cancer  # Load dataset

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()  # Load the dataset into a variable

# Separate features and target variable
X = data.data  # Features (input data)
y = data.target  # Target (labels)

In [None]:
# Convert data to DataFrame for easier manipulation
# - Create a DataFrame from the features
# - Add the target variable to the DataFrame
df = pd.DataFrame(X, columns=data.feature_names)  # Create DataFrame from features
df['target'] = y  # Add target variable to DataFrame

In [None]:
# Display first few rows of the dataset
# - Show the first 5 rows to get a glimpse of the data
print(df.head())  # Print the first 5 rows of the DataFrame


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

In [None]:
# Display basic statistics of the DataFrame
# - Provide summary statistics like mean and standard deviation
print(df.describe())  # Print summary statistics (count, mean, std, min, max)

       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000             0.000000   
25%      

In [None]:
# Display class distribution (number of samples per class)
# - Count how many samples belong to each class
print(df['target'].value_counts())  # Print the count of each unique target value

target
1    357
0    212
Name: count, dtype: int64


In [None]:
# Split the dataset into training and testing sets
# - Use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize the features
# - Scale the features to have a mean of 0 and a standard deviation of 1
scaler = StandardScaler()  # Create a scaler object
X_train_scaled = scaler.fit_transform(X_train)  # Fit the scaler and transform the training data
X_test_scaled = scaler.transform(X_test)  # Transform the test data using the fitted scaler

In [None]:
# Initialize and train the logistic regression model
# - Create a logistic regression model with a maximum of 1000 iterations
log_reg = LogisticRegression(max_iter=1000)  # Initialize model
log_reg.fit(X_train_scaled, y_train)  # Train the model on the scaled training data

In [None]:
# Make predictions
# - Use the trained model to predict labels for the test set
y_pred = log_reg.predict(X_test_scaled)  # Predict on the scaled test data

In [None]:
# Evaluate the performance
# - Calculate and display the accuracy of the model
print("Accuracy without PCA:", accuracy_score(y_test, y_pred))  # Print accuracy score
# - Show detailed classification metrics (precision, recall, F1-score)
print("Classification Report without PCA:\n", classification_report(y_test, y_pred))  # Print classification report
# - Display the confusion matrix to visualize true vs. predicted labels
print("Confusion Matrix without PCA:\n", confusion_matrix(y_test, y_pred))  # Print confusion matrix

Accuracy without PCA: 0.9736842105263158
Classification Report without PCA:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Confusion Matrix without PCA:
 [[41  2]
 [ 1 70]]


In [None]:
# Apply PCA
# - Reduce the dimensionality of the data while preserving 95% of the variance
pca = PCA(n_components=0.95)  # Initialize PCA to preserve 95% of variance
X_train_pca = pca.fit_transform(X_train_scaled)  # Fit PCA on training data and transform
X_test_pca = pca.transform(X_test_scaled)  # Transform the test data using the fitted PCA

In [None]:
# Display explained variance
# - Show how much of the original variance is retained by the PCA components
print("Explained Variance Ratio:", pca.explained_variance_ratio_)  # Print the explained variance ratio
# - Print the number of components that preserve 95% of the variance
print("Number of Components:", pca.n_components_)  # Print the number of components

Explained Variance Ratio: [0.43502782 0.19500007 0.09781519 0.06486409 0.05253378 0.041128
 0.0223559  0.01647952 0.01380052 0.01205262]
Number of Components: 10


In [None]:
# Train logistic regression on PCA data
log_reg_pca = LogisticRegression(max_iter=1000)  # Create model
log_reg_pca.fit(X_train_pca, y_train)  # Fit model to training data

In [None]:
# Make predictions
y_pred_pca = log_reg_pca.predict(X_test_pca)  # Predict on test data

In [None]:
# Evaluate performance
print("Accuracy with PCA:", accuracy_score(y_test, y_pred_pca))  # Show accuracy
print("Classification Report with PCA:\n", classification_report(y_test, y_pred_pca))  # Show detailed metrics
print("Confusion Matrix with PCA:\n", confusion_matrix(y_test, y_pred_pca))  # Show confusion matrix

Accuracy with PCA: 0.9824561403508771
Classification Report with PCA:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        43
           1       0.99      0.99      0.99        71

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

Confusion Matrix with PCA:
 [[42  1]
 [ 1 70]]


Impact of PCA on Performance:
PCA helps in reducing the number of features while keeping most of the original data's important information. This can make the model simpler and faster, especially if the dataset had many features.
Reducing dimensions might cause a loss of some information, which could affect the model's performance. It's good to compare the performance before and after applying PCA to see how it impacts the model.
Data Standardization:
Scaling the features to have a mean of 0 and a standard deviation of 1 is crucial before applying PCA and logistic regression. This ensures that all features are treated equally and contribute to the model in a fair manner.
PCA Explained Variance:
By setting n_components=0.95, you're telling PCA to keep the features that account for 95% of the variance in the dataset. This helps you understand how much of the original information is preserved after reducing the dimensions.
Model Evaluation Metrics:
Accuracy is a good overall measure of how well the model is performing, but it might not tell the whole story if the classes are imbalanced.
Precision, recall, and F1-score provide more detailed insights into the model's performance, especially when dealing with imbalanced datasets.
The confusion matrix helps you identify specific errors made by the model and gives you a clear picture of how the model is performing for each class.
Dimensionality Reduction Benefits:
PCA can significantly reduce the computational cost by decreasing the number of features, which is very helpful when working with large datasets.
Reducing the number of features can also help prevent overfitting, especially if the original dataset has many features but a limited number of samples.