SelectKBest:

We use the SelectKBest method to select the top 10 features based on the ANOVA F-value between the feature and the target variable.

The k parameter is set to 10, meaning we want to keep the top 10 features.

SelectPercentile:

We use the SelectPercentile method to select the top 50% of features based on the ANOVA F-value.

The percentile parameter is set to 50, meaning we want to keep the top 50% of the features.

PCA:

We first standardize the data using StandardScaler to ensure that each feature has a mean of 0 and a standard deviation of 1.

We then apply PCA to reduce the dimensionality to 2 components.

The explained_variance_ratio_ attribute shows the proportion of the dataset's variance that lies along each principal component.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif
from sklearn.decomposition import PCA
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print("Original dataset shape:", X.shape)

# 1. Reduce dimensions using SelectKBest
k_best = SelectKBest(score_func=f_classif, k=10)  # Select top 10 features
X_k_best = k_best.fit_transform(X, y)
k_best_features = X.columns[k_best.get_support()]
print("After SelectKBest shape:", X_k_best.shape)
print("Selected features using SelectKBest:", list(k_best_features))

# 2. Reduce dimensions using SelectPercentile
percentile_best = SelectPercentile(score_func=f_classif, percentile=50)  # Select top 50% features
X_percentile = percentile_best.fit_transform(X, y)
percentile_features = X.columns[percentile_best.get_support()]
print("After SelectPercentile shape:", X_percentile.shape)
print("Selected features using SelectPercentile:", list(percentile_features))

# 3. Reduce dimensions using PCA
pca = PCA(n_components=10)  # Reduce to 10 principal components
X_pca = pca.fit_transform(X)
print("After PCA shape:", X_pca.shape)
print("Explained variance ratio using PCA:", pca.explained_variance_ratio_)


Original dataset shape: (569, 30)
After SelectKBest shape: (569, 10)
Selected features using SelectKBest: ['mean radius', 'mean perimeter', 'mean area', 'mean concavity', 'mean concave points', 'worst radius', 'worst perimeter', 'worst area', 'worst concavity', 'worst concave points']
After SelectPercentile shape: (569, 15)
Selected features using SelectPercentile: ['mean radius', 'mean perimeter', 'mean area', 'mean compactness', 'mean concavity', 'mean concave points', 'radius error', 'perimeter error', 'area error', 'worst radius', 'worst perimeter', 'worst area', 'worst compactness', 'worst concavity', 'worst concave points']
After PCA shape: (569, 10)
Explained variance ratio using PCA: [9.82044672e-01 1.61764899e-02 1.55751075e-03 1.20931964e-04
 8.82724536e-05 6.64883951e-06 4.01713682e-06 8.22017197e-07
 3.44135279e-07 1.86018721e-07]
