# **Feature Selection and Dimension Reduction**
Both feature selection and dimension reduction are techniques used in machine learning and data analysis to handle high-dimensional data, but they work in different ways:
## **Feature Selection**
Feature selection involves identifying and selecting a subset of the original features that are most relevant to the prediction task. This process eliminates redundant or irrelevant features while preserving the most informative ones.
## **Dimension Reduction**
Dimension reduction transforms the original high-dimensional data into a lower-dimensional space while trying to preserve the essential information and structure of the data.

**Feature Selection**

In [9]:
from sklearn.feature_selection import SelectKBest , chi2,SelectPercentile,mutual_info_classif,mutual_info_regression
from sklearn.datasets import load_breast_cancer
X,y=load_breast_cancer(return_X_y=True)


In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2 , _mutual_info, mutual_info_classif, mutual_info_regression, SelectPercentile

In [11]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest ,SelectFdr ,SelectFpr ,SelectPercentile ,chi2 , mutual_info_classif ,mutual_info_regression 

In [12]:
X,y=load_breast_cancer(return_X_y=True)
print(X.shape)
print(y.shape)

(569, 30)
(569,)


In [13]:
x, y=load_breast_cancer(return_X_y=True)
print(x.shape)

(569, 30)


# 📘 Explanation of Methods
## ✅ chi2
Type: Filter method

Use Case: Classification tasks

How it works: Calculates the chi-squared statistic between each feature and the class label. It measures how much the observed distribution of class labels deviates from what would be expected if they were independent.

Limitation: Requires non-negative features.

## ✅ mutual_info_classif
Type: Filter method

Use Case: Classification tasks

How it works: Computes mutual information between each feature and the target. Measures the amount of information gained about the target by knowing the feature. Can capture non-linear dependencies.

Advantages: Works with both discrete and continuous variables.

## ✅ mutual_info_regression
Type: Filter method

Use Case: Regression tasks

How it works: Same concept as mutual_info_classif but designed for continuous target variables.

## ✅ SelectKBest
Description: Selects the top K features with the highest scores using the chosen scoring function.

## ✅ SelectPercentile
Description: Selects the top X% percentile of features based on score. For example, percentile=20 picks the top 20% features.

In [14]:

k_chai=SelectKBest(chi2,k=5).fit_transform(X,y)
print(k_chai.shape)
per_chi=SelectPercentile(chi2,percentile=14).fit_transform(X,y)
print(per_chi.shape)
S_fdr=SelectFdr()
print(k_chai.get)

(569, 5)
(569, 5)


AttributeError: 'numpy.ndarray' object has no attribute 'get'

In [8]:
k_chi=SelectKBest(mutual_info_classif, k=3).fit_transform(X, y)
print(k_chi.shape)
per_chi=SelectPercentile(chi2, percentile=10).fit_transform(X,y)
print(per_chi.shape)

ValueError: Expected 2D array, got 1D array instead:
array=[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

# Load dataset
bc = load_breast_cancer()
X = bc.data
y = bc.target
feature_name = bc.feature_names  # ✅ correct name

# Select top 15 features using SelectKBest with ANOVA F-score
selector = SelectKBest(score_func=f_classif, k=15).fit(X, y)

# Transform the data to include only selected features
X_selected = selector.transform(X)

# Print selected feature names
print("Selected feature names:")
print(feature_name[selector.get_support()])


Selected feature names:
['mean radius' 'mean perimeter' 'mean area' 'mean compactness'
 'mean concavity' 'mean concave points' 'radius error' 'perimeter error'
 'area error' 'worst radius' 'worst perimeter' 'worst area'
 'worst compactness' 'worst concavity' 'worst concave points']


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

# Load dataset
bc = load_breast_cancer()
X = bc.data
y = bc.target
feature_name=bc.feature_names  # ✅ correct name

# Select top 15 features using SelectKBest with ANOVA F-score
selector = SelectKBest(score_func=f_classif, k=15).fit(X,y)

# Transform the data to include only selected features
X_selected = selector.transform(X)

print(feature_name[selector.get_support()])



ValueError: too many values to unpack (expected 2)

## ✅ VarianceThreshold
Removes features with low variance.

Good for removing constant or near-constant columns.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import VarianceThreshold

selector=VarianceThreshold(threshold=0.5).fit_transform(X, Y)
print(selector.shape)




(569, 10)


## ✅ `f_classif` (ANOVA F-test for classification)
Measures linear dependency between features and class.

Assumes features are normally distributed.

In [30]:
from sklearn.feature_selection import  f_classif

anova_selector = SelectKBest(score_func=f_classif, k=10).fit_transform(X,Y)
print(anova_selector.shape)

(569, 10)


## ✅ `f_regression`
Like f_classif but for regression problems.

In [32]:
from sklearn.feature_selection import  f_regression

reg_selector = SelectKBest(score_func=f_regression, k=5).fit_transform(X,Y)
print(reg_selector.shape)

(569, 5)


## ✅ `SelectFpr` / `SelectFdr` / `SelectFwe`
These select features based on false positive rate, false discovery rate, or family-wise error rate.

In [34]:
from sklearn.feature_selection import SelectFpr, SelectFwe, SelectFdr

fpr_selector = SelectFpr(score_func=chi2, alpha=0.05).fit_transform(X,Y)
print(fpr_selector.shape)

fdr_selector = SelectFdr(score_func=chi2, alpha=0.05).fit_transform(X,Y)
print(fdr_selector.shape)

fwe_selector = SelectFwe(score_func=chi2, alpha=0.05).fit_transform(X,Y)
print(fwe_selector.shape)

(569, 17)
(569, 17)
(569, 16)
