# Module 1: Data Analysis and Data Preprocessing

## Section 4: Feature selection

In machine learning, feature selection is a crucial step in preparing our data for building effective models. It involves choosing the most relevant features from the original dataset to improve the model's performance and reduce computational complexity. Scikit-Learn provides various methods for feature selection, and two common techniques are SelectKBest and SelectPercentile.

### Part 1: SelectKBest / SelectPercentile

SelectKBest is a feature selection method that selects the top k features based on a specified scoring function. It is commonly used for classification tasks, where the f_classif score function is a popular choice. The f_classif function uses the ANOVA F-value to rank the features by their importance with respect to the target variable.

### 1.1 Using SelectKBest

Parameters:
- score_func: The scoring function used to evaluate the features' importance. Common choices for classification tasks include f_classif (ANOVA F-value) and mutual_info_classif (mutual information).
- k: The number of top features to select.

Advantages:
- Simple and straightforward to use.
- Allows specifying the number of features to keep, which can be helpful when you have prior knowledge about the optimal feature count.

Disadvantages:
- Assumes that the selected features are independent, which might not always hold in real-world datasets.
- Does not consider feature dependencies, which may lead to suboptimal results in certain scenarios.

Here's how you can use it:

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
x, y = data.data, data.target
feature_names = data.feature_names
print("Original feature names:", feature_names)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# Create the SelectKBest object with k=2 (select top 2 features)
selector = SelectKBest(score_func=f_classif, k=2)
# Fit the selector to the training data and transform it
X_train_selected = selector.fit_transform(X_train, y_train)
# Get the selected feature indices
selected_feature_indices = selector.get_support(indices=True)
print("\nSelected feature names:", [feature_names[i] for i in selected_feature_indices])

# Train a classifier using the selected features
clf = KNeighborsClassifier()
clf.fit(X_train_selected, y_train)
# Transform the test data using the selected features
X_test_selected = selector.transform(X_test)
# Make predictions on the test data
y_pred = clf.predict(X_test_selected)
# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy of KNeighborsClassifier after selected feature names:", accuracy)

In this example, we use the Iris dataset, which contains samples of iris flowers with four features and a target variable representing the species of iris flowers. We demonstrate the process of feature selection using SelectKBest with the ANOVA F-value (f_classif) score function for classification tasks. We store the original feature names in the feature_names variable. After selecting the top two features using SelectKBest, we train a KNeighborsClassifier model that predicts with a good accuracy.

### 1.2 SelectPercentile

SelectPercentile is similar to SelectKBest, but it selects the top features based on a percentile of the highest-scoring features. This method is useful when you don't know the exact number of features you want to keep, but you want to retain a certain percentage of the best features.

Parameters:
- score_func: The scoring function used to evaluate the features' importance. Common choices for classification tasks include f_classif (ANOVA F-value) and mutual_info_classif (mutual information).
- percentile: The percentage of top features to select.

Advantages:
- Allows selecting a specified percentage of top features, which is useful when you want to retain a proportion of the best features.
- Handles scenarios where the optimal number of features is unknown or varies across datasets.

Disadvantages:
- Similar to SelectKBest, it assumes that the selected features are independent, which might not always be the case.
- May lead to the exclusion of potentially relevant features if the percentile threshold is too stringent.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
feature_names = data.feature_names
print("Original feature names:", feature_names)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Create the SelectPercentile object with percentile=50 (select top 50% of features)
selector = SelectPercentile(score_func=f_classif, percentile=50)
# Fit the selector to the training data and transform it
X_train_selected = selector.fit_transform(X_train, y_train)
# Get the selected feature indices
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = [feature_names[i] for i in selected_feature_indices]
print("\nSelected feature names:", selected_feature_names)

# Train a KNN classifier using the selected features
clf = KNeighborsClassifier()
clf.fit(X_train_selected, y_train)
# Transform the test data using the selected features
X_test_selected = selector.transform(X_test)
# Make predictions on the test data
y_pred = clf.predict(X_test_selected)
# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy of KNeighborsClassifier after feature selection:", accuracy)

In this example, we use the Iris dataset, which contains samples of iris flowers with four features and a target variable representing the species of iris flowers. We demonstrate the process of feature selection using SelectPercentile with the ANOVA F-value (f_classif) score function for classification tasks. We store the original feature names in the feature_names variable. After selecting the top two features using SelectKBest, we train a KNeighborsClassifier model that predicts with a good accuracy.

### 1.3 Summary

Both SelectKBest and SelectPercentile are useful feature selection methods to enhance model performance, reduce overfitting, and improve the interpretability of the model. Choose the appropriate method based on the number of features you want to retain or the percentage of top features you want to keep in your dataset.