# Module 1: Data Analysis and Data Preprocessing

## Section 4: Feature selection

### Part 5: SelectFdr, SelectFpr, and SelectFwe

In this part, we will explore three statistical methods for feature selection: SelectFdr, SelectFpr, and SelectFwe. These methods are based on statistical hypothesis testing and control the family-wise error rate (FWER) or false discovery rate (FDR).

### 5.1 SelectFdr, SelectFpr, and SelectFwe

1. SelectFdr (False Discovery Rate): This method controls the expected proportion of false discoveries among the rejected hypotheses (false positives). It is used when you want to control the FDR, which is the ratio of false positives to the total number of rejected hypotheses. The higher the FDR, the more false positives you are willing to tolerate among the selected features.
    
2. SelectFpr (False Positive Rate): This method controls the expected proportion of false positives among all tested features. It is used when you want to control the FPR, which is the ratio of false positives to the total number of features selected. The higher the FPR, the more false positives you are willing to tolerate among the selected features.
   
3. SelectFwe (Family-wise Error Rate): This method controls the probability of at least one false positive among all tested features. It is used when you want to control the FWER, which is the probability of making at least one false positive (Type I error) among all the hypotheses tested. The lower the FWER, the more stringent the selection criteria for the features.

Parameters:
- estimator: The base estimator used for feature selection (e.g., a classifier or regressor).
- alpha: 
    - In SelectFdr: The significance level or threshold for controlling the FDR. Features with p-values lower than alpha will be selected.
    - In SelectFpr: The significance level or threshold for controlling the FPR. Features with p-values lower than alpha will be selected.
    - In SelectFwd: The significance level or threshold for controlling the FWER. Features with p-values lower than alpha will be selected.

Let's illustrate how to use these feature selection methods with a simple example using the Iris dataset:

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFdr, SelectFpr, SelectFwe
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_wine()
X, y = data.data, data.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print("All features:")
print(data.feature_names)

# Create the logistic regression model
clf = LogisticRegression(max_iter=5000)

# SelectFdr (False Discovery Rate) feature selection
selector_fdr = SelectFdr(alpha=0.05)
X_train_selected_fdr = selector_fdr.fit_transform(X_train, y_train)
X_test_selected_fdr = selector_fdr.transform(X_test)
# Train the classifier on the selected features using Fdr
clf.fit(X_train_selected_fdr, y_train)
y_pred_fdr = clf.predict(X_test_selected_fdr)
accuracy_fdr = accuracy_score(y_test, y_pred_fdr)
print("\nSelected features using SelectFdr:")
print(selector_fdr.get_support(indices=True))

# SelectFpr (False Positive Rate) feature selection
selector_fpr = SelectFpr(alpha=0.05)
X_train_selected_fpr = selector_fpr.fit_transform(X_train, y_train)
X_test_selected_fpr = selector_fpr.transform(X_test)
# Train the classifier on the selected features using Fpr
clf.fit(X_train_selected_fpr, y_train)
y_pred_fpr = clf.predict(X_test_selected_fpr)
accuracy_fpr = accuracy_score(y_test, y_pred_fpr)
print("\nSelected features using SelectFpr:")
print(selector_fpr.get_support(indices=True))

# SelectFwe (Family-wise Error Rate) feature selection
selector_fwe = SelectFwe(alpha=0.05)
X_train_selected_fwe = selector_fwe.fit_transform(X_train, y_train)
X_test_selected_fwe = selector_fwe.transform(X_test)
# Train the classifier on the selected features using Fwe
clf.fit(X_train_selected_fwe, y_train)
y_pred_fwe = clf.predict(X_test_selected_fwe)
accuracy_fwe = accuracy_score(y_test, y_pred_fwe)
print("\nSelected features using SelectFwe:")
print(selector_fwe.get_support(indices=True))

print("\nAccuracy using SelectFdr:", accuracy_fdr)
print("Accuracy using SelectFpr:", accuracy_fpr)
print("Accuracy using SelectFwe:", accuracy_fwe)

In this example, we use Logistic Regression as the base estimator for feature selection. We demonstrate how to use SelectFdr, SelectFpr, and SelectFwe from the sklearn.feature_selection module to select features based on different significance levels (alpha). We then train the logistic regression classifier on the selected features and evaluate the accuracy of the model on the test set.

The SelectFdr, SelectFpr, and SelectFwe methods take the significance level alpha as a parameter to control the false discovery rate, false positive rate, and family-wise error rate, respectively. By using these methods, we can choose features based on their statistical significance and control the risk of overfitting by selecting the most relevant features for the model. The accuracy results for each feature selection method can be compared to determine which method performs best for the given dataset and model.

### 5.3 Summary

The SelectFdr, SelectFpr, and SelectFwe feature selection methods provide different ways to control the false discovery rate, false positive rate, and family-wise error rate, respectively. By using these methods, you can tailor your feature selection strategy based on the desired level of statistical significance and the type of errors you want to control in your machine learning model. The desired level of statistical significance requires careful tuning to avoid overfitting.