# Filter Methods In Feature Selection

## Mutual Information

- Mutual information is a filter method
- It selected features only on their relationship with the target, not on any trained model.
- A feature is good if it tells you a lot about the target.
- How much knowing X reduces uncertainty about Y.
- Feature and target are independent → MI=0
- Strong dependence → MI is large.

$$
I(X,Y) = \sum_{x \in X}^{} \sum_{y \in Y}^{} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}
$$

- if features are independent then MI will be 0.

$$
H(Y) = - \sum_{y}^{} p(y) \log p(y)
$$

$$
H(Y|X) = - \sum_{x,y}^{} p(x,y) \log p(y|x)
$$

$$
I(X;Y) = H(Y) - H(Y|X) 
$$

So, MI = Reduction in uncertainty about Y when X is known. 

## ANOVA For Feature Selection

- Identify features that differ significantly across target classes.
- Used mainly for classification tasks with a categorical target.
- Does the mean of the feature vary between classes.
- Anova computes the ratio of variance between classes to variance within classes.

$$
F = \frac{\text{Variance between classes}}{\text{Variance within classes}}
$$

- Algorithm
    - We have a feature $X_j$
    - Classes $C_1, C_2, …, C_k$
    - Number of samples n_i in class i
    - $\bar{x}_i$ be the mean of feature in class i
    - $\bar{x}$ be the overall mean of the feature
    - Compute the between and within class variance and then the f statistic using the following equations.

$$
SSB_j = \sum_{i=1}^{k} n_i (\bar{x}_i - \bar{x})^2
$$

$$
SSW_j = \sum_{i=}^{k} \sum_{x \in C_i}^{} (x-\bar{x}_i)^2
$$

Between class sum of squares measures how much the class means deviate from overall mean. 

Within class sum of squares measure the variability inside each class. 

$$
F_j = \frac{SSB_j/(k-1)}{SSW_j/(N-k)}
$$

- High F_j → Feature strongly separates classes
- Low F_j → Feature weakly separates classes
- This method works well when target is categorical.
- It only captures differences in mean ( ignores variance structure, interactions)

## Variation Inflation Factor - VIF

- High VIF → Feature is highly correlated with others → May consider removing it.

$$
VIF_j = \frac{1}{1-R_j^2}
$$

- If X_j feature is highly predictable from other features → High VIF.
- 1<VIF<5 → Moderate correlation
- VIF>5 → Higher correlation, consider removing

## Weight of Evidence

How strongly a feature value ( or bin) separates good vs bad customers.

$$
\text{Weight of Evidence} = \log \left(\frac{\text{\% of Good in the bin}} {\text{\% of Bad in the bin}} \right)
$$

- Is this bin more associated with good or bad outcome?
- For continuous features we can use equal width bins or optimal binning.
- We count the % of Good and % Bad .
- For each bin we get the percentage.
- Final computation.
- If WoE >0 → More good than bad
- After this, we can compute the Information Value ;

$$
IV = \sum_{i} (\text{\% of Good} - \text{\% of Bad}) * WoE
$$

- if IV>0.3 then the predictor is strong
- if IV <0.1 its a weak predictor

## Chi-Square Test of Independence

- A feature is useful , when its distribution is different across classes.
- A feature is informative when it is dependent of target.
- Null Hypothesis : X & Y are independent
- We create the contingency table.
- We compute the expected counts
- Compute the chi-square statistic

$$
E_{ij} = \frac{(\text{Row i Total}) * (\text{Column j Total})}{N}
$$

$$
\Chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$

If we have r categories in the feature and c classes in the target column, then the degrees of freedom will be $(r-1) (c-1)$ 

# Wrapper Methods in Feature Selection

- Iterative procedures
- One independent variable at a time is added or deleted based on the p-value.

## Forward Selection

Pick the feature that improves performance the most

- For each candidate feature:
    - Add it to current feature set
    - Train model
    - Evaluate with validation/cv
- Compare scores
- Choose the feature giving highest score
- Fix it and move to next round

## Backward Selection

Start big, remove one at a time, and keep the removal that hurts performance the least.

## Stepwise Selection

Hybrid of forward selection and backward elimination.

Forward selection problem:

- Once you add a feature , it never gets removed
- But later, it might become redundant when new features are added.

Stepwise fixes this by adding good features but also kick out bad ones when they stop helping. 

We add the features and in each step we do a backward elimination check. 

## Recursive Feature Elimination

- In backward elimination the removal is based on; Which feature can I remove with the least drop in model performance?
- In RFE the removal is based on which feature is least important according to the model itself ?
- so we train model once, get the feature importance scores, coefficients , then remove the weakest, retrain and repeat.
- Backward Elimination: If I remove you, what happens to performance?
- RFE : How important does the model think you are?

# Embedded Methods in Feature Selection

## Lasso - L1 Regularisation

- Feature selection happens automatically during training.
- Add a penalty that pushes coefficients to exactly zero.
- L1 penalty has a diamond shaped constraint.
- Optimal solution often hits the corners.

$$
\lambda \sum_j |\beta_j|
$$

- As lambda increases small coefficients shrink to 0 first.
- Non-zero coefficients = selected features.