## Feature Selection - Forward Elimination & Backward Elimination 

# 🧠 Feature Selection: Forward and Backward Selection

Feature selection helps in improving model performance, reducing overfitting, and increasing model interpretability by selecting the most relevant features.

---

## 🚀 Forward Feature Selection

**Definition**:  
Starts with no features and adds one feature at a time based on performance improvement until no further improvement is seen.

### 🔍 How It Works:
1. Start with an empty model (no predictors).
2. Evaluate all features individually and select the one that improves model performance the most.
3. Add this feature to the model.
4. Repeat step 2–3 by adding one feature at a time from the remaining features.
5. Stop when:
   - The performance does not improve significantly.
   - A maximum number of features is reached.

### ✅ Advantages:
- Simple and interpretable.
- Useful when number of features is small.

### ❌ Disadvantages:
- Greedy approach: Once a feature is added, it can't be removed.
- Can miss feature combinations that work well together.

---

## 🔁 Backward Feature Elimination

**Definition**:  
Starts with all features and removes the least important one at each step based on performance degradation until further removal worsens the model.

### 🔍 How It Works:
1. Start with all features.
2. Remove one feature at a time that least reduces model performance.
3. Repeat the process with the remaining features.
4. Stop when:
   - Removal significantly reduces performance.
   - Desired number of features is reached.

### ✅ Advantages:
- Considers all features initially, so better at capturing feature interactions.
- Useful when the initial set of features is not too large.

### ❌ Disadvantages:
- Computationally expensive (especially with many features).
- May overfit on small datasets.

---

## 📊 Comparison Table

| Criteria                    | Forward Selection          | Backward Elimination       |
|----------------------------|----------------------------|-----------------------------|
| Start Point                | No features                 | All features                |
| Process                    | Add one feature at a time   | Remove one feature at a time|
| Computation Cost           | Lower                       | Higher                      |
| Best For                   | Small feature sets          | Medium-sized feature sets   |
| Feature Interactions       | Not captured well           | Captured better             |
| Risk of Overfitting        | Lower                       | Higher                      |

---

## 📌 Conclusion

- Use **Forward Selection** when you have a **small number of features** or when computation is a concern.
- Use **Backward Elimination** when you start with a **reasonable number of features** and want to consider interactions.
- For high-dimensional data, consider **automated methods** like:
  - Recursive Feature Elimination (RFE)
  - Lasso Regression (L1 Regularization)
  - Tree-based Feature Importance

---

✅ Tip: Always use **cross-validation** to evaluate performance at each step to avoid overfitting during selection.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/diabetescsv/diabetes.csv


In [2]:
import pandas as pd

In [3]:
dataset = pd.read_csv("/kaggle/input/diabetescsv/diabetes.csv")
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
x = dataset.iloc[:,:-1]
y = dataset["Outcome"]
x

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [5]:

from mlxtend.feature_selection import SequentialFeatureSelector


In [6]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr


In [7]:
fe = SequentialFeatureSelector(lr,k_features=5,forward=True)
fe.fit(x,y)

In [8]:
fe.feature_names

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [9]:
fe.k_feature_names_

('Pregnancies', 'Glucose', 'Insulin', 'BMI', 'Age')

In [10]:
fe.k_score_

0.7708768355827178