# Feature Reduction 2 - model accuracy

Recursive Feature Elimination (RFE) is a feature selection technique that recursively removes the least important features and builds a model using the remaining features. It ranks the features by importance and selects the best subset of features for model building.

Here's a brief overview of how to use RFE in Python with `scikit-learn`:

1. **Import necessary libraries:**
   ```python
   from sklearn.feature_selection import RFE
   from sklearn.linear_model import LogisticRegression
   ```

2. **Create the model and RFE object:**
   ```python
   model = LogisticRegression()
   rfe = RFE(model, n_features_to_select=5)
   ```

3. **Fit the RFE object to the data:**
   ```python
   rfe = rfe.fit(X, y)
   ```

4. **Use `rfe.support_` to get a boolean mask of the selected features:**
   ```python
   selected_features = rfe.support_
   ```

5. **Use `rfe.ranking_` to get the ranking of all features:**
   ```python
   feature_ranking = rfe.ranking_
   ```

### Example Code


In [1]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Generate a dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# Create a model
model = LogisticRegression()

# Create the RFE object
rfe = RFE(model, n_features_to_select=5)

# Fit the RFE object to the data
rfe = rfe.fit(X, y)

# Get the boolean mask of selected features
selected_features = rfe.support_

# Get the ranking of all features
feature_ranking = rfe.ranking_

print("Selected Features:", selected_features)
print("Feature Ranking:", feature_ranking)

Selected Features: [ True  True  True  True False False  True False False False]
Feature Ranking: [1 1 1 1 3 4 1 5 2 6]




- `rfe.support_`: A boolean array indicating which features are selected.
- `rfe.ranking_`: An array of feature rankings, where 1 indicates the most important features.

In [2]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Generate a dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(10)])

# Create a model
model = LogisticRegression()

# Create the RFE object
rfe = RFE(model, n_features_to_select=5)

# Fit the RFE object to the data
rfe = rfe.fit(X, y)

# Get the ranking of all features
feature_ranking = rfe.ranking_

# Create a dictionary mapping feature names to their rankings
feature_ranking_dict = dict(zip(X.columns, feature_ranking))

print("Feature Ranking:", feature_ranking_dict)

Feature Ranking: {'feature_0': 1, 'feature_1': 1, 'feature_2': 1, 'feature_3': 1, 'feature_4': 3, 'feature_5': 4, 'feature_6': 1, 'feature_7': 5, 'feature_8': 2, 'feature_9': 6}


## Diabetes classifier

Using Pima Indians dataset to predict diabetes using logistic regression.

In [3]:
import pandas as pd

# import PimaIndians.csv
pima = pd.read_csv('PimaIndians.csv')


In [4]:
print(pima.head())

   pregnant  glucose  diastolic  triceps  insulin   bmi  family  age      test
0         1       89         66       23       94  28.1   0.167   21  negative
1         0      137         40       35      168  43.1   2.288   33  positive
2         3       78         50       32       88  31.0   0.248   26  positive
3         2      197         70       45      543  30.5   0.158   53  positive
4         1      189         60       23      846  30.1   0.398   59  positive


In [8]:
# create train, test data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

X = pima.drop('test', axis=1)
y = pima['test']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# fit scaler and transform training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# fit model
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)

# scale test data and make predictions
X_test_scaled = scaler.transform(X_test)

# predict
y_pred = lr.predict(X_test_scaled)

#  accuracy metrics and feature coefficients
print(f"{accuracy_score(y_test, y_pred):.1%} accuracy on test set.")
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

77.2% accuracy on test set.
{'pregnant': 0.36, 'glucose': 1.12, 'diastolic': 0.13, 'triceps': 0.23, 'insulin': 0.13, 'bmi': 0.32, 'family': 0.4, 'age': 0.2}


Can the model be improved by reducing the number of features without hurting the model accuracy?

In [10]:
# Remove the feature with the lowest model coefficient
X = pima[['pregnant', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print(f"{acc:.1%} accuracy on test set.") 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

79.6% accuracy on test set.
{'pregnant': 0.05, 'glucose': 1.23, 'diastolic': 0.03, 'triceps': 0.24, 'insulin': 0.19, 'bmi': 0.38, 'family': 0.35, 'age': 0.34}


Remove diastolic.

In [12]:
# Remove the feature with the lowest model coefficient
X = pima[[ 'pregnant','glucose', 'triceps', 'insulin', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print(f"{acc:.1%} accuracy on test set.") 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

80.6% accuracy on test set.
{'pregnant': 0.05, 'glucose': 1.24, 'triceps': 0.24, 'insulin': 0.2, 'bmi': 0.39, 'family': 0.34, 'age': 0.35}


Remove two more features - pregnant, insulin

In [13]:
# Remove the feature with the lowest model coefficient
X = pima[['glucose', 'triceps',  'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print(f"{acc:.1%} accuracy on test set.") 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

79.6% accuracy on test set.
{'glucose': 1.13, 'triceps': 0.25, 'bmi': 0.34, 'family': 0.34, 'age': 0.37}


Keep only highest coef feature - glucose

In [14]:
# Remove the feature with the lowest model coefficient
X = pima[['glucose']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print(f"{acc:.1%} accuracy on test set.") 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

75.5% accuracy on test set.
{'glucose': 1.28}


Automatic Recursive feature elimination

In [None]:
# Remove the feature with the lowest model coefficient
X = pima[['pregnant', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)


# scale train, test data 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create the RFE with a LogisticRegression estimator and 3 features to select
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3, verbose=1)

# Fits the eliminator to the data
rfe.fit(X_train_scaled, y_train)

# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))

# Print the features that are not eliminated
print(X.columns[rfe.support_])

# Calculates the test set accuracy
acc = accuracy_score(y_test, rfe.predict(X_test_scaled))
print(f"{acc:.1%} accuracy on test set.") 

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
{'pregnant': 5, 'glucose': 1, 'diastolic': 6, 'triceps': 3, 'insulin': 4, 'bmi': 1, 'family': 2, 'age': 1}
Index(['glucose', 'bmi', 'age'], dtype='object')
80.6% accuracy on test set.
