# LAB session: Features selection

### Objective

1. Understand the need for feature selection.

2. Apply various feature selection techniques.

3. Evaluate the performance of models with and without feature selection.


Feature selection is a crucial step in machine learning that helps improve the performance of models by selecting only the most relevant features (or columns) in a dataset and removing the rest. This reduces noise, speeds up computation, and can improve model accuracy. In our lab session, we'll go through the entire process step-by-step using generated data, covering different techniques for feature selection.

## Step 1: Generate a Synthetic Dataset

We'll start by generating a dataset with the following characteristics:

- 1000 samples (rows)

- 15 features (columns), with some of them irrelevant to the target variable

To simulate real-world data, we'll include a binary target variable ( 0 or 1, representing two classes) and a mix of relevant and irrelevant features.

In [1]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Generate 10 informative features (correlated with the target)
X_informative = np.random.randn(1000, 10) * 0.5 
# Generate 5 noise features (not correlated with the target)
X_noise = np.random.randn(1000, 5)
# Generate binary target variable based on informative features
target = (X_informative.sum(axis=1) + np.random.randn(1000) * 0.1 > 0.3).astype(int)

# Combine informative and noise features
X = np.hstack([X_informative, X_noise])
feature_names = [f'feature_{i+1}' for i in range(X.shape[1])]
data = pd.DataFrame(X, columns=feature_names)
data['target'] = target

data.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,target
0,0.248357,-0.069132,0.323844,0.761515,-0.117077,-0.117068,0.789606,0.383717,-0.234737,0.27128,-0.678495,-0.305499,-0.597381,0.110418,1.197179,1
1,-0.231709,-0.232865,0.120981,-0.95664,-0.862459,-0.281144,-0.506416,0.157124,-0.454012,-0.706152,-0.771042,1.00082,-0.781672,-0.847627,0.818595,0
2,0.732824,-0.112888,0.033764,-0.712374,-0.272191,0.055461,-0.575497,0.187849,-0.300319,-0.145847,0.921936,0.85141,-1.315797,-0.465951,0.822989,0
3,-0.300853,0.926139,-0.006749,-0.528855,0.411272,-0.610422,0.104432,-0.979835,-0.664093,0.098431,0.041542,-1.073693,0.458318,-0.714807,1.794525,0
4,0.369233,0.085684,-0.057824,-0.150552,-0.739261,-0.359922,-0.230319,0.528561,0.171809,-0.88152,1.544841,0.604097,1.361007,0.064791,0.765437,0


In [2]:
print(data['target'].value_counts())

target
0    569
1    431
Name: count, dtype: int64


## Step 2: Why Feature Selection?

In real datasets, we often have many features, some of which do not contribute to predicting the target variable. Including unnecessary features can:

1. Increase model complexity and risk of overfitting.

2. Reduce model interpretability.

3. Slow down training time and make the model less accurate.

By selecting only the most important features, we can create simpler, faster, and often more accurate models.

## Step 3: Techniques for Feature Selection

We'll explore three main types of feature selection techniques:

1. **Filter Methods**

2. **Wrapper Methods**

3. **Embedded Methods**

Each has different strengths, and understanding when to use each is important.


### 3.1 Filter Methods

Filter methods use statistical measures to assess the importance of each feature independently of the model.

#### Example: Correlation-based Feature Selection

We can calculate the correlation of each feature with the target variable and remove those with low correlation.

In [3]:
correlations = data.corr()['target'].abs().sort_values(ascending=False)
print('correlation:',correlations)

correlation: target        1.000000
feature_8     0.285276
feature_9     0.279393
feature_2     0.267687
feature_5     0.261309
feature_3     0.258959
feature_4     0.242934
feature_10    0.235953
feature_6     0.218040
feature_1     0.209317
feature_7     0.186391
feature_15    0.045910
feature_14    0.031970
feature_11    0.015926
feature_13    0.005452
feature_12    0.001614
Name: target, dtype: float64


### 3.2 Wrapper Methods

Wrapper methods evaluate feature subsets by actually training and validating models on them. This approach is more computationally expensive but often provides better results.

#### Example: Recursive Feature Elimination (RFE)

RFE is a popular wrapper method that recursively removes the least important features, training the model multiple times to rank the features.


In [4]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import pandas as pd

# Define the model using linear regression
model = LinearRegression()

# Define RFE with the model and specify the number of features to select
rfe = RFE(model, n_features_to_select=10)

# Separate features and target variable
X = data.drop(columns=['target'])
y = data['target']

# Fit RFE to the data
rfe.fit(X, y)

# Get the ranking of features
feature_ranking = pd.DataFrame({
    'feature': X.columns,
    'rank': rfe.ranking_
}).sort_values(by='rank')

print(feature_ranking)


       feature  rank
0    feature_1     1
1    feature_2     1
2    feature_3     1
3    feature_4     1
4    feature_5     1
5    feature_6     1
6    feature_7     1
7    feature_8     1
8    feature_9     1
9   feature_10     1
14  feature_15     2
10  feature_11     3
11  feature_12     4
13  feature_14     5
12  feature_13     6


### 3.3 Embedded Methods

Embedded methods perform feature selection as part of the model training process itself. These methods are usually faster than wrappers and more accurate than filters.

#### Example: Feature Importance from a Tree-based Model

Tree-based algorithms like **Random Forest** and **Gradient Boosting** calculate feature importance as part of their training. Here we use  the Random Forest model to compute feature importance by training an ensemble of decision trees and evaluating how much each feature contributes to reducing impurity at each node.

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model
model = RandomForestClassifier()
model.fit(X, y)

# Get feature importances
importances = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False)

print(importances)

       feature  importance
7    feature_8    0.105836
4    feature_5    0.094658
1    feature_2    0.092188
2    feature_3    0.085131
8    feature_9    0.084862
3    feature_4    0.082718
9   feature_10    0.078795
0    feature_1    0.072479
5    feature_6    0.067805
6    feature_7    0.062819
10  feature_11    0.036191
14  feature_15    0.036125
13  feature_14    0.035001
12  feature_13    0.032843
11  feature_12    0.032550


## Step 4: Evaluating Model Performance

Finally, we’ll test the performance of a model built with and without feature selection to see the impact. For simplicity, we’ll use a linear regression classifier:

1. **Without Feature Selection**: Train a model using all features.

2. **With Feature Selection**: Train a model using only the selected features.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Model without feature selection
model_all_features = LogisticRegression(max_iter=200)
model_all_features.fit(X_train, y_train)
predictions_all = model_all_features.predict(X_test)
accuracy_all = accuracy_score(y_test, predictions_all)
print(f'Accuracy without feature selection: {accuracy_all:.2f}')

# Model with selected features (top 5 from RFE)
selected_features = feature_ranking[feature_ranking['rank'] == 1]['feature']
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

model_selected_features = LogisticRegression(max_iter=200)
model_selected_features.fit(X_train_selected, y_train)
predictions_selected = model_selected_features.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, predictions_selected)

print(f'Accuracy with feature selection: {accuracy_selected:.2f}')


Accuracy without feature selection: 0.97
Accuracy with feature selection: 0.97


In [7]:
selected_features

0     feature_1
1     feature_2
2     feature_3
3     feature_4
4     feature_5
5     feature_6
6     feature_7
7     feature_8
8     feature_9
9    feature_10
Name: feature, dtype: object



###  Key Takeaways

1. **Feature Selection** is essential for building efficient models, especially with large datasets.

2. **Filter methods** are fast and easy but don't consider feature interactions.

3. **Wrapper methods** provide good results but can be slow.

4. **Embedded methods** are a good balance, leveraging the model’s structure to determine feature importance.

By selecting features carefully, we can build simpler, faster, and potentially more accurate models.

## Task 1: **Combining Feature Selection Methods**

1. Generate the synthetic dataset of your project.

2. First, use the filter method (correlation-based) to remove irrelevant features.

3. Then, apply RFE to further eliminate any unnecessary features.

4. Finally, train a Random Forest model and use the feature importance scores to select the top features.

5. Compare and evaluate the model performance at each step when (using all features, after filtering, after RFE, after feature importance selection).




## Task 2: **Visualizing Feature Selection**

1. Use the synthetic dataset you generates in task 1.

2. Apply any feature selection technique (e.g., correlation-based, RFE, or Random Forest importance).

3. Plot the accuracy of the model with the different numbers of features selected (Create a plot showing the accuracy vs. the number of features used).

4. Use bar plots to show how the importance of features changes during the selection process.

5. Observe and analyze how feature selection impacts the performance visually

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from GenerativeModel import *

In [2]:
# Generation of synthetic data
np.random.seed(1)
gen = ComplexDependentSatisfaction(1000)
X_noise = np.random.randint(1,5, size=(1000,5))
X = np.random.randint(1,5, size=(5,1000))
X = np.hstack([X_noise, gen.data])
feature_names = ["noise_1", "noise_2", "noise_3", "noise_4", "noise_5", "price", "punctuality", "duration",
                 "frequency", "overcrowding", "satisfaction"]
data = pd.DataFrame(X, columns=feature_names)
data.head()

NameError: name 'data' is not defined

In [None]:
# Filter method
correlations = data.corr()['satisfaction'].abs().sort_values(ascending=False)
print('correlation:',correlations)

### Observations:
We see that the two less relevant features using filter method are frequency and duration. But before removing anything let's try other methods to see if we get different results... 

Note : Price, overcrowding and punctuality correlation are the most interesting ones.

In [None]:
# RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Define the model using linear regression
model = LinearRegression()

# Define RFE with the model and specify the number of features to select
rfe = RFE(model, n_features_to_select=5)

# Separate features and target variable
X = data.drop(columns=['satisfaction'])
y = data['satisfaction']

# Fit RFE to the data
rfe.fit(X, y)

# Get the ranking of features
feature_ranking = pd.DataFrame({
    'feature': X.columns,
    'rank': rfe.ranking_
}).sort_values(by='rank')

print(feature_ranking)

### Observation : 
Here using RFE we get different results than before with the filter method. We want to chose 4 out of 5 features and the RFE chose punctuality, price and overcrowding. The less relevant feature is the frequency. Let's try one more method to see if we can effectively conlcude something.

In [None]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model
model = RandomForestClassifier()
model.fit(X, y)

# Get feature importances
importances = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False)

print(importances)

### Observations:
Training a Random Forest model gives us an estimate of the importance of the features and the top 3 is still the same as filter method. But here we see this time that punctuality importance is not far from duration and frequency. So we can maybe try to train our model with only the two main features price and overcrowding.

In [None]:
# comparing the different models

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model without feature selection
model_all_features = LogisticRegression(max_iter=200)
model_all_features.fit(X_train, y_train)
predictions_all = model_all_features.predict(X_test)
accuracy_all = accuracy_score(y_test, predictions_all)
print(f'Accuracy without feature selection: {accuracy_all:.2f}')

# Model with selected features (top 5 from RFE)
selected_features = feature_ranking[feature_ranking['rank'] == 1]['feature']
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

model_selected_features = LogisticRegression(max_iter=200)
model_selected_features.fit(X_train_selected, y_train)
predictions_selected = model_selected_features.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, predictions_selected)

print(f'Accuracy with feature selection (RFE): {accuracy_selected:.2f}')

# Model with selected features (correleted filter > 0.4)
X_train_selected = X_train.loc[:,["price", "overcrowding"]]
X_test_selected = X_test.loc[:,["price", "overcrowding"]]

model_selected_features = LogisticRegression(max_iter=200)
model_selected_features.fit(X_train_selected, y_train)
predictions_selected = model_selected_features.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, predictions_selected)

print(f'Accuracy with feature selection (Filter): {accuracy_selected:.2f}')

# Model with selected features (Random Forest)
selected_features = importances[importances['importance'] > 0.1]['feature']
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

model_selected_features = LogisticRegression(max_iter=200)
model_selected_features.fit(X_train_selected, y_train)
predictions_selected = model_selected_features.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, predictions_selected)

print(f'Accuracy with feature selection (Random Forest): {accuracy_selected:.2f}')

In [None]:
accuracies = np.zeros(X.shape[1])
for i in range (1, X.shape[1] + 1): 
    model = LinearRegression()
    rfe = RFE(model, n_features_to_select=i)
    rfe.fit(X, y)

    # Get the ranking of features
    feature_ranking = pd.DataFrame({
        'feature': X.columns,
        'rank': rfe.ranking_
    }).sort_values(by='rank')
    # top i from RFE
    selected_features = feature_ranking[feature_ranking['rank'] == 1]['feature']
    X_train_selected = X_train[selected_features]
    X_test_selected = X_test[selected_features]

    
    model_selected_features = LogisticRegression(max_iter=200)
    model_selected_features.fit(X_train_selected, y_train)
    predictions_selected = model_selected_features.predict(X_test_selected)
    accuracies[i - 1] = accuracy_score(y_test, predictions_selected)

In [None]:
# Plot accuracy vs. number of features
plt.plot(np.arange(1, X.shape[1] + 1), accuracies, marker='o')
plt.xlabel("Number of Features Selected")
plt.ylabel("Accuracy")
plt.title("Model Accuracy vs. Number of Features Selected")
plt.grid()
plt.show()

In [None]:
i = np.argmax(accuracies)
print(f"Max accuracy is {accuracies[i]}, with {i + 1} features selected !")

In [None]:
# Exact values from user-provided results
features = ["noise_1", "noise_2", "noise_3", "noise_4", "noise_5", "price", "punctuality", "duration", "frequency", "overcrowding"]

# Correlation-Based Importance
correlation_importance = [0.006054, 0.021606, 0.006088, 0.038890, 0.004496, 0.519185, 0.269080, 0.065455, 0.028504, 0.465135]

# RFE-Based Importance (inverse of rank for better visualization)
rfe_ranking = [7, 6, 9, 8, 10, 1, 2, 3, 4, 5]  # Rankings from RFE
rfe_importance = [1 / rank if rank != 0 else 0 for rank in rfe_ranking]

# Random Forest-Based Importance
random_forest_importance = [0.037449, 0.034916, 0.037431, 0.03365, 0.041367, 0.319403, 0.124820, 0.036293, 0.035772, 0.298895]

# DataFrame
importance_df = pd.DataFrame({
    'Feature': features,
    'Correlation-Based': correlation_importance,
    'RFE': rfe_importance,
    'Random Forest': random_forest_importance
})

# Bar plots for feature importance changes
importance_df.set_index('Feature').plot(kind='bar', figsize=(14, 8))
plt.title("Feature Importance Across Selection Methods (User-Provided Values)")
plt.ylabel("Importance")
plt.xlabel("Features")
plt.xticks(rotation=45)
plt.legend(title="Method")
plt.tight_layout()
plt.show()

### Observations:
Key features like price and overcrowding consistently rank high in all methods  demonstrating their strong predictive power while noise features maintain low importance, highlighting their irrelevance