# Feature Selection

Feature selection techniques are methods used to select a subset of relevant features (or variables) from a larger set of features in a dataset. The goal is to improve model performance, reduce overfitting, and enhance interpretability. Here are some common feature selection techniques:

### Filter Methods:
These methods select features based on statistical properties like correlation, mutual information, or significance tests without considering the model. Examples include:

#### Correlation: 
Features highly correlated with the target variable are selected.
#### Mutual Information: 
    Measures the amount of information obtained about one variable through the other variable.
#### Chi-square Test:
    Tests the independence of categorical variables.


### Wrapper Methods: 
These methods select features by evaluating the performance of a specific machine learning algorithm using different feature subsets. Examples include:

#### Forward Selection:
Starts with an empty set of features and adds one feature at a time, selecting the one that improves model performance the most.
#### Backward Elimination:
    Starts with all features and removes one feature at a time, selecting the one whose removal improves performance.
#### Recursive Feature Elimination (RFE):
    Iteratively removes the least significant features based on their coefficients or feature importance.
#### Embedded Methods:
    These methods perform feature selection as part of the model training process. They use regularization techniques or built-in feature selection algorithms to penalize irrelevant features during model training. Examples include:

### Lasso Regression: 
Adds an L1 penalty term to the loss function, which encourages sparsity in the coefficients, effectively performing feature selection.

#### Random Forest Feature Importance:
Measures the importance of each feature based on how much it decreases impurity in decision trees.
#### Gradient Boosting Feature Importance:
Measures the contribution of each feature to the model's predictive power.
#### Dimensionality Reduction Techniques:
While not strictly feature selection, dimensionality reduction methods like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reduce the number of features by transforming them into a lower-dimensional space while preserving as much information as possible.

Choosing the appropriate feature selection technique depends on the dataset, the machine learning algorithm used, and the specific goals of the analysis, such as improving model performance or enhancing interpretability.

used in the model.yes/no)

## Example :
Predicting house prices with the following features:

Number of bedrooms,
Square footage of the house,
Presence of a garage (yes/no),
Distance to the nearest school (in miles),
Age of the house (in years),
Presence of a swimming pool (yes/no)

### Filter Methods:

#### Correlation:
We calculate the correlation coefficient between each feature and the target variable (house price). For example, if the number of bedrooms and square footage have high positive correlations with house price, we select them as relevant features.
#### Mutual Information:
We measure how much information each feature provides about the house price. Features with high mutual information scores are considered relevant.
#### Chi-square Test:
If the presence of a garage or swimming pool significantly affects house prices, these features are selected.

### Wrapper Methods:

#### Forward Selection:
We start with one feature and add more features one by one, selecting the feature that improves the model's predictive performance the most.
#### Backward Elimination:
We start with all features and remove one feature at a time, selecting the one whose removal improves the model's performance.
#### Recursive Feature Elimination (RFE):
We train the model with all features and then iteratively remove the least important feature based on their coefficients or feature importance until the desired number of features is reached.

### Embedded Methods:

#### Lasso Regression:
We train a Lasso regression model that automatically penalizes irrelevant features by shrinking their coefficients to zero, effectively performing feature selection.
#### Random Forest Feature Importance:
We train a random forest model and measure the importance of each feature based on how much it decreases impurity in the decision trees. Features with higher importance are selected.
#### Gradient Boosting Feature Importance: 
Similar to random forest, we measure the contribution of each feature to the model's predictive power using gradient boosting.

### Dimensionality Reduction Techniques:

#### Principal Component Analysis (PCA):
We transform the original features into a lower-dimensional space while preserving as much variance as possible. The new components can be used as features in the model.
#### t-distributed Stochastic Neighbor Embedding (t-SNE):
It is a technique for visualizing high-dimensional data by transforming it into a lower-dimensional space. While it's primarily for visualization, the transformed features can also be used in the model.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [2]:
# Generate sample dataset
X, y = make_regression(n_samples=100, n_features=6, noise=0.1, random_state=42)

In [3]:
print(X)

[[-0.98150865  0.01023306  0.46210347  1.44127329 -1.43586215  1.16316375]
 [-0.0660798  -1.66152006 -1.2110162   0.2597225  -0.90431663  0.63859246]
 [-0.39210815 -0.32766215 -1.46351495  0.09707755  0.96864499 -0.70205309]
 [ 0.87232064 -0.76734756  0.18334201  1.45114361  0.95927083  2.15318246]
 [-1.62754244 -1.8048821   0.04808495 -0.66262376  0.57059867 -0.76325916]
 [-1.53411417 -1.12464209  1.27767682  0.12029563  0.51443883  0.71161488]
 [-1.39856757  1.90941664  0.56296924 -0.87561825 -1.38279973  0.92617755]
 [ 0.51504769  0.51378595  3.85273149 -1.37766937 -0.93782504  0.51503527]
 [-0.58936476 -0.49300093  0.8496021   0.28099187 -0.62269952 -0.20812225]
 [-1.03724615  0.53891004 -0.19033868  0.07736831 -0.8612842   1.52312408]
 [ 0.08658979 -0.60398519 -0.15567724  0.57707213 -0.20304539  0.37114587]
 [ 0.08704707  0.8219025  -0.29900735 -0.03582604  1.56464366 -2.6197451 ]
 [ 0.64537595  0.97157095  1.36863156  2.06074792  1.75534084 -0.24896415]
 [ 1.14282281 -1.16867804

In [4]:
print(y)

[ -29.17548975 -190.1583977  -105.96389021  278.60982475 -254.92791649
  -18.21174044    2.70799058  215.51299974  -79.65861277    1.55121017
  -16.0012651   -33.88880277  337.34227617   60.00467584 -101.01259043
   -1.30581048  173.12608722 -282.88324463   99.93859788  -38.25125729
 -105.52954341   23.23034043 -115.17438183   38.39614316  -92.38293737
 -346.29961548 -151.71606868  -14.6795519    57.51866659   28.82612202
  -55.68757681 -137.00700912  133.20182865  -50.74656439   -7.60259898
  -59.63739971   32.26111076  225.03962981  116.6521056    35.05452593
  223.06656703  -43.41061132  -51.05177822 -189.91455799   -3.32767918
   86.59206255 -159.18764186  -93.32153107  166.62479837  -13.58393135
  164.79807396  100.74248531   42.01697297  -70.41377735    8.37574186
  123.0880192    60.93874066 -201.73242598 -362.27406285  -28.80358615
  164.13156638 -108.47528484  -70.81086474  355.68618196 -252.22734413
 -109.88960811  112.72076189  129.37637087  -36.30516981   57.31973911
   88.

We generate a synthetic dataset using make_regression from sklearn.datasets. This dataset simulates house price prediction with six features: number of bedrooms, square footage, presence of a garage, distance to the nearest school, age of the house, and presence of a swimming pool.


1. n_samples=100: This specifies the number of samples in the dataset. In this case, we are generating 100 samples.2. 
n_features=6: This specifies the number of features (or independent variables) in the dataset. We have 6 features: number of bedrooms, square footage, presence of a garage, distance to the nearest school, age of the house, and presence of a swimming pol                                   .4. 
noise=0.1: This parameter controls the amount of noise in the dataset. A noise level of 0.1 means that the target variable (y) will have some random noise added to it. This is done to simulate real-world data where there may be some inherent randomness or measurement erro                                 r5. .
random_state=42: This is a random seed that ensures reproducibility. Setting a random seed ensures that the random numbers generated by the function are the same each time the code is run. This helps in debugging and reproducing the same results.

In [16]:
# Convert to DataFrame
df = pd.DataFrame(X, columns=["num_bedrooms", "square_footage", "has_garage", "distance_to_school", "age_of_house", "has_pool"])
df['price'] = y


In [7]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), df['price'], test_size=0.2, random_state=42)

# We split the dataset into training and testing sets using train_test_split from sklearn.model_selection.

In [8]:
# 1. Filter Methods
# Correlation
correlation_scores = df.drop('price', axis=1).corrwith(df['price'])
relevant_features_correlation = correlation_scores[abs(correlation_scores) > 0.5].index.tolist()
print("Correlation Selected Features:", relevant_features_correlation)

# We calculate the correlation coefficients between each feature and the target variable (price). Features with absolute correlation scores greater than 0.5 are selected as relevant.

Correlation Selected Features: ['num_bedrooms']


In [9]:
# Mutual Information
mi_scores = mutual_info_regression(X_train, y_train)
relevant_features_mi = df.columns[:-1][mi_scores > 0.1]  # Adjust threshold as needed
print("Mutual Information Selected Features:", relevant_features_mi.tolist())

# We compute the mutual information between each feature and the target variable using mutual_info_regression from sklearn.feature_selection. Features with mutual information scores greater than 0.1 are selected.

Mutual Information Selected Features: ['num_bedrooms', 'has_pool']


In [10]:
# 2. Wrapper Methods
# Recursive Feature Elimination (RFE)
model = RandomForestRegressor(random_state=42)
rfe = RFE(model, n_features_to_select=3)  # Select top 3 features
rfe.fit(X_train, y_train)
relevant_features_rfe = df.columns[:-1][rfe.support_]
print("RFE Selected Features:", relevant_features_rfe.tolist())

# We use RFE from sklearn.feature_selection with a random forest regressor as the estimator. It recursively removes features until the desired number of features is reached (in this case, 3).

RFE Selected Features: ['num_bedrooms', 'square_footage', 'has_pool']


In [11]:
# 3. Embedded Methods
# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
relevant_features_lasso = df.columns[:-1][lasso.coef_ != 0]
print("Lasso Selected Features:", relevant_features_lasso.tolist()) 

# We train a Lasso regression model using Lasso from sklearn.linear_model with an alpha value of 0.1. Features with non-zero coefficients are selected.

Lasso Selected Features: ['num_bedrooms', 'square_footage', 'has_garage', 'distance_to_school', 'age_of_house', 'has_pool']


In [12]:
# Random Forest Feature Importance
forest = RandomForestRegressor(random_state=42)
forest.fit(X_train, y_train)
feature_importances = forest.feature_importances_
relevant_features_forest = df.columns[:-1][feature_importances > 0.1]  # Adjust threshold as needed
print("Random Forest Feature Importance Selected Features:", relevant_features_forest.tolist())

# We train a random forest model using RandomForestRegressor from sklearn.ensemble. Features with importance scores greater than 0.1 are selected.

Random Forest Feature Importance Selected Features: ['num_bedrooms', 'square_footage', 'has_pool']


In [13]:
# Gradient Boosting Feature Importance
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)
feature_importances_gb = gb.feature_importances_
relevant_features_gb = df.columns[:-1][feature_importances_gb > 0.1]  # Adjust threshold as needed
print("Gradient Boosting Feature Importance Selected Features:", relevant_features_gb.tolist())


# We train a gradient boosting model using GradientBoostingRegressor from sklearn.ensemble. Features with importance scores greater than 0.1 are selected.

Gradient Boosting Feature Importance Selected Features: ['num_bedrooms', 'square_footage', 'has_pool']


In [14]:
# 4. Dimensionality Reduction Techniques
# PCA
pca = PCA(n_components=2)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print("PCA Transformed Features:", X_train_pca)

# We use PCA from sklearn.decomposition to transform the original features into a lower-dimensional space with two components.

PCA Transformed Features: [[ 2.59337862  0.15156365]
 [ 1.48524782 -0.4056936 ]
 [-0.06427952 -0.73711137]
 [-1.6360316   0.11958012]
 [ 1.35946222 -1.07186992]
 [ 0.80560908 -0.62442595]
 [ 2.3283638   0.326333  ]
 [ 0.41169391  0.8998266 ]
 [ 0.63088141 -0.74070555]
 [ 0.67831997  0.27811325]
 [-2.12825639  0.50790456]
 [ 1.12470988 -0.33356721]
 [-0.49778537 -2.17069466]
 [ 1.12911366 -0.376078  ]
 [-2.51477929 -2.28885415]
 [ 0.40387127 -1.11170357]
 [ 1.69914573  0.57365716]
 [-0.90892344 -0.83666118]
 [-0.62633384  1.17628112]
 [ 0.02530362  1.27704228]
 [-1.51193858 -0.57934443]
 [ 0.26967514  0.17305881]
 [ 3.49091235 -0.45758697]
 [ 1.07920453 -1.09107241]
 [ 0.70045465 -0.02077993]
 [-1.13918128  0.56588618]
 [-0.90856487 -0.70608363]
 [-0.12448329 -1.67902054]
 [-0.66811449 -0.75911085]
 [ 0.55601811  1.21456166]
 [ 0.17182178 -0.5709772 ]
 [ 0.81058494  1.62240363]
 [-1.1450194  -1.98937226]
 [ 0.15828109  0.06503539]
 [ 0.50148791 -0.73993157]
 [-0.18398209  2.49597728]
 [

In [15]:
# t-SNE
tsne = TSNE(n_components=2)
X_train_tsne = tsne.fit_transform(X_train)
print("t-SNE Transformed Features:", X_train_tsne)

# We use TSNE from sklearn.manifold to transform the original features into a two-dimensional space for visualization.

AttributeError: 'NoneType' object has no attribute 'split'