## Feature Selection in Machine Learning
Feature selection helps reduce overfitting, improves accuracy, and decreases training time. It is especially useful for high-dimensional datasets.



In [1]:
from pandas import read_csv

url = 'https://raw.githubusercontent.com/erojaso/MLMasteryEndToEnd/master/data/pima-indians-diabetes.data.csv'
column_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# Load CSV into DataFrame
data = read_csv(url, names=column_names)

# Convert to NumPy array and split into input and output
array = data.values
Input = array[:, 0:8]
Output = array[:, 8]

## SelectKBest with Chi-Squared Test
This method ranks features by their chi-squared statistic and selects the top `k`.

In [2]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from numpy import set_printoptions

# Apply SelectKBest to extract top 4 features
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(Input, Output)

# Print scores
set_printoptions(precision=3)
print("Chi2 scores for each feature:\n", fit.scores_)

# Transform input data to include only selected features
features = fit.transform(Input)
print("\nTop 4 features (first 5 rows):\n", features[0:5, :])

Chi2 scores for each feature:
 [ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]

Top 4 features (first 5 rows):
 [[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


## Recursive Feature Elimination (RFE)
RFE recursively removes less important features and builds the model again until the desired number is reached.

In [3]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')
rfe = RFE(estimator=model, n_features_to_select=3)
fit = rfe.fit(Input, Output)

print("Num Features Selected:", fit.n_features_)
print("Selected Features (True=Selected):", fit.support_)
print("Feature Ranking:", fit.ranking_)

Num Features Selected: 3
Selected Features (True=Selected): [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


## Principal Component Analysis (PCA)
PCA transforms original features into a smaller set of uncorrelated components while retaining most of the variance.

In [4]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
fit = pca.fit(Input)

print("Explained Variance Ratio:", fit.explained_variance_ratio_)
print("Principal Components:\n", fit.components_)

Explained Variance Ratio: [0.889 0.062 0.026]
Principal Components:
 [[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [ 2.265e-02  9.722e-01  1.419e-01 -5.786e-02 -9.463e-02  4.697e-02
   8.168e-04  1.402e-01]
 [ 2.246e-02 -1.434e-01  9.225e-01  3.070e-01 -2.098e-02  1.324e-01
   6.400e-04  1.255e-01]]


## Feature Importance using Extra Trees Classifier
This ensemble method assigns an importance score to each feature based on how helpful it was in building decision trees.

In [5]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier(n_estimators=100)
model.fit(Input, Output)

print("Feature Importances from Extra Trees:\n", model.feature_importances_)

Feature Importances from Extra Trees:
 [0.111 0.232 0.097 0.078 0.074 0.143 0.12  0.144]
