feature selection
 feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons

In [67]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler

In [68]:
df=pd.read_csv("https://raw.githubusercontent.com/digipodium/Datasets/main/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


FEATURE SELECTION METHODS:
- FILTER METHOD( Take all the features, selecting the best subset,pass it to algorithm, check the performance) by using scikit learn classes( select k best)
- WRAPPER METHOD(Take all the features, generate subset,pass it to algorithm, check the performance )// step 2nd and 3rd runs in loop(slecting the best subset iteratively)

In [69]:
df.rename({'DiabetesPedigreeFunction':'predigree'},axis=1,inplace=True)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,predigree,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [70]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [107]:
featSelector=SelectKBest(chi2,k=2)

In [108]:
X=df.iloc[:,:-1]
print(X.shape)
y=df['Outcome']
print(y.shape)


(768, 8)
(768,)


In [109]:
featSelector.fit(X,y)

SelectKBest(k=2, score_func=<function chi2 at 0x00000238B74CC790>)

In [110]:
import numpy as np
np.set_printoptions(precision=2)



In [111]:

featSelector.scores_

array([ 111.52, 1411.89,   17.61,   53.11, 2175.57,  127.67,    5.39,
        181.3 ])

higher scored column should be selected

In [112]:
print(f" COL{featSelector.feature_names_in_}\n SEL{featSelector.get_feature_names_out()}")

 COL['Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI'
 'predigree' 'Age']
 SEL['Glucose' 'Insulin']


In [113]:
features=featSelector.transform(X)
print(features.shape)

(768, 2)


In [114]:
print(featSelector.get_feature_names_out())
features
#data should be show in numpy array

['Glucose' 'Insulin']


array([[148.,   0.],
       [ 85.,   0.],
       [183.,   0.],
       ...,
       [121., 112.],
       [126.,   0.],
       [ 93.,   0.]])

In [115]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [116]:
scaler=StandardScaler()
scaledX=scaler.fit_transform(features)
xtrain,xtest,ytrain,ytest=train_test_split(scaledX,y,test_size=.2,random_state=1)
#xtrain.shape,xtest.shape
m=KNeighborsClassifier(n_neighbors=9)
m.fit(xtrain,ytrain)
ypred=m.predict(xtest)
cm=confusion_matrix(ytest,ypred)
print(cm)

[[87 12]
 [24 31]]


In [117]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.78      0.88      0.83        99
           1       0.72      0.56      0.63        55

    accuracy                           0.77       154
   macro avg       0.75      0.72      0.73       154
weighted avg       0.76      0.77      0.76       154



WRAPPER METHOD
- RECURSIVE FEATURE SELECTOR(IT AUTOMATICALLY CHECKS WITH DIFFERENT MODELS AND GIVES BEST RESULT) and remove waste column

In [118]:
from sklearn.feature_selection import RFE #rfe recursive feature elemination
from sklearn.linear_model import LogisticRegression

In [119]:
clf=LogisticRegression(solver='liblinear')
rfe = RFE(clf)#NO OF COL AND NO OF SELECTED COL
rfe.fit(X,y)

RFE(estimator=LogisticRegression(solver='liblinear'))

In [120]:
print("features selected: ",rfe.n_features_)

features selected:  4


In [121]:
rfe.support_ #supported column show True

array([ True,  True, False, False, False,  True,  True, False])

In [122]:
rfe.ranking_ #if 1 then that colum is selected

array([1, 1, 2, 4, 5, 1, 1, 3])

In [123]:
X.columns.tolist()

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'predigree',
 'Age']

In [124]:
features=X[rfe.get_feature_names_out()]

In [125]:
scaler=StandardScaler()
scaledX=scaler.fit_transform(features)
xtrain,xtest,ytrain,ytest=train_test_split(scaledX,y,test_size=.2,random_state=1)
#xtrain.shape,xtest.shape
m=KNeighborsClassifier(n_neighbors=9)
m.fit(xtrain,ytrain)
ypred=m.predict(xtest)
cm=confusion_matrix(ytest,ypred)
print(cm)

[[87 12]
 [25 30]]


In [126]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.78      0.88      0.82        99
           1       0.71      0.55      0.62        55

    accuracy                           0.76       154
   macro avg       0.75      0.71      0.72       154
weighted avg       0.75      0.76      0.75       154

