<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 6.5
## Feature Selection

### Data

**Predict the onset of diabetes based on diagnostic measures.**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

[Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database/download)

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.linear_model import LogisticRegression

#### 1. Load Data

In [2]:
# Read Data
diabetes_csv = 'diabetes.csv'
df = pd.read_csv(diabetes_csv)

#### 2. Perform EDA

Perform EDA. Check Null Values. Impute if necessary.

In [4]:
# Display basic information about the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [5]:
# Check for null values
print("\nNull values in the dataset:")
print(df.isnull().sum())


Null values in the dataset:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [6]:
# Display basic statistics of the dataset
print("\nBasic statistics of the dataset:")
print(df.describe())


Basic statistics of the dataset:
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
m

#### 3. Set Target

- Set `Outcome` as target.
- Set Features

In [7]:
# Set target and features
target = 'Outcome'
features = df.drop(target, axis=1).columns

print("\nTarget variable:", target)
print("Features:", list(features))


Target variable: Outcome
Features: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


#### 4. Select Feature

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

##### 4.1 Univariate Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:

- SelectKBest removes all but the  highest scoring features
- Use sklearn.feature_selection.chi2 as score function
    > Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.


More Reads:
[Univariate feature selection](https://scikit-learn.org/stable/modules/feature_selection.html)

- Create an instance of SelectKBest
    - Use sklearn.feature_selection.chi2 as score_func
    - Use k of your choice
- Fit X, y
- Find top 4 features
- Transform features to a DataFrame

In [None]:
# Create an instance of SelectKBest

In [None]:
# Fit

In [8]:
# Univariate Selection
selector = SelectKBest(score_func=chi2, k=4)
X = df[features]
y = df[target]

#Fit 
selector.fit(X, y)

In [21]:
# Print Score

scores = selector.scores_
feature_scores = list(zip(features, scores))
feature_scores.sort(key=lambda x: x[1], reverse=True)

print("Feature scores from Univariate Selection:")
for feature, score in feature_scores:
    print(f"{feature}: {score}")
# Find Top 4 Features
# Get the indices of the selected features
selected_feature_indices = selector.get_support(indices=True)

# Get the names of the selected features
selected_features = X.columns[selected_feature_indices]

print("\nTop 4 features selected by Univariate Selection:")
print(selected_features)

Feature scores from Univariate Selection:
Insulin: 2175.5652729220137
Glucose: 1411.887040644141
Age: 181.30368904430023
BMI: 127.66934333103643
Pregnancies: 111.51969063588255
SkinThickness: 53.10803983632434
BloodPressure: 17.605373215320718
DiabetesPedigreeFunction: 5.392681546971445

Top 4 features selected by Univariate Selection:
Index(['Glucose', 'Insulin', 'BMI', 'Age'], dtype='object')


In [9]:
# Print scores after fitting
print("\nScores after fitting:")
feature_scores = list(zip(features, selector.scores_))
for feature, score in feature_scores:
    print(f"{feature}: {score}")


Scores after fitting:
Pregnancies: 111.51969063588255
Glucose: 1411.887040644141
BloodPressure: 17.605373215320718
SkinThickness: 53.10803983632434
Insulin: 2175.5652729220137
BMI: 127.66934333103643
DiabetesPedigreeFunction: 5.392681546971445
Age: 181.30368904430023


In [10]:
#top 4 features
selected_feature_indices = selector.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]

print("\nTop 4 features selected:")
print(selected_features)


Top 4 features selected:
Index(['Glucose', 'Insulin', 'BMI', 'Age'], dtype='object')


In [11]:
# Transform X to Features
X_selected = selector.transform(X)

In [12]:
# Transform features to a dataframe

X_selected_df = pd.DataFrame(X_selected, columns=selected_features)

print("\nTransformed DataFrame (first 5 rows):")
print(X_selected_df.head())


Transformed DataFrame (first 5 rows):
   Glucose  Insulin   BMI   Age
0    148.0      0.0  33.6  50.0
1     85.0      0.0  26.6  31.0
2    183.0      0.0  23.3  32.0
3     89.0     94.0  28.1  21.0
4    137.0    168.0  43.1  33.0


In [None]:
# Prepare X and y
X = diabetes_df[features]
y = diabetes_df[target]

In [None]:
# Univariate Selection
print("\n4.1 Univariate Selection")

In [None]:
# instance of SelectKBest
selector = SelectKBest(score_func=chi2, k=4)

In [None]:
# Fit X, y
selector.fit(X, y)

In [14]:
# Find top 4 features
selected_feature_indices = selector.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]

print("Top 4 features selected by Univariate Selection:")
print(selected_features)

Top 4 features selected by Univariate Selection:
Index(['Glucose', 'Insulin', 'BMI', 'Age'], dtype='object')


In [15]:
# Transform features to a DataFrame
X_selected = selector.transform(X)
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)

print("\nTransformed DataFrame:")
print(X_selected_df.head())


Transformed DataFrame:
   Glucose  Insulin   BMI   Age
0    148.0      0.0  33.6  50.0
1     85.0      0.0  26.6  31.0
2    183.0      0.0  23.3  32.0
3     89.0     94.0  28.1  21.0
4    137.0    168.0  43.1  33.0


##### 4.2 Recursive feature elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

More Reads:
[Recursive feature elimination](https://scikit-learn.org/stable/modules/feature_selection.html)

- Use RFE to extract feature
    - use LogisticRegression as estimator
    - Number of n_features_to_select as of your choice
- Fit X, y to RFE
- Find Selected Features

In [None]:
# ANSWER

In [None]:
# Print Score
# Find Features

In [None]:
# RFE
print("Recursive Feature Elimination")

In [17]:
# Use RFE to extract features
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=5, step=1)

In [18]:
# Fit X, y to RFE
rfe_selector = rfe_selector.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [19]:
# Find Selected Features
rfe_selected_features = X.columns[rfe_selector.support_]

print("Features selected by Recursive Feature Elimination:")
print(rfe_selected_features)

Features selected by Recursive Feature Elimination:
Index(['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age'], dtype='object')


In [20]:
# Transform features to a DataFrame
X_rfe_selected = rfe_selector.transform(X)
X_rfe_selected_df = pd.DataFrame(X_rfe_selected, columns=rfe_selected_features)

print("\nTransformed DataFrame:")
print(X_rfe_selected_df.head())


Transformed DataFrame:
   Pregnancies  Glucose   BMI  DiabetesPedigreeFunction   Age
0          6.0    148.0  33.6                     0.627  50.0
1          1.0     85.0  26.6                     0.351  31.0
2          8.0    183.0  23.3                     0.672  32.0
3          1.0     89.0  28.1                     0.167  21.0
4          0.0    137.0  43.1                     2.288  33.0




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



