# <center> <u> Feature Selection </u> </center>

<h2> What Is Feature Selection in Machine Learning?</h2>
<p>
The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of studied phenomena.
<p>Many learning algorithms perform poorly on high-dimensional data. This is known as the <b>curse of dimensionality</b>
    <p>There are other reasons we may wish to reduce the number of features including:
        <p>1. Reducing computational cost
            <p>2. Reducing the cost associated with data collection
                <p>3. Improving Interpretability

The techniques for feature selection in machine learning can be broadly classified into the following categories:

<b>Supervised Techniques:</b> These techniques can be used for labeled data and to identify the relevant features for increasing the efficiency of supervised models like classification and regression. For Example- linear regression, decision tree, SVM, etc.

<b>Unsupervised Techniques:</b> These techniques can be used for unlabeled data. For Example- K-Means Clustering, Principal Component Analysis, Hierarchical Clustering, etc.

From a taxonomic point of view, these techniques are classified into filter, wrapper, embedded, and hybrid methods.</p>

**Problem Statement:**

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

In [224]:
import pandas as pd
import numpy as np

## 1.Filter Methods:



Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

In this section we will cover below approaches:

1. Missing Value Ratio Threshold
2. Variance Threshold
3. $Chi^2$ Test
4. Anova Test

## (a) Missing Value Ratio Threshold

<p style='text-align: right;'> 20 points</p>


Data Dict:
---

**Pregnancies:** Number of times pregnant <br>
**Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test.<br>
**BloodPressure:** Diastolic blood pressure (mm Hg).<br>
**SkinThickness:** Triceps skin fold thickness (mm).<br>
**Insulin:** 2-Hour serum insulin (mu U/ml).<br>
**BMI:** Body mass index (weight in kg/(height in m)^2). <br>
**DiabetesPedigreeFunction:** A function which scores likelihood of diabetes based on family history<br>
**Age:** Age (years)<br>
**Outcome:** Class variable (0 or 1)




In [225]:
# create a data frame named diabetes and load the csv file

diabetes = pd.read_csv(r"C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes.csv")
#print the head 
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [226]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


We know that some features can not be zero(e.g. a person's blood pressure can not be 0) hence we will impute zeros with nan value in these features.

Reference to impute: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.replace.html

In [227]:
#Glucose BloodPressure, SkinThickness, Insulin, and BMI features cannot be zero ,we will impute zeros with nan value in these features.
columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for column in columns:
    diabetes[column].replace(to_replace=0, value=None, inplace=True)



In [228]:
#display the no of null values in each feature
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now let's see for each feature what is the percentage of having missing values.

In [229]:
#percentage of missing values for Glucose(sum null values , divide by length and multiply by 100)
(diabetes['Glucose'].isnull().sum()/len(diabetes))*100


0.6510416666666667

In [230]:
# calculate the percentage for SkinThickness
(diabetes['SkinThickness'].isnull().sum()/len(diabetes))*100



29.557291666666668

In [231]:
# calculate the percentage for Insulin

(diabetes['Insulin'].isnull().sum()/len(diabetes))*100


48.69791666666667

In [232]:
# calculate the percentage for BMI

(diabetes['BMI'].isnull().sum()/len(diabetes))*100


1.4322916666666665

Hey can you see that a large number of data missing in SkinThickness and Insulin.

## **`Watch Video 2: Identify and Dropping Missing Values`**

File used: https://drive.google.com/file/d/1xD7xnBXRYLGg5Qy8676lcj3QepoxmxMn/view?usp=sharing

Dataset: https://drive.google.com/file/d/18iVghbe09MKOMc49imA1Phuw-dNvMm4X/view?usp=sharing

Here we will keep only those features which are having missing data less than 10% as our threshold.


You can also check its document official: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [233]:

#we will keep only those features which are having missing data less than 10% 
diabetes_missing_value_threshold = 10

for column in diabetes.columns:
    null_value_percent = (diabetes[column].isnull().sum()/len(diabetes))*100
    if null_value_percent > diabetes_missing_value_threshold:
        diabetes.drop(column, axis=1, inplace=True)
        print(f"Null values exceeded for {column}:: Dropping it")


# print diabetes_missing_value_threshold 
diabetes.head()


Null values exceeded for SkinThickness:: Dropping it
Null values exceeded for Insulin:: Dropping it


Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,33.6,0.627,50,1
1,1,85,66,26.6,0.351,31,0
2,8,183,64,23.3,0.672,32,1
3,1,89,66,28.1,0.167,21,0
4,0,137,40,43.1,2.288,33,1


Let's now Seperate the data diabetes into features and labels 

<b>Label</b> is something which is depen'dent on other features for its outcome. You can also called it as our Target variable which we predict using ML algorithms.

Can you think which column would be considered as label.


In [234]:

diabetes_features = diabetes.drop('Outcome',axis=1)

diabetes_label= diabetes['Outcome']


In [235]:
#print diabetes_missing_value_threshold_features

diabetes_features.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,33.6,0.627,50
1,1,85,66,26.6,0.351,31
2,8,183,64,23.3,0.672,32
3,1,89,66,28.1,0.167,21
4,0,137,40,43.1,2.288,33


In [236]:
#print diabetes_label
diabetes_label.head()



0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

## (b) Variance Threshold

<p style='text-align: right;'> 20 points</p>

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.


## **`Watch Video 3: Variance Threshold Removal`**

File used: https://drive.google.com/file/d/1aF_h1juJT6wY49ijjLh-7cMFxMkbXkt-/view?usp=sharing

Dataset used: https://drive.google.com/file/d/1-Sx-VZwuUYH8O48aw5mWYEXPLrLRlKKK/view?usp=sharing

Are you ready to implement feature selection using variance threshold? Smile and download the dataset diabetes_cleaned.csv from https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes_cleaned.csv

Dataset used - diabetes_cleaned.csv

In [237]:
# load the csv to dataframe name "diabetes" and print the head values
diabetes = pd.read_csv(r'C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes_cleaned.csv')

# print head
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.93776,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [238]:
# seperate the features and the target as x and y 
X = diabetes.drop('Outcome', axis=1)
Y = diabetes['Outcome']

If you have seen the video then Krish must have told you to use sklearn library to calculate variance threshold. But we will use var function to calculate our variance so that you understand the concept from base. 

In [239]:
# Return  the variance for X along the specified axis=0.

X.var()

Pregnancies                   11.354056
Glucose                      929.680350
BloodPressure                146.321591
SkinThickness                 77.285567
Insulin                     9484.259268
BMI                           48.813618
DiabetesPedigreeFunction       0.109779
Age                          138.303046
dtype: float64

We can see that DiabetesPedigreeFunction variance is less so it brings almost no information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.
    
Reference for minmax scaling: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Lets use sklearn minmax scaler here.


In [240]:
from sklearn.preprocessing import MinMaxScaler

# use minmax scale with feature_range=(0,10) and columns=X.columns,to scale the features of dataframe and store them into X_scaled_df 
scalar = MinMaxScaler(feature_range=(0,10))
X_scaled = scalar.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=scalar.get_feature_names_out(X.columns))
X_scaled_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,6.709677,4.897959,3.043478,2.740295,3.149284,2.34415,4.833333
1,0.588235,2.645161,4.285714,2.391304,1.018185,1.717791,1.16567,1.666667
2,4.705882,8.967742,4.081633,2.391304,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.73913,1.29385,2.02454,0.380017,0.0
4,0.0,6.0,1.632653,3.043478,2.150572,5.092025,9.436379,2.0



Wait a minute! whats minmax scaling?

It is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]

hey hey heyieeee! Fun fact time:

There is another scaling method called StandardScaler which follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance. 

Cool right? :)

In [241]:
# Again return  the variance for X along the specified axis=0 to check the scales after using minmax scaler.
X_scaled_df.var()

Pregnancies                 3.928739
Glucose                     3.869637
BloodPressure               1.523548
SkinThickness               0.913109
Insulin                     1.271218
BMI                         2.041377
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64

Now again you can check the previous video

In [242]:
# import variancethreshold
from sklearn.feature_selection import VarianceThreshold

# set threshold=1 and define it to variable select_features
selector = VarianceThreshold(threshold=1.0)

   Impliment fit_transform on select_features passing X_scaled_df into it and save this result in variable X_variance_threshold_df


In [243]:
X_variance_threshold_df = selector.fit_transform(X_scaled_df)


Were you thinking of fit_transform? We are here to help you understand

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data

Don't worry you will get lot of challenges to use these things in our other assignments

In [244]:
print(X_variance_threshold_df)

[[3.52941176 6.70967742 4.89795918 ... 3.14928425 2.3441503  4.83333333]
 [0.58823529 2.64516129 4.28571429 ... 1.71779141 1.16567037 1.66666667]
 [4.70588235 8.96774194 4.08163265 ... 1.04294479 2.53629377 1.83333333]
 ...
 [2.94117647 4.96774194 4.89795918 ... 1.63599182 0.71306576 1.5       ]
 [0.58823529 5.29032258 3.67346939 ... 2.43353783 1.15713066 4.33333333]
 [0.58823529 3.16129032 4.69387755 ... 2.49488753 1.01195559 0.33333333]]


In [245]:
#Convert X_variance_threshold_df into dataframe
X_variance_threshold_df = pd.DataFrame(X_variance_threshold_df, columns=selector.get_feature_names_out())

In [246]:
# print of head values of X_variance_threshold_df 
X_variance_threshold_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,6.709677,4.897959,2.740295,3.149284,2.34415,4.833333
1,0.588235,2.645161,4.285714,1.018185,1.717791,1.16567,1.666667
2,4.705882,8.967742,4.081633,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.29385,2.02454,0.380017,0.0
4,0.0,6.0,1.632653,2.150572,5.092025,9.436379,2.0


Super! you can see SkinThickness feature is not selected as its variance is less.

## (c) Chi-Squared statistical test (SelectKBest)

<p style='text-align: right;'> 20 points</p>

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.



Reference link: https://chrisalbon.com/machine_learning/feature_selection/chi-squared_for_feature_selection/

The below mentioned function generate_feature_scores_df is used to get feature score for using it in  Chi-Squared statistical test explained below

In [247]:
def generate_feature_scores_df(X, Score):
    feature_score=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score=pd.concat([feature_score,new])
    return feature_score

Hey coder! lets use dataset - diabetes.csv from below link

https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes.csv

In [248]:
# create a data frame named diabetes and load the csv file again
diabetes = pd.read_csv(r"C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes.csv")
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [249]:
# assign features to X variable and 'outcome' to y variable from the dataframe diabetes
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

Reference Doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [250]:
from sklearn.feature_selection import SelectKBest, chi2
# converting data cast to a float type.
X = X.astype(float)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
dtypes: float64(8)
memory usage: 48.1 KB


Lets use SelectKBest to calculate the best feature score. Use Chi2 as Score Function and no.of feature i.e. k as 4


In [251]:
# Initialise SelectKBest with above parameters 
chi2_test = SelectKBest(score_func=chi2, k=4)

# and fit it with X and Y
chi2_model = chi2_test.fit(X, y)


In [252]:
#print the scores of chi2_model
chi2_model.scores_

array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [253]:
# use generate_feature_scores_df function to get features and their respective scores passing X and chi2_model.scores_ as paramter
feature_score_df = generate_feature_scores_df(X, chi2_model.scores_)
# return feature_score_df
feature_score_df

Unnamed: 0,Features,Score
0,Pregnancies,111.519691
1,Glucose,1411.887041
2,BloodPressure,17.605373
3,SkinThickness,53.10804
4,Insulin,2175.565273
5,BMI,127.669343
6,DiabetesPedigreeFunction,5.392682
7,Age,181.303689


Did you see the features and corresponding chi square scores? This is so easy right, higher the score better the feature. Just like higher the marks in assignment better the student of ours. 

In [254]:
##Lets get X to the selected features of chi2_model using tranform function so we will have X_new

X_new = chi2_model.transform(X)
X_new_df = pd.DataFrame(X_new, columns=chi2_model.get_feature_names_out())
X_new_df.head()


Unnamed: 0,Glucose,Insulin,BMI,Age
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0


whaaaat! tranform()? , how is it different from fit_transform.You know buddy fit() can also confuse you. Hey you inquisitive learner we will tell you the difference.

The fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.

for more info on this refer below video

As you can see chi-squared test helps us to select  important independent features out of the original features that have the strongest relationship with the target feature.

You did it well!

## (d) Anova-F Test

<p style='text-align: right;'> 20 points</p>

The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.

## **`Watch Video 5: One Way ANOVA`**

File used: https://drive.google.com/file/d/1myYGdzEfmUqoWc0A3YNuK0oaId1_CCnI/view?usp=sharing

Dataset used: https://drive.google.com/file/d/1g50xZ7xqzHaXNoh0KuUyAaaui7i5ud8n/view?usp=sharing


In [255]:
#import libraries f_classif,SelectPercentile from sklearn
from sklearn.feature_selection import SelectPercentile, f_classif

# Initialise SelectPercentile function with parameters f_classif and percentile as 80
Anova_test = SelectPercentile(score_func=f_classif, percentile=80)

#Fit the above object to the features and target i.e X and Y
Anova_model =  Anova_test.fit(X, y)


here you have used f_classif for Anova-F test. To know more about this test you can check this artical.

https://towardsdatascience.com/anova-for-feature-selection-in-machine-learning-d9305e228476



In [256]:
# return scores of anova model
Anova_model.scores_

array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])

In [257]:
# use generate_feature_scores_df function to get features and their respective scores by passing X and Anova_model.scores_ as score in function 
feature_scores_df = generate_feature_scores_df(X, Anova_model.scores_)
feature_scores_df

Unnamed: 0,Features,Score
0,Pregnancies,39.670227
1,Glucose,213.161752
2,BloodPressure,3.25695
3,SkinThickness,4.304381
4,Insulin,13.281108
5,BMI,71.772072
6,DiabetesPedigreeFunction,23.8713
7,Age,46.140611


In [258]:
# Get all supported columns values in Anova_model with indices=True
cols = Anova_model.get_feature_names_out()
print(f"Columns:: {cols}")
# Reduce X to the selected features of anova model using tranform 
X_new = pd.DataFrame(Anova_model.transform(X), columns=cols)

Columns:: ['Pregnancies' 'Glucose' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age']


In [259]:
X_new.head()

Unnamed: 0,Pregnancies,Glucose,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,0.0,33.6,0.627,50.0
1,1.0,85.0,0.0,26.6,0.351,31.0
2,8.0,183.0,0.0,23.3,0.672,32.0
3,1.0,89.0,94.0,28.1,0.167,21.0
4,0.0,137.0,168.0,43.1,2.288,33.0


Hey brighty! Hope you learned to implement Anova F-test method for feature selection. It has selected 6 best features as you can see in above output

# 2. Wrapper Methods

<p style='text-align: right;'> 30 points</p>


Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.

In [260]:
# load and read the csv using pandas and print the head values
diabetes = pd.read_csv(r'C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes.csv')

# print head
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [261]:
# assign features to X and target 'outcome' to Y(Think why the 'outcome' column is taken as the target)
X = diabetes.drop('Outcome', axis=1)
Y = diabetes['Outcome']

#return X,Y
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [262]:
Y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

## (a) Recursive Feature Elemination

<p style='text-align: right;'> 25 points</p>

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.


## **`Watch Video 6: Recursive Feature Elimination`**

Note: the video is using random forest classifier, but we are going to use logistic regression as our model.

File used: https://drive.google.com/file/d/1NPomAlZP1g054xlRWUZ9_GE8pZxu2WFA/view?usp=sharing

Dataset used: https://drive.google.com/file/d/1WC4UwIx56Muc5lX3io7UJK_Is6gIv4tM/view?usp=sharing

In [263]:
# import required libraries RFE, LogisticRegression and dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Initialise model variable with LogisticRegression function with solver = 'liblinear'
model = LogisticRegression(max_iter=200)

# rfe variable has RFE instance with should have model and n_features_to_select=4 as parameters
rfe = RFE(estimator=model, n_features_to_select=4)

In [264]:
# fit rfe with X and Y
fit = rfe.fit(X, Y)

In [265]:
# print fit.n_features_, fit.support_, fit.ranking_
print(f"Columns          :: {list(X.columns)}")
print(f"Support          :: {fit.support_}")
print(f"Feature Rankings :: {fit.ranking_}")
print(f"Number of Selected features= {fit.n_features_}")
print(f"Selected features= {fit.get_feature_names_out()}")


Columns          :: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Support          :: [ True  True False False False  True  True False]
Feature Rankings :: [1 1 3 5 4 1 1 2]
Number of Selected features= 4
Selected features= ['Pregnancies' 'Glucose' 'BMI' 'DiabetesPedigreeFunction']


In [266]:
# use below function to get ranks of all the features
def feature_ranks(X,Rank,Support):
    feature_rank=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
        feature_rank=pd.concat([feature_rank,new])
    return feature_rank


In [267]:
#Get all feature's ranks using feature_ranks function with suitable parameters in variable called feature_rank_df
feature_rank_df = feature_ranks(X, fit.ranking_, fit.support_)

# print feature_rank_df
feature_rank_df.head(20)


Unnamed: 0,Features,Rank,Selected
0,Pregnancies,1,True
1,Glucose,1,True
2,BloodPressure,3,False
3,SkinThickness,5,False
4,Insulin,4,False
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True
7,Age,2,False


We can see there are four features with rank 1 ,RFE states that these are the most significant features.

In [268]:
# finally get all the features selected by RFE in dataframe X and this result in variable called RFE_selected_features
RFE_selected_features = fit.get_feature_names_out()

# print RFE head()
X_rfe =  pd.DataFrame(fit.transform(X), columns=RFE_selected_features)
X_rfe.head()

Unnamed: 0,Pregnancies,Glucose,BMI,DiabetesPedigreeFunction
0,6.0,148.0,33.6,0.627
1,1.0,85.0,26.6,0.351
2,8.0,183.0,23.3,0.672
3,1.0,89.0,28.1,0.167
4,0.0,137.0,43.1,2.288


# 3. Embedded Method using random forest

<p style='text-align: right;'> 25 points</p>


## **`Watch Video 7: Model based feature selection`**

File used: https://drive.google.com/file/d/1Vfmyrbz4HzRMtgYwasrHeXS7Zuo3Yk1b/view?usp=sharing

dataset : https://drive.google.com/file/d/1IyDpzY-9G5rO5sPsY0lN1o3hTzWNAPEY/view?usp=sharing

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :
1. They are highly accurate.
2. They generalize better.
3. They are interpretable

In [269]:
#Importing libraries pd, RandomForestClassifier, SelectFromModel
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [270]:
# load and read the csv using pandas and print the head values
diabetes = pd.read_csv(r'C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes.csv')

# print head
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.
So considering we have a train and a test dataset. We select the features from the train set and then transfer the changes to the test set later

In [271]:
# assign features to X and target 'outcome' to Y(Think why the 'outcome' column is taken as the target)
X = diabetes.drop('Outcome', axis=1)
Y = diabetes['Outcome']

In [272]:
# import test_train_split module
from sklearn.model_selection import train_test_split

# splitting of dataset(test_size=0.3)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=54)


Here We will do the model fitting and feature selection altogether in one line of code.

Firstly, specify the random forest instance, indicating the number of trees.

Then use selectFromModel object from sklearn to automatically select the features. Simple right?. Don't worry trust your code. It helps.

Reference link to use selectFromModel: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [273]:
#create an instance of Select from Model. Pass an object of Random Forest Classifier with n_estimators=100 as argument. 
model = RandomForestClassifier(n_estimators=100)
sel = SelectFromModel(estimator=model)

# fit sel on X and y 
sel.fit(X_train, y_train)

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.

 To see which features are important we can use get_support method on the fitted model.

In [274]:
# Using sel.get_support() print the boolean values for the features selected. 
sel.get_support()

array([False,  True, False, False, False,  True, False,  True])

In [275]:
#make a list named selected_feat with all columns which are True
selected_feat = sel.get_feature_names_out()

# print length of selected_feat
len(selected_feat)

3

In [276]:
# Print selected_feat
print(selected_feat)

['Glucose' 'BMI' 'Age']


Well done Champ!. Let us impliment SelectFromModel using LinearSVC model also

## Feature selection using SelectFromModel

<p style='text-align: right;'> 25 points</p>


SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or featureimportances attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or featureimportances values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

Lets use selectfrommodel again with LinearSVC

In [277]:
# import libraries LinearSVC, SelectFromModel , and dependencies

from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [278]:
#Use SelectFromModel with LinearSVC() as its parameter and save it in variable 'm'

m = SelectFromModel(estimator=LinearSVC())

# fit on X, y
m.fit(X_train, y_train)




In [279]:
#make a list named selected_feat with all columns which are supported and count the selected features.

selected_feat = m.get_feature_names_out()

# print selected_feat
print(selected_feat)


['Pregnancies' 'DiabetesPedigreeFunction']


 # 4. Handling Multicollinearity with VIF
 
 <p style='text-align: right;'> 15 points</p>


Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

VIF has big defination but for now understand that:-
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables

## **`Watch Video 8: Variance Inflation Factor`**

File : https://drive.google.com/file/d/1j1npRRM9iu_m61ElAdSHaP5mftNtGHQc/view?usp=sharing

In [280]:
#load and read the diabetes_cleaned.csv file using pandas and print the head values
dia_df = pd.read_csv(r"C:\Users\VICTUS\Documents\Data-Science\Machine_Learning\Datasets\diabetes_cleaned.csv")

# print dia_df

dia_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.93776,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [281]:
# describe the dataframe using .describe()
dia_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.539062,72.405184,29.108073,152.222767,32.307682,0.471876,33.240885,0.348958
std,3.369578,30.49066,12.096346,8.791221,97.387162,6.986674,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,-17.757186,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,25.0,89.647494,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,29.0,130.0,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,188.448695,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn's scale function.

reference doc: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

In [282]:
# import preprocessing
from sklearn.preprocessing import scale

#iterate over all features in dia_df and scale

for i in dia_df:
    dia_df[[i]]= scale(dia_df[[i]])

In [283]:
dia_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-6.476301e-17,-4.625929e-18,6.915764e-16,-1.526557e-16,-3.4694470000000005e-17,1.272131e-16,2.174187e-16,1.931325e-16,7.401487e-17
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-2.5447,-4.004245,-2.516429,-1.746542,-2.020543,-1.189553,-1.041549,-0.7321202
25%,-0.8448851,-0.7396938,-0.695306,-0.4675972,-0.64296,-0.7172147,-0.6889685,-0.7862862,-0.7321202
50%,-0.2509521,-0.1489643,-0.01675912,-0.01230129,-0.2283386,-0.04406715,-0.3001282,-0.3608474,-0.7321202
75%,0.6399473,0.6140612,0.6282695,0.3291706,0.3722209,0.6147581,0.4662269,0.6602056,1.365896
max,3.906578,2.542136,4.102655,7.955377,7.128551,4.983056,5.883565,4.063716,1.365896


In [284]:
#import variance inflation factor 
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split

In [285]:
# assign features to X and target to Y by analyzing which columns to be dropped and which is to be considered as target
X = dia_df.drop('Outcome', axis=1)
Y = dia_df['Outcome']


# split the data to test and train with test_size=0.2
x_train,x_test,y_train,y_test = train_test_split(X, Y, test_size=0.2)


In [286]:
#assign an empty dataframe to variable vif
vif = pd.DataFrame()

# make a new column 'VIF Factor' in vif dataframe and calculate the variance_inflation_factor for each X 
vif['VIF Factor'] = [variance_inflation_factor(x_train.values, i) for i in range(len(x_train.columns)) ] 
vif['features'] = x_train.columns


In [287]:
#  round off all the decimal values in the dataframe to 2 decimal places for VIF dataframe
vif.head(20)

Unnamed: 0,VIF Factor,features
0,1.397837,Pregnancies
1,2.109495,Glucose
2,1.227851,BloodPressure
3,1.417704,SkinThickness
4,2.074083,Insulin
5,1.550375,BMI
6,1.053065,DiabetesPedigreeFunction
7,1.547607,Age


* VIF = 1: Not correlated
* VIF =1-5: Moderately correlated
* VIF >5: Highly correlated

Glucose, Insulin, and Age are having large VIF scores, so lets drop it.



In [288]:
# according to above observation , drop  'Glucose', 'Insulin' and 'Age' from X
X = X.drop(['Glucose', 'Insulin', 'Age'], axis=1)
X.head()

Unnamed: 0,Pregnancies,BloodPressure,SkinThickness,BMI,DiabetesPedigreeFunction
0,0.639947,-0.033518,0.670643,0.185089,0.468492
1,-0.844885,-0.529859,-0.012301,-0.817471,-0.365061
2,1.23388,-0.695306,-0.012301,-1.290106,0.604397
3,-0.844885,-0.529859,-0.695245,-0.602636,-0.920763
4,-1.141852,-2.680669,0.670643,1.545707,5.484909


Now again we calculate the VIF for the rest of the features

Again repeat the previous steps to assign an empty dataframe() to vif and make a new column 'VIF Factor' and calculate the variance_inflation_factorfor each X 


In [289]:
#assign an empty dataframe to variable vif
vif = pd.DataFrame()

# make a new column 'VIF Factor' in vif dataframe and calculate the variance_inflation_factor for each X 
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns)) ] 
vif['features'] = X.columns

In [290]:
# round up to 2 
vif.round(2).head(10)

Unnamed: 0,VIF Factor,features
0,1.05,Pregnancies
1,1.13,BloodPressure
2,1.42,SkinThickness
3,1.5,BMI
4,1.03,DiabetesPedigreeFunction


So now colinearity of features has been reduced using VIF.


The need to fix multicollinearity depends primarily on the below reasons:
---

1. When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option

2. If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

## **`Summary:`**

1. In the feature selection assignment, we learnt the methods which we use to find best features i.e. filter methods, wrapper methods and embedded methods.

2. In the filter methods, we learnt Chi2 test, Anova test, variance threshold and missing value ratio threshold.

3. In the wrapper method, we learnt forward selection, backward elimination and recursive feature elimination techniques.

4. In the embedded method, we create subsets of features and choose the best subset to train the model.

## **`Conclusion:`**
Feature selection is a very important step in the construction of Machine Learning models. <br>

It can speed up training time, make our models simpler, easier to debug, and reduce the time to market of Machine Learning products. 

------------------------------

# Hip Hip Hurray! Congratulations you have completed the 7th assignment too! Very well done.

-------------------------------------

# Its Feedback Time!

We hope you’ve enjoyed this course so far. We’re committed to help you use "AI for All" course to its full potential, so that you have a great learning experience. And that’s why we need your help in form of a feedback here.


---
**Please fill this feedback form**<br>

https://forms.gle/VfxaKpjnmim8m4Yv8

------------------------------