<a href="https://colab.research.google.com/github/sureshmecad/CloudyML-AI-FOR-ALL/blob/main/8_Feature_Selection_CloudyML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> <u> Feature Selection </u> </center>

<h2>What's the Purpose of Feature Selection</h2>
<p>Many learning algorithms perform poorly on high-dimensional data. This is known as the <b>curse of dimensionality</b>
    <p>There are other reasons we may wish to reduce the number of features including:
        <p>1. Reducing computational cost
            <p>2. Reducing the cost associated with data collection
                <p>3. Improving Interpretability
                    
Reference for entire topic-

https://www.youtube.com/watch?v=EqLBAmtKMnQ

![image.png](attachment:image.png)

In [None]:
#import these libraries as we are going to use them. 
# Note: just have a look what all libraries you have imported
import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from math import sqrt

## 1.Filter Methods:



Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

In this section we will cover below approaches:

1. Missing Value Ratio Threshold
2. Variance Threshold
3. $Chi^2$ Test
4. Anova Test

Download the dataset from https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes.csv

Dataset used - diabetes.csv

Download instruction: go to the given link--->click raw button on top right corner---->Press Ctrl+S -->save it as .csv file.

## (a) Missing Value Ratio Threshold

<p style='text-align: right;'> 20 points</p>




In [None]:
# create a data frame named diabetes and load the csv file

#print the head 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


We know that some features can not be zero(e.g. a person's blood pressure can not be 0) hence we will impute zeros with nan value in these features.

Reference to impute: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.replace.html

In [None]:
#Glucose BloodPressure, SkinThickness, Insulin, and BMI features cannot be zero ,we will impute zeros with nan value in these features.



In [None]:
#display the no of null values in each feature


Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now let's see for each feature what is the percentage of having missing values.

In [None]:
#percentage of missing values for Glucose


0.6510416666666667

In [None]:
# calculate the percentage for Bloodpressure


4.557291666666666

In [None]:
# calculate the percentage for SkinThickness


29.557291666666668

In [None]:
# calculate the percentage for Insulin


48.69791666666667

In [None]:
# calculate the percentage for BMI


1.4322916666666665

Hey can you see that a large number of data missing in SkinThickness and Insulin.

Here we will keep only those features which are having missing data less than 10% as our threshold.

Reference to check methods for dropping nan in pandas- https://www.youtube.com/watch?v=57vFbsiZYHg

You can also check its document official: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
#we are keep only those features which are having missing data less than 10% 
diabetes_missing_value_threshold=#code

# print diabetes_missing_value_threshold

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,33.6,0.627,50,1
1,1,85.0,66.0,26.6,0.351,31,0
2,8,183.0,64.0,23.3,0.672,32,1
3,1,89.0,66.0,28.1,0.167,21,0
4,0,137.0,40.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...
763,10,101.0,76.0,32.9,0.171,63,0
764,2,122.0,70.0,36.8,0.340,27,0
765,5,121.0,72.0,26.2,0.245,30,0
766,1,126.0,60.0,30.1,0.349,47,1


Let's now Seperate the data diabetes_missing_value_threshold into features and labels 

Hey buddy! label is something which is dependent on other features for its outcome. You can also called it as our Target variable which we predict using ML algorithms.

Can you think which column would be considered as label.


In [None]:

diabetes_missing_value_threshold_features = #code

diabetes_missing_value_threshold_label= #code


In [None]:
#print diabetes_missing_value_threshold_features


Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.0,33.6,0.627,50
1,1,85.0,66.0,26.6,0.351,31
2,8,183.0,64.0,23.3,0.672,32
3,1,89.0,66.0,28.1,0.167,21
4,0,137.0,40.0,43.1,2.288,33
...,...,...,...,...,...,...
763,10,101.0,76.0,32.9,0.171,63
764,2,122.0,70.0,36.8,0.340,27
765,5,121.0,72.0,26.2,0.245,30
766,1,126.0,60.0,30.1,0.349,47


In [None]:
#print diabetes_missing_value_threshold_label


0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

## (b) Variance Threshold

<p style='text-align: right;'> 20 points</p>

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.


Reference-
https://www.youtube.com/watch?v=uMlU2JaiOd8

Are you ready to implement feature selection using variance threshold? Smile and download the dataset diabetes_cleaned.csv from https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes_cleaned.csv

Dataset used - diabetes_cleaned.csv

In [None]:
# load the csv to dataframe name "diabetes" and print the head values


# display diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.93776,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [None]:
# seperate the features and the target as x and y 



If you have seen the video then Krish must have told you to use sklearn library to calculate variance threshold. But we will use var function to calculate our variance so that you understand the concept from base. 

In [None]:
# Return  the variance for X along the specified axis=0.


Pregnancies                   11.354056
Glucose                      929.680350
BloodPressure                146.321591
SkinThickness                 77.285567
Insulin                     9484.259268
BMI                           48.813618
DiabetesPedigreeFunction       0.109779
Age                          138.303046
dtype: float64

    Hey smarty! did you see that DiabetesPedigreeFunction variance is less so it brings almost no information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.
    
Reference for minmax scaling: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Lets use sklearn minmax scaler here.


In [None]:
# import minmax_scale

# use minmax scale with feature_range=(0,10) and columns=X.columns,to scale the features of dataframe and store them into X_scaled_df 



Wait a minute! whats minmax scaling?

It is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]

hey hey heyieeee! Fun fact time:

There is another scaling method called StandardScaler which follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance. 

Cool right? :)

In [None]:
# return X_scaled_df


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,6.709677,4.897959,3.043478,2.740295,3.149284,2.344150,4.833333
1,0.588235,2.645161,4.285714,2.391304,1.018185,1.717791,1.165670,1.666667
2,4.705882,8.967742,4.081633,2.391304,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.739130,1.293850,2.024540,0.380017,0.000000
4,0.000000,6.000000,1.632653,3.043478,2.150572,5.092025,9.436379,2.000000
...,...,...,...,...,...,...,...,...
763,5.882353,3.677419,5.306122,4.456522,2.289500,3.006135,0.397096,7.000000
764,1.176471,5.032258,4.693878,2.173913,2.044244,3.803681,1.118702,1.000000
765,2.941176,4.967742,4.897959,1.739130,1.502241,1.635992,0.713066,1.500000
766,0.588235,5.290323,3.673469,2.391304,2.217956,2.433538,1.157131,4.333333


In [None]:
# Again return  the variance for X along the specified axis=0 to check the scales after using minmax scaler.


Pregnancies                 3.928739
Glucose                     3.869637
BloodPressure               1.523548
SkinThickness               0.913109
Insulin                     1.271218
BMI                         2.041377
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64

Now again you can check the previous video:https://www.youtube.com/watch?v=uMlU2JaiOd8

In [None]:
# import variancethreshold

# set threshold=1 and define it to variable select_features


   Impliment fit_transform on select_features passing X_scaled_df into it and save this result in variable X_variance_threshold_df


In [None]:
X_variance_threshold_df=#code


Were you thinking of fit_transform? We are here to help you understand

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data

Don't worry you will get lot of challenges to use these things in our other assignments

In [None]:
#print X_variance_threshold_df


array([[3.52941176, 6.70967742, 4.89795918, ..., 3.14928425, 2.3441503 ,
        4.83333333],
       [0.58823529, 2.64516129, 4.28571429, ..., 1.71779141, 1.16567037,
        1.66666667],
       [4.70588235, 8.96774194, 4.08163265, ..., 1.04294479, 2.53629377,
        1.83333333],
       ...,
       [2.94117647, 4.96774194, 4.89795918, ..., 1.63599182, 0.71306576,
        1.5       ],
       [0.58823529, 5.29032258, 3.67346939, ..., 2.43353783, 1.15713066,
        4.33333333],
       [0.58823529, 3.16129032, 4.69387755, ..., 2.49488753, 1.01195559,
        0.33333333]])

In [None]:
#Convert X_variance_threshold_df into dataframe


In [None]:
# print of head values of X_variance_threshold_df 


Unnamed: 0,0,1,2,3,4,5,6
0,3.529412,6.709677,4.897959,2.740295,3.149284,2.34415,4.833333
1,0.588235,2.645161,4.285714,1.018185,1.717791,1.16567,1.666667
2,4.705882,8.967742,4.081633,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.29385,2.02454,0.380017,0.0
4,0.0,6.0,1.632653,2.150572,5.092025,9.436379,2.0


Below mentioned is the function get_selected_features for returning selected_features to be used further.

Warning ;)
If we have provided you a readymade function, don't just use it but understand it too.

In [None]:
def get_selected_features(raw_df,processed_df):
    selected_features=[]
    for i in range(len(processed_df.columns)):
        for j in range(len(raw_df.columns)):
            if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
                selected_features.append(raw_df.columns[j])
    return selected_features

In [None]:
# pass the X_scaled_df as raw_df and X_variance_threshold_df as processed_df inside get_selected_features function
selected_features= # code

# print selected_features


['Pregnancies',
 'Glucose',
 'BloodPressure',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

Super! you can see SkinThickness feature is not selected as its variance is less.

Lets give column names to our X_variance_threshold_df

In [None]:
# define selected_features as columns and save it in variabe named X_variance_threshold_df

#print X_variance_threshold_df


Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,6.709677,4.897959,2.740295,3.149284,2.344150,4.833333
1,0.588235,2.645161,4.285714,1.018185,1.717791,1.165670,1.666667
2,4.705882,8.967742,4.081633,3.331099,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.293850,2.024540,0.380017,0.000000
4,0.000000,6.000000,1.632653,2.150572,5.092025,9.436379,2.000000
...,...,...,...,...,...,...,...
763,5.882353,3.677419,5.306122,2.289500,3.006135,0.397096,7.000000
764,1.176471,5.032258,4.693878,2.044244,3.803681,1.118702,1.000000
765,2.941176,4.967742,4.897959,1.502241,1.635992,0.713066,1.500000
766,0.588235,5.290323,3.673469,2.217956,2.433538,1.157131,4.333333


## (c) Chi-Squared statistical test (SelectKBest)

<p style='text-align: right;'> 20 points</p>

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.

Reference link: https://chrisalbon.com/machine_learning/feature_selection/chi-squared_for_feature_selection/

The below mentioned function generate_feature_scores_df is used to get feature score for using it in  Chi-Squared statistical test explained below

In [None]:
def generate_feature_scores_df(X,Score):
    feature_score=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score=pd.concat([feature_score,new])
    return feature_score

Hey coder! lets use dataset - diabetes.csv from below link

https://github.com/arupbhunia/Data-Pre-processing/blob/master/datasets/diabetes.csv

In [None]:
# create a data frame named diabetes and load the csv file again


In [None]:
# assign features to X variable and 'outcome' to y variable from the dataframe diabetes



In [None]:
#import chi2 and SelectKBest



In [None]:
# converting data cast to a float type.


Lets use SelectKBest to calculate the best feature score. Use Chi2 as Score Function and no.of feature i.e. k as 4


In [None]:
# Initialise SelectKBest with above parameters 
chi2_test=#

# fit it with X and Y
chi2_model=#



In [None]:
#print the scores of chi2_model


array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [None]:
# use generate_feature_scores_df function to get features and their respective scores passing X and chi2_model.scores_ as paramter
feature_score_df=#


# return feature_score_df



Unnamed: 0,Features,Score
0,Pregnancies,111.519691
1,Glucose,1411.887041
2,BloodPressure,17.605373
3,SkinThickness,53.10804
4,Insulin,2175.565273
5,BMI,127.669343
6,DiabetesPedigreeFunction,5.392682
7,Age,181.303689


Did you see the features and corresponding chi square scores? This is so easy right, higher the score better the feature. Just like higher the marks in assignment better the student of ours. 

In [None]:
#Lets get X with selected features of chi2_model using tranform function so we will have X_new
X_new=#

whaaaat! tranform()? , how is it different from fit_transform.You know buddy fit() can also confuse you. Hey you inquisitive learner we will tell you the difference.

The fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.

for more info on this refer: https://www.youtube.com/watch?v=BotYLBQfd5M

In [None]:
# Convert X_new into a dataframe

X_new=#

In [None]:
#repeat the previous steps of calling get_selected_features function( pass X and X_new as score in the function)
selected_features=#

# return selected_features


['Glucose', 'Insulin', 'BMI', 'Age']

Let have X with all features given in list selected_features and save this dataframe in variable chi2_best_features

In [None]:


# print chi2_best_features.head()


Unnamed: 0,Glucose,Insulin,BMI,Age
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0


As you can see chi-squared test helps us to select  important independent features out of the original features that have the strongest relationship with the target feature.

You did it well!

## (d) Anova-F Test

<p style='text-align: right;'> 20 points</p>

The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.

References:
1. https://www.youtube.com/watch?v=9zrQ_c5RZkI


In [None]:
#import libraries
from sklearn.feature_selection import f_classif,SelectPercentile


# Initialise SelectPercentile function with parameters f_classif and percentile as 80
Anova_test=#



#Fit the above object to the features and target i.e X and Y
Anova_model=#

here you have used f_classif for Anova-F test. To know more about this test you can check this artical.

https://towardsdatascience.com/anova-for-feature-selection-in-machine-learning-d9305e228476



In [None]:
# return scores of anova model


array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])

In [None]:
# use generate_feature_scores_df function to get features and their respective scores by passing X and Anova_model.scores_ as score in function 

feature_scores_df=#code

# print feature_scores_df

Unnamed: 0,Features,Score
0,Pregnancies,39.670227
1,Glucose,213.161752
2,BloodPressure,3.25695
3,SkinThickness,4.304381
4,Insulin,13.281108
5,BMI,71.772072
6,DiabetesPedigreeFunction,23.8713
7,Age,46.140611


In [None]:
# Get all supported columns values in Anova_model with indices=True
cols = #
# Reduce X to the selected features of anova model using tranform 

X_new = #



In [None]:
#print X_new.head()


Unnamed: 0,Pregnancies,Glucose,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,0.0,33.6,0.627,50.0
1,1.0,85.0,0.0,26.6,0.351,31.0
2,8.0,183.0,0.0,23.3,0.672,32.0
3,1.0,89.0,94.0,28.1,0.167,21.0
4,0.0,137.0,168.0,43.1,2.288,33.0


Hey brighty! Hope you learned to implement Anova F-test method for feature selection. It has selected 6 best features as you can see in above output

# 2. Wrapper Methods

<p style='text-align: right;'> 30 points</p>


Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.

In [None]:
# load and read the diabetes.csv using pandas and print the head values



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# assign features to X and target 'outcome' to Y variable(Think why the 'outcome' column is taken as the target)
X=#
Y=#

#return X,Y



(     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
 0              6      148             72             35        0  33.6   
 1              1       85             66             29        0  26.6   
 2              8      183             64              0        0  23.3   
 3              1       89             66             23       94  28.1   
 4              0      137             40             35      168  43.1   
 ..           ...      ...            ...            ...      ...   ...   
 763           10      101             76             48      180  32.9   
 764            2      122             70             27        0  36.8   
 765            5      121             72             23      112  26.2   
 766            1      126             60              0        0  30.1   
 767            1       93             70             31        0  30.4   
 
      DiabetesPedigreeFunction  Age  
 0                       0.627   50  
 1                    

## (a) Recursive Feature Elemination

<p style='text-align: right;'> 25 points</p>

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.


reference video: https://www.youtube.com/watch?v=MYnxxRoPiwI

Note: the video is using random forest classifier, but we are going to use logistic regression as our model.

In [None]:
# import RFE and LogisticRegression



In [None]:
# Initialise model variable with LogisticRegression function with solver = 'liblinear'
model=#

# rfe variable has RFE instance with should have model and n_features_to_select=4 as parameters

rfe=#

In [None]:
# fit rfe with X and Y

fit=#



In [None]:
print('Number of selected features',fit.n_features_)
print('Selected Features',fit.support_)
print('Feature rankings',fit.ranking_)

Number of selected features 4
Selected Features [ True  True False False False  True  True False]
Feature rankings [1 1 2 4 5 1 1 3]


In [None]:
# use below function to get ranks of all the features
def feature_ranks(X,Rank,Support):
    feature_rank=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
        feature_rank=pd.concat([feature_rank,new])
    return feature_rank


In [None]:
#Get all feature's ranks using feature_ranks function with suitable parameters. Sotre it in variable called feature_rank_df
feature_rank_df=#

# print feature_rank_df


Unnamed: 0,Features,Rank,Selected
0,Pregnancies,1,True
1,Glucose,1,True
2,BloodPressure,2,False
3,SkinThickness,4,False
4,Insulin,5,False
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True
7,Age,3,False


We can see there are four features with rank 1 ,RFE states that these are the most significant features.

In [None]:
# filter feature_rank_df  with selected column values as True and save result in variable called recursive_feature_names 
recursive_feature_names=#

# print recursive_feature_names



Unnamed: 0,Features,Rank,Selected
0,Pregnancies,1,True
1,Glucose,1,True
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True


In [None]:
# finally get dataframe X with all the features selected by RFE and store this result in variable called RFE_selected_features
RFE_selected_features=#

# print RFE head()



Unnamed: 0,Pregnancies,Glucose,BMI,DiabetesPedigreeFunction
0,6,148,33.6,0.627
1,1,85,26.6,0.351
2,8,183,23.3,0.672
3,1,89,28.1,0.167
4,0,137,43.1,2.288


We recommend you to watch this video to know about working of RFE:  https://www.youtube.com/watch?v=Yo1vYRdJ95k

# 3. Embedded Method using random forest

<p style='text-align: right;'> 25 points</p>


Reference: https://www.youtube.com/watch?v=em4OFr-4C34

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :
1. They are highly accurate.
2. They generalize better.
3. They are interpretable

In [None]:
#Importing libraries RandomForestClassifier and SelectFromModel



In [None]:
# load the csv file using pandas and print the head values

# print diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.
So considering we have a train and a test dataset. We select the features from the train set and then transfer the changes to the test set later

In [None]:
# assign features to X and target 'outcome' to Y(Think why the 'outcome' column is taken as the target)


In [None]:
# import test_train_split module

# splitting of dataset(test_size=0.3)


Here We will do the model fitting and feature selection altogether in one line of code.

Firstly, specify the random forest instance, indicating the number of trees.

Then use selectFromModel object from sklearn to automatically select the features. Simple right?. Don't worry trust your code. It helps.

Reference link to use selectFromModel: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [None]:
#create an instance of Select from Model. Pass an object of Random Forest Classifier with n_estimators=100 as argument. 
sel = #code


# fit sel on training data



SelectFromModel(estimator=RandomForestClassifier())

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.

 To see which features are important we can use get_support method on the fitted model.

In [None]:
# Using sel.get_support() print the boolean values for the features selected. 



array([False,  True, False, False, False,  True, False,  True])

In [None]:
#make a list named selected_feat with all columns which are True
selected_feat= #code

# print length of selected_feat


3

In [None]:
# Print selected_feat


Index(['Glucose', 'BMI', 'Age'], dtype='object')


Well done Champ!. Let us impliment SelectFromModel using LinearSVC model also

## Feature selection using SelectFromModel

<p style='text-align: right;'> 25 points</p>


SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or featureimportances attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or featureimportances values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

Lets use selectfrommodel again with LinearSVC

In [None]:
# import LinearSVC 


In [None]:
#Use SelectFromModel with LinearSVC() as its parameter and save it in variable 'm'

m = #code

#fit m with X and Y





SelectFromModel(estimator=LinearSVC())

In [None]:
#make a list named selected_feat with all columns which are supported

selected_feat=#

print(selected_feat)

Index(['Pregnancies', 'DiabetesPedigreeFunction'], dtype='object')


 # 4. Handling Multicollinearity with VIF
 
 <p style='text-align: right;'> 15 points</p>


Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

VIF has big defination but for now understand that:-
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables

Reference: https://www.youtube.com/watch?v=6JpmgzCAusI

In [None]:
#load and read the diabetes_cleaned.csv file using pandas and print the head values

dia_df=

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.93776,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [None]:
# describe the dataframe using .describe()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.539062,72.405184,29.108073,152.222767,32.307682,0.471876,33.240885,0.348958
std,3.369578,30.49066,12.096346,8.791221,97.387162,6.986674,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,-17.757186,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,25.0,89.647494,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,29.0,130.0,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,188.448695,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn's preprocessing scale function.

Reference doc: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

In [None]:
from sklearn import preprocessing

#iterate over all features in dia_df and scale



In [None]:
# describe dataframe using .describe()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,5.89806e-17,-7.372575e-18,1.1637100000000001e-17,-2.82254e-17,-4.192248e-18,2.2768250000000002e-17,3.715199e-17,-9.107298e-18,2.353442e-16
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-2.5447,-4.004245,-2.516429,-1.746542,-2.020543,-1.189553,-1.041549,-0.7321202
25%,-0.8448851,-0.7396938,-0.695306,-0.4675972,-0.64296,-0.7172147,-0.6889685,-0.7862862,-0.7321202
50%,-0.2509521,-0.1489643,-0.01675912,-0.01230129,-0.2283386,-0.04406715,-0.3001282,-0.3608474,-0.7321202
75%,0.6399473,0.6140612,0.6282695,0.3291706,0.3722209,0.6147581,0.4662269,0.6602056,1.365896
max,3.906578,2.542136,4.102655,7.955377,7.128551,4.983056,5.883565,4.063716,1.365896


In [None]:
#import variance inflation factor


In [None]:
# assign features to X and target to Y 



# split the data to test and train with test_size=0.2


In [None]:
#assign an empty dataframe to variable vif
vif=#

# make a new column 'VIF Factor' in vif dataframe and calculate the variance_inflation_factor for each X 
vif['VIF Factor']=#

In [None]:
# define vif['Features'] with columns names in X

vif['Features']=

In [None]:
#  round off all the decimal values in the dataframe to 2 decimal places for VIF dataframe and print it.



Unnamed: 0,VIF Factor,Features
0,1.43,Pregnancies
1,2.07,Glucose
2,1.24,BloodPressure
3,1.43,SkinThickness
4,2.04,Insulin
5,1.58,BMI
6,1.05,DiabetesPedigreeFunction
7,1.62,Age


* VIF = 1: Not correlated
* VIF =1-5: Moderately correlated
* VIF >5: Highly correlated

Glucose, Insulin, and Age are having large VIF scores, so lets drop it.



In [None]:
# according to above observation , drop  'Glucose', 'Insulin' and 'Age' from X

X=

Now again we calculate the VIF for the rest of the features

Again repeat the previous steps to assign an empty dataframe() to vif and make a new column 'VIF Factor' and calculate the variance_inflation_factorfor each X 


In [None]:
#code here



In [None]:
#define vif['Features'] as columns of X and return vif with round off to 2 decimal places



Unnamed: 0,VIF Factor,Features
0,1.05,Pregnancies
1,1.13,BloodPressure
2,1.42,SkinThickness
3,1.5,BMI
4,1.03,DiabetesPedigreeFunction


So now colinearity of features has been reduced using VIF.

The need to fix multicollinearity depends primarily on the below reasons:

1. When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
2. If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

------------------------------

# Hip Hip Hurray! Congratulations you have completed the 7th assignment too! Very well done.

-------------------------------------

# Its Feedback Time!

We hope you’ve enjoyed this course so far. We’re committed to help you use "AI for All" course to its full potential, so that you have a great learning experience. And that’s why we need your help in form of a feedback here.

**Please fill this feedback form**
 https://zfrmz.in/MtRG5oWXBdesm6rmSM7N

------------------------------