<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/1_Embedded_Method_Random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedded Method using random forest

<p style='text-align: right;'> 25 points</p>


Reference: https://www.youtube.com/watch?v=em4OFr-4C34

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :
1. They are highly accurate.
2. They generalize better.
3. They are interpretable

In [None]:
#Importing libraries RandomForestClassifier and SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [None]:
# load the csv file using pandas and print the head values
diabetes = pd.read_csv("diabetes1.csv")

# print diabetes.head()
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.
So considering we have a train and a test dataset. We select the features from the train set and then transfer the changes to the test set later

In [None]:
# assign features to X and target 'outcome' to Y(Think why the 'outcome' column is taken as the target)
X = diabetes.drop('Outcome', axis = 1)

y = diabetes["Outcome"]

In [None]:
# import test_train_split module
from sklearn.model_selection import train_test_split
# splitting of dataset(test_size=0.3)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

Here We will do the model fitting and feature selection altogether in one line of code.

Firstly, specify the random forest instance, indicating the number of trees.

Then use selectFromModel object from sklearn to automatically select the features. Simple right?. Don't worry trust your code. It helps.

Reference link to use selectFromModel: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [None]:
#create an instance of Select from Model. Pass an object of Random Forest Classifier with n_estimators=100 as argument. 
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

sel = SelectFromModel(estimator=RandomForestClassifier(n_estimators=100))

# fit sel on training data
fit = sel.fit(X_train, y_train)

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.

 To see which features are important we can use get_support method on the fitted model.

In [None]:
# Using sel.get_support() print the boolean values for the features selected. 

sel_support = sel.get_support()
sel_support

array([False,  True, False, False, False,  True,  True,  True])

In [None]:
#make a list named selected_feat with all columns which are True
selected_feat= X_train.loc[:,sel_support].columns.tolist()

# print length of selected_feat
print(str(len(selected_feat)), 'selected features')

4 selected features


In [None]:
# Print selected_feat
print(selected_feat)

['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']


Well done Champ!. Let us impliment SelectFromModel using LinearSVC model also

## Feature selection using SelectFromModel

<p style='text-align: right;'> 25 points</p>


SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or featureimportances attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or featureimportances values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

Lets use selectfrommodel again with LinearSVC

In [None]:
# import LinearSVC 
from sklearn.svm import LinearSVC

In [None]:
#Use SelectFromModel with LinearSVC() as its parameter and save it in variable 'm'

m = SelectFromModel(estimator=LinearSVC())

#fit m with X and Y
fit = m.fit(X, y)



In [None]:
#make a list named selected_feat with all columns which are supported
sel_support = m.get_support()

selected_feat= X.loc[:,sel_support].columns.tolist()

print(selected_feat)

['Pregnancies', 'DiabetesPedigreeFunction']
