### Why do we need feature selection?

* Firstly, too many features increase algorithm training time.
* Overfitting is more likely to occur in models established with more features.
* When the number of features is more than optimal, the correct prediction rate of the model may decrease.
* The situation that occurs due to such reasons is called "curse of dimensionality".
* To avoid these difficulties, we should reduce the number of attributes.
* We will handle the feature selection method with the Borutapy algorithm.

In [47]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
!pip install Boruta
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

[0m

In [49]:
df=pd.read_csv("../input/mobile/mobile_train.csv")
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [51]:
y = df["price_range"] # dependent(target) variable
X = df.drop(["price_range"], axis=1) # independent variables

#splitting with  Holdout method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=45) 

In [52]:
rf_all_features = RandomForestClassifier(random_state=123, n_estimators=1000, max_depth=5)
rf_all_features.fit(X_train, y_train)
accuracy_score(y_test, rf_all_features.predict(X_test))

0.844

* We create a random forest model using all variables without selecting variables and the accurcay of this model is 0.844. 
* After the variable selection process, we will set up the model again and compare the accuracy.

* Boruta algorithm works based on random forest. so we created a random forest object for the Boruta algorithm.
* The Boruta algorithm works with numpy arrays, so we have to convert dataframe structures to numpy array
* We do the variable selection process through the Borutapy function.
* Borutapy function takes random forest model object and n_estimators parameters respectively.
* Then we fit with the required and X and y values.
* We can see the order of importance of the variables with ranking_ and the number of important variables with n_features_
* Variables with ranking=1 are the most important variables for us.
* We eliminate the remaining variables and continue to build a model with ranking=1 variables.


In [61]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X), np.array(y))  

print("Ranking: ",boruta_selector.ranking_)          
print("No. of significant features: ", boruta_selector.n_features_) 

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	20
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	9 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	10 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	11 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	12 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	13 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	14 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	15 / 100
Confirmed: 	4
Tentative: 	2
Rejected: 	14
Iteration: 	16 / 100
Confirmed: 	4
Tentative: 	1
Rejected: 	15
I

In [53]:
selected_rf_features = pd.DataFrame({'Feature':list(X.columns),
                                       'Ranking':boruta_selector.ranking_}).sort_values(by='Ranking') 
selected_rf_features

Unnamed: 0,Feature,Ranking
0,battery_power,1
13,ram,1
11,px_height,1
12,px_width,1
8,mobile_wt,2
6,int_memory,3
15,sc_w,4
10,pc,5
16,talk_time,6
4,fc,7


* we chose battery_power,ram,px_width and px_height as the best 4 features.
* X_new is the new dataframe containing these 4 independent attributes.

In [57]:
new_columns=selected_rf_features[selected_rf_features["Ranking"]==1].index
X_new=df.iloc[:,new_columns]
X_new.head()

Unnamed: 0,battery_power,ram,px_height,px_width
0,842,2549,20,756
1,1021,2631,905,1988
2,563,2603,1263,1716
3,615,2769,1216,1786
4,1821,1411,1208,1212


In [58]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.25, random_state=45) 
# we create new train-test dataset for new model with new features

In [60]:
rf_final = RandomForestClassifier(random_state=123, n_estimators=1000, max_depth=5)
rf_final.fit(X_train, y_train)
accuracy_score(y_test, rf_final.predict(X_test))

0.882

* The accuracy of random forest model built with new variables is 0.882.
* That is, even if it is reduced from 20 variables to 4 variables with the variable selection method, the accuracy has increased.
* So we can reduce the model training time by using fewer variables.
* And when the model is built with fewer variables, the risk of overfitting is reduced.