### Name: KHUN Dararith
### ID: M060813

***
# <center>Feature Selection Technique</center>
***

### What is Feature Selection?
Feature selection is the process of reducing the number of input variables when developing a predictive model. It is a process of selecting the most significant and relevant features from a vast set of features in the given dataset.

### Why do we need Feature Selection?
Since some features may be irrelevant or less significant to the dependent variable so their unnecessary inclusion to the model leads to:

* Increase in complexity of a model and makes it harder to interpret.
* Increase in time complexity for a model to get trained.
* Result in a dumb model with inaccurate or less reliable predictions.

#### Key Benefit:
Feature selection helps in finding the smallest set of features which results in:
* Training a machine learning algorithm faster.
* Reducing the complexity of a model and making it easier to interpret.
* Building a sensible model with better prediction power.
* Reducing over-fitting by selecting the right set of features.

### Feature Selection Methods/Techniques
There are mainly two types of Feature Selection techniques, which are:

* Supervised Feature Selection technique
  * Supervised Feature selection techniques consider the target variable and can be used for the labelled dataset.
* Unsupervised Feature Selection technique
  * Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabelled dataset.
  
***There are mainly three techniques under supervised feature selection:***

##### 1. Wrapper Methods
* Forward Feature Selection
* Backward Feature Selection
* Exhausetive Feature Selection
* Recursive Feature Selection

##### 2. Filter Methods
* Information Gain
* Chi-square Test
* Fisher's Score
* Missing Value Ratio

##### 3. Embedded Methods
* Regularization L1, L2
* Random Forest Importance

## Forward Selection
Forward selection is an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The process continues until the addition of a new variable/feature does not improve the performance of the model.

***Process Of Wrapper Methods***
<center>
    <tt>
    |Set of Features|>>|Generate subset ⇄ Algorithm|>>|Performance|
    </tt>
</center>

**Steps to perform Forward Feature Selection:**

1. Train n model using each feature(n) individually and check the performance
2. Choose the variable which gives the best performance
3. Repeat the process and add one variable at a time
4. Variable producing the highest improvement is retained
5. Repeat the entire process until there is no significant improvement in model's performance

### Implementation:
The dataset is heart.csv file which is from kaggle. It uses for heart disease detection.

In [110]:
#importing the libraries
import pandas as pd

In [111]:
data = pd.read_csv('/Users/dararithkhun/Documents/ITC/M1-Program Semester 1/Machine_Learning/ML_Assignments/Final_Project_M1Semester1/Heart Disease/heart.csv')

In [112]:
data.shape

(303, 14)

In [113]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [114]:
#check for missiong values
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [115]:
#define training data
X = data.drop(['chol', 'target'], axis=1)
y = data['target']

In [116]:
X.shape, y.shape

((303, 12), (303,))

In [117]:
#to install mlxrend library
# conda install mlxtend --channel conda-forge

In [118]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score as acc
from sklearn.ensemble import RandomForestClassifier

In [127]:
#uses of linear regression model
linearR = LinearRegression()
sfsTest = sfs(linearR, k_features=5, forward=True, verbose=2, scoring='r2')

In [128]:
sfsTest = sfsTest.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    0.1s finished

[2022-05-25 05:55:17] Features: 1/5 -- score: 0.0449193701526039[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    0.1s finished

[2022-05-25 05:55:17] Features: 2/5 -- score: 0.07194095902184601[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished

[2022-05-25 05:55:17] Features: 3/5 -- score: 0.08448908286073334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   

In [129]:
feature_names = list(sfsTest.k_feature_names_)
print(feature_names)

['sex', 'restecg', 'exang', 'oldpeak', 'thal']


In [130]:
# creating a new dataframe using the above variables and adding the target variable
new_data = data[feature_names]
new_data['target'] = data['target']

# first five rows of the new data
new_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['target'] = data['target']


Unnamed: 0,sex,restecg,exang,oldpeak,thal,target
0,1,0,0,2.3,1,1
1,1,1,0,3.5,2,1
2,0,0,0,1.4,2,1
3,1,1,0,0.8,2,1
4,0,1,1,0.6,2,1


In [131]:
# shape of new and original data
new_data.shape, data.shape

((303, 6), (303, 14))

In [132]:
#define training data
X1 = new_data.drop(['sex', 'target'], axis=1)
y1 = new_data['target']

In [160]:
# Build full model with selected features
clf = RandomForestClassifier(n_estimators=1000, random_state=40, max_depth=8)
clf.fit(X1, y1)

y_train_pred = clf.predict(X1)
print('Training accuracy on selected features: %.3f' % acc(y1, y_train_pred))

Training accuracy on selected features: 0.888
