<a href="https://colab.research.google.com/github/randy-tsukemen/Data_science_roadmap/blob/master/Regression_feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Learning


## Regression: feature selection
### Selecting the correct features:
- Reduces over?tting
- Improves accuracy
- Increases interpretability
- Reduces training time

### Feature selection methods
- Filter: Rank features based on statistical performance
- Wrapper: Use an ML method to evaluate performance
- Embedded: Iterative model training to extract features
- Feature importance: tree-based ML models


### Compare and contrast methods
|Method |Use an ML model |Select best subset |Can overfit|
| -------- | -------- | -------- | -------- |
|Filter |No |No |No|
|Wrapper |Yes |Yes |Sometimes|
|Embedded |Yes |Yes |Yes|
|Feature importance |Yes |Yes |Yes|

### Correlation coeficient statistical tests
|Feature/Response |Continuous |Categorical|
| -------- | -------- | -------- |
|Continuous |Pearson's Correlation |LDA|
|Categorical |ANOVA |Chi-Square|

### Filter functions
|Function |returns|
|---|---|
|df.corr() |Pearson's correlation matrix|
|sns.heatmap(corr_object) |heatmap plot|
|abs() |absolute value|

In [0]:
# Create correlation matrix and print it
cor = diabetes.corr()
print(cor)

# Correlation matrix heatmap
plt.figure()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

# Correlation with output variable
cor_target = abs(cor["progression"])

# Selecting highly correlated features
best_features = cor_target[cor_target > 0.5]
print(best_features)

### Wrapper methods
1. Forward selection (LARS-least angle regression)
    - Starts with no features, adds one at a time
2. Backward elimination
    - Starts with all features, eliminates one at a time
3. Forward selection/backward elimination combination (bidirectional elimination)
4. Recursive feature elimination
    - RFECV

In [0]:
# Import modules
from sklearn.svm import SVR
from sklearn.feature_selection import RFECV

# Instantiate estimator and feature selector
svr_mod = SVR(kernel="linear")
feat_selector = RFECV(svr_mod, cv=5)

# Fit
feat_selector = feat_selector.fit(X, y)

# Print support and ranking
print(feat_selector.support_)
print(feat_selector.ranking_)
print(X.columns)

In [0]:
# Import modules
from sklearn.linear_model import LarsCV

# Drop feature suggested not important in step 2
X = X.drop('sex', axis=1)

# Instantiate
lars_mod = LarsCV(cv=5, normalize=False)

# Fit
feat_selector = lars_mod.fit(X, y)

# Print r-squared score and estimated alpha
print(lars_mod.score(X, y))
print(lars_mod.alpha_)

### Embedded methods
1. Lasso Regression
2. Ridge Regression
3. ElasticNet

### Tree-based feature importance methods
- Random Forest --> `sklearn.ensemble.RandomForestRegressor`
- Extra Trees --> `sklearn.ensemble.ExtraTreesRegressor`
- After model fit --> `tree_mod.feature_importances_`

|Function |returns|
|---|---|
|`sklearn.svm.SVR` |support vector regression estimator|
|`sklearn.feature_selection.RFECV` |recursive feature elimination with cross-val|
|`rfe_mod.support_` |boolean array of selected features|
|`ref_mod.ranking_` |feature ranking, selected=1|
|`sklearn.linear_model.LinearRegression` |linear model estimator|
|`sklearn.linear_model.LarsCV` |least angle regression with cross-val|
|`LarsCV.score` |r-squared score|
|`LarsCV.alpha_` |estimated regularization parameter|