[![Group-1subset-selection.png](https://i.postimg.cc/ZRBGJ8Xh/Group-1subset-selection.png)](https://postimg.cc/njxRkDHd)

## Why?
Selecting a subset of predictors is handy in certain situations. Given that underlying relation between predictors, $X$ and target, $y$ is linear, in following situation we might need subset of predictors. 
* **Accuracy**. When relation is linear between X and y, the model will have low bias. If number of samples is greater than predictors, then a model will also have low variance. But as those numbers becomes closer, variance starts to increase. Which the model will solve by overfitting and poor performance on the test set. Where predictors are more than there is samples, least squares fit won't be able to uniquely estimate coefficients. If we reduce the number of predictors, then we will have a better model.
* **Interpretability**. Not all predictors affects the target values equally. Some of which are negligible in the output for a less complex and interpretable model.

Reducing or eliminating predictors can be done in three main ways,
- Subset Selection
- Shrinkage
- Dimension Reduction

Shrinkage is methods like Ridge Regression, Lasso Regression. Dimension Reduction techniques includes Principle Component Analysis etc. 

# Subset Selection
There are a few approach to take here. We will begin with the simplest one.

## Best Subset
It is just training all predictor combination possible and taking the best one by error rates. Computationally, this is a bad approach, but none the less, we should at least learn it once. For $p$ predictors, there is going to be $2^p$ models.
### Algorithm

- Define Mo as the null model. It contains no predictors and predicts sample mean as output.
- For k = 1,2,...,p:
    - Fit all 2^k models among p predictors models that contain exactly k predictors.
    - Pick the best among those models, call it Mk. Best is defiend by accuracy metrics.
- Select a single best model from among Mo,...,Mk models using cross validated prediction error, Cp (AIC) or adjusted R^2.      

### Implementation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from itertools import permutations
import pandas as pd

In [None]:
class BestSubset:
    def __init__(self, metric='Cp'):
        self.metric = metric
        self.modelContainer = []
    
    def fit(self, X, y):
        nSamples, nFeats = X.shape
        Mo = LinearRegression().fit(np.zeros((nSamples,1)),y) # this will return mean always
        self.modelContainer.append([]) #empty predictors
        
        for k in range(1,nFeats+1):
            print(f'Training for {k} predictors')
            tmp = list(permutations(range(nFeats),r=k))
            tmp = [tuple(sorted(x)) for x in tmp]
            feature_combinations = list(set(tmp))
            tmpModelContainer = []
            
            for t in range(len(feature_combinations)):
                p = np.array(feature_combinations[t])
#                 print(p.shape, p)
                tmpX = X[:,p]
                model = LinearRegression().fit(tmpX, y)
                r2 = model.score(tmpX, y)
                tmpModelContainer.append((p,r2))
            
            tmpModelContainer = sorted(tmpModelContainer, key=lambda x: x[1], reverse=True)
            self.modelContainer.append(tmpModelContainer[0][0]) # only best model's predictors
        
        print("Comparing final K models", X.shape)
        Cp = []
        for k in range(nFeats):
            predictors = self.modelContainer[k]
            print(f"{k}th model: ",len(predictors))
            
            if len(predictors) == 0:
                model = LinearRegression().fit(np.zeros((nSamples, 1)), y)
                yhat = model.predict(np.zeros((nSamples, 1)))
            else:
                model = LinearRegression().fit(X[:, predictors], y)
                yhat = model.predict(X[:, predictors])
                
            rss = np.sum((y - yhat)**2)
            est_var = np.var( (y-yhat) )
            _Cp = ((rss + (2 * len(predictors) * est_var)) / nSamples)
            Cp.append(_Cp)
            
        idxmin = np.argmin(Cp)
        return self.modelContainer[idxmin], Cp

## Trying out the model
on Credit dataset used in the ISLR book

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
df = pd.read_csv("../input/ISLR-Auto/Credit.csv")
df.drop(['Unnamed: 0'], axis=1, inplace=True)
X = df.loc[:,df.columns!='Rating'].values
y = df['Rating'].values
predictiors = df.columns[df.columns!='Rating']
oe = OrdinalEncoder().fit(X)
X = oe.transform(X)

In [None]:
%%time
bs = BestSubset()
best_predictors, scores = bs.fit(X,y)
print(f"Best predictors are {df.columns[best_predictors].values}")

## Stepwise Selection
It does a similar job of comparing models but computationally efficient than best subset. Stepwise selection are of two type,
- Forward Stepwise Selection
- Backward Stepwise Selection

Forward stepwise begins with no predictors, then adds predictors that contribute the most. Here is the algorithm, 
1. $M_0$ is the null model without any predictors and only returns mean of target variable.
2. For $k=0, ... ,p-1$:
    1. Consider all $p-k$ predictor that are not in Mk. Augment Mk with each of them, one a time and train.
    2. Choose the best among those $p-k$ models, and call it $M_{k+1}$.
3. Select the best model from $M_0 ... M_p$ models using Cp or other metrics.

In each iteration inside $(2)$, we train $p-k$ models. That gives us total $M_0 + \sum_{k=0}^{p-1} \ k(p-k)$ models to train, which is less than the $2^p$ models that Best Subset would have to train.

### Implementation

In [None]:
class ForwardStepwise:
    def __init__(self):
        pass
    
    def fit(self, X, y):
        nSamples, nFeats = X.shape
        predictors = [] # predictors[0:i] will have best model for i number of predictors.
        for k in range(0, nFeats):
            unused_indices = [i for i in range(nFeats) if i not in predictors]
            tmp_scores = []
            for j in unused_indices:
                p = predictors + [j]
                model = LinearRegression().fit(X[:, p], y)
                score = model.score(X[:, p], y)
                tmp_scores.append(score)
            mx = np.argmax(tmp_scores)
            predictors.append(unused_indices[mx])
            
        Cp = []
        
        for k in range(nFeats):
            if k == 0:
                p = []
                model = LinearRegression().fit(np.zeros((nSamples,1)), y)
                yhat = model.predict(np.zeros((nSamples,1)))
            else:
                p = predictors[0:k]
                model = LinearRegression().fit(X[:, p], y)
                yhat = model.predict(X[:, p])
                
            rss = np.sum((y - yhat)**2)
            est_var = np.var( (y-yhat) )
            _Cp = ((rss + (2 * len(p) * est_var)) / nSamples)
            Cp.append(_Cp)
            
        mn = np.argmin(Cp)
        best_predictors = predictors[:mn]
        return best_predictors

In [None]:
%%time
fs = ForwardStepwise()
best = fs.fit(X, y)
print(f"Best predictors are {df.columns[best].values}")

As we can see that *Best Subset* and *Forward Stepwise* both has given us the same result but the latter took less time.

## Backward Stepwise
It is the same as forward stepwise but reverse. We begin with using all features as predictors in step $(1)$, and then in step $(2)$, we remove features.

# Reference
1. James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.