<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [2]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv('../../DATA/winequality_merged.csv')

# define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality'] #list comprehension
# Load the dataset as a pandas data frame
X = pd.DataFrame(wine, columns = predictor_columns)

In [3]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

## Forward feature selection the hard way

In [4]:
## Flag intermediate output

show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [5]:
X_train.shape[0]

5197

**What does the ```%s``` mean in a Python format string?**
> The ```%s``` token allows to insert (and potentially format) a string. Notice that the ```%s``` token is replaced by whatever is passed to the string after the ```%``` symbol.

In [6]:
## Use Forward Feature Selection to pick a good model

## Code from Answer Lab

# start with no predictors
included = []
# keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}
# create a model object to hold the modelling parameters
model = LinearRegression()
# get the number of cases in the training data
n = X_train.shape[0]

while True:
    changed = False
    
    if show_steps:
        print('') 

    # list the features to be evaluated
    excluded = list(set(X.columns) - set(included))
    
    if show_steps:
        print('(Step) Excluded = %s' % ', '.join(excluded))  

    # for each remaining feature to be evaluated
    for new_column in excluded:
        
        if show_steps:
            print('(Step) Trying %s...' % new_column)
            print('(Step) - Features = %s' % ', '.join(included + [new_column]))

        # fit the model with the Training data
        fit = model.fit(X_train[included + [new_column]], y_train)
        # calculate the score (R^2 for Regression)
        r2 = fit.score(X_train[included + [new_column]], y_train)
        # number of predictors in this model
        k = len(included + [new_column])
        # calculate the adjusted R^2
        adjusted_r2 = 1 - ( ( (1 - r2) * (n - 1) ) / (n - k - 1) )

        if show_steps:
            print('(Step) - Adjusted R^2: This = %.3f; Best = %.3f' % 
                  (adjusted_r2, best['a_r2']))

        # if model improves
        if adjusted_r2 > best['a_r2']:
            # record new parameters
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            # flag that found a better model
            changed = True
            if show_steps:
                print('(Step) - New Best!   : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % 
                      (best['feature'], best['r2'], best['a_r2']))
    # END for

    # if found a better model after testing all remaining features
    if changed:
        # update control details
        included.append(best['feature'])
        excluded = list(set(excluded) - set(best['feature']))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % 
              (best['feature'], best['r2'], best['a_r2']))
    else:
        # terminate if no better model
        print('*'*50)
        break

print('')
print('Resulting features:')
print(', '.join(included))


(Step) Excluded = residual sugar, fixed acidity, pH, chlorides, density, citric acid, alcohol, volatile acidity, sulphates, red_wine, total sulfur dioxide, free sulfur dioxide
(Step) Trying residual sugar...
(Step) - Features = residual sugar
(Step) - Adjusted R^2: This = 0.002; Best = 0.000
(Step) - New Best!   : Feature = residual sugar; R^2 = 0.002; Adjusted R^2 = 0.002
(Step) Trying fixed acidity...
(Step) - Features = fixed acidity
(Step) - Adjusted R^2: This = 0.004; Best = 0.002
(Step) - New Best!   : Feature = fixed acidity; R^2 = 0.004; Adjusted R^2 = 0.004
(Step) Trying pH...
(Step) - Features = pH
(Step) - Adjusted R^2: This = 0.000; Best = 0.004
(Step) Trying chlorides...
(Step) - Features = chlorides
(Step) - Adjusted R^2: This = 0.037; Best = 0.004
(Step) - New Best!   : Feature = chlorides; R^2 = 0.037; Adjusted R^2 = 0.037
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.091; Best = 0.037
(Step) - New Best!   : Feature = density; R^2

In [7]:
## Use Forward Feature Selection to pick a good model

## Code without print statements from Answer Lab

# start with no predictors
included = []
# keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}
# create a model object to hold the modelling parameters
model = LinearRegression()
# get the number of cases in the training data
n = X_train.shape[0]

while True:
    changed = False
    
    # list the features to be evaluated
    excluded = list(set(X.columns) - set(included))
    
    # for each remaining feature to be evaluated
    for new_column in excluded:
        
        # fit the model with the Training data
        fit = model.fit(X_train[included + [new_column]], y_train)
        # calculate the score (R^2 for Regression)
        r2 = fit.score(X_train[included + [new_column]], y_train)
        # number of predictors in this model
        k = len(included + [new_column])
        # calculate the adjusted R^2
        adjusted_r2 = 1 - ( ( (1 - r2) * (n - 1) ) / (n - k - 1) )

        # if model improves
        if adjusted_r2 > best['a_r2']:
            # record new parameters
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            # flag that found a better model
            changed = True
            
    # END for loop

    # if found a better model after testing all remaining features
    if changed:
        # update control details
        included.append(best['feature'])
        excluded = list(set(excluded) - set(best['feature']))
        
    else:
        # terminate if no better model
        break

print('Resulting features:', len(included))
print(', '.join(included))
print(f'Scores: R^2={np.round(best["r2"],3)}, adjusted R^2={np.round(best["a_r2"], 3)}')

Resulting features: 12
alcohol, volatile acidity, sulphates, residual sugar, red_wine, free sulfur dioxide, total sulfur dioxide, density, chlorides, pH, fixed acidity, citric acid
Scores: R^2=0.303, adjusted R^2=0.301


In [8]:
# Shorten inner for loop (Linear Regression model)

include = [] #predictors included in final model
best_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores

while True:
    changed = False
    
    evaluate = list(set(X.columns) - set(include)) #predictors to be evaluated
    
    for col in evaluate:
        r2 = LinearRegression().fit(X_train[include + [col]], y_train).score(X_train[include + [col]], y_train)
        adjusted_r2 = 1 - ( ( (1 - r2) * (len(X_train) - 1) ) / (len(X_train) - len(include + [col]) - 1) )

        if adjusted_r2 > best_score['adj_r2']:
            best_score = {'feature': col, 'r2': r2, 'adj_r2': adjusted_r2}
            changed = True
            
    if changed:
        include.append(best_score['feature'])
        evaluate = list(set(evaluate) - set(best_score['feature']))
        
    else:
        break

print('Number of resulting predictors:', len(include))
print('Predictor names: ' + ', '.join(include))
print(f'Scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')

Number of resulting predictors: 12
Predictor names: alcohol, volatile acidity, sulphates, residual sugar, red_wine, free sulfur dioxide, total sulfur dioxide, density, chlorides, pH, fixed acidity, citric acid
Scores: R^2=0.303, adjusted R^2=0.301


In [None]:
# Different loop options for outer loop

tol = 0.001
diff = 0

while diff < tol:

while diff == 0 | diff <0:
    
while best_r2 >= adjusted_r2:
    
while True:
    

**Break, continue and pass in Python**

**Break statement:**
> The break statement in Python terminates the current loop and resumes execution at the next statement

**Continue statement:**
 > The continue statement in Python returns the control to the beginning of the while loop. 
The continue statement rejects all the remaining statements in the current iteration of 
the loop and moves the control back to the top of the loop.

**Pass statement:**
> The pass statement in Python is used when a statement is required syntactically but you 
do not want any command or code to execute.

**The else statement used with Loops:**
> If the else statement is used with a for loop, the else statement is executed when the loop has exhausted iterating the list.
If the else statement is used with a while loop, the else statement is executed when the condition becomes false.

In [9]:
# Shorten outer while loop

include = [] #predictors included in final model
inner_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in inner loop
best_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in outer loop

while True:
    
    evaluate = list(set(X.columns) - set(include)) #predictors to be evaluated
        
    for col in evaluate:
        r2 = LinearRegression().fit(X_train[include + [col]], y_train).score(X_train[include + [col]], y_train)
        adjusted_r2 = 1 - ( ( (1 - r2) * (len(X_train) - 1) ) / (len(X_train) - len(include + [col]) - 1) )
        
        if adjusted_r2 > inner_score['adj_r2']:
            inner_score = {'feature': col, 'r2': r2, 'adj_r2': adjusted_r2}
                                    
    if inner_score['adj_r2'] > best_score['adj_r2']:
        best_score = inner_score
        include.append(best_score['feature'])
        evaluate = list(set(evaluate) - set(best_score['feature']))
            
    else:
        break

print('Number of resulting predictors:', len(include))
print('Predictor names: ' + ', '.join(include))
print(f'Scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')

Number of resulting predictors: 12
Predictor names: alcohol, volatile acidity, sulphates, residual sugar, red_wine, free sulfur dioxide, total sulfur dioxide, density, chlorides, pH, fixed acidity, citric acid
Scores: R^2=0.303, adjusted R^2=0.301


Formatting:
* Make print statements **bold**: enclose the text in the escape sequence "\033[1;3m" and "\033[0m"
* New line character: "\n"

In [10]:
# Shorten print statements

include = [] #predictors included in final model
inner_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in inner loop
best_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in outer loop
n = 1 #loop counter

while True:
    evaluate = list(set(X.columns) - set(include)) #predictors to be evaluated
    print(f'\n\033[1;3mFFS round No. {n}:\033[0m')
        
    for col in evaluate:
        r2 = LinearRegression().fit(X_train[include + [col]], y_train).score(X_train[include + [col]], y_train)
        adjusted_r2 = 1 - ( ( (1 - r2) * (len(X_train) - 1) ) / (len(X_train) - len(include + [col]) - 1) )
        
        if adjusted_r2 > inner_score['adj_r2']:
            inner_score = {'feature': col, 'r2': r2, 'adj_r2': adjusted_r2}
            print(f'<Current New Best! Feature> {inner_score["feature"]}, R^2: {np.round(inner_score["r2"], 3)}, adjusted R^2 {np.round(inner_score["adj_r2"], 3)}')
                            
    if inner_score['adj_r2'] > best_score['adj_r2']:
        best_score = inner_score
        include.append(best_score['feature'])
        evaluate = list(set(evaluate) - set(best_score['feature']))
        print(f'Added new best feature \033[1;3m{best_score["feature"]}\033[0m to list. Current best scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')
        n+=1
    
    else:
        print('\n', 47*'!', '\n Selection terminated. No better model was found\n', 47*'!')
        break

print('\nNumber of resulting predictors:', len(include))
print('Predictor names: ' + ', '.join(include))
print(f'Final scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')


[1;3mFFS round No. 1:[0m
<Current New Best! Feature> residual sugar, R^2: 0.002, adjusted R^2 0.002
<Current New Best! Feature> fixed acidity, R^2: 0.004, adjusted R^2 0.004
<Current New Best! Feature> chlorides, R^2: 0.037, adjusted R^2 0.037
<Current New Best! Feature> density, R^2: 0.091, adjusted R^2 0.091
<Current New Best! Feature> alcohol, R^2: 0.201, adjusted R^2 0.201
Added new best feature [1;3malcohol[0m to list. Current best scores: R^2=0.201, adjusted R^2=0.201

[1;3mFFS round No. 2:[0m
<Current New Best! Feature> residual sugar, R^2: 0.218, adjusted R^2 0.218
<Current New Best! Feature> volatile acidity, R^2: 0.262, adjusted R^2 0.261
Added new best feature [1;3mvolatile acidity[0m to list. Current best scores: R^2=0.262, adjusted R^2=0.261

[1;3mFFS round No. 3:[0m
<Current New Best! Feature> residual sugar, R^2: 0.268, adjusted R^2 0.268
<Current New Best! Feature> density, R^2: 0.271, adjusted R^2 0.27
<Current New Best! Feature> sulphates, R^2: 0.273, adjus

In [22]:
# Alternatively (outer while loop with valueChanged=True/False) ...

include = [] #predictors included in final model
best_score = {'feature': '', 'r2': 0, 'adj_r2': 0}
n = 1 #loop counter

while True:
    valueChanged = False
    evaluate = list(set(X.columns) - set(include)) #predictors to be evaluated
    print(f'\n\033[1;3mFFS round No. {n}:\033[0m')
         
    for col in evaluate:
        r2 = LinearRegression().fit(X_train[include + [col]], y_train).score(X_train[include + [col]], y_train)
        adjusted_r2 = 1 - ( ( (1 - r2) * (len(X_train) - 1) ) / (len(X_train) - len(include + [col]) - 1) )
        
        if adjusted_r2 > best_score['adj_r2']:
            valueChanged = True
            best_score = {'feature': col, 'r2': r2, 'adj_r2': adjusted_r2}
            #print(f'<Current New Best! Feature> {best_score["feature"]}, R^2: {np.round(best_score["r2"], 3)}, adjusted R^2 {np.round(best_score["adj_r2"], 3)}')
                                    
    if valueChanged:
        include.append(best_score['feature'])
        evaluate = list(set(evaluate) - set(best_score['feature']))
        print(f'Added new best feature \033[1;3m{best_score["feature"]}\033[0m to list. Current best scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')
        n+=1
    
    else:
        print('\n', 47*'!', '\n Selection terminated. No better model was found\n', 47*'!')
        break

print('\nNumber of resulting predictors:', len(include))
print('Predictor names: ' + ', '.join(include))
print(f'Scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')


[1;3mFFS round No. 1:[0m
Added new best feature [1;3malcohol[0m to list. Current best scores: R^2=0.201, adjusted R^2=0.201

[1;3mFFS round No. 2:[0m
Added new best feature [1;3mvolatile acidity[0m to list. Current best scores: R^2=0.262, adjusted R^2=0.261

[1;3mFFS round No. 3:[0m
Added new best feature [1;3msulphates[0m to list. Current best scores: R^2=0.273, adjusted R^2=0.272

[1;3mFFS round No. 4:[0m
Added new best feature [1;3mresidual sugar[0m to list. Current best scores: R^2=0.282, adjusted R^2=0.282

[1;3mFFS round No. 5:[0m
Added new best feature [1;3mred_wine[0m to list. Current best scores: R^2=0.287, adjusted R^2=0.286

[1;3mFFS round No. 6:[0m
Added new best feature [1;3mfree sulfur dioxide[0m to list. Current best scores: R^2=0.291, adjusted R^2=0.29

[1;3mFFS round No. 7:[0m
Added new best feature [1;3mtotal sulfur dioxide[0m to list. Current best scores: R^2=0.295, adjusted R^2=0.294

[1;3mFFS round No. 8:[0m
Added new best feature [1

**Wrap forward feature selection in a function**

In [12]:
def ffs(X=X, y=y, model=LinearRegression(), ttsplit=0.2):
    """
    Function to select best features 
    by forward feature selection
    """
    
    include = [] #predictors included in final model
    inner_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in inner loop
    best_score = {'feature': '', 'r2': 0, 'adj_r2': 0} #keep track of scores in outer loop
    n = 1 #loop counter
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = ttsplit, random_state=42)

    while True:
        evaluate = list(set(X.columns) - set(include)) #predictors to be evaluated
        print(f'\n\033[1;3mFFS round No. {n}:\033[0m')

        for col in evaluate:
            r2 = model.fit(X_train[include + [col]], y_train).score(X_train[include + [col]], y_train)
            adjusted_r2 = 1 - ( ( (1 - r2) * (len(X_train) - 1) ) / (len(X_train) - len(include + [col]) - 1) )

            if adjusted_r2 > inner_score['adj_r2']:
                inner_score = {'feature': col, 'r2': r2, 'adj_r2': adjusted_r2}
                #print(f'<Current New Best! Feature> {inner_score["feature"]}, R^2: {np.round(inner_score["r2"], 3)}, adjusted R^2 {np.round(inner_score["adj_r2"], 3)}')
            
        if inner_score['adj_r2'] > best_score['adj_r2']:
            best_score = inner_score
            include.append(best_score['feature'])
            evaluate = list(set(evaluate) - set(best_score['feature']))
            print(f'Added new best feature \033[1;3m{best_score["feature"]}\033[0m to list. Current best scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')
            n+=1

        else:
            print('\n', 47*'!', '\n Selection terminated. No better model was found\n', 47*'!')
            break

    print('\nNumber of resulting predictors:', len(include))
    print('Predictor names: ' + ', '.join(include))
    print(f'Final scores: R^2={np.round(best_score["r2"],3)}, adjusted R^2={np.round(best_score["adj_r2"], 3)}')
    
    return include, best_score

In [13]:
predictor_list = ffs()


[1;3mFFS round No. 1:[0m
Added new best feature [1;3malcohol[0m to list. Current best scores: R^2=0.201, adjusted R^2=0.201

[1;3mFFS round No. 2:[0m
Added new best feature [1;3mvolatile acidity[0m to list. Current best scores: R^2=0.262, adjusted R^2=0.261

[1;3mFFS round No. 3:[0m
Added new best feature [1;3msulphates[0m to list. Current best scores: R^2=0.273, adjusted R^2=0.272

[1;3mFFS round No. 4:[0m
Added new best feature [1;3mresidual sugar[0m to list. Current best scores: R^2=0.282, adjusted R^2=0.282

[1;3mFFS round No. 5:[0m
Added new best feature [1;3mred_wine[0m to list. Current best scores: R^2=0.287, adjusted R^2=0.286

[1;3mFFS round No. 6:[0m
Added new best feature [1;3mfree sulfur dioxide[0m to list. Current best scores: R^2=0.291, adjusted R^2=0.29

[1;3mFFS round No. 7:[0m
Added new best feature [1;3mtotal sulfur dioxide[0m to list. Current best scores: R^2=0.295, adjusted R^2=0.294

[1;3mFFS round No. 8:[0m
Added new best feature [1

In [14]:
predictor_list

(['alcohol',
  'volatile acidity',
  'sulphates',
  'residual sugar',
  'red_wine',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'chlorides',
  'pH',
  'fixed acidity',
  'citric acid'],
 {'feature': 'citric acid',
  'r2': 0.3029447101867323,
  'adj_r2': 0.3013311562751275})

In [15]:
# Retrieve top 5 features
predictor_list[0][0:5]

['alcohol', 'volatile acidity', 'sulphates', 'residual sugar', 'red_wine']

In [16]:
# Retrieve adjusted R^2 value
predictor_list[1]['adj_r2']

0.3013311562751275

## Feature selection the lazy way 

In [17]:
# Forward feature selection (the lazy way) using mlxtend

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

lreg = LinearRegression()
sfs = SFS(lreg, k_features='best', forward=True, verbose=0, scoring='r2', cv=5) #verbose: level of output details (0, 1 or 2)
sfs.fit(X_train, y_train)

print('Number of selected features:', len(sfs.k_feature_names_))
print('Selected features:', sfs.k_feature_names_)
print('Cross validation average score:', sfs.k_score_)

Number of selected features: 11
Selected features: ('fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'red_wine')
Cross validation average score: 0.2983779201525915


In [18]:
# Backward feature selection (the lazy way) using mlxtend

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

lreg = LinearRegression()
sfs = SFS(lreg, k_features='best', forward=False, verbose=0, scoring='r2', cv=5) #verbose: level of output details (0, 1 or 2)
sfs.fit(X_train, y_train)

print('Number of selected features:', len(sfs.k_feature_names_))
print('Selected features:', sfs.k_feature_names_)
print('Cross validation average score:', sfs.k_score_)

Number of selected features: 11
Selected features: ('fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'red_wine')
Cross validation average score: 0.2983779201525915


In [19]:
# FORWARD feature selection (the lazy way) with sklearn

from sklearn.feature_selection import SequentialFeatureSelector as sfs

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

lreg = LinearRegression()
sfs = sfs(lreg, n_features_to_select='auto', tol=0.001, direction='forward', scoring='r2' ,cv=5)
sfs.fit(X_train, y_train)

forward = sfs.get_feature_names_out()
print('Number of selected features:', len(forward))
print('Selected features:', forward)

Number of selected features: 8
Selected features: ['volatile acidity' 'residual sugar' 'free sulfur dioxide'
 'total sulfur dioxide' 'density' 'sulphates' 'alcohol' 'red_wine']


In [20]:
# BACKWARD feature selection (the lazy way) with sklearn

from sklearn.feature_selection import SequentialFeatureSelector as sfs

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

lreg = LinearRegression()
sfs = sfs(lreg, n_features_to_select='auto', tol=0.001, direction='backward', scoring='r2' ,cv=5)
sfs.fit(X_train, y_train)

backward = sfs.get_feature_names_out()
print('Number of selected features:', len(backward))
print('Selected features:', backward)

Number of selected features: 11
Selected features: ['fixed acidity' 'volatile acidity' 'residual sugar' 'chlorides'
 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH' 'sulphates'
 'alcohol' 'red_wine']


In [21]:
list(set(backward) - set(forward))

['pH', 'chlorides', 'fixed acidity']



---



---



> > > > > > > > > © 2022 Institute of Data


---



---



