# Lab 5
## Feature Selection
Enda McCarthy - 19159269

October 2019

Lab 5 builds on top of Lab 4 by introducing feature selection into the process of training and comparing predictive models (i.e. classification and numeric prediction models). 

In general, the more features/attributes a dataset has (with a fixed number of examples), the more difficult it might be to train an accurate predictive model. For datasets with a large number of features, it is almost always necessary to select the most important features and train the model only on them.

The goal of Lab 5 is to understand how to evaluate a model trained with a subset of features without overestimating its accuracy. It also introduces scikit-learn model-training pipelines and implements feature-selection methods within such pipelines.

## 1 - Preparation 

### a) Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statistics

from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression, SelectPercentile, RFE

sns.set(style="darkgrid")
%matplotlib inline

### b) Load and prepare dataset

In [None]:
df = pd.read_csv("../input/winequality_red.csv")
df.head()

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(df["quality"], palette="muted")

From this we can see the following:
- there are 1599 instances in the dataset
- all of the attributes are numeric
- the `quality` attribute is a whole number score between 1-10
- all of the instances have scores between 3-8, with the majority between 5-6
- there are quite a few attributes here, we can use feature selection to find the most relevant ones for predicting `quality`

Check for any missing values:

In [None]:
df.apply(lambda x: sum(x.isnull()), axis=0)

Looks good, no missing values.

Next we can define our target attribute and set the rest to our predictors.

In [None]:
# target attribute
target_attribute_name = 'quality'
target = df[target_attribute_name]

# predictor attributes
predictors = df.drop(target_attribute_name, axis=1).values

# predictor attributes names
predictors_col_names = list(df.drop(target_attribute_name, axis=1).columns)

We then need to convert the `quality` attribute from numerical to categorical. We can do this using the following function.

In [None]:
labelencoder = LabelEncoder()
target = labelencoder.fit_transform(target)

Now our target has will have 6 categories for each score between 3-8 (there are no scores outside this range).

__Note:__ If we wanted, we could abstract our quality scores into fewer categories (good, okay, poor) to make the final predictions a bit easier.
The following code would replace the two code blocks above this:

In [None]:
#        quality = df["quality"].values
#        category = []
#        for number in quality:
#            if number > 6:
#                category.append("Good")
#            elif number > 3:
#                category.append("Okay")
#            else:
#                category.append("Poor")
#        category = pd.DataFrame(data=category, columns=["category"])
#        data = pd.concat([df, category], axis=1)
#        data.drop(columns="quality", axis=1, inplace=True)
#
#        # target attribute
#        target_attribute_name = 'category'
#        target = data[target_attribute_name]
#
#        # predictor attributes
#        predictors = data.drop(target_attribute_name, axis=1).values
#
#        labelencoder = LabelEncoder()
#        target = labelencoder.fit_transform(target)

Now we split the data set into training (80%) and test (20%) datasets.

In [None]:
# prepare independent stratified data sets for training and test of the final model
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, 
                                                                                target, 
                                                                                test_size=0.20, 
                                                                                shuffle=True, 
                                                                                stratify=target)

Scale all predictor values to the range [0, 1]. 

The target attribute is now treated as a categorical attribute so we do not need to scale it.

Note that the MinMaxScaler is applied separately to the training and the testing datasets. 
This is to ensure that this transformation when performed on the testing dataset is not influenced by the training dataset.

In [None]:
min_max_scaler = MinMaxScaler()
predictors_train = min_max_scaler.fit_transform(predictors_train)
predictors_test = min_max_scaler.fit_transform(predictors_test)

## 2 - Feature Selection

#### a) Apply RFE with SVM for selecting the best features

In [None]:
# create a base classifier used to evaluate a subset of attributes
estimatorSVM = svm.SVR(kernel="linear")
# create the RFE model and select 3 relevant attributes
selectorSVM = RFE(estimatorSVM, 3)
selectorSVM = selectorSVM.fit(predictors_train, target_train)
# summarize the selection of the attributes
print(selectorSVM.support_)
print(selectorSVM.ranking_)

We can see that the relevant attributes are True and are ranked as 1, while the remaining attributes are all False and are ranked from 2 downwards.

#### 2. Apply RFE with Logistic Regression for selecting the best features

In [None]:
# create a base classifier used to evaluate a subset of attributes
estimatorLR = LogisticRegression(solver='lbfgs', multi_class='auto')
# create the RFE model and select 3 relevant attributes
selectorLR = RFE(estimatorLR, 3)
selectorLR = selectorLR.fit(predictors_train, target_train)
# summarize the selection of the attributes
print(selectorLR.support_)
print(selectorLR.ranking_)

## 3 - Evaluate on the Test Dataset

Apply the selectors to prepare training data sets with only the selected features.

__Note:__ The same selectors are applied to the test data set. However, it is important that the test data set was not used by (it's invisible to) the selectors.

In [None]:
# select only relevant features from training and test seperately - SVM
predictors_train_SVMselected = selectorSVM.transform(predictors_train)
predictors_test_SVMselected = selectorSVM.transform(predictors_test)

# select only relevant features from training and test seperately - LR
predictors_train_LRselected = selectorLR.transform(predictors_train)
predictors_test_LRselected = selectorLR.transform(predictors_test)

#### Train and evaluate SVM classifiers with both the selected features and all features 

Here we train three models:
* model1 - with the features selected by SVM
* model2 - with the features selected by Logistic Regression
* model3 - with all features (i.e. without feature selection)

We use an SVM classifier for all 3 models. 

Basically, we are using an SVM classifier to train a dataset with only the relevant features (according to each RFE model).

In [None]:
# create SVM classifier
classifier = svm.SVC(gamma='scale')

In [None]:
# run final classifier with only features selected using RFE with SVM
model1 = classifier.fit(predictors_train_SVMselected, target_train)
accuracy1 = model1.score(predictors_test_SVMselected, target_test)

In [None]:
# run final classifier with only features selected using RFE with LR
model2 = classifier.fit(predictors_train_LRselected, target_train)
accuracy2 = model2.score(predictors_test_LRselected, target_test)

In [None]:
# run final classifier with all the features
model3 = classifier.fit(predictors_train, target_train)
accuracy3 = model3.score(predictors_test, target_test)

In [None]:
print("Model 1 Accuracy = %.4f" % (accuracy1))
print("Model 2 Accuracy = %.4f" % (accuracy2))
print("Model 3 Accuracy = %.4f" % (accuracy3))

### Conclusion

The results above do not vary much and it is likely that each time we run it we will have a different model as the most accurate.

To get more accurate results, accounting for the variance in the results, it is better to run the whole experiment multiple times and measure the __variance__ in the results. Then pick the model that gives the best results.

## 4 - Iteration

We will use iteration to repeat the experiment multiple times, each time with a different percentage of random test data selected from the dataset. The splits we will use are:
- 15%
- 20%
- 25%

In [None]:
# create base classifiers
estimatorSVM = svm.SVR(kernel="linear")
estimatorLR = LogisticRegression(solver='lbfgs', multi_class='auto')

# create RFE model for both classifiers to find 3 best features
selectorSVM = RFE(estimatorSVM, 3)
selectorLR = RFE(estimatorLR, 3)

# create SVM classifier for final evaluation
classifier = svm.SVC(gamma='scale')

# store the results from loop in a dataframe 
results_df = pd.DataFrame(columns=('score', 'split', 'model'))

# create list of differant % test splits (15%, 20% and 25%)
test_sizes = [0.15, 0.20, 0.25]

# list of model numbers
model = [1, 2, 3]

# counter
row = 0

for i in range(len(test_sizes)):
    for j in range(20):
        predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, 
                                                                                        target, 
                                                                                        test_size=test_sizes[i], 
                                                                                        shuffle=True, 
                                                                                        stratify=target)

        # scale predictors
        predictors_train = min_max_scaler.fit_transform(predictors_train)
        predictors_test = min_max_scaler.fit_transform(predictors_test)

        # use RFE models on data to identify best features
        selectorSVM = selectorSVM.fit(predictors_train, target_train)
        selectorLR = selectorLR.fit(predictors_train, target_train)

        # select only relevant features from training and test seperately - SVM
        predictors_train_SVMselected = selectorSVM.transform(predictors_train)
        predictors_test_SVMselected = selectorSVM.transform(predictors_test)

        # select only relevant features from training and test seperately - LR
        predictors_train_LRselected = selectorLR.transform(predictors_train)
        predictors_test_LRselected = selectorLR.transform(predictors_test)

        # run final classifier with only with features selected using RFE with SVM
        model1 = classifier.fit(predictors_train_SVMselected, target_train)
        model1_score = model1.score(predictors_test_SVMselected, target_test)
        results_df.loc[row] = [model1_score, (test_sizes[i]*100), model[0]]
        row+=1

        # run final classifier with only with features selected using RFE with LR
        model2 = classifier.fit(predictors_train_LRselected, target_train)
        model2_score = model2.score(predictors_test_LRselected, target_test)
        results_df.loc[row] = [model2_score, (test_sizes[i]*100), model[1]]
        row+=1

        # run final classifier with all features
        model3 = classifier.fit(predictors_train, target_train)
        model3_score = model3.score(predictors_test, target_test)
        results_df.loc[row] = [model3_score, (test_sizes[i]*100), model[2]]
        row+=1
            

Now we can boxplot the 3 models for each test split and compare them.

In [None]:
plt.figure(figsize=(14, 8))
sns.boxplot(x="split", y="score", hue="model", data=results_df, width = .4, palette="Set3")

We can also look at the variance of each example.

In [None]:
variance_df = pd.DataFrame(index=[0], columns=('0.15 split, model 1', '0.15 split, model 2', '0.15 split, model 3',
                                               '0.20 split, model 1', '0.20 split, model 2', '0.20 split, model 3',
                                               '0.25 split, model 1', '0.25 split, model 2', '0.25 split, model 3'))

for i in test_sizes:
    for j in model:
        variance = np.var(list(results_df.score[(results_df['model'] == j) & (results_df['split'] == (i*100))]))
        variance_df.at[0, '%.2f split, model %d' % (i,j)] = variance

In [None]:
plt.figure(figsize=(14, 8))
ax = sns.barplot(data=variance_df, palette="Set3")
for item in ax.get_xticklabels():
    item.set_rotation(60)

From examining the above information we can conclude that:
- all 9 models have similar median accuracies
- model 1 with a split of 25% test data has the lowest variance
- model 3 with a split of 20% test data has the highest variance
- __model 1 (features selected using SVM) with a split of 25%__ is the best performing

However, even though we ran this through a loop multiple times, if we were to run it again we may have different conclusions. 

This means that there is very little difference in using the best 3 attributes (selected using either SVM or Logistic Regression) or by using all the attributes to train a classification model for this dataset.

In [None]:
accuracy2_20 = statistics.mean(list(results_df.score[(results_df['model'] == 1) & (results_df['split'] == (25))]))
print("Model 2 with 0.20 Split Accuracy = %.4f" % (accuracy2_20))

1. This performs the same as earlier on where we had a split of 20%.

## 5 - Pipelines

We can use pipelines to extract the best combination of features to train our model. This will help us to avoid overfitting.

We will set up the data from the beginning.

In [None]:
# target attribute
target_attribute_name = 'quality'
target = df[target_attribute_name]
labelencoder = LabelEncoder()
target = labelencoder.fit_transform(target)

# predictor attributes
predictors = df.drop(target_attribute_name, axis=1).values

# predictor attributes names
predictors_col_names = list(df.drop(target_attribute_name, axis=1).columns)

In [None]:
# prepare independent stratified data sets for training and test of the final model
X_train, X_test, y_train, y_test = train_test_split(predictors, 
                                                    target, 
                                                    test_size=0.25, 
                                                    shuffle=True, 
                                                    stratify=target)

In [None]:
# set up pipeline
    # step 1 - scale the data
    # step 2 - select the best features to use (reduce dimensionality)
    # step 3 - use a learning algorithm on the selected features
# note: we will add more options to each step in the next block)
pipe = Pipeline([('scaler', MinMaxScaler()),
                 ('reduce_dim', SelectPercentile(f_regression)),
                 ('regressor', svm.SVC(gamma='scale'))])

In [None]:
# as stated above we can add more options for each step
# first we add options for scaling
scalers_to_test = [StandardScaler(), RobustScaler(), MinMaxScaler()]
# next we add options for learning algorithms
regressors_to_test = [svm.SVC(gamma='scale'), LogisticRegression(solver='lbfgs', multi_class='auto')]
# then we can vary the number of selected features from 1-11 for each variation
n_features_to_test = np.arange(1, 12)

In [None]:
# params will be passed in alongside our pipeline
# we have two sets of params here, each with a different method of selectinf features:
    # first we use the SelectPercentile method (where percentile is the number of attributes)
    # then we use the SelectKBest method (where k is the number of attributes)
params = [{'scaler': scalers_to_test,
           'reduce_dim': [SelectPercentile(f_regression)],
           'reduce_dim__percentile': n_features_to_test,
           'regressor': regressors_to_test},

          {'scaler': scalers_to_test,
           'reduce_dim': [SelectKBest(f_regression)],
           'reduce_dim__k': n_features_to_test,
           'regressor': regressors_to_test}]

In [None]:
# then we can train our final model using the pipeline and params (cross-validation is used)
gridsearch = GridSearchCV(pipe, params, cv=5, verbose=1, iid=False).fit(X_train, y_train)

In [None]:
# we can see which params were chosen as the best performing
gridsearch.best_params_

In [None]:
# we can see which features were selected from the pipeline
final_pipeline = gridsearch.best_estimator_
final_classifier = final_pipeline.named_steps['regressor']
mask = final_pipeline.named_steps['reduce_dim'].get_support()
feature_names = df.drop(target_attribute_name, axis=1).columns
selected_features = feature_names[final_pipeline.named_steps['reduce_dim'].get_support()].tolist()
selected_features

In [None]:
# and finally we can see the accuracy of the final model
print('Final score is: ', gridsearch.score(X_test, y_test))

We can see the following from these results:
- best performing scaler is StandardScaler
- best performing feature selector is SelectKBest
- best performing learning algorithm is SVC
- ideal number of features for learning is 11

This gives us a total accuracy of 58% for our model.