# 0. Introduction

In this notebook, we will explore various feature selection and dimensionality reduction techniques in reference to the [Wisconsin breast cancer dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) on Kaggle. This dataset contains 569 breast cancer observations in which 357 of them are benign and 212 of them are malignant. The goal is to train a machine learning model that is able to classify a random breast cancer observation as either benign or malignant. I have chosen the random forest classifier for this particular problem but feel free try out other classification models of your choice.

The techniques that will be covered in this notebook as follows:

- Variance inflation factors (VIF)
- Univariate feature selection
- Recursive feature elimination
- Model-based feature selection
- Principal component analysis (PCA)

We will compare the effectiveness of each technique by examining the accuracy of our model at making predictions. More specifically, we will be using the confusion matrix, which is a common approach to test the performance of a model in binary classification.

I will include some additional resources at the end of this notebook to help you learn more about the concepts that are discussed in this notebook as well as links to my platforms and the other projects that I am currently working on. I hope you will gain some value out of this notebook.

Cheers\
Jason

# 1. Import libraries

In [None]:
# Data wrangling
import pandas as pd
import numpy as np

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, RFE, RFECV
from sklearn.decomposition import PCA

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# 2. Import and read data

In [None]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
data.head()

In [None]:
print("Shape of dataframe: ", data.shape)

# 3. Check for missing values

In [None]:
# Missing data

missing = data.isnull().sum()
missing[missing > 0]

Unnamed is the only column with missing values in the dataframe. In fact, the entire column is missing so it is safe for us to drop the column. 

I will also remove the id column as it does not provide us with any information regarding the classification of cancer cells. 

In [None]:
# Drop ID and Unnamed columns

print("Before: ", data.shape)
data = data.drop(['id', 'Unnamed: 32'], axis = 1)
print("After: ", data.shape)

# 4. Data description

In [None]:
data.dtypes.value_counts()

We have 30 numerical variables and only 1 categorical variable, which is our target variable (diagnosis). Let's have a look at these features in detail.

In [None]:
data.dtypes

According to the [data description](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) of this dataset, the columns represent 10 real-valued features of each cell nucleus:

1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness 
6. Compactness
7. Concavity
8. Concave points
9. Symmetry
10. Fractal dimension

The mean, standard error and worst of each feature were also computed, resulting in a total of 10 x 3 = 30 features (columns) in the dataset excluding the target variable.

Just by the names of the features itself, we can already foresee some issue of multicollinearity. The most obvious being between radius, perimeter and area. But before we deal with the issue of multicollinearity, let's first standardise our data and divide the features into 3 groups:

- feature_mean
- feature_se
- feature_worst

In [None]:
# Standardise all features so that they follow a standard Gaussian distribution

original_features = data.drop('diagnosis', axis = 1)
standard_features = (original_features - original_features.mean()) / original_features.std()
standard_data = pd.concat([data['diagnosis'], standard_features], axis = 1)

In [None]:
# Divide the standardised features into 3 groups 

feature_mean = standard_data.iloc[:, 1:11]
feature_se = standard_data.iloc[:, 11: 21]
feature_worst = standard_data.iloc[:, 21:31]

Now, we can move on to exploring the features in the dataset. 

# 5. Exploratory data analysis (EDA)

In [None]:
standard_data.head()

# 5.1 Target variable

In [None]:
# Encode target variable

data['diagnosis'] = data['diagnosis'].map({'B': 0, 'M': 1})
standard_data['diagnosis'] = standard_data['diagnosis'].map({'B': 0, 'M': 1})

In [None]:
# Value counts

target = data['diagnosis']
target.value_counts()

There are more cancer cells that are benign than there are that are malignant. 

In [None]:
total = len(data)
plt.figure(figsize = (6, 6))
plt.title('Diagnosis Value Counts')
ax = sns.countplot(target)
for p in ax.patches:
    percentage = '{:.0f}%'.format(p.get_height() / total * 100)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height() + 5
    ax.annotate(percentage, (x, y), ha = 'center')
plt.show()

# 5.2 Predictor variables

## 5.2.1 Issue of multicollinearity

In [None]:
# Heatmap

correlation = feature_mean.corr()
plt.figure(figsize = (10, 8))
plt.title('Correlation Between Predictor Variables')
sns.heatmap(correlation, annot = True, fmt = '.2f', cmap = 'coolwarm')

As expected, we have a severe problem of multicollinearity in our data. From the heatmap, we can observe that the follwing features are positively correlated with each other:

- Radius
- Perimeter
- Area
- Compactness
- Concavity
- Concave points

In [None]:
# Pairplot between correlated features

sns.pairplot(feature_mean[['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean']])

## 5.2.2 Explore the relationship between predictor variables and target variable

In this section, we will visualise the relationship between our predictor variables and the target variable. The goal here is to investigate and determine the features that are most important at distinguishing whether a cancer cell is benign or malignant.

In [None]:
feature_mean = pd.concat([target, feature_mean], axis = 1)
feature_mean.head()

In [None]:
mean_melt = pd.melt(feature_mean, id_vars = 'diagnosis', var_name = 'feature', value_name = 'value')
mean_melt.head()

In [None]:
# Violinplot

plt.figure(figsize = (12, 8))
sns.violinplot(x = 'feature', y = 'value', hue = 'diagnosis', data = mean_melt, split = True, inner = 'quart')
plt.legend(loc = 2)
plt.xticks(rotation = 90)

Besides fractal dimension, all the features look promising at classifying cancer cells.

We can see that cancer cells that are malignant tend to have higher values in all of the features.

In [None]:
# Boxplot

plt.figure(figsize = (12, 8))
sns.boxplot(x = 'feature', y = 'value', hue = 'diagnosis', data = mean_melt)
plt.xticks(rotation = 90)

Again, this plot illustrates that fractal dimension is not as good at classifying cancer cells as the other features in the dataset.

Boxplot also allows us to analyse the outliers in our dataset but let's ignore this problem for now.

# 6. Feature selection

Now that we have a better sense of our data, we can move on to selecting features for our training set to build our model.

Before we do that, let's set a base case for our feature selection i.e. use all the features in the dataset to train our model.

In [None]:
data.head()

Here, I will assign 70% of the dataset as the training set and the remaining 30% as the test set to test the accuracy of our model.

In [None]:
# Train test split 

X_train, X_test, Y_train, Y_test = train_test_split(original_features, target, test_size = 0.3, random_state = 10)
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_test shape: ", X_test.shape)
print("Y_test shape: ", Y_test.shape)

# 6.1 Base case

In [None]:
# Fit random forest classifier to training set and make predictions on test set

rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)

In [None]:
# Evaluate model accuracy 

accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

Our model achieved an accuracy of 98.25%. Not too shabby at all.

Let's now explore the different feature selection and dimensionality reduction techniques and see if we can achieve a similar level of accuracy but with a smaller training set i.e. less features.

# 6.2 Variance inflation factors (VIF)

In [None]:
# Define a function which computes VIF

def calculate_vif(df):
    vif = pd.DataFrame()
    vif['Feature'] = df.columns
    vif['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return (vif)

In [None]:
# Construct VIF dataframe

vif_table = calculate_vif(original_features)
vif_table = vif_table.sort_values(by = 'VIF', ascending = False, ignore_index = True)
vif_table

In [None]:
# Top 5 features with highest VIF

features_to_drop = list(vif_table['Feature'])[:5]
features_to_drop

Unsurprisingly, the top 5 features with the highest VIF are the suspects that we have already identified earlier on. 

In [None]:
# Drop top 5 features with highest VIF

new_features = original_features.drop(features_to_drop, axis = 1)

In [None]:
# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(new_features, target, test_size = 0.3, random_state = 10)

# Fit model to data and make predictions
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)

# Evaluate model accuracy 
accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

After removing the top 5 features with the highest VIF, our model accuracy not only did not decrease but instead it increased marginally. 

# 6.3 Univariate feature selection

In [None]:
# Train test split

X_train, X_test, Y_train, Y_test = train_test_split(original_features, target, test_size = 0.3, random_state = 10)

In [None]:
# Instantiate select features
select_features = SelectKBest(chi2, k = 5).fit(X_train, Y_train)

# Top 5 features
selected_features = select_features.get_support()
print("Top 5 features: ", list(X_train.columns[selected_features]))

In [None]:
# Apply select features to training and test set
X_train = select_features.transform(X_train)
X_test = select_features.transform(X_test)

# Fit model to data and make predictions
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)

# Evaluate model accuracy 
accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

Wow, this is remarkable!

Despite only using 5 features (1/6 of the original trainining set), our model accuracy has only gone down by 3%. This goes to show that these 5 features contain most of the information that is needed by our model to classify cancer cells accurately.

# 6.4 Recursive feature elimination

In [None]:
# Train test split

X_train, X_test, Y_train, Y_test = train_test_split(original_features, target, test_size = 0.3, random_state = 10)

In [None]:
# Instantiate recursive feature elimination to select the top 5 features
rf = RandomForestClassifier(random_state = 42)
rfe = RFE(estimator = rf, n_features_to_select = 5).fit(X_train, Y_train)

# Top 5 features
print("Top 5 features: ", list(X_train.columns[rfe.support_]))

The top 5 features chosen under recursive feature elimination are slightly different to those selected under univariate feature selection.

Let's now test the model accuracy.

In [None]:
# Make predictions on test set
Y_pred = rfe.predict(X_test)

# Evaluate model accuracy 
accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

Again, we achieved an accuracy of 95%, similar to that under univariate feature selection. 

# 6.5 Model-based feature selection

In [None]:
# Train test split

X_train, X_test, Y_train, Y_test = train_test_split(original_features, target, test_size = 0.3, random_state = 10)

In [None]:
# Fit model to data

rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis = 0)
indices = np.argsort(importances)[::-1]

In [None]:
# Evaluate feature importances

print("Feature ranking: ")
for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" %(f + 1, indices[f], importances[indices[f]]))

In [None]:
# Plot feature importances

plt.figure(figsize = (15, 8))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
        color = "r", yerr = std[indices], align="center")
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation = 90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

In [None]:
# Select features with importances above 5%

new_features = original_features.iloc[:, list(indices)[:9]]
print("Number of features above 5%: ", len(new_features.columns))
list(new_features.columns) 

This part is slightly subjective but I decided to only retain features that have importances over 5% as the training set.

In [None]:
# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(new_features, target, test_size = 0.3, random_state = 10)

# Fit model to data and make predictions
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)

# Evaluate model accuracy 
accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

With only 9 features, our model accuracy came very close to that under the base case scenario i.e. 98.25%.

# 6.6 Principal component analysis (PCA)

In [None]:
# Train test split using standardised features

X_train, X_test, Y_train, Y_test = train_test_split(standard_features, target, test_size = 0.3, random_state = 10)

In [None]:
# Instantiate and fit PCA to training set

pca = PCA()
pca.fit(X_train)

In [None]:
# Visualise explained variance ratio to the number of components

plt.figure(figsize = (10, 6))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_ratio_, linewidth = 2)
plt.axis('tight')
plt.xlabel('Number of components')
plt.ylabel('Explained variance ratio')

From the visualisation above, we can conclude that the optimal number of components is 4 using the elbow method.

In [None]:
# Instantiate PCA with 4 components and transform both training set and test set

pca = PCA(n_components = 4)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

In [None]:
# Fit model to data and make predictions
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)

# Evaluate model accuracy 
accuracy = accuracy_score(Y_pred, Y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))
f1 = f1_score(Y_pred, Y_test)
print("F1 score: {:.2f}".format(f1))
cm = confusion_matrix(Y_pred, Y_test)
sns.heatmap(cm, annot = True, fmt = 'd')

With just 4 principal components, our model is able to achieve a similar accuracy score to that under the model-based feature selection technique where 9 different features were used to train the model.

However, despite the impressive dimensionality reduction abilities of the PCA, one of its disadvantages is that our predictor variables become less interpretable. In other words, PCA makes it more difficult for us to determine the features that are important in classifying cancer cells. This is largely due to the underlying algorithm of the PCA which turns the original features in the dataset into principal components which are linear combinations of different features. 

Nevertheless, PCA remains a very robust technique in summarising high number of features into key components and thus allowing our model to capture all the important information in the dataset in order to make accurate predictions.

# 7. Conclusion

To summarise, feature selection and dimensionality reduction allow us to minimise the number of features in our dataset by only keeping features that are important. In other words, we want to retain features that contain the most useful information that is needed by our model to learn to make accurate predictions while discarding features that contain little to no information.

In this notebook, we have considered the following techniques:

- Variance inflation factors (98.83% accuracy with 25 features)
- Univariate feature selection (95.32% accuracy with 5 features)
- Recursive feature elimination (95.91% accuracy with 5 features)
- Model-based feature selection (97.08% accuracy with 9 features)
- Principal component analysis (97.08% with 4 principal components)

As we saw, despite using a significantly less number of features, we still managed to come very close the accuracy score under the base case scenario (98.25% accuracy) where all the features in the dataset were used to train our model. 

# 8. Additional resources

- [Random forest](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ)
- [Confusion matrix](https://www.youtube.com/watch?v=8Oog7TXHvFY&t=1681s)
- [Principal component analysis](https://www.youtube.com/watch?v=FgakZw6K1QQ)
- [Scikit-learn feature selection documentation](https://scikit-learn.org/stable/modules/feature_selection.html)

# 9. Follow me on other platforms

- [Facebook](https://www.facebook.com/chongjason914)
- [Instagram](https://www.instagram.com/chongjason914)
- [Twitter](https://www.twitter.com/chongjason914)
- [LinkedIn](https://www.linkedin.com/in/chongjason914)
- [YouTube](https://www.youtube.com/channel/UCQXiCnjatxiAKgWjoUlM-Xg?view_as=subscriber)
- [Medium](https://www.medium.com/@chongjason)