# This notebook is derived from the excellent notebook on RFE by Gabriel Daely. Many Thanks!
https://www.kaggle.com/gabrieldaely/recursive-feature-elimination-rfe-implementation

# **We are going to use a new Kaggle Utility Script called Featurewiz-Py to perform Recursive Feature Elimination (RFE) to Predict Customer Churn** 
Let us see how simple and fast this new library from Python is to reduce features!

Steps:
1.  Add "Utility Script" from File Menu on top
2.  Look for "featurewiz" from the list of scripts available from the pop-up screen
3. Select it and add it to your notebook
4. You can then import that library using the command:
from featurewiz_py import featurewiz

In [None]:
from featurewiz_py import featurewiz

## Contents
1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
2. [Feature Engineering and Selection](#Feature-Engineering-and-Selection)
3. [Build Some ML Models](#Build-Some-ML-Models)
4. [Model Evaluation](#Model-Evaluation)
5. [Feature Importance](#Feature-Importance)
6. [Summary](#Summary)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
random_state = 123

Import the dataset.

In [None]:
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv", index_col="customerID")
print(df.shape)
df.head()

Do some changes on "SeniorCitizen" and "TotalCharges" data type to make them appropriate.

In [None]:
df["TotalCharges"] = df["TotalCharges"].apply(pd.to_numeric, errors='coerce')
df["SeniorCitizen"] = df["SeniorCitizen"].apply(lambda x: "Yes" if x == 0 else "No")

# Now let's import and run Featurewiz to select features

In [None]:
help(featurewiz)

In [None]:
target = 'Churn'

In [None]:
feats = featurewiz(df,target)

Notice that it took just 10 seconds to run on this data and it detected GPU's automatically in this Kaggle machine and using GPU's, it performed its feature selection to speed it up! Let us see what features it selected...

In [None]:
print(len(feats))
feats

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

<a id='Feature-Engineering-and-Selection'></a>
# B. Feature Engineering and Selection

Separate features and label column into two variables, X and y.

In [None]:
X = df.drop(columns=["Churn"])
y = df["Churn"]

Use StandarScaler to standardize all numerical features, so their mean and standard deviation are zero and one, respectively.

In [None]:
scaler = StandardScaler()
X[X.select_dtypes("number").columns] = scaler.fit_transform(X.select_dtypes("number"))

Encode each categorical feature by using ordinal encoder.

In [None]:
ordEnc = OrdinalEncoder(dtype=np.int)
X[X.select_dtypes("object").columns] = ordEnc.fit_transform(X.select_dtypes("object"))

Also, don't forget to encode the label.

In [None]:
labEnc = LabelEncoder()
y = labEnc.fit_transform(y)

Do feature selection by using recursive feature elimination (RFE). Use Logistic Regression classifier as the estimator, and set the fold (k) for cross-validation to 10.

In [None]:
estimator = LogisticRegression(random_state=random_state)
rfecv = RFECV(estimator=estimator, cv=StratifiedKFold(10, random_state=random_state, shuffle=True), scoring="accuracy")
rfecv.fit(X, y)

Make a line plot of number of selected features against cross-validation score. Then, print the optimal number of features.

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_)
plt.grid()
plt.xticks(range(1, X.shape[1]+1))
plt.xlabel("Number of Selected Features")
plt.ylabel("CV Score")
plt.title("Recursive Feature Elimination (RFE)")
plt.show()

print("The optimal number of features: {}".format(rfecv.n_features_))

Make a new DataFrame called "X_rfe" that contains selected features.

In [None]:
rfe_feats = X.columns[rfecv.support_]
print(len(rfe_feats))
rfe_feats

In [None]:
X_rfe = X[feats]
X_rfe.head(1)

Compare the dimension of DataFrame "X" and "X_rfe".

In [None]:
print("\"X\" dimension: {}".format(X.shape))
print("\"X\" column list:", X.columns.tolist())
print("\"X_rfe\" dimension: {}".format(X_rfe.shape))
print("\"X_rfe\" column list:", X_rfe.columns.tolist())

From the steps above, the data is reduced to only 9 features from 19 features in the original data.

Now, let's compare their performance on various machine learning models.

<a id='Build-Some-ML-Models'></a>
# C. Build Some ML Models

Split the feature-selected DataFrame into train and test set. Also, do the same thing on the original DataFrame.

In [None]:
X_train, X_test, X_rfe_train, X_rfe_test, y_train, y_test = train_test_split(X, X_rfe, y, 
                                                                             train_size=0.8, 
                                                                             stratify=y,
                                                                             random_state=random_state)
print("Train size: {}".format(len(y_train)))
print("Test size: {}".format(len(y_test)))

Let's try these following classifiers to make the machine learning model, and compare their performance for the original and feature-selected dataset.
* Logistic Regression
* Support Vector Machine (linear kernel)
* Naive Bayes
* k-Nearest Neighbors
* Stochastic Gradient Descent
* Decision Tree
* AdaBoost
* Multi-layer Perceptron

In [None]:
clf_keys = ["Logistic Regression", "Support Vector Machine", "Naive Bayes", "k-Nearest Neighbors",
            "Stochastic Gradient Descent", "Decision Tree", "AdaBoost", "Multi-layer Perceptron"]
clf_values = [LogisticRegression(random_state=random_state), SVC(kernel="linear", random_state=random_state),
              GaussianNB(), KNeighborsClassifier(), SGDClassifier(random_state=random_state),
              DecisionTreeClassifier(random_state=random_state), AdaBoostClassifier(random_state=random_state), 
              MLPClassifier(random_state=random_state, max_iter=1000)]
clf_rfe_keys = ["Logistic Regression", "Support Vector Machine", "Naive Bayes", "k-Nearest Neighbors",
                "Stochastic Gradient Descent", "Decision Tree", "AdaBoost", "Multi-layer Perceptron"]
clf_rfe_values = [LogisticRegression(random_state=random_state), SVC(kernel="linear",random_state=random_state),
                  GaussianNB(), KNeighborsClassifier(), SGDClassifier(random_state=random_state),
                  DecisionTreeClassifier(random_state=random_state), AdaBoostClassifier(random_state=random_state), 
                  MLPClassifier(random_state=random_state, max_iter=1000)]
clfs = dict(zip(clf_keys, clf_values))
clfs_rfe = dict(zip(clf_rfe_keys, clf_rfe_values))

# Original dataset
print("Model training using original data: started!")
for clf_name, clf in clfs.items():
    clf.fit(X_train, y_train)
    clfs[clf_name] = clf
    print(clf_name, "training: done!")
print("Model training using original data: done!\n")

# Feature-selected dataset
print("Model training using feature-selected data: started!")
for clf_rfe_name, clf_rfe in clfs_rfe.items():
    clf_rfe.fit(X_rfe_train, y_train)
    clfs_rfe[clf_rfe_name] = clf_rfe
    print(clf_rfe_name, "training: done!")
print("Model training using feature-selected data: done!")

Check the accuracy of these two models, for now.

In [None]:
# Original dataset
acc = []
for clf_name, clf in clfs.items():
    y_pred = clf.predict(X_test)
    acc.append(accuracy_score(y_test, y_pred))

# Feature selected dataset
acc_rfe = []
for clf_rfe_name, clf_rfe in clfs_rfe.items():
    y_rfe_pred = clf_rfe.predict(X_rfe_test)
    acc_rfe.append(accuracy_score(y_test, y_rfe_pred))
    
acc_all = pd.DataFrame({"Original dataset": acc, "Feature-selected dataset": acc_rfe},
                       index=clf_keys)
acc_all

Make a bar plot of all accuracy results to visualize them.

In [None]:
print("Accuracy\n" + acc_all.mean().to_string())

ax = acc_all.plot.bar(figsize=(10, 8))
for p in ax.patches:
    ax.annotate(str(p.get_height().round(3)), (p.get_x()*0.985, p.get_height()*1.002))
plt.ylim((0.7, 0.82))
plt.xticks(rotation=90)
plt.title("All Classifier Accuracies")
plt.grid()
plt.show()

From the result above, the mean accuracy of feature-selected data is slightly higher (0.3% higher) than the mean accuracy of the original data. The model that has the best accuracy is Support Vector Machine trained on feature-selected data with 79.6% accuracy. Multi-layer Perceptron accuracy improved by 2.3% with training on feature-selected data. But, there are some classifiers (Naive Bayes, k-Nearest Neighbors, Stochastic Gradient Descent, and AdaBoost) that don't get the advantage from training on feature-selected data.

To ensure this result, evaluate the model by using cross-validation.

<a id='Model-Evaluation'></a>
# D. Model Evaluation

To validate the accuracy result and evaluate the performance of these two models furthermore, do k-fold cross-validation with $k = 10$ on the whole dataset.
Metrics to validate are: accuracy, and ROC AUC score.

In [None]:
scoring = ["accuracy", "roc_auc"]

scores = []
# Original dataset
print("Cross-validation on original data: started!")
for clf_name, clf in clfs.items():
    score = pd.DataFrame(cross_validate(clf, X, y, cv=StratifiedKFold(10, random_state=random_state, shuffle=True), scoring=scoring)).mean()
    scores.append(score)
    print(clf_name, "cross-validation: done!")
cv_scores = pd.concat(scores, axis=1).rename(columns=dict(zip(range(len(clf_keys)), clf_keys)))
print("Cross-validation on original data: done!\n")

scores = []
# Feature-selected dataset
print("Cross-validation on feature-selected data: started!")
for clf_name, clf in clfs_rfe.items():
    score = pd.DataFrame(cross_validate(clf, X_rfe, y, cv=StratifiedKFold(10, random_state=random_state, shuffle=True), scoring=scoring)).mean()
    scores.append(score)
    print(clf_name, "cross-validation: done!")
cv_scores_rfe = pd.concat(scores, axis=1).rename(columns=dict(zip(range(len(clf_keys)), clf_keys)))
print("Cross-validation on feature-selected data: done!")

Let's visualize cross-validation accuracy, ROC AUC score, and fit time results.

In [None]:
# Accuracy
cv_acc_all = pd.concat([cv_scores.loc["test_accuracy"].rename("Original data"), cv_scores_rfe.loc["test_accuracy"].rename("Feature-selected data")], 
                       axis=1)

print("Cross-validation accuracy\n" + cv_acc_all.mean().to_string())
ax = cv_acc_all.plot.bar(figsize=(10, 8))
for p in ax.patches:
    ax.annotate(str(p.get_height().round(3)), (p.get_x()*0.985, p.get_height()*1.003))
plt.xticks(rotation=90)
plt.ylim((0.7, 0.82))
plt.title("Cross-validation Accuracy")
plt.grid()
plt.legend()
plt.show()

In [None]:
# ROC AUC
cv_roc_auc_all = pd.concat([cv_scores.loc["test_roc_auc"].rename("Original data"), cv_scores_rfe.loc["test_roc_auc"].rename("Feature-selected data")], 
                           axis=1)

print("Cross-validation ROC AUC score\n" + cv_roc_auc_all.mean().to_string())
ax = cv_roc_auc_all.plot.bar(figsize=(10, 8))
for p in ax.patches:
    ax.annotate(str(p.get_height().round(3)), (p.get_x()*0.985, p.get_height()*1.003))
plt.xticks(rotation=90)
plt.ylim((0.63, 0.88))
plt.title("Cross-validation ROC AUC Score")
plt.grid()
plt.legend()
plt.show()

In [None]:
# Fit time
cv_fit_time_all = pd.concat([cv_scores.loc["fit_time"].rename("Original data"), cv_scores_rfe.loc["fit_time"].rename("Feature-selected data")], 
                           axis=1)

print("Cross-validation fit time\n" + cv_fit_time_all.mean().to_string())
ax = cv_fit_time_all.plot.bar(figsize=(10, 8))
for p in ax.patches:
    ax.annotate(str(p.get_height().round(3)), (p.get_x()*0.985, p.get_height()*1.003))
plt.xticks(rotation=90)
plt.yscale("log")
plt.title("Cross-validation Fit Time")
plt.grid()
plt.legend()
plt.show()

From the accuracy result, the mean accuracy of feature-selected data is 0.75% higher than the mean accuracy of the original data. The best accuracy here is Logistic Regression model trained on feature-selected data with 80.4% accuracy. Multi-layer Perceptron accuracy got the highest improvement by 2.8% with training on feature-selected data. Both SVM and AdaBoost accuracies of feature-selected data are slightly lower (only 0.1% lower) than the accuracies of original data. Remember, feature-selected data only has **9 features** while original data has 19 features.

The models that have the best ROC AUC score are Logistic Regression and AdaBoost with an ROC AUC score of 0.844. The ROC AUC result is not much different from the accuracy result. But there are some classifiers (Logistic Regression, Naive Bayes, and AdaBoost) that have slightly lower ROC AUC score of feature-selected data than the ROC AUC score of original data.

All models that were trained on feature-selected data have faster fit time than the one that was trained on original data. It is obviously because the number of features trained on those models.

<a id='Feature-Importance'></a>
# E. Feature Importance

Find the feature importance of the predictive model that has been made. In this case, use Logistic Regression because it has the highest accuracy among all models.

In [None]:
importance = abs(clfs["Logistic Regression"].coef_[0])
plt.barh(X.columns.values[importance.argsort()], importance[importance.argsort()])
plt.title("Logistic Regression - Feature Importance (Original Data)")
plt.grid()
plt.show()

importance_rfe = abs(clfs_rfe["Logistic Regression"].coef_[0])
plt.barh(X_rfe.columns.values[importance_rfe.argsort()], importance_rfe[importance_rfe.argsort()])
plt.title("Logistic Regression - Feature Importance (Feature-selected Data)")
plt.grid()
plt.show()

Top 5 important features of both Logistic Regression models are the same ("tenure", "PhoneService", "Contract", "TotalCharges", and "MonthlyCharges"). The rest of these important features are quite the same in both models.

Let's check the feature importance of AdaBoost classifier for comparison.

In [None]:
importance = clfs["AdaBoost"].feature_importances_
plt.barh(X.columns.values[importance.argsort()], importance[importance.argsort()])
plt.title("AdaBoost - Feature Importance (Original Data)")
plt.grid()
plt.show()

importance_rfe = clfs_rfe["AdaBoost"].feature_importances_
plt.barh(X_rfe.columns.values[importance_rfe.argsort()], importance_rfe[importance_rfe.argsort()])
plt.title("AdaBoost - Feature Importance (Feature-selected Data)")
plt.grid()
plt.show()

The top 5 important features of both AdaBoost models are slightly different. AdaBoost classifier that was trained on original data includes "PaymentMethod" on the fifth rank of its feature importance, while this feature is not selected during the RFE step. The rest of these important features are similar in both models.

Also find the feature importance of Support Vector Machine.

In [None]:
importance = abs(clfs["Support Vector Machine"].coef_[0])
plt.barh(X.columns.values[importance.argsort()], importance[importance.argsort()])
plt.title("Support Vectore Machine - Feature Importance (Original Data)")
plt.grid()
plt.show()

importance_rfe = abs(clfs_rfe["Support Vector Machine"].coef_[0])
plt.barh(X_rfe.columns.values[importance_rfe.argsort()], importance_rfe[importance_rfe.argsort()])
plt.title("Support Vectore Machine - Feature Importance (Feature-selected Data)")
plt.grid()
plt.show()

The top 5 important features of both Support Vector Machine models are kind of different. SVM classifier that was trained on original data placed "tenure" on the fifth rank, while the other model placed "tenure" on the first rank.

From these three models, "tenure", "MonthlyCharges", and "TotalCharges" are always appeared on the top 5 important features of each model.

<a id='Summary'></a>
# F. Summary

Recursive feature elimination (RFE) is very useful to select only necessary features, save the training time, and still get similar accuracy, or even higher than the original data. RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable. The feature importance of feature-selected data is also still preserved and is quite the same with original data based on the observation above.