In [None]:
import numpy as np
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
pip install evalml

In [None]:
import evalml
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
data.head()

In [None]:
print(data.info())

In [None]:
data.describe()

In [None]:
data.shape

**The first thing we'll do is drop CLIENTNUM from the data since a unique client identifier will have no correlation with attrition rates. Now there's clearly some diversity in the types of features, and at first glace it looks like we don't have to worry about any null or missing values. But that seems unlikely with a dataset of this size.**

In [None]:
data = data.drop(['CLIENTNUM'], axis=1)

In [None]:
for feature in data.columns:
    if data[feature].dtype not in ['int64', 'float64']:
        print(f'{feature}: {data[feature].unique()}')

**Education_Level, Marital_Status, and Income_Category have Unknown as a value. This is something we'll have to remember before we get to the model training, since Unknown isn't an acceptable value for any of the features.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(16, 28))
sns.set(font_scale=1.6)
cols_ = ["Education_Level", "Marital_Status", "Income_Category"]

for ind, col in enumerate(cols_):
    sns.countplot(x=col, data=data, ax=ax[ind])

**Checking to see how prevalent Unknown is proportionally to the the other values. Based on the count plots above, it doesn't look like Unknown is the most common value, but it's frequency is high enough that we probably don't want to drop rows containing it altogether.**

In [None]:
data.columns

In [None]:
data.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis = 1, inplace = True)

**We're also going to take a look at the correlation matrix to see if there are any features that are too closely tied to others. It looks like Avg_Open_To_Buy is perfectly correlated with Credit_Limit, so we're going to drop the latter.**

In [None]:
fig, ax = plt.subplots(figsize=(20, 16))
df_corr = data.corr(method="pearson")
mask = np.zeros_like(np.array(df_corr))
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(df_corr, mask=mask, annot=True)

In [None]:
data.columns

In [None]:
data['Attrition_Flag'].value_counts()

**The target feature is imbalanced so we will consider F1-score as our metric.**

**Encoding**

In [None]:
X = data.copy()
X = X.drop(['Credit_Limit'], axis=1) # dropping Credit Limit since it is highly correlated with Avg_Open_To_Buy
y = X.pop('Attrition_Flag')

X['Income_Category'] = X['Income_Category'].replace({'Less than $40K':0,
                                                     '$40K - $60K':1,
                                                     '$60K - $80K':2,
                                                     '$80K - $120K':3,
                                                     '$120K +':4})
X['Card_Category'] = X['Card_Category'].replace({'Blue':0,
                                                 'Silver':1,
                                                 'Gold':2,
                                                 'Platinum':3})
X['Education_Level'] = X['Education_Level'].replace({'Uneducated':0,
                                                     'High School':1,
                                                     'College':2,
                                                     'Graduate':3,
                                                     'Post-Graduate':4,
                                                     'Doctorate':5})

**Encoding the Target feature**

In [None]:
y = y.replace({'Existing Customer':0,
               'Attrited Customer':1})

**Replacing the Unknown values that we saw earlier with the most frequent value encountered in that feature using SimpleImputer.**

In [None]:
from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer

def preprocessing(X, y):
    imputer = SimpleImputer(impute_strategy="most_frequent", missing_values="Unknown")
    X = imputer.fit_transform(X, y)
    return X

X = preprocessing(X, y)

In [None]:
from evalml.utils import infer_feature_types

In [None]:
X = infer_feature_types(X, feature_types={'Income_Category': 'categorical',
                                          'Education_Level': 'categorical'})
X

**Splitting the dataset into 80% train and 20% test.**

In [None]:
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary',test_size=.2)

**Initializing AutoMLSearch from EvalML**

In [None]:
from evalml import AutoMLSearch

automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary", objective="F1", 
                      allowed_model_families=['random_forest' , 'xgboost', 'lightgbm'],
                      additional_objectives=['accuracy binary'], max_batches=5)
automl.search()

**Pipelines Review**

So a lot just happened, let's review the pipelines that were created and tested. We can see that the best performing pipeline was with the LightGBM estimator. We want to learn a little more about it, which can be done with the describe_pipeline function. Notice that the pipeline included a preprocessing step of imputation. In this case, it ended up being unnecessary because of our earlier SimpleImputer and our lack of null values for our numerical features. However AutoMLSearch comes with the built-in capacity to automatically iterate over the hyperparameters for this preprocessing step as well.

In [None]:
automl.rankings

**Obtaining the complete pipeline of the best model**

In [None]:
best_pipeline_ = automl.best_pipeline
automl.describe_pipeline(automl.rankings.iloc[1]["id"])

**We got the best classifier with LightGBM Classifier.**

In [None]:
best_pipeline_.fit(X_train, y_train)
predictions = best_pipeline_.predict(X_test)

In [None]:
from evalml.model_understanding.graphs import (
    graph_binary_objective_vs_threshold, 
    graph_permutation_importance, 
    graph_confusion_matrix
)

graph_binary_objective_vs_threshold(best_pipeline_, X_test, y_test, "F1")

In [None]:
graph_permutation_importance(best_pipeline_, X_test, y_test, "F1")

**Total Trans Ct is giving us the highest permutation importance score followed by Total Trans Amt.**

In [None]:
graph_confusion_matrix(y_test, predictions)

**We are getting (1685+273) = 1958 correct observations and (52+16) = 68 incorrect observations.**

In [None]:
from evalml.objectives.standard_metrics import AccuracyBinary, AUC, F1, PrecisionWeighted, Recall

acc = AccuracyBinary()
auc = AUC()
f1 = F1()
pre_w = PrecisionWeighted()
rec = Recall()

print(f"Accuracy (Binary): {acc.score(y_true=y_test, y_predicted=predictions)}")
print(f"Area Under Curve: {auc.score(y_true=y_test, y_predicted=predictions)}")
print(f"F1: {f1.score(y_true=y_test, y_predicted=predictions)}")
print(f"Precision (Weighted): {pre_w.score(y_true=y_test, y_predicted=predictions)}")
print(f"Recall: {rec.score(y_true=y_test, y_predicted=predictions)}")

**We are getting an F1-score of 0.88 on the test set which is pretty good.**