In this mini-study, we set the following tasks: to find relationships in the data, to find out the reason for the outflow of customers and to build a prediction algorithm.

In [None]:
import pandas as pd
df=pd.read_csv('../input/credit-card-customers/BankChurners.csv')
df=df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
            'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2','CLIENTNUM'],axis=1)
df.head()

In [None]:
df.info()

The conclusions are below

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize = (15, 8))
sns.scatterplot(data=df,x='Total_Ct_Chng_Q4_Q1',y='Total_Amt_Chng_Q4_Q1' ,hue='Attrition_Flag')

In [None]:
plt.figure(figsize = (15, 8))
sns.scatterplot(data=df,x='Total_Trans_Amt',y='Total_Trans_Ct',hue='Attrition_Flag')

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['Total_Amt_Chng_Q4_Q1'])

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['Total_Ct_Chng_Q4_Q1'])

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['Avg_Utilization_Ratio'])

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['Total_Trans_Amt'])

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['Total_Trans_Ct'])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (15, 8))
sns.countplot(df['Card_Category'],hue=df['Attrition_Flag'])


In [None]:
plt.figure(figsize = (15,8))
sns.countplot(x= df['Education_Level'], edgecolor = 'black', saturation = 0.55,hue=df['Attrition_Flag'])
plt.show()

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(x = df['Marital_Status'], edgecolor = 'black', saturation = 0.55,hue=df['Attrition_Flag'])
plt.show()

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(x = df['Income_Category'], edgecolor = 'black', saturation = 0.55,hue=df['Attrition_Flag'])
plt.show()

In [None]:
plt.figure(figsize = (15, 8))

sns.histplot(df['Customer_Age'],kde=True)

In [None]:
plt.figure(figsize = (15, 8))

sns.countplot(df['Attrition_Flag'],hue=df['Gender'])

In [None]:
labels=['Existing','Atrrited']
colors = ["cyan","red"]
plt.pie(df['Attrition_Flag'].value_counts(),labels=labels,colors=colors,
        autopct='%1.2f%%', shadow=True, startangle=140) 
plt.show()

In [None]:
import seaborn as sns
sns.set_theme(style="darkgrid")
plt.figure(figsize = (15, 8))

ax = sns.countplot(x="Attrition_Flag", data=df)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (15, 8)) 
sns.scatterplot(data=df,x='Total_Trans_Amt',y='Avg_Open_To_Buy' ,hue='Attrition_Flag')

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True);

From the above visual analysis and the correlation matrix, the following conclusions can be drawn:
1) Most people are blue credit card holders. Also, it was the owners of blue cards who mostly left the bank. Perhaps competitors offer more favorable terms;
2) The data set is unbalanced. About 16% of our customers left us. That's a lot!
3) Also, most customers have an income of $ 40,000 per year
4) Total transaction amount and Total number of transactions = 0.81, high positive correlation.
5) We also have a dependency: most of the people who left the bank had from 30 to 80 transactions in their account and from 2500 to 10000 conventional units.
6) Also an interesting fact: most of the bank's customers are women, and most of the lost customers are also women

Let's start building the models. We use all the standard steps: label encoder, normalization, data rebalancing

In [None]:


from sklearn.preprocessing import LabelEncoder
for c in df.columns:
    le = LabelEncoder()
    if df.dtypes[c] == object:
        le.fit(df[c].astype(str))
        df[c] = le.transform(df[c].astype(str))





In [None]:
X=df.drop('Attrition_Flag',axis=1)
y=df['Attrition_Flag']

In [None]:
from sklearn import preprocessing
norm = preprocessing.StandardScaler()
ndf=norm.fit_transform(X)
X = pd.DataFrame(ndf, index=X.index, columns=X.columns)
X.head(10)

In [None]:
from imblearn.over_sampling import ADASYN 
X_resampled, y_resampled = ADASYN().fit_resample(X, y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X_resampled, y_resampled, train_size=0.8, random_state=42,shuffle=True)

In [None]:
from catboost import CatBoostClassifier
catboost_params = {'loss_function' : 'CrossEntropy',
            'iterations': 2000,
            'depth': 5,
            'learning_rate': 0.01,
            'eval_metric': 'AUC',
            'random_seed': 4,
            'l2_leaf_reg': 15.0,
            'bagging_temperature': 0.75,
            'allow_writing_files': False, 'border_count':50
        }
model = CatBoostClassifier(**catboost_params)
model.fit(X_train, y_train,verbose=True)

In [None]:
y_pred=model.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test,y_pred))



In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(model,X_test,y_test)

In [None]:
import numpy as np
def plot_feature_importance(importance,names,model_type):
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    plt.figure(figsize=(15,8))
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    plt.title('FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(model.get_feature_importance(),X_train.columns,'CATBOOST')

Our model is not bad. However, based on the feature importance graph, we can conclude that the following indicators affect the training of the model: the total number of transactions, the amount of transactions and inactive.