# Bank Customer Churn

--

- RowNumber —corresponds to the record (row) number and has no effect on the output.
- CustomerId —contains random values and has no effect on customer leaving the bank.
- Surname —the surname of a customer has no impact on their decision to leave the bank.
- CreditScore —can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
- Geography—a customer’s location can affect their decision to leave the bank.
- Gender—it’s interesting to explore whether gender plays a role in a customer leaving the bank.
- Age—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
- Tenure—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
- Balance—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
- NumOfProducts—refers to the number of products that a customer has purchased through the bank.
- HasCrCard—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
- IsActiveMember—active customers are less likely to leave the bank.
- EstimatedSalary—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
- Exited—whether or not the customer left the bank.
- Complain—customer has complaint or not.
- Satisfaction Score—Score provided by the customer for their complaint resolution.
- Card Type—type of card hold by the customer.
- Points Earned—the points earned by the customer for using credit card.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('churn.csv')
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.describe().T

In [None]:
data.duplicated().sum()

In [None]:
data.nunique()

In [None]:
data[['Exited']].value_counts()

In [None]:
data[['Geography']].value_counts()

In [None]:
plt.figure(figsize=(10,6))
churn_counts = data[['Exited']].value_counts()
plt.pie(churn_counts, labels=['Retained', 'Churned'], autopct='%1.1f%%', startangle=90)
plt.title('Churn Distribution')
plt.show()

In [None]:
sns.countplot(x='Gender',hue='Exited', data=data)
plt.title('Gender churn distrbution')
plt.show()

In [None]:
sns.countplot(x='Geography',hue='Exited', data=data)
plt.title('Churn rate by Geography')
plt.show()

In [None]:
sns.countplot(x='Geography',hue='Gender', data=data)
plt.title('Gender Distribution by Geography')
plt.show()

In [None]:
sns.histplot(data=data, x='Age', hue='Exited', bins=30, multiple='stack', kde=True)
plt.show()

In [None]:
sns.countplot(x='Tenure', hue='Exited', data=data)
plt.show()

In [None]:
sns.histplot(data=data, x='Balance', hue='Exited', bins=30, multiple='stack', kde=True)
plt.show()

In [None]:
data[['NumOfProducts']].value_counts()

We can see Customers have purchased who have purched more than 2 products have a higher churn rate compared to those who have purchased upto 2 or less. 

In [None]:
sns.countplot(data=data, x='NumOfProducts', hue='Exited')
plt.show()

In [None]:
data[['HasCrCard']].value_counts()

In [None]:
sns.countplot(data=data, x='HasCrCard', hue='Exited')
plt.xticks(ticks=[0, 1], labels=['No', 'Yes'])
plt.show()

In [None]:
sns.histplot(data=data, x='EstimatedSalary', hue='Exited', bins=30, multiple='stack', kde=True)
plt.show()

In [None]:
sns.countplot(data=data, x='IsActiveMember', hue='Exited')
plt.xticks(ticks=[0, 1], labels=['No', 'Yes'])
plt.show()

# Feature Engineering

In [23]:
data.drop(columns=['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

In [24]:
X=data.drop(columns='Exited', axis=1)
y=data['Exited']

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train , y_test = train_test_split(X,y, test_size=0.33, random_state=42)

In [None]:
X_train.shape

In [None]:
X_train.head()

In [28]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

numerical_features = ['CreditScore','Age', 'Tenure', 'Balance', 'NumOfProducts','HasCrCard','IsActiveMember','EstimatedSalary']

preprocessor = ColumnTransformer([
    ('geo', OrdinalEncoder(), ['Geography']),
    ('gender', OneHotEncoder(),['Gender']),
    ('num', StandardScaler(), numerical_features)
])

In [29]:
X_train_prepared = preprocessor.fit_transform(X_train)
X_test_prepared = preprocessor.fit_transform(X_test)

#  Model Training

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

models = [ LogisticRegression() ,SVC(), DecisionTreeClassifier(), GradientBoostingClassifier(), AdaBoostClassifier(), RandomForestClassifier(), 
    KNeighborsClassifier(), GaussianNB()]

model_names = ['Logistic Regression', 'SVC', 'Decision Tree', 'Gradient Boosting', 'AdaBoost', 'Random Forest', 'KNN','Naive Bayes']

accuracy = []

for model in range(len(models)):
    clf = models[model]
    clf.fit(X_train_prepared, y_train)
    y_pred = clf.predict(X_test_prepared)
    print(f"----- {model_names[model]} ------")
    accuracy.append(classification_report(y_test, y_pred))
    print(accuracy[model])