# **Bank Customer Churn Analysis**

-------------

## **Objective**

This project aims to analyze customer churn in a bank using various machine learning techniques. The goal is to predict whether a customer will exit based on several features and handle imbalanced data using both over-sampling and under-sampling techniques.

## **Data Source**

The dataset used in this analysis is the [Bank Customer Churn Dataset](https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset?resource=download).

## **Import Library**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

## **Import Data**

In [None]:
url = 'https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset?resource=download'
df = pd.read_csv(url)

## **Describe Data**

In [None]:
df.head()
df.info()

## **Data Visualization**

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='Exited', data=df)
plt.title('Distribution of Exited Customers')
plt.show()

## **Data Preprocessing**

In [None]:
df.set_index('CustomerId', inplace=True)

# Encoding categorical variables
df['Geography'] = df['Geography'].replace({'France': 2, 'Germany': 1, 'Spain': 0})
df['Gender'] = df['Gender'].replace({'Male': 0, 'Female': 1})

# Encode number of products as 0 and 1 (grouping 2, 3, and 4 into one category)
df['NumberOfProducts'] = df['NumberOfProducts'].replace({1: 0, 2: 1, 3: 1, 4: 1})

# Convert Credit Card and Active Member to binary (0 and 1)
df['CreditCard'] = df['CreditCard'].replace({'Yes': 1, 'No': 0})
df['ActiveMember'] = df['ActiveMember'].replace({'Yes': 1, 'No': 0})

# Create a new feature for zero bank balance
df['ZeroBankBalance'] = df['Balance'].apply(lambda x: 1 if x == 0 else 0)

# Features and target variable
X = df.drop(['Surname', 'Exited'], axis=1)
y = df['Exited']

## **Train Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Data Standardization**

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## **Handling Imbalanced Data**

In [None]:
# Handling imbalanced data with Random Under Sampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)

# Handling imbalanced data with Random Over Sampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

## **Modeling**

In [None]:
svc = SVC()

# Hyperparameter tuning for the raw data
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.1, 1, 10],
    'kernel': ['rbf'],
    'class_weight': ['balanced']
}

grid_search = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1, scoring='f1')
grid_search.fit(X_train, y_train)
best_model_raw = grid_search.best_estimator_
y_pred_raw = best_model_raw.predict(X_test)

# Hyperparameter tuning for Random Under Sampling data
grid_search_rus = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1, scoring='f1')
grid_search_rus.fit(X_rus, y_rus)
best_model_rus = grid_search_rus.best_estimator_
y_pred_rus = best_model_rus.predict(X_test)

# Hyperparameter tuning for Random Over Sampling data
grid_search_ros = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1, scoring='f1')
grid_search_ros.fit(X_ros, y_ros)
best_model_ros = grid_search_ros.best_estimator_
y_pred_ros = best_model_ros.predict(X_test)

## **Model Evaluation**

In [None]:
def evaluate_model(y_test, y_pred, model_name):
    print(f"Evaluation for {model_name}")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()

print("Raw Data Model:")
evaluate_model(y_test, y_pred_raw, "Raw Data Model")

print("Random Under Sampling Model:")
evaluate_model(y_test, y_pred_rus, "Random Under Sampling Model")

print("Random Over Sampling Model:")
evaluate_model(y_test, y_pred_ros, "Random Over Sampling Model")