# Import Dataset

In [89]:
import pandas as pd
import numpy as np

# Read file
df = pd.read_csv('churn.csv')

In [None]:
# Initial data inspection
df.head()

In [None]:
# Get basic information of dataset

df.info()

In [None]:
# Shape of dataset

df.shape

The dataset has 10000 customer entries with 14 multiple features. Out of these 14 features only 10 can have impact on customer churn. 

1. CreditScore— Customer with a higher credit score is less likely to leave the bank.
2. Geography— A customer’s location can affect their decision to leave the bank.
3. Gender— It’s interesting to explore whether gender plays a role in a customer leaving the bank.
4. Age— This is certainly relevant, since older customers are less likely to leave their bank than younger ones.
5. Tenure— Refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
6. Balance— Also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
7. NumOfProducts— Refers to the number of products that a customer has purchased through the bank.
8. HasCrCard— People with a credit card are less likely to leave the bank.
9. IsActiveMember— Active customers are less likely to leave the bank.
10. EstimatedSalary— People with lower salaries are more likely to leave the bank compared to those with higher salaries.

Target vector:
1. Exited— whether or not the customer left the bank.

All other columns like RowNumber, CustomerID, Surname have no relation with customer churn.


# Data Preparation
The dataset was inspected to check for missing values, duplicates, and basic statistics.

## 1. Drop unnecessary columns

In [93]:
# Drop unnecessary columns

df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)

In [None]:
df.head()

In [None]:
# Statistical description of dataset
df.describe().T

## 2. Handle Missing Values

In [None]:
# Check for missing values
df.isnull().sum()

- No missing values present.

## 3. Handle Duplicates

In [None]:
# Check for duplicate entries
df.duplicated().sum()

- No duplicate entries present.

## 4. Renaming columns

In [98]:
# Rename column 'Exited' to 'Churn'

df.rename(columns={'Exited':'Churn'}, inplace=True)

In [None]:
df.head()

# Exploratory Data Anaysis
For Exploratory data analysis, we will graphically analyse: How many churned and churning wrt all other features.

## 1. Total Churning Distribution

In [100]:
# 1. Churning customer distribution

from matplotlib import pyplot as plt 
import seaborn as sns; sns.set_theme()

churn_counts= df['Churn'].value_counts()

In [None]:
plt.pie(churn_counts, labels=['Not Churned','Churned'], autopct='%1.2f%%')
plt.title('Total churn distribution')
plt.show()

- Majority of the customers (~80%) continue to use the service without churning. Only 20% churned.
- We can say that the data is imbalanced. 
- We can use SMOTE later to make a balanced datset for accurate prediction and compare results with the originally imbalanced dataset.

## 2. Gender

In [None]:
df.head()

### (a) Gender distribution among all customers

In [None]:
gender_counts = df['Gender'].value_counts()
gender_counts

In [None]:
plt.pie(gender_counts, labels=['Male', 'Female'], autopct='%1.2f%%')
plt.title('Gender Distribution Across All Bank Customers')
plt.show()

### (b) Gender Distribution among Churned and Not Churned

In [None]:
sns.histplot(df, x='Churn', hue='Gender', multiple='dodge', binwidth=0.25)
plt.title("Gender Distribution and Churning")
plt.show()

Gender distribution reveals that the majority of customers are male, while more churned customers are female.

## 3. Age

In [None]:
df.head()

In [None]:
sns.histplot(data= df, x= 'Age', hue='Churn', kde=True)
plt.xlabel('Age of customers')
plt.ylabel('Customer counts')
plt.title('Customer Age dostribution')
plt.show()

In [None]:
sns.violinplot(data=df, x='Churn', y='Age', hue='Churn')
plt.title("Age distribution and Churning")
plt.show()

Histogram and violin plot show that churn rates are higher among customers aged 40-50.

## 4. Geography

In [None]:
df.head()

In [None]:
df['Geography'].value_counts()

In [None]:
# Another method to find value_counts
df.groupby(['Geography']).count()['Churn']

In [None]:
sns.histplot(data=df, x='Geography', hue='Churn',  multiple='dodge')
plt.xlabel('Customer Location')
plt.ylabel('Number of Customers')
plt.title('Customer Churning Distribution across Customer Location')
plt.show()

Most of the customers are located in France. However, the highest number of customers churned are from Germany, which also has minimum number of customers.

## 5. CreditScore

In [None]:
df.head()

In [None]:
sns.histplot(data=df, x='CreditScore', hue='Churn', kde=True)
plt.xlabel('Customer Credit Score')
plt.ylabel('Customer Counts')
plt.title('Customer Churning wrt Credit Score')

In [None]:
sns.violinplot(data=df, x= 'Churn', y= 'CreditScore', hue= 'Churn')
plt.legend(title='Churning status', loc= 'upper center')

Credit scores reflect the overall financial behaviour of a customer. Thus is surely can be a factor to predict churning. Some factors like older accounts, ontime loan/credit card payments etc increases the credit score of a customer. Which means, higher the credit score higher are the chances of stability, thus less chance of churning. The customers with low and high credit scores have lesser tendency of churning while the customers with intermediate credit score have higher tendency to churn. But the number of customers who did not churn also follow the same curve.

## 6. Tenure

In [None]:
df.head()

In [None]:
df['Tenure'].value_counts()

In [None]:
sns.histplot(data=df, x='Tenure', hue='Churn', multiple='dodge', binwidth=0.5).legend(['Churned', 'Not Churned'])
plt.title('Tenure wise churning distribution')
plt.show()

Customers with tenure very high (>9) seems less likely to churn. Same goes for the new customers. However, the customers with tenure 1-9 seem to churn highly.

## 7. Bank Balance

In [None]:
# Bank balance
df.head()

In [None]:
sns.histplot(data=df, x= 'Balance', hue='Churn', kde=True, multiple='dodge')
plt.title("Balance and Churn")

Customers with zero bank balance are highest among churners. Whereas the customers with very high bank balance (>1.5L) are less likely to leave the bank.

## 8. Number of Products

In [None]:
df.head()

In [None]:
sns.histplot(data=df, x='NumOfProducts', hue='Churn', multiple='dodge', binwidth=0.5)
plt.title("Number of Products and Churn")


Most customers have 1-2 products, and this range is also prevalent among churners.

## 9. Has Credit Card

In [None]:
df.head()

In [None]:
sns.countplot(data=df, x= 'HasCrCard', hue= 'Churn')
plt.title("Has Credit Card and Churn")

The majority of customers possess credit cards. In both churn and non-churn groups, those with credit cards are more prevalent.

## 10. Is Active Member

In [None]:
df.head()

In [None]:
sns.countplot(df, x='IsActiveMember', hue= 'Churn')
plt.title("Is Active Member and Churn")


Churners seems to be less active. However, the active memebers churning is also considerable.

## 11. Estimated Salary

In [None]:
df.head()

In [None]:
sns.histplot(df, x='EstimatedSalary', hue='Churn', multiple='dodge')
plt.title("Estimated Salary and Churn")

Churning pattern seems similar among all the salary ranges.

## Encoding

There are two columns Geography and Gender which are object type and need to be encoded. The type of encoding depends on the model we are going to use for prediction. The Categorical data is converted to numerical values so that our ML model can understand it.
1. Linear models: For linear models numerical values has meaning- equivalent to their magnitude.Thus one-hot-encoding is used. For example, if male and female are encoded as 1 and 0, then it will take 1 > 0 and train model accordingly, which will obviously not give the correct result.
2. DT, RF, XGBoosts have no problem with label encoding because they do not use order of categories directly.

So, we are going to perform one-hot-encoding, so that we can fit on different models and compare the accuracy.

In [None]:
df.head()

There are two ways to perform one hot encoding. Using pd.get_dummies and sklearn.preprocessing.OneHotEncoding.

**One hot Econding using Pandas**

In [130]:
# One hot encoding using pandas
df_encoded_pd = pd.get_dummies(df, columns=['Geography', 'Gender']).astype(int)

In [None]:
df_encoded_pd.head()

**One hot encoding using sklearn**

In [None]:
df.head()

In [133]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)

sk_encoded = encoder.fit_transform(df[['Geography', 'Gender']])
df_sk = pd.DataFrame(sk_encoded, columns=encoder.get_feature_names_out())
df_encoded_sk = pd.concat([df.drop(columns=['Geography', 'Gender']), df_sk], axis=1)

In [None]:
df_encoded_sk.head()

# Normalization

- We have column values that range widely different. For efficient model generation and convergence using Gradient decsent we must scale them to a uniform level. As the data range varies widely and units are also not the same, we will use StandaedScaler for normalization. Standard Scler or Z-score normalization makes the data distribution uniform for features. When we fit the model, it stores the mean and SD for each column. When we transform the data, it normalizes the columns by subtracting with their mean and dividing by their respective SD.

- We will apply the normalization on features [CreditScore, Age, Balance, EstimatedSalary]. We have already encoded the categorical features, so there is no need to scale them.

In [135]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
model = scaler.fit_transform(df_encoded_pd[['CreditScore', 'Age', 'Balance', 'EstimatedSalary']])
# model_transformed = scaler.transform(df_encoded_pd)

normalized_df = pd.DataFrame(model, columns= scaler.get_feature_names_out())

# Concat with other columns as well

normalized_df_concat = pd.concat([df_encoded_pd.drop(columns=['CreditScore', 'Age', 'Balance', 'EstimatedSalary']),normalized_df], axis=1)

In [None]:
normalized_df_concat.head()

In [None]:
normalized_df_concat.describe().T

In [None]:
normalized_df_concat.shape

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=5, figsize=(20,12))

axes = axes.flatten()  # Flatten axes to make readable in for loop
for i, col in enumerate(normalized_df_concat.columns):
    sns.histplot(data=normalized_df_concat, x=col, ax=axes[i], kde=True)

All the features are scaled down to similar level although they were varying very widely before.

# Outliers

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(normalized_df_concat)
plt.xticks(rotation= 90)
plt.title("Box Plot Representing Outliers for Each Feature")

All the outliers seems reasonable and doesn't seem like noise which should be removed.

# Correlation

To find if our features are independent or not we will plot correlation matrix.


In [None]:
from scipy import stats

((normalized_df_concat.corr()>0.5) & (normalized_df_concat.corr()!=1)).sum()

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(normalized_df_concat.corr(), cmap='coolwarm', annot=True, fmt='.2f')

- There is no significant correlation among features.

# Feature matrix and Target vector

In [None]:
X = normalized_df_concat.drop(columns='Churn')
X.head()

In [None]:
y = normalized_df_concat['Churn']
y.head()

# Train-Test-Split

We split the dataset into training and testing sets to ensure fair evaluation of model performance.

In [145]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) 

In [None]:
print(f"{X_train.shape=}, {y_train.shape=}\n{X_test.shape=}, {y_test.shape=}")

# Fitting the Model

Two models are employed for churn prediction:
1. Decision tree
2. Random forest

## 1. Decision Tree

In [None]:
X_train.head()

For hyperparameter tuning we are using grid serach with cross-valiadtion. This will help us provide the optimal parameter with enhanced model performance. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

grid = {'criterion': ['gini', 'entropy'],
        'max_depth': [5,10,15],
        'random_state': [0,42]
        }

grid_tree= GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=grid, cv=5)

grid_tree.fit(X_train, y_train)

In [None]:
grid_tree.best_params_  # Get best hyperparameters

In [151]:
dtree = DecisionTreeClassifier(criterion='entropy', max_depth= 6, random_state=None).fit(X_train, y_train)  # Train model for best hyperparameters

In [None]:
from sklearn import tree 
print(tree.plot_tree(dtree))

In [None]:
acc_dtree = dtree.score(X_test, y_test)  # Accuracy of DT model
print(f"Accuracy of DT: {acc_dtree}")

In [154]:
y_dtree = dtree.predict(X_test)  # Predict y using DT model

## 2. Random Forest

Similar hyperparameter tuning will be conducted for the Random Forest model.

In [None]:
from sklearn.ensemble import RandomForestClassifier


grid2 = { 'n_estimators': [100,150,200],
         'max_depth': [5,10,15],
           'criterion': ['gini', 'entropy'],
           'random_state': [0, 42], 
           }

grid_rf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=grid2, cv=5)

grid_rf.fit(X_train, y_train)

In [None]:
grid_rf.best_params_  # Get best fitted hyperparameters

In [None]:
rf_model = RandomForestClassifier(n_estimators=grid_rf.best_params_['n_estimators'],criterion= grid_rf.best_params_["criterion"],max_depth= grid_rf.best_params_["max_depth"], random_state= grid_rf.best_params_["random_state"]).fit(X_train, y_train)  # Train model using best hyperparameters

acc_rf = rf_model.score(X_test, y_test)
print(f"Accuracy of RF: {acc_rf}")

In [159]:
y_rf = rf_model.predict(X_test)  # Predicted y using RF model

The accuracies of all the models are as follows:

1. Decision tree: 86.08%
2. Random forest: 86.76%

Accuracies of DT and RF are more or less similar.

# Model Evaluations

## 1. Confusion matrix
Confusion matrix shows the true positives, true negatives, false positives, and false negatives for both models.

In [None]:
from sklearn.metrics import confusion_matrix

figure, axes = plt.subplots(nrows=1,ncols=2,figsize=(12,5))

cm_dtree = confusion_matrix(y_test, y_dtree)
cm_rf = confusion_matrix(y_test, y_rf)

sns.heatmap(cm_dtree, annot=True, fmt='g', cmap='coolwarm', ax=axes[0])
axes[0].set_title('Confusion Matrix for Decision Tree Model')

sns.heatmap(cm_rf, ax=axes[1], annot=True, fmt='g', cmap='coolwarm' )
axes[1].set_title('Confusion Matrix for Random Forest Model')


## 2. Classification Report

- To get scores: Precision, Recall, and F1-Score. To understand the report, we should know that:
  1. Lower the precision means higher is the chances of false positives.(i.e. predicting many churn even when not churned).
  2. Lower the recall means higher is the chances of false negatives. (i.e. unable to predict the many churned ones as churned).
  3. F1-score finds a balance between precision and recall. Lower F1 score means, it is struggling to balance between precision and recall.

- Higher the metrics values, good is the prediction model.

In [None]:
from sklearn.metrics import classification_report

cr_dtree = classification_report(y_test, y_dtree, output_dict=True)
print(f'Classification Report for Decision Tree Model:\n {cr_dtree}')

In [None]:
cr_rf = classification_report(y_test, y_rf, output_dict=True)
print(f"Classification Report for RF model: \n{cr_rf}")

For Customer churn prediction it is very important to not miss any class '1' prediction. This means that our recall should be high and FN should be low. 
But as per the classification report, the recall for class '1' is low in both the models- 0.43 and 0.46.
The lower F1-score for class '1' also indicates difficulty while balancing the recall and precision. This may be due to the imbalanced data with nearly 80%-20% ratio between class '0' and class '1'. 
The Accuracy was observed higher due to the 80% of class '0' data. 

We will check if balancing the dataset may help improve the recall or not. For this we will perform SMOTE analsyis on our training data and then check various metrics on original test data.

# SMOTE Analysis

In [163]:
from imblearn.over_sampling import SMOTE

sm= SMOTE(random_state=42)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

In [None]:
y_train_sm.value_counts()

After performing SMOTE our training dataset is now balanced. Let's train the models again.

So, our data includes:
1. training data: X_train_sm, y_train_sm
2. testing data: X_test, y_test

# Retraining models

## 1. Decision tree smote model

Grid search cv is used for hyperparameter tuning. The training dataset is changed based on SMOTE analysis.

In [None]:
grid = {'criterion': ['gini', 'entropy'],
        'max_depth': range(2,15),
        'random_state': [0,42]
        }

grid_tree_sm= GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=grid, cv=5)

grid_tree_sm.fit(X_train_sm, y_train_sm)

In [None]:
grid_tree_sm.best_params_  # Get optimized hyperparameters

In [174]:
dtree_sm = DecisionTreeClassifier(criterion= grid_tree_sm.best_params_["criterion"],max_depth= grid_tree_sm.best_params_["max_depth"], random_state= grid_tree_sm.best_params_["random_state"]).fit(X_train_sm, y_train_sm)  # Train model based on optimized hyperparameters

In [None]:
acc_dtree_smote = dtree_sm.score(X_test, y_test)  # Accuracy of DT SMOTE model
acc_dtree_smote

In [176]:
y_dtree_sm = dtree_sm.predict(X_test)  # y predicted using DT SMOTE model

In [None]:
# Classification Report
cr_dtree_smote = classification_report(y_dtree_sm, y_test, output_dict=True)

print(cr_dtree_smote)  # Recall, precision, and f1 score

## 2. Random forest

Similarly, random forest model is also trained with smote training dataset along with hyperparameter tuning.

In [None]:
grid2 = {'max_depth': range(2,15),
           'criterion': ['gini', 'entropy'],
           'random_state': [0,42],
           'n_estimators':[100,150,200] 
           }

grid_rf_sm = GridSearchCV(RandomForestClassifier(), param_grid=grid2, cv=5)

grid_rf_sm.fit(X_train_sm, y_train_sm)

In [None]:
grid_rf_sm.best_params_ # Get optimized hyperparameters

In [181]:
rf_sm = RandomForestClassifier(n_estimators=grid_rf_sm.best_params_["n_estimators"],criterion= grid_rf_sm.best_params_["criterion"],max_depth= grid_rf_sm.best_params_["max_depth"], random_state= grid_rf_sm.best_params_["random_state"]).fit(X_train_sm, y_train_sm) # Train model based on optimized hyperparameters

In [182]:
y_rf_sm = rf_sm.predict(X_test)  # y predicted using RF SMOTE model

In [None]:
cr_rf_smote = classification_report(y_rf_sm, y_test, output_dict=True)  # Recall, precision, and f1 score
print(cr_rf_smote)

In [None]:
acc_rf_smote = rf_sm.score(X_test, y_test)  # Accuracy of RF smote model
acc_rf_smote

# Model Evaluations


In [None]:
# Extract relevant metrics
data = {
    'Model': ['Decision Tree', 'Random Forest', 'Decision Tree (SMOTE)', 'Random Forest (SMOTE)'],
    'Precision': [cr_dtree['1']['precision'], 
                  cr_rf['1']['precision'], 
                  cr_dtree_smote['1']['precision'], 
                  cr_dtree_smote['1']['precision']],
    'Recall': [cr_dtree['1']['recall'], 
               cr_rf['1']['recall'], 
               cr_dtree_smote['1']['recall'], 
               cr_rf_smote['1']['recall']],
    'F1-Score': [cr_dtree['1']['f1-score'], 
                 cr_rf['1']['f1-score'], 
                 cr_dtree_smote['1']['f1-score'], 
                 cr_rf_smote['1']['f1-score']],
    'Accuracy': [acc_dtree,
                acc_rf,
                acc_dtree_smote,
                acc_rf_smote]
    
   
}

# Create DataFrame
results_df = pd.DataFrame(data)
results_df


Precision tells us how good the model is at making correct positive predictions:

- The Random Forest model shines here, correctly predicting 78% of churners. It’s a reliable choice when it does say someone is likely to churn.

Recall shows how many actual churners the model can spot:
- The Random Forest with SMOTE does best in this category, catching 57.4% of actual churners. This means it's better at identifying customers who might leave.

F1-Score balances precision and recall:
- The Random Forest with SMOTE gets the highest F1-score (60%), indicating it does a good job of balancing being accurate and catching churners.

Accuracy reflects overall correct predictions:
- Looking at all other metrics Random Forest (SMOTE) has slightly lower accuracy that its respective non-SMOTE model. But the higher recall and F1-score compensates for that.

If we want to be sure about churn predictions, we will go with the Random Forest without SMOTE for its high precision. If spotting churners is our priority, the Random Forest with SMOTE is better. For a balanced approach, consider the Random Forest with SMOTE; it gives a good mix of catching churners while still being accurate.