# Predicting Customer Churn Using ML

**About Dataset**

This Global Customer Churn Dataset is meticulously curated to aid in understanding and predicting customer churn Behaviour across various industries. With detailed customer profiles, including demographics, product interactions, and banking behaviors, this dataset is an invaluable resource for developing machine learning models aimed at identifying at-risk customers and devising targeted retention strategies."

**Data Description:**

Break down the dataset in detail, describing what each column represents:

**RowNumber:** A unique identifier for each row in the dataset.

**CustomerId:** Unique customer identification number.

**Surname:** The last name of the customer (for privacy reasons, consider anonymizing this data if not already done).

**CreditScore:** The customer's credit score at the time of data collection.

**Geography:** The customer's country or region, providing insights into location-based trends in churn.

**Gender:** The customer's gender.

**Age:** The customer's age, valuable for demographic analysis.

**Tenure:** The number of years the customer has been with the bank.

**Balance:** The customer's account balance.

**NumOfProducts:** The number of products the customer has purchased or subscribed to.

**HasCrCard:** Indicates whether the customer has a credit card (1) or not (0).

**IsActiveMember:** Indicates whether the customer is an active member (1) or not (0).

**EstimatedSalary:** The customer's estimated salary.

**Exited:** The target variable, indicating whether the customer has churned (1) or not (0).

This dataset is primed for exploratory data analysis, customer segmentation, predictive modeling to churn behaviour,
and the development of customer retention strategies. It offers rich insights for business strategists, data scientists, and researchers interested in improving customer loyalty and reducing churn rates.

Dataset Link: https://www.kaggle.com/datasets/anandshaw2001/customer-churn-dataset/data


## Loading Required Library

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer as Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion

In [2]:
#Dataloader Class
class DatasetLoader:
    def __init__(self, file_path):
        self.file_path = file_path
    def load_csv(self):
        try:
            dataframe = pd.read_csv(self.file_path)
            return dataframe
        except FileNotFoundError:
            print(f"File '{self.file_path}' not found.")
            return None
        except Exception as e:
            print(f"An error occurred while loading the dataset: {str(e)}")
            return None



In [None]:
def main():
    # Define the file path of the CSV dataset
    file_path = "Dataset/Churn_Modelling.csv"
    
    # Create an instance of DatasetLoader
    loader = DatasetLoader(file_path)
    
    # Load the CSV dataset
    dataset = loader.load_csv()
    dataset.describe()
    
    # Check if the dataset was successfully loaded
    if dataset is not None:
        print("Dataset loaded successfully.")
        #print(dataset.head())  # Display the first few rows of the dataset
        print(dataset.describe())
    else:
        print("Failed to load the dataset.")

if __name__ == "__main__":
    main()

In [9]:
dataset_churn = pd.read_csv("Dataset/Churn_Modelling.csv")
dataset_churn

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [11]:
dataset_churn.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


With the following information of the dataset, we have some interesting findings. Though we can see here among 14 columns 11 of them is numeric and only 3 of the columns are object (String). Interesting fact is though 11 columns are numeric `Tenure`, `NumOfProducts`, `HasCrCard`, `isActiveMember` and the target column `Exited` is categorical so only 6 of them are non-categorical. These findings are very important while we will do Data Preprocessing

In [14]:
columns_to_omit = [0,1,2,4,5,10,13]
columns_to_keep = [col for col in range(len(dataset_churn.columns)) if col not in columns_to_omit]
data_filtered = dataset_churn.iloc[:, columns_to_keep]
data_filtered.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,0.5151,100090.239881
std,96.653299,10.487806,2.892174,62397.405202,0.581654,0.499797,57510.492818
min,350.0,18.0,0.0,0.0,1.0,0.0,11.58
25%,584.0,32.0,3.0,0.0,1.0,0.0,51002.11
50%,652.0,37.0,5.0,97198.54,1.0,1.0,100193.915
75%,718.0,44.0,7.0,127644.24,2.0,1.0,149388.2475
max,850.0,92.0,10.0,250898.09,4.0,1.0,199992.48


Basically here certain columns which is unnecessary and also which has binary outcome and string based those has been omitted to have better analysis of the dataset. <br>
In the perspective of outcome: The dataset contains information on 10,000 bank customers, including their credit scores, ages, tenure with the bank, account balances, number of products held, activity status, and estimated salaries. On average, customers have a credit score of 650.53, are around 38.92 years old, and have been with the bank for approximately 5.01 years. The average account balance is 76,485.89, with a standard deviation of 62,397.41. Most customers (51.51%) are active members. The estimated salary of customers ranges from 11.58 to 199,992.48, with a median value of 100,193.92 <br>
and the most deviated column here is `Balance` approximately 62,397.41, indicating higher variability in account balances among customers.