# **Customer Churn Prediction**

**Problem Statement:**
The problem statement involves developing a model to predict customer churn for a subscription-based service or business. The aim is to utilize historical customer data, including features such as usage behavior and customer demographics, to predict whether a customer is likely to churn or not. The task entails exploring various machine learning algorithms such as Logistic Regression, Random Forests, and Gradient Boosting to build predictive models. The ultimate goal is to create a model that accurately identifies potential churners, allowing the business to implement targeted retention strategies and minimize customer attrition.

* Use algorithms like Logistic Regression, Random Forests, or Gradient Boosting to predict churn.

So this is a simple classification problem, where you're given a list of customers with their details. And your goal is to predict based whether a new customer would stay or exit the bank, given their details.

# Loading and Analyzing the dataset

In [1]:
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
df = pd.read_csv("/content/drive/MyDrive/CustomerChurn_Modelling.csv")

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
df.shape

(10000, 14)

In [5]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


# Data Preprocessing and cleaning

Checking for Null values

In [6]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Checking for Duplicate values

In [7]:
df.duplicated().sum()

0

Removing unnecessary columns

In [8]:
df = df.drop(columns=["RowNumber",	"CustomerId",	"Surname"])
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Convert **Gender** from string value column to numerical, by marking male as 1 and female as 0.

In [9]:
df["Gender"].replace("Male", 1, inplace=True)
df["Gender"].replace("Female", 0, inplace=True)
df["Gender"].value_counts()

1    5457
0    4543
Name: Gender, dtype: int64

**One Hot Encode** the Geography column. To two different columns, based on locality: Geography_Spain, Geography_Germany.

This can be done by using inbuilt *get_dummies()* function in pandas library.

In [10]:
df = pd.get_dummies(df, columns=["Geography"], drop_first=True)
df

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,0,42,2,0.00,1,1,1,101348.88,1,0,0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,0,42,8,159660.80,3,1,0,113931.57,1,0,0
3,699,0,39,1,0.00,2,0,0,93826.63,0,0,0
4,850,0,43,2,125510.82,1,1,1,79084.10,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,39,5,0.00,2,1,0,96270.64,0,0,0
9996,516,1,35,10,57369.61,1,1,1,101699.77,0,0,0
9997,709,0,36,7,0.00,1,0,1,42085.58,1,0,0
9998,772,1,42,3,75075.31,2,1,0,92888.52,1,1,0


# Spilitting Training and Testing data

In [11]:
X = df.drop(columns=["Exited"])
Y = df["Exited"]

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [13]:
x_train.shape

(8000, 11)

In [14]:
x_train

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain
2694,628,1,29,3,113146.98,2,0,1,124749.08,1,0
5140,626,0,29,4,105767.28,2,0,0,41104.82,0,0
2568,612,0,47,6,130024.87,1,1,1,45750.21,1,0
3671,646,0,52,6,111739.40,2,0,1,68367.18,1,0
7427,714,1,33,8,122017.19,1,0,0,162515.17,0,1
...,...,...,...,...,...,...,...,...,...,...,...
2895,621,1,47,7,107363.29,1,1,1,66799.28,1,0
7813,684,0,63,3,81245.79,1,1,0,69643.31,1,0
905,672,0,45,9,0.00,1,1,1,92027.69,0,0
5192,663,0,39,8,0.00,2,1,1,101168.90,0,0


Scaling the training and testing parameters using *StandartScaler*

In [20]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)

In [21]:
X_train_scaled

array([[-0.23082038,  0.91509065, -0.94449979, ...,  0.42739449,
         1.71490137, -0.57273139],
       [-0.25150912, -1.09278791, -0.94449979, ..., -1.02548708,
        -0.58312392, -0.57273139],
       [-0.3963303 , -1.09278791,  0.77498705, ..., -0.94479772,
         1.71490137, -0.57273139],
       ...,
       [ 0.22433188, -1.09278791,  0.58393295, ..., -0.14096853,
        -0.58312392, -0.57273139],
       [ 0.13123255, -1.09278791,  0.01077067, ...,  0.01781218,
        -0.58312392, -0.57273139],
       [ 1.1656695 ,  0.91509065,  0.29735181, ..., -1.15822478,
         1.71490137, -0.57273139]])

# Model Selection and Training

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

**Logistic Regression** Implemetation

In [22]:
# Initialize and train Logistic Regression
log_reg_model = LogisticRegression()
log_reg_model.fit(X_train_scaled, y_train)

# Make predictions using Logistic Regression
y_pred_lr = log_reg_model.predict(X_test_scaled)

# Evaluate Logistic Regression model
print("Logistic Regression Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:")
print(classification_report(y_test, y_pred_lr))

Logistic Regression Performance:
Accuracy: 0.8125
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.97      0.89      1585
           1       0.64      0.22      0.33       415

    accuracy                           0.81      2000
   macro avg       0.73      0.59      0.61      2000
weighted avg       0.79      0.81      0.77      2000



**Random forest** Implementation

In [23]:
# Initialize and train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = rf_model.predict(X_test_scaled)

# Evaluate model
print("Random Forest Classifier Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Random Forest Classifier Performance:
Accuracy: 0.8645
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1585
           1       0.80      0.46      0.58       415

    accuracy                           0.86      2000
   macro avg       0.84      0.72      0.75      2000
weighted avg       0.86      0.86      0.85      2000



**Gradient Boosting** Implementation

In [24]:
# Initialize and train Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_scaled, y_train)

# Make predictions using Gradient Boosting
y_pred_gb = gb_model.predict(X_test_scaled)

# Evaluate Gradient Boosting model
print("\nGradient Boosting Classifier Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("Classification Report:")
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Classifier Performance:
Accuracy: 0.8615
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.96      0.92      1585
           1       0.77      0.47      0.59       415

    accuracy                           0.86      2000
   macro avg       0.82      0.72      0.75      2000
weighted avg       0.85      0.86      0.85      2000

