## **Bank Customer Churn Prediction**

### **1. Introduction**

#### **1.1 Objectives**
1. Conduct data exploration and data cleaning
2. Conduct descriptive analysis of the data.
3. Determine relationship between variables.
4. Conduct churn prediction, given the variables in the dataset.
5. Share insights and findings through visualizations and observations.

#### **1.2 Methodology**
1. Use Python and necessary packages in conducting exploratory data analysis and data visualizations.
2. Check the dataset for incorrect datatypes and data.
3. Check the dataset for null values.
4. If there are incorrect datatypes and data and presence of null values, employ necessary actions to correct the data.
5. Create additional columns (if necessary) by using data extracted from an existing column.
6. Create data visualizations for descriptive analysis.
7. Plot relationship between variables using visualization tools such as Seaborn and Matplotlib.
8. Write observations based on the results of the analysis.
9. Create a machine learning model to predict the likelihood that a customer will churn, based on the variables in the dataset.
10. Evaluate the results of the machine learning model and share insights and recommendations.

#### **1.3 Description of Variables**

Column | Description
----- | -----
customer_id |unique identifier for each customer
vintage |the duration of the customer's relationship with the company
age |age of the customer
gender |gender of the customer
dependents |number of dependents the customer has
occupation |the occupation of the customer
city |city in which the customer is located
customer_nw_category |net worth category of the customer
branch_code |code identifying the branch associated with the customer
current_balance |current balance in the customer's account
previous_month_end_balance |account balance at the end of the previous month
average_monthly_balance_prevQ |average monthly balance in the previous quarter
average_monthly_balance_prevQ2 |average monthly balance in the second previous quarter
current_month_credit |credit amount in the current month
previous_month_credit |credit amount in the previous month
current_month_debit |debit amount in the current month
previous_month_debit |debit amount in the previous month
current_month_balance |account balance in the current month
previous_month_balance |account balance in the previous month
churn |variable indicating whether the customer has churned(1 - churned, 0 - not churned)
last_transaction |timestamp of the customer's last transaction


### **2. Data Preparation**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [3]:
df = pd.read_csv("D:\Documents\CSV Datasets\Bank Churn\churn_prediction.csv")

In [4]:
df.head()

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,current_balance,...,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn,last_transaction
0,1,2101,66,Male,0.0,self_employed,187.0,2,755,1458.71,...,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0,2019-05-21
1,2,2348,35,Male,0.0,self_employed,,2,3214,5390.37,...,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,0,2019-11-01
2,4,2194,31,Male,0.0,salaried,146.0,2,41,3913.16,...,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,0,NaT
3,5,2329,90,,,self_employed,1020.0,2,582,2291.91,...,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,1,2019-08-06
4,6,1579,42,Male,2.0,self_employed,1494.0,3,388,927.72,...,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1,2019-11-03


### **3. Data Cleaning**

In [10]:
## Check datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28382 entries, 0 to 28381
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   customer_id                     28382 non-null  int64  
 1   vintage                         28382 non-null  int64  
 2   age                             28382 non-null  int64  
 3   gender                          27857 non-null  object 
 4   dependents                      25919 non-null  float64
 5   occupation                      28302 non-null  object 
 6   city                            27579 non-null  float64
 7   customer_nw_category            28382 non-null  int64  
 8   branch_code                     28382 non-null  int64  
 9   current_balance                 28382 non-null  float64
 10  previous_month_end_balance      28382 non-null  float64
 11  average_monthly_balance_prevQ   28382 non-null  float64
 12  average_monthly_balance_prevQ2  

In [12]:
## Check the percentage of null values for each column in the dataset
for col in df.columns:
    print('Null Values for column {} is {}%'.format(col, np.round(df[col].isnull().sum()*100 / len(df[col])),2))

Null Values for column customer_id is 0.0%
Null Values for column vintage is 0.0%
Null Values for column age is 0.0%
Null Values for column gender is 2.0%
Null Values for column dependents is 9.0%
Null Values for column occupation is 0.0%
Null Values for column city is 3.0%
Null Values for column customer_nw_category is 0.0%
Null Values for column branch_code is 0.0%
Null Values for column current_balance is 0.0%
Null Values for column previous_month_end_balance is 0.0%
Null Values for column average_monthly_balance_prevQ is 0.0%
Null Values for column average_monthly_balance_prevQ2 is 0.0%
Null Values for column current_month_credit is 0.0%
Null Values for column previous_month_credit is 0.0%
Null Values for column current_month_debit is 0.0%
Null Values for column previous_month_debit is 0.0%
Null Values for column current_month_balance is 0.0%
Null Values for column previous_month_balance is 0.0%
Null Values for column churn is 0.0%
Null Values for column last_transaction is 0.0%


### Findings
1. The dataset contains incorrect datatypes, such as:
   - dependents (datatype is `float64` - will be changed into `int64`)
   - city (datatype is `float64` - will be changed into `int64`)
2. The dataset contains null values in the following columns:
   - gender (2.0% of the total)
   - dependents (9.0% of the total)
   - city (3.0% of the total)

### Actions
#### Incorrect datatypes
1. Change datatype for columns `dependents` and `city` into `int64`.

#### Null values
2. There are values of 0.0 in the `dependents` column, which means that the customer has no dependent, therefore, the null value for this column does not indicate that a customer has no dependent. It is not advisable to imputate values for this column due to the percentage of null values for this variable to be more than 5.0%, which might cause over-representation, and due to the fact that the values for this column is customer-inputted, and cannot be imputated by the column's mean, median, or mode values.
3. The same will be done for the columns `gender` and `city`, despite the percentage of null values to be less than 5.0%, the values for these columns are also customer-inputted, and cannot be imputated by the column's mean, median, and mode values.