# Prediction of Churn

From the data visualisation, we found that the mean of days since last transaction was significantly different between churn and non-churned individuals. Thus, in the following code, we will use days since last transaction to predict churn.

## Random Forest

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import os

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestRegressor

In [2]:
# Reading the dataset
path = os.path.join('c:' + os.sep, 'Users', 'Isaac', 'Desktop', 'Churn', 'Dataset', 'visathon_train_data.csv')
df = pd.read_csv(path)

In [3]:
# Find the datatype of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17654 entries, 0 to 17653
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   customer_id                     17654 non-null  float64
 1   vintage                         17654 non-null  float64
 2   age                             17654 non-null  float64
 3   gender                          17654 non-null  object 
 4   dependents                      17654 non-null  float64
 5   occupation                      17472 non-null  object 
 6   customer_nw_category            17654 non-null  object 
 7   branch_code                     17654 non-null  float64
 8   days_since_last_transaction     17654 non-null  float64
 9   current_balance                 16939 non-null  float64
 10  previous_month_end_balance      17654 non-null  float64
 11  average_monthly_balance_prevQ   17654 non-null  float64
 12  average_monthly_balance_prevQ2  

In [4]:
# Wrangling data type of churn from object to bool (0 or 1)
df['churn'] = (df['churn'] == 'Yes')

As we see below, the days_since_last_transaction column contains no null values, so there is no need to remove any rows.

In [5]:
df.isnull().sum()

customer_id                         0
vintage                             0
age                                 0
gender                              0
dependents                          0
occupation                        182
customer_nw_category                0
branch_code                         0
days_since_last_transaction         0
current_balance                   715
previous_month_end_balance          0
average_monthly_balance_prevQ       0
average_monthly_balance_prevQ2      0
current_month_credit                0
previous_month_credit               0
current_month_debit                 0
previous_month_debit              887
current_month_balance             816
previous_month_balance              0
churn                               0
dtype: int64

In [6]:
# Split dataset into train and test
features = df['days_since_last_transaction']
labels = df['churn']


train_features, test_features, train_labels, test_labels = train_test_split(
    features.values.reshape(-1,1),labels,test_size = 0.25, random_state = 42)

In [7]:
# Ensure train and test sets are correctly split 
print(train_features.shape)
print(train_labels.shape)
print(test_features.shape)
print(test_labels.shape)


(13240, 1)
(13240,)
(4414, 1)
(4414,)


In [8]:
# Buiding model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

In [9]:
# Train model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(n_estimators=1000, random_state=42)

Accuracy of Random Forest is 2%. 

In [10]:
# Prediction on test data
predictions = rf.predict(test_features)

# Accuracy
accuracy = sum(predictions == test_labels)/test_labels.shape[0] * 100
print(accuracy)

2.174898051653829


Hence, we shall try logistic regression instead.

## Logistic Regression

In [18]:
logModel = LogisticRegression()
logModel.fit(train_features, train_labels)
predLogReg = logModel.predict(test_features)

In [19]:
# Confusion matrix

print(classification_report(test_labels,predLogReg))
print(confusion_matrix(test_labels,predLogReg))

              precision    recall  f1-score   support

       False       0.00      0.00      0.00       848
        True       0.81      1.00      0.89      3566

    accuracy                           0.81      4414
   macro avg       0.40      0.50      0.45      4414
weighted avg       0.65      0.81      0.72      4414

[[   0  848]
 [   0 3566]]


  _warn_prf(average, modifier, msg_start, len(result))


In [13]:
# Predictions on test data for submission

path2 = os.path.join('c:' + os.sep, 'Users', 'Isaac', 'Desktop', 'Churn', 'Dataset', 'visathon_test_data.csv')
df2 = pd.read_csv(path2)
features2 = df2['days_since_last_transaction']

predLogReg2 = logModel.predict(features2.values.reshape(-1,1))
submit = np.concatenate((df2['customer_id'].values.reshape(-1,1),predLogReg2.reshape(-1,1)), axis=1)

np.savetxt('pred.csv', submit, delimiter=',', fmt='%i', header='customer_id,churn', comments='')

Accuracy obtained from logistic regression is 81%, which is better.