# Churn Analysis

Customer Churn is an important and challenging problem for ecomerce and online businesses. In an Online business, with multiple competitors in the same business its really important to re-engage existing customers and keep them from churning. 

We can classify customer churn (also known as customer attrition) by grouping them into different categories. Contractual Churn, which is applicable to businesses such as cable companies and SAAS service providers, is when customers decide not to continue with their expired contracts. Voluntary Churn, on the other hand, is when a customer decides to cancel their existing service, which can be applicable for companies such as prepaid cellphones and streaming subscription providers. There are also times when consumers leave a possible purchase without completing the transaction. We can categorize these instances as non-contractual churn, which is applicable for businesses that rely on retail locations, online stores or online borrowing services. And lastly, there is the involuntary churn, for instance where a customer can not pay their credit card bill and no longer stays with the credit card company.
The reasoning of customer churn can vary and would require domain knowledge in order to define properly, however some common ones are; lack of usage of the product, poor service and better price somewhere else. Regardless of the reasoning that can be specific for different industries, one thing applies for every domain is, it costs more to acquire new customers than it does to retain existing ones. This has a direct impact on operating costs and marketing budgets within the company.

For a business in a stipulated period of time, customers can come under 3 major categories-
   
  a) Newly Acquired Customers  
  b) Existing Customers  
  c) Churned Customers  
  
  
  
Churned Customers are those who have decided to end their relationship with their existing company. It can happen because of variety of reasons like-  

   a) Bad customer Service   
   b) Bad Onbording   
   c) Lack of Ongoing Customer Success   
   
Churned Customers means a direct loss of Marketing Acquisition Cost and possible revenue which could be capitalized post sale. Hence, predicting possible customers who can churn beforehand can help us save this loss.

Since we know our best customers by segmentation and lifetime value prediction, we should also work hard on retaining them. That’s what makes Retention Rate is one of the most critical metrics.

Retention Rate is an indication of how good is your product market fit (PMF). If your PMF is not satisfactory, you should see your customers churning very soon. One of the powerful tools to improve Retention Rate (hence the PMF) is Churn Prediction. By using this technique, you can easily find out who is likely to churn in the given period. 

The best way to avoid customer churn is to know your customers, and the best way to know your customer is through historical and new customer data.

aims:  


“Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.” 

Investigating how the features affect Retention by using Logistic Regression


Building a classification model

  1.    
  2. 

### Import required libraries

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

## About the Data

### Read the data

In [None]:
df = pd.read_csv('Churn_Modelling.csv')

### Review the data

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
df.nunique()

### Clean the data

In [None]:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

In [None]:
new_names = {
    'CreditScore': 'credit_score',
    'Geography': 'country',
    'Gender': 'gender',
    'Age': 'age',
    'Tenure': 'tenure',
    'Balance': 'balance',
    'NumOfProducts': 'number_products',
    'HasCrCard': 'owns_credit_card',
    'IsActiveMember': 'is_active_member',
    'EstimatedSalary': 'estimated_salary',
    'Exited': 'exited'
}

In [None]:
df.rename(columns=new_names, inplace=True)

## Exploratory Data Analysis

In [None]:
amount_retained = df[df['exited'] == 0]['exited'].count() / df.shape[0] * 100
amount_lost = df[df['exited'] == 1]['exited'].count() / df.shape[0] * 100

In [None]:
fig, ax = plt.subplots()
sns.countplot(x='exited', palette="Set3", data=df)
plt.xticks([0, 1], ['Retained', 'Lost'])
plt.xlabel('Condition', size=15, labelpad=12, color='grey')
plt.ylabel('Amount of customers', size=15, labelpad=12, color='grey')
plt.title("Proportion of customers lost and retained", size=15, pad=20)
plt.ylim(0, 9000)
plt.text(-0.15, 7000, f"{round(amount_retained, 2)}%", fontsize=12)
plt.text(0.85, 1000, f"{round(amount_lost, 2)}%", fontsize=12)
sns.despine()
plt.show()

In [None]:
categorical_labels = [['gender', 'country'], ['owns_credit_card', 'is_active_member']]
colors = [['Set1', 'Set2'], ['Set3', 'PuRd']]

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
for i in range(2):
    for j in range(2):
        feature = categorical_labels[i][j]
        color = colors[i][j]
        ax1 = sns.countplot(x=feature, hue='exited', palette=color, data=df, ax=ax[i][j])
        ax1.set_xlabel(feature, labelpad=10)
        ax1.set_ylim(0, 6000)
        ax1.legend(title='Exited', labels= ['No', 'Yes'])
        if i == 1:
            ax1.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
df.columns

In [None]:
numerical_labels = [['age', 'credit_score'], 
                    ['tenure', 'balance'],
                   ['number_products', 'estimated_salary']]
num_colors = [['Set1', 'Set2'], 
              ['Set3', 'PuRd'],
              ['Spectral', 'Wistia']]

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(12, 12))
for i in range(3):
    for j in range(2):
        feature = numerical_labels[i][j]
        color = num_colors[i][j]
        ax1 = sns.boxplot(x='exited', y=feature, palette=color, data=df, ax=ax[i][j])
        ax1.set_xlabel('Exited', labelpad=10)
        ax1.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
sns.pairplot(df, vars=['age', 'credit_score', 'balance', 'estimated_salary'], 
             hue="exited", palette='husl')
sns.despine()

## Feature Engineering

### New Variable Creation

In [None]:
df['creditscore_age_ratio'] = df['credit_score'] / df['age']

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
sns.boxplot(y='creditscore_age_ratio', x='exited', palette='summer', data=df)
ax.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
df['balance_salary_ratio'] = df['balance'] / df['estimated_salary']

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
sns.boxplot(y='balance_salary_ratio', x='exited', palette='winter', data=df)
ax.set_xticklabels(['No', 'Yes'])
ax.set_ylim(-1, 6)
sns.despine()

### Encoding Categorical Variables

In [None]:
x = df.drop('exited', axis=1)
y = df['exited']

In [None]:
for label in ['gender', 'country']:
    le = LabelEncoder()
    le.fit(x[label])
    x.loc[:, label] = le.transform(x[label])

### Split the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, 
                                                    shuffle=True, stratify=y)

## Model fitting

In [None]:
def print_best_model(model):
    print(f"The best parameters are: {model.best_params_}")
    print(f"The best model score is: {model.best_score_}")    
    print(f"The best estimator is: {model.best_estimator_}")

In [None]:
def get_auc_scores(y_actual, method,method2):
    auc_score = roc_auc_score(y_actual, method); 
    fpr_df, tpr_df, _ = roc_curve(y_actual, method2); 
    return (auc_score, fpr_df, tpr_df)

### 1. Parameter Searching

#### Logistic Regression

In [None]:
param_grid_log = {
    'C': [0.1, 1, 10, 50, 100, 200],
    'max_iter': [200, 300],
    'penalty': ['l2'],
    'tol':[0.00001, 0.0001],
}

In [None]:
log_first = LogisticRegression(solver='lbfgs')

In [None]:
log_grid = GridSearchCV(log_first, param_grid=param_grid_log, cv=10, verbose=1)

In [None]:
log_grid.fit(x, y)

In [None]:
best_log_estimator = log_grid.best_estimator_

In [None]:
print_best_model(log_grid)

#### Support Vector Machine

In [None]:
param_grid_svm = {
    'C': [0.5, 100, 150],
    'kernel': ['rbf'],
    'gamma': [0.1, 0.01, 0.001]
}

In [None]:
svm_first = SVC()

In [None]:
svm_grid = GridSearchCV(svm_first, param_grid=param_grid_svm, cv=3, verbose=3, n_jobs=-2)

In [None]:
svm_grid.fit(x, y)

In [None]:
best_svm_estimator = svm_grid.best_estimator_

In [None]:
print_best_model(svm_grid)

In [None]:
param_grid_svm_poly = {
    'C': [0.5, 1, 10],
    'kernel': ['poly'],
    'degree': [2, 3],
    'gamma': [0.1, 0.01, 0.001]
}

In [None]:
svm_poly_first = SVC()

In [None]:
svm_grid_poly = GridSearchCV(svm_poly_first, param_grid=param_grid_svm_poly, cv=3, verbose=3, n_jobs=-2)

In [None]:
svm_grid_poly.fit(x, y)

### 2. Fitting Best Models

#### Logistic Regression

In [None]:
best_log_estimator.fit(X_train, y_train)

#### Support Vector Machine

In [None]:
best_svm_estimator.fit(X_train, y_train)

### 3. Metrics of Best Models

#### Logistic Regression

In [None]:
log_predict_train = best_log_estimator.predict(x_train)

In [None]:
log_predict_test = best_log_estimator.predict(x_test)

#### Support Vector Machine

In [None]:
svm_predict_train = best_svm_estimator.predict(x_train)

In [None]:
svm_predict_test = best_svm_estimator.predict(x_test)

In [None]:
X = df_train.loc[:, df_train.columns != 'Exited']
X_pol2 = df_train_pol2
log_scores = get_auc_scores()
smv_scores = get_auc_scores()

In [None]:
plt.figure(figsize = (12,6), linewidth= 1)
plt.plot(fpr_log_primal, tpr_log_primal, label = 'log primal Score: ' + str(round(auc_log_primal, 5)))
plt.plot(fpr_SVM_RBF, tpr_SVM_RBF, label = 'SVM RBF Score: ' + str(round(auc_SVM_RBF, 5)))
plt.plot([0,1], [0,1], 'k--', label = 'Random: 0.5')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()