## Solve Customer Churn by Using Data Science

#### BACKGROUND:
Supposing a car insurance company has millions of customers, they spent a lot of money to acquire the customers. However, 40% of customers cancelled the insurance policy every year and all of customers stay on average no more than 8 months. To their business, churn is a big deal. Hopefully this analysis will serve as a good way for you to better understand how to use data science to solve the customer churn.


#### PROBLEMS:

* Given an existing customer, can we use the historical data to predict:

 1. Whether the customer will churn or not;
 2. When the customer will churn;
 3. Why the customer will churn.


* Apparently it doesn't make sense for us to contact (e.g. outbound calls, direct mails) all of customers. What if there's a way to identify top high risk customer who are more likely to churn, then it would be a good way to prioritize and then maximize the efforts as well as to improve productivity.


**SOLUTIONS:** 
1. Create the customer datasets via linking all possible data source (Customer, Finance, Marketing, Survey and Third party)
2. Find out the most important features when predicting churn
3. Leverage machine learning to train and predict the data, evaluate the models and output the scoring of customer at risk.




**RESULTS:**

1. We are able to generate the High Risk customers scoring based on three models (Random model, Premium Model, Preditive model) Random forest does the best.
2. Geographic ZIPCODE feature is ALWAYS the most important variable when predicting probability to churn.
3. The KMeans clustering model indicates that the richer area, the higher retention rate, but less profitability.

**TAKEAWAYS:**

1. If a customer A has a very high predicted probability (e.g. 95%) to churn, however, A produces less revenue (e.g. $2) than others, should we really assign that customer a high priorty? It turns out we should also predict the LTV (Life Time Value) based on customer's attributes. Please refer to Customer LTV section.
2. The linear model to predict how long a customer stays at the plan is not always perfectly. Survival model is a better solution that could predict the duration of time to event scenario.

## Load Data

In [67]:
import pandas as pd
import numpy as np

In [68]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [100]:
customers = pd.read_csv('customers.csv', dtype={'ZIP_CODE': np.str})

In [101]:
# Transform column names to lower case
customers.columns = map(str.lower, customers.columns)

## Process Data

In [102]:
customers.shape

(40000, 16)

In [103]:
customers.head()

Unnamed: 0,customer_id,cust_churn,cust_churn_status,duration,productline_type,carrier_name,channel,zip_code,gender,age,has_email,cost,median_income,unemployment_rate,poverty_rate,high_school_rate
0,C1,0,Retain,4,Product A,Carrier 4,Channel A,32817,F,74,Y,23.073607,54188,4.5,9.8,34.4
1,C2,0,Retain,6,Product A,Carrier 8,Channel A,46804,M,60,N,387.354991,30900,5.9,30.7,32.8
2,C3,0,Retain,7,Product C,Carrier 8,Channel A,51034,F,77,Y,27.843773,43347,3.9,9.8,28.7
3,C4,0,Retain,8,Product A,Carrier 3,Channel C,49891,M,61,N,24.496505,43496,3.9,8.5,28.9
4,C5,1,Churn,25,Product A,Carrier 1,Channel A,49776,M,59,Y,0.0,48935,5.2,11.4,25.2


In [104]:
def cleanup_zipcode(zipcode):
    if(len(str(zipcode))==4):
        return str(0) + str(zipcode)
    return str(zipcode).split('-')[0].strip()

In [105]:
customers['gender'] = customers['gender'].apply(lambda s: 1 if s =='M' else 0)
customers['has_email'] = customers['has_email'].apply(lambda s: 1 if s=='Y' else 0)
products = pd.get_dummies(customers['productline_type'], drop_first=True).astype(np.int8)
carriers = pd.get_dummies(customers['carrier_name'], drop_first=True).astype(np.int8)
channels = pd.get_dummies(customers['channel'], drop_first=True).astype(np.int8)
zipcodes = pd.get_dummies(customers['zip_code'], drop_first=True).astype(np.int8)
customers = pd.concat([customers, products, carriers, channels, zipcodes], axis=1)

## Classification Model

In [106]:
from sklearn.model_selection import train_test_split

In [107]:
from sklearn.tree import DecisionTreeClassifier

In [108]:
from sklearn.ensemble import RandomForestClassifier

In [109]:
from sklearn.metrics import classification_report, confusion_matrix

In [110]:
class SklearnTreeClassifierModel(object):
    
    """ Sklearn Tree Model Object """
    X = None
    y = None
    X_train = None
    X_test = None
    y_train = None
    y_test = None
    predictions = None
    model = None
    name = None
    
    def __init__(self, data, response, split=0.3, name='dtc'):
        self.data = data
        self.response = response
        self.split = split
        self.name = name
        self.X = data.drop(response, axis=1)
        self.y = data[response]
        self.X_train, self.X_test, self.y_train, self.y_test=train_test_split(self.X, self.y, test_size=split, random_state=101)
        
    def create_build_model(self):
        if self.name == 'rfc':
            tree = RandomForestClassifier(n_estimators=50)
        else:
            tree = DecisionTreeClassifier()
        self.model = tree.fit(self.X_train, self.y_train)
        self.predictions = self.model.predict(self.X_test)
    
    def get_confustion_metrix(self):
        return confusion_matrix(self.y_test, self.predictions)
    
    def get_classification_report(self):
        return classification_report(self.y_test, self.predictions)
    
    def get_predict_proba(self):
        return self.model.predict_proba(self.X_test)
    
    def get_scoring_dataframe(self):
        results = pd.DataFrame(self.get_predict_proba(), columns= self.model.classes_.tolist())
        y_test_copy = self.y_test.copy()
        y_test_copy.index = results.index
        return pd.concat([results, y_test_copy], axis=1)

In [111]:
final_data = customers.drop(['customer_id', 'cust_churn', 'duration', 'productline_type', 'carrier_name','channel', 'zip_code'], axis=1)

In [118]:
# Create a decision tree model
dtc = SklearnTreeClassifierModel(final_data, response='cust_churn_status')

In [119]:
dtc.create_build_model()

In [121]:
print(dtc.get_confustion_metrix())
print(dtc.get_classification_report())

[[1588 2502]
 [2033 5877]]
             precision    recall  f1-score   support

      Churn       0.44      0.39      0.41      4090
     Retain       0.70      0.74      0.72      7910

avg / total       0.61      0.62      0.62     12000



In [112]:
# Create a random forest model
tcm = SklearnTreeClassifierModel(final_data, response='cust_churn_status', name='rfc')

In [113]:
tcm.create_build_model()

In [116]:
tcm.model.classes_

array(['Churn', 'Retain'], dtype=object)

In [117]:
print(tcm.get_confustion_metrix())
print(tcm.get_classification_report())

[[ 909 3181]
 [ 708 7202]]
             precision    recall  f1-score   support

      Churn       0.56      0.22      0.32      4090
     Retain       0.69      0.91      0.79      7910

avg / total       0.65      0.68      0.63     12000



## Prediction

In [127]:
# If we randomly pickup top 10% of customer as high risk customer
random = customers.sample(frac=0.1)
# The churn % we can target:
print(random['cust_churn'].sum() / random['cust_churn'].count())

0.32875


In [123]:
# Given the X test dataset
scores = tcm.get_scoring_dataframe()

In [124]:
scores.head()

Unnamed: 0,Churn,Retain,cust_churn_status
0,0.16,0.84,Churn
1,0.26,0.74,Retain
2,0.22,0.78,Retain
3,0.28,0.72,Churn
4,0.26,0.74,Retain


In [125]:
# If we sort from higest risk cutomer to lowest by probability to churn, waht's the % we can target
scores.sort_values('Churn', ascending=False).head(4000)['cust_churn_status'].value_counts('Churn')

Retain    0.50225
Churn     0.49775
Name: cust_churn_status, dtype: float64

## KMeans Clustering Model

In [50]:
from sklearn.cluster import KMeans

In [59]:
def create_build_kmeans(df, file_name, list_to_drop, label_name = 'cluster', clusters = 8):
    df.columns = map(str.lower, df.columns)
    df_values = df.drop(list_to_drop, axis=1).fillna(0).values
    kmeans = KMeans(n_clusters = clusters)
    kmeans.fit(df_values)
    dfc =  df.copy()
    dfc[label_name] = pd.to_numeric(kmeans.labels_)
    return dfc
    dfc.to_csv(file_name, header=True, index=False)

In [52]:
zipcodes = pd.read_csv('zipcodes.csv', dtype={'ZIP_CODE': np.str})

In [53]:
# Transform column names to lower case
zipcodes.columns = map(str.lower, zipcodes.columns)

In [54]:
zipcodes['zip_code'] = zipcodes['zip_code'].apply(cleanup_zipcode)

In [55]:
list_to_drop= ['zip_code', 'population','churn_counts']

In [56]:
zipcodes.head()

Unnamed: 0,zip_code,customers,total_cost,churn_counts,median_income,high_school_rate,unemployment_rate,poverty_rate,avg_age,population
0,1001,31,825.49,7,60161,29.6,4.4,4.7,68.6,17438
1,1002,16,793.97,2,50540,12.3,4.9,8.2,64.8,29780
2,1005,9,135.71,4,68786,40.7,5.4,1.3,67.4,5201
3,1007,21,657.0,3,76881,23.3,4.2,5.9,62.8,14838
4,1009,15,14.95,0,40833,28.2,16.1,28.0,65.0,429


In [57]:
zipcodes.dtypes

zip_code              object
customers              int64
total_cost           float64
churn_counts           int64
median_income          int64
high_school_rate     float64
unemployment_rate    float64
poverty_rate         float64
avg_age              float64
population             int64
dtype: object

In [62]:
clusters = create_build_kmeans(zipcodes, 'zipcodes_clusters.csv', list_to_drop)

In [63]:
clusters.head()

Unnamed: 0,zip_code,customers,total_cost,churn_counts,median_income,high_school_rate,unemployment_rate,poverty_rate,avg_age,population,cluster
0,1001,31,825.49,7,60161,29.6,4.4,4.7,68.6,17438,0
1,1002,16,793.97,2,50540,12.3,4.9,8.2,64.8,29780,4
2,1005,9,135.71,4,68786,40.7,5.4,1.3,67.4,5201,7
3,1007,21,657.0,3,76881,23.3,4.2,5.9,62.8,14838,7
4,1009,15,14.95,0,40833,28.2,16.1,28.0,65.0,429,6


In [66]:
clusters.groupby(['cluster']).mean()

Unnamed: 0_level_0,customers,total_cost,churn_counts,median_income,high_school_rate,unemployment_rate,poverty_rate,avg_age,population
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,25.212135,2025.338729,5.872903,61352.854976,31.613951,4.159044,6.758883,65.436773,14008.238566
1,23.338284,1516.393383,4.523102,123650.592409,13.831023,3.362376,2.425248,68.776403,17809.235974
2,24.850281,2513.462447,6.970506,28248.153371,36.644522,6.723596,24.643736,63.164663,10081.33764
3,26.717763,1829.885914,5.778947,95548.797368,19.755724,3.780526,3.519079,67.626382,19467.491447
4,23.839956,1984.384402,5.714421,50047.565933,35.651638,4.456324,9.617983,64.829587,11786.67722
5,20.068182,829.125379,3.386364,174996.318182,9.134848,3.093939,1.987879,69.729545,12587.30303
6,25.802218,2450.093434,6.774948,39857.316752,37.366812,5.217006,14.443242,63.947767,11650.912047
7,27.757009,2063.513353,6.383567,75929.308801,25.500156,4.037227,4.802142,66.49965,17685.790498
