# Customer Churn Prediction: applying the model to predict whether a given client will churn


Adapted from original by Heba El-Shimy https://raw.githubusercontent.com/IBM/customer-churn-prediction/master/notebooks/customer-churn-prediction.ipynb 

-------------------

- This article: https://medium.com/@markryan_69718/watson-studio-desktop-first-impressions-5a85309597d0 describes adapting the original Python-based churn solution described here: https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/
- the article describes a simple Modeler flow https://github.com/ryanmark1867/shared_ml/blob/master/churn%20flow%20Feb%202019.str?raw=true that implements a subset of the original Python-based churn solution

This notebook adapts the approach in the main notebook https://github.com/ryanmark1867/IDUG2019_ML_bootcamp/blob/master/churn_match_modeler.ipynb by incorporating the data preparation steps from that notebook in an sklearn pipeline. By using a pipeline:
- you can train the data preparation steps and the model itself in one pass
- you can conveniently apply the trained model to new clients to predict whether or not they will churn




# Load Libraries

In [44]:
# ensure library loaded for data entry controls
! pip install ipywidgets

[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m


In [45]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
from scipy import stats
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
import ipywidgets as widgets
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
# define dictionary of interactively entered values
val = {}


testproportion = 0.3


# The Dataset

The same dataset used in this notebook was used in the Modeler flow described above: https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv


# Ingest Dataset

- the cell below brings the dataset into a Pandas dataframe directly from the repo using the URL of the CSV file



In [46]:
url="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/WA_Fn-UseC_-Telco-Customer-Churn.csv"
customer_data = pd.read_csv(url)
customer_data_orig = customer_data.copy()
customer_data.head()



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [47]:
customer_data.shape

(7043, 21)

# Define column subsets

- retain_columns - the list of columns from the original dataset that you want to train the model on
- categorical_columns - the subset of columns that are categorical (where the values need to be replaced with numeric values)
- continuous_columns - the subset of columns that are number values (where the values need to be scaled and outliers needs to be dealt with)

In [48]:
# define the columns that we are going to keep
# retain_columns = ['MonthlyCharges','TotalCharges','InternetService','PaymentMethod','OnlineSecurity','Churn','Contract','tenure']
categorical_columns = ['InternetService', 'PaymentMethod', 'OnlineSecurity', 'Contract']
continuous_columns = ['MonthlyCharges', 'TotalCharges','tenure']
retain_columns = ['MonthlyCharges','TotalCharges','InternetService','PaymentMethod','OnlineSecurity','Contract','tenure']

In [49]:
# get the unique values for the categorical columns - we will need these later
# to get input values to test the model
InternetService_list = customer_data['InternetService'].unique()
PaymentMethod_list = customer_data['PaymentMethod'].unique()
OnlineSecurity_list = customer_data['OnlineSecurity'].unique()
Contract_list = customer_data['Contract'].unique()
InternetService_list

array(['DSL', 'Fiber optic', 'No'], dtype=object)

# Split the dataset into test and train

The following code splits the dataset into test and train subsets for both X (the columns used to train the model) and y (the target: Churn):

- train: subset of the data used to train the models
- test: subset of the data held back from training so it can be used to assess the accuracy of the model on predicting churn on data that it has not seen before


In [50]:
# define label
y_le = customer_data['Churn']
# transform label
label_le = LabelEncoder()
label_le.fit(y_le.tolist())
y_le = label_le.transform(y_le)
# define input values
X_selected = customer_data.drop(['Churn'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_le,\
                                                    test_size=testproportion, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(4930, 20) (4930,)
(2113, 20) (2113,)


In [51]:
X_test.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
185,1024-GUALD,Female,0,Yes,No,1,No,No phone service,DSL,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,24.8,24.8
2715,0484-JPBRU,Male,0,No,No,41,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Bank transfer (automatic),25.25,996.45
3825,3620-EHIMZ,Female,0,Yes,Yes,52,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.35,1031.7
1807,6910-HADCM,Female,0,No,No,1,Yes,No,Fiber optic,No,No,Yes,No,No,No,Month-to-month,No,Electronic check,76.35,76.35
132,8587-XYZSF,Male,0,No,No,67,Yes,No,DSL,No,No,No,Yes,No,No,Two year,No,Bank transfer (automatic),50.55,3260.1


In [52]:
y_test

array([1, 0, 0, ..., 0, 0, 0])

In [53]:
X_test.tail()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
5522,2619-WFQWU,Female,0,No,No,1,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,70.15,70.15
6377,7851-FLGGQ,Male,0,No,No,1,No,No phone service,DSL,No,Yes,No,Yes,No,Yes,Month-to-month,Yes,Mailed check,44.65,44.65
5500,7139-JZFVG,Male,0,Yes,Yes,60,Yes,No,DSL,Yes,Yes,Yes,No,No,No,Two year,No,Bank transfer (automatic),60.5,3694.45
2392,3771-PZOBW,Male,0,No,No,20,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),90.7,1781.35
6705,3733-ZEECP,Male,0,Yes,Yes,22,Yes,No,DSL,No,No,No,Yes,No,No,Month-to-month,Yes,Electronic check,51.1,1232.9


# Defining a pipeline to combine data preparation and the model
We want to predict whether a given client will churn.

To predict whether a client will churn, we need to:
- train the model
- apply all the transforms that were performed on the training data set to the data for the client we want to score
- feed the transformed data for this client into the trained model to predict whether or not the client will churn 

To make this process efficient, we will use the sklearn pipeline structure to combine all the data processing steps (selecting a subset of columns, replacing missing values, scaling & outlier processing for continuous columns, replacing categorical values with numeric IDs) together with the model.

Once the pipeline has been defined, we can use it to:
- train the data preparation steps and the model in one pass
- make predictions (score) on data for a given client on whether or not the client will churn
- because a pipeline combines data preparation with the model, once the pipeline is trained we can feed in a client's data and get a churn prediction 

There are two steps to creating the pipeline:
1. For each data preparation step, define classes derived from the appropriate sklearn classes that specify: (a) the parameters for the step, (b) what transformations the step performs, (c) whether the training dataset defines any aspects of the transformation
2. Build the pipeline using instances of the data preparation step classes, along with the model




# Define classes for each data preparation step

The sklearn pipeline structure requires classes to be defined for each data preparation step:
- subset_columns - select the subset of columns that will be used to train the model
- fill_empty - replace empty values with a placeholder. Note that unlike the modeler flow and the initial notebook, we apply this step to all columns.
- encode_categorical - replace values in categorical columns with numeric IDs
- scale_continuous - use zscore to scale continuous values
- replace_outliers - for each continuous column, identify values that are beyond the threshold and replace those values with the threshold

In [54]:

# class to subset columns
class subset_columns(BaseEstimator, TransformerMixin):
    
    def set_params(self, **kwargs):
        print("subsetting")
        self.retain_cols = kwargs.get('retain_cols', None)
        return self
    
    def transform(self, X, **tranform_params):
        print("subset xform")
        trimmed_X = X[self.retain_cols]
        return trimmed_X
    
    def fit(self, X, y=None, **fit_params):
        return self
    
#   class to fill empty values
class fill_empty(BaseEstimator, TransformerMixin):
    
    def transform(self, X, **tranform_params):
        print("fill empty xform")
        filled_X = X.fillna(0)
        #print(X.loc[[0]])
        return filled_X
    
    def fit(self, X, y=None, **fit_params):
        return self
    

# encode categorical categories

class encode_categorical(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.le = {}
        return None
    
    def set_params(self, **kwargs):
        self.col_list = kwargs.get('col_list', None)
        return self
        
    
    
    def fit(self, X, y=None,  **fit_params):
        for col in self.col_list:
            print("col is ",col)
            self.le[col] = LabelEncoder()
            self.le[col].fit(X[col].tolist())
        return self
        
    
    def transform(self, X, y=None, **tranform_params):
        for col in self.col_list:
            print("transform col is ",col)
            X[col] = self.le[col].transform(X[col])
            print("after transform col is ",col)
        # print(X.loc[[0]])
        return X
    

   

    
    

In [55]:
# scale continuous 

class scale_continuous(BaseEstimator, TransformerMixin):
    
    def __init__(self, **kwargs):
        self.sd = {}
        self.mean = {}
        return None
        
    def set_params(self, **kwargs):
        self.col_list = kwargs.get('col_list', None)
        return self
    
    def fit(self, X, y=None,  **fit_params):
        for col in self.col_list:
            self.sd[col] = X.loc[:,col].std()
            self.mean[col] = X.loc[:,col].mean()
        return self
                
    
    def transform(self, X, y=None, **tranform_params):
        print("scale xform")
        if len(X.index) > 1:
            X[self.col_list] = X[self.col_list].apply(zscore)
        else:
            for col in self.col_list:
                X[col] = (X[col] - self.mean[col])/self.sd[col]
        #print(X.loc[[0]])
        return X

In [56]:
def sd_max(x,sd,multiplier):
    if x > multiplier*sd:
        return multiplier*sd
    else:
        return x
    


In [57]:
# replace outliers

class replace_outliers(BaseEstimator, TransformerMixin):
    
    def __init__(self, **kwargs):
        self.sd = {}
        return None
    
    def set_params(self, **kwargs):
        self.sd_mult = kwargs.get('sd_mult', None)
        self.col_list = kwargs.get('col_list', None)
        return self
     
    def fit(self, X, y=None,  **fit_params):
        for col in self.col_list:
            self.sd[col] = X.loc[:,col].std()
        return self
    
    def transform(self, X, y=None, **tranform_params):
        print("outliers xform")
        for col in self.col_list:
            # self.sd[col] = X.loc[:,col].std()
            X[col] = X[col].apply(sd_max,args=(self.sd[col],self.sd_mult))
        print("after outliers xform")
        # print(X.loc[[0]])
        return X
      

In [58]:
y_train.shape

(4930,)

In [59]:
y_train

array([0, 0, 0, ..., 0, 1, 0])

# Build  and train the pipeline using the data preparation step classes

The pipeline will be built of layers made up of the data preparation steps followed by the model. To build and train the pipeline:
- create instances of the data preparation classes defined above
- define the pipeline using these instances and the Logistic Regression model
- specify any parameters for the data preparation steps and train (fit) the pipeline



In [60]:
# define instances of the data preparation classes
sc = subset_columns()
ec = encode_categorical()
fe = fill_empty()
ro = replace_outliers()
scale_c = scale_continuous()
clf_lr = LogisticRegression(solver = 'lbfgs')

# define the pipeline layers
churn_pipeline = Pipeline([('fe',fe),('sc',sc),('e_categorical',ec),('scale_c',scale_c),('ro',ro),('lr',clf_lr)])


# fit the pipeline - note that X_train is a dataframe and y_train is a (1234,) thing

churn_pipeline.set_params(sc__retain_cols = retain_columns,
                          e_categorical__col_list = categorical_columns,
                          scale_c__col_list = continuous_columns, 
                          ro__sd_mult = 3.0, ro__col_list = continuous_columns
                        ).fit(X_train, y_train)


subsetting
fill empty xform
subset xform
col is  InternetService
col is  PaymentMethod
col is  OnlineSecurity
col is  Contract
transform col is  InternetService
after transform col is  InternetService
transform col is  PaymentMethod
after transform col is  PaymentMethod
transform col is  OnlineSecurity
after transform col is  OnlineSecurity
transform col is  Contract
after transform col is  Contract
scale xform
outliers xform
after outliers xform


Pipeline(memory=None,
     steps=[('fe', fill_empty()), ('sc', subset_columns()), ('e_categorical', encode_categorical()), ('scale_c', scale_continuous()), ('ro', replace_outliers()), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False))])

# Evaluate the accuracy of the pipeline

Exercise the pipeline by:
- applying the test set to the pipeline to get predictions for the test set
- measuring the accuracy of the pipeline by comparing the predictions made for the test set with the actual Churn values for the test set

In [61]:
# predict using the pipeline
prediction = churn_pipeline.predict(X_test)

fill empty xform
subset xform
transform col is  InternetService
after transform col is  InternetService
transform col is  PaymentMethod
after transform col is  PaymentMethod
transform col is  OnlineSecurity
after transform col is  OnlineSecurity
transform col is  Contract
after transform col is  Contract
scale xform
outliers xform
after outliers xform


In [62]:
# assess the accuracy of the pipeline
acc_lr = accuracy_score(y_test, prediction)
print(acc_lr)

0.794131566493


# Enter values for the client to be scored with the pipeline

The following cells allow you to interactively enter values for a client that you can score using the pipeline that you have trained above.

- define controls for continuous column data entry
- enter continuous values
- enter categorical values

In [63]:
# define controls for all the categorical columns
from IPython.html.widgets import interact, interactive
def f(InternetService):
    print(InternetService)
    
def g(PaymentMethod):
    print(PaymentMethod)
    
def h(OnlineSecurity):
    print(OnlineSecurity)
    
def i(Contract):
    print(Contract)
    
control = {}
control['InternetService'] = interactive(f, InternetService = InternetService_list)
control['PaymentMethod'] = interactive(g, PaymentMethod = PaymentMethod_list)
control['OnlineSecurity'] = interactive(h, OnlineSecurity = OnlineSecurity_list)
control['Contract'] = interactive(i, Contract = Contract_list)





# Enter continuous values for this client

Run this cell to get controls where you can interactively enter the continuous values for this client.

In [64]:
# interactively enter the continuous values

for col in continuous_columns:
    val[col] = input("what is the value for  "+col)
    val[col] = [float(val[col])]

what is the value for  MonthlyCharges33.4
what is the value for  TotalCharges100.5
what is the value for  tenure33


# Enter categorical values for this client

Use the controls following this cell to enter categorical values for the new client you want to score as a churn risk.

Once you have updated the controls run the cell to save them.

In [65]:
# interactively enter the categorical values.

for col in categorical_columns:
    # show control for categorical columns
    display(control[col])
    
    

Month-to-month


In [66]:
# save values selected
for col in categorical_columns:
    val[col] = [[*control[col].kwargs.values()][0]]
    print("value",val[col])

value ['DSL']
value ['Electronic check']
value ['No']
value ['Month-to-month']


In [67]:
val

{'Contract': ['Month-to-month'],
 'InternetService': ['DSL'],
 'MonthlyCharges': [33.4],
 'OnlineSecurity': ['No'],
 'PaymentMethod': ['Electronic check'],
 'TotalCharges': [100.5],
 'tenure': [33.0]}

# Score this client using the pipeline

- take the values entered for this client and load them into a dataframe
- run the pipeline with this dataframe as input
- the predicted outcome for this client is the output of the pipeline for these values

In [68]:
val_df = pd.DataFrame.from_dict(val)
val_df = val_df[['MonthlyCharges','TotalCharges','InternetService','PaymentMethod','OnlineSecurity','Contract','tenure']]
val_df.head()

Unnamed: 0,MonthlyCharges,TotalCharges,InternetService,PaymentMethod,OnlineSecurity,Contract,tenure
0,33.4,100.5,DSL,Electronic check,No,Month-to-month,33.0


In [69]:
# print(df.loc[[159220]])
print(val_df.loc[[0]])

   MonthlyCharges  TotalCharges InternetService     PaymentMethod  \
0            33.4         100.5             DSL  Electronic check   

  OnlineSecurity        Contract  tenure  
0             No  Month-to-month    33.0  


In [70]:
# pd.DataFrame.from_dict(data)

prediction_val = churn_pipeline.predict(val_df)

fill empty xform
subset xform
transform col is  InternetService
after transform col is  InternetService
transform col is  PaymentMethod
after transform col is  PaymentMethod
transform col is  OnlineSecurity
after transform col is  OnlineSecurity
transform col is  Contract
after transform col is  Contract
scale xform
outliers xform
after outliers xform


# Get the predicted outcome for this client

See what the pipeline predicts for this client.
- if prediction_val = 1, the pipeline predicts that this client will churn
- if prediction_val = 0, the pipeline predicts that this client will not churn

In [71]:
if (prediction_val == 1):
    print("The model predicts that this client will churn")
else:
    print("The model predicts that this client will not churn")
val_df.head()    
    

The model predicts that this client will not churn


Unnamed: 0,MonthlyCharges,TotalCharges,InternetService,PaymentMethod,OnlineSecurity,Contract,tenure
0,33.4,100.5,DSL,Electronic check,No,Month-to-month,33.0


# Sample client values that produce Churn = Yes / Churn = No

The following two cells have sample value sets that should produce Churn = Yes (prediction = 1) / Churn = No (prediction = 0)

Exercise:
- try entering variations on these values to see if you can determine which column / value combinations push these clients from one churn prediction state to another

In [72]:
# example of a set of input value that produces churn = yes
'''val = {'MonthlyCharges': [90.70],
 'TotalCharges': [1781.35],
 'tenure': [20.0],
 'InternetService': ['Fiber optic'],
 'PaymentMethod': ['Credit card (automatic)'],
 'OnlineSecurity': ['No'],
 'Contract': ['Month-to-month']}'''

"val = {'MonthlyCharges': [90.70],\n 'TotalCharges': [1781.35],\n 'tenure': [20.0],\n 'InternetService': ['Fiber optic'],\n 'PaymentMethod': ['Credit card (automatic)'],\n 'OnlineSecurity': ['No'],\n 'Contract': ['Month-to-month']}"

In [73]:
# example of a set of input values that produces churn = no 
'''val = {'MonthlyCharges': [90.70],
 'TotalCharges': [178.35],
 'tenure': [200.0],
 'InternetService': ['Fiber optic'],
 'PaymentMethod': ['Credit card (automatic)'],
 'OnlineSecurity': ['No'],
 'Contract': ['Month-to-month']}'''

"val = {'MonthlyCharges': [90.70],\n 'TotalCharges': [178.35],\n 'tenure': [200.0],\n 'InternetService': ['Fiber optic'],\n 'PaymentMethod': ['Credit card (automatic)'],\n 'OnlineSecurity': ['No'],\n 'Contract': ['Month-to-month']}"

# Experiment to determine optimal columns for the model

- for simplicity's sake, we chose a subset of columns to train the model with
- the following analysis looks for the optimal set of columns
- first, we need to define a new pipeline that omits the column filtering and Logistic Regression layers
- next, we apply that pipeline (careful to define a set of categorical columns that includes all the candidate columns) to the X and Y subsets of the original dataset
- finally, we use random forest to come up with a score for the columns


In [74]:
# define the pipeline layers just for data transformation - remove subset columns and LR layers
transform_pipeline = Pipeline([('fe',fe),('e_categorical',ec),('scale_c',scale_c),('ro',ro)])

# list of categorical columns needs to include all the categorical columns
max_categorical_columns = [
 'gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod'
  ]

# fit the transform-only pipeline - note that X_train is a dataframe and y_train is a (1234,) thing

transform_pipeline.set_params(
                          e_categorical__col_list = max_categorical_columns,
                          scale_c__col_list = continuous_columns, 
                          ro__sd_mult = 3.0, ro__col_list = continuous_columns
                        ).fit(X_train, y_train)

fill empty xform
col is  gender
col is  SeniorCitizen
col is  Partner
col is  Dependents
col is  PhoneService
col is  MultipleLines
col is  InternetService
col is  OnlineSecurity
col is  OnlineBackup
col is  DeviceProtection
col is  TechSupport
col is  StreamingTV
col is  StreamingMovies
col is  Contract
col is  PaperlessBilling
col is  PaymentMethod
transform col is  gender
after transform col is  gender
transform col is  SeniorCitizen
after transform col is  SeniorCitizen
transform col is  Partner
after transform col is  Partner
transform col is  Dependents
after transform col is  Dependents
transform col is  PhoneService
after transform col is  PhoneService
transform col is  MultipleLines
after transform col is  MultipleLines
transform col is  InternetService
after transform col is  InternetService
transform col is  OnlineSecurity
after transform col is  OnlineSecurity
transform col is  OnlineBackup
after transform col is  OnlineBackup
transform col is  DeviceProtection
after transf

Pipeline(memory=None,
     steps=[('fe', fill_empty()), ('e_categorical', encode_categorical()), ('scale_c', scale_continuous()), ('ro', replace_outliers())])

In [75]:
# apply transform pipeline to input data
transformed_customer_data = transform_pipeline.transform(customer_data_orig)
transformed_customer_data.head()

fill empty xform
transform col is  gender
after transform col is  gender
transform col is  SeniorCitizen
after transform col is  SeniorCitizen
transform col is  Partner
after transform col is  Partner
transform col is  Dependents
after transform col is  Dependents
transform col is  PhoneService
after transform col is  PhoneService
transform col is  MultipleLines
after transform col is  MultipleLines
transform col is  InternetService
after transform col is  InternetService
transform col is  OnlineSecurity
after transform col is  OnlineSecurity
transform col is  OnlineBackup
after transform col is  OnlineBackup
transform col is  DeviceProtection
after transform col is  DeviceProtection
transform col is  TechSupport
after transform col is  TechSupport
transform col is  StreamingTV
after transform col is  StreamingTV
transform col is  StreamingMovies
after transform col is  StreamingMovies
transform col is  Contract
after transform col is  Contract
transform col is  PaperlessBilling
after 

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,-1.277445,0,1,0,0,...,0,0,0,0,0,1,2,-1.160323,-0.992611,No
1,5575-GNVDE,1,0,0,0,0.066327,1,0,0,2,...,2,0,0,0,1,0,3,-0.259629,-0.172165,No
2,3668-QPYBK,1,0,0,0,-1.236724,1,0,0,2,...,0,0,0,0,0,1,3,-0.36266,-0.958066,Yes
3,7795-CFOCW,1,0,0,0,0.514251,0,1,0,2,...,2,2,0,0,1,0,0,-0.746535,-0.193672,No
4,9237-HQITU,0,0,0,0,-1.236724,1,0,1,0,...,0,0,0,0,0,1,2,0.197365,-0.938874,Yes


In [76]:
# Import `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# Isolate Data, class labels and column values

# X is the transformed original dataset minus customerID and churn; Y is the churn column with numeric IDs
X = transformed_customer_data.drop(['Churn','customerID'],axis=1)
Y = y_le
names = customer_data.columns.values

# Build the model
rfc = RandomForestClassifier()

# Fit the model
rfc.fit(X, Y)

# Print the results
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rfc.feature_importances_), names), reverse=True))

Features sorted by their score:
[(0.18720000000000001, 'MonthlyCharges'), (0.1736, 'PaymentMethod'), (0.15160000000000001, 'Dependents'), (0.10199999999999999, 'StreamingMovies'), (0.047500000000000001, 'PaperlessBilling'), (0.045699999999999998, 'InternetService'), (0.033599999999999998, 'DeviceProtection'), (0.033000000000000002, 'MultipleLines'), (0.029700000000000001, 'OnlineSecurity'), (0.028899999999999999, 'customerID'), (0.026800000000000001, 'Contract'), (0.023599999999999999, 'SeniorCitizen'), (0.021100000000000001, 'PhoneService'), (0.020199999999999999, 'gender'), (0.019, 'OnlineBackup'), (0.0189, 'Partner'), (0.017500000000000002, 'StreamingTV'), (0.0149, 'TechSupport'), (0.0051999999999999998, 'tenure')]
