In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = '/kaggle/input/applied-ml-microcourse-telco-churn'
data = pd.read_csv('{}/features.csv'.format(path))

In [None]:
data.head()

We will start by looking at our churn label.  Generally with applied machine learning tasks, we do not expect to get perfect balance between our class.  In fact, in may cases we will get very unbalanced classes as we typically model rare events such as churn, fraud, etc.  In this case, the inbalance is reasonable so we do not need to make any adjustments.

In [None]:
data['churn'].value_counts().to_frame()

One quick way to explore the relationship between a binary classification target variable and categorical features is to group the data frame by the feature and calculate the mean of the target variable (treating churn as a numeric variable).  Note that this is only meaningful if our binary target has values of 0 or 1.  In our case, the customer has churned if the churn field is 1, so higher mean values of churn imply a larger proportion of churned users for the grouped value.  

In [None]:
categorical_columns = ['gender', 'SeniorCitizen', 'Dependents', 'PaperlessBilling', 'PaymentMethod', 'Contract', 
                       'DeviceProtection', 'InternetService', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity',
                       'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport']
for column in categorical_columns:
    display(data.groupby(column)['churn'].mean().to_frame())

From the output above, we can see some relationships emerging.  For instance, customers with 'No internet service' across any of the service columns seem to have a very low proportion of churn, around 5.6%.  40% of customers with Fibre Optic internet services have churned, 40% of customers on month-to-month contracts have churned, and the churn rate is 43% for customers who pay by electronic check.  These signals seem quite strong and we should achieve a good model with this set of features.  

For numeric features, a first step is often to calculate the mean value of the feature split by the churn target (treating it as a categorical variable).  

In [None]:
numeric_columns = ['MonthlyCharges', 'Tenure', 'MeanMonthlyCharge', 'MeanMonthlyUsage']
for column in numeric_columns:
    display(data.groupby('churn')[column].mean().to_frame())

As in the previous notebook, we can also plot histograms of each numeric feature split by the churn target to see how the distributions vary. 

In [None]:
import matplotlib.pyplot as plt

def plot_churn_hist(column):
    plt.figure()
    plt.hist(data[data['churn'] == 0][column], bins=20)
    plt.hist(data[data['churn'] == 1][column], bins=20)
    plt.title(column)

for column in numeric_columns:
    plot_churn_hist(column)

## Feature transformations for modelling

### Categorical variables

There are several ways to encode categorical variables in python.  One simple way is to use the pandas method get_dummies().  Lets try this on the gender variable

In [None]:
display(data['gender'].head())
display(pd.get_dummies(data['gender']).head())

One consequence of one-hot encoding that can cause problems is that of collinearity - when features are correlated or have a dependence on each other. For instance, if we know that the customer in the first row of data has Female gender, we also know that this customer does not have Male gender.  We can therefore just use the encoding for either Female or Male genders and no information is lost.  In general, when we have n values we only need n-1 encoded features to capture all the information since one of the features is completely determined by the others.

The get_dummies() method has an argument to do this, drop_first, which removes the encoding for the first value in alphabetical order.

In [None]:
display(pd.get_dummies(data['gender'], drop_first=True).head())

We will now apply one-hot encoding with the first value dropped to the whole data frame.  We can do this by simply passing the data frame to the get_dummies() method, and all variables of type 'object' will be encoded.  Numeric variables will be unchanged.  We must first remove the customerID column or we will have a unique column for each customer (-1).

In [None]:

data_transformed = data.drop(columns='customerID')
data_transformed = pd.get_dummies(data_transformed, drop_first=True)

print('Number of columns before one-hot encoding:', data.shape[1])
print('Number of columns after one-hot encoding:', data_transformed.shape[1])
data_transformed.head()

As a last step, we will add the customerID back into the data (this is fine since we haven't reordered the data set).  We will then write the output to disk to use in the next module.

### Scaling numeric variables

We will briefly demonstrate how to rescale variables using two different methods.  However, some machine learning models need this and some don't so we won't apply these methods to our data just yet.  

We will use the sklearn module 'preprocessing' which has a number of methods for scaling numeric variables.

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# get an array of all the numeric variables (float64)
numeric_variables = data_transformed.dtypes[data_transformed.dtypes == 'float64'].index.values
        
# display descriptive statistics before transformations
display(data_transformed[numeric_variables].describe())

In [None]:
# Min Max transformation first

min_max = MinMaxScaler().fit_transform(data_transformed[numeric_variables])
min_max = pd.DataFrame(min_max)
min_max.columns = numeric_variables
display(min_max.describe())

It is clear that the output features from a min max scaler all have min of 0 and max of 1

In [None]:
# Standard Scaler transformation

std_scaler = StandardScaler().fit_transform(data_transformed[numeric_variables])
std_scaler = pd.DataFrame(std_scaler)
std_scaler.columns = numeric_variables
display(std_scaler.describe())

It is clear that the output of the standard scaler transformation all have mean of very close to 0 (around machine precision e-16) and standard deviation of around 1. 