# 4.2 Preparing data for churn modelling

In this notebook, we will load the data sets and create our churn indicator

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Defining the entity to model

Let's have a look at some data for the churn use case.  Here are two files extracted from the data lake.  We have a customer entity file and a contract entity file.  Download the files, then we'll load them into pandas in python and have a look. 

In [None]:
path = '/kaggle/input/applied-ml-microcourse-telco-churn'

customer = pd.read_csv('{}/customer.csv'.format(path))
contract = pd.read_csv('{}/contract.csv'.format(path))
customer.head()

In [None]:
contract.head()

In [None]:
customer.info()

It is common for a single customer to have multiple contracts, for instance one for each contract term.  However, in this data set we can see that the each customer only has one contractID associated with it.  To verify this, you can use the duplicated() function.  This is applied within the data frame subset operation, and will return all duplicate rows on the given column.  In our case, it returns an empty data frame indicating that there are no duplicated customers.

Check to see if there are any duplicate customerIDs on the contract

In [None]:
contract[contract['customerID'].duplicated()].shape

Let us now start creating our feature set.   We will merge the customer and contract data sets together and define our churn label.  Note that the customerID is the common linking field for these two data sets, and is unique across both files.  Always check this first as duplicate rows can cause problems

Ensure that customerID is unique on customer and check the size of both data sets.

In [None]:
print(customer[customer['customerID'].duplicated()])
print('Customer file shape is: {}'.format(customer.shape))
print('Contract file shape is: {}'.format(contract.shape))

In [None]:
data = customer.merge(contract, on='customerID')
data.head()

## Defining the prediction label

Let's have a look at the contract data again and see if there is enough information to construct a churn label.  The file has the following fields:

- Contract: contract term, either one year, month-to-month or two years
- MonthlyCharges: contracted amount to charge per month
- StartDate: notice that contract seem to roll over when the initial term expires and the start date is not updated
- EndDate: this is only populated for one contract in the top 10.  This looks like it is filled if the contract has been terminated

In [None]:
data['churn'] = np.where(data['EndDate'].isna(), 0, 1)
print('Merged data has shape: {}'.format(data.shape))
data.head()