# 4.5 Creating features and data exploration

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In Section 4.2 we created a feature set from the customer and contract files, and we also created our churn label.   Now we will load this file and explore variables that may be potential features, such as some demographic, payment methods and basic contract details. 

In [None]:
path = '/kaggle/input/applied-ml-microcourse-telco-churn'
data = pd.read_csv('{}/customer_contract_churn.csv'.format(path))
data.head()

### Inital Exploration

Let us first have a look at the data, and explore some of the variables.  We will start by counting the values of key categorical variables.  It can be handy to use loops to iteratively run a group by and count over each variable.  

In [None]:
columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PaperlessBilling', 'PaymentMethod', 'Contract']
for column in columns:
    display(data.groupby(column)['customerID'].count().to_frame())

This data looks to be very clean with no missing values and the number of unique values is quite low.  The only numeric variable of interest is the MonthlyCharges column, so lets have a look at that.  The pandas function describe() will calculate several descriptive statistics over our data, giving us the mean, standard deviation, min and max, and the median (50%) and interquartile range (25%, 75%).  These values show that this variable is bound between 18.25 and 118.75, with a mean of 64.76, so does not seem to have any outliers or missing values.  

It is also good practice with numeric variables in particular to plot the distribution.  We can use this hist() function from matplotlib.pyplot.  In this case, we see two clear peaks in the data: one large peak at MonthlyCharges of around 10 or less, and another at value of 80 or so.  

In [None]:
display(data['MonthlyCharges'].describe().to_frame())

import matplotlib.pyplot as plt
plot = plt.hist(data['MonthlyCharges'], bins=20)

We will now create some additional features from this data.

### Tenure

We know the start date of each customer, but we cannot use dates as features in machine learning models since they are not numeric without specific transformations.  One way we can use this information is to calculate the tenure of the customer, which is the time that has passed since their contract started.  However, you need to be careful here since the end date will be different depending on the churn class.  If the customer has churned, we want to predict their churn event so the tenure should be at or before their churn date.  For simplicity, we will just set it to be the churn date.  For customers who are currently active, we set the end date to be the date we run this analysis.  In this case, it is the 1st of January 2020.

Do this in python, we need to first insert the current date into our EndDate column, and make sure the date columns are formatted as pandas dates.

In [None]:
data.loc[data['EndDate'].isna(), 'EndDate'] = '2020-01-01'
data['EndDate'] = pd.to_datetime(data['EndDate'])
data['StartDate'] = pd.to_datetime(data['StartDate'])
data['Tenure'] = (data['EndDate'] - data['StartDate']) / np.timedelta64(1, 'M')

data.head()

### Services data

Now we will introduce some further data.  Often there are multiple services that can be bundled with a contract.  After some investigation, we find that there is another file which contains the list of services for each contract.  We will load this file and have a look.

In [None]:
services = pd.read_csv('{}/services.csv'.format(path))
display(services.shape)
services.sample(5)

This file looks to have a deep structure since it has 63,387 rows, likely one row per contractID and service.  We should explore the count of contractIDs for each service value and service

In [None]:
services.groupby(['Service', 'ServiceValue'])['contractID'].count().to_frame()

To turn this data into features, we will have to group the data to the contractID level (i.e. one row per contract), and convert each value to a column.   We can do this using the pandas pivot() function, then merge the results into our growing feature set.

In [None]:
services_pivot = services.pivot(index='contractID', columns='Service', values='ServiceValue').reset_index()
data = data.merge(services_pivot, on='contractID')
data.head()

### Charges

We have also come across some data related to historic charges for each contract. This data set looks to be recorded per month, without the day being specified (all the days given are the first). Let's load this data and have a look.

In [None]:
charges = pd.read_csv('{}/charges.csv'.format(path))
charges.head()

Lets also plot the charge variable.  Compare this plot to the earlier one for the MonthlyCharges variable.  You will notice that the values are slightly higher here.  E.g. the second peak in the distribution is at around 100 or so, previously it was around 80.  This indicates that customers may be charged additional amounts each month from their contracted rates.  

In [None]:
plot = plt.hist(charges['charge'], bins=20)

Transactional data sets can often contain interesting patterns. For instance, a customer may get a large and unexpected bill one month for additional charges. This may prompt a churn event.  For more complex features, we could look at the variation in the charge amount over time.  To keep things simple for now, we will just calculate the average charge paid by the customer per month, and join this feature to our growing feature set.

In [None]:
total_charges = charges.groupby('contractID')['charge'].mean().to_frame().reset_index()
total_charges.rename(columns={'charge': 'MeanMonthlyCharge'}, inplace=True)
data = data.merge(total_charges, on='contractID')
data.head()

### Phone data usage

Given that most of the customers have phone services, one potentially useful data source would be their phone service data usage.   We have managed to source a data set containing the total phone service data usage for each contract per month. 

In [None]:
usage = pd.read_csv('{}/phone_usage.csv'.format(path))
usage.head()

As with the charges data, there are many ways to create features from this monthly usage data.  For now, we will just calculate the mean data usage used by the customer, and join this feature to our feature set.

In [None]:
total_usage = usage.groupby('contractID')['MonthlyUsage'].mean().to_frame().reset_index()
total_usage.rename(columns={'MonthlyUsage': 'MeanMonthlyUsage'}, inplace=True)
data = data.merge(total_usage, on='contractID')
data.head()

We have now created a good set of features for our task.  You can also use the info() method on pandas dataframes to see a summary of all the columns.

In [None]:
data.info()

We will now remove the columns that we no longer need.  We have saved the data set to a file called 'features.csv'

In [None]:
data.drop(columns=['contractID', 'StartDate', 'EndDate'], inplace=True)