# Prepare Fictitious Churn Data
Use a publicly available fictitious dataset for a telco churn use case to generate data structured as multiple monthly files of details on customers and calls.

The prepared data files are intended as sample data for running associated notebooks on selecting a subset and transforming, combining, and aggregating large datasets. They are much too small to require any of the techniques demonstrated in those notebooks (they don't require Spark, for example), but they are enough to run the code and show the intended use and results.

## Preliminaries

In [1]:
import pandas as pd

from calendar import month_abbr   # List: 1 = 'Jan', 2 = 'Feb', etc.

import numpy as np
import random
import wget

import os
import glob

local_path = os.environ['DSX_PROJECT_DIR'] + '/datasets/'    # Handy for reading and writing to local project

In [2]:
# Allow multiple outputs from a cell, even without print() statements
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Load source data directly from the remote site
Find the datasets in the [IBM Watson Community](https://dataplatform.cloud.ibm.com/community?context=analytics&query=telco%20churn&format=dataset).

In [3]:
customers_url = 'https://dataplatform.cloud.ibm.com/data/exchange-api/v1/entries/4c1779a61392dedf1fd3caad4c8c8517/data?accessKey=946c86bf96596c10cc651e52502344e2'
calls_url     = 'https://dataplatform.cloud.ibm.com/data/exchange-api/v1/entries/4c1779a61392dedf1fd3caad4cc20e64/data?accessKey=c8d0403d844a82df9ecd264df078c9cd'

In [4]:
df_cust = pd.read_csv(customers_url, header=0, dtype=str)

df_cust.shape                       # 1000 rows, 23 columns
df_cust['churned'].value_counts()   # about 2:1 not-churned:churned
df_cust.info()                      # all columns are read as strings; only twitter_handle has na's

(1000, 23)

N    676
Y    324
Name: churned, dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
customer_id                     1000 non-null object
first_name                      1000 non-null object
last_name                       1000 non-null object
twitter_handle                  320 non-null object
number                          1000 non-null object
gender                          1000 non-null object
age                             1000 non-null object
type                            1000 non-null object
location                        1000 non-null object
location_lat                    1000 non-null object
location_lon                    1000 non-null object
callcenter_callcount            1000 non-null object
text_package_deal               1000 non-null object
voice_package_deal              1000 non-null object
4g_handset_deal                 1000 non-null object
all_u_can_eat_data_deal         1000 non-null object
accidental_damage_cover_deal    1000 non-null o

In [5]:
n_total = df_cust.shape[0]                       # total numnber of rows
n_churn = df_cust['churned'].value_counts()['Y'] # number of churners

## Randomly assign to churners the month in which they left
Churners leave in months 1=January, 2=February, ...; non-churners "leave" in month 13, that is, not during the period covered by this dataset.

In [6]:
random.seed(123)                     # make random assignment repeatable
months     = 12                      # the call data covers a whole year (see below)
churn_rate = n_churn/n_total/months  # monthly churn rate
join_rate  = churn_rate              # arbitrary choice: set the montly join rate (customers come and go)

df_cust['left_in_period'] = df_cust.apply(lambda r: months+1 if r['churned'] == 'N' else random.randint(1, months),
                                          axis='columns')

### Verify that all churns are reasonably distributed over the months
The non-churners are represented in month 13.

In [7]:
df_cust['left_in_period'].value_counts().sort_index().to_dict()

{1: 33,
 2: 26,
 3: 23,
 4: 18,
 5: 21,
 6: 39,
 7: 34,
 8: 24,
 9: 38,
 10: 20,
 11: 27,
 12: 21,
 13: 676}

## Randomly assign all customers to a month when they joined
Most customers were there when data collection started; mark them as having joined in "month" 0 (zero). But mark some fraction (which depends on the arbitrary monthly join rate) as joining in a random month within the data collection period, with the constraint that they cannot be marked as having joined in a month later than when they left (in the case of churners).

There's a somewhat obscure calculation behind assigning probabilities to each of the months. The intent is to have a randomly fluctuating monthly join rate that on average matches the set value in `join_rate`.

In [8]:
counts                 = df_cust['left_in_period'].value_counts().sort_index().to_dict()
cumulative_counts_from = [sum([counts[j+1] for j in range(i, months+1)]) for i in range(1, months+1)]
prob                   = [join_rate*n_total/c for c in cumulative_counts_from]
p_0s                   = [1.0 - sum(prob[:m+1]) for m in range(0, months)]

def join_period(row):
    period = row['left_in_period']
    if period == 1:           # A churner who leaves in period 1 must have been there since period 0
        return 0
    elif 1 < period <= months+1:  # Randomly pick a month, with the constraint that a customer can't join after they leave
        return np.random.choice(period, p=[p_0s[period-2]]+prob[:period-1])
    else:
        raise ValueError('Churn period {} is outside the range [1 ... {}]'.format(period, months))

df_cust['joined_in_period'] = df_cust.apply(join_period, axis='columns')

### Check the results
Most joiners in any period never left (marked as having left in month 13); and those who joined in period 0 (before data collection) form the largest group. Those who left in period 1 must have joined in period 0. Those who left in period 2 must have joined in period 0 or 1, and so on.

In [9]:
pd.options.display.max_rows = 999

df_cust.groupby(['joined_in_period', 'left_in_period']).size()

joined_in_period  left_in_period
0                 1                  33
                  2                  26
                  3                  23
                  4                  15
                  5                  20
                  6                  37
                  7                  25
                  8                  14
                  9                  27
                  10                 13
                  11                 21
                  12                  9
                  13                411
1                 11                  1
                  12                  1
                  13                 13
2                 8                   3
                  9                   2
                  10                  1
                  12                  2
                  13                 17
3                 4                   3
                  7                   1
                  8                   2
       

## Write the monthly customer files

In [10]:
filename = os.path.join(local_path, 'customers_{}.csv')

for m in range(1, months+1):
    (df_cust.loc
     [(df_cust['joined_in_period'] <= m) & (df_cust['left_in_period'] >= m)] # Only customers who have joined this month or before and have not left yet
     .assign(churned = lambda r: np.where(r['left_in_period'] == m, 'Y', 'N'))   # Only mark churned in the month in which they left
     .drop(['joined_in_period', 'left_in_period'], axis='columns')               # Don't write the join/leave supporting columns
     .to_csv(filename.format(month_abbr[m]), header=True, index=False)
    )
    
glob.glob(os.path.join(local_path, 'customers_*.csv'))

['/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_May.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Jan.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Feb.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Mar.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Apr.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Jun.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Jul.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Aug.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Sep.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Oct.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Nov.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/customers_Dec.csv']

## Load the call data
Pay attention to the datetime column; coerce it into a datetime type so it can be properly manipulated and interpreted. And derive a "date" column, which ignores the time-of-day part.

In [11]:
schema = {
    'from'     : str,
    'to'       : str,
    'dt'       : object,
    'duration' : int,    # Need to be able to compute sums
    'dropped'  : float   # For some reason, these are presented with decimal fractions (which are always zero), so keep them
}

df_calls         = pd.read_csv(calls_url, dtype=schema, header=0, parse_dates=[2])
df_calls['date'] = pd.to_datetime(df_calls['dt'].dt.date)            # Precompute dates from full datetimes

In [12]:
df_calls.dtypes
df_calls['date'].describe()  # Notice that the timestamps cover the range from January 1st to December 30th, (almost) a full year

from                object
to                  object
dt          datetime64[ns]
duration             int64
dropped            float64
date        datetime64[ns]
dtype: object

count                  710275
unique                    364
top       2013-05-16 00:00:00
freq                     2077
first     2013-01-01 00:00:00
last      2013-12-30 00:00:00
Name: date, dtype: object

## Roll up individual call data into daily counts per phone number
Each record in the call data represents a single call, with to and from numbers, duration, and whether the call was dropped or not (1 or 0, for some reason listed with a decimal fraction which is always zero). What we need is, for each number, the number of outgoing calls, minutes, and dropped calls per day, and the same for incoming calls.

One wrinkle not addressed in this simulated dataset, with all the simulated customers leaving and joining each month, is that each customer churning in a given month has call throughout that month. In other words, there is no randomization of the day of the month when they stop using the service. This does not affect anything in the sample notebooks that are expected to use this dataset.

In [13]:
# Outgoing calls from the number
df_from = (df_cust
             .merge(df_calls, how='inner', left_on='number', right_on='from')
             [['customer_id','number','date','duration','dropped']]
             .groupby(['customer_id','number','date'])                      # customer_id and number are linked one-to-one, but no matter
             .agg({'number':'size', 'duration':'sum', 'dropped':'sum'}, axis='columns')    # arbitrary column name for count; size(), not count()
             .rename(columns={'number':'from_calls', 'duration':'from_duration', 'dropped':'from_dropped'})
          )

# Incoming calls to the number
df_to   = (df_cust
             .merge(df_calls, how='inner', left_on='number', right_on='to')[['customer_id','number','date','duration','dropped']]
             .groupby(['customer_id','number','date'])
             .agg({'number':'size', 'duration':'sum', 'dropped':'sum'}, axis='columns')
             .rename(columns={'number':'to_calls', 'duration':'to_duration', 'dropped':'to_dropped'})
          )

df_from.head()
df_to.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,from_calls,from_duration,from_dropped
customer_id,number,date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C0000000000,2295782445,2013-01-01,7,107,0.0
C0000000000,2295782445,2013-01-02,5,25,1.0
C0000000000,2295782445,2013-01-03,4,61,0.0
C0000000000,2295782445,2013-01-04,2,13,0.0
C0000000000,2295782445,2013-01-05,6,63,0.0


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,to_calls,to_duration,to_dropped
customer_id,number,date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C0000000000,2295782445,2013-01-02,1,1,1.0
C0000000000,2295782445,2013-01-03,3,42,1.0
C0000000000,2295782445,2013-01-12,1,10,0.0
C0000000000,2295782445,2013-01-13,1,12,0.0
C0000000000,2295782445,2013-01-19,1,1,1.0


In [14]:
# Put customer_id into the df_cust index, in preparation for a join
if not df_cust.index.name: df_cust = df_cust.set_index('customer_id')

# Put to and from calls together.
# Fill in any na's from the outer join: na means zero calls (and minutes, dropped) for that category.
# Columns with na's have to be of float type (Pandas/numpy restriction), so coerce to int after filling.
# Derive a month column from the date column (integer 1,...,12, can be compared to the joined/left helper columns).
# Add in the joined/left helper columns for later selection
df_daily = (df_from.join(df_to, how='outer')
            .fillna(0)                         # 
            .astype({'from_calls':int, 'from_duration':int, 'to_calls':int, 'to_duration':int})
            .reset_index(['number','date'])
            .assign(month=lambda x: x['date'].dt.month)
            .join(df_cust[['joined_in_period', 'left_in_period']])
           )

df_daily.head()

Unnamed: 0_level_0,number,date,from_calls,from_duration,from_dropped,to_calls,to_duration,to_dropped,month,joined_in_period,left_in_period
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
C0000000000,2295782445,2013-01-01,7,107,0.0,0,0,0.0,1,0,13
C0000000000,2295782445,2013-01-02,5,25,1.0,1,1,1.0,1,0,13
C0000000000,2295782445,2013-01-03,4,61,0.0,3,42,1.0,1,0,13
C0000000000,2295782445,2013-01-04,2,13,0.0,0,0,0.0,1,0,13
C0000000000,2295782445,2013-01-05,6,63,0.0,0,0,0.0,1,0,13


## Write the monthly call files

In [15]:
filename = os.path.join(local_path, 'calls_{}.csv')

for m in range(1, months+1):
    (df_daily
     .loc[(df_daily['month']            == m) &                           # filter by month
          (df_daily['joined_in_period']  < m) &                           # must have joined previously
          (df_daily['left_in_period']   >= m)]                            # must not have left yet
     .drop(['month','joined_in_period','left_in_period'], axis='columns')
     .to_csv(filename.format(month_abbr[m]), header=True, index=True)
    )
    
glob.glob(os.path.join(local_path, 'calls_*.csv'))

['/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_May.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Jan.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Feb.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Mar.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Apr.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Jun.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Jul.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Aug.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Sep.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Oct.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Nov.csv',
 '/user-home/1053/DSX_Projects/Big-Churn/datasets/calls_Dec.csv']

### Developed by IBM Data Science Elite Team, IBM Data Science and AI:
- Robert Uleman, Data Science Engineer

Copyright (c) 2019 IBM Corporation