# Exploring Comcast Customer Interaction Data
Load sample data, flatten JSON structures, understand the meaning of various status and reason codes, construct time series, look for identifying patterns and construct candidate features.

## Preliminaries
Set up credentials and functions for file I/O, enable PixieDust for visualization.

In [2]:
# The code was removed by DSX for sharing.

In [3]:
# The code was removed by DSX for sharing.

In [4]:
import pixiedust

Pixiedust database opened successfully


## Load the raw sample data
And take a look at what's in it.

In [5]:
df = pd.read_csv(
        get_object_storage_file_with_credentials(
            'CableCompany', 'limitedJsonMachinelearning.csv'),
        sep="|", header=None)
df.head()

Unnamed: 0,0,1
0,10050788972,"{""msg"":{""body"":{""Account"":{""AccountId"":1005078..."
1,10050788972,"{""msg"":{""body"":{""Account"":{""AccountId"":1005078..."
2,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190..."
3,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190..."
4,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190..."


### There's several observations per account id

In [6]:
df[0].value_counts()[0:5]

02300449241    24
00240118225    23
00121471183    21
00522654023    21
44010354197    20
Name: 0, dtype: int64

### Fix formatting issues
Account and order IDs look numeric, but they have leadning zeroes so
must be treated as strings

In [7]:
import json

def fix(row):
    # Convert AccountId values from int to str to handle leading zeroes
    accountid = row[1][row[1].find("AccountId"):row[1].find("ConnectDate")-2]
    accountid = accountid[accountid.find(":")+1:]
    accountid_fix = row[1].replace(accountid,'"{}"'.format(accountid))
    
    # Convert OrderId values from int to str to handle leading zeroes
    orderid = accountid_fix[accountid_fix.find("OrderId"):accountid_fix.find("Class")-2]
    orderid = orderid[orderid.find(":")+1:]
    orderid_fix = accountid_fix.replace(orderid,'"{}"'.format(orderid))
    
    # Convert str to dict
    return json.loads(orderid_fix)



In [8]:
df[2] = df.apply(fix, axis=1)

### Make sure the format fix worked
Column 1 is the original data; the fixed data is in column 2. Unfortunately, the dictionary elements have been shuffled and the Pandas table display doesn't show enough to see the Account ID. So just print a single value, which shows the entire dictionary.

In [9]:
df.head()

Unnamed: 0,0,1,2
0,10050788972,"{""msg"":{""body"":{""Account"":{""AccountId"":1005078...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
1,10050788972,"{""msg"":{""body"":{""Account"":{""AccountId"":1005078...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
2,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
3,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
4,10021901902,"{""msg"":{""body"":{""Account"":{""AccountId"":1002190...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...


In [10]:
# Print the value in column 2 of the first row (row 0)
df[2][0]

{'msg': {'body': {'Account': {'AccountId': '10050788972',
    'ConnectDate': '2012-11-09',
    'Delinquency': {'Status': 'C'},
    'DisconnectDate': '2017-07-31'},
   'Order': {'Class': 'S',
    'OrderId': '7625192014',
    'ReasonCode': 'O2',
    'ReasonCodeDescription': 'S-Verizon Fios (Fiber)',
    'Status': 'C'}},
  'head': {'Origin': 'LAZIA:17-BEP ORDER DETAIL UPDATED',
   'TriggerDate': '2017-07-31',
   'TriggerTime': '04:17:56'}}}

## Flatten the dictionary structures
Put each element in its own column. Omit elements deemed irrelevant.

In [11]:
# Brute-force method for flattening: list each column and element as needed
# instead of relying on fully automated traversal
def flatten_row(row, i):
    info = pd.DataFrame(row['msg']['body']['Order'], index = [i])
    info['AccountId'] = row['msg']['body']['Account']['AccountId']
    info['ConnectDate'] = row['msg']['body']['Account']['ConnectDate']
    info['DelinquencyStatus'] = row['msg']['body']['Account']['Delinquency']['Status']
    info['DisconnectDate'] = row['msg']['body']['Account']['DisconnectDate']
    info2 = pd.DataFrame(row['msg']['head'], index = [i])   
    return pd.concat([info, info2],axis=1)

In [12]:
df1 = flatten_row(df[2][0],0)
for i in range(1,df.shape[0]):
    df1 = df1.append(flatten_row(df[2][i],i))
del df1['''S'=['tatus''']    

#### Rename Columns Using the json structure

In [13]:
def rename_cols(s):
    order = ['Class','ReasonCode','ReasonCodeDescription','Status']
    account = ['ConnectDate','DelinquencyStatus','DisconnectDate']
    head = ['TriggerDate','TriggerTime','Origin']
    if(s in order):
        return "Order"+s
    elif(s in account):
        return "Account"+s
    elif(s in head):
        return "Head"+s
    else:
        return s

In [14]:
new_cols = []
for st in df1.columns.tolist():
    new_cols.append(rename_cols(st))
df1.columns = new_cols    

In [15]:
# Check that we got what we wanted
df1.columns.tolist()

['AccountId',
 'OrderClass',
 'AccountConnectDate',
 'AccountDelinquencyStatus',
 'AccountDisconnectDate',
 'OrderId',
 'HeadOrigin',
 'OrderReasonCode',
 'OrderReasonCodeDescription',
 'OrderStatus',
 'HeadTriggerDate',
 'HeadTriggerTime']

## Explore Data after json flattening

In [16]:
df1.shape

(3308, 12)

### Lookup dictionaries for status and class codes
These come in handy when interpreting one-letter codes

In [17]:
status_codes = {
    '':'Normal',
    'A':'Open non-pay disconnect and equipment is active',
    'C':'Voluntary disconnect',
    'E':'Non-pay disconnect',
    'F':'Open non-pay disconnect and equipment is force tuned',
    'P':'Pending non-pay disconnect and services are restored; CSG assigns this status in real time',
    'S':'Pending change of service job (applies to subscription billing)',
    'T':'PPV ordering restricted',
    'V':'Open voluntary disconnect job',
    'W':'Open non-pay disconnect and equipment is disabled',
    'Z':'Charged off'
}

class_codes = {
    'M':'Special request',
    'S':'Service order',
    'T':'Trouble call'
}

In [18]:
(df1['AccountDelinquencyStatus'].apply(lambda x: status_codes.get(x))
                        .value_counts())

Normal                                                                                        2581
Open non-pay disconnect and equipment is disabled                                              192
Open non-pay disconnect and equipment is active                                                137
Open voluntary disconnect job                                                                  133
Voluntary disconnect                                                                           118
Pending non-pay disconnect and services are restored; CSG assigns this status in real time      98
PPV ordering restricted                                                                         31
Non-pay disconnect                                                                              14
Charged off                                                                                      4
Name: AccountDelinquencyStatus, dtype: int64

In [19]:
(df1['OrderClass'].apply(lambda x: class_codes.get(x))
             .value_counts())

Service order      2623
Trouble call        437
Special request     248
Name: OrderClass, dtype: int64

In [20]:
# This is an example of using PixieDust to create a visual representation
df_status = (df1['AccountDelinquencyStatus'].value_counts()
                                    .fillna('N')
                                    .to_frame()
                                    .reset_index()
                                    .replace('', 'N')
            )
df_status['status'] = df_status.index
display(df_status)

index,AccountDelinquencyStatus,status
N,2581,0
W,192,1
A,137,2
V,133,3
C,118,4
P,98,5
T,31,6
E,14,7
Z,4,8


In [21]:
df_or = df1[['OrderReasonCode', 'OrderReasonCodeDescription']].drop_duplicates() #.sort_values('OrderReasonCode')
df_or.head(20)
# display(df_or)

Unnamed: 0,OrderReasonCode,OrderReasonCodeDescription
0,O2,S-Verizon Fios (Fiber)
2,SJ,Channel-Web Order
3,18,Tech Assist
8,OT,P-Transfers Of Service
9,NP,P-Non Pay
13,OZ,P-Student
14,NT,Install-No Truck
17,25,Equip Prob
19,10,Digital Prblm
25,SC,Cssr Sale


## Construct per-account time series
The original data has one record per customer interaction (Order ID), with multiple records
per customer (Account ID). Since each customer's history is independent of all the others, it
makes sense to aggregate each history into a single row, one for each Account ID. The codes
for each interaction can be concatenated, in time order, to form a time series of codes. These
will form the basis of features and labels we can use to build models.

#### Coerce the reason codes to string type
Reason codes consist of two characters (OT, DF, etc.). Because some look like numbers (00), Pandas infers a type of mixed string/numeric for that column. This causes problems during aggregation, so make sure they're treated as strings.

In [22]:
df1['OrderReasonCode'] = df1['OrderReasonCode'].astype(str)
pd.api.types.infer_dtype(df1['OrderReasonCode'])

'string'

#### Create the concatenated time series
Turn each sequence of single-character codes into a single string. For the two-charcter
reason codes, separate them by an inobtrusive character, like a period. (Better not use comma,
as that would cause issues when writing a CSV file.)

In [23]:
df2 = (df1.fillna('N')
          .replace('','N')
          .sort_values(['HeadTriggerDate', 'HeadTriggerTime'])
          .groupby('AccountId')
          .agg({
                'OrderClass':'sum',
                'AccountDelinquencyStatus':'sum',
                'OrderReasonCode':lambda x: '.'.join(x),
                'OrderStatus':'sum'
          })
          .reset_index()
)

In [24]:
df2.head()

Unnamed: 0,AccountId,OrderClass,OrderStatus,AccountDelinquencyStatus,OrderReasonCode
0,10043819,SSSSSSSSSS,OOCOOOOOOC,NNNNNNNNNN,DF.DF.DF.DF.DF.DF.DF.DF.DF.DF
1,10221159,SSSSSSSSSTTSSSSSS,OOXOOXOOXOXCOOOXO,AWNAWNAWNNNNNNNNN,NP.NP.NP.NP.NP.NP.NP.NP.NP.01.01.NT.SJ.SJ.SJ.S...
2,10271483,SS,OC,NN,NT.SE
3,10271491,SS,OC,NN,NT.SE
4,10380306,SSSSSSSSTTTSSS,OOXOOXOXOOCOOX,APNAWNANTTTAPN,NP.NP.NP.NP.NP.NP.NP.NP.D2.D2.D2.NP.NP.NP


## Write the per-account time series to a file
To make the data accessible to other notebooks in the project.

NOTE: Do this only once.

## Minor problem diagnosis
### One account ID from the original data is missing from the current dataframe
Find out why.

#### The current dataframe has 573 rows, but the original data has 574 unique account IDs

In [25]:
df2.shape

(573, 5)

In [26]:
df[0].nunique()

574

#### The difference is one corrupted version of an existing account ID
See row 1774 in the table below.

As part of data cleansing, we may want to look for things like this and fix them. For now, assume that this may be an artifact of preparing the sample data set and in any case is insignificant. It's OK to ignore that record.

In [27]:
set(df[0]).difference(set(df2['AccountId']))

{'][][00762520805'}

In [28]:
df[1773:1778]

Unnamed: 0,0,1,2
1773,00841533637,"{""msg"":{""body"":{""Account"":{""AccountId"":0084153...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
1774,][][00762520805,"{""msg"":{""body"":{""Account"":{""AccountId"":0076252...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
1775,00762520805,"{""msg"":{""body"":{""Account"":{""AccountId"":0076252...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
1776,00762520805,"{""msg"":{""body"":{""Account"":{""AccountId"":0076252...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
1777,00710757699,"{""msg"":{""body"":{""Account"":{""AccountId"":0071075...",{'msg': {'body': {'Order': {'ReasonCodeDescrip...
