# Derive Features from Per-Account Time Series
Look at the sequences of interactions and orders from each account and try to identify classes and patterns that may suggest features to use in distinguishing customers at risk from those who are not.

## Preliminaries
Set up credentials for file I/O, enable PixieDust for visualization.

In [74]:
# The code was removed by DSX for sharing.

In [75]:
# Allow display of multiple values without using print()
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [76]:
import pixiedust

## Load the prepared time series data

In [77]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = (spark.read
           .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
           .option('header', 'true')
           .load(bmos.url('CableCompany', 'accountTimeSeries.csv')))
# df.show(5)

In [78]:
display(df)

AccountId,OrderClass,OrderStatus,AccountDelinquencyStatus,OrderReasonCode,IsNormal,NumInteractions,IsVoluntaryDisconnect,IsExcessiveCancelation
10043819,SSSSSSSSSS,OOCOOOOOOC,NNNNNNNNNN,DF.DF.DF.DF.DF.DF.DF.DF.DF.DF,1.0,10,0.0,0.0
10221159,SSSSSSSSSTTSSSSSS,OOXOOXOOXOXCOOOXO,AWNAWNAWNNNNNNNNN,NP.NP.NP.NP.NP.NP.NP.NP.NP.01.01.NT.SJ.SJ.SJ.SJ.DF,0.0,17,0.0,1.0
10271483,SS,OC,NN,NT.SE,1.0,2,0.0,0.0
10271491,SS,OC,NN,NT.SE,1.0,2,0.0,0.0
10380306,SSSSSSSSTTTSSS,OOXOOXOXOOCOOX,APNAWNANTTTAPN,NP.NP.NP.NP.NP.NP.NP.NP.D2.D2.D2.NP.NP.NP,0.0,14,0.0,1.0
10504548,SS,OC,VC,OT.OT,0.0,2,1.0,0.0
10678689,SSSS,OOOX,AWPN,NP.NP.NP.NP,0.0,4,0.0,0.0
11650153,TSSSTSSSTTS,OCOOOOCCOCC,NNNNNNNNNNN,H5.NT.NT.NT.H5.NT.NT.DF.H5.H5.SJ,1.0,11,0.0,0.0
11915788,SSSSS,CCCCC,NNCNN,NT.NT.NT.NT.NT,0.0,5,0.0,0.0
12258879,SSS,OOC,NNN,DF.DF.DF,1.0,3,0.0,0.0


## Explore Features

### Lookup dictionaries for status and class codes
These come in handy when interpreting one-letter codes

In [79]:
status_codes = {
    '':'Normal',
    'A':'Open non-pay disconnect and equipment is active',
    'C':'Voluntary disconnect',
    'E':'Non-pay disconnect',
    'F':'Open non-pay disconnect and equipment is force tuned',
    'P':'Pending non-pay disconnect and services are restored; CSG assigns this status in real time',
    'S':'Pending change of service job (applies to subscription billing)',
    'T':'PPV ordering restricted',
    'V':'Open voluntary disconnect job',
    'W':'Open non-pay disconnect and equipment is disabled',
    'Z':'Charged off'
}

class_codes = {
    'M':'Special request',
    'S':'Service order',
    'T':'Trouble call'
}

### Lookup dictionary for order reason codes
These are derived from provided sample data containing codes and descriptions. The list may not be complete and some descriptions may be too terse for someone not intimately familiar with operations.

In [80]:
# NOTE: We are missing a complete set of descriptions for reason codes (orders).
reason_codes = {
    "00": "Install-Sik",
    "01": "TV All Out",
    "01": "No Cable",
    "04": "P-Dc No Contact",
    "0B": "30 Day Mbg",
    "10": "Drop Bury",
    "10": "Digital Prblm",
    "11": "Cnv/Rmt Prblm",
    "11": "Verify Svcs",
    "18": "Tech Assist",
    "20": "No Dial Tone",
    "22": "Mdu Postwire",
    "23": "Raise Drop",
    "25": "Equip Prob",
    "28": "Vlt/Ped Rpl/Rpr",
    "36": "Xi-Can't Surf",
    "75": "Inside Ingress",
    "7L": "Initial Visit",
    "8Z": "Bc Hard-Down Tc",
    "A0": "Crpesc Mustr0Ll",
    "A2": "X1Rfprb-Mustgo",
    "A3": "Priority",
    "AD": "Replace Aerial Drop",
    "BA": "Bc-Tech Assist 2Hr",
    "C0": "TV Single Chout",
    "C1": "TV Cblecard Prb",
    "C2": "TV Out Some Sts",
    "C2": "Equip Del/Swap",
    "C3": "TV Audio Prob",
    "C4": "TV Tiling",
    "C6": "TV Remote Prob",
    "C7": "TV Vod Prob",
    "CB": "TV Term Rcd/Pbk",
    "CD": "Cust Device Optimization",
    "CD": "Cdo Proactive Visit",
    "CF": "TV Hddta/Dta",
    "CW": "Comm Preinstall Work",
    "D0": "Dv No Dial Tone",
    "D2": "Dv Conn Prob",
    "DF": "Channel-Care",
    "DR": "Drop Bore",
    "EC": "Esl Proactive",
    "F0": "Drp Dwn Servout",
    "H2": "Int Out",
    "H3": "Int Imt Blksync",
    "H4": "Int Gw/Hm Ntwrk",
    "H5": "Int Speed Prob",
    "HA": "Make Drop Hot",
    "HI": "Make Tap Hot",
    "HN": "X1 Proactive Visit",
    "LX": "Legacy To X1",
    "MP": "Network Health-Prem",
    "NO": "Not Home Install",
    "NP": "P-Non Pay",
    "NT": "Install-No Truck",
    "NU": "P-Port Out",
    "O2": "S-Verizon Fios (Fiber)",
    "OD": "P-Promotion Expired",
    "OE": "P-Low Usage",
    "OG": "P-Can't Afford",
    "OJ": "P-Move Outside Srv Area",
    "OL": "P-Moved To Connected Hse",
    "ON": "P-Non Pay Manual",
    "OO": "Account Correction",
    "OQ": "P-Military",
    "OR": "P-Dissatisfied Product",
    "OS": "P-Move Within Srv Area",
    "OT": "P-Transfers Of Service",
    "OV": "P-Competition",
    "OY": "P-Dissatisfied Cust Exp",
    "OZ": "P-Student",
    "P1": "Poe Filtr Inst",
    "PM": "Pr Make Tap Hot",
    "RR": "Residential Rewire",
    "SB": "Channel-Cbs",
    "SC": "Cssr Sale",
    "SC": "Channel-Care",
    "SE": "Employee/Office Acct",
    "SF": "Channel-Technician",
    "SH": "Channel-Dsr",
    "SH": "Bc Proactive Visit",
    "SJ": "Channel-Web Order",
    "SK": "Channel-Outbound",
    "SL": "I New Sub",
    "SP": "Channel-Front Counter",
    "ST": "D Move No Trnfr",
    "TA": "Tech Assist",
    "TO": "Tech Ops Follow-Up",
    "TS": "Tech Upgrade",
    "TW": "Cbs-Teleworker",
    "WS": "Comm Premise Survey",
    "X3": "Xh Tch Scn Prb",
    "X5": "Xh Camera Prb",
    "XM": "Mis",
    "XU": "Tech Cmpl Xh Consult",
    "XX": "Is/Macro Change",
    "XY": "X Data Integrty",
    "YH": "Cbs-Business Closed",
    "YJ": "Bulk-Residential",
    "YN": "Sik Failed",
    "YS": "2Nd Job-Pri/Bve",
    "ZA": "Equip Pickup"
}

### Is everything normal?
AccountDelinquencyStatus is 'N' for every interaction, meaning there is no delinquency and presumably there's nothing going on that could indicate that a customer is at risk. We don't need to look into their history in further detail.

In [81]:
# Create new column with normal/not-normal indicator for every account history
from pyspark.sql.functions import udf
from pyspark.sql.types import *

udfIsAllNormal = udf(lambda s: float(s == len(s) * 'N'), FloatType())

# Add a column indicating normal or not-normal
df = df.withColumn('IsNormal', udfIsAllNormal('AccountDelinquencyStatus'))
df.show(5)

# Count the occurrence of each distinct value (there are only two here)
df.groupBy('IsNormal').count().show()

+-----------+-----------------+-----------------+------------------------+--------------------+--------+
|  AccountId|       OrderClass|      OrderStatus|AccountDelinquencyStatus|     OrderReasonCode|IsNormal|
+-----------+-----------------+-----------------+------------------------+--------------------+--------+
|00010043819|       SSSSSSSSSS|       OOCOOOOOOC|              NNNNNNNNNN|DF.DF.DF.DF.DF.DF...|     1.0|
|00010221159|SSSSSSSSSTTSSSSSS|OOXOOXOOXOXCOOOXO|       AWNAWNAWNNNNNNNNN|NP.NP.NP.NP.NP.NP...|     0.0|
|00010271483|               SS|               OC|                      NN|               NT.SE|     1.0|
|00010271491|               SS|               OC|                      NN|               NT.SE|     1.0|
|00010380306|   SSSSSSSSTTTSSS|   OOXOOXOXOOCOOX|          APNAWNANTTTAPN|NP.NP.NP.NP.NP.NP...|     0.0|
+-----------+-----------------+-----------------+------------------------+--------------------+--------+
only showing top 5 rows

+--------+-----+
|IsNormal|cou

### How active is this account?
Simply count the number of interactions

In [82]:
import pyspark.sql.functions as F
df.select(F.length('OrderClass')).show(5)

+------------------+
|length(OrderClass)|
+------------------+
|                10|
|                17|
|                 2|
|                 2|
|                14|
+------------------+
only showing top 5 rows



In [83]:
# OrderClass, OrderStatus, and AccountDelinquencyStatus should all have the same length, so it doesn't
# matter which column we use
import pyspark.sql.functions as F

df = df.withColumn('NumInteractions', F.length('OrderClass'))
df.show(5)

+-----------+-----------------+-----------------+------------------------+--------------------+--------+---------------+
|  AccountId|       OrderClass|      OrderStatus|AccountDelinquencyStatus|     OrderReasonCode|IsNormal|NumInteractions|
+-----------+-----------------+-----------------+------------------------+--------------------+--------+---------------+
|00010043819|       SSSSSSSSSS|       OOCOOOOOOC|              NNNNNNNNNN|DF.DF.DF.DF.DF.DF...|     1.0|             10|
|00010221159|SSSSSSSSSTTSSSSSS|OOXOOXOOXOXCOOOXO|       AWNAWNAWNNNNNNNNN|NP.NP.NP.NP.NP.NP...|     0.0|             17|
|00010271483|               SS|               OC|                      NN|               NT.SE|     1.0|              2|
|00010271491|               SS|               OC|                      NN|               NT.SE|     1.0|              2|
|00010380306|   SSSSSSSSTTTSSS|   OOXOOXOXOOCOOX|          APNAWNANTTTAPN|NP.NP.NP.NP.NP.NP...|     0.0|             14|
+-----------+-----------------+-

## Did the account disconnect at the customer's request?
As a first cut at a possible "churn" label, identify customers who left at their request.

We still don't know how to tell if they were unhappy or left for other reasons, such as a move.

In [84]:
import re
udfIsVoluntaryDisconnect = udf(lambda s: 1.0 if re.search('V[^VC]*C$', s) else 0.0, FloatType())

# Add a column indicating whether the customer was disconnected at their request
df = df.withColumn('IsVoluntaryDisconnect', udfIsVoluntaryDisconnect('AccountDelinquencyStatus'))

# Count the occurrence of each distinct value (there are only two here)
df.groupBy('IsVoluntaryDisconnect').count().show()

+---------------------+-----+
|IsVoluntaryDisconnect|count|
+---------------------+-----+
|                  1.0|   54|
|                  0.0|  519|
+---------------------+-----+



### Verification: voluntary disconnect is not normal

In [85]:
# There should be zero rows where IsNormal and IsVoluntaryDisconnect are both one (1.0)
df.filter('IsVoluntaryDisconnect * IsNormal == 1.0').count()

0

## Were there a lot of canceled orders?
This could be nonsense, but perhaps canceled orders can indicate dissatisfaction, the customer
changing their mind, orders not carried out as expected, etc.

In [86]:
threshold = 2

udfExcessiveCancelation = udf(lambda s: float(s.count('X') > threshold), FloatType())

df = df.withColumn('IsExcessiveCancelation', udfExcessiveCancelation('OrderStatus'))
df.groupBy('IsExcessiveCancelation').count().show()

+----------------------+-----+
|IsExcessiveCancelation|count|
+----------------------+-----+
|                   1.0|    7|
|                   0.0|  566|
+----------------------+-----+



In [87]:
df.select('OrderStatus', 'IsExcessiveCancelation').show(10)

+-----------------+----------------------+
|      OrderStatus|IsExcessiveCancelation|
+-----------------+----------------------+
|       OOCOOOOOOC|                   0.0|
|OOXOOXOOXOXCOOOXO|                   1.0|
|               OC|                   0.0|
|               OC|                   0.0|
|   OOXOOXOXOOCOOX|                   1.0|
|               OC|                   0.0|
|             OOOX|                   0.0|
|      OCOOOOCCOCC|                   0.0|
|            CCCCC|                   0.0|
|              OOC|                   0.0|
+-----------------+----------------------+
only showing top 10 rows

