# Know your customer - marketing a new product to customers

### Data Source:
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Dataset description from the UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

## Load Data

In [191]:
import graphlab as gl

In [192]:
# already stored dataset locally as Turi SFrame
train = gl.SFrame('bankCustomerTrain.sf')
test = gl.SFrame('bankCustomerTest.sf')

# data = graphlab.SFrame('s3://' or 'hdfs://')
# data # pySpark RDD or SchemaRDD / Spark DataFrame
# data = graphlab.SFrame.read_json('')
# With a DB: configure ODBC manager / driver on the machine
#    graphlab.connect_odbc?
#    graphlab.from_sql?

## Data Dictionary
The original dataset came with the following attribute information:


| Field Num | Field Name | Description |
|---|---|---|
| 1 | age | (numeric) |
| 2 | job | type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')|
| 3 | marital | marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) |
| 4 | education | (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') |
| 5 | default | has credit in default? (categorical: 'no','yes','unknown') |
| 6 | housing | has housing loan? (categorical: 'no','yes','unknown') |
| 7 | loan | has personal loan? (categorical: 'no','yes','unknown') |
|---|---|---|
| 8 | contact | contact communication type (categorical: 'cellular','telephone') |
| 9 | month | last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |
| 10 | day_of_week | last contact day of the week (categorical: 'mon','tue','wed','thu','fri') |
| 11 | duration | last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
|---|---|---|
| 12 | campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| 13 | pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
| 14 | previous | number of contacts performed before this campaign and for this client (numeric) |
| 15 | poutcome | outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') |
|---|---|---|
| 16 | emp.var.rate | employment variation rate - quarterly indicator (numeric) |
| 17 | cons.price.idx | consumer price index - monthly indicator (numeric) |
| 18 | cons.conf.idx | consumer confidence index - monthly indicator (numeric) |
| 19 | euribor3m | euribor 3 month rate - daily indicator (numeric) |
| 20 | nr.employed | number of employees - quarterly indicator (numeric) |
|---|---|---|
| **21** | **y** | **has the client subscribed a term deposit? (binary: 'yes','no')**|

## Data Exploration - get a sense of the data with GraphLab Canvas

In [193]:
gl.canvas.set_target('browser')
train.show()

Canvas is accessible via web browser at the URL: http://localhost:53104/index.html
Opening Canvas in default web browser.


## ROI Calculation - how we will measure the effectiveness of our lead scoring model

Before we start, let's assume that each phone call to a contact costs \$1 and that the customer lifetime value for a contact that purchases a term deposit is \$100. Then the ROI for calling all the customers in our training dataset is:

In [194]:
def calc_call_roi(contactList, leadScore, percentToCall):
    #assumptions
    costOfCall = 1.00
    custLTV = 100.00
    
    numberCalls = int(len(contactList)*percentToCall)
    if 'lead_score' in contactList.column_names():
        contactList.remove_column('lead_score')
    contactList = contactList.add_column(leadScore,name='lead_score')
    sortedByModel = contactList.sort('lead_score', ascending=False)
    callList = sortedByModel[:numberCalls]
    numSubscriptions = len(callList[callList['y']=='yes']) 
    roi = (numSubscriptions*custLTV - numberCalls*costOfCall) / (numberCalls*costOfCall)
    return roi

### Call everyone (assuming you have budget & time), ROI is 9.59%

In [195]:
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 1)
print 'ROI for calling all contacts: ' + '{0:.2f}'.format(initROI) + '%'

ROI for calling all contacts: 9.59%


### Call only the first 20%, ROI drops to 1.47%

In [196]:
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 0.2)
print 'ROI for calling a 20% subset of contacts: ' + '{0:.2f}'.format(initROI) + '%'

ROI for calling a 20% subset of contacts: 1.47%


## Modeling Part 1 - Query Data for customer segment - age less than median age of 38.

The SFrame, one of Turi's underlying data structures, allows users to build flexible pipelines. Here we show how to quickly retrieve the percentage of clients who are going to make a deposit along with the percentage of clients who represent the specific demographic of single students.

In [197]:
numClients = float(len(train))
numY = gl.Sketch(train['y']).frequency_count('yes')
print "%.2f%% of clients in training set opened deposit accounts." % (numY/numClients*100.0)

medianAge = gl.Sketch(train['age']).quantile(0.5)
numUnderMedianAge = float(len(train[train['age']<medianAge]))
numPurchasingAndUnderMedianAge = sum(train.apply(lambda x: 1 if x['age'] < medianAge 
                                           and x['y'] == 'yes' else 0))
probYGivenUnderMedianAge = numPurchasingAndUnderMedianAge/numUnderMedianAge*100

print "%.2f%% clients with age < %g (median) opened deposit account." % (probYGivenUnderMedianAge, medianAge)

11.43% of clients in training set opened deposit accounts.
12.16% clients with age < 38 (median) opened deposit account.


### From this analysis we see that a larger percentage of people under 38 opened accounts than overall. So let's target them as leads and measure our ROI.

In [198]:
ageTargetingROI = calc_call_roi(test, test['age'].apply(lambda x: 1 if x < medianAge else 0), 0.2)
print 'ROI for age targeted calls to 20% of contacts: ' + '{0:.2f}'.format(ageTargetingROI) + '%'

ROI for age targeted calls to 20% of contacts: 15.71%


### ROI for age targeted 20% of contacts: 15.71% - big jump over calling everyone and huge jump over random 20% - this is a good start, but we can do better.

## Modeling Part 2 - Train a Machine Learning model instead - learn from ALL features, not just age, use GraphLab Create AutoML to choose the most effective classifer model automatically.

In [199]:
# remove features that give away results/prediction
features = train.column_names()
features.remove('duration')
features.remove('y')

Turi's classifier toolkit that can help marketers predict if a client is likely to open an account.

In [200]:
toolkit_model = gl.classifier.create(train, features = features, target='y')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, SVMClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.893958866596
PROGRESS: RandomForestClassifier          : 0.888817489147
PROGRESS: DecisionTreeClassifier          : 0.890102803707
PROGRESS: SVMClassifier                   : 0.884319
PROGRESS: LogisticClassifier              : 0.888175
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.


### The toolkit automatically evaluates several types of algorithms, including: Boosted Trees, Random Forests, Decision Trees, Support Vector Machines, Logistic regression - with intelligent default paramters. Based on a validation set, it chooses the most accurate model. We can then evaluate this model on the test dataset.

In [201]:
results = toolkit_model.evaluate(test)
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])

accuracy: 0.904351, precision: 0.627692, recall: 0.237485


This initial model can be considered accurate given that it correctly predicts the purchasing decisions of 90% of the contacts. However, the toolkit model leaves room for improvement. Specifically only 64% of predicted sales actually convert to sales. Furthermore only 24% of actual sales were actually predicted by the model. In order to understand the model we can review the importance of the input features.

In [202]:
toolkit_model.get_feature_importance()

name,index,count
age,,95
euribor3m,,88
campaign,,59
pdays,,32
contact,telephone,18
job,admin.,16
day_of_week,mon,16
cons.conf.idx,,14
education,high.school,14
nr.employed,,13


After scoring the list by probability to purchase, the ROI for calling the top 50% of the list is:

In [203]:
toolkitLeadScore = toolkit_model.predict(test,output_type='probability')
toolkitROI = calc_call_roi(test, toolkitLeadScore, 0.2 )
print 'ROI for calling 20% of highest predicted contacts: ' + '{0:.2f}'.format(toolkitROI) + '%'

ROI for calling 20% of highest predicted contacts: 33.40%


----
## Huge improvement in ROI for 20% called: *32.79%* (over 3x improvement from calling everyone, while only calling 20% of the contacts).
----

### Modeling Part 3 - Continued experimentation / iteration - we can continue to tweak the model by generating features, doing experiments, adding different data sources etc.

```python
# One option to explore is using quadratic features to see if interactions between the features have predictive power.
quadratic = gl.feature_engineering.create(train,
                gl.feature_engineering.QuadraticFeatures(features=['campaign',
                                                                   'pdays',
                                                                   'previous',
                                                                   'emp.var.rate',
                                                                   'cons.price.idx',
                                                                   'euribor3m',
                                                                   'nr.employed']))

# Transform the training data.
qTrain = quadratic.transform(train)

qFeatures = qTrain.column_names()
qFeatures.remove('y')
qFeatures.remove('duration')

# We create a boosted trees classifier with the enriched dataset.
new_rf_model = gl.random_forest_classifier.create(qTrain, features = qFeatures, target='y', 
                                                  class_weights='auto', max_depth = 50,
                                                  row_subsample = 0.75, max_iterations = 50, column_subsample=0.5)

results = new_rf_model.evaluate(quadratic.transform(test))
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])       

# see which features are most important in this tree model
new_rf_model.get_feature_importance()

# show ROI for experimentation model
rfLeadScore = new_rf_model.predict(test,output_type='probability')
rfROI = calc_call_roi(test, rfLeadScore, 0.1 )
print 'ROI for calling predicted contacts: ' + '{0:.2f}'.format(rfROI) + '%'
```

## Integration Part 1 - Ranked Lists for Marketing / Sales - who should be prioritized to be called next!

In [204]:
rfList = test.sort('lead_score', ascending=False)
rfList['lead_score', 'age','campaign','euribor3m','job','loan'].print_rows(num_rows=20)

+----------------+-----+----------+-----------+-------------+------+
|   lead_score   | age | campaign | euribor3m |     job     | loan |
+----------------+-----+----------+-----------+-------------+------+
| 0.883403599262 |  48 |    3     |   0.904   |    admin.   |  no  |
| 0.883403599262 |  58 |    1     |    0.9    | blue-collar |  no  |
| 0.883403599262 |  51 |    1     |   0.903   | blue-collar |  no  |
| 0.882390379906 |  61 |    1     |   0.695   | blue-collar | yes  |
| 0.882390379906 |  77 |    1     |   0.682   |   retired   |  no  |
| 0.876977026463 |  53 |    1     |    0.84   | blue-collar |  no  |
| 0.876977026463 |  55 |    1     |   0.802   |    admin.   |  no  |
| 0.866948366165 |  58 |    1     |   0.878   |    admin.   |  no  |
| 0.866948366165 |  63 |    1     |   0.846   |   retired   |  no  |
| 0.863215148449 |  60 |    1     |   0.861   |    admin.   |  no  |
| 0.859958946705 |  66 |    1     |   0.655   |   retired   |  no  |
| 0.859958946705 |  55 |    3     

## Integration Part 2 - Deploy models as a fault tolerant scalable REST service, so marketing and sales dashboards (SalesForce/Tableau) can easily integrate lead score

We can deploy a real-time model to help the marketers understand potential clients as soon as the contacts come to the bank. Here we deploy on AWS, but Turi also supports hosting models on premise and on Azure.

```python
# define the state path - this is where Turi will store the models, logs, and metadata for this deployment
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'


# setup your own AWS credentials.
# gl.aws.set_credentials(<key>,<secret key>)

# create an EC2 config - this is how you define the EC2 configuration for the cluster being deployed
ec2 = gl.deploy.Ec2Config(region='us-west-2', instance_type='m3.xlarge')

# use the EC2 config to launch a new Predictive Service
# num_hosts specifies how many machines the Predictive Service cluster has. 
#     You can scale up and down later after initial creation.

deployment = gl.deploy.predictive_service.create(name = 'rajat-lead-scoring-app', 
                                                    ec2_config = ec2, state_path = ps_state_path, num_hosts = 3)

```

In [205]:
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'
deployment = gl.deploy.predictive_service.load(ps_state_path)



In [206]:
# see the status of and what's deployed on my_deployment
deployment

Name                  : rajat-lead-scoring-app
State Path            : s3://gl-rajat-testing/predictive_service/lead_scoring_app
Description           : None
API Key               : 8a0244c4-497b-4969-a18d-3a3bfdfc8fcd
CORS origin           : 
Global Cache State    : enabled
Load Balancer DNS Name: rajat-lead-scoring-app-1226522522.us-west-2.elb.amazonaws.com

Deployed endpoints:
	lead_score [model]

No Pending changes.

### Creating a new intelligent service is as simple as defining a Python function (can deploy anything in Python)

In [207]:
# inputs and returns of this function map directly to the io of the endpoint for the REST service
def get_lead_score(json_row):
    json_row = {key:[value] for key,value in json_row.items()}
    client_info = quadratic.transform(gl.SFrame(json_row))
    client_info['lead_score'] = toolkit_model.predict(client_info, output_type='probability')
    return client_info

In [208]:
deployment.update('lead_score', get_lead_score)

2016-05-26 11:52:36,006 [INFO] graphlab.deploy._predictive_service._predictive_service, 1527: Endpoint 'lead_score' is updated. Use apply_changes to deploy all pending changes, or continue other modification.


In [209]:
deployment.apply_changes()

2016-05-26 11:52:37,246 [INFO] graphlab.deploy._predictive_service._predictive_service, 1733: Persisting endpoint changes.
2016-05-26 11:52:37,260 [INFO] graphlab.util.file_util, 190: Uploading local path /var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv to s3 path: s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3


upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/ef811c97-fd83-453f-8b32-656dc5901a7c/objects.bin to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/ef811c97-fd83-453f-8b32-656dc5901a7c/objects.bin
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/dir_archive.ini to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/dir_archive.ini
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/version to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/version
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.sidx to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predict

2016-05-26 11:52:39,066 [INFO] graphlab.util.file_util, 245: Successfully uploaded to s3 path s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3


upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/objects.bin to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/objects.bin


In [210]:
deployment.get_status()

[{u'cache': {u'healthy': True, u'num_keys': 64, u'type': u'cluster'},
  u'dns_name': u'ec2-52-26-115-238.us-west-2.compute.amazonaws.com',
  u'graphlab_service_status': {u'ip-10-0-0-53:10000': {u'reason': None,
    u'status': u'healthy'}},
  u'id': u'i-ecd60631',
  u'models': [{u'lead_score': {u'ip-10-0-0-53:10000': {u'cache_enabled': True,
      u'reason': None,
      u'status': u'LoadSuccessful',
      u'type': u'model',
      u'version': 3}}}],
  u'reason': u'N/A',
  u'state': u'InService',
  u'system': {u'cpu_count': 4,
   u'cpu_usage': [0.3, 0.1, 0.1, 0.0],
   u'disk_usage': {u'root': {u'free': 3858726912,
     u'percent': 48.3,
     u'total': 8320901120,
     u'used': 4015902720},
    u'tmp': {u'free': 37426188288,
     u'percent': 0.1,
     u'total': 39490912256,
     u'used': 51879936}},
   u'memory': {u'active': 804380672,
    u'available': 15022342144,
    u'buffers': 91893760,
    u'cached': 2650820608,
    u'free': 12279627776,
    u'inactive': 2304663552,
    u'percent': 4

Now we can score incoming contacts using the REST endpoint.

In [214]:
# High lead score: 7720, 8070, 7924
# Low lead score: 0, 5000, 7000
deployment.query('lead_score', test[5000])

{u'from_cache': False,
 u'model': u'lead_score',
 u'response': [{u'age': 29,
   u'campaign': 2,
   u'cons.conf.idx': -42.0,
   u'cons.price.idx': 93.2,
   u'contact': u'cellular',
   u'day_of_week': u'wed',
   u'default': u'no',
   u'duration': 117,
   u'education': u'basic.9y',
   u'emp.var.rate': -0.1,
   u'euribor3m': 4.12,
   u'housing': u'yes',
   u'job': u'management',
   u'lead_score': 0.0692317932844162,
   u'loan': u'no',
   u'marital': u'married',
   u'month': u'nov',
   u'nr.employed': 5195,
   u'pdays': 999,
   u'poutcome': u'nonexistent',
   u'previous': 0,
   u'quadratic_features': {u'campaign, campaign': 4,
    u'campaign, cons.price.idx': 186,
    u'campaign, emp.var.rate': 0,
    u'campaign, euribor3m': 8,
    u'campaign, nr.employed': 10390,
    u'campaign, pdays': 1998,
    u'campaign, previous': 0,
    u'cons.price.idx, cons.price.idx': 8686.24,
    u'cons.price.idx, emp.var.rate': -9.32,
    u'cons.price.idx, euribor3m': 383.98400000000004,
    u'cons.price.idx, nr

In [None]:
# deployment.terminate_service()

## Summary: With Machine Learning and leveraging our existing historical customer data we can prioritize which customers have the largest propensity to buy a new product.

## Using Turi's Platform a development team can easily implement a lead scoring model and deploy it as a REST API for integration into Marketing tools and Dashboards

### Want to find out more? Let's talk: rajat@turi.com