# Introduction

Data Science at its core is about understanding entities.  What’s an entity?  Well, an entity is the “thing” that you are trying to understand.  For example, an entity can be a store, a customer, a machine or an employee.  Data science allows us to better understand and gain insight into what an entity does or thinks.  

Entities are usually complex.  They have multiple inclinations and are typically moving targets.  Imagine you are a Data scientist working at a telco and consumers are your entity.  Telco consumers can do many things.  They can cancel, increase, decrease or maintain their spend.  They can buy product x and cancel product y.  Or, they can cancel product x and y, but increase product z.  As you dive into to the behavior and inclinations of an entity and examine all the things they can do, it gets hairy.

This article deals with combining multiple insights from one entity into a single view of the entity.  Recently, I have heard this called “Creating a Profile of One”.  I’m not sure about that name, but as you continue to examine the entities in your business, you will inevitably look for ways to combine multiple insights about an entity into a distinct summary that is actionable.

In in this article I will explore a company that sells advertising to small businesses.  Small businesses include things like Restaurants, Painting Contractors and Attorneys.  The data in this example is 100% fake.  I created it and it belongs to me.  Although it is fake, I feel that it accurately reflects several “real-world” examples I have personally experienced in the past.  The use case should give you a flavor of the ways you can combine multiple insights about an entity.  It is in no way meant to be comprehensive or inclusive of all situations you encounter.

The company in our use case has three types of advertising (products) that it can offer to its small business customer base.  The first is Direct Mail.  The second is Yellow Pages.  The third is On-line/Internet.


Direct Mail:  Traditional print advertising delivered to consumers (households) through the Postal Service.  https://en.wikipedia.org/wiki/Advertising_mail

Yellow Pages:  Traditional print advertising.  Directory of businesses that is delivered once a year.  https://en.wikipedia.org/wiki/Yellow_pages

On-Line/Internet:  Includes all digital adverting, including: Paid Search, banner ads and preferential search placement. https://en.wikipedia.org/wiki/Online_advertising


In our first example, we will assume that one and only one product can be offered to the small business customer. You may see this if you are extending an offer via a bill insert or billing message. https://en.wikipedia.org/wiki/Insert_(print_advertising)  In their December bill, there is room to present the customer with one and only one offer.  We have to decide which product should be offered to each customer.  Another situation where you may be limited to one and only one offer is with an internet banner ad https://en.wikipedia.org/wiki/Online_advertising#Display_ads or offer presented in email https://en.wikipedia.org/wiki/Online_advertising#Email_advertising. In all these situations, the interaction with the end small business customer is minimal.  You need to make a point quickly and succinctly.


In our second example, we will look at a scenario where multiple offers are extended to a small business customer. This is common where you have a direct sales channel.  A direct sales channel means a human is calling on the customer either over the phone or door to door.   In this example, we will create a wholistic view of the customer that quickly summarizes what product should be proposed first, second and third.  This gives the salesperson a guide on what product to lead with and how likely each customer is to buy each product.  The second example highlights situations where the interaction with the small business customer is extensive and more time is available to explore the small business customer's needs and wants.


## Table of contents

1.0 [Getting Setup](#setup)<br>
    1.1 [Import Relevant Libraries](#import)<br>
2.0 [Explore Data](#explore)<br>
    2.1 [Field Descriptions](#descr)<br>
3.0 [Methods and techiques to create a "Profile of One"](#methods)<br>
    3.1 [One Product per Customer](#oneper)<br>
        3.11 [Technique One -- Winner Take All](#t1)<br>
        3.12 [Technique Two -- Winner Take All with a Strategic Minimum](#T2)<br>
        3.13 [Technique Three -- Standardize each probability so each is measured on a common scale (Mean of 0.5 and a standard deviation of 0.01)](#T3) <br>
        3.14 [Technique Four --  Combine Expected product Revenue and the probability to buy.](#T4) <br>
     3.2 [Use the Probabilities and Revenue Prediction to Summarize each Small Business Customer.](#wgt) <br>
        3.21 [Technique Five -- Summarize the Prospect into Gold Star and Silver Star Accounts.](#T5) <br>
        3.22 [Technique Six -- Summarize the Prospect into Gold Star and Silver Star Accounts.](#T6) <br>
        3.23 [Technique Seven -- Summarize the Prospect into Gold Star and Silver Star Accounts while Overlaying other Business Characteristics](#T7) <br>
4.0 [Conclusion](#concl)<br>

# 1.0 Getting Setup <a id="setup"></a>

### 1.1 Load the training data from GitHub and Import required libraries <a id="import"></a>

In [1]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
import numpy as np
import numpy.dual as dual


import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0



In [2]:

!rm leads_for_humans.csv
!wget https://raw.githubusercontent.com/shadgriffin/profile_of_1/main/leads_for_humans.csv

df_datax = pd.read_csv("leads_for_humans.csv", sep=",", header=0)



--2021-01-08 19:51:03--  https://raw.githubusercontent.com/shadgriffin/profile_of_1/main/leads_for_humans.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 537616 (525K) [text/plain]
Saving to: ‘leads_for_humans.csv’


2021-01-08 19:51:03 (44.7 MB/s) - ‘leads_for_humans.csv’ saved [537616/537616]



# 2.0 Data Exporation <a id="explore"></a> 

In [3]:
df_datax.head()


Unnamed: 0,cust_id,yp_probability,predicted_yp_revenue,internet_probability,predicted_internet_rev,dm_probability,predicted_dm_rev,credit_flag
0,516211000001,0.031483,1142.26,0.029965,1076.37,0.001829,1414.29,0.0
1,516211000002,0.03,1360.42,0.022816,1189.9,0.001289,1349.06,0.0
2,516211000003,0.029754,1278.21,0.01397,1105.58,0.000167,113.140541,0.0
3,516211000004,0.005741,496.518557,0.007751,621.390408,0.001444,1260.87,0.0
4,516211000005,0.0205,1128.02,0.010573,906.8225,0.00155,1126.51,0.0


### 2.1 Field Descriptions <a id="descr"></a>

cust_id :  A unique identifer of a customer.  In this case a customer is a small business entity.

yp_probability:  This comes from a machine learning/predictive model.  Indicates the probability that a customer will buy yellow pages advertising.

predicted_yp_revenue:  This comes from a machine learning/predictive model.  Indicates the revenue for customer if they buy yellow pages advertising.

internet_probability:  This comes from a machine learning/predictive model.  Indicates the probability that a customer will buy internet advertising.

predicted_internet_rev:  This comes from a machine learning/predictive model.  Indicates the revenue for customer if they buy internet advertising.

dm_probability:  This comes from a machine learning/predictive model.  Indicates the probability that a customer will buy direct mail advertising.

predicted dm_rev:  This comes from a machine learning/predictive model.  Indicates that revenue for a customer if they buy direct mail advertising.

credit_flag:  Indicates if the customer has credit issues and poses a risk of deqlinquency.

# 3.0 Methods and techiques to create a "Profile of One" <a id="methods"></a> 

### 3.1 One Product per Customer  <a id="oneper"></a> 

In our first scenario, we are looking to match products with customers.  Each customer can only receive one and only one offer.  There are many different ways to find the best product for each customer.  Here are a few examples. 

#### 3.11 Technique One -- Winner Take All <a id="T1"></a>

The Winner Take All method is the most simplistic.  Our data contains the probability that each small business customer will buy each product.  The most obvious solution is to assign each customer to the product they are most likely to buy.


Create a new field called "LEAD_PRODUCT"  that contains the product for each customer that they are most likely to buy, based on their probability scores.

In [4]:
df1=df_datax.copy()

In [5]:

df1["BEST"] = df1[["yp_probability", "internet_probability","dm_probability"]].max(axis=1)

df1['LEAD_PRODUCT']=np.where((df1.yp_probability==df1.BEST),'YP',
                                      (np.where((df1.internet_probability==df1.BEST),'INTERNET',
                                                (np.where((df1.dm_probability==df1.BEST),'DM','DOODOO')))))

Count the number of offers by product.

In [6]:
df_datax.head()

Unnamed: 0,cust_id,yp_probability,predicted_yp_revenue,internet_probability,predicted_internet_rev,dm_probability,predicted_dm_rev,credit_flag
0,516211000001,0.031483,1142.26,0.029965,1076.37,0.001829,1414.29,0.0
1,516211000002,0.03,1360.42,0.022816,1189.9,0.001289,1349.06,0.0
2,516211000003,0.029754,1278.21,0.01397,1105.58,0.000167,113.140541,0.0
3,516211000004,0.005741,496.518557,0.007751,621.390408,0.001444,1260.87,0.0
4,516211000005,0.0205,1128.02,0.010573,906.8225,0.00155,1126.51,0.0


In [7]:
df_sum = pd.DataFrame(df1.groupby(['LEAD_PRODUCT'])['BEST'].agg(['mean','count', 'std']))
df_sum['pct']=(df_sum['count'])/df_sum['count'].sum()

In [8]:
df_sum

Unnamed: 0_level_0,mean,count,std,pct
LEAD_PRODUCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DM,0.00322,4,0.002107,0.000823
INTERNET,0.004406,2216,0.004634,0.455686
YP,0.007152,2643,0.00821,0.543492


From the table above, we can see that based on our "Winner Take All" method there will be 4 Direct Mail Offers, 2,216 Internet Offers and 2,643 Yellow Pages Offers.


Uhhh?  That's a bit of a problem, huh?  Of the the 5,000 offers only 4 are direct mail?  That usually won't work. 

Actually, that makes total sense.  If you look at the "mean" column in the table above.  The average probability to buy DM is substantially less than the other products.  

It turns out that Direct Mail is a new strategic product and the organization only started selling the media in the last couple of years.  Becuase few of the businesses in the prospect base have purchased the product, the base odds of a conversion are low and, thus, the probability to buy yielded by a machine learning model will be low as well.

So, how can we stay true to the probability to buy but increase the number of Direct Mail Offers?

#### 3.12 Technique Two -- Winner Take All with a Strategic Minimum <a id="T2"></a>

In our second scenario, we will use the same data. This time, however, we will set a restriction that at least 15% of the small business prospects get a direct mail offer.

There are several ways to do this.  In the following code, I am accomplishing the 15% restriction in the following way.

1.  Assign a Direct Mail offer to the 15% of the prospects with the highest probability to buy direct mail.
2. For the remaining 85% of the prospects, assign the offer (either Internet or Yellow Pages) with the highest probability.


In [9]:
df3=df_datax.copy()

Sort the data so that the customers with the highest probability of buying direct mail are at the bottom. This will ensure that those assigned a direct mail offer will have the highest probability of buying direct mail, even if they are more likely to buy YP or internet.

In [10]:
df3=df3.sort_values(by=['dm_probability'])

Apply the "Winner Take All"  Methodology.

In [11]:

df3["BEST"] = df3[["yp_probability", "internet_probability","dm_probability"]].max(axis=1)

df3['LEAD_PRODUCTX']=np.where((df3.yp_probability==df3.BEST),'YELLOW_PAGES',
                              (np.where((df3.internet_probability==df3.BEST),'INTERNET',
                                        (np.where((df3.dm_probability==df3.BEST),'DM','DOODOO')))))

Limit the number of fields for simplicity

In [12]:
df3=df3[['yp_probability','internet_probability', 'dm_probability', 'LEAD_PRODUCTX']]

Flip the LEAD_PRODUCT field into three different dummy variables.  We'll use these new fields to calculate cumulative sums.

In [13]:
df_dv = pd.get_dummies(df3['LEAD_PRODUCTX'])
df_dv = pd.get_dummies(df3['LEAD_PRODUCTX'])
df3= pd.concat([df3, df_dv], axis=1)

Count the total records in the data frame.

In [14]:
total_n=df3['DM'].count()

Calculate the cumulative percentage of records with each offer.

In [15]:
df3['dm_cuml']=pd.DataFrame((df3.DM.cumsum())/total_n)
df3['internet_cuml']=pd.DataFrame((df3.INTERNET.cumsum())/total_n)
df3['yp_cuml']=pd.DataFrame((df3.YELLOW_PAGES.cumsum())/total_n)

If the cumulative number of Yellow Pages and Internet/on-line Offers is less than 85% of the total, use the "Winner Take All" Methodolgy, otherwise Offer Direct Mail.

In [16]:
df3["BEST"] = df3[["yp_probability", "internet_probability","dm_probability"]].max(axis=1)

df3['LEAD_PRODUCT'] = np.where((df3.yp_cuml+df3.internet_cuml<0.85), df3.LEAD_PRODUCTX, 'DM')

In [17]:
df3=df3[['yp_probability','internet_probability', 'dm_probability', 'LEAD_PRODUCT','BEST']]

In [18]:
df_sum = pd.DataFrame(df3.groupby(['LEAD_PRODUCT'])['BEST'].agg(['mean','count', 'std','sum']))
df_sum['total']=df_sum['count'].sum()
df_sum['pct']=df_sum['count']/df_sum['total']

In [19]:
df_sum

Unnamed: 0_level_0,mean,count,std,sum,total,pct
LEAD_PRODUCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DM,0.011348,730,0.009387,8.284322,4863,0.150113
INTERNET,0.003849,1862,0.004301,7.166258,4863,0.382891
YELLOW_PAGES,0.005825,2271,0.006852,13.228323,4863,0.466996


Now, 15% of the business customers will get a Direct Mail Offer.  The others will get an Internet or Yellow Pages offer. 

#### 3.13 Technique Three -- Standardize each probability so each is measured on a common scale (Mean of 0.5 and a standard deviation of 0.01) <a id="T3"></a>

In this technique we will standardize the three product probabilities on a common scale (Mean of 0.5 and a standard deviation of .01).  After we standardize the the three predictions, we will use the standardized values to assign a product offer.


In [20]:
df2=df_datax.copy()

Calculate the mean and a standard deviation of each metric.

In [21]:
df2['YP_STD']=df2['yp_probability'].std()
df2['YP_MEAN']=df2['yp_probability'].mean()
df2['DM_STD']=df2['dm_probability'].std()
df2['DM_MEAN']=df2['dm_probability'].mean()
df2['INTERNET_STD']=df2['internet_probability'].std()
df2['INTERNET_MEAN']=df2['internet_probability'].mean()

Use the mean and standard deviation to standardize the metrics.

In [22]:
df2['S_yp_probability']=(50+(df2['yp_probability']-df2['YP_MEAN'])*(10/df2['YP_STD']))/100
df2['S_dm_probability']=(50+(df2['dm_probability']-df2['DM_MEAN'])*(10/df2['DM_STD']))/100
df2['S_internet_probability']=(50+(df2['internet_probability']-df2['INTERNET_MEAN'])*(10/df2['INTERNET_STD']))/100

Apply the "Winner Take All" methodology on the standardized variables.

In [23]:
df2["BEST"] = df2[["S_yp_probability", "S_internet_probability","S_dm_probability"]].max(axis=1)
df2['LEAD_PRODUCT']=np.where((df2.S_yp_probability==df2.BEST),'YELLOW PAGES',
                                      (np.where((df2.S_internet_probability==df2.BEST),'INTERNET',
                                                (np.where((df2.S_dm_probability==df2.BEST),'DM','DOODOO')))))

In [24]:
df_sum = pd.DataFrame(df2.groupby(['LEAD_PRODUCT'])['BEST'].agg(['mean','count', 'std','sum']))

In [25]:
df_sum['pct']=(df_sum['count'])/df_sum['count'].sum()

In [26]:
df_sum

Unnamed: 0_level_0,mean,count,std,sum,pct
LEAD_PRODUCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DM,0.503814,2776,0.126713,1398.588953,0.570841
INTERNET,0.613629,1027,0.147781,630.197082,0.211187
YELLOW PAGES,0.607707,1060,0.136285,644.169195,0.217972


Notice that after we standardized the variables, the number of small business customers who recieve Direct Mail offers jumps substantially.  There is a good reason for this.  If you dig into the data, you will notice that the customer base for Direct Mail is very different than the customers base for Internet and Yellow Pages.  In other words, if the customers who have the highest likelihood to buy Direct Mail are different than the ones who have a high likelihood to buy Internet and Yellow Pages.

#### 3.14 Technique Four -- Combine Expected product Revenue and the probability to buy. <a id="T4"></a>

When you an chose what advertising product to present to a small business customer, considering the expected revenue is probably smart.  In this technique, we weight each probability to buy with the expected revenue.  

In this example, we are simply using the product of expected revenue times the probability to buy.  

'Expected Revenue' * 'Probability to Buy'

There are many other ways to combine the two, what's best depends on the context of your business problem.

In [27]:
df4=df_datax.copy()

Multiply the predicted revenue times the probability to buy

In [28]:
df4['wgt_yp']=df4['yp_probability']*df4['predicted_yp_revenue']
df4['wgt_internet']=df4['internet_probability']*df4['predicted_internet_rev']
df4['wgt_dm']=df4['dm_probability']*df4['predicted_dm_rev']

Use the "Winner Take All" methodolgy on the new variables.

In [29]:
df4["BEST"] = df4[["wgt_yp", "wgt_internet","wgt_dm"]].max(axis=1)

df4['LEAD_PRODUCT']=np.where((df4.wgt_yp==df4.BEST),'YELLOW_PAGES',
                                      (np.where((df4.wgt_internet==df4.BEST),'INTERNET',
                                                (np.where((df4.wgt_dm==df4.BEST),'DM','DOODOO')))))

In [30]:

df_sum = pd.DataFrame(df4.groupby(['LEAD_PRODUCT'])['BEST'].agg(['median','count']))
df_sum['pct']=(df_sum['count'])/df_sum['count'].sum()
df_sum

Unnamed: 0_level_0,median,count,pct
LEAD_PRODUCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DM,0.173561,288,0.059223
INTERNET,0.826904,1994,0.410035
YELLOW_PAGES,1.188124,2581,0.530742


### 3.2 Use the Probabilities and Revenue Prediction to Summarize each Small Business Customer <a id="wgt"></a>

In the previous example, our goal was to select one and only product to offer to each small business customer.  As I mentioned earlier, this is often useful when you have little interaction with the end customer.  In the second example, I will summarize the machine learning model output and present a complete view of the customer.  As I said before, there are many ways to do this.  I will only highlight, two.  There is no "right" answer.  The best answer depends on the context of your business problem.  Hopefully, this will give you some ideas.

#### 3.21 Technique Five -- Summarize the Prospect into Gold Star and Silver Star Accounts.<a id="T5"></a>

Copy the original dataframe.

In [31]:
df5=df_datax

Create a very very tiny random number.  We will add this to our weighted values.  Before we create our deciles. This make sure that the edges of our deciles are clean.

In [32]:
df5['wookie'] = (np.random.randint(0, 1000, df5.shape[0]))/100000000000000000

Mulitply Predicted Revenue times the probability to buy.

In [33]:


df5['wgt_yp']=(df5['yp_probability']*df5['predicted_yp_revenue'])+df5['wookie']
df5['wgt_internet']=(df5['internet_probability']*df5['predicted_internet_rev'])+df5['wookie']
df5['wgt_dm']=(df5['dm_probability']*df5['predicted_dm_rev'])+df5['wookie']



Create Deciles where the highest weighted values are in decile 100 and the lowest are in decile 10.

In [34]:


df5['internet_decile'] = pd.to_numeric(pd.qcut(df5['wgt_internet'], 10, labels=False))

df5['dm_decile'] = pd.to_numeric(pd.qcut(df5['wgt_dm'], 10, labels=False))
df5['yp_decile'] = pd.to_numeric(pd.qcut(df5['wgt_yp'], 10, labels=False))
df5['yp_decile']=(df5['yp_decile']+1)*10
df5['dm_decile']=(df5['dm_decile']+1)*10
df5['internet_decile']=(df5['internet_decile']+1)*10

Simplify the datframe, keeping only the records you need.

In [35]:
df_one=df5[['cust_id','internet_decile','dm_decile','yp_decile']].copy()

Now we will summarize the deciles into three different groups.

The First Group is 'GOLD_STAR'.  This group are your A+++ customers for the product.  If someone has a Gold Star for Internet Advertising, they are very likely to buy and we need to spend resources getting the small business customer introduced and sold on the product.  The odds on conversion are high.

The second group is 'SILVER_STAR'.  This group are your second tier prospects.  There is a good likelihood to buy the product, but not extremely good.

The third group is 'NO_STAR'.  This group has very little likelihood to buy the product in question.

Note that I classify 'GOLD_STAR' as the top decile and 'SILVER_STAR' as the next three deciles.  'NO_STAR' is the last six deciles.  How you split the data into groups and the number of groups really depends on your business problem and the amount of resource you have to contact each small business customer.  For example, if you only have enough sales people to contact 5% of the prospects, this segmenation would make little sense. 

In [36]:
df_one['internet']=np.where((df_one.internet_decile>90),'GOLD STAR', 
                         np.where(((df_one.internet_decile<=90) & (df_one.internet_decile>=70)),'SILVER STAR', 
                                  np.where(((df_one.internet_decile<=60)),'NO STAR', 'POOPY')))

df_one['yp']=np.where((df_one.yp_decile>90),'GOLD STAR', 
                         np.where(((df_one.yp_decile<=90) & (df_one.yp_decile>=70)),'SILVER STAR', 
                                  np.where(((df_one.yp_decile<=60)),'NO STAR', 'POOPY')))

df_one['dm']=np.where((df_one.dm_decile>90),'GOLD STAR', 
                         np.where(((df_one.dm_decile<=90) & (df_one.dm_decile>=70)),'SILVER STAR', 
                                  np.where(((df_one.dm_decile<=60)),'NO STAR', 'POOPY')))

df_one.head()

Unnamed: 0,cust_id,internet_decile,dm_decile,yp_decile,internet,yp,dm
0,516211000001,100,100,100,GOLD STAR,GOLD STAR,GOLD STAR
1,516211000002,100,100,100,GOLD STAR,GOLD STAR,GOLD STAR
2,516211000003,100,20,100,GOLD STAR,GOLD STAR,NO STAR
3,516211000004,90,100,80,SILVER STAR,SILVER STAR,GOLD STAR
4,516211000005,100,100,100,GOLD STAR,GOLD STAR,GOLD STAR


Summarize to look at the number of accounts in each category.

In [37]:
df_sum = pd.DataFrame(df_one.groupby(['internet'])['yp_decile'].agg(['median','count']))
df_sum['pct']=(df_sum['count'])/df_sum['count'].sum()
df_sum

Unnamed: 0_level_0,median,count,pct
internet,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GOLD STAR,90,487,0.100144
NO STAR,50,2918,0.600041
SILVER STAR,70,1458,0.299815


In [38]:
df_one=df_one[['cust_id','yp','dm', 'internet']]

Now let's take a look at a few records to see how the profile recommends what you should offer each small business customer.

In [39]:
sample=df_one[df_one['cust_id']==516211000001]

This is an A+++++++++ prospect.  They have a high likelihood to buy all three advertising products.  We should make this prospect a priority.  It makes alot of sense to use many resources on this product.  Maybe we take him out to dinner.  Maybe we assign the account to our best rep.  Whatever is possible or makes sense, this prospect is a great one.

In [40]:
sample

Unnamed: 0,cust_id,yp,dm,internet
0,516211000001,GOLD STAR,GOLD STAR,GOLD STAR


This prospect is not a good one.  It makes little sense spending much resources to convert this prospect, becuase the probability they will buy any product is low.

In [41]:
sample=df_one[df_one['cust_id']==516211000042]
sample

Unnamed: 0,cust_id,yp,dm,internet
41,516211000042,NO STAR,NO STAR,NO STAR


This prospect should be approached with internet advertising, and followed with Yellow Pages.  Little or no resources should be used to sell direct given that there is little possibility he will buy it.

In [42]:
sample=df_one[df_one['cust_id']==516211000010]
sample

Unnamed: 0,cust_id,yp,dm,internet
9,516211000010,SILVER STAR,NO STAR,GOLD STAR


#### 3.22 Technique Seven -- Summarize the Prospect into Gold Star and Silver Star Accounts while Overlaying other Business Characteristics.¶  <a id="T7"></a>

Sometimes it makes sense to layer-in information about the prospect in addition to the machine learning model.  Credit riskiness, is oftten an import element to consider.  If a small business prospect has a bad credit rating, you probably want to make sure you get your money up front.  If that's not possible, you probably should skip over them and spend your resources selling to someone else.

In the code below, I assign each prospect to a "STAR" category, but if a prospect has a credit flag, I classify them as "NO_CREDIT"

In [43]:
df_one=df5.copy()

In [44]:
df_one['internet']=np.where((df_one.internet_decile>90 ) & (df_one.credit_flag==0),'GOLD STAR', 
                         np.where(((df_one.internet_decile<=90) & (df_one.internet_decile>=70) & (df_one.credit_flag==0)),'SILVER STAR', 
                                  np.where(((df_one.internet_decile<=60) & (df_one.credit_flag==0)),'NO STAR', 'NO CREDIT')))

df_one['dm']=np.where((df_one.dm_decile>90 ) & (df_one.credit_flag==0),'GOLD STAR', 
                         np.where(((df_one.dm_decile<=90) & (df_one.dm_decile>=70) & (df_one.credit_flag==0)),'SILVER STAR', 
                                  np.where(((df_one.dm_decile<=60) & (df_one.credit_flag==0)),'NO STAR', 'NO CREDIT')))

df_one['yp']=np.where((df_one.yp_decile>90 ) & (df_one.credit_flag==0),'GOLD STAR', 
                         np.where(((df_one.yp_decile<=90) & (df_one.yp_decile>=70) & (df_one.credit_flag==0)),'SILVER STAR', 
                                  np.where(((df_one.yp_decile<=60) & (df_one.credit_flag==0)),'NO STAR', 'NO CREDIT')))


There are 344 prospects who have bad credit and should be avoided.

In [45]:
df_sum = pd.DataFrame(df_one.groupby(['yp'])['internet_probability'].agg(['median','count']))
df_sum['pct']=(df_sum['count'])/df_sum['count'].sum()
df_sum

Unnamed: 0_level_0,median,count,pct
yp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GOLD STAR,0.006488,451,0.092741
NO CREDIT,0.002085,344,0.070738
NO STAR,0.001762,2720,0.559326
SILVER STAR,0.002649,1348,0.277195


And finally, let's take a look at one of our customers from above.  Previously, this prospect showed promise.  When you factor in credit, however, you should probably stay away.

In [46]:
df_one=df_one[['cust_id','yp','dm', 'internet']]

In [47]:
sample=df_one[df_one['cust_id']==516211000010]
sample

Unnamed: 0,cust_id,yp,dm,internet
9,516211000010,NO CREDIT,NO CREDIT,NO CREDIT


### Conclustions <a id="concl"></a>

As I mentioned above.  This is by know means meant to be a comprehenisve solution to the topic of summarizing machine leanring models for a specific entity.  Rather, I am just trying to give you some ideas on what is possible.  The best solution for a specific problem always depends on context.  

Hope this was helpful.