#### The Cumulative Gains Curve

this section will be about visualizations of model performance that are relevant for business

the auc is very useful for data scientist but it's complex and a single number (doesn't catch all the info about the model) and less appropriate if you want to discuss your models with business stakeholders

the cumulative gains curve is easy to explain and can guide you to better business decisions
to make a cumulative gains curve:
- order all the observations according to the output of the model
- the left hand side are the observations that have the highest probability to be the target according to the model
- on the right hand side are observations with the lowest probability to be target
- the horizontal axis has the percentage of observations that is considered, for example at 30% the top 30% of observations with the highest probability to be target is considered
- the vertical axis is which percentage of all targets is included in this group, for example if the cumulative gains is 70% at 30% it means that when taking the top 30% of observations with highest probability to be target, this group already contains 70% of all targets

the cumulative gains model is good for comparing models
being in the upper left corner is a better model 
sometimes two models will have curves that cross each other, making deciding which one is better more difficult 
you might end up with something like model 2 is better for the top 10% and model 1 is better to distinguish the top 70% of the observations from the rest

In [None]:
# cumulative gains 
import scikitplot as skplt
import matplotlib.pyplot as plt

# the first argument is an array with the true values of the target
# the second argument is the predictions for the observations resulting from the model
# predictions should have both the predictions for the target to be 1 as well as the target to be 0
skplt.metrics.plot_cumulative_gain(true_values, predictions)
plt.show()

#### The Lift Curve

the lift curve is another widely used visualization of model performance

to construct a lift curve:
- order all the observations according to the output of the model
- the horizontal axis has which percentage of the observations is considered 
- the vertical axis has how many times more than average targets are included in this group

if the lift is at 50% and the top 50% observations contains 10% targets then the lift will be 2 at 50% because 10% is 2 times 5% (the average percentage of targets)

a random model will have about an equal percentage of targets for each group so the baseline is 1

better models have higher lifts, higher curves mean better accuracy
just like cumulative gains curves, models can cross each other, so one can be higher at 10% and the other could be higher at 80%, and it'll be harder to say which model is best because it'll depend on the situation 

In [None]:
# lift curve
import scikitplot as skplt
import matplotlib.pyplot as plt

# the first argument is an array with the true values of the target
# the second argument has the model predictions for the population
skplt.metrics.plot_lift_curve(true_values, predictions)
plt.show()

#### Guiding Business to Better Decisions

lift graphs and cumulative gains are great tools to help make better business decisions

lift graphs can be used to estimate the profit that you can make with a campaign, for example a population of 100,000 candidate donors and 5% among those candidate donors is target, targets are expected to donate $50 and sending a litter to the donor costs $2, given that info you could calculate the expected profit of the campaign, as seen in the code below

the cumulative gains graph can be used to decide how many donors should be targeted if you want to make a certain profit, you have a pool of 1,000,000 candidate donors with 2% targets but you don't want to send an email to everyone because you don't want to bother candidate donors that aren't interested in donating for this campaign, you'd build a model (cumulative gains graph) to predict which candidate donors are most likely to donate

In [None]:
# estimating profit with lift graphs
population_size = 100000
target_incidence = 0.05
reward_target = 50
cost_campaign = 2

# the profit depends on the 5 elements passed it
def profit(perc_targets, perc_selected, population_size, reward_target, cost_campaign):
    cost = cost_campaign * perc_selected * population_size
    reward = reward_target * perc_targets * perc_selected * population_size
    return(reward - cost)

# address the top 20% of donors with the highest probability to donate according to the model
perc_selected = 0.20
lift = 2.5
perc_targets = lift * target_incidence
print(profit(perc_targets, perc_selected, population_size, reward_target, cost_campaign))
# from the lift curve you can see that the lift at 20% is 2.5 which means that the top 20% contains 10% targets
# using the profit() function will tell you that this results in a profit of $60,000 (a good result) 

# if you addressed all candidate donors
print(profit(target_incidence, 1, population_size, reward_target, cost_campaign))
# you would expect to have a $50,000 loss (ouch!)

In [None]:
# how many candidates to target with cumulative gains 
population_size = 1000000
target_incidence = 0.02
# number of targets you want to reach
number_targets_to_reach = 16000 # this is 80% of all targets
perc_targets = number_targets_to_reach / (target_incidence * population_size)
print(perc_targets_to_reach)
# you could then read from the cumulative gains curve that to reach 80% targets you would need to address 
# the top 60% of the candidate donors which is 600,000 donors
cumulative_gains = 0.06
# number of donors to reach
number_donors_to_reach = cumulative_gains * population_size
print(number_donors_to_reach)
# you'll see that the answer is to send it to 600,000 donors and without using this model you would have had to send the email to
# 800,000 donors in order to reach the same number of targets 

In [None]:
# exercise example, comparing targeted campaign versus all
# Read the lift at 40% (round it up to the upper tenth)
perc_selected = 0.4
lift = 1.5

# Information about the campaign
population_size, target_incidence, campaign_cost, campaign_reward = 100000, 0.01, 1, 100
    
# Profit if all donors are targeted
profit_all = profit(target_incidence, 1, population_size, campaign_cost, campaign_reward)
print(profit_all)

# Profit if top 40% of donors are targeted
profit_40 = profit(lift * target_incidence, 0.4, population_size, campaign_cost, campaign_reward)
print(profit_40)

In [None]:
# exercise example, how many customers to target to reach goal
# If one knows the reward of a campaign, it follows easily how many donors should be targeted to reach a certain profit.

# Plot the cumulative gains
skplt.metrics.plot_cumulative_gain(targets_test, predictions_test)
plt.show()

# Number of targets you want to reach
number_targets_toreach = 30000 / 50
perc_targets_toreach = number_targets_toreach / 1000
cumulative_gains = 0.4
number_donors_toreach = cumulative_gains * 10000
# you would need to address 4000 donors