#### Predictor Insight Graphs

in a typical predictive modeling project you proceed as follows when you need to make a predictive model:
1. build the predictive model 
2. evaluate the model using AUC accuracy metric
3. evaluate the model using cumulative gaains and lift curves
4. verify whether the variables in the model are interpretable, do the link between these variables and the target make sense? (predictor insight graphs can be used)

if the variable is continuous an addition discretization step that divides the continuous varible in bins is needed 

constructing a pig
- discretization of variable if continuous
- calculate predictor insight graph table, this lets you calculate the values that are needed to make the plot 
- plot it and interpret 


In [None]:
# access elements in the predictor insight graph table using indexing
print(pig_table['size'][income=='low'])

In [None]:
# example exercise
# The target incidence of USA and India donors is the same, 
# indicating that country is not a good variable to predict donations.

# Inspect the predictor insight graph table of Country
print(pig_table)

# Print the number of UK donors
print(pig_table["Size"][pig_table["Country"]=="UK"])

# Check the target incidence of USA and India donors
print(pig_table["Incidence"][pig_table["Country"]=="USA"])
print(pig_table["Incidence"][pig_table["Country"]=="India"])

#### Discretization of Continuous Variables 

the first step of creating predictor insight graphs is the discretize the continous variables
you can discretize pandas columns using qcut()

In [None]:
# divide the variable maximum gift into 3 bins of equal size
variable = "max_gift"
number_bins = 3
basetable["disc_max_gift"] = pd.qcut(basetable[variable], number_bins)
# output is a new column that can be added to the basetable

# to check that the qcut method divided the variable in equal size bins, use groupby() and size() methods
basetable.groupby("disc_max_gift").size()



In [None]:
# if you have a bunch of continuous variables
# first figure out which variables are continuous then loop over them to discretize them 
# list of variables
variables_model = ["income_average", "mean_gift", "gender_M", "min_gift", "age"]
# not all of these make sense to discretize, variables like gender_M and income_average only take 2 values
# here's a function to check whether a variable should be discretized
def check_discretize(basetable, variable, threshold):
    return(len(basetable.groupby(variable)) > threshold)
# returs True if the number is higher than the threshold and False otherwise
check_discretize(basetable, "mean_gift", 5)

# with this function you can loop through all the variables and only discretize those that need to be
threshold = 5
number_bins = 5
for variable in variables_model:
    if check_discretize(basetable, variable, threshold):
        new_variable = "disc" + variable
        basetable[new_variable] = pd.qcut(basetable[variable], number_bins)

In [None]:
# clean cuts, you can specify the wanted cuts
# qcut() divides the variable in equal size bins but sometimes this makes ugly intervals 
basetable["disc_age"] = pd.cut(basetable["age"], [18, 30, 40, 50, 60, 110])
basetable.groupby("disc_age").size()
# this won't have equal bin sizes but the result is much easier to read and interpret 
# suggested workflow is to use qcut to see how the bins should be chosen and then clean up the cuts

In [None]:
# example exercise
# Print the columns in the original basetable
print(basetable.columns)

# Get all the variable names except "target"
variables = list(basetable.columns)
variables.remove("target")

# Loop through all the variables and discretize in 10 bins if there are more than 5 different values
for variable in variables:
    if len(basetable.groupby(variable))>5:
        new_variable = "disc_" + variable
        basetable[new_variable] = pd.qcut(basetable[variable], 10)
        
# Print the columns in the new basetable
print(basetable.columns)

In [None]:
# example exercise, clean cuts
# notice how the range works to put it into 3 groups 0-5, 5-10, 10-20
# Discretize the variable 
basetable["disc_number_gift"] = pd.cut(basetable["number_gift"],[0, 5, 10, 20])

# Count the number of observations per group
print(basetable.groupby("disc_number_gift").size())

#### Preparing the Predictor Insight Graph Table

the predictor insight graph table has all the info you need to create the pig
it has one row for each group in the variable that you want to plot and 3 columns
* the first column is the name of the groups or the names of the intervals it was discretized in if it's a continuous variable 
* the second column shows the average target incidence of the group, what is the mean target in this group
* the third column shows the size of each group, the number of observations that belong to the particular group 

In [None]:
# a function to create the pig table
import numpy as np

def create_pig_table(df, target, variable):
    # group by the variable you want to plot, you only need the variable and target values for these groups
    groups = df[[target, variable]].groupby(variable)
    # use the aggregrate function on these groups to create 2 columns
    # calculate the size and and incidence of each group
    pig_table = groups[target].agg({"Incidence" : np.mean, "Size":np.size}).reset_index()
    return pig_table

print(create_pig_table(basetable, "target", "country"))

In [None]:
# calculate multiple predictor insight graph tables 
# instead of doing them one-by-one you could do them automatically and store the pigs in a dictionary
# for example, you want to graph these 4 variables
variables = ["country", "gender", "disc_mean_gift", "age"]
# create an empty dictionary, this will keep track of the pig tables
pig_tables = {}
# loop over the variables and calculate the pig tables each time
for variable in variables:
    # create the pig table
    pig_table = create_pig_table(basetable, "target", variable)
    # store the table in the dictionary, the key is the name of the variables
    pig_tables[variable] = pig_table
    
# if you want to plot a specific variable you can look up the pig in the dictionary
print(create_pig_table(basetable, "target", "country"))

#### Plotting the Predictor Insight Graph

once you have the table ready it's easy to create the final pig that can be used to interpret your model

the graph will be plotted in two steps
1. plot the target incidence line
2. plot the bars that show the group size

In [None]:
# plot the target incidence
import matplotlib.pyplot as plt
import numpy as np

# assume the pig table is already loaded in the pig_table variable
pig_table["Incidence"].plot()
# this graph will need some editing to make it interpretable
# first show the group names on the horizontal axis
plt.xticks(np.arange(len(pig_table)), pig_table["income"])
# second center the groups be adding margins to the left and right
width = 0.5
plt.xlim([-width, len(pig_table)-width])
# finally, add label to the vertical axis and another to the horizontal axis
plt.ylabel("Incidence", rotation=0, rotation_mode="anchor", ha="right")
plt.xlabel("Income")
plt.show()

In [None]:
# add the graphs with the sizes by changing 3 lines of code
# assume the pig table is already loaded in the pig_table variable
# plot the graph
plt.ylabel("Size", rotation = 0, rotation_mode="anchor", ha="right")
pig_table["Incidence"].plot(secondary_y = True)
# plot the bars, the kind argument means you plot bars instead of the default line, width is bar size
pig_table["Size"].plot(kind="bar", width=0.5, color="lightgray", edgecolor="none")
# this graph will need some editing to make it interpretable
# first show the group names on the horizontal axis
plt.xticks(np.arange(len(pig_table)), pig_table["income"])
# second center the groups be adding margins to the left and right
width = 0.5
plt.xlim([-width, len(pig_table)-width])
# finally, add label to the vertical axis and another to the horizontal axis
plt.ylabel("Incidence", rotation=0, rotation_mode="anchor", ha="right")
plt.xlabel("Income")
plt.show()