# PREDICTING DONATION PROPENSITY OF A CAMPAIGN COHORT USING BINOMIAL REGRESSION

### With this model, I seek to predict the proportion of new members that join on a given campaign will become donors within the first 90 days after joining.  I will make my predictions using features about the campaigns themselves, such as age, virality, and topic (discovered using my LDA model), and aggregate stats about the campaign joiners, such as regional proportion and activity in the first 7 days on list.

In [90]:
#import the first Python modules I will be using in the project
import pandas as pd  #for working with data in table form within Python
import numpy as np   #important math and logic functions
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

### First, I read in the final CSV file that I prepared in the Data Wrangling phase.  The file contains:
### * page id
### * mailing_id
### * date the mailing was sent
### * total number of actions taken from the mailing
### * size of the new member cohort
### * percentage of new members from each of the 7 SoU regions
### * opens, clicks, actions, and donations per person in the cohort in their first 7 days on list
### and 
### * the dependant variable we are trying to predict: the proportion of the new member cohort that made a donation in their first 90 days on list.

In [79]:
data = pd.read_csv('../capstone/final_logistic_export.csv', encoding = "ISO-8859-1")  #import CSV as a Pandas dataframe
data.tail(10)

Unnamed: 0,page_id,mailing_id,mailing_date,mail_acts,cohort_size,us,can,uk,aus,nz,enuk,rest,opens7_pp,clicks7_pp,acts7_pp,donates7_pp,donates90_dv
281,12557,18116,2016-03-25 16:44:13,82229,1525,45,15,2,7,2,19,6,0.862951,0.184918,0.059016,0.000656,0.0
282,12630,17783,2016-03-13 11:37:17,30465,1178,36,5,44,2,0,8,2,0.674873,0.143463,0.057725,,0.0
283,12689,17977,2016-03-18 17:31:03,4725,1760,0,0,0,96,0,0,0,0.879545,0.206818,0.089205,0.001136,0.001705
284,12799,18144,2016-03-26 10:41:34,53411,4264,0,0,99,0,0,0,0,0.697467,0.142824,0.078799,0.002814,0.003049
285,12912,18437,2016-04-07 22:15:30,13665,553,0,0,0,96,0,1,1,0.9783,0.22604,0.106691,0.001808,0.0
286,14483,19334,2016-05-11 23:48:04,13256,557,0,0,0,97,0,0,0,1.089767,0.228007,0.109515,0.001795,0.005386
287,14728,19635,2016-05-25 17:47:46,133762,5905,28,13,17,12,3,21,3,1.066384,0.218459,0.110754,0.006266,0.003387
288,14887,19656,2016-05-24 16:14:54,33401,2481,1,0,93,0,0,2,0,1.125353,0.24869,0.115276,0.008464,0.004031
289,15109,19830,2016-05-30 17:34:44,143421,7128,8,4,18,3,1,60,3,1.091611,0.242985,0.139169,0.001824,0.003788
290,15669,20870,2016-07-17 11:03:58,75643,2012,77,8,2,4,1,3,2,0.675944,0.134195,0.062624,0.000497,0.001491


### Next, I read in the topics data that I generated using my LDA model.

In [80]:
topics = pd.read_csv('../capstone/mailing_topic.csv', index_col=0, encoding = "ISO-8859-1")  #import topics as a Pandas dataframe
topics = topics.drop(['text_clean','freq','topics'],1) # drop everything except the page_id and the topic % from the LDA
topics.tail(10)

Unnamed: 0,page_id,health,private,palm,fossil,econ,rights,food,trade
1390,16102,0.0,0.0,0.0,0.0,0.0,98.0,0.0,0.0
1391,16118,0.0,0.0,0.0,0.0,74.0,13.0,0.0,11.0
1392,16148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.0
1393,16154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,98.0
1394,16160,0.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0
1395,16416,10.0,72.0,1.0,15.0,1.0,1.0,1.0,1.0
1396,16458,34.0,0.0,0.0,0.0,0.0,49.0,0.0,15.0
1397,16479,0.0,0.0,0.0,21.0,74.0,0.0,0.0,4.0
1398,16736,0.0,29.0,0.0,7.0,32.0,0.0,0.0,31.0
1399,16940,0.0,0.0,0.0,98.0,0.0,0.0,0.0,0.0


### I combine the topics columns with the other data columns to create a single data frame. Merging the two frames leaves me with 175 observations for my regression model.

In [81]:
joined = pd.merge(data, topics, on='page_id')
joined.tail(10)

Unnamed: 0,page_id,mailing_id,mailing_date,mail_acts,cohort_size,us,can,uk,aus,nz,...,donates7_pp,donates90_dv,health,private,palm,fossil,econ,rights,food,trade
166,12523,17700,2016-03-09 20:13:38,126031,1670,32,9,18,6,1,...,0.002395,0.005988,0.0,13.0,20.0,54.0,0.0,0.0,0.0,12.0
167,12557,18116,2016-03-25 16:44:13,82229,1525,45,15,2,7,2,...,0.000656,0.0,98.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
168,12630,17783,2016-03-13 11:37:17,30465,1178,36,5,44,2,0,...,,0.0,38.0,0.0,0.0,0.0,11.0,50.0,0.0,0.0
169,12689,17977,2016-03-18 17:31:03,4725,1760,0,0,0,96,0,...,0.001136,0.001705,17.0,68.0,0.0,0.0,0.0,0.0,0.0,14.0
170,12912,18437,2016-04-07 22:15:30,13665,553,0,0,0,96,0,...,0.001808,0.0,8.0,0.0,0.0,0.0,90.0,0.0,0.0,0.0
171,14483,19334,2016-05-11 23:48:04,13256,557,0,0,0,97,0,...,0.001795,0.005386,0.0,20.0,0.0,0.0,72.0,7.0,0.0,0.0
172,14728,19635,2016-05-25 17:47:46,133762,5905,28,13,17,12,3,...,0.006266,0.003387,0.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0
173,14887,19656,2016-05-24 16:14:54,33401,2481,1,0,93,0,0,...,0.008464,0.004031,0.0,85.0,0.0,0.0,0.0,0.0,0.0,14.0
174,15109,19830,2016-05-30 17:34:44,143421,7128,8,4,18,3,1,...,0.001824,0.003788,0.0,0.0,0.0,0.0,0.0,0.0,69.0,30.0
175,15669,20870,2016-07-17 11:03:58,75643,2012,77,8,2,4,1,...,0.000497,0.001491,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0


### Before I start modelling, I need to calculate two interaction term features that I didn't calculate in my SQL.  

### The first is the 'age' of the campaign - or, the number of months that elapsed from the time the organization was founded until the time the mailing was sent.  This is important because we believe that cohort quality has been decreasing over time.  

### The second feature is 'virality' - or, the percentage of new joiners (cohort size) compared to the number of members who took action via email. 

### Since these are very simple and straightforward calculations, I have left this step as a script, rather than create a function for it.

In [82]:
transform = joined.copy() #copy the data to a new frame

transform['virality'] = transform['cohort_size'] /transform['mail_acts'] #calc virality

org_founded = pd.to_datetime('2011-12-01') # the month that SumOfUs was founded
transform['age'] = pd.to_datetime(transform['mailing_date']) - org_founded #subtract mail date from org founding
transform['age'] = transform['age'].astype('timedelta64[M]') #convert to number of months

transform[['page_id','mailing_date','age','mail_acts','cohort_size','virality']].tail(10)

Unnamed: 0,page_id,mailing_date,age,mail_acts,cohort_size,virality
166,12523,2016-03-09 20:13:38,51.0,126031,1670,0.013251
167,12557,2016-03-25 16:44:13,51.0,82229,1525,0.018546
168,12630,2016-03-13 11:37:17,51.0,30465,1178,0.038667
169,12689,2016-03-18 17:31:03,51.0,4725,1760,0.372487
170,12912,2016-04-07 22:15:30,52.0,13665,553,0.040468
171,14483,2016-05-11 23:48:04,53.0,13256,557,0.042019
172,14728,2016-05-25 17:47:46,53.0,133762,5905,0.044146
173,14887,2016-05-24 16:14:54,53.0,33401,2481,0.074279
174,15109,2016-05-30 17:34:44,53.0,143421,7128,0.0497
175,15669,2016-07-17 11:03:58,55.0,75643,2012,0.026599


### Next, I want to 'normalize' all the data that I will be using in the regression model.  By 'normalize', I mean that we will adjust the scale for each column so that every data point is between zero and 1, relative to the other values in that column.  Even though the different columns have different scales and units of measurements, normalization let's us convert everything to the same scale so that we can make fair comparisons between them.  This will be helpful later when we want to interpret our coefficients.

In [105]:
norm = transform.copy() #copy the data to a new frame

def normalize(df,cols):
    df[cols] = df[cols].apply(lambda x: (x - x.min()) / (x.max() - x.min())) #applies the normalization formula
    df[cols] = df[cols].fillna(0) #fills in empty data points with zeros
    return df

# sublists of all the independent variables I will use in the regression, by type
calcs = ['age','virality',]
regions = ['us','uk','can','aus','nz','enuk','rest']
activity = ['opens7_pp','clicks7_pp','acts7_pp','donates7_pp']
topics = ['health','private','palm','fossil','econ','rights','food','trade']

feature_cols = calcs + regions + activity + topics  #creates the master list of feature cols by adding the sublists

norm = normalize(norm,feature_cols)

new_norm_cols = ['page_id','donates90_dv', 'cohort_size'] + feature_cols #adds the page_id and the dependant variable to the normed cols
norm = norm[new_norm_cols] # drops everything except the page_id, the dependent variable, and the normed columsn
norm.tail(10)

Unnamed: 0,page_id,donates90_dv,cohort_size,age,virality,us,uk,can,aus,nz,...,acts7_pp,donates7_pp,health,private,palm,fossil,econ,rights,food,trade
166,12523,0.005988,1670,0.878788,0.004317,0.326531,0.183673,0.091837,0.061856,0.010638,...,0.376103,0.103102,0.0,0.131313,0.20202,0.545455,0.0,0.0,0.0,0.121212
167,12557,0.0,1525,0.878788,0.007015,0.459184,0.020408,0.153061,0.072165,0.021277,...,0.14086,0.0271,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
168,12630,0.0,1178,0.878788,0.017268,0.367347,0.44898,0.05102,0.020619,0.0,...,0.137273,0.0,0.387755,0.0,0.0,0.0,0.111111,0.505051,0.0,0.0
169,12689,0.001705,1760,0.878788,0.187359,0.0,0.0,0.0,0.989691,0.0,...,0.224707,0.0481,0.173469,0.686869,0.0,0.0,0.0,0.0,0.0,0.141414
170,12912,0.0,553,0.909091,0.018185,0.0,0.0,0.0,0.989691,0.0,...,0.273275,0.077459,0.081633,0.0,0.0,0.0,0.909091,0.0,0.0,0.0
171,14483,0.005386,557,0.939394,0.018975,0.0,0.0,0.0,1.0,0.0,...,0.28112,0.076892,0.0,0.20202,0.0,0.0,0.727273,0.070707,0.0,0.0
172,14728,0.003387,5905,0.939394,0.020059,0.285714,0.173469,0.132653,0.123711,0.031915,...,0.28456,0.272222,0.0,0.0,0.0,0.0,0.0,0.0,0.989899,0.0
173,14887,0.004031,2481,0.939394,0.035413,0.010204,0.94898,0.0,0.0,0.0,...,0.297121,0.368278,0.0,0.858586,0.0,0.0,0.0,0.0,0.0,0.141414
174,15109,0.003788,7128,0.939394,0.022889,0.081633,0.183673,0.040816,0.030928,0.010638,...,0.363484,0.078135,0.0,0.0,0.0,0.0,0.0,0.0,0.69697,0.30303
175,15669,0.001491,2012,1.0,0.011118,0.785714,0.020408,0.081633,0.041237,0.010638,...,0.150881,0.020165,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### I need to create variables for the dependent variable of my model (what I am trying to predict), and the independant variables (all features I am using in my prediction).

### This is a 'binomial' regression, so my dependent variable is an array with two columns: 'successes' - the number of members who donated in their first 90 days, and 'failures' -  the number of members who did not donate.

In [92]:
# make predictions for all values of X

cohorts = norm.copy() #copy the data to a new frame

ind = cohorts[feature_cols] #independant variables
dep = pd.DataFrame()
dep['sucesses'] = cohorts['cohort_size'] * cohorts['donates90_dv']
dep['failures'] = cohorts['cohort_size'] - dep['sucesses']
dep.tail(10)

Unnamed: 0,sucesses,failures
166,10.0,1660.0
167,0.0,1525.0
168,0.0,1178.0
169,3.0,1757.0
170,0.0,553.0
171,3.0,554.0
172,20.0,5885.0
173,10.0,2471.0
174,27.0,7101.0
175,3.0,2009.0


### Next, I create my model, and then fit the data to then model.  I can print out summary statistics about the model results. 'GLM' stands for Generalized Linear Model.

In [106]:
glm_binom = sm.GLM(dep, ind, family=sm.families.Binomial()) #create the model
result = glm_binom.fit() #fit the model to the data
print(result.summary()) #print summary stats

                    Generalized Linear Model Regression Results                     
Dep. Variable:     ['sucesses', 'failures']   No. Observations:                  176
Model:                                  GLM   Df Residuals:                      155
Model Family:                      Binomial   Df Model:                           20
Link Function:                        logit   Scale:                             1.0
Method:                                IRLS   Log-Likelihood:                -670.13
Date:                      Mon, 05 Dec 2016   Deviance:                       641.72
Time:                              19:31:20   Pearson chi2:                     652.
No. Iterations:                          12                                         
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
age             1.5287      0.123     12.441      0.000         1.288     1

### Using the model, I can make a prediction about each campaign, then insert that prediction into the dataframe.  I can compare the column 'predict' with the column 'donates90_dv' (actual value) to get a sense of the accuracy of the prediction.

In [103]:
predicts = cohorts.copy()

def add_predict(model,df,features,new_col):
    df.insert(1, new_col, 0)
    df[new_col] = model.predict(df[features])
    return df

predicts = add_predict(result,predicts,feature_cols,'predict')
predicts

Unnamed: 0,page_id,predict,donates90_dv,cohort_size,age,virality,us,uk,can,aus,...,acts7_pp,donates7_pp,health,private,palm,fossil,econ,rights,food,trade
0,824,0.003658,0.004594,73784,0.000000,0.237978,0.346939,0.091837,0.183673,0.051546,...,0.673310,0.006739,0.000000,0.000000,0.000000,0.000000,0.000000,0.232323,0.767677,0.000000
1,830,0.002078,0.002389,3348,0.000000,0.082239,0.000000,0.979592,0.000000,0.000000,...,0.503736,0.024550,0.255102,0.282828,0.000000,0.000000,0.030303,0.434343,0.000000,0.000000
2,831,0.009376,0.007702,1558,0.000000,0.005330,0.265306,0.244898,0.142857,0.134021,...,0.948528,0.082581,0.051020,0.000000,0.000000,0.393939,0.151515,0.000000,0.191919,0.212121
3,834,0.009832,0.008496,1177,0.000000,0.003268,0.336735,0.234694,0.132653,0.092784,...,0.991657,0.258303,0.000000,0.080808,0.000000,0.212121,0.494949,0.000000,0.000000,0.212121
4,841,0.006094,0.007679,1172,0.000000,0.007131,0.173469,0.765306,0.000000,0.010309,...,0.815876,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
5,876,0.001983,0.002942,9177,0.030303,0.035978,0.428571,0.112245,0.122449,0.082474,...,0.480867,0.036538,0.000000,0.434343,0.000000,0.242424,0.000000,0.000000,0.313131,0.000000
6,883,0.007993,0.004577,8739,0.030303,0.167571,0.000000,0.979592,0.000000,0.000000,...,0.894508,0.008448,0.153061,0.292929,0.000000,0.060606,0.161616,0.000000,0.000000,0.343434
7,886,0.002568,0.002946,4753,0.030303,0.094885,0.010204,0.897959,0.000000,0.010309,...,0.635522,0.016834,0.479592,0.000000,0.000000,0.000000,0.000000,0.525253,0.000000,0.000000
8,898,0.000535,0.001612,13647,0.030303,0.277229,0.010204,0.000000,0.979592,0.000000,...,0.182298,0.011256,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000
9,924,0.003198,0.005697,2984,0.030303,0.093851,0.030612,0.030612,0.010204,0.876289,...,0.466540,0.027734,0.000000,0.000000,0.000000,0.111111,0.000000,0.000000,0.000000,0.888889
