# What Theme of Emails increase Conversion Rate in Connect?

## Five problems to solve:
* Does theme of emails effect conversion rate?
* Does time of emails effect conversion rate?
* Does combination of theme and time of emails effect conversion rate?
* Does number of emails received effect conversion rate?
* Suggestions on further experiment.

In [123]:
import pandas as pd
import numpy as np
from scipy import stats
from math import sqrt

Data:

In [78]:
df = pd.read_csv('connect_email_scrubbed_list.csv')
df.head()

Unnamed: 0,EmailAddress,Date Sent,List,Lead_Source,Theme,Enrolled
0,imayes@juniper.net,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
1,pandyanandan007@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
2,chrisdidato@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
3,muthurengan@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
4,minda_aguhob@post.harvard.edu,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0


### Answer question 1: Does theme of emails effect conversion rate?

Combined "Not_Too_Late" to "not_too_late" and "personal_app_invite" to "personal_invite_to_apply" in "Theme":

In [83]:
df['Theme'] = df['Theme'].map({'Not_Too_Late':'not_too_late',
                               'personal_app_invite':'personal_invite_to_apply',
                               'cancelled_program':'cancelled_program',
                               'not_too_late':'not_too_late',
                               'deadline_extended':'deadline_extended',
                               'personal_invite_to_apply':'personal_invite_to_apply',
                               'personal_invite_to_submit':'personal_invite_to_submit'
                              })
df.groupby(['Theme','Enrolled'])['EmailAddress'].count().unstack(level=1)

Enrolled,0,1
Theme,Unnamed: 1_level_1,Unnamed: 2_level_1
cancelled_program,24.0,
deadline_extended,107.0,2.0
not_too_late,511.0,57.0
personal_invite_to_apply,629.0,20.0
personal_invite_to_submit,4.0,


There are 5 Themes: 
    * cancelled_program
    * deadline_extended
    * not_too_late
    * personal_invite_to_apply
    * personal_invite_to_submit
Among these 5, "cancelled_program" and "personal_invite_to_submit" will be excluded from this analysis.

Compute the conversion rate for each of 3 groups:
    1 - deadline_extended
    2 - not_too_late
    3 - personal_invite_to_apply

In [84]:
n1 = df.loc[df['Theme']=='deadline_extended'].shape[0]
obs_v1 = df.loc[(df['Theme']=='deadline_extended') & (df['Enrolled']==1)].shape[0]

n2 = df.loc[df['Theme']=='not_too_late'].shape[0]
obs_v2 = df.loc[(df['Theme']=='not_too_late') & (df['Enrolled']==1)].shape[0]

n3 = df.loc[df['Theme']=='personal_invite_to_apply'].shape[0]
obs_v3 = df.loc[(df['Theme']=='personal_invite_to_apply') & (df['Enrolled']==1)].shape[0]

print("{0} students received deadline_extended emails, {1} of them enrolled. ({2:.1%})".format(n1,obs_v1,obs_v1/n1))
print("{0} students received not_too_late emails, {1} of them enrolled. ({2:.1%})".format(n2,obs_v2,obs_v2/n2))
print("{0} students received personal_invite_to_apply emails, {1} of them enrolled. ({2:.1%})".format(n3,obs_v3,obs_v3/n3))

109 students received deadline_extended emails, 2 of them enrolled. (1.8%)
568 students received not_too_late emails, 57 of them enrolled. (10.0%)
649 students received personal_invite_to_apply emails, 20 of them enrolled. (3.1%)


Just by looking at the conversion rate, theme not_too_late has a conversion rate of 42.5%, way higher than deadline_extended (1.8%) and personal_invite_to_apply (3.1%). One thing to notice is that theme not_too_late has lowest students exposure.

Let's calculate the 95% CI for all 3 groups:

In [85]:
# mean and standard variance
m1 = obs_v1/n1
m2 = obs_v2/n2
m3 = obs_v3/n3
sd1 = (obs_v1/n1*(1-obs_v1/n1))/n1
sd2 = (obs_v2/n2*(1-obs_v2/n2))/n2
sd3 = (obs_v3/n3*(1-obs_v3/n3))/n3
# 95% CI:
g1_lower = m1-1.96*np.sqrt(sd1/n1)
g1_upper = m1+1.96*np.sqrt(sd1/n1)
g2_lower = m2-1.96*np.sqrt(sd2/n2)
g2_upper = m2+1.96*np.sqrt(sd2/n2)
g3_lower = m3-1.96*np.sqrt(sd3/n3)
g3_upper = m3+1.96*np.sqrt(sd3/n3)

print("for student received deadline_extended emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g1_lower,g1_upper))
print("for student received not_too_late emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g2_lower,g2_upper))
print("for student received personal_invite_to_apply emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g3_lower,g3_upper))

for student received deadline_extended emails, 95% of CI conversion rate: [1.59%,2.08%]
for student received not_too_late emails, 95% of CI conversion rate: [9.93%,10.14%]
for student received personal_invite_to_apply emails, 95% of CI conversion rate: [3.03%,3.13%]


T-test between deadline_extended and not_too_late:

In [86]:
s_t = np.sqrt(((n1-1)*n1*sd1+(n2-1)*n2*sd2)/(n1+n2-2))
t = (m2-m1)/(s_t*np.sqrt(1/n1+1/n2))
tscore = stats.t.ppf(.95,n1+n2-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is 2.795034714714805; 95% t score is 1.6471141829675946


T-test between not_too_late and personal_invite_to_apply:

In [87]:
s_t = np.sqrt(((n2-1)*n2*sd2+(n3-1)*n3*sd3)/(n2+n3-2))
t = (m3-m2)/(s_t*np.sqrt(1/n2+1/n3))
tscore = stats.t.ppf(.95,n2+n3-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is -5.022470896919387; 95% t score is 1.646108720535874


T-test between deadline_extended and personal_invite_to_apply:

In [88]:
s_t = np.sqrt(((n1-1)*n1*sd1+(n3-1)*n3*sd3)/(n1+n3-2))
t = (m3-m1)/(s_t*np.sqrt(1/n1+1/n3))
tscore = stats.t.ppf(.95,n1+n3-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is 0.7175939162507401; 95% t score is 1.6468716817714208


### Result:
    * Theme not_too_late is significant more efficient than deadline_extended and personal_invite_to_apply.
    * There's no significant difference between deadline_extended and personal_invite_to_apply.

### Answer Question 2: Does time of emails effect conversion rate?

In [89]:
df.groupby(['Date Sent','Enrolled'])['EmailAddress'].count().unstack(level=1)

Enrolled,0,1
Date Sent,Unnamed: 1_level_1,Unnamed: 2_level_1
"Aug, 2017",463,8
"Dec, 2017",193,18
"Nov, 2017",80,3
"Oct, 2017",476,49
"Sept, 2017",63,1


There're 5 months' data collected in this file: Aug, Sep, Oct, Nov and Dec.
Compute the conversion rate of these 5 months:
    1 - Aug
    2 - Sep
    3 - Oct
    4 - Nov
    5 - Dec

In [48]:
n1 = df.loc[df['Date Sent']=='Aug, 2017'].shape[0]
obs_v1 = df.loc[(df['Date Sent']=='Aug, 2017') & (df['Enrolled']==1)].shape[0]

n2 = df.loc[df['Date Sent']=='Sept, 2017'].shape[0]
obs_v2 = df.loc[(df['Date Sent']=='Sept, 2017') & (df['Enrolled']==1)].shape[0]

n3 = df.loc[df['Date Sent']=='Oct, 2017'].shape[0]
obs_v3 = df.loc[(df['Date Sent']=='Oct, 2017') & (df['Enrolled']==1)].shape[0]

n4 = df.loc[df['Date Sent']=='Nov, 2017'].shape[0]
obs_v4 = df.loc[(df['Date Sent']=='Nov, 2017') & (df['Enrolled']==1)].shape[0]

n5 = df.loc[df['Date Sent']=='Dec, 2017'].shape[0]
obs_v5 = df.loc[(df['Date Sent']=='Dec, 2017') & (df['Enrolled']==1)].shape[0]

print("{0} students received emails in Aug, {1} of them enrolled. ({2:.1%})".format(n1,obs_v1,obs_v1/n1))
print("{0} students received emails in Sept, {1} of them enrolled. ({2:.1%})".format(n2,obs_v2,obs_v2/n2))
print("{0} students received emails in Oct, {1} of them enrolled. ({2:.1%})".format(n3,obs_v3,obs_v3/n3))
print("{0} students received emails in Nov, {1} of them enrolled. ({2:.1%})".format(n4,obs_v4,obs_v4/n4))
print("{0} students received emails in Dec, {1} of them enrolled. ({2:.1%})".format(n5,obs_v5,obs_v5/n5))

471 students received emails in Aug, 8 of them enrolled. (1.7%)
64 students received emails in Sept, 1 of them enrolled. (1.6%)
525 students received emails in Oct, 49 of them enrolled. (9.3%)
83 students received emails in Nov, 3 of them enrolled. (3.6%)
211 students received emails in Dec, 18 of them enrolled. (8.5%)


We sent out 500+ emails in Oct, 471 emails in Aug and 211 emails in Dec. Conversion rate in Oct is the highest (9.3%).

Compute 95% CI:

In [49]:
n = [n1,n2,n3,n4,n5]
obs = [obs_v1,obs_v2,obs_v3,obs_v4,obs_v5]
month = ['Aug','Sept','Oct','Nov','Dec']
m = []
sd = []
g_lower = []
g_upper = []

for i in range(5):
    m.append(obs[i]/n[i])
    sd.append((obs[i]/n[i]*(1-obs[i]/n[i]))/n[i])
    g_lower.append(m[i]-1.96*sqrt(sd[i]/n[i]))
    g_upper.append(m[i]+1.96*sqrt(sd[i]/n[i]))
    print("for student received emails in {0}, 95% of CI conversion rate: [{1:.2%},{2:.2%}]".format(month[i],g_lower[i],g_upper[i]))

for student received emails in Aug, 95% of CI conversion rate: [1.64%,1.75%]
for student received emails in Sept, 95% of CI conversion rate: [1.18%,1.94%]
for student received emails in Oct, 95% of CI conversion rate: [9.22%,9.44%]
for student received emails in Nov, 95% of CI conversion rate: [3.17%,4.06%]
for student received emails in Dec, 95% of CI conversion rate: [8.27%,8.79%]


T-test:

In [67]:
t = [ [0]*5 ] * 5
t_score = [ [0]*5 ] * 5
for i in range(5):
    for j in range(5):
        if i < j:
            s_t = np.sqrt(((n[i]-1)*n[i]*sd[i]+(n[j]-1)*n[j]*sd[j])/(n[i]+n[j]-2))
            t[i][j] = (m[j]-m[i])/(s_t*np.sqrt(1/n[i]+1/n[j]))
            t_score[i][j] = stats.t.ppf(.95,n[i]+n[j]-2)
            print("t-test for {0} and {1}: t stats is {2:.4}; 95% t score is {3:.4}".format(month[i],month[j],t[i][j],t_score[i][j]))

t-test for Aug and Sept: t stats is -0.07938; 95% t score is 1.648
t-test for Aug and Oct: t stats is 5.25; 95% t score is 1.646
t-test for Aug and Nov: t stats is 1.156; 95% t score is 1.648
t-test for Aug and Dec: t stats is 4.369; 95% t score is 1.647
t-test for Sept and Oct: t stats is 2.112; 95% t score is 1.647
t-test for Sept and Nov: t stats is 0.7594; 95% t score is 1.655
t-test for Sept and Dec: t stats is 1.937; 95% t score is 1.65
t-test for Oct and Nov: t stats is -1.735; 95% t score is 1.647
t-test for Oct and Dec: t stats is -0.3423; 95% t score is 1.647
t-test for Nov and Dec: t stats is 1.478; 95% t score is 1.65


### Result:
    * Recall conversion rate:
        * Aug (471): 1.7%
        * Sept (64): 1.6%
        * Oct (525): 9.3%
        * Nov (83): 3.6%
        * Dec (211): 8.5%
    * Conversion rate significant different:
        * Aug and Oct
        * Aug and Dec
        * Sept and Oct
        * Sept and Dec
        * Oct and Nov
    * Conversion rate not significant different:
        * Aug and Sept
        * Aug and Nov
        * Sept and Nov
        * Oct and Dec
        * Nov and Dec
    * It seems Aug has good amount of emails sent to our students, but only gets 1.7% conversion rate, seems low.
    * Q4 has higher conversion rate than Q3 in general.
    * In Q4, Oct and Dec seem to have higher conversion rate than in Nov.

### Answer Question 3: Does combination of theme and time of emails effect conversion rate?

In [90]:
df.groupby(['Date Sent','Theme','Enrolled'])['EmailAddress'].count().unstack(level=2)

Unnamed: 0_level_0,Enrolled,0,1
Date Sent,Theme,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aug, 2017",deadline_extended,68.0,1.0
"Aug, 2017",personal_invite_to_apply,395.0,7.0
"Dec, 2017",not_too_late,193.0,18.0
"Nov, 2017",personal_invite_to_apply,76.0,3.0
"Nov, 2017",personal_invite_to_submit,4.0,
"Oct, 2017",not_too_late,318.0,39.0
"Oct, 2017",personal_invite_to_apply,158.0,10.0
"Sept, 2017",cancelled_program,24.0,
"Sept, 2017",deadline_extended,39.0,1.0


Remember that we are having high conversion rate for theme "not_too_late" and in Oct.
Please check Question 5 for further comments.

### Answer Question 4: Does number of emails received effect conversion rate?

In [103]:
num_emails_received = df.groupby(['EmailAddress'])['List'].count().reset_index()
num_emails_received.columns = ['EmailAddress','num_emails_received']

In [105]:
df1 = pd.merge(df,num_emails_received,on=['EmailAddress'],how='left')
df1.head()

Unnamed: 0,EmailAddress,Date Sent,List,Lead_Source,Theme,Enrolled,num_emails_received
0,imayes@juniper.net,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0,1
1,pandyanandan007@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0,2
2,chrisdidato@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0,2
3,muthurengan@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0,1
4,minda_aguhob@post.harvard.edu,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0,1


In [107]:
df1.groupby(['num_emails_received','Enrolled'])['EmailAddress'].count().unstack(level=1)

Enrolled,0,1
num_emails_received,Unnamed: 1_level_1,Unnamed: 2_level_1
1,735.0,21.0
2,198.0,44.0
3,177.0,6.0
4,112.0,8.0
5,40.0,
6,6.0,
7,7.0,


In [110]:
n1 = df1.loc[df1['num_emails_received']==1].shape[0]
obs_v1 = df1.loc[(df1['num_emails_received']==1) & (df1['Enrolled']==1)].shape[0]

n2 = df1.loc[df1['num_emails_received']==2].shape[0]
obs_v2 = df1.loc[(df1['num_emails_received']==2) & (df1['Enrolled']==1)].shape[0]

print("{0} students received 1 email, {1} of them enrolled. ({2:.1%})".format(n1,obs_v1,obs_v1/n1))
print("{0} students received 2 emails, {1} of them enrolled. ({2:.1%})".format(n2,obs_v2,obs_v2/n2))

756 students received 1 email, 21 of them enrolled. (2.8%)
242 students received 2 emails, 44 of them enrolled. (18.2%)


Looks like 1 or 2 emails receiver have relative higher conversation rate than the rest. Especially for students who received 2 emails, conversion rate is 18%.

### Answer Question 5: Suggestions on further experiment.

By Status of Students:
* For student who already applied and we are sending emails to them for enrollment is more sensitive than students who send emails in general to encourage them to apply.

By Month:
* If we look closer, the reason causing Q3's conversion rate lower than Q4 is because the 'not_too_late' emails are sending out in Q4.

By Email Theme:
* For email 'deadline_extended', we sent out 68 emails in Aug, 39 emails in Sept. There's only 2 students converted.
* For email "personal_invite_to_apply", conversion rate in Oct (10/158) is higher than Aug (7/395) and Nov (3/76). We can do a further investigation in this to see if it's month sensitive. Might want to double check if there's more cohorts open in Oct. Or where are the list of students coming from.

In [111]:
df['Lead_Source'].unique()

array(['App_Created_Sept', 'Info_session_Oct_30', 'app_created_Oct',
       'app_created_Aug', 'Created_app_Aug', 'app_created_aug',
       'info_session_Oct_30', 'app_created_Sept_FEND', 'created_app_Oct',
       'created_app_DAND', 'created_app_MLND', 'created_submitted_app_Dec'], dtype=object)

#### What can we do next:
* I think Theme is more effecient than time (month) to conversion rate.
* Is Lead_Source effecting conversion rate?
* Is new ND / stable ND (have reviews/reputation already) effecting conversion rate?
* Same as location, is new city / stable city effecting conversion rate?

### Let's do a lead_status t-test in current data.

In [113]:
df['Lead_Source'] = df['Lead_Source'].map({'App_Created_Sept':'App_Created',
                                           'Info_session_Oct_30':'Info_Session',
                                           'app_created_Oct':'App_Created',
                                           'app_created_Aug':'App_Created',
                                           'Created_app_Aug':'App_Created',
                                           'app_created_aug':'App_Created',
                                           'info_session_Oct_30':'Info_Session',
                                           'app_created_Sept_FEND':'App_Created',
                                           'created_app_Oct':'App_Created',
                                           'created_app_DAND':'App_Created',
                                           'created_app_MLND':'App_Created',
                                           'created_submitted_app_Dec':'App_Submitted'
                                          })
df.groupby(['Lead_Source','Enrolled'])['EmailAddress'].count().unstack(level=1)

Enrolled,0,1
Lead_Source,Unnamed: 1_level_1,Unnamed: 2_level_1
App_Created,924,51
App_Submitted,193,18
Info_Session,158,10


In [120]:
n = []
obs = []
lead_source = ['App_Created','App_Submitted','Info_Session']

for i in range(len(lead_source)):
    n.append(df.loc[df['Lead_Source']==lead_source[i]].shape[0])
    obs.append(df.loc[(df['Lead_Source']==lead_source[i]) & (df['Enrolled']==1)].shape[0])
    print("{0} students received {1} email, {2} of them enrolled. ({3:.1%})".format(lead_source[i],n[i],obs[i],obs[i]/n[i]))

App_Created students received 975 email, 51 of them enrolled. (5.2%)
App_Submitted students received 211 email, 18 of them enrolled. (8.5%)
Info_Session students received 168 email, 10 of them enrolled. (6.0%)


In [121]:
m = []
sd = []
g_lower = []
g_upper = []

for i in range(len(lead_source)):
    m.append(obs[i]/n[i])
    sd.append((obs[i]/n[i]*(1-obs[i]/n[i]))/n[i])
    g_lower.append(m[i]-1.96*sqrt(sd[i]/n[i]))
    g_upper.append(m[i]+1.96*sqrt(sd[i]/n[i]))
    print("for student received {0} emails, 95% of CI conversion rate: [{1:.2%},{2:.2%}]".format(lead_source[i],g_lower[i],g_upper[i]))

for student received App_Created emails, 95% of CI conversion rate: [5.19%,5.28%]
for student received App_Submitted emails, 95% of CI conversion rate: [8.27%,8.79%]
for student received Info_Session emails, 95% of CI conversion rate: [5.68%,6.23%]


In [122]:
t = [ [0]*3 ] * 3
t_score = [ [0]*3 ] * 3
for i in range(len(lead_source)):
    for j in range(len(lead_source)):
        if i < j:
            s_t = np.sqrt(((n[i]-1)*n[i]*sd[i]+(n[j]-1)*n[j]*sd[j])/(n[i]+n[j]-2))
            t[i][j] = (m[j]-m[i])/(s_t*np.sqrt(1/n[i]+1/n[j]))
            t_score[i][j] = stats.t.ppf(.95,n[i]+n[j]-2)
            print("t-test for {0} and {1}: t stats is {2:.4}; 95% t score is {3:.4}".format(lead_source[i],lead_source[j],t[i][j],t_score[i][j]))

t-test for App_Created and App_Submitted: t stats is 1.86; 95% t score is 1.646
t-test for App_Created and Info_Session: t stats is 0.3844; 95% t score is 1.646
t-test for App_Submitted and Info_Session: t stats is -0.9544; 95% t score is 1.649


### Results:
    * Students who submitted apps is sensitive to emails than students who hasn't submitted.

### If we look at lead_source combined with Theme - personal_invite_to_apply/submit?

In [125]:
df[df['Theme']=='personal_invite_to_apply'].groupby(['Lead_Source','Theme','Enrolled'])['EmailAddress'].count().unstack(level=2)

Unnamed: 0_level_0,Enrolled,0,1
Lead_Source,Theme,Unnamed: 2_level_1,Unnamed: 3_level_1
App_Created,personal_invite_to_apply,471,10
Info_Session,personal_invite_to_apply,158,10


For Students who already have an app, conversation rate: 10/471 = 2.12%.
For Students who might not have an app, conversion rate: 10/158 = 6.33%.

Conclusion: Students who make the effort to attend an info-session have higher conversion rate than general students.