# What Theme of Emails increase Conversion Rate in Connect?

## Four problems to solve:
* Does theme of emails effect conversion rate?
* Does time of emails effect conversion rate?
* Does combination of theme and time of emails effect conversion rate?
* Does number of emails received effect conversion rate?
* Suggestions on further experiment.

In [44]:
import pandas as pd
import numpy as np
from scipy import stats
from math import sqrt
#import seaborn as sns

Data:

In [26]:
df = pd.read_csv('connect_email_scrubbed_list.csv')
df.head()

Unnamed: 0,EmailAddress,Date Sent,List,Lead_Source,Theme,Enrolled
0,imayes@juniper.net,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
1,pandyanandan007@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
2,chrisdidato@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
3,muthurengan@gmail.com,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0
4,minda_aguhob@post.harvard.edu,"Sept, 2017",Connect application extended email list-->IPND,App_Created_Sept,cancelled_program,0


### Answer question 1: Does theme of emails effect conversion rate?

Combined "Not_Too_Late" to "not_too_late" and "personal_app_invite" to "personal_invite_to_apply" in "Theme":

In [13]:
df['Theme'] = df['Theme'].map({'Not_Too_Late':'not_too_late',
                               'personal_app_invite':'personal_invite_to_apply',
                               'cancelled_program':'cancelled_program',
                               'deadline_extended':'deadline_extended',
                               'personal_invite_to_apply':'personal_invite_to_apply',
                               'personal_invite_to_submit':'personal_invite_to_submit'
                              })
df.groupby(['Theme','Enrolled'])['EmailAddress'].count()

Theme                      Enrolled
cancelled_program          0            24
deadline_extended          0           107
                           1             2
not_too_late               0            23
                           1            17
personal_invite_to_apply   0           629
                           1            20
personal_invite_to_submit  0             4
Name: EmailAddress, dtype: int64

There are 5 Themes: 
    * cancelled_program
    * deadline_extended
    * not_too_late
    * personal_invite_to_apply
    * personal_invite_to_submit
Among these 5, "cancelled_program" and "personal_invite_to_submit" will be excluded from this analysis.

Compute the conversion rate for each of 3 groups:
    1 - deadline_extended
    2 - not_too_late
    3 - personal_invite_to_apply

In [32]:
n1 = df.loc[df['Theme']=='deadline_extended'].shape[0]
obs_v1 = df.loc[(df['Theme']=='deadline_extended') & (df['Enrolled']==1)].shape[0]

n2 = df.loc[df['Theme']=='not_too_late'].shape[0]
obs_v2 = df.loc[(df['Theme']=='not_too_late') & (df['Enrolled']==1)].shape[0]

n3 = df.loc[df['Theme']=='personal_invite_to_apply'].shape[0]
obs_v3 = df.loc[(df['Theme']=='personal_invite_to_apply') & (df['Enrolled']==1)].shape[0]

print("{0} students received deadline_extended emails, {1} of them enrolled. ({2:.1%})".format(n1,obs_v1,obs_v1/n1))
print("{0} students received not_too_late emails, {1} of them enrolled. ({2:.1%})".format(n2,obs_v2,obs_v2/n2))
print("{0} students received personal_invite_to_apply emails, {1} of them enrolled. ({2:.1%})".format(n3,obs_v3,obs_v3/n3))

109 students received deadline_extended emails, 2 of them enrolled. (1.8%)
528 students received not_too_late emails, 40 of them enrolled. (7.6%)
79 students received personal_invite_to_apply emails, 3 of them enrolled. (3.8%)


Just by looking at the conversion rate, theme not_too_late has a conversion rate of 42.5%, way higher than deadline_extended (1.8%) and personal_invite_to_apply (3.1%). One thing to notice is that theme not_too_late has lowest students exposure.

Let's calculate the 95% CI for all 3 groups:

In [33]:
# mean and standard variance
m1 = obs_v1/n1
m2 = obs_v2/n2
m3 = obs_v3/n3
sd1 = (obs_v1/n1*(1-obs_v1/n1))/n1
sd2 = (obs_v2/n2*(1-obs_v2/n2))/n2
sd3 = (obs_v3/n3*(1-obs_v3/n3))/n3
# 95% CI:
g1_lower = m1-1.96*np.sqrt(sd1/n1)
g1_upper = m1+1.96*np.sqrt(sd1/n1)
g2_lower = m2-1.96*np.sqrt(sd2/n2)
g2_upper = m2+1.96*np.sqrt(sd2/n2)
g3_lower = m3-1.96*np.sqrt(sd3/n3)
g3_upper = m3+1.96*np.sqrt(sd3/n3)

print("for student received deadline_extended emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g1_lower,g1_upper))
print("for student received not_too_late emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g2_lower,g2_upper))
print("for student received personal_invite_to_apply emails, 95% of CI conversion rate: [{0:.2%},{1:.2%}]".format(g3_lower,g3_upper))

for student received deadline_extended emails, 95% of CI conversion rate: [1.59%,2.08%]
for student received not_too_late emails, 95% of CI conversion rate: [7.48%,7.67%]
for student received personal_invite_to_apply emails, 95% of CI conversion rate: [3.32%,4.27%]


T-test between deadline_extended and not_too_late:

In [34]:
s_t = np.sqrt(((n1-1)*n1*sd1+(n2-1)*n2*sd2)/(n1+n2-2))
t = (m2-m1)/(s_t*np.sqrt(1/n1+1/n2))
tscore = stats.t.ppf(.95,n1+n2-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is 2.2062746908246575; 95% t score is 1.6472567893565855


T-test between not_too_late and personal_invite_to_apply:

In [35]:
s_t = np.sqrt(((n2-1)*n2*sd2+(n3-1)*n3*sd3)/(n2+n3-2))
t = (m3-m2)/(s_t*np.sqrt(1/n2+1/n3))
tscore = stats.t.ppf(.95,n2+n3-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is -1.2219255109196285; 95% t score is 1.6473761381549312


T-test between deadline_extended and personal_invite_to_apply:

In [36]:
s_t = np.sqrt(((n1-1)*n1*sd1+(n3-1)*n3*sd3)/(n1+n3-2))
t = (m3-m1)/(s_t*np.sqrt(1/n1+1/n3))
tscore = stats.t.ppf(.95,n1+n3-2)
print("t stats is {0}; 95% t score is {1}".format(t,tscore))

t stats is 0.8272764526766723; 95% t score is 1.6530871383957708


### Result:
    * Theme not_too_late is significant more efficient than deadline_extended and personal_invite_to_apply.
    * There's no significant difference between deadline_extended and personal_invite_to_apply.

### Answer Question 2: Does time of emails effect conversion rate?

In [27]:
df.groupby(['Date Sent','Enrolled'])['EmailAddress'].count()

Date Sent   Enrolled
Aug, 2017   0           463
            1             8
Dec, 2017   0           193
            1            18
Nov, 2017   0            80
            1             3
Oct, 2017   0           476
            1            49
Sept, 2017  0            63
            1             1
Name: EmailAddress, dtype: int64

There're 5 months' data collected in this file: Aug, Sep, Oct, Nov and Dec.
Compute the conversion rate of these 5 months:
    1 - Aug
    2 - Sep
    3 - Oct
    4 - Nov
    5 - Dec

In [48]:
n1 = df.loc[df['Date Sent']=='Aug, 2017'].shape[0]
obs_v1 = df.loc[(df['Date Sent']=='Aug, 2017') & (df['Enrolled']==1)].shape[0]

n2 = df.loc[df['Date Sent']=='Sept, 2017'].shape[0]
obs_v2 = df.loc[(df['Date Sent']=='Sept, 2017') & (df['Enrolled']==1)].shape[0]

n3 = df.loc[df['Date Sent']=='Oct, 2017'].shape[0]
obs_v3 = df.loc[(df['Date Sent']=='Oct, 2017') & (df['Enrolled']==1)].shape[0]

n4 = df.loc[df['Date Sent']=='Nov, 2017'].shape[0]
obs_v4 = df.loc[(df['Date Sent']=='Nov, 2017') & (df['Enrolled']==1)].shape[0]

n5 = df.loc[df['Date Sent']=='Dec, 2017'].shape[0]
obs_v5 = df.loc[(df['Date Sent']=='Dec, 2017') & (df['Enrolled']==1)].shape[0]

print("{0} students received emails in Aug, {1} of them enrolled. ({2:.1%})".format(n1,obs_v1,obs_v1/n1))
print("{0} students received emails in Sept, {1} of them enrolled. ({2:.1%})".format(n2,obs_v2,obs_v2/n2))
print("{0} students received emails in Oct, {1} of them enrolled. ({2:.1%})".format(n3,obs_v3,obs_v3/n3))
print("{0} students received emails in Nov, {1} of them enrolled. ({2:.1%})".format(n4,obs_v4,obs_v4/n4))
print("{0} students received emails in Dec, {1} of them enrolled. ({2:.1%})".format(n5,obs_v5,obs_v5/n5))

471 students received emails in Aug, 8 of them enrolled. (1.7%)
64 students received emails in Sept, 1 of them enrolled. (1.6%)
525 students received emails in Oct, 49 of them enrolled. (9.3%)
83 students received emails in Nov, 3 of them enrolled. (3.6%)
211 students received emails in Dec, 18 of them enrolled. (8.5%)


We sent out 500+ emails in Oct, 471 emails in Aug and 211 emails in Dec. Conversion rate in Oct is the highest (9.3%).

Compute 95% CI:

In [49]:
n = [n1,n2,n3,n4,n5]
obs = [obs_v1,obs_v2,obs_v3,obs_v4,obs_v5]
month = ['Aug','Sept','Oct','Nov','Dec']
m = []
sd = []
g_lower = []
g_upper = []

for i in range(5):
    m.append(obs[i]/n[i])
    sd.append((obs[i]/n[i]*(1-obs[i]/n[i]))/n[i])
    g_lower.append(m[i]-1.96*sqrt(sd[i]/n[i]))
    g_upper.append(m[i]+1.96*sqrt(sd[i]/n[i]))
    print("for student received emails in {0}, 95% of CI conversion rate: [{1:.2%},{2:.2%}]".format(month[i],g_lower[i],g_upper[i]))

for student received emails in Aug, 95% of CI conversion rate: [1.64%,1.75%]
for student received emails in Sept, 95% of CI conversion rate: [1.18%,1.94%]
for student received emails in Oct, 95% of CI conversion rate: [9.22%,9.44%]
for student received emails in Nov, 95% of CI conversion rate: [3.17%,4.06%]
for student received emails in Dec, 95% of CI conversion rate: [8.27%,8.79%]


T-test:

In [61]:
t[0][1]

-0.079382374041016043

In [67]:
t = [ [0]*5 ] * 5
t_score = [ [0]*5 ] * 5
for i in range(5):
    for j in range(5):
        if i < j:
            s_t = np.sqrt(((n[i]-1)*n[i]*sd[i]+(n[j]-1)*n[j]*sd[j])/(n[i]+n[j]-2))
            t[i][j] = (m[j]-m[i])/(s_t*np.sqrt(1/n[i]+1/n[j]))
            t_score[i][j] = stats.t.ppf(.95,n[i]+n[j]-2)
            print("t-test for {0} and {1}: t stats is {2:.4}; 95% t score is {3:.4}".format(month[i],month[j],t[i][j],t_score[i][j]))

t-test for Aug and Sept: t stats is -0.07938; 95% t score is 1.648
t-test for Aug and Oct: t stats is 5.25; 95% t score is 1.646
t-test for Aug and Nov: t stats is 1.156; 95% t score is 1.648
t-test for Aug and Dec: t stats is 4.369; 95% t score is 1.647
t-test for Sept and Oct: t stats is 2.112; 95% t score is 1.647
t-test for Sept and Nov: t stats is 0.7594; 95% t score is 1.655
t-test for Sept and Dec: t stats is 1.937; 95% t score is 1.65
t-test for Oct and Nov: t stats is -1.735; 95% t score is 1.647
t-test for Oct and Dec: t stats is -0.3423; 95% t score is 1.647
t-test for Nov and Dec: t stats is 1.478; 95% t score is 1.65


### Result:
    * Recall conversion rate:
        * Aug (471): 1.7%
        * Sept (64): 1.6%
        * Oct (525): 9.3%
        * Nov (83): 3.6%
        * Dec (211): 8.5%
    * Conversion rate significant different:
        * Aug and Oct
        * Aug and Dec
        * Sept and Oct
        * Sept and Dec
        * Oct and Nov
    * Conversion rate not significant different:
        * Aug and Sept
        * Aug and Nov
        * Sept and Nov
        * Oct and Dec
        * Nov and Dec
    * It seems Aug has good amount of emails sent to our students, but only gets 1.7% conversion rate, seems low.
    * Q4 has higher conversion rate than Q3 in general.
    * In Q4, Oct and Dec seem to have higher conversion rate than in Nov.