**Rationale** Hypothesis testing is foundational to the entire field of statistics. Without the Central Limit Theorem and hypothesis testing, we would not have modern day (social) science. In marketing, this is generally most important when doing A/B testing of ads. In this assignment, you will practice computing basic summary statistics and conducting hypothesis tests. 

[Datasets](https://drive.google.com/drive/folders/1D6D-zRU3oiP12c0dFMQjs66JDhBZBtax?usp=sharing) required:
1. [FB ad campaign data](https://drive.google.com/file/d/1JfwvumxS2oys8ZcQtsb8GGmeRfnkzFdy/view?usp=sharing).
1. [Starbucks Promos](https://drive.google.com/file/d/1W7x5_PU5KszT8mqXqqEmRfXuR8CRr1DX/view?usp=sharing)

# 1. (2 points) FB Ad campaigns

1. Use a groupby operation create a dataframe called `sumstats` consisting of the total impressions and clicks for each `xyz_campaign_id`.
1. Using `sumstats`, create a new column called `ctr` that represents the [click through rate](https://support.google.com/google-ads/answer/2615875?hl=en) for each campaign. Which campaign had the highest CTR?
1. Create a column for `sd` for standard deviation of the click through rate. 

Hint: Recall from the notes that the standard deviation of a binomial distribution with $Pr(Success) = p$ is .$SD(p) = \sqrt{p(1-p)}$. So if the probability of a binary outcome (such as clicking on an ad) is observed to be .3, the standard deviation is $\sqrt{.3(.7)}=\sqrt{.21}$.

In [3]:
# imports and mount google drive
import os
import pandas
import numpy as np
from scipy import stats

##from google.colab import drive
##drive.mount('drive')

In [4]:
!pwd

/home/johndoe/A5


In [5]:
# set the path to the datasets for A5

fpath = '/home/johndoe/A5'
os.listdir(fpath)

['starbucks_promos.csv',
 '.ipynb_checkpoints',
 'A5_F2021_Stats_Review.ipynb',
 'facebook_ads.csv']

In [6]:
# read in the facebook_ads.csv file as the dataframe variable named fb
fb_df = pandas.read_csv('facebook_ads.csv')

In [7]:
# Take a look at the first 5 rows
# Observe that each row represents 1 ad from 1 campaign and a target demographic group (age, gender, interest)
fb_df.head(5)

Unnamed: 0,ad_id,xyz_campaign_id,fb_campaign_id,age,gender,interest,Impressions,Clicks,Spent,Total_Conversion,Approved_Conversion
0,708746,916,103916,30-34,M,15,7350,1,1.43,2,1
1,708749,916,103917,30-34,M,16,17861,2,1.82,2,0
2,708771,916,103920,30-34,M,20,693,0,0.0,1,0
3,708815,916,103928,30-34,M,28,4259,1,1.25,1,0
4,708818,916,103928,30-34,M,28,4133,1,1.29,1,1


In [8]:
# use a groupby to create new dataframe, sumstats that tabulates the total clicks and impressions for each campaign.
# Generic groupby syntax: sumstats = df.groupby('groupbyvariable')[['summary_variable1', 'summary_variable2']].summaryfunction().reset_index()
# use the correct dataframe name, variables (i.e. column names) and summaryfunction
sumstats_df = fb_df.groupby('xyz_campaign_id')[['Clicks', 'Impressions']].sum().reset_index()
sumstats_df
            

Unnamed: 0,xyz_campaign_id,Clicks,Impressions
0,916,113,482925
1,936,1984,8128187
2,1178,36068,204823716


In [9]:
# create a new column in sumstats called 'ctr' (short for click through rate)
# recall that ctr = clicks / impressions
sumstats_df['ctr']  = (sumstats_df['Clicks'])/(sumstats_df['Impressions'])




In [10]:
# take a look at the whole sumstats dataframe
sumstats_df


Unnamed: 0,xyz_campaign_id,Clicks,Impressions,ctr
0,916,113,482925,0.000234
1,936,1984,8128187,0.000244
2,1178,36068,204823716,0.000176


In [11]:
# create a new column ctr_sd for the std. dev of the ctr (See hint above)
# pay attention to PEMDAS, esp. that you want to take the sqrt after computing p*(1-p)
# np.sqrt(x) returns the sqrt of some number xsumstats_df['ctr']create a new column ctr_sd for the std. dev of the ctr (See hint above)
# pay attention to PEMDAS, esp. that you want to take the sqrt after computing p*(1-p)
# np.sqrt(x) returns the sqrt of some number x
# alternatively, x**.5 also returns the sqrt of x
# $SD(p) = \sqrt{p(1-p)}$
# p = sumstats_df['ctr']

sumstats_df['ctr_sd'] = np.sqrt(sumstats_df['ctr']*(1 - sumstats_df['ctr']))
sumstats_df

Unnamed: 0,xyz_campaign_id,Clicks,Impressions,ctr,ctr_sd
0,916,113,482925,0.000234,0.015295
1,936,1984,8128187,0.000244,0.015621
2,1178,36068,204823716,0.000176,0.013269


In [12]:
# take another look at the whole sumstats dataframe
# compare the ctr_std to calculating it by hand, make sure you did this correctly.
p = 0.000234
sd = (p*(1.0 - p))**(.5)
print('sd for line 0 equals: ', sd)
sumstats_df


sd for line 0 equals:  0.015295268680216115


Unnamed: 0,xyz_campaign_id,Clicks,Impressions,ctr,ctr_sd
0,916,113,482925,0.000234,0.015295
1,936,1984,8128187,0.000244,0.015621
2,1178,36068,204823716,0.000176,0.013269


## Answer (edit this cell)
The campaign with the highest CTR is: **936**



# 2. (4 points) Hypothesis testing and confidence intervals

1. Compute the 95% confidence interval for the CTR for each campaign. Compare the confidence intervals of campaign 916 and 936. What can you conclude about the relative perfrmance of the 2 ads in the population (e.g. are they very different, similar, etc.)? How about 916 vs. 1178?
1. Was campaign 936 statistically different compared to campaign 916? How about 936 vs. 1178? Use the `ttest_2sample` function from the notes. Remember, you must define it in your Colab session in order to use it (execute the cell w/ the function).

Given these statistical tests what would you recommend in terms of allocation of the ad budget? 

In [13]:
# use stats.norm.ppf([.025, .975], mean, std error)
# where mean is sample mean and std error is std dev / sqrt(obs) (from Central Limit Theorem)
#
# confidence interval for ad 916

# mean = ctr
# std = ctr_sd
# obs = impressions
##sumstats_df['mean'] = sumstats_df['Clicks']/sumstats_df['Impressions']
sumstats_df['std_error'] = sumstats_df['ctr_sd']/((sumstats_df['Impressions'])**(.5))
##sumstats_df['conf_interval'] = stats.norm.ppf([.025, .975], sumstats_df['mean'], sumstats_df['std_error'])
##conf_interval_916 = stats.norm.ppf([.025, .975], 113/, sumstats_df['ctr_sd']/(np.sqrt(sumstats_df['Impressions'])) )
sumstats_df

Unnamed: 0,xyz_campaign_id,Clicks,Impressions,ctr,ctr_sd,std_error
0,916,113,482925,0.000234,0.015295,2.200943e-05
1,936,1984,8128187,0.000244,0.015621,5.479288e-06
2,1178,36068,204823716,0.000176,0.013269,9.271341e-07


In [16]:
# confidence intervalf for ad 916
conf_interval_916 = stats.norm.ppf([.025, .975], 0.000234, .00002200943)
span_916 = conf_interval_916[1] - conf_interval_916[0]
print('916 conf interval =', conf_interval_916)
print('span = ', span)

916 conf interval = [0.00019086 0.00027714]
span =  8.627538024051079e-05


In [17]:
# confidence interval for ad 936
conf_interval_936 = stats.norm.ppf([.025, .975], 0.000244, .000005479288)
span_936 = conf_interval_936[1] - conf_interval_936[0]
print('936 conf interval =', conf_interval_936)
print('span = ', span_936)

936 conf interval = [0.00023326 0.00025474]
span =  2.1478414281845008e-05


In [18]:
# confidence interval for ad 1178
conf_interval_1178 = stats.norm.ppf([.025, .975], 0.000176, .0000009271341)
span_1178 = conf_interval_1178[1] - conf_interval_1178[0]
print('1178 conf interval =', conf_interval_1178)
print('span = ', span_1178)

1178 conf interval = [0.00017418 0.00017782]
span =  3.6342988896778997e-06


**Edit this cell**

1. (write 1-2 sentences to compare the confidence interval of 916 vs 936's CTR):
** The confidence interval for the mean for campaign 916 is much wider--about 4 times as wide--as the confidence interval for 936.  The confidence interval is the range of values (in this case at the 95% level) within which we expect the true mean to exist. For campaign 916, our ability to estimate the true mean is less powerful because of the large range of values within which the true mean could be.  Campaign 936 is the opposite (or ability to estimate the true mean is more powerful). For campaign 916, the true mean for click through rate exists between .0001908 and .0002771.  For campaign 936, the true mean click through rate exists between .0002332 and .0002547. **
1. (write 1-2 sentences to compare the confidence interval of 916 vs 1178's CTR) 
** The 95% confidence interval for campaign 1178 is also narrower than the confidence interval for 916. Our ability to say where the true mean click through rate lies for campaign 1178 lies within a tighter range of values (.000174 and .000177) then the broader seen in campaign 916, where our ability to estimate the true mean lies within a broader range of values, between .0001908 and .0002771.**

In [20]:
# Use this custom funciton

def ttest_2sample(m1,sd1,N1,m2,sd2,N2, twotail = True, equalvar = False):
    """
    This function requires you tu supply:
    m1: mean of sample 1
    sd1: std. dev of sample 1
    N1: number of obs of sample 1
    m2: mean of sample 2
    sd2: std dev of sample 2
    N2: number of obs of sample 2

    Optional inputs:
    twotail = True (default) / False. If False, then 1 tail
    equalvar = True / False (default). If True, assumes equal population variance.
    """

    # The difference between equal and unequal variance is only in how to compute
    # the test statistic and degree of freedom.
    if equalvar:
        spsquare = ((N1-1)*sd1**2+(N2-1)*sd2**2)/(N1+N2-2) 
        T = (m1-m2)/np.sqrt(spsquare*(1/N1+1/N2))
        nu = N1+N2-2
    else:
        nu = (sd1**2/N1+sd2**2/N2)**2/((sd1**2/N1)**2/(N1-1)+(sd2**2/N2)**2/(N2-1)) # new degree of freedom
        T = (m1-m2)/(np.sqrt(sd1**2/N1+sd2**2/N2))
    
    # If the first mean is bigger, we need to do 1- cdf
    # Otherwise just compute cdf
    if m1>m2:
        pval = 1-stats.t.cdf(T, df = nu)
    else:
        pval = stats.t.cdf(T, df = nu)

    # return p values
    # If 2 tail, we must multiply by 2
    # otherwise we just return the computed pval
    if twotail == True:
        return pval*2
    else:
        return pval

In [29]:
# run a 2 tail 2 sample t-test to compare the population CTRs for 936 and 916
## ttest_2sample(m1,sd1,N1,m2,sd2,N2, twotail = True, equalvar = False)

# from the dataframe

m936 = 0.000244
sd936 = 0.015621
imp936 = 8128187

m916 = 0.000234
sd916 = 0.015295
imp916 = 482925

m1178 = 0.000176
sd1178 = 0.013269
imp1178 = 204823716

test_result = ttest_2sample(m936, sd936, imp936, m916, sd916, imp916, twotail = True, equalvar = False)
print(test_result)

0.659290373515292


In [30]:
# run a 2 tail 2 sample t-test to compare the population CTRs for 936 and 1178
test_result = ttest_2sample(m936, sd936, imp936, m1178, sd1178, imp1178, twotail = True, equalvar = False)
print(test_result)

0.0


**Edit this cell to answer**

(Write a few sentences to describe how to proceed with the allocation of the ad budget going forward. Be sure to use the 2 tail t-tests above to support your strategy)  
**The test result of less than .05 comparing the distributions in campaigns 936 and 1178 suggests that the difference between the means in this instance is statistically significant and that campaign 936's click through rate is reliably better than 1178.  The test result comparing campaigns 936 and 916 is not statistically significant -- we can't rely on the differnce between the two means in that case.  As a result, when allocating advertising spend, we should select campaign 936 or 916 but not 1178.  One other very important thing to consider though, would be what the cost per click through is between 936 and 916.  If one isn't clearly better than the other, it may be better to go with the cheaper one.**

# 3. (4 points)

Using the Starbucks dataset, conduct an appropriate statistical test to determine:
1. Is there any statistically appreciable difference in the redemption rates of BOGO vs. Discount promotions?
1. Among those who redeemed an offer and reported their income, is there any statistical difference in the average income associated with BOGO vs. discount redemptions? Note that not every observation reports income (select only the observations that do).

In [31]:
# read in the starbucks_promos.csv file as the dataframe variable named sb, use index_col = 0 option
sb_df = pandas.read_csv('starbucks_promos.csv', index_col = 0)

In [32]:
# preview first 5 rows to get a sense of the contents of the dataframe
# note that each observation represents an offer received by a customer
sb_df.head(5)

Unnamed: 0,uid,event,time,gender,age,register_date,income,offer_id,offer_reward,channels,difficulty,duration,offer_type,offer_time,transaction_amount,redeem_time,redeemed
1,0020c2b971eb4e9188eac86d93036a77,offer received,0,F,59,20160304,90000.0,fafdcd668e3743c1bb461111dcafc2a4,2.0,"['web', 'email', 'mobile', 'social']",10.0,240.0,discount,0.0,17.63,54.0,1
4,005500a7188546ff8a767329a2f7c76a,offer received,0,M,56,20171209,47000.0,ae264e3637204a6fb9bb56bc8210ddfd,10.0,"['email', 'mobile', 'social']",10.0,168.0,bogo,0.0,,,0
5,0056df74b63b4298809f0b375a304cf4,offer received,0,M,54,20160821,91000.0,9b98b8c7a33c4b65b9aebfe6a799e6d9,5.0,"['web', 'email', 'mobile']",5.0,168.0,bogo,0.0,27.86,132.0,1
6,00715b6e55c3431cb56ff7307eb19675,offer received,0,F,58,20171207,119000.0,ae264e3637204a6fb9bb56bc8210ddfd,10.0,"['email', 'mobile', 'social']",10.0,168.0,bogo,0.0,27.26,12.0,1
8,00840a2ca5d2408e982d56544dc14ffd,offer received,0,M,26,20141221,61000.0,2906b810c7d4411798c6938adc9daaa5,2.0,"['web', 'email', 'mobile']",10.0,168.0,discount,0.0,6.05,540.0,1


In [85]:
import numpy
# select the redemed column for discount offers, name this data the variable disc
new_df = sb_df.loc[sb_df['offer_type'] == 'discount']
pre_disc = new_df['redeemed'].tolist()
disc = numpy.array(pre_disc)
disc_mean = disc.mean()
print(disc)
print(disc_mean)

[1 1 1 ... 1 0 1]
0.6066856562878564


In [86]:
# select the redemed column for buy one get one (bogo) offers, name this data the variable bogo
new_df = sb_df.loc[sb_df['offer_type'] == 'bogo']
pre_bogo = new_df['redeemed'].tolist()
bogo = numpy.array(pre_bogo)
bogo_mean = bogo.mean()
print(bogo)
print(bogo_mean)

[0 1 1 ... 1 0 1]
0.5375586084789665


In [61]:
# use stats.ttest_ind to test whether the 2 types of offers yield different redemption rates
stats.ttest_ind(disc, bogo)


Ttest_indResult(statistic=17.30160421286854, pvalue=6.611390792545181e-67)

In [77]:
# Select the income column for those who redeemed the BOGO offer, call this the variable bogo_income
# note you need to use 3 condtitions for the row selection: one for redeemed, one for bogo offer, and one for income is not missing
# FYI: dataframe.somecolumn.notnull() is the Boolean evaluation for whether the somecolumn is not missing data
new_df = sb_df.loc[(sb_df['offer_type'] == 'bogo') & (sb_df['redeemed'] >= 1) & (sb_df['income'].notnull() == True)]
pre_income = new_df['income']
income_bogo = numpy.array(pre_income)
income_bogo

array([ 91000., 119000.,  81000., ...,  93000.,  73000.,  34000.])

In [78]:
# Select the income column for those who redeemed the discount offer, call this the variable disc_income
# note you need to use 3 condtitions for the row selection: one for redeemed, one for discount offer,  and one for income is not missing
# FYI: dataframe.somecolumn.notnull() is the Boolean evaluation for whether the somecolumn is not missing data
new_df = sb_df.loc[(sb_df['offer_type'] == 'discount') & (sb_df['redeemed'] >= 1) & (sb_df['income'].notnull() == True)]
pre_income = new_df['income']
income_discount = numpy.array(pre_income)
income_discount


array([90000., 61000., 66000., ..., 32000., 47000., 62000.])

In [79]:
# What is the average income for bogo redeemers (bogo_income.mean())?
bogo_mean = income_bogo.mean()
bogo_mean


70114.8559593297

In [80]:
# What is the average income for discount redeemers?
discount_mean = income_discount.mean()
discount_mean


68617.39130434782

In [83]:
# use stats.ttest_ind to test whether the average income of the redeemers of the 2 offer types are different

stats.ttest_ind(income_bogo, income_discount)

Ttest_indResult(statistic=6.343377852366808, pvalue=2.2763463492984577e-10)

**Edit this cell**
1. (Write a sentence to describe the relative redemption rates of the 2 offer types based on the result of the t-test).
**Because the p value is less than .05 suggests that there is a statistically significant difference between the mean redemption rate for bogo offers and mean redemption rate for discount offers, with the mean redemption rate for discount offers being greater than the mean redemption rate for bogo offers.**
1. (Write a sentence to describe the relative incomes of the redeemers of the 2 offer types based on the result of the t-test).
**Because the p value is less than .05 suggests that there is a statistically significant difference between the means and that the average income of the people from the bogo offers is in fact greater than average income for the individuals from the discount offers.**