**Outline**

1. Topic Review
3. Sample Size for Rental Properties
4. Sample Size for Marketing Campaign
5. Priming

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import norm
import matplotlib.pyplot as plt

## Topic Review

#### Sample Size Estimate
- Sample size estimate are useful to design an exeperiment to keep some standard in our inferences
- There are two approach to estimate the sample size:
  - Based on desired standard error
  - Based on power level and also confidence level and effect size
- Larger samples lead to smaller standard error
- Larger power and confidence level lead to higher sample size need
- Smaller effect size lead to smaller sample size need
- Sample size calculation for estimating regression coefficient are exactly the same as estimated sample size for mean where standard deviation and effectsize was estimated from the residual standard deviation from previous study

#### Sample Size for Estimating Single Mean

**1. Based on Standard Error**
$$
n=\left (\frac{\sigma}{\text{SE}(\bar{y})}  \right )^2
$$
  - $n$ is sample size estimate
  - $\sigma$ is standard deviation of population
  - $\text{SE}(\bar{y})$ is desired standard error

  
*Helper Function*
  


In [2]:
def sample_size_single_mean_se(sigma, standard_error):

    # Calculate sample size
    n = (sigma/standard_error)**2

    return n

**2. Based on Desired Power and Confidence Level**
- Sample size for estimating single mean based on power, confidence level and effectsize
$$
n = \frac{\sigma^2 (z_{1-{\frac{\alpha}{2}}}+z_{1-\beta})^2}{\delta^2}
$$
  - $n$ is sample size estimate
  - $\sigma$ is standard deviation of population
  - $\delta$ is desired difference to the true population
  - $z_{1-\frac{\alpha}{2}}$ represents z-value under desired significance level ($\alpha$)
  - $z_{1-\beta}$ represents z-value under the desired power

  *Helper Function*

In [3]:
def sample_size_single_mean(alpha, beta, effect_size, sigma):

    # Calculate z-value
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(1 - beta)

    # Calculate sample size
    n = ( (sigma**2) * (z_alpha + z_beta)**2 ) / effect_size**2

    return n

#### Sample Size for Estimating Difference between Two Means

**1. Based on Standard Error**

$$
n=\left (\frac{2\sigma}{\text{SE}(\bar{y}_1-\bar{y}_2)}  \right )^2
$$
  - $n$ is sample size estimate
  - $\sigma$ is standard deviation of population
  - $\text{se}(\bar{y}_1-\bar{y}_2)$ is desired standard error of difference of two means

  *Helper Function*

In [4]:
def sample_size_diff_mean_se(sigma, standard_error):

    # Calculate sample size
    n = ( (2 * sigma) / standard_error )**2

    return n

**2. Based on Desired Power and Confidence Level**
$$
n = \frac{(2\sigma(z_{1-{\frac{\alpha}{2}}}+z_{1-\beta}))^2}{\delta^2}
$$
  - $n$ is sample size estimate in two groups
  - $\sigma$ is standard deviation of population
  - $\delta$ is desired difference of two mean to be detected
  - $z_{1-\frac{\alpha}{2}}$ represents z-value under desired significance level ($\alpha$)
  - $z_{1-\beta}$ represents z-value under the desired power

In [5]:
def sample_size_diff_mean(alpha, beta, effect_size, sigma):

    # Calculate z-value
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(1 - beta)

    # Calculate sample size
    n =  (2 * sigma * (z_alpha+z_beta))**2 / (effect_size**2)

    return n

# 1. Sample Size for Rental Properties
---
- The firm in charge of rental properties is facing a decision regarding whether or not to expand into a costly area in downtown San Francisco.
- The amount of money the firm earns is dependent on the rents of the properties they manage.
- In order to make the expansion financially feasible, the firm must ensure that the average rent in this area is more than `$1500` per month.
- From the the previous study they know that the standard deviation of rent price in the San Francisco is `$500`

## 1.1
How much estimated the sample size if they want standard error to be no worse than $50?

In [6]:
# Given

standard_error = 50
sigma = 500

In [7]:
# then calculate the number of sample size
n =  sample_size_single_mean_se(sigma, standard_error)

print(f"Number of sample size needed            : {n:.2f}")

Number of sample size needed            : 100.00


## 1.2
If they wanted to conduct statistical inference to help the decision, how much estimated the sample size if they want the estimate has 80% power to be statistical significant under 95% confidence level to detect `$50` difference with the true population mean?

Initiate what we know

In [8]:
# Given
effect_size = 50 # absolute desired difference with the true population mean
conf = 0.95      # confidence level
power = 0.8      # power
sigma = 500      # standard deviation of population (assumed / approximated)

In [9]:
# calculate beta
beta = 1 - power

# calculate alpha
alpha = 1 - conf

Calculate estimated sample size using helper function, or you can manually calculate it by using calculator, it must expect approximately same value

In [10]:
# then calculate the number of sample size
n =  sample_size_single_mean(alpha, beta, effect_size, sigma)

print(f"Number of sample size needed            : {n:.2f}")

Number of sample size needed            : 784.89


**Conclusion**

The estimated sample size to have `$50` standard error in our inferences is:
- We need 100 observations

The estimated sample size to have 95% confidence and 80% power in our inferences is:
- We need 785  $\approx$ 800 observations

# 2. Sample Size for Marketing Campaign
---

- Marketing companies want to run successful campaigns.
- They have a new initiative in new campaigns, let say campaigns B.
- Before implementing campaign B, they conduct an experiment to compare existing campaign (campaign A) and new campaign (campaigns B).
- With the new campaigns, it is expected that the the revenue per user will increase by `$5`.
- It is known that 1 month ago, the standar deviation was `$15`.

## 2.1
How much the estimated sample size to infer the difference revenue-per-user with standard error $3?


In [11]:
# Given
standard_error = 3  # desired standard error
sigma = 15          # standard deviation of population (assumed / approximated)

In [12]:
# then calculate the number of sample size
n =  sample_size_diff_mean_se(sigma, standard_error)

print(f"Number of total sample size needed            : {n:.2f}")
print(f"Number of sample size of each group needed    : {n/2:.2f}")

Number of total sample size needed            : 100.00
Number of sample size of each group needed    : 50.00


## 2.2
With a confidence level of 95% and a power of 80%, How many minimum sample size are needed?

Initiate what we know

In [13]:
# Given
effect_size = 5    # absolute desired difference of two groups means
conf = 0.95        # confidence level
power = 0.8        # power
sigma = 15         # standard deviation of population (assumed / approximated)

Calculate the alpha and beta from confidence level and power level

In [14]:
# calculate beta
beta = 1 - power

# calculate alpha
alpha = 1 - conf

Calculate estimated sample size using helper function, or you can manually calculate it by using calculator, it must expect the same vaue

In [15]:
# then calculate the number of total sample size
n =  sample_size_diff_mean(alpha, beta, effect_size, sigma)

print(f"Number of total sample size needed            : {n:.2f}")
print(f"Number of sample size of each group needed    : {n/2:.2f}")

Number of total sample size needed            : 282.56
Number of sample size of each group needed    : 141.28


**Conclusion**

The estimated sample size to have `$3` standard error in our inferences is:
- We need 100 sample for two groups.
- We need 50 samples for each group

Assuming each group would have equal sample size, the estimated sample size to have 95% confidence and 80% power in our inferences is:
- We need 284 $\approx$ 300 observations for two groups.
- We need 142 $\approx$ 150 observations for each group

**Additional Question**

1. What if they want reduce the confidence level to 90%, but keep 80% power level, how much the estimated sample size?



In [16]:
# then calculate the number of total sample size
n =  sample_size_diff_mean(alpha = 0.1, beta = 0.2, effect_size = 5, sigma = 15)

print(f"Number of total sample size needed            : {n:.2f}")
print(f"Number of sample size of each group needed    : {n/2:.2f}")


Number of total sample size needed            : 222.57
Number of sample size of each group needed    : 111.29


2. What if they want increase the power level to 95%, but keep 95% confidence level, how much the estimated sample size?

In [17]:
# then calculate the number of total sample size
n =  sample_size_diff_mean(alpha = 0.5, beta = 0.05, effect_size = 5, sigma = 15)

print(f"Number of total sample size needed            : {n:.2f}")
print(f"Number of sample size of each group needed    : {n/2:.2f}")

Number of total sample size needed            : 193.66
Number of sample size of each group needed    : 96.83


# 3. Priming
---

- To generate interest in a new product before its launch, priming strategies can lead to increased demand when the product becomes available for sale.
- FedEx want introduced new services and want to raise *awareness* through memorable television advertisements.
- FedEx also sent sales representatives to existing clients. This application presents data on how 125 customers responded to this promotional campaign, as there are questions about its costs and benefits.
- Despite potentially increasing the use of Courier Paks, the high costs of the promotional visits made this a costly campaign.
- You are asked to do regression analysis to answer this question:
> Was the promotion more effective to gain more mailings for customers who were already aware of FedEx’s business?

### Load Data


In [19]:
priming = pd.read_csv("sample_priming.csv")
priming.head()

Unnamed: 0,Mailings,Hours,Aware,Awareness
0,74,3.46,1,YES
1,3,0.51,1,YES
2,9,1.68,0,NO
3,0,0.31,0,NO
4,65,3.8,0,NO


For each customer, the data give the number of times in the following month that the customer used Courier Paks, `Mailings`and the amount of time (in hours) spent by a sales representative with the customer `Hours`. An indicator variable indicating whether or not the customer had been aware of the Courier Paks prior to the visit `Aware`.

## 3.1

Estimate regression line of Mailings on Hour and Awareness

---

In [20]:
def print_coef_std_err(results):
    """
    Function to combine estimated coefficients and standard error in one DataFrame
    :param results: <statsmodels RegressionResultsWrapper> OLS regression results from statsmodel
    :return df: <pandas DataFrame> combined estimated coefficient and standard error of model estimate
    """
    coef = results.params
    std_err = results.bse

    df = pd.DataFrame(data = np.transpose([coef, std_err]),
                      index = coef.index,
                      columns=["coef","std err"])
    return df

For simplicity, we do not include interaction in the model

In [21]:
# Create OLS model object
model = smf.ols("Mailings ~ Hours + Aware", priming)

# Fit the model
results = model.fit()

# Extract the results (Coefficient and Standard Error) to DataFrame
results_priming_inter = print_coef_std_err(results)
results_priming_inter

Unnamed: 0,coef,std err
Intercept,-4.072176,3.649764
Hours,18.164421,1.490154
Aware,4.660434,3.873062


#### Interpretation of Aware Coefficient

- 4.7 represents the difference in the mailings for the same hours of visits, comparing customer who did and did not aware

#### Insight
- On average, the difference in the use of Courier Paks is approximately 5 more with priming compared to without priming in the same hour of contact with a sales representative.

---
## 3.2

Next, if they want to conduct other priming analysis for other similar product, how much estimated sample size needed if they want to reduced the current standard error?

### 3.2.1

Estimate the sample size needed to have smaler standard error of aware coefficient, says to 1

---



In [22]:
# Given
standar_error = 1
sigma = np.sqrt(results.mse_resid) # standard deviation of population (estimated from residual standard deviation of regression)

In [23]:
# then calculate the number of total sample size
n =  sample_size_single_mean_se(sigma, standar_error)

print(f"Number of total sample size needed    : {n:.2f}")

Number of total sample size needed    : 124.71


### 3.2.2

Estimate the sample size needed to have power 80% and confidence level 95% to detect the smaller effect size, let's say 3.
---


Initiate what we know

In [24]:
# Given
effect_size = 3                    # absolute desired effect_size
conf = 0.95                        # confidence level
power = 0.8                        # power
sigma = np.sqrt(results.mse_resid) # standard deviation of population (estimated from residual standard deviation of regression)

Calculate the alpha and beta from confidence level and power level

In [25]:
# calculate beta
beta = 1 - power

# calculate alpha
alpha = 1 - conf

Calculate estimated sample size using helper function, or you can manually calculate it by using calculator, it must expect approximately same value

In [26]:
# then calculate the number of sample size
n =  sample_size_single_mean(alpha, beta, effect_size, sigma)

print(f"Number of sample size needed            : {n:.2f}")

Number of sample size needed            : 108.76


**Conclusion**

- The estimated sample size to have smaller standard error coefficient in Aware, `1` in our inferences is is 125

- The estimated sample size to have 95% confidence and 80% power level to detect smaller effect size, `3` in our inferences is approximately 110