# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
#Your code here
import pandas as pd
data = pd.read_csv('homepage_actions.csv')
# Check unique IDs
unique_ids = data['id'].nunique()
print(f"Number of unique viewers: {unique_ids}")


Number of unique viewers: 6328


In [2]:
# Get IDs of users who viewed
viewed_ids = data[data['action'] == 'view']['id'].unique()

# Get IDs of users who clicked
clicked_ids = data[data['action'] == 'click']['id'].unique()

# Find how many users who viewed also clicked
viewers_who_clicked = len(set(viewed_ids).intersection(set(clicked_ids)))
print(f"Number of viewers who also clicked: {viewers_who_clicked}")


Number of viewers who also clicked: 1860


In [3]:
# Find IDs of users who clicked but did not view
clicked_without_viewing = set(clicked_ids) - set(viewed_ids)
print(f"Number of users who clicked but didn't view: {len(clicked_without_viewing)}")


Number of users who clicked but didn't view: 0


In [4]:
# Get IDs for control and experiment groups
control_ids = data[data['group'] == 'control']['id'].unique()
experiment_ids = data[data['group'] == 'experiment']['id'].unique()

# Check for overlap
overlap_ids = set(control_ids).intersection(set(experiment_ids))
print(f"Number of overlapping IDs between control and experiment groups: {len(overlap_ids)}")


Number of overlapping IDs between control and experiment groups: 0


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [5]:
#Your code here
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest

# Get the control and experiment group data
control_group = df[df['group'] == 'control']
experiment_group = df[df['group'] == 'experiment']

# Get the number of clicks and views for both groups
control_clicks = control_group[control_group['action'] == 'click'].shape[0]
control_views = control_group[control_group['action'] == 'view'].shape[0]

experiment_clicks = experiment_group[experiment_group['action'] == 'click'].shape[0]
experiment_views = experiment_group[experiment_group['action'] == 'view'].shape[0]

# Perform the z-test
counts = [control_clicks, experiment_clicks]  # Number of clicks in each group
nobs = [control_views, experiment_views]  # Number of views in each group

z_stat, p_value = proportions_ztest(counts, nobs)
print(f"Z-statistic: {z_stat:.4f}, P-value: {p_value:.4f}")

# # Interpretation
# alpha = 0.05
# if p_value < alpha:
#     print("Reject the null hypothesis: The experimental homepage is more effective.")
# else:
#     print("Fail to reject the null hypothesis: No significant difference between control and experimental homepage.")


NameError: name 'df' is not defined

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [None]:
#Your code here

# Create a new binary column 'clicked', where 'click' is 1 and 'view' is 0
data['clicked'] = data['action'].apply(lambda x: 1 if x == 'click' else 0)

# Separate control and experiment groups
control_group = data[data['group'] == 'control']
experiment_group = data[data['group'] == 'experiment']

# Calculate the click-through rate for the control group
control_click_rate = control_group['clicked'].mean()

# Calculate the number of visitors in the experimental group
n_experiment = experiment_group.shape[0]

# Calculate the expected number of clicks in the experiment group
expected_clicks_experiment = control_click_rate * n_experiment

print(f"Expected clicks in experiment group: {expected_clicks_experiment}")



### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [None]:
#Your code here
# Calculate the actual number of clicks in the experiment group
actual_clicks_experiment = experiment_group['clicked'].sum()

# Step 4: Calculate the variance and standard deviation for the experiment group
variance_experiment = n_experiment * control_click_rate * (1 - control_click_rate)
std_dev_experiment = variance_experiment ** 0.5  # Square root of variance to get standard deviation

# Step 5: Calculate the Z-score (number of standard deviations)
z_score = (actual_clicks_experiment - expected_clicks_experiment) / std_dev_experiment

print(f"Expected clicks in experiment group: {expected_clicks_experiment}")
print(f"Actual clicks in experiment group: {actual_clicks_experiment}")
print(f"Standard deviation of clicks: {std_dev_experiment}")
print(f"Z-score: {z_score}")

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [None]:
#Your code here
from scipy import stats

# Step 6: Calculate the p-value using the normal distribution
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

print(f"P-value: {p_value}")


### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

#### Answer:
Yes, the p-value of 0.0066 from this test is very close to the previously calculated p-value of 0.0088. Both p-values are below the common significance level threshold of 0.05, meaning that in both cases we can reject the null hypothesis.

#### Analysis:
This p-value indicates that the difference in click-through rates between the experiment group (new homepage) and the control group (old homepage) is statistically significant. In other words, there is strong evidence to suggest that the experimental homepage performed better (or at least differently) than the control.

Since the p-value is very small, the likelihood of this difference being due to random chance is low, and it suggests that the change in the homepage has had a measurable effect on user behavior.

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.