# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Additional Notebook (Ungraded): Hypothesis Testing - Regression Approach

## Learning Objectives



At the end of this Addition Notebook, you will be able to :


* Have a very surface and high-level level understanding of A/B Testing - a Regression Approach way to compare two or more versions (A or B?)

* Determine not only which one (A or B) performs better but also to understand if the difference between two of them is statistically significant.

* Learn how to set up and interpret a regression model specifically designed for comparing groups in A/B testing scenarios.

* Gain skills to interpret the regression coefficients and p-values to draw meaningful conclusions about the performance differences and their statistical significance.





## Introduction

A/B tests are very commonly performed by data analysts and data scientists. It is important to get some practice working with these difficulties.

A/B testing is a crucial technique in data-driven decision-making, yet it often lacks comprehensive exploration. This notebook aims to address this gap by providing a consolidated overview of A/B testing principles and practices.

For this Additional Notebook, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

## Dataset

The dataset chosen for this experiment is the **ab_data.csv** dataset which is publicly available on [Kaggle](https://www.kaggle.com/datasets/abdelrahmanrezk7/ab-testing-e-commerce-website)  

This dataset consists of 2,94,478 records. Each record is made up of 5 fields.

**For example**, Each record consists of 'user_id', 'timestamp', 'group', 'landing_page' and 'converted'.

* **user_id:** A unique identifier assigned to each user participating in the experiment.

* **timestamp:** The timestamp indicating the time at which the user interacted with the webpage or was exposed to the experimental condition.

* **group:** The group to which the user was assigned, typically denoted as either 'control' or 'treatment'. This field helps categorize users into different experimental conditions.

* **landing_page:** Specifies the type of landing page or webpage variant that the user was directed to upon interaction. It distinguishes between different versions of the webpage used in the experiment.

* **converted:** A binary indicator representing whether the user performed the desired action or conversion after interacting with the webpage. It typically indicates whether the user made a purchase, signed up for a service, or completed any other desired action.

## Problem Statement

The biggest e-commerce company called FaceZonGoogAppFlix approached to a data science consulting firm as a new client!

They have a potential new webpage designed with the intention to increase their current conversion rates of 12% by 0.35% or more. With such an ambiguous task, they have full trust in the data science consulting firm to give them a recommendation whether to implement the new web page or keep the old webpage. Unfortunately they haven't built up a data science capability in their company, but they've used an external software called 'A/B Tester' for 23 days and then come back to the data science consulting firm with a dataset. Under this requirement scenario, what the data science consulting firm will do?

In [None]:
# @title Download the Dataset
! wget -q https://cdn.exec.talentsprint.com/static/cds/content/ab_data.csv
! wget -q https://cdn.exec.talentsprint.com/static/cds/content/countries.csv
print("The datset was downloaded")

# **Part I - Probability**

#### Import required packages

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import statsmodels.api as sm
import math as mt
import itertools
import random
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline
#We are setting the seed to assure that each of your Additional Notebook peer group members gets the same answers
random.seed(42)

#### Load the dataset

In [None]:
# YOUR CODE HERE
# a. Read in the dataset and take a look at the top few rows here:
data = pd.read_csv('/content/ab_data.csv') # 2,94,478 rows and 5 columns
df = data.copy()
df.head()

In [None]:
# b. Use the below cell to find the number of rows in the dataset.
df.shape

## Pre-processing

### Step 1: Data Cleaning

* Check the number of unique users in the dataset

* Check the proportion of users converted.
  
    **Hint:** query(), count()
* Estimate how many times the new_page and treatment don't line up. Also estimate how many times the old_page and control do not match.

* Display the total no. of non-line up pages

* Check if any of the rows have missing values?

#### **Treatment Group & Control Group**
* **Treatment Group (New Webpage):**
Users in this group will be exposed to the new webpage design.
The effectiveness of the new webpage design will be measured by comparing the conversion rates of users (who actually make purchase of the company's products after visiting this new webpage) in this group to those in the control groups.
* **Control Group 1 (Placebo):**
Users in this group will be presented with an identical-looking webpage that serves as a placebo.
This group represents the baseline scenario where users are exposed to the current webpage design without any changes.
It means that in Control Group 1, users will see a webpage that looks exactly like the current one (new one) but doesn't have any actual changes. This group helps us understand how users typically behave on the current webpage without any alterations. It's like giving users a fake version of the webpage to see how they respond, so we can compare their behavior to those users who see the real changes in the actual new webpage.
* **Control Group 2 (Old Webpage or Existing Treatment):**
Users in this group will be shown a webpage that is already in use and has demonstrated effectiveness in terms of conversion rates. It means that users in Control Group 2 will see the same old webpage that is currently being used. This webpage has been proven to be effective in terms of converting visitors into purchasers (or customers) in the past.

-- This group serves as a benchmark to evaluate whether the new webpage design outperforms the existing treatment.

-- This group (Control Group 2) acts as a standard for comparison to see if the new webpage design performs better than the current one. We will use the conversion rates observed in Control Group 2 to assess whether the changes made in the new webpage design lead to better results or not.

In [None]:
# c. The number of unique users in the dataset.
df.user_id.nunique()

In [None]:
df.query('converted == 1')['converted'].count() / df.shape[0]

In [None]:
# identify treatment does not match with new_page
N1 = df.query('group == "treatment" and landing_page != "new_page"').count()[0]
N1

In [None]:
# identify control does not match with old_page
N2 = df.query('group != "treatment" and landing_page == "new_page"').count()[0]
N2

In [None]:
# Total no. of non-line up
N = N1 + N2
N

In [None]:
# Check for any missing values
df.isnull().sum().sum()

In [None]:
# Check datatype of each column
df.dtypes

### Step 2: Identify the not aligned rows

<u>**Part-2a:**</u>

With the above dataset (achieved in Task-1) the requirement is to first identify the rows in that dataset where the treatment group is aligned with the new_page and where the control group is aligned with the old_page.

**Hint:** It creates a new DataFrame containing these filtered rows.
**('group == "treatment" and landing_page == "new_page"')**

<u>**Part-2b:**</u>

Now, with the help of the new dataset (achieved in Task-2), we need to identify the misaligned rows in the dataset (achieved in Task-1) where treatment is not aligned with new_page or control is not aligned with old_page

This can be done by checking the values 'treatment' and 'control' under the 'group' column to ensure they do not correspond with the values 'new_page' and 'old_page' under the 'landing_page' column, respectively.

For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. Write your code to provide how we should handle these rows.

In [None]:
# Part-2a
# create a new dataset that meets the specifications:
# treatment is aligned with new_page or control is aligned with old_page
df2 = df.iloc[df.query('group == "treatment" and landing_page == "new_page"').index.values]

df3 = df.iloc[df.query('group == "control" and landing_page == "old_page"').index.values]

In [None]:
df2 = pd.concat([df2, df3], ignore_index=False)

In [None]:
df2.shape

In [None]:
# Part-2b
# Identify misaligned rows where treatment is not aligned with new_page or control is not aligned with old_page
df_misaligned = df[~df.index.isin(df2.index)]
df_misaligned

In [None]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

### Step-3:Using the above new dataset, Check the following points.

* How many unique user_ids are in the new dataset created above in Task-2?

* There is one user_id repeated in this dataset.  (Here we need to show only the user_id)

* What is the row information for the repeat user_id? (Here, we need to show the complete row including 'user_id', 'timestamp', 'group', 'landing_page' and	'converted')

* Remove one of the rows with a duplicate user_id, but keep your dataframe name as same.

In [None]:
df2.user_id.nunique()

In [None]:
duplicated_user_ids = df2[df2.duplicated(['user_id'])]['user_id'].unique()
print("Duplicated user_id:", duplicated_user_ids)

In [None]:
df2.query('user_id == 773192')

In [None]:
df2.drop(1899, inplace=True)
df2.head()

## Finding Probabilities

### Step 4: After removing the duplicated user_id, answer the following:

##### Exercise 1: What is the probability of an individual converting regardless of the page they receive?

In [None]:
# YOUR CODE HERE
df2.converted.mean()

##### Exercise 2: Given that an individual was in the control group, what is the probability they converted?

In [None]:
# YOUR CODE HERE
control_df = df2.query('group == "control"')
control_convert = df2.query('group == "control"').converted.mean()
control_convert

##### Exercise 3: Given that an individual was in the treatment group, what is the probability they converted?

In [None]:
# YOUR CODE HERE
treatment_df = df2.query('group == "treatment"')
treatment_convert = df2.query('group == "treatment"').converted.mean()
treatment_convert


##### Exercise 4: What is the probability that an individual received the new page?

In [None]:
# YOUR CODE HERE

P_receiving_new_page = df2.query('landing_page == "new_page"').count()[0] / df2.shape[0]
print("Probability of receiving new_page:", P_receiving_new_page)

In [None]:
# What is the probability that an individual received the old page?
P_receiving_old_page = 1 - P_receiving_new_page
print("Probability of receiving old_page:", P_receiving_old_page)

In [None]:
obs_mean = treatment_convert - control_convert
obs_mean

# **Part II - A regression approach**

1. In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.


**a.** Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

Since we only need to yield two different output values that are categorical. We can perform a Logistic Regression model to compare two dummy variables rather than quantitative.

**b.** The goal is to use statsmodels to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives.
* However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received.
* Add an intercept column, as well as an ab_page column, which is **1** when an individual receives the treatment and **0** if control.

In [None]:
df2['intercept'] = 1

In [None]:
df2['ab_page'] = pd.get_dummies(df2.group)['treatment']
df2.head()

**c.** Use statsmodels to import your regression model. Instantiate the model, and fit the model using the two columns you created in part **b.** to predict whether or not an individual converts.

In [None]:
# Convert 'converted' column to numeric
df2['converted'] = pd.to_numeric(df2['converted'])

# As 'intercept' column is binary indicator and it is of numeric type
df2['intercept'] = pd.to_numeric(df2['intercept'])

# Convert 'ab_page' column to binary numeric values (0 and 1)
df2['ab_page'] = df2['ab_page'].astype(int)

# Fit the logistic regression model
ls = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
result = ls.fit()

**d.** Provide the summary of your model below, and use it as necessary to answer the following questions.

In [None]:
result.summary()

In [None]:
dfc = pd.read_csv('/content/countries.csv')
dfc.head()

In [None]:
df2 = df2.merge(dfc, on='user_id')
df2.head()

In [None]:
df2[['CA', 'UK']] = pd.get_dummies(df2.country)[['CA', 'UK']]
df2.head()

We select CA and UK and drop the US column to make the matrice full rank.

In [None]:
df2['new_page'] = pd.get_dummies(df2.landing_page)['new_page']
df2.head()

In [None]:
# Convert 'new_page' column to binary numeric values (0 and 1)
df2['new_page'] = df2['new_page'].astype(int)

# Convert 'UK' column to binary numeric values (0 and 1)
df2['UK'] = df2['UK'].astype(int)

# Convert 'CA' column to binary numeric values (0 and 1)
df2['CA'] = df2['CA'].astype(int)

In [None]:
# Create a logistic regression model with baselines as US and old_page
logit = sm.Logit(df2.converted, df2[['intercept', 'new_page', 'CA', 'UK']])
result = logit.fit()
result.summary()

-- The predicted difference in the conversion of a page in CA as compared to the US holding other variables constant : **-0.0407**

-- The predicted difference in the conversion of a page in UK as compared to the US holding other variables constant : **0.0099**

-- For every one unit increase new page, we predict the conversion of a page to decrease by **0.0150** holding all other variables constant.

-- The predicted converted page if the user views the old page in the US. = **-1.9893**

Let's calculate **Variance Inflation Factor (VIF)** value in order to determine whether we have multicollinearity in our model.

In [None]:
y, X = dmatrices('converted ~ new_page + CA + UK', df2, return_type='dataframe')

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif

As a result, features' values are not larger than 10 that is, we don't have multicallinearity in our model.

**h.** Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

In [None]:
df2['CA_new'] = df2['new_page'] * df2['CA']
df2['UK_new'] = df2['new_page'] * df2['UK']

In [None]:
# Create a second logistic regression model with baselines as US and old_page
logit = sm.Logit(df2.converted, df2[['intercept', 'new_page', 'CA_new', 'UK_new', 'CA', 'UK']])
result = logit.fit()
result.summary()

Based on the results, only the intercept is statically significant. The coefficient of intereaction variables namely CA_new and UK_new are slightly different from the coefficient of new_page itself. I think adding a higher order term between page and country is useful in predicting the conversion of page.

-- For every one unit increase in the conversion for new page from UK, the predicted increase in convertion is by **0.0315.**

-- For every one unit increase in the conversion for new page from CA, the predicted decrease in convertion is by **-0.0468.**

-- The predicted difference between the conversion of pages viewed from CA and from US holding all other variables constant is **-0.0175.**

-- The predicted difference between the conversion of pages viewed from UK and from US holding all other variables constant is **-0.0057.**