# **4 - Ensuring Experimentation Trustworthiness**

Experimental Design and A/B testing

## **Topic Review**
---

From the material that has been studied in video learning:

**Some mechanism to Ensure Trustworthiness are :**
1. Validate data quality
2. Avoid threat to internal validity
3. Avoid threat to external validity
4. Mitigate the effect of simpson’s paradox


However, in this notebook we only discuss about data quality and Sample Ratio Mismatch (SRM). SRM is part of threat to internal validity.

#### **1. Data Quality**

We can use the following checklist to measure data quality :
- Missing rates : How much missing value in dataset
- Uniqueness : No duplicate data
- Invalid values : Do the values follow the proper format? Are the values valid for the variable/column?
- Data delays : How many data is there at the periode of the experiment? How long does it take between when the events were logged and when the data is available for analysis?

### **2. Sample Ratio Mismatch (SRM)**
* Sample Ratio Mismatch (SRM) is the situation when the observed sample ratio in the experiment is different from the expected.

* **Chi-square test** can be used to detect whether an experiment has SRM or not.

The steps for doing a chi-square test in order to detect SRM are:
1. Define the null and alternative hypothesis (H0 and H1)
2. Calculate chi-square statistics
3. Define decision rules
4. Make decisions and draw a conclusion


In this notebook, we will learn how to validate the data quality and detect SRM data in several cases in A/B testing

## **Case : Music Streaming Platform**
___________________________



- PacMusic is one of the music streaming service providers, with over 10 million monthly active users, including more than 8 million paying subscribers.
- PacMusic has feature, namely "Made for You", which is a **song recommendation system** that is made based on the user's behavior or favorite songs.
- Song recommendation system used certain algorithm that can predict and then offer the appropriate songs to their users based on the characteristics of the music that has been heard previously.
- The engineer of PacMusic has **initiative to improve** the algorithm of song recommendation system.
-But, before implementing the new song recommendation system algorithm, they conducted an A/B testing for comparing the performance of **existing algorithm (control)** and **new one (treatment)**.
- Here is the dataset collected from PacMusic experimentation.
Let's check the trustworthiness of the experimentation based on data quality and the occurrence of SRM!

## Data Quality

#### **Import Data**
First, we must import the data.
- PacMusic data is stored in `.csv` named `pacmusic_dataset.csv`
- Import data into Python with `pd.read_csv(...)` to start analysis

In [1]:
# load data
import pandas as pd
import numpy as np

In [2]:
# Initialization filename
filename = "pacmusic_dataset.csv"

# Import data
data = pd.read_csv(filename)

# Display 5 top rows of the data
data.head()

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
0,1,01-03-22,39.13.114.2,0.0,Android,28.359429
1,2,01-03-22,13.3.25.8,1.0,Web,48.37441
2,3,01-03-22,247.8.211.8,1.0,Android,39.70099
3,4,01-03-22,124.8.220.3,0.0,Android,52.900249
4,5,01-03-22,60.10.192.7,0.0,Web,34.054143


- There are 6 column (`RecordID`,`Visit_date`,`IP Address`,`Variant`, `Device`,`Total_Daily_usage`)
- `RecordID` : Order record ID
- `Visit_date` : the date of user used the platform
- `IP Address` : unique user identifier or IP address
- `Variant` : variant assignment (0 for control, 1 for treatment)
- `Device` : device that user userd, it canbe IOS, Android, Web
- `Total_Daily_usage` : Total daily usage user in platform (in minutes)

In [3]:
#Overview data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   RecordID           184588 non-null  int64  
 1   Visit_Date         184588 non-null  object 
 2   IP Address         184588 non-null  object 
 3   Variant            184586 non-null  float64
 4   Device             184588 non-null  object 
 5   Total_Daily_usage  183421 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 8.4+ MB


In [4]:
#dimension of data
data.shape

(184588, 6)

#### **Missing Rates**



In [5]:
# Checking missing value
data.isna().sum()

Unnamed: 0,0
RecordID,0
Visit_Date,0
IP Address,0
Variant,2
Device,0
Total_Daily_usage,1167


In [6]:
#Percentage of missing rate in column variant
miss_var = data['Variant'].isna().sum()/data.shape[0]*100
print(f"Missing rate variant : {miss_var:.3f}%")

Missing rate variant : 0.001%


In [7]:
#Percentage of missing rate in column Total_Daily_Usage
miss_usage = data['Total_Daily_usage'].isna().sum()/data.shape[0]*100
print(f"Missing rate Total daily usage : {miss_usage:.3f}%")

Missing rate Total daily usage : 0.632%


In [8]:
#drop missing value
data = data.dropna()

In [9]:
# Checking missing value after drop
data.isna().sum()

Unnamed: 0,0
RecordID,0
Visit_Date,0
IP Address,0
Variant,0
Device,0
Total_Daily_usage,0


In [10]:
# Checking duplicate data
data.duplicated(['IP Address']).sum()

np.int64(84159)

In [11]:
# Dropping duplicates data
data.drop_duplicates(subset='IP Address',inplace=True)
data.shape

(99260, 6)

#### **Invalid values**

  We will check the following column whether it invalid values or not :
  * Variant : should contain 0 or 1
    * 0 = control
    * 1 = treatment
  * Device : should contain IOS, or Android, or web
  * Total_Daily_usage = should be numeric value not to exceed 1440, cause minutes in a day is just 1,440

In [12]:
#check for Variant
unique_variant=data["Variant"].unique()
print(unique_variant)

[0. 1.]


In [13]:
#check for Device
unique_device=data["Device"].unique()
print(unique_device)

['Android' 'Web' 'IOS']


In [14]:
#check for total daily usage
invalid_minutes = data[data['Total_Daily_usage'] > 1440]
invalid_minutes

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
85,86,01-03-22,227.2.113.6,1.0,Android,10000.0
151555,151556,24-03-22,201.0.194.8,1.0,Android,10000.0


In [15]:
#drop invalid values
#we only used value < 1440. So, we can select row that total daily usage < 1440 by .loc
data=data.loc[data['Total_Daily_usage'] < 1440]
data.head()

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
0,1,01-03-22,39.13.114.2,0.0,Android,28.359429
1,2,01-03-22,13.3.25.8,1.0,Web,48.37441
2,3,01-03-22,247.8.211.8,1.0,Android,39.70099
3,4,01-03-22,124.8.220.3,0.0,Android,52.900249
4,5,01-03-22,60.10.192.7,0.0,Web,34.054143


In [16]:
#sanity check invalid values
invalid_minutes = data[data['Total_Daily_usage'] > 1440]
invalid_minutes

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage


#### **Data delays**

Suppose that, the Engineer of PacMusic conducted experiment in 4 weeks, start form **1 March 2022** until **28 March 2022**.
Please make sure that there is no data delays.

We will check whether there is data outside the experimental period (after March 28, 2022) or not

In [17]:
# Convert the date to datetime
data['Visit_Date'] = pd.to_datetime(data['Visit_Date'], format='%d-%m-%y')

# data delay
data.loc[data['Visit_Date'] > '28-03-22']

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
184584,184585,2022-03-29,207.2.110.5,0.0,IOS,36.220716


In [18]:
#only select in period of experiment
data_final = data.loc[data['Visit_Date'] <= '28-03-22']

# print tail data
data_final.tail()

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
184569,184570,2022-03-28,60.3.87.6,0.0,Web,48.796757
184570,184571,2022-03-28,11.2.87.5,0.0,IOS,28.436697
184573,184574,2022-03-28,18.6.86.9,0.0,IOS,48.682794
184578,184579,2022-03-28,199.14.104.7,0.0,IOS,48.474243
184580,184581,2022-03-28,93.3.115.6,0.0,IOS,52.380493


## Sample Ratio Mismatch (SRM)

We will detect the SRM using chi-square test

#### Pre-Analysis


- Before doing the chi-square test, let's fill in the following table to make it easier to detect the presence of SRM.


<center>

|Group|# user|Percentage|
|:--|:--:|:--:|
|Control|-|-|-|
|Treatment|-|-|-|

</center>

* #user is the sample size in the dataset in each group
* Percentage is the percentage of sample size in each group


In [19]:
# Make control data
data_control = data_final[data_final["Variant"] == 0]

data_control.head()

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
0,1,2022-03-01,39.13.114.2,0.0,Android,28.359429
3,4,2022-03-01,124.8.220.3,0.0,Android,52.900249
4,5,2022-03-01,60.10.192.7,0.0,Web,34.054143
7,8,2022-03-01,97.6.126.6,0.0,Android,55.781004
10,11,2022-03-01,32.6.213.1,0.0,IOS,59.190174


In [20]:
# Make treatment data
data_treatment = data_final[data_final["Variant"] == 1]

data_treatment.head()

Unnamed: 0,RecordID,Visit_Date,IP Address,Variant,Device,Total_Daily_usage
1,2,2022-03-01,13.3.25.8,1.0,Web,48.37441
2,3,2022-03-01,247.8.211.8,1.0,Android,39.70099
5,6,2022-03-01,23.5.199.2,1.0,Web,25.432637
6,7,2022-03-01,195.12.126.2,1.0,Android,51.153328
8,9,2022-03-01,93.10.165.4,1.0,Android,47.174201


In [21]:
# Number of users in the control group
n_control = data_control.shape[0]

n_control

49445

In [22]:
# Number of users in the
n_treatment = data_treatment.shape[0]

n_treatment

49812

In [23]:
# Percentage in each group
n_total = data_final.shape[0]
persen_control = n_control/n_total * 100
persen_treat = n_treatment/n_total * 100

print(f"% control   : {persen_control:.2f}%")
print(f"% treatment : {persen_treat:.2f}%")

% control   : 49.82%
% treatment : 50.18%


### Chi-Square Test for Detect SRM

The steps for doing a chi-square test in order to detect SRM are:


**1. Define the null and alternative hypothesis ($H_0$ and $H_1$)**

$H_0$  : No SRM detected

$H_1$  : SRM detected

**2. Calculate chi-square statistics**

$$ \chi^2 = \sum \frac{\left ( \text{observed - expected} \right )^2}{\text{expected}} $$

Where :
- Observed: the control and variation traffic volumes (sample size), respectively
- Expected: the expected values for control and treatment — i.e. the total observed divided by 2


Observed is the same as # user in each group.

For calculate expexted in each group, we can use total observed divided by 2


In [24]:
observed = [ n_control, n_treatment ]
total_traffic= sum(observed)
expected = [ total_traffic/2, total_traffic/2 ]

Then we can calculate the chi-square statistics using the function in the `scipy` library, namely `chisquare` with steps:

1. import library
  - from scipy.stats import chisquare
2. Use the function `chisquare(f_obs, f_exp=...)`
    - `f_obs`: Observed frequencies in each category (array)
    - `f_exp`: Expected frequencies in each category. By default the categories are assumed to be equally likely.



In [25]:
#calculate chi-square statistics
from scipy.stats import chisquare
chi = chisquare(observed, f_exp=expected)
print(chi)

Power_divergenceResult(statistic=np.float64(1.3569723042203572), pvalue=np.float64(0.2440628976499173))


**3. Define decision rules**

In making statistical test decisions, we can use:
- Comparison of chi-square statistics with critical value
     -  $\chi^2 > \chi^2_{\alpha,df}$ → reject $H_0$

- Comparison of p-value with alpha
   - pvalue < $\alpha$ → reject $H_0$


Normally, one would look for a p-value of 0.05 or less to proof of SRM. The problem with 0.05 is that it’s not strict enough for our purposes. Using this might give us a false signal. What we need is to be stricter for our test. So we use significance level 1%.

degree of freedom (df) is calculated as:
$$ df = (rows − 1) × (columns − 1) $$


In [26]:
# Comparison of chi-square statistics with critical value
# We must calculate the critical first

# critical value is the chi-square value at alpha
alpha = 0.01
df=(2-1)*(2-1)

import scipy
chi_critical = scipy.stats.chi2.ppf(1 - alpha, df)
print(f"Critical value: {chi_critical:.3f}")

Critical value: 6.635


In [27]:
#Make decisions from chi-square statistics and critical value
if chi[0] > chi_critical:
  print("Reject H0 : SRM may be present.")
else:
  print("Fail to reject H0 : No SRM")

Fail to reject H0 : No SRM


In [28]:
# Comparison of P-Value with alpha.
if chi[1] < 0.01:
    print('Reject H0 : SRM may be present.')
else:
    print('Fail to reject H0 : No SRM.')

Fail to reject H0 : No SRM.


## Conclusion

Based on data quality, we have done data cleaning so that the data we use is of sufficient quality. But we need to check again, whether the sample size after data cleaning is sufficient (according to the experimental design) or not so that there is enough power to draw credible conclusions.

Based on the detection of SRM, although the sample size of the cleaned data in the control and treatment groups is different. However, SRM was not detected.