# <font color=blue>A/B Test Data Generator</font>
### <font color=#4D4D4D>Warren Silva</font>
---
This is a tool used to create sample data suitable for A/B testing. Several variables (found in the **Settings** section) can be used to create highly customizable output.

In [1]:
# libraries
import pandas as pd
import numpy as np
import csv
import random
import string
from datetime import datetime, timedelta
from faker import Faker
fake = Faker()
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.5f}'.format
import scipy.stats as stats

## <font color=blue>Settings</font>

The following variables can be updated in order to control the characteristics of the data generated. Below is a short description of how each one can be used.

#### General Settings
- <code>record_count</code>: Specifies the total number of records to generate
- <code>export</code>: If set to <code>True</code>, the generated data is exported as a CSV file
- <code>binary_outcome</code>: Determines the type of outcome variable; <code>True</code> for binary outcomes and <code>False</code> for continuous outcomes

#### ID Settings
- <code>prefix_length</code>: Sets the length of the alphanumeric prefix added to each value in the id column
- <code>start_id</code>: Specifies the starting number for generating sequential IDs
- <code>skip_chance</code>: Defines the probability that a sequential ID will be skipped. Set to 0 if you don't want any gaps 
- <code>repeat_chance</code>: Defines the probability that a sequential ID value will be repeated. Set to 0 if you don't want any duplicate ID values

#### Date Settings
- <code>start_date</code>: The earliest possible date for the date column
- <code>end_date</code>: The latest possible date for the date column

#### Group Settings
- <code>group_names</code>: Names for the values in the  group column, typically used for control and treatment groups
- <code>group_ratio</code>: The proportion of records assigned to the first group (control group)

#### Binary Outcome Settings
Settings from this section are ignored if <code>binary_outcome == False</code>
- <code>outcomes</code>: The possible values for a binary outcome (e.g., <code>0</code> and <code>1</code>) 
- <code>rate_1</code>: The probability of a <code>1</code> outcome in the control group
- <code>rate_2</code>: The probability of a <code>1</code> outcome in the treatment group

#### Continuous Outcome Settings
Settings from this section are ignored if <code>binary_outcome == True</code>
- <code>zero_pct_1</code>: The proportion of <code>0</code> values in the control group for continuous outcomes
- <code>zero_pct_2</code>: The proportion of <code>0</code> values in the treatment group for continuous outcomes
- <code>min_val</code>: The minimum value for non-zero continuous outcomes
- <code>max_val</code>: The maximum value for non-zero continuous outcomes
- <code>treatment_scale</code>: A scaling factor applied to the treatment group's continuous outcomes

#### Outlier Settings
Settings from this section are ignored if <code>binary_outcome == True</code>
- <code>outlier_pct</code>: The proportion of records designated as outliers in a continuous outcome
- <code>outlier_min_val</code>: The minimum value for generated outliers
- <code>outlier_max_val</code>: The maximum value for generated outliers

In [2]:
# general settings 
record_count = 100000
export = False
binary_outcome = False # set to False for a continuous outcome

# id settings
prefix_length = 2
start_id = 11303
skip_chance = 0.3
repeat_chance = 0.009

# date settings
start_date = datetime(2023, 8, 1)
end_date = datetime(2023, 12, 31)

# group settings
group_names = ['control', 'treatment']
group_ratio = 0.517

# binary outcome settings
outcomes = [0,1]
rate_1 = 0.165
rate_2 = 0.210

# continuous outcome settings
zero_pct_1 = 0.962
zero_pct_2 = 0.966
min_val = 5
max_val = 175
treatment_scale = 0.9905

# outlier settings
outlier_pct = 0.00009  
outlier_min_val = 800  
outlier_max_val = 2500  

## <font color=blue>Functions</font>

In [3]:
# functions
def id_prefix(num_characters=prefix_length):
    characters = string.ascii_lowercase
    prefix = ''.join(random.choice(characters)for i in range(num_characters))
    return prefix

def create_ids(start=start_id, records=record_count, skip_rate=skip_chance, repeat_rate=repeat_chance):
    id_list = []
    current_val = start
    while len(id_list) < records:
        id_val = id_prefix() + str(current_val)
        if random.random() > skip_chance:
            id_list.append(id_val)
        if random.random() < repeat_chance:
            id_list.append(id_val)
        else:
            current_val += 1
    return id_list

def create_dates(start=start_date, end=end_date, records=record_count):
    date_list = []
    while len(date_list) < records:
        current_val = fake.date_between_dates(start_date, end_date)
        date_list.append(current_val)
    return date_list

def create_binary(records=record_count, ratio=group_ratio, rate_1=rate_1, rate_2=rate_2):
    group_list = []
    outcome_list = []
    while len(group_list) < records:
        # determine group (control or treatment)
        if random.random() > ratio:
            group_list.append(group_names[0]) # control group
            # determine if the outcome is zero or a random value
            if random.random() > rate_1:
                outcome_list.append(outcomes[0])
            else:
                outcome_list.append(outcomes[1])
        else:
            group_list.append(group_names[1])
            if random.random() > rate_2:
                outcome_list.append(outcomes[0])
            else:
                outcome_list.append(outcomes[1])
    return group_list, outcome_list

def create_continuous(records=record_count, ratio=group_ratio, zero_rt_1=zero_pct_1, zero_rt_2=zero_pct_2,
                      min_value=min_val, max_value=max_val, group2_scale=treatment_scale,
                      outlier_pct=outlier_pct, outlier_min=outlier_min_val, outlier_max=outlier_max_val):
    group_list = []
    outcome_list = []
    
    # calculate the number of outliers
    num_outliers = int(records * outlier_pct)
    
    while len(group_list) < records:
        # determine group (control or treatment)
        if random.random() > ratio:
            group_list.append(group_names[0])  # control group
            # determine if the outcome is zero or a random value
            if random.random() < zero_rt_1:
                outcome_list.append(0)
            else:
                outcome_list.append(np.random.uniform(min_value, max_value))
        else:
            group_list.append(group_names[1])  # treatment group
            # determine if the outcome is zero or a random value
            if random.random() < zero_rt_2:
                outcome_list.append(0)
            else:
                outcome_list.append(np.random.uniform(min_value, max_value) * group2_scale)
    
    # introduce outliers
    outlier_indices = np.random.choice(records, num_outliers, replace=False)
    for idx in outlier_indices:
        outcome_list[idx] = np.random.uniform(outlier_min, outlier_max)
        
    return group_list, outcome_list



## <font color=blue>Populate Dataframe</font>

In [4]:
# build series for each feature
id_vals = create_ids()
date_vals = create_dates()
if binary_outcome:
    group_vals, outcome_vals = create_binary()
else:
    group_vals, outcome_vals = create_continuous()

# make dataframe
df = pd.DataFrame({
    'id':id_vals,
    'date':date_vals,
    'group':group_vals,
    'outcome':outcome_vals})

# preview data
df.head()

Unnamed: 0,id,date,group,outcome
0,cd11303,2023-10-18,treatment,0.0
1,ft11304,2023-09-28,treatment,0.0
2,ih11305,2023-12-20,control,0.0
3,ag11306,2023-09-25,treatment,159.71133
4,gc11307,2023-12-02,treatment,0.0


## <font color=blue>Export</font>

In [5]:
datestring = datetime.now().strftime("%Y%m%d_%H%M")

if export and binary_outcome:
    out_file = 'binary_sample_' + datestring + '.csv'
    df.to_csv(out_file, index=False)
    print(out_file + ' successfully exported')
elif export and not binary_outcome:
    out_file = 'continuous_sample_' + datestring + '.csv'
    df.to_csv(out_file, index=False)
    print(out_file + ' successfully exported')
else:
    print('No file saved. Set export=True to save results')

No file saved. Set export=True to save results
