# Data
- data.csv taken from Kaggle [here](https://www.kaggle.com/datasets/carrie1/ecommerce-data/data)
- it is then stored in the folder `data/`

# Problem
- In this note, we create an example to illustrate the variance reduction technique and its usefullness

Imagine we do an experiment with the company in Oct 2011 in UK. And we would like to do AB test. We hypothesize that the treatment will impact unit prices.
However, we do analysis for orders in Sep 2011 and see that the average UnitPrice is 3.91 with variance ~ 2300!!

So instead of using Unit Price to measure the effect of treatment, we use a cutoff version of Unit Price. For example, cutoff at the value 50.
The variance of the new version is 14, which is ~ 160 times lower. Hence we need 160 times fewer samples for the experiment.

In [54]:
import chardet
import pandas as pd

# with open('../data/data.csv', 'rb') as f:
#     result = chardet.detect(f.read())
#     charenc = result['encoding']
charenc = 'ISO-8859-1'
print(charenc)
df = pd.read_csv('../data/data.csv', encoding=charenc)

ISO-8859-1


In [46]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


In [55]:
# remove rows with UnitPrice<=0
df=df[df['UnitPrice']>0]

In [56]:
# format the column `InvoiceDate` to 'YYYY-MM-DD HH:MM:SS'
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceDate'] = df['InvoiceDate'].dt.strftime('%Y-%m-%d %H:%M:%S')



In [59]:
# write a code to compute the average order amount of orders in Sep 2011, the column amount is `UnitPrice`, the column time is `InvoiceDate`, we focus on Country `United Kingdom` only
df_orders = df[(df['InvoiceDate']>'2011-09-01') & (df['InvoiceDate']<'2011-10-01') & (df['Country']=='United Kingdom')].copy()

print(df_orders)

       InvoiceNo StockCode                         Description  Quantity  \
320705    565080     20677                  PINK POLKADOT BOWL         8   
320706    565080     22128          PARTY CONES CANDY ASSORTED        24   
320708    565082     22423            REGENCY CAKESTAND 3 TIER         2   
320709    565082    15060B          FAIRY CAKE DESIGN UMBRELLA         8   
320710    565082     23245          SET OF 3 REGENCY CAKE TINS         4   
...          ...       ...                                 ...       ...   
370926    569202     22486                   PLASMATRONIC LAMP         1   
370927    569202     22495      SET OF 2 ROUND TINS CAMEMBERT          1   
370928    569202     22539              MINI JIGSAW DOLLY GIRL         2   
370929    569202     22540          MINI JIGSAW CIRCUS PARADE          2   
370930    569202     22805  BLUE DRAWER KNOB ACRYLIC EDWARDIAN        10   

                InvoiceDate  UnitPrice  CustomerID         Country  
320705  2011-09-01

In [61]:
print(f"Average order value in Sep 2011 is {df_orders['UnitPrice'].mean():.2f}. Number of orders: {len(df_orders)}")
print(f"The variance is {df_orders['UnitPrice'].var():.2f}")

Average order value in Sep 2011 is 3.91. Number of orders: 45373
The variance is 2324.52


In [62]:
# We use cut off Unitprice at value 50 (supposedly UK pounds). 
df_orders['UnitPrice_cutoff_50'] = df_orders['UnitPrice'].apply(lambda x: 50 if x>50 else x)
print(f"Average is {df_orders['UnitPrice_cutoff_50'].mean():.2f}. Number of orders: {len(df_orders)}")
print(f"The variance is {df_orders['UnitPrice_cutoff_50'].var():.2f}")
# so the variance is reduced xxx times
print(f"So the variance is reduced {df_orders['UnitPrice'].var()/df_orders['UnitPrice_cutoff_50'].var():.2f} times.")


Average is 3.15. Number of orders: 45373
The variance is 14.01
So the variance is reduced 165.91 times.


In [63]:
# data in Aug 2021
df_orders = df[(df['InvoiceDate']>'2011-08-01') & (df['InvoiceDate']<'2011-09-01') & (df['Country']=='United Kingdom')].copy()
print(f"Average order value in Aug 2011 is {df_orders['UnitPrice'].mean():.2f}. Number of orders: {len(df_orders)}")
print(f"The variance is {df_orders['UnitPrice'].var():.2f}")

df_orders['UnitPrice_cutoff_50'] = df_orders['UnitPrice'].apply(lambda x: 50 if x>50 else x)
print(f"Average is {df_orders['UnitPrice_cutoff_50'].mean():.2f}. Number of orders: {len(df_orders)}")
print(f"The variance is {df_orders['UnitPrice_cutoff_50'].var():.2f}")
# so the variance is reduced xxx times
print(f"So the variance is reduced {df_orders['UnitPrice'].var()/df_orders['UnitPrice_cutoff_50'].var():.2f} times.")


Average order value in Aug 2011 is 5.03. Number of orders: 31007
The variance is 9561.83
Average is 3.28. Number of orders: 31007
The variance is 17.62
So the variance is reduced 542.65 times.
