<a href="https://colab.research.google.com/github/sheng-999/simplepython/blob/sheng-999-upload-files/Greenweez_Home_Page.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Greenweez Home Page

The home page is very important for a website. It generates a lot of traffic and is the showcase of the site. The traffic optimisation team wants to optimise the homepage. They hesitate between two versions.



### Here are the two versions:

[Variant A](https://drive.google.com/file/d/1LqPXgeOJ8QQ1ZfcO4_Mz26lehmyOkles/view) - Slider with a white design

[Variant B](https://drive.google.com/file/d/1rBydNNlrg5d1AmGXo8-9DsfrbE-tuAox/view) - Static page with a green design

### We need to split the users

Before we can actually run the AB Test, we need to segment our users into two groups. Let's start by importing the user data from the customers tab in [this spreadsheet](https://docs.google.com/spreadsheets/d/1lpyAhs6Yh2WZ-zqKrpfxKN08fZ3PTISvS2ajl3L6Avk/edit#gid=386045473).

In [1]:
# Import the data (also import the necessary packages)
import pandas as pd
from sklearn.utils import shuffle

In [None]:
sheet_id = '1HwstwFa5Iy4xPPGNvmPjJqtqaqO04LP-OfvtkwO5CBw'
sheet_name = 'customers'
# url = f'https://docs.google.com/spreadsheets/d/{id}/edit#gid=386045473'


In [3]:
file_path = '/content/Greenweez Home Page Results.xlsx'
df_customer= pd.read_excel(file_path,sheet_name='customers',engine='openpyxl')

In [5]:
# Let's take a look at our dataframe
df_customer

Unnamed: 0,customers_id,avg_basket
0,9731,202.59
1,61582,22.92
2,305054,32.05
3,305036,30.46
4,10969,87.93
...,...,...
39995,273264,35.46
39996,273371,87.03
39997,70803,50.49
39998,6743,86.19


Let's adopt a naive strategy first - splitting by median customers_id

In [6]:
median_customers_id = df_customer['customers_id'].median()
print(median_customers_id)
# check if the median value is exist
print(df_customer[df_customer['customers_id'] == median_customers_id])

218866.5
Empty DataFrame
Columns: [customers_id, avg_basket]
Index: []


In [7]:
A_group = df_customer[df_customer['customers_id'] < median_customers_id]
B_group = df_customer[df_customer['customers_id'] > median_customers_id]

Did we do a good job? Let's look at the mean avg_basket for both groups

In [8]:
A_group['avg_basket'].mean(),B_group['avg_basket'].mean()
# 76.670484, 52.311416
# 76.670484, 52.311415999999994

(76.670484, 52.311416)

That's quite a difference! Should we try another strategy?
Let's divide the two groups randomly. Check out [this](https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows) StackOverflow thread on how to do that.

In [9]:
# mix the DF
shuffled = df_customer.sample(frac=1)
group1 = shuffled[:20000]
group2 = shuffled[20000:]

Let's check the avg_basket again. We should have done a better job!

In [10]:
group1['avg_basket'].mean(),group2['avg_basket'].mean()

(64.357677, 64.624223)

### The results are in

After 4 weeks, the web developers have gotten back to you with the results of the [test](https://docs.google.com/spreadsheets/d/1lpyAhs6Yh2WZ-zqKrpfxKN08fZ3PTISvS2ajl3L6Avk/edit?usp=sharing). Let's analyse them to see which variant is the best. Take some time to make sense of the different columns in the *4 weeks* table. Then, download the file as CSV and load it in the next cell.

In [11]:
# Load in the CSV of the first day.
df4= pd.read_excel(file_path,sheet_name='4 weeks',engine='openpyxl')

In [12]:
# Have a look at your newly created dataframe
df4


Unnamed: 0,AB test group,Nb sessions,Nb bounces,% bounces,Nb pages,Page / Sessions,Nb transactions,% conversions
0,Slider blank,243210,90310,0.371325,406734,1.672357,16904,0.069504
1,Static green,243920,92031,0.3773,405872,1.663955,16699,0.068461
2,Total,487130,182341,0.374317,812606,1.66815,33603,0.068982


In [13]:
# Let's reset the index to the "AB test group" column
df4.set_index('AB test group', inplace=True)

In [14]:
df4

Unnamed: 0_level_0,Nb sessions,Nb bounces,% bounces,Nb pages,Page / Sessions,Nb transactions,% conversions
AB test group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Slider blank,243210,90310,0.371325,406734,1.672357,16904,0.069504
Static green,243920,92031,0.3773,405872,1.663955,16699,0.068461
Total,487130,182341,0.374317,812606,1.66815,33603,0.068982


In [25]:
# Make sure you know how to access the individual values - try displaying the number of sessions for the blank slider
# Try using the column/index names and not numbers to make the code more readable
df4.loc['Slider blank','Nb sessions']

243210

### The bounce variable

The first metric we want to analyse is bounce! What kind of test would best suit this metric?

*Answer: Chi-Square test because bounce is a discrete binary variable, a customer either bounces or doesn't!*

Now that we've chosen the appropriate test, you might notice that we're lacking something! The theoretical or expected value. Since neither of these variants have been implemented before and we don't have a baseline, we'll have to create our own. Our hypothesis is that the Bounce rate is the same for both variants -- equal to the average Bounce rate of 37.40%.

Compute the theoretical number of bounces for both variants using the average bounce rate!

In [16]:
df4.head()

Unnamed: 0_level_0,Nb sessions,Nb bounces,% bounces,Nb pages,Page / Sessions,Nb transactions,% conversions
AB test group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Slider blank,243210,90310,0.371325,406734,1.672357,16904,0.069504
Static green,243920,92031,0.3773,405872,1.663955,16699,0.068461
Total,487130,182341,0.374317,812606,1.66815,33603,0.068982


In [17]:
# Compute the theoretical number of bounces for both variants using the average bounce rate!
bounce_moyen = df4.loc['Total','% bounces']
bounce_moyen

0.3743169175

In [26]:
blanc = bounce_moyen * df4.loc['Slider blank','Nb sessions']
vert = bounce_moyen * df4.loc['Static green','Nb sessions']
blanc,vert

(91037.617505175, 91303.3825166)

Now that we have all the elements we need, compute the Chi-Square test below, first by hand with the formula (and the table) and then using the [scipy function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)

In [28]:
## With the formula

## CS-statisic = total [(observed frequency per category - expected frequency per category)^2
## / expected frequency per category]
chi_square = \
(df4.loc['Slider blank','Nb bounces']-blanc)**2/blanc +\
(df4.loc['Static green','Nb bounces']-vert)**2/vert
print(chi_square)
## With Scipy
from scipy.stats import chisquare
import numpy as np
# Import the right modules (also import numpy)
f_obs_bounce = np.array([df4.loc['Slider blank', 'Nb bounces'], df4.loc['Static green', 'Nb bounces']])
f_exp_bounce = np.array([blanc, vert])
# Create arrays for the observed and expected bounce values

# Calculate chisquare
chi_square_bounce = chisquare(f_obs_bounce,f_exp_bounce)

11.614027402426252


What do you make of the results? Can we safely reject the null hypothesis?

*Yes, we can - the p-value is low enough (lower than our 5% threshold)*

### What about the other metrics?

Let's repeat what we just did for the other valid metric: number of transactions made. Again, we need to compute the theoretical values first.

Could we also compute for number of pages visited? Why/why not?

#### Number of transactions made

In [37]:
# Compute the theoretical transactions for both variants using the conversion rate!
blanc_theor_transaction = df4.loc['Slider blank', 'Nb sessions'] * df4.loc['Total', '% conversions']
green_theor_transaction = df4.loc['Static green', 'Nb sessions'] * df4.loc['Total', '% conversions']

blanc_theor_transaction, green_theor_transaction

(16777.0115359242, 16825.9884619984)

In [47]:
# Chi-Square with the formula

chi_square_transaction = (((df4.loc['Slider blank', 'Nb transactions'] - blanc_theor_transaction) ** 2) /  (blanc_theor_transaction) + \
                          ((df4.loc['Static green', 'Nb transactions'] - green_theor_transaction) ** 2) /  green_theor_transaction)
print(f"Using the formula: {chi_square_transaction}")
# Chi-Square with the Scipy function
f_obs_transaction = np.array([df4.loc['Slider blank', 'Nb transactions'], df4.loc['Static green', 'Nb transactions']])
f_exp_transaction = np.array([blanc_thero_rate, green_thero_rate])

chi_square_transactions = chisquare(f_obs=f_obs_transaction, f_exp=f_exp_transaction)
chi_square_transactions


Using the formula: 1.9196028930112474


Power_divergenceResult(statistic=1.9196028930112474, pvalue=0.1659004437802039)

Is the resulting p-value satisfactory?