# Which version of the website should you use?

## 📖 Background
You work for an early-stage startup in Germany. Your team has been working on a redesign of the landing page. The team believes a new design will increase the number of people who click through and join your site. 

They have been testing the changes for a few weeks and now they want to measure the impact of the change and need you to determine if the increase can be due to random chance or if it is statistically significant.

## 💾 The data
The team assembled the following file:

#### Redesign test data
- "treatment" - "yes" if the user saw the new version of the landing page, no otherwise.
- "new_images" - "yes" if the page used a new set of images, no otherwise.
- "converted" - 1 if the user joined the site, 0 otherwise.

The control group is those users with "no" in both columns: the old version with the old set of images.

In [None]:
import pandas as pd
df = pd.read_csv('./data/redesign.csv')
df.head()

Unnamed: 0,treatment,new_images,converted
0,yes,yes,0
1,yes,yes,0
2,yes,yes,0
3,yes,no,0
4,no,yes,0


## 💪 Challenge
Complete the following tasks:

1. Analyze the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.
2. Can the increases observed be explained by randomness? (Hint: Think A/B test)
3. Which version of the website should they use?

## 🧑‍⚖️ Judging criteria

We will randomly select ten winners from the correct submissions for this challenge.

The winners will receive DataCamp merchandise.

## ✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your answers.
- Check that all the cells run without error.

In [None]:


df222 = pd.crosstab(index=[df['treatment'],df['new_images']], columns= df['converted'],margins=True).reset_index()

df222[0] = df222[0]/df222['All']
df222[1] = df222[1]/df222['All']
df222['All'] = df222['All']/df222['All']
df222

#clearly 3rd one has higher conversion rate.



converted,treatment,new_images,0,1,All
0,no,no,0.892896,0.107104,1.0
1,no,yes,0.887462,0.112538,1.0
2,yes,no,0.879953,0.120047,1.0
3,yes,yes,0.886276,0.113724,1.0
4,All,,0.886647,0.113353,1.0


In [None]:
df2 = df.groupby(['converted','new_images']).count()

df22 = pd.crosstab(index=[df['treatment'],df['new_images']], columns= df['converted'],margins=False).reset_index()
df22
df22= df22.iloc[:,-2:]
df22



converted,0,1
0,9037,1084
1,8982,1139
2,8906,1215
3,8970,1151


In [None]:
##The Pearson’s chi squared test for proportion shows us that that the p-value is less than 0.05 which implies that the 4 groups are significantly different from each other. The null hypothesis is rejected indicating that the increase in users is not by chance.

from scipy.stats import chi2_contingency
from scipy.stats import chi2

stat, p, dof, expected = chi2_contingency(df22)

alpha = 0.05
print("p value is " + str(p))
print(p, 'P')
print("Stat",stat)
print("DOF", dof)
print("expected", expected)

p value is 0.03630328708083606
0.03630328708083606 P
Stat 8.526056765102425
DOF 3
expected [[8973.75 1147.25]
 [8973.75 1147.25]
 [8973.75 1147.25]
 [8973.75 1147.25]]


In [None]:
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')
df22
print(p)

p value is 0.03630328708083606
Dependent (reject H0)
0.03630328708083606
