<a href="https://colab.research.google.com/github/wahyunh10/AB-testing-Ecommerce/blob/main/ab_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyze A/B Test Results**

# **Table of Contents**
* Introduction
* Part I - Probability
* Part II - A/B Test
* Part III - Regression

# **Introduction**

A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these

For this project, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

# **Part I - Probability**
To get started, let's import our libraries.

In [2]:
import pandas as pd
import numpy as np
import random 
import time
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

a. Read in the dataset and take a look at the top few rows here:

In [3]:
#reading the dataset
df = pd.read_csv('ab_testing_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


b. Use the cell below to find the number of rows in the dataset.

In [4]:
#Number of rows
df.shape[0]

294478

c. The number of unique users in the dataset.

In [5]:
#Number of unique users
df['user_id'].nunique()

290584

d. The proportion of users converted.

In [6]:
#Proportion of users converted
len(df[df['converted'] ==1]) / df['user_id'].nunique()

0.12126269856564711

e. The number of times the new_page and treatment don't match.

In [7]:
#Number of rows that follows:
#When group is treatment but landing_page is not new_page,
#When group is not treatment but landing_page is new_page
df2 = df.query('(group == "treatment" & landing_page != "new_page") or (group != "treatment" & landing_page == "new_page")')
df2.shape[0]

3893

f. Do any of the rows have missing values?

In [8]:
#Checking data for Null Values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new dataframe in df2.

In [9]:
#Dropping rows that may have wrong information
df2 = df.drop(df.query('(group == "treatment" & landing_page != "new_page") or (group != "treatment" & landing_page == "new_page")').index)

In [10]:
#Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

a. How many **unique user_ids** are in **df2**?

In [11]:
#Unique users in the new dataframe
df2.user_id.nunique()

290584

b. There is **one user_id** repeated in **df2**. What is it?

In [12]:
#Duplicated user
df2[df2.duplicated(['user_id'])]['user_id']

2893    773192
Name: user_id, dtype: int64