Project description

You are an analyst at a big online store. Together with the marketing department, you've compiled a list of hypotheses that may help boost revenue.<br>
**You need to prioritize these hypotheses, launch an A/B test, and analyze the results.**

------------------------------------

Description of the data
Data used in the first part of the project
/datasets/hypotheses_us.csv Download dataset

    Hypotheses — brief descriptions of the hypotheses
    Reach — user reach, on a scale of one to ten
    Impact — impact on users, on a scale of one to ten
    Confidence — confidence in the hypothesis, on a scale of one to ten
    Effort — the resources required to test a hypothesis, on a scale of one to ten. The higher the Effort value, the more resource-intensive the test.

Data used in the second part of the project
/datasets/orders_us.csv Download dataset

    transactionId — order identifier
    visitorId — identifier of the user who placed the order
    date — of the order
    revenue — from the order
    group — the A/B test group that the user belongs to

/datasets/visits_us.csv Download dataset

    date — date
    group — A/B test group
    visits — the number of visits on the date specified in the A/B test group specified

 

 Research plan:
<a class="anchor" id="table_of_contents"></a>

[table_of_contents](#table_of_contents)

1. [look at general information of the data](#general_information)
    * Load data
    * Explore data
2. [preprocess](#preprocess)
    * Replace the column names (make them lowercase).
    * Convert the data to the required types.
    * Describe the columns where the data types have been changed and why.
    * If necessary, decide how to deal with missing values:
        ◦ Explain why you filled in the missing values as you did or why you decided to leave them blank.
        ◦ Why do you think the values are missing? Give possible reasons.
        ◦ Pay attention to the abbreviation TBD (to be determined). Specify how you intend to handle such cases.
    * Calculate the total sales (the sum of sales in all regions) for each game and put these values in a separate column.
3. [Prioritizing Hypotheses:](#Prioritizing_Hypotheses)
    * Apply the ICE framework to prioritize hypotheses. Sort them in descending order of priority.
    * Apply the RICE framework to prioritize hypotheses. Sort them in descending order of priority.
    * Show how the prioritization of hypotheses changes when you use RICE instead of ICE. Provide an explanation for the changes.
4. [A/B Test Analysis](#A/B_Test_Analysis)
    * Graph cumulative revenue by group. Make conclusions and conjectures.
    * Graph cumulative average order size by group. Make conclusions and conjectures.
    * Graph the relative difference in cumulative average order size for group B compared with group A. Make conclusions and conjectures.
    * Calculate each group's conversion rate as the ratio of orders to the number of visits for each day. Plot the daily conversion rates of the two groups and describe the difference. Draw conclusions and make conjectures.
    * Plot a scatter chart of the number of orders per user. Make conclusions and conjectures.
    * Calculate the 95th and 99th percentiles for the number of orders per user. Define the point at which a data point becomes an anomaly.
    * Plot a scatter chart of order prices. Make conclusions and conjectures.
    * Calculate the 95th and 99th percentiles of order prices. Define the point at which a data point becomes an anomaly.
    * Find the statistical significance of the difference in conversion between the groups using the raw data. Make conclusions and conjectures.
    * Find the statistical significance of the difference in average order size between the groups using the raw data. Make conclusions and conjectures.
    * Find the statistical significance of the difference in conversion between the groups using the filtered data. Make conclusions and conjectures.
    * Find the statistical significance of the difference in average order size between the groups using the filtered data. Make conclusions and conjectures.
    * Make a decision based on the test results. The possible decisions are: 1. Stop the test, consider one of the groups the leader. 2. Stop the test, conclude that there is no difference between the groups. 3. Continue the test.

Here’s what project reviewers look for when assessing your project:

    How you prepare the data for analysis
    How you prioritize hypotheses
    How you interpret the resulting graphs
    How you calculate statistical significance
    What conclusions you draw based on the A/B test results
    Whether you follow the project structure and keep the code tidy
    The conclusions you make
    Whether you leave comments at each step


<a class='anchor' id='general_information'></a>
[Go back to the Table of Contents](#table_of_contents)
## [General information](#general_information)

In [2]:
!pip install plotly==5.1.0 

Collecting plotly==5.1.0
  Downloading plotly-5.1.0-py2.py3-none-any.whl (20.6 MB)
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.1.0 tenacity-8.0.1


In [3]:
pip install sidetable

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install squarify

Collecting squarify
  Downloading squarify-0.4.3-py3-none-any.whl (4.3 kB)
Installing collected packages: squarify
Successfully installed squarify-0.4.3
Note: you may need to restart the kernel to use updated packages.


In [5]:
# Load relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import sidetable
import squarify
import warnings
import plotly.express as px
import plotly.graph_objects as go
warnings.simplefilter ('ignore')

from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

In [13]:
# Load data
try:
    visits_df = pd.read_csv('./visits_log_us.csv')
    orders_df = pd.read_csv('./orders_log_us.csv')
    hypotheses_df = pd.read_csv('./hypotheses_us.csv')
except:
    visits_df = pd.read_csv('/datasets/visits_log_us.csv')
    orders_df = pd.read_csv('/datasets/orders_log_us.csv')
    hypotheses_df = pd.read_csv('/datasets/hypotheses_us.csv')

dfs = {'visits_df':visits_df, 'orders_df':orders_df,'hypotheses_df':hypotheses_df}

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/visits_log_us.csv'

Lets explore the data:

In [None]:
for df_name in dfs:
    print(df_name)
    dfs[df_name].info()
    display(dfs[df_name].head())
    print()

Lets look for duplicates and na values

In [None]:
for df_name in dfs:
    display(dfs[df_name][dfs[df_name].duplicated()])
