05 Oct 2017

# Using Hypothesis Testing to Understand Drivers of Sales

We were contracted by an e-commerce marketplace to help them understand what factors drive sales. They collected data on 10,0000 purchases recently made, including how much time was spent on the page, how many reviews the product had, and the product rating. They wanted to understand what factors in their data are driving sales. 

### The data

The company has collected data on 10,000 purchases. They know:
* The ammount of time in seconds an individual user spent on that page
* The number of product reviews
* The product rating
* Whether the user purchased the product or not.

### Your task

My task was to use hypothesis testing to test whether the mean time spent on the page, the mean number of reviews, and the mean product rating are different in populations that purchased and didn't purchase. 

The goal of this work was to make a recommendation for how to identify products that will sell better as well as what factors of a product contribute as a sales driver.

In [None]:
# Import packages
import pandas as pd
import numpy as np
from scipy import stats


import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
# Create Pandas dataframe from csv file
results_df = pd.read_csv('./_data/sales_table.csv')

In [None]:
# Display data info
print(results_df.info())
print('\ndf shape:', results_df.shape)
print('\n', results_df.describe())
results_df.head()

You can see that there are 10,000 rows of data and 4 columns, as well as some statistical information for each column

In [None]:
# Create a boolean mask on your results_df['purchase] column that's True if the customer bought and False if they didn't.
purchase_mask = results_df['purchase'] == 'yes'
purchase_mask.head()

In [None]:
# And tehre's this way.
results_df[results_df['purchase'] == 'yes'].head()

In [None]:
# Create a boolean mask on results_df['purchase] column that's False if the customer bought and True if they didn't.
no_purchase_mask = results_df['purchase'] == 'no'
no_purchase_mask.head()

In [None]:
# Use the masks to create new DataFrames
purchase_df = results_df[purchase_mask]
no_purchase_df = results_df[no_purchase_mask]
display(purchase_df.head())
display(no_purchase_df.head())

In [None]:
# Use the masks to create your new DataFrames
# Another way to do it

purchase_df = results_df[results_df['purchase'] == 'yes']
no_purchase_df = results_df[results_df['purchase'] == 'no']
display(purchase_df.head())
display(no_purchase_df.head())

Two dataframes have now been created. One dataframe is for those who purchased, and one dataframe is for those who did not.

# Create Visualizataions:

In [None]:
# Plot both dataframes on top of eachother

# Dictionary to convert geeky dataframe column names to fancy AF strings 
col_name_dict = {'time_on_page_sec': 'Time on Page (seconds)',
                 'num_product_reviews': 'Number of Product Reviews',
                 'product_rating': 'The Product Rating'}

fig = plt.figure(figsize=(10,30))
for i, col in enumerate(results_df.columns[:3]):
    fig.add_subplot(5,1,1+i)
    # Add a histogram for the column in the loop from the purchase dataframe.
    plt.hist(purchase_df[col], alpha = 0.5, label="User Purchased", bins=20)
    plt.legend(prop={'size': 10})
    
    # Add a histogram for the column in the loop from the no purchase dataframe.
    plt.hist(no_purchase_df[col], alpha = 0.5, label="User Did Not Purchase",bins=20)
    plt.legend(prop={'size': 10})
    
    # Plot mean line.
    plt.axvline(purchase_df[col].mean(), color='b', linestyle='dashed', linewidth=2, ls='dotted')
    plt.axvline(no_purchase_df[col].mean(), color='r', linestyle='dashed', linewidth=2, ls='dotted')
    
    plt.ylabel('Number of Purchases')
    plt.xlabel(col_name_dict[col])
    
    plt.savefig('./_assets/01.png', bbox_inches='tight')

The distributions of each numerical column for the two different datasets show two observations:
* the means of the time spent on each page appear the same for purchases and no-purchases.
* For number of reviews and product rating, the purchase mean seems higher than the no-purchase mean. Note that the means for each figure are the dotted red/blue lines.

---

## How many purchases? How many non-purchases?


In [None]:
# Function to find number of purchases in case I need it later
def num_purchases(df):
    return len(df)

In [None]:
# Same as above as one line lambda function
num_purchases = lambda df: len(df)

In [None]:
# Print purchase numbers for each dataframe
print('The number of purchases was', num_purchases(purchase_df))
print('The number of non_purchases was', num_purchases(no_purchase_df))

# Hypothesis Testing

In [None]:
# Make function to give results of hypothesis testing by passing the p-value
def hyp_test(pval):
    if pval < 0.05:
        return 'The p-value is less than 0.05. We should reject our null hypothesis.' 
    else:
        return 'The p-value is not less than 0.05. We should NOT reject our null hypothesis.'

## `time_on_page_sec`

> **Null Hypothesis ($H_0$):** The mean difference if time spent on a page between purchases and non-purchases is zero.

> **Alternative Hypothesis ($H_1$):** The mean difference if time spent on a page between purchaes and non-purchases is not zero.

In [None]:
# Obbtain pvalue.
pvalue = stats.ttest_ind(purchase_df['time_on_page_sec'],no_purchase_df['time_on_page_sec']).pvalue
hyp_test(pvalue)

### Interpretation of the p-value:

The amount of time a user of the web page is not statistically significant as to whether or not there will be a purchase.

---

## `num_product_reviews`

> **Null Hypothesis ($H_0$):** The mean difference in whether or not a product review is present between purchases and non-purchases is zero.

> **Alternative Hypothesis ($H_1$):** The mean difference in whether or not a product review is present between purchases and non-purchases is *not* zero.


In [None]:
# Obtain pvalue.
pvalue = stats.ttest_ind(purchase_df['num_product_reviews'],no_purchase_df['num_product_reviews']).pvalue
hyp_test(pvalue)

### Interpretation of the p-value?

Whether or not a product has a product review is statistically significant as to whether or not there will be a purchase.

---

## `product_rating`

> **Null Hypothesis ($H_0$):** The mean difference in wehther or not a product rating is present between purchases and non-purchases is zero.

> **Alternative Hypothesis ($H_1$):** The mean difference in wehther or not a product rating is present between purchases and non-purchases is *not* 
zero.

In [None]:
# Obtain your pvalue.
pvalue = stats.ttest_ind(purchase_df['product_rating'],no_purchase_df['product_rating']).pvalue
hyp_test(pvalue)

### Interpretation of the p-value

Whether or not a product has a product review is statistically significant as to whether or not there will be a purchase.

---

## Conclusion: Sales Drivers


The company should strive to incorporate product reviews and product ratings as an effort to increase purchases. Thus, products which include product ratings and product reviews will statistically have greater sales.

Based on the data provided; these two factors are statistically proven to contribute to purchases.

The amount of time spent on a web page has been statistically proven not to contribute to a purchase.

# Part 2: A/B Testing Using the chi-square test

The buy button on the e-commerce marketplace's product pages is red. However, they wanted to see if a yellow button will do better. An experiment (A/B/ test) was designed to test this!

Here's how the A/B test was set up:

* Control Group: Red Button
* Experimental Group: Yellow Button

The experiment was conducted with 1000 users. Each user was randomly assigned to a group (experimental or control).

#### Chi-square Test

A chi-square test tests the null hypothesis that the observed proportions are the same between the groups. In this case, There is only one proportion- the proportion of purchases. Thus, 

> **Null Hypothesis ($H_0$):** The observed proportion of purchases is the same between control and experimental groups.

or:

> **Alternative Hypothesis ($H_1$):** The observed proportion of purchases is not the same between control and experimental groups.

**Conclusion:** Based on the data provided from the experiment; is the red or yellow button recommended?

In [None]:
# Create Pandas dataframe from csv file
abdata_df = pd.read_csv('./_data/ab-data.csv')

In [None]:
# Display data info
print(abdata_df.info())
print('\ndf shape:', abdata_df.shape)
print('\n', abdata_df.describe())
abdata_df.head()

In [None]:
# Complete the crosstab to generate the contingency table.
# It should take two arguments, each should be a column.
contingency_table = pd.crosstab(abdata['group'], abdata['purchase'])
contingency_table

In [None]:
chistat, chipval, dof, exp_p = stats.chi2_contingency(contingency_table)

In [None]:
chipval

## Conclusion of A-B Test

-The p-value is less than 0.05. Thus, we should reject our null hypothesis.

-The observed proportion of purchases is not the same between control and experimental groups.

-The company should adopt the experimental yellow button