# A/B Test for MuscleHub
***

# Introduction

This project is an A/B Test practice project from CodeAcademy. This is for a fictional gym, MuscleHub, who wants to gain insights on their visitors. Their current system for onboarding new members from the visitors follow three steps. The first step is to take a fitness test with a personal trainer. Second, prospective member would fill out an application. Third, prospective member submits a payment for the first month. Janet, a manager from MuscleHub, thinks that a fitness test intimidates some prospective members, so she set up an A/B test. For a period of time, Janet collected data by randomly assigning visitors to one of two groups: Group A are those still asked to take a fitness test with a personal trainer and Group B who skips the fitness test. Janet's hypothesis is that visitors assigned to Group B will be more likely to eventually purchase a membership than visitors assigned to Group A. 

## Project Goals
* Analyze the data from an A/B test with Python
* Some of the questions that will be answered are the following"
1. Were there approximately the same number of visitors assigned to each group?
1. Were the number of applications submitted affected by which group the visitor was assigned to?
1. What would you recommend the MuscleHub do, keep the fitness test or get rid of it?


# Data

Janet of MuscleHub has a SQLite database, which contains several tables that will be helpful in this investigation. I have already created a csv file for each table.

Import the four csv files as pandas DataFrames and examine them. Create the following four pandas DataFrames:
* *visits* from the **visits.csv** file, which contains information about potential gym customers who have visited MuscleHub.
* *fitness_tests* from the **fitness_tests.csv** file, which contains information about potential customers in “Group A”, who were given a fitness test.
* *applications* from the **applications.csv** file, which contains information about any potential customers (both “Group A” and “Group B”) who filled out an application. Not everyone in the **visits.csv** file will have filled out an application.
* *purchases* from the **purchases.csv** file, which contains information about customers who purchased a membership to MuscleHub.

### Load data

In [71]:
# Load necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for calculations
import seaborn as sns # for creating visualizations

%matplotlib inline

In [73]:
# Import and read the csv files
applications = pd.read_csv("../input/musclehub-abtest/applications.csv")
fitness_tests = pd.read_csv("../input/musclehub-abtest/fitness_tests.csv")
purchases = pd.read_csv("../input/musclehub-abtest/purchases.csv")
visits = pd.read_csv("../input/musclehub-abtest/visits.csv")

In [None]:
# To preview the dataframe to see what data we are working with
print("applications :")
applications.head()

In [None]:
print("fitness_test :")
fitness_tests.head()

In [None]:
print("purchases :")
purchases.head()

In [None]:
print("visits :")
visits.head()

## Process Data 

Since we have separate four separate DataFrames, we would need to combine it into a single DataFrame. Note that not all visits in visit.csv occurred during the A/B test so would need to filter out data of visits on dates before, but not including, 7-1-17.

In [74]:
# Create a new visits DataFrame based on date
visits = visits[visits['visit_date'] >= '7-1-17']

# Merge all four DataFrames
df = visits.merge(fitness_tests,on=['first_name', 'last_name', 'email', 'gender'], how='left').merge(
    applications,on=['first_name', 'last_name', 'email', 'gender'], how='left').merge(
    purchases,on=['first_name', 'last_name', 'email', 'gender'], how='left')

# Examine the new DataFrame
df.head()

Unnamed: 0,first_name,last_name,email,gender,visit_date,fitness_test_date,application_date,purchase_date
0,Kim,Walter,KimWalter58@gmail.com,female,7-1-17,2017-07-03,,
1,Tom,Webster,TW3857@gmail.com,male,7-1-17,2017-07-02,,
2,Edward,Bowen,Edward.Bowen@gmail.com,male,7-1-17,,2017-07-04,2017-07-04
3,Marcus,Bauer,Marcus.Bauer@gmail.com,male,7-1-17,2017-07-01,2017-07-03,2017-07-05
4,Roberta,Best,RB6305@hotmail.com,female,7-1-17,2017-07-02,,


In [None]:
# Double check if there are any duplicates after merging
# Selecting duplicate rows except first occurrence based on all columns
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate

We further clean up our dataFrame by adding two more columns. One column, called ab_test_group, by creating a variable containing fitness test dates to create a new variable with values of A if the fitness test date variable is not NaN, and B if the fitness test date variable is NaN.

In [76]:
# Create new ab_test_group variable and add to dataframe
df['ab_test_group'] = df.fitness_test_date.apply(lambda x:
                                                'A' if pd.notnull(x) else 'B')

In [77]:
df.head()

Unnamed: 0,first_name,last_name,email,gender,visit_date,fitness_test_date,application_date,purchase_date,ab_test_group
0,Kim,Walter,KimWalter58@gmail.com,female,7-1-17,2017-07-03,,,A
1,Tom,Webster,TW3857@gmail.com,male,7-1-17,2017-07-02,,,A
2,Edward,Bowen,Edward.Bowen@gmail.com,male,7-1-17,,2017-07-04,2017-07-04,B
3,Marcus,Bauer,Marcus.Bauer@gmail.com,male,7-1-17,2017-07-01,2017-07-03,2017-07-05,A
4,Roberta,Best,RB6305@hotmail.com,female,7-1-17,2017-07-02,,,A


# Analysis



* Null Hypothesis = There will no difference between the visitors in Group A that purchase membership and the visitors in Group B that purchase membership.
* Alternate Hypothesis = There will be more visitors in Group B that will purchase membership than visitors in Group A that will purchase membership.

The significance threshold we will set as the benchmark to either accept or fail to reject the null hypothesis will be: 𝛼 = 0.05.

First, we need to determine if the A/B test was conducted properly with the visitors split appromixately in half in group A and half are in group B.

In [None]:
# Obtain value counts of each group
df['ab_test_group'].value_counts()

In [None]:
# Obtain percentages of each group
df['ab_test_group'].value_counts(normalize=True)

## Visualize the data in a pie chart.

In [None]:
# Create a pie chart of test group
plt.pie(df['ab_test_group'].value_counts(), labels=['A', 'B'], autopct='%0.2f%%')
plt.axis('equal')
plt.show()

The visitors were split approximately in half so we can proceed in running a statistical analysis on the data.
Recall that the sign-up process was the following:
1. Take a fitness test with a personal trainer (only Group A).
2. Fill out an application for the gym.
3. Send in their payment for their first month's membership.

We will determine the percentage of people in each group who complete Step 2.

In [None]:
# Create is_application variable
df['is_application'] = df.application_date.apply(lambda x: 'Application'
                                                  if pd.notnull(x) else 'No Application')

In [None]:
# Create new app_counts DataFrame
app_counts = df.groupby(['ab_test_group', 'is_application'])\
               .first_name.count().reset_index()

# Check DataFrame
app_counts.head()

In [None]:
# Pivot app_counts DataFrame for easier readability
app_pivot = app_counts.pivot(columns='is_application',
                            index='ab_test_group',
                            values='first_name')\
            .reset_index()

# View app_pivot
app_pivot

In [None]:
# Create the total variable
app_pivot['Total'] = app_pivot.Application + app_pivot['No Application']

# Create the percent with application variable
app_pivot['Percent with Application'] = app_pivot.Application / app_pivot.Total

# View app_pivot
app_pivot

From the pivot table, we see that more people from Group B turned in an application in the rate of 13.0% compared to 9.9%. To determine if this difference is statistically significant, we run a Chi-square test and find the p-value which we then compare to the significance threshold of 0.05.

In [None]:
# Import hypothesis test module
from scipy.stats import chi2_contingency

# Calculate the p-value
contingency = [[250, 2254], [325, 2175]]
chi2_contingency(contingency)

A p-value of 0.00096 relative to a significance threshold of 0.05 indicates that there is a statistically signifant difference between the two groups. Question is why might those in Group B more likely turn an application?


Of those who picked up an application, how many purchased a membership?

In [None]:
# Create an is_member variable
df['is_member'] = df.purchase_date.apply(lambda x: 'Member' if pd.notnull(x) else 'Not Member')

# Create the just_apps DataFrame
just_apps = df[df.is_application == 'Application']

# Create member_count DataFrame
member_count = just_apps.groupby(['ab_test_group', 'is_member'])\
                 .first_name.count().reset_index()

# Pivot member_count
member_pivot = member_count.pivot(columns='is_member',
                                  index='ab_test_group',
                                  values='first_name')\
                           .reset_index()

# Create the Total variable
member_pivot['Total'] = member_pivot.Member + member_pivot['Not Member']

# Create the Percent Purchase variable
member_pivot['Percent Purchase'] = member_pivot.Member / member_pivot.Total
member_pivot

In [None]:
# Calculate the p-value
contingency = [[200, 50], [250, 75]]
chi2_contingency(contingency)

Calculate if the difference between the following groups is statistically significant:

* The customers that picked up an application and took a fitness test.
* The customers that did not take a fitness test and picked up an application.

It looks like people who took the fitness test were more likely to purchase a membership if they picked up an application. Why might that be?
A p-value of 0.432 relative to a significance threshold of 0.05 does not refelct a statistically significant difference between the two groups, and would lead us to fail to reject the null hypothesis.

Previously, you looked at what percentage of people who picked up applications purchased memberships.
Now, determine what percentage of ALL visitors purchased memberships.

In [None]:
# Create final_member_count DataFrame
final_member_count = df.groupby(['ab_test_group', 'is_member'])\
                 .first_name.count().reset_index()
# Pivot final_member_count
final_member_pivot = final_member_count.pivot(columns='is_member',
                                  index='ab_test_group',
                                  values='first_name')\
                           .reset_index()

# Create the Total variable
final_member_pivot['Total'] = final_member_pivot.Member + final_member_pivot['Not Member']

# Create the Percent Purchase variable
final_member_pivot['Percent Purchase'] = final_member_pivot.Member / final_member_pivot.Total
final_member_pivot

# Calculate the p-value
contingency = [[200, 2304], [250, 2250]]
chi2_contingency(contingency)

Previously, when we only considered people who had already picked up an application, we saw that there was no significant difference in membership between Group A and Group B.

Now, when we consider all people who visit MuscleHub, we see that there might be a significant difference in memberships between Group A and Group B.

A p-value of 0.0147 relative to a significance threshold of 0.05 indicates that there is a statistically signifant difference between the two groups. This informs us that we should not reject Janet's hypothesis that visitors assigned to Group B will be more likely to eventually purchase a membership to MuscleHub than visitors assigned to Group A.

However, it is important to note that when assessing the groups among those customers that filled out an application, those that completed a fitness test (Group A), were more likely to make a purchase than those customers that did not complete a fitness test (Group B).

## Visualize the results
Create visualizations for Janet that show the difference between Group A (people who were given the fitness test) and Group B (people who were not given the fitness test) at each state of the process:

* Percent of visitors who apply.
* Percent of applicants who purchase a membership.
* Percent of visitors who purchase a membership.

In [None]:
# Percent of Visitors who Apply
ax = plt.subplot()
plt.bar(range(len(app_pivot)),
       app_pivot['Percent with Application'].values)
ax.set_xticks(range(len(app_pivot)))
ax.set_xticklabels(['Fitness Test', 'No Fitness Test'])
ax.set_yticks([0, 0.05, 0.10, 0.15, 0.20])
ax.set_yticklabels(['0%', '5%', '10%', '15%', '20%'])
plt.show()
# plt.savefig('percent_visitors_apply.png')

In [None]:
# Percent of Applicants who Purchase
ax = plt.subplot()
plt.bar(range(len(member_pivot)),
       member_pivot['Percent Purchase'].values)
ax.set_xticks(range(len(member_pivot)))
ax.set_xticklabels(['Fitness Test', 'No Fitness Test'])
ax.set_yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
ax.set_yticklabels(['0%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%', '90%', '100%'])
plt.show()
# plt.savefig('percent_apply_purchase.png')

In [None]:
# Percent of Applicants who Purchase
ax = plt.subplot()
plt.bar(range(len(member_pivot)),
       member_pivot['Percent Purchase'].values)
ax.set_xticks(range(len(member_pivot)))
ax.set_xticklabels(['Fitness Test', 'No Fitness Test'])
ax.set_yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
ax.set_yticklabels(['0%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%', '90%', '100%'])
plt.show()
# plt.savefig('percent_apply_purchase.png')

# Conclusion

insert analysis

# Challenge Assignment
Create a wordcloud visualization that can be used for an ad for Muscle Gym with the data in the interviews.txt file.

In [None]:
# Import modules
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator # for creating wordclouds
from collections import Counter  # for counting objects
from matplotlib.pyplot import figure # to create a figure in matplotlib

In [None]:
# Open and read the interviews.txt file
interviews = open(r"../input/musclehub-abtest/interviews.txt", encoding='utf8')
txtContent = interviews.read()
print ("The Content of text file is : ", txtContent)

In [None]:
# Print the length of the new string
print('There are {} words in the total interviews.txt file.'.format(len(txtContent)))

In [None]:
# Create a wordcloud object
wordcloud = WordCloud(width=2500, height=1250).generate(txtContent)

# Display the wordcloud with MatplotLib and save figure
figure(num=None, figsize=(20, 16), facecolor='w', edgecolor='k')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
# plt.savefig('response_data/responses_wordcloud.png')