## What is A/B Testing?

It is a tool that allows you to test two or more different ideas against each other in real world. Choose the one performing statistically better.

Provide accurate answers, and statistically sound way to establish causality.

A/B test process
- Define a hypothesis about product or business. 
- Randomly assign user to two different groups
- Expose group1 to the current product rules
- Expose group 2 to product that tests the hypothesis
- Pick whichever performs better according to a set of KPIs

KPI's
- A/B tests measure impact of change on KPIs
- examples: likelihood of side-effect, revenue, conversion rate

Data:
    Mobile App company's paid subscription and in-app purchases data

Goal:
    maintain high fee -> paid conversion rate

In [None]:
# Import pandas 
import pandas as pd

# Load the customer_data
customer_data = pd.read_csv('customer_data.csv')

# Load the app_purchases
app_purchases = pd.read_csv('inapp_purchases.csv')

# Print the columns of customer data
print(customer_data.head())

# Print the columns of app_purchases
print(app_purchases.head())

In [None]:
# Merge on the 'uid' and 'date' field
uid_date_combined_data = app_purchases.merge(customer_data, on=['uid', 'date'], how='inner')

# Examine the results 
print(uid_date_combined_data.head())
print(len(uid_date_combined_data))

Conversion Rate: percentage of users who subscribe after free trial
    - stability over time
    - importance across different users (generalizability to different demographic groups)
    - correlation with other business metrics

KPI Computation 

In [None]:
# Calculate the mean purchase price 
purchase_price_mean = purchase_data.price.agg('mean')

# Examine the output 
print(purchase_price_mean)

In [None]:
# Calculate the mean and median purchase price 
purchase_price_summary = purchase_data.price.agg(['mean', 'median'])

# Examine the output 
print(purchase_price_summary)

In [None]:
# Calculate the mean and median of price and age
purchase_summary = purchase_data.agg({'price': ['mean', 'median'], 'age': ['mean', 'median']})

# Examine the output 
print(purchase_summary)

In [None]:
# Group the data 
grouped_purchase_data = purchase_data.groupby(by = ['device', 'gender'])

# Aggregate the data
purchase_summary = grouped_purchase_data.agg({'price': ['mean', 'median', 'std']})

# Examine the results
print(purchase_summary)

## Computing Conversion Rate

Goal: Examine the KPI 'user conversion rate' after the free trial <br>

Week One Conversion Rate: Limit to users who convert in their first week after the trial ends

Maximum lapse date? 
    lapse date: date the trial ends for a given user

In [None]:
from datetime import datetime, timedelta
current_date = pd.to_datetime('2018-03-17')

In [None]:
# What is the maximum lapse date in our data
print(sub_data_demo.lapse_date.max())

Remove users who lapsed today or any of the prior 7 days so we allow users a full week to subscribe

Conversion Rate = subscribers/users

In [None]:
# Users
# latest lapse date: a week before today
max_lapse_date = current_date - timedelta(days=7)
# restrict to users lapsed before max_lapse_date
conv_sub_data = sub_data_demo[(sub_data_demo.lapse_date < max_lapse_date)]
# count the users remaining in our data
total_users_count = conv_sub_data.price.count()
print(total_users_count)

In [None]:
# Subscriber
# latest subscription date: within 7 days of lapsing
max_sub_date = conv_sub_data.lapse_date + timedelta(days=7)
# filte the users with non-zero subscription price who subscribed before max_sub_date
total_subs = conv_sub_data[
    (conv_sub_data.price > 0) &
    (conv_sub_data.subscription_date <= max_sub_date)
]
# count the users remaining in our data
total_subs_count = total_subs.price.count()
print(total_subs_count)

In [None]:
# calculate the conversion rate with our previous values
conversion_rate = total_subs_count/total_users_count
print(conversion_rate)

Cohort Conversion Rate

In [None]:
# create a copy of our dataframe
conv_sub_data = conv_sub_data.copy()
# keep users who lapsed prior to the last 2 weeks
max_lapse_date = current_date - timedelta(days=14)
conv_sub_data = sub_data_demo[
    (sub_data_demo.lapse_date <= max_lapse_date)
]

Sub Time = how long it took a user to subscribe

In [None]:
# Find the days between lapse and subscrition if they subscribed and pd.NaT otherwise
sub_time = np.where( # ifelse function of R equivalent
    # if: a subscription date exists
        conv_sub_data.subscription_date.notnull(),
    # then: find how many days since their lapse
        (conv_sub_data.subscription_date - conv_sub_data.lapse_date).dt.days,
    # else: set the value to pd.NaT
        pd.NaT
)

In [None]:
# create a new column 'sub_time'
conv_sub_data['sub_time'] = sub_time

In [None]:
# gcr7() and gcr14() functions
# group by the relevant cohorts
purchase_cohorts = conv_sub_data.groupby(by=['gender','device'], as_index=False)
# find the conversion rate for each cohort using gcr7, gcr14
purchase_cohorts.agg({'sub_time': ['gcr7', 'gcr14']})

Think about how long does it take to determine a KPI
1. monthly KPI takes too long

Relevance to Business goals?

Conversion Rate
- strong measure of growth
- see how changes impact different groups differently

### Average Purchase 

This KPI can provide a sense of the popularity of different in-app purchase price points to users within their first month.

In [None]:
# Compute max_purchase_date 
max_purchase_date = current_date - timedelta(days=28)
# Filter to only include users who registered before our max date
purchase_data_filt = purchase_data[purchase_data.reg_date < max_purchase_date]
# Filter this dataset to only include purchases that occurred on a date within the first 28 days.
# Filter to contain only purchases within the first 28 days of registration
purchase_data_filt = purchase_data_filt[(purchase_data_filt.date <= 
                        purchase_data_filt.reg_date + timedelta(days=28))]
# Output the mean price paid per purchase
print(purchase_data_filt.price.mean())

same KPI, average purchase price, and a similar one, median purchase price, within the first 28 days.
We can calculate these metrics across a set of cohorts and see what differences emerge. This is a useful task as it can help us understand how behaviors vary across cohorts.

In [None]:
# Set the max registration date to be one month before today
max_reg_date = current_date - timedelta(days=28)

Use np.where to create an array month1 containing:

the price of the purchase purchase, if 
1. the user registration .reg_date occurred at most 28 days ago (i.e. before max_reg_date), and
2. the date of purchase .date occurred within 28 days of registration date .reg_date;
3. NaN, otherwise.

In [None]:
# Find the month 1 values
month1 = np.where((purchase_data.reg_date < max_reg_date) &
                 (purchase_data.date < purchase_data.reg_date + timedelta(days=28)),
                  purchase_data.price, 
                  np.NaN)
# Update the value in the DataFrame 
purchase_data['month1'] = month1

In [None]:
# Aggregate the month1 and price data 
purchase_summary = purchase_data_upd.agg(
                        {'month1': ['mean', 'median'],
                        'price': ['mean', 'median']})

# Examine the results 
print(purchase_summary)

# Time Series Data

### Using strftime

In [None]:
# Saturday January 27, 2017
# Provide the correct format for the date
date_data_one = pd.to_datetime(date_data_one, format='%A %B %d, %Y')
print(date_data_one)

In [None]:
# 2017-08-01
date_data_two = pd.to_datetime(date_data_two, format='%Y-%m-%d')
print(date_data_two)

In [None]:
# 08/17/1978
date_data_three = pd.to_datetime(date_data_three, format=''%m/%d/%Y')
print(date_data_three)

In [None]:
# 2016 March 01 01:56
date_data_four = pd.to_datetime(date_data_four, format='%Y %B %d %H:%M')
print(date_data_four)

### Plot multiple time series

Plotting time series data
In trying to boost purchases, we have made some changes to our introductory in-app purchase pricing. In this exercise, we check if this is having an impact on the number of purchases made by purchasing users during their first week.

The dataset user_purchases has been joined to the demographics data and properly filtered. The column 'first_week_purchases' that is 1 for a first week purchase and 0 otherwise has been added. This column is converted to the average number of purchases made per day by users in their first week.

We will try to view the impact of this change by looking at a graph of purchases.

In [None]:
# Group the data and aggregate first_week_purchases
user_purchases = user_purchases.groupby(by=['reg_date', 'uid']).agg({'first_week_purchases': ['sum']})

# Reset the indexes
user_purchases.columns = user_purchases.columns.droplevel(level=1)
user_purchases.reset_index(inplace=True)

# Find the average number of purchases per day by first-week users
user_purchases = user_purchases.groupby(by=['reg_date']).agg({'first_week_purchases': ['mean']})
user_purchases.columns = user_purchases.columns.droplevel(level=1)
user_purchases.reset_index(inplace=True)

# Plot the results
user_purchases.plot(x='reg_date', y='first_week_purchases')
plt.show()

Pivoting our data
there does seem to be an increase in the number of purchases by purchasing users within their first week. Let's now confirm that this is not driven only by one segment of users. We'll do this by first pivoting our data by 'country' and then by 'device'. Our change is designed to impact all of these groups equally.

In [None]:
# Pivot the user_purchases_country table such that we have our first_week_purchases as our values, the country as the column, and our reg_date as the row.
country_pivot = pd.pivot_table(user_purchases_country, values=['first_week_purchases'], columns=['country'], index=['reg_date'])
print(country_pivot.head())

In [None]:
# pivot the user_purchases_device table such that we have our first_week_purchases as our values, the device as the column, and our reg_date as the row.
device_pivot = pd.pivot_table(user_purchases_device, values=['first_week_purchases'], columns=['device'], index=['reg_date'])
print(device_pivot.head())

plot by 'country' and then by 'device' and examine the results. See the observed lift across all groups as designed. This would point to the change being the cause of the lift, not some other event impacting the purchase rate.

In [None]:
# Plot the average first week purchases for each country by registration date
country_pivot.plot(x='reg_date', y=['USA', 'CAN', 'FRA', 'BRA', 'TUR', 'DEU'])
plt.show()

In [None]:
# Plot the average first week purchases for each device by registration date
device_pivot.plot(x='reg_date', y=['and', 'iOS'])
plt.show()

### Removing Noise: Seasonality and moving averages

look at the overall revenue data for our meditation app. We saw strong purchase growth in one of our products, and now we want to see if that is leading to a corresponding rise in revenue. revenue is very seasonal, so we want to correct for that and unlock macro trends.

we will correct for weekly, monthly, and yearly seasonality and plot these over our raw data. This can reveal trends in a very powerful way.

In [None]:
# Compute 7_day_rev
daily_revenue['7_day_rev'] = daily_revenue.revenue.rolling(window=7,center=False).mean()

# Compute 28_day_rev
daily_revenue['28_day_rev'] = daily_revenue.revenue.rolling(window=28,center=False).mean()
    
# Compute 365_day_rev
daily_revenue['365_day_rev'] = daily_revenue.revenue.rolling(window=365,center=False).mean()
    
# Plot date, and revenue, along with the 3 rolling functions (in order)    
daily_revenue.plot(x='date', y=['revenue', '7_day_rev', '28_day_rev', '365_day_rev', ])
plt.show()

we saw that our revenue is somewhat flat over time. we will dive deeper into the data to see if we can determine why this is the case. We will look at the revenue for a single in-app purchase product we are selling to see if this potentially reveals any trends. As this will have less data then looking at our overall revenue it will be much noisier. To account for this we will smooth the data using an exponential rolling average.

In [None]:
# Calculate 'small_scale'
daily_revenue['small_scale'] = daily_revenue.revenue.ewm(span=10).mean()

# Calculate 'medium_scale'
daily_revenue['medium_scale'] = daily_revenue.revenue.ewm(span=100).mean()

# Calculate 'large_scale'
daily_revenue['large_scale'] = daily_revenue.revenue.ewm(span=500).mean()

# Plot 'date' on the x-axis and, our three averages and 'revenue'
# on the y-axis
daily_revenue.plot(x = 'date', y =['revenue', 'small_scale', 'medium_scale', 'large_scale'])
plt.show()

## Putting Everything Together

Visualizing user spending
Recently, the Product team made some big changes to both the Android & iOS apps. They do not have any direct concerns about the impact of these changes, but want you to monitor the data to make sure that the changes don't hurt company revenue. Additionally, the product team believes that some of these changes may impact female users more than male users.

Plot the monthly revenue for one of the updated products and evaluate the results.

In [None]:
# Pivot user_revenue
pivoted_data = pd.pivot_table(user_revenue, values ='revenues', columns=['device', 'gender'], index='month')

# Remove the first and last row of the DataFrame once pivoted to prevent discontinuities from distorting the results.
pivoted_data = pivoted_data[1:(len(pivoted_data) -1 )]

# Create and show the plot
pivoted_data.plot()
plt.show()