# A/B testing. Mobile game "Cookie cats" 

<img src="https://tactilegames.com/wp-content/uploads/2018/05/cookie-cats.png" height=400 width=800>

# Intro

Cookie Cats is a hugely popular mobile puzzle game. Players are supposed to connect tiles of the same color to clear the board and win the game.

While progressing through the game, players encounter "gates" which forces them to wait for some time to go to the next level and/or make in-app purchases. Sometimes, players do not wish to do so and delete the game.

The goal of this A/B test is to find out whether moving the gate from level 30 to level 40 is worth doing and does not harm the retention rate.

# 1. Loading raw data & EDA

In [None]:
import pandas as pd
import numpy as np
import plotly_express as px

df = pd.read_csv('../input/mobile-games-ab-testing/cookie_cats.csv')
df.head()

<p>The data is from 90,189 players that installed the game while the AB-test was running. 
When a player installed the game, he or she was randomly assigned to either of the versions:<code>gate_30</code> or <code>gate_40</code>. </p>
    The variables are:</p>
<ul>
<li><code>userid</code> - a unique number that identifies each player.</li>
<li><code>version</code> - whether the player was put in the control group (<code>gate_30</code> - a gate at level 30) or the test group (<code>gate_40</code> - a gate at level 40).</li>
<li><code>sum_gamerounds</code> - the number of game rounds played by the player during the first week after installation
<li><code>retention_1</code> - did the player come back and play 1 day after installing?</li>
<li><code>retention_7</code> - did the player come back and play 7 days after installing?</li>
</ul>

In [None]:
df.info()

In [None]:
# Checking if there are any users that have been falsely attributed to both group

round_cnts = df['userid'].value_counts(ascending=False)
duplicate_users = round_cnts[round_cnts > 1].count()
print(f'There are {duplicate_users} duplicate users in the dataset.')

Luckily, there are none.

In [None]:
# Number of players in each group

df.groupby('version').count()

<p>The number of players is roughly the same in the control and test group, which means the set is good for testing.

# 2. Looking for insights & handling outliers

Below is the boxplot which indicates the distribution of the number of rounds played per person. This will helo us find the outliers.

In [None]:
fig = px.box(df, y="sum_gamerounds", width=600, height=400)
fig.show()

We clearly have an outlier here, a person who played >50k round in a week. We need to exclude it from further analysis. Let's see the percentiles.

In [None]:
df.describe([0.01, 0.05, 0.10, 0.20, 0.80, 0.90, 0.95, 0.99])[["sum_gamerounds"]]

In [None]:
# Removing the single outlier

df = df[df.sum_gamerounds < df.sum_gamerounds.max()]

In [None]:
df["Retention"] = np.where((df.retention_1 == True) & (df.retention_7 == True), 1,0)
df.groupby(["version", "Retention"])["sum_gamerounds"].agg(["count", "median", "mean", "std", "max"])

In [None]:
df.groupby("version").sum_gamerounds.agg(["count", "median", "mean", "std", "max"])

In [None]:
zero_rounds = df[df['sum_gamerounds'] == 0].userid.count()
print(f'{zero_rounds} played no rounds at all within the first week of downloading the game.')

As we may observe, <b>50%</b> of players played <16 rounds during the first week after installation with the 75-th percentile amounting to 51 rounds in the first week (which we can retain in mind as the "normal" number of rounds).

As many as <b>~4000</b> players did play a single round after installation. Possible reasons:
<ul>
<li>The downloaded several games "for fun" and opted for another one.
<li>They were confused by the gameplay (because it seemed too complicated).
<li>They haven't even opened the app.
<li>The server was down while the game was loading.
</ul>

Also, <b>25%</b> of players played <b><5</b> rounds, which is also a lot and the producing company should try and collect some feedback on an early stage (e.g. in-app survey).

# 3. A/B Testing

Steps:
* Split the groups into **test** and **control**
* Shapiro–Wilk test for normality
    * If no, then use Mann-Whitney U test (non-parametric test)
    * If yes, then use Levene Test for homogeneity of variances
        * If variances are homogeneous then use T-Test (parametric test)
        * If variances are not homogenous then use Welch Test (parametric test)

In [None]:
df["version"] = np.where(df.version == "gate_30", "A", "B")
df.head()

**A quick QQ-plot to check the normality of the distribution before we dive in the detailed testing.**

In [None]:
import matplotlib.pyplot as plt
from scipy import stats

groupA = df[df['version'] == "A"]["sum_gamerounds"]
groupB = df[df['version'] == "B"]["sum_gamerounds"]
    
fig = plt.figure()
ax = fig.add_subplot()
res = stats.probplot(groupA, plot=plt)

In [None]:
fig = plt.figure()
ax = fig.add_subplot()
res = stats.probplot(groupB, plot=plt)

The distribution is far from normal, though we will embed the similar check in the function below.

In [None]:
def AB_testing(df, group, target):
    
    from scipy.stats import shapiro
    import scipy.stats as stats
    
    groupA = df[df[group] == "A"][target]
    groupB = df[df[group] == "B"][target]
    
    # Normality test
    normalA = shapiro(groupA)[1] < 0.05 # α = 0.05 threshold
    normalB = shapiro(groupB)[1] < 0.05
    # H0 hypothesis: Distribution is Normal = False
    # H1 hypothesis: Distribution is not Normal = True
    
    if (normalA == False) & (normalB == False): # "H0: Normal Distribution"
        # Homogeneity of variances test
        leveneTest = stats.levene(groupA, groupB)[1] < 0.05
        # H0: Homogeneous variances = False
        # H1: Heterogeneous variances = True
        
        if leveneTest == False:
            # Homogeneous -> T0test
            ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
            # H0: R1 == R2 => False (R - retention)
            # H1: R1 != R2 => True
        else:
            # Heterogeneous
            ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1] # equal_var is used for Welch test
            # H0: R1 == R2 => False (R - retention)
            # H1: R1 != R2 => True
    else:
        # Non-Parametric Test
        ttest = stats.mannwhitneyu(groupA, groupB)[1] 
        # H0: R1 == R2 => False (R - retention)
        # H1: R1 != R2 => True
        
    # Result
    result = pd.DataFrame({
        "A/B Hypothesis":[ttest < 0.05], 
        "p-value":[ttest]
    })
    result["Test Type"] = np.where((normalA == False) & (normalB == False), "Parametric (normal dist.)", "Non-Parametric (not normal dist.)")
    result["A/B Hypothesis"] = np.where(result["A/B Hypothesis"] == False, "Do not reject H0.", "Reject H0, accept H1.")
    result["Comment"] = np.where(result["A/B Hypothesis"] == "Do not reject H0", "A/B groups are similar in performance.", "Difference in A/B is statistically significant.")
    
    # Columns
    if (normalA == False) & (normalB == False):
        result["Homogeneity"] = np.where(leveneTest == False, "Yes", "No")
        result = result[["Test Type", "Homogeneity","A/B Hypothesis", "p-value", "Comment"]]
    else:
        result = result[["Test Type","A/B Hypothesis", "p-value", "Comment"]]
    
    # Print Hypothesis
    print("# A/B hypotheses testing:")
    print("H0: A == B")
    print("H1: A != B", "\n")
    
    return result
    
    
    
# Apply A/B Testing
AB_testing(df=df, group = "version", target = "sum_gamerounds")

# 4. Interpreting results

The analysis indicates that there is a statistically significant difference between two groups in terms of moving the first gate from level 30 to level 40 for game rounds.

The conclusion is as follows: if we want to keep retention rate high — we should not move the gate from level 30 to level 40.