

## What are gates?
As players progress through the game, they will occasionally encounter gates that force them to wait for some time or make an in-app purchase to progress. 

## What is the purpose of the gate?
The main objective of these gates is to drive in-app sales, but they are also useful to invite users to make a pause from playing the game (suuuure).

## What is the purpose of the experiment?

The purpose of the experiment is to test the effect of moving the first gate in Cookie Cats from level 30 to level 40 on player retention.

## What is the null hypothesis?

The null hypothesis is that moving the first gate in Cookie Cats from level 30 to level 40 has no effect on player retention.



## What are the target metrics?

The target metrics are retention_1 and retention_7. These metrics are binary, where 1 means that the player came back to play the game after 1 or 7 days and 0 means that the player did not come back to play the game after 1 or 7 days.

In [None]:
import kagglehub
import pandas as pd
import numpy as np
import plotly.express as px

path = kagglehub.dataset_download("mursideyarkin/mobile-games-ab-testing-cookie-cats")
df = pd.read_csv(path + "/cookie_cats.csv")
df

## What are the fields from the dataset?

* *userid*: A unique number that identifies each player.
* *version*: Whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40).
* *sum_gamerounds*: the number of game rounds played by the player during the first 14 days after install.
* *retention_1*: Did the player come back and play 1 day after installing?
* *retention_7*: Did the player come back and play 7 days after installing?

## What are the target metrics?

The target metrics are retention_1 and retention_7. These metrics are binary, where 1 means that the player came back to play the game after 1 or 7 days and 0 means that the player did not come back to play the game after 1 or 7 days.

## What is the sample size?

In [None]:
df.shape[0]

The sample size is 90,189 players who installed the game while the AB-test was running.

## What is the test group size?

In [None]:
df.groupby("version").size()

We have 44,495 players in the gate_30 group and 45,694 players in the gate_40 group. We can see that the test group is well-balanced.

## What are the variant proportions?

In [None]:
df.groupby("version").size() / df.shape[0]

The variant proportions are 0.493 and 0.507. The proportions are well-balanced. It's a characteristic of a good sample size.

# Statistical Tests

## Looking at the data, there are a couple of way to test the hypothesis:

For metrics like `retention_1` and `retention_7`:
* Chi-square test to check for Sample Ratio Mismatch and then use the Chi-square test to analyze the binary metric.
* Binomial test to check the difference between the two proportions, but specifically how the difference compares to baseline. # TODO: What is baseline in our case?

In [None]:
# Create a figure with both histogram and box plot using subplots
fig = px.histogram(df, x='sum_gamerounds',
                   title='Distribution of Game Rounds',
                   labels={'sum_gamerounds': 'Number of Game Rounds'},
                   color='version',
                   marginal='box',  # This adds a box plot above
                   hover_data=['version'])

# Update layout for better readability
fig.update_layout(
    title_x=0.5,  # Center the title
    bargap=0.1,  # Add space between bars
    height=600,  # Make plot taller for better visibility
    xaxis_title="Number of Game Rounds",
    yaxis_title="Count of Players"
)

# If you want to zoom in to see the main distribution better
# (since there might be extreme outliers), we can set the x-axis range
# Let's show up to the 95th percentile
x_limit = np.percentile(df['sum_gamerounds'], 95)
fig.update_xaxes(range=[0, x_limit])

fig.show()

* We can see the distribution is heavily right-skewed
    * Most players complete relatively few game rounds
    * Some players complete many game rounds
    * This is very far from a normal distribution (bell curve)

* The boxes (representing the middle 50% of players) are quite compressed near the left
. The whiskers (the lines extending from the boxes) are very long to the right
There are many dots beyond the whiskers, representing outliers.
     * Both gate_30 (blue) and gate_40 (red) show similar patterns
     * The distributions largely overlap
     * Both versions have similar shapes and outlier patterns
          *

Given these observations, we should definitely use the [Mann-Whitney U](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu) test instead of a t-test because:

* The data is not normally distributed # TODO: what is non-normal? Bell-curve?
* We have many outliers
* The variances between groups appear unequal # TODO: What is variance

The Mann-Whitney U test will be more reliable because it:

* Doesn't assume normality
* Works with skewed distributions
* Is less sensitive to outliers
* Compares ranks rather than raw values

In [None]:
from scipy import stats

# 1. Split the data into two groups
gate_30_rounds = df[df['version'] == 'gate_30']['sum_gamerounds']
gate_40_rounds = df[df['version'] == 'gate_40']['sum_gamerounds']

# 2. Perform Mann-Whitney U test
statistic, p_value = stats.mannwhitneyu(
    gate_30_rounds,
    gate_40_rounds,
    alternative='two-sided'  # Test if there's any difference between groups (not just one being larger)
)

# 3. Calculate some descriptive statistics to help interpret results
stats_summary = {
    'gate_30': {
        'median': gate_30_rounds.median(),
        'mean': gate_30_rounds.mean(),
        'count': len(gate_30_rounds)
    },
    'gate_40': {
        'median': gate_40_rounds.median(),
        'mean': gate_40_rounds.mean(),
        'count': len(gate_40_rounds)
    }
}

# 4. Print results in a clear format
print("Mann-Whitney U Test Results:")
print(f"p-value: {p_value:.4f}")
print(f"\nDescriptive Statistics:")
for version, stats_dict in stats_summary.items():
    print(f"\n{version}:")
    print(f"  Median rounds: {stats_dict['median']:.2f}")
    print(f"  Mean rounds: {stats_dict['mean']:.2f}")
    print(f"  Sample size: {stats_dict['count']}")

# 5. Print conclusion
alpha = 0.05  # Standard significance level
print(f"\nConclusion:")
if p_value < alpha:
    print(f"There is a statistically significant difference between the groups (p < {alpha})")
else:
    print(f"There is no statistically significant difference between the groups (p > {alpha})")

Test Significance (p-value = 0.0502):

This is just barely above our standard significance level of 0.05
It's very close to being significant, but technically isn't
This kind of "borderline" result warrants careful interpretation

Group Comparisons:

Gate 30:

Median: 17 rounds
Mean: 52.46 rounds
Sample size: 44,700 players

Gate 40:

Median: 16 rounds
Mean: 51.30 rounds
Sample size: 45,489 players

Key Insights:

The difference is very small in practical terms
Mean values are much higher than medians (confirms our earlier observation about skewed distribution)
Sample sizes are very large and almost equal between groups

Business Interpretation:

Players in Gate 30 play slightly more rounds on average (about 1 round more)
The difference is not statistically significant, **but might be practically meaningful given the large sample size**. Here we would have to consult to domain experts.
Given that Gate 30 shows slightly better numbers and is an easier level, it might be the safer choice