Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It's a classic "connect three" style puzzle game where the player must connect tiles of the same color in order to clear the board and win the level. It also features singing cats. We're not kidding!

As players progress through the game they will encounter gates that force them to wait some time before they can progress or make an in-app purchase. In this project, we will analyze the result of an A/B test where the first gate in Cookie Cats was moved from level 30 to level 40. In particular, we will analyze the impact on player retention.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [34]:
# import csv and show head
df = pd.read_csv('cookie_cats.csv')
df.head()

Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7
0,116,gate_30,3,False,False
1,337,gate_30,38,True,False
2,377,gate_40,165,True,False
3,483,gate_40,1,False,False
4,488,gate_40,179,True,True


In [35]:
# look for nulls and check data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90189 entries, 0 to 90188
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   userid          90189 non-null  int64 
 1   version         90189 non-null  object
 2   sum_gamerounds  90189 non-null  int64 
 3   retention_1     90189 non-null  bool  
 4   retention_7     90189 non-null  bool  
dtypes: bool(2), int64(2), object(1)
memory usage: 2.2+ MB


In [36]:
# no nulls
# change version datatype to category for quick processing
df['version'] = df.version.astype('category')

Data dictionary
    userid (int) - unique id for each user
    version (category) - version of the game they played (either gate at level 30 or gate at level 40)
    sum_gamerounds (int) - rounds played
    retention_1 (bool) - did the player come back and play 1 day after installing?
    retention_7 (bool) - did the player come back and play 7 days after installing?

In [37]:
# is your understanding correct, do some sanity checks

# user_id --> unique or not?
print("Is userid unique:",df.userid.nunique() == df.shape[0])

# version --> only two? is naming consistent?
print("Versions:", tuple(df.version.unique()))

# sum_gamerounds --> are there any weird values
display(df.sum_gamerounds.describe())

Is userid unique: True
Versions: ('gate_30', 'gate_40')


count    90189.000000
mean        51.872457
std        195.050858
min          0.000000
25%          5.000000
50%         16.000000
75%         51.000000
max      49854.000000
Name: sum_gamerounds, dtype: float64

In [39]:
# retention_1, retention_7 --> let's assign retention levelss for each case
RETENTION_LEVELS = ['No retention', 'Some retention', 'High retention']
def assign_retention_levels(df):
    if (df.retention_1 == True) and (df.retention_7 == True):
        return RETENTION_TYPES[2]
    elif (df.retention_1 == False) and (df.retention_7 == False):
        return RETENTION_TYPES[0]
    else:
        return RETENTION_TYPES[1]

df['retention_level'] = df.apply(assign_retention_levels, axis=1)

df['retention_level'] = pd.Categorical(df['retention_level'], categories = RETENTION_LEVELS, ordered=True)

df['retention_level']

0          No retention
1        Some retention
2        Some retention
3          No retention
4        High retention
              ...      
90184    Some retention
90185      No retention
90186    Some retention
90187    Some retention
90188      No retention
Name: retention_level, Length: 90189, dtype: category
Categories (3, object): ['No retention' < 'Some retention' < 'High retention']

In [None]:
# Graphical EDA

# amt of people in each group

# how many games does each group play

# retention levels in each group