# User Retention - Meta Kaggle Dataset

Retention is among the most important metrics to understand. It is crucial in the development of a successful product. In this notebook, we'll use the Meta Kaggle dataset to learn how to visualize, understand, and explore this key performance indicator.

This notebook focuses on at users which have created/forked at least one notebook (i.e kernels).

• I forked from: https://www.kaggle.com/wguedes/learning-about-user-retention-meta-kaggle

## Overview - What is retention?
Retention is our ability to make users come back to our product within a certain time period. 

In our example, we would like users who created a notebook today to use Kernels again when they write their next notebook.

#### What's a good retention rate?
*There's no magic number.* 
Lower is better. 

* Good lecture about retention/growth/churn by Facebook's Alex Schultz:  [lecture](https://www.youtube.com/watch?v=n_yHZ_vKjno) 



## Exploring User Retention of Kernels

We will be looking at what percentage of new users come back in the following week, the week after, and so on. This will tell us how *sticky* Kernels is. Once we have a grasp in the overall user retention, we'll look at ways to make our insights actionable.

In the following sections we will:
- Import libraries and load tables
- Prepare the data for exploration (Data Wrangling).
- Compute overall week-over-week (WoW) user retention.
- Explore ways to make our analysis actionable.


# User Retention - Meta Kaggle Dataset

Retention is among the most important metrics to understand. It is crucial in the development of a successful product. In this notebook, we'll use the Meta Kaggle dataset to learn how to visualize, understand, and explore this key performance indicator.

I'm particularly interested in the usage of Kernels. Therefore this notebook will only look at users which have created/forked at least one notebook.

In the end, I hope you'll be able to apply what you've learned and look at the retention of your own product.

<img src="https://image.slidesharecdn.com/metricsandksfsforbuildingahighperformingcustomersuccessteam-150410171710-conversion-gate01/95/customer-success-best-practices-for-saas-retention-metrics-and-ksfs-for-building-a-high-performing-customer-success-team-10-638.jpg?cb=1428686468" alt="thatdbegreat" style="width: 400px"/>

## Overview
Before we dive into the data, let's understand what user retention is and why it's important. 

### What is retention?
At a high level, retention is our ability to make users come back to our product within a certain time period. 

In our example, we would like users who created a notebook today to use Kernels again when they write their next notebook.

### Why is retention important?
***Retention tells us if we're building something worth building***. It helps us understand if we're providing value to our users. In ther words, retention tells us whether or not our product has market fit.

If we provide no value, we don't have a sustainable business.

Acquiring new users is also expensive. And if we can't retain them, all the money put into user acquisition and growth hacking tactics will go to waste. 

### Spoiler Alert! What's a good retention rate?
Maybe you already know what retention is. You might be simply trying to figure out what is a good number for your retention rate.

A good retention rate is 15%! No wait... It's 10%. But it could also be 28%, 95% is definitely good I guess - *There's no magic number.* 

Personally, I was frustrated when I couldn't find that magic number. Only after watching Alex Schultz's  [lecture](https://www.youtube.com/watch?v=n_yHZ_vKjno) about retention that I understood why no particular number existed . I highly encourage you to do the same!

## Exploring User Retention of Kernels
It's time for us to look at the data! 

We will be looking at what percentage of new users come back in the following week, the week after, and so on. This will tell us how *sticky* Kernels is. Once we have a grasp in the overall user retention, we'll look at ways to make our insights actionable.

In the following sections we will:
- Import libraries and load tables
- Prepare the data for exploration (Data Wrangling).
- Compute overall week-over-week (WoW) user retention.
- Explore ways to make our analysis actionable.


## Libraries and Tables

To compute user retention, we need only two tables:
* A table with user signup dates
* A table with user events (used to determine the dates an user was active)

We'll derive these tables from the `KernelVersions` dataset + competition activity

* Note that we will set the "start" date by the first "Activity" from these tables, not the user registration date . (A user might register than have no activity for a while, then become active. We'll use the registration date as a seperate feature, but not for defining the target).

In [None]:
# Imports
import datetime
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

# Returns Panda DataFrame for provided CSV filename (without .csv extension).
load = lambda name: pd.read_csv('../input/{}.csv'.format(name))

In [None]:

# use to filter users who registered recently = not enough time to measure retention. Currently 4 months before latest date ("26.5.2019") in this version of the dataset.:
## updated to latest date (Note: we could get this directly from the data, but it's cleaner to have magic variables defined upfront)
END_DATE = pd.to_datetime("2020-06-02") #pd.to_datetime("26.5.2019")
RECENT_CUTOFF_DATE =  pd.to_datetime("2020-02-01") #pd.to_datetime("25.1.2019")  

MINIMAL_RETENTION_HISTORY =  12 #12 # Time in weeks
MINIMAL_RETENTION_WEEKS_TARGET = 24 # 36 #target we want to predict (in weeks). We'll filter data for users who can't have been active for less than this amount of time

## Data Wrangling/loading
* Load the datasets (and parse as datetime). 

* We can strip the hours/time component and leave just the date (then drop duplicateS). Our aggregation will do this for us anyway though
    * df['dates'].dt.floor('d')   # (or:) 
    * df['dates'].dt.date 

In [None]:
# Information about public kernel versions. 
# Signup date table and user events table will be derived from this dataframe.
# kernel_versions = load('KernelVersions')
kernel_versions = pd.read_csv('../input/KernelVersions.csv',usecols=['AuthorUserId','CreationDate'],parse_dates=["CreationDate"],infer_datetime_format=True)
print("kernel versions",kernel_versions.shape)
kernel_versions["CreationDate"] = kernel_versions["CreationDate"].dt.floor('d')
kernel_versions.drop_duplicates(inplace=True)
print("Date level dedupped kernel versions",kernel_versions.shape[0])

# competition submissions 
# Additional source of data on user events/activity, to merge with kernels
# Some rows are missing the ID. We drop them (and then reparse the data as  a integer, not float/nan)
submissions = pd.read_csv('../input/Submissions.csv',usecols=["SubmittedUserId","SubmissionDate"],
                          parse_dates=["SubmissionDate"],infer_datetime_format=True).dropna()#load('KernelVersions')
print("submissions",submissions.shape)
submissions.SubmittedUserId = submissions.SubmittedUserId.astype(int)
submissions["SubmissionDate"] = submissions["SubmissionDate"].dt.floor('d')
print(f"Last/max submissions date {submissions['SubmissionDate'].max()}")
submissions.drop_duplicates(inplace=True)
print("Date level dedupped submissions",submissions.shape[0])


In [None]:
submissions.rename(columns={"SubmittedUserId":"UserId","SubmissionDate":"ActivityDate"},inplace=True)
kernel_versions.rename(columns={"AuthorUserId":"UserId","CreationDate":"ActivityDate"},inplace=True)

# New dataframe that will contain aggregated data about user activity, for defining target and prediction instances:
df_labels = pd.concat([kernel_versions,submissions]).drop_duplicates()
print(df_labels.shape)
df_labels.tail()

In [None]:
## basic users info (name..).  It can also be joined with more data, e.g. teams, organization
## PerformanceTier is leaky! it's not time stamped.
users = pd.read_csv('../input/Users.csv').set_index("Id").drop_duplicates().drop(["PerformanceTier","UserName"],axis=1) # drop UserName, it's probably redundant if we have display name
print(users.shape)
users.head()

#### Concat, aggregate the user event tables
* Concat then (After renaming)
    * Rename columns after aggregating multiple functions on the same columns: https://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0
    * https://stackoverflow.com/questions/32958526/pandas-agg-multiple-summaries-on-same-column
* Get "duration" of all users (max time between first and last event)
* Filter users who weren't active for a minimal duration (We saw ~80% dropout over the first 1-2 weeks. Many users could be "tests", temporary or not serious). 

In [None]:
## https://stackoverflow.com/questions/32958526/pandas-agg-multiple-summaries-on-same-column
## https://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

# df_labels_agg = df_labels.groupby("UserId", as_index=True).agg({'ActivityDate':{"first_Activity": "min", "last_Activity": "max"}})
df_labels_agg = df_labels.groupby("UserId", as_index=True).agg({'ActivityDate':["max","min"]}).rename(columns={ "min":"first_Activity", "max":"last_Activity"})
df_labels_agg.columns = df_labels_agg.columns.droplevel() # drop ActivityDate multiaxis level
df_labels_agg.reset_index(inplace=True)
df_labels_agg.drop_duplicates(inplace=True)

# add time between first and last event, in weeks:
df_labels_agg['weeks_retention'] = round((df_labels_agg['last_Activity'] - df_labels_agg['first_Activity']) / np.timedelta64(1, 'W'))

print(df_labels_agg.shape)
df_labels_agg.sample(5)

Kaggle claims ~1 (million users)[http://blog.kaggle.com/2017/06/06/weve-passed-1-million-members/], but the numbers we see here are far smaller. This may be due to the dataset having been sampled, or simply due to only a small portion of registered users actually being "serious", i.e code-writing, competition entering DS :)


* We already know how the (retention curve)[https://www.kaggle.com/wguedes/learning-about-user-retention-meta-kaggle] looks, and regardless we want users who can be considered "Active". We could define this by medals (equivalently to "premium"/"gold" clients), but that is a bit too harsh. 
Let's filter by a minimal duration(e.g. >1 month, although we see that a cutoff of ~>1 week is the real seperator), then predict subsequent activity (possibly multiple times per user).


In [None]:
print(df_labels_agg['weeks_retention'].describe())

## 25%           0.000000
## 50%           1.000000
## over half aren't active after first usage/week!

print("\n If excluding users with <=2 weeks retention, median is:",df_labels_agg[df_labels_agg['weeks_retention']>2]["weeks_retention"].median())

df_labels_agg['weeks_retention'].hist();

We still see a heavy tailed distribution (as expected), many users may open an account just for a course, or without being serious, or Kaggle may simply not be "sticky" enough. 

Defining a "real" user is tricky. We can use 30 days , but a timespan greater than a single competition (2-4 months) might also be valid. We can also go for something more data driven (based on typical retentions). 

* We'll want to remove the "newest"/most recent users, as users who registered within less than 3 months from the present time ("today"), can't possibly have 6 months retention!
* Let's also look at the users with crazy long veterancy (maybe they're kaggle founders? Grandmasters?)

In [None]:
# MINIMAL_RETENTION_HISTORY = minimal amount of history for users
print(f"MINIMAL_RETENTION_HISTORY (Weeks): {MINIMAL_RETENTION_HISTORY}\n")
df = df_labels_agg.loc[df_labels_agg['weeks_retention']>=MINIMAL_RETENTION_HISTORY].copy() 
df.weeks_retention = df.weeks_retention.astype(int)

print(df['weeks_retention'].describe(percentiles=[.1,.25, .5, .75,.9]))
df['weeks_retention'].hist();

In [None]:
print("Last date in data",df.last_Activity.max())
print("%i users with 7.5+ years veterancy"% df.loc[df['weeks_retention']>(53*7.5)].shape[0])

print("\n counts of first activity over time:")
df.first_Activity.hist();


## Filter earliest and oldest users:

* We will **drop** the **early years'** (2010-~2012) data , both to remove bias for "kaggle admins" and the big shift in behavior and growth seen.  
    * I'll keep users from the past ~ 6 years. I'll admit to [personal bias due to being registered for 6 years](https://www.kaggle.com/danofer), albeit active for only ~4-5
* We could focus on just the last 2-3 years , but that may be a bit much.

* We also **drop** users who registered very **recently** (past ~3 months , as of 26.5.2019" = date kernel was last updated!)
    * They won't have had enough time to show retention for our target
    * *RECENT_CUTOFF_DATE* = users who registered within less time than our retention prediction target (i.e "more than 5 months")
    * *MINIMAL_RETENTION_WEEKS_TARGET* = target we want to predict (User retention time in weeks). We'll filter data for users who can't have been active for less than this amount of time
    * We'll add a "helper" column with months between the first activity and the current date. This will make it easier to filter by duration , rather than dates (useful for future versions of this data)


In [None]:
df["weeks_from_endDate"] = round((END_DATE - df_labels_agg['first_Activity']) / np.timedelta64(1, 'W'))
# df.weeks_from_endDate.describe()

print(df.shape[0])
df = df.loc[df.first_Activity.dt.year>=2012]
print("Early years dropped:",df.shape[0])

print(f" \nMINIMAL_RETENTION_WEEKS_TARGET (Weeks of minimum target history): {MINIMAL_RETENTION_WEEKS_TARGET} \n")
df = df.loc[df["weeks_from_endDate"] >= MINIMAL_RETENTION_WEEKS_TARGET] 
# ensure we have users 
### RECENT_CUTOFF_DATE = users who registered within less time than our retention prediction target (i.e "more than 5 months")
print(f"After dropping too recent registrations:",df.shape[0]) ## f strings! :)

df.weeks_retention.describe()

In [None]:
# ### joiners per month stats
# df.set_index("last_Activity")["UserId"].resample("M").count().describe()

* We see massive growth over the past few years in newly registered users! 
* Growth seems to accelerate around 2015-2015, even more so after the Google acquisition (~2017)

* If we were more interested in classic retention, we would look at retention based on the start date (i.e cohorts). It seems reasonabe that retention will be higher in later dates. 


In [None]:
print(df[["UserId","last_Activity"]].nunique())

## Clean data for export
* Add prediction time-point
* Add target column, defined by MINIMAL_RETENTION_WEEKS_TARGET  (Instead of modelling problem as regression)
* Drop some target-related columns (last activity..)
* MErge with users metadata (mainly to get creation date). 
    * I will get features from Tthe remaining data externally. 

In [None]:
df["time_of_prediction"] = df["first_Activity"] + pd.to_timedelta(MINIMAL_RETENTION_HISTORY,unit="W")
df[f"retention_target_{MINIMAL_RETENTION_WEEKS_TARGET}"] = (df["weeks_retention"]>MINIMAL_RETENTION_WEEKS_TARGET).astype(int)
print(df[f"retention_target_{MINIMAL_RETENTION_WEEKS_TARGET}"].mean())
df.drop(["last_Activity"],axis=1,inplace=True)

In [None]:
# some users aren't in the user data!
# + Dרםפ קשרךןקר ודקרד )גןככקרקמא נקישהןםר
print(df.join(users,on="UserId",how="inner").shape)

df = df.join(users,on="UserId",how="left")
print(df.shape[0])
df.RegisterDate = pd.to_datetime(df.RegisterDate)
df =  df.loc[df.RegisterDate.dt.year>2012]
print("Users with no registration date dropped + earliest users;",df.shape[0])
df.tail(3)

In [None]:
df.to_csv(f"metaKaggle_churn-{MINIMAL_RETENTION_WEEKS_TARGET}_retention-{MINIMAL_RETENTION_HISTORY}_hist-labels_v1.csv.gz",index=False,compression="gzip")

### A Side Note
There are many different ways to look at retention. 

E.g. looking at the percentage of users who are active N weeks after they've signed up. Some other common approaches are to look at month-over-month retention or look at 7-day (or 30-day) active instead of 1-day active.

### So What?

* To take our insights to the next level, we must make them actionable.

## Actionability - Turn Insights Into Action
We now have our baseline. We know that by the 8th week, 2% of the users are still using Kernels. It's time to explore ways to increase retention.

Cohort analysis is a great way to develop more actionable insights. The overall user retention is what we've plotted above. However, there are certain groups of users who are more engaged than others. Cohort analysis can help us identify them.

### Kernel Categories
Authors can *tag* their notebooks. To tag a notebook is to associate it with a category.

<img src="https://image.ibb.co/gZhAu8/tag.png" alt="tag" style="width: 400px" />

Let's see if user's who add tags to their first notebook behave any differently than our current baseline (all users). For that, we'll join our user events table with the `KernelTags` dataset.


In [None]:
# # Load KernelTags table.
# kernel_tags = load('KernelTags')

In [None]:
# # Create temporary table to determine if a user's first kernel has a tag.
# user_first_kernel = kernel_versions.iloc[kernel_versions.groupby('AuthorUserId')['CreationDate'].idxmin()]
# user_first_kernel = pd.merge(
#     user_first_kernel,
#     kernel_tags,
#     how='left',
#     on='KernelId',
#     suffixes=('', '_kernel_tags'))

# # If right side of join is n/a, it's because user's first notebook has no tag/category.
# user_first_kernel.loc[pd.notnull(user_first_kernel.TagId), 'TagId'] = 'has_category'
# user_first_kernel.loc[pd.isnull(user_first_kernel.TagId), 'TagId'] = 'no_category'
# user_first_kernel = user_first_kernel.rename(columns={'TagId': 'has_category'})

In [None]:
# cohort = 'has_category'

# augmented_kernel_users = pd.merge(
#     kernel_users,
#     user_first_kernel,
#     left_on='Id',
#     right_on='AuthorUserId',
#     suffixes=('', '_b'))[['Id', cohort, 'RegisterDate']]

# dim_users = pd.merge(
#     kernel_user_events,
#     augmented_kernel_users,
#     how='left',
#     on='Id')
# dim_users['weeks_from_signup'] = round((dim_users['Date'] - dim_users['RegisterDate']) / np.timedelta64(1, 'W'))
# dim_users = dim_users[['Id', 'weeks_from_signup', cohort]].drop_duplicates()
# dim_users = dim_users[dim_users['weeks_from_signup'] <= 6]

# assert dim_users['Id'].nunique() == dim_users[dim_users['weeks_from_signup'] == 0].shape[0]

# cohort_size = (
#     dim_users[dim_users['weeks_from_signup'] == 0]
#     .groupby([cohort], as_index=False).agg('count')[[cohort, 'Id']]
#     .rename(columns={'Id': 'cohort_size'})
# )
# cohort_size = cohort_size[cohort_size['cohort_size'] > 1000]


# users_by_cohort = (pd.merge(
#     dim_users,
#     cohort_size,
#     on=cohort)
#  .groupby(['weeks_from_signup', cohort, 'cohort_size'], as_index=False)
#  .agg('count')
#  .rename(columns={'Id': 'user_count'})
# )

# users_by_cohort['pct'] = users_by_cohort['user_count'] / users_by_cohort['cohort_size'] * 100

In [None]:
# plt.figure(figsize=(8, 6))
# for a, b in users_by_cohort.groupby([cohort]):
#     plt.plot(b['weeks_from_signup'], b['user_count'] / b['cohort_size'] * 100.0, label=a)
# plt.title('Kernels Retention Curve')
# plt.ylabel('% Active Users')
# plt.xlabel('Weeks From Signup')
# plt.legend()
# plt.show()

It appears that users who added a tag to their first kernel have substatially higher retention rates! Let's look at it in a table.