<a href="https://colab.research.google.com/github/theventurecity/data-toolkit/blob/master/Mini_Pipeline_Cohort_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![TheVentureCity](https://theventure.city/wp-content/uploads/2017/06/Theventurecity-logoweb-1.png)

# Mini-Pipeline: Cohort Analysis
1. Extract raw event log data from a CSV
2. Transform that data into a cohort analysis dataframe
3. Load the transformed data into Google Sheets 
4. Visualize insights in Google Data Studio

## Before you begin

- This notebook is shared with read-only access. To run this notebook yourself, first click "**Open in Playground**" in the toolbar above. That will create a separate instance that you can run and/or save a copy of to your own Google Drive. 

- To run each cell, hit **Shift-Enter**, which will run the contents of the active cell and move to the next cell. This includes the markup cells (such as this one).

- When you run the first block of Python code, you will get a message that says, "**Warning: This notebook was not authored by Google.**" Please be aware that we are **NOT** accessing your data shared with Google or reading data and credentials from other sessions. This notebook reads data from GitHub and writes to a Google Sheet that only you have access to and can control. We recommend you click the box to "**Reset all runtimes before running**" for extra information security.

## Import relevant Python libraries

In [1]:
### Pandas, Numpy, and date functions to read the data from its source
### and manipulate it in memory
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import datetime
import math


### Google account authentication to give the code access to your Google account
### and gspread to enable writing to Google Sheets
from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread
from gspread_dataframe import set_with_dataframe


### The IPython.display library allows us to embed an iFrame within this 
### notebook
from IPython.display import IFrame


### To run this with functions from TheVentureCity's GitHub repository,
### clone the repository to the Google Colaboratory runtime environment
### and then import the code. This code allows us to run pre-existing functions 
### rather than having to define them inline within the notebook
### THIS IS ONLY APPLICABLE IF YOU WANT TO ACCESS THOSE FUNCTIONS ###
from importlib.machinery import SourceFileLoader
!git clone https://github.com/theventurecity/data-toolkit.git /tmp/theventurecity
!mv /tmp/theventurecity/python/tvc_transform.py tvc_transform.py
!mv /tmp/theventurecity/python/tvc_load_colab.py tvc_load_colab.py
!rm -r /tmp/theventurecity
tvct = SourceFileLoader('tvc_transform', 'tvc_transform.py').load_module()
tvcl = SourceFileLoader('tvc_load_colab', 'tvc_load_colab.py').load_module()

Cloning into '/tmp/theventurecity'...
remote: Enumerating objects: 108, done.[K
remote: Counting objects: 100% (108/108), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 108 (delta 53), reused 62 (delta 19), pack-reused 0[K
Receiving objects: 100% (108/108), 37.93 MiB | 21.58 MiB/s, done.
Resolving deltas: 100% (53/53), done.


## 1. Extract raw event log data from a CSV

This example uses a data file for a Sample Company from our GitHub repository. In this step we read the data file into memory as a Pandas dataframe we name "t."

In [2]:
filename = 'https://raw.githubusercontent.com/theventurecity/Analytics/master/data/ServBiz_transactions.csv'
t = pd.read_csv(filename)
t.tail(10)

Unnamed: 0,client_id,date,value_usd,segment
420781,27902A,2019-02-28,8.75,Enterprise
420782,34181A,2019-02-28,18.97,SMB
420783,30168A,2019-02-28,17.73,SMB
420784,30844A,2019-02-28,19.98,SMB
420785,35815A,2019-02-28,17.98,SMB
420786,16958A,2019-02-28,17.45,SMB
420787,13090A,2019-02-28,13.48,SMB
420788,19162A,2019-02-28,13.64,Enterprise
420789,28409A,2019-02-28,14.72,SMB
420790,12080A,2019-02-28,18.32,SMB


## 2. Transform the raw data into engagement analysis dataframes
**Note**: For a more detailed discussion about creating the DAU and DAU Decorated dataframes, complete with inline code, visit [Create the DAU Decorated Data Set](https://colab.research.google.com/drive/12uehG2EcIqxcTazKs-pNQRTQSckllOmE)
### 2.1 Create Daily Active Users (DAU) dataframe
The **DAU** dataframe aggregates all activity by user and day. 

In [3]:
# Run the create_dau_df function and show the first ten rows of the resulting dataframe
dau = tvct.create_dau_df(t, 
                         user_id = 'client_id', 
                         activity_date = 'date', 
                         inc_amt = 'value_usd',
                         segment_col = 'segment'
                        )
dau.head(10)

Unnamed: 0,user_id,activity_date,segment,inc_amt
0,10000A,2015-10-14,SMB,11.75
1,10001A,2015-10-14,SMB,13.75
2,10001A,2015-11-02,SMB,7.5
3,10001A,2015-11-22,SMB,18.0
4,10001A,2017-04-04,SMB,6.25
5,10001A,2017-12-08,SMB,8.75
6,10002A,2015-10-14,Enterprise,11.75
7,10002A,2015-10-26,Enterprise,12.25
8,10002A,2015-11-16,Enterprise,12.25
9,10002A,2015-11-23,Enterprise,12.25


### 2.2 - 2.3 Calculate First Date and DAU Decorated dataframes
The create_dau_decorated_df function calls the create_first_dt_df if no first_dt dataframe is specified

In [4]:
# Run the create_dau_decorated_df function and show the first ten rows of the resulting dataframe
dau_decorated = tvct.create_dau_decorated_df(dau)
dau_decorated.head(10)

Creating DAU Decorated dataframe
Creating first_dt dataframe


Unnamed: 0,user_id,activity_date,segment,inc_amt,first_dt,first_week,first_month
0,10000A,2015-10-14,SMB,11.75,2015-10-14,2015-10-12/2015-10-18,2015-10
1,10001A,2015-10-14,SMB,13.75,2015-10-14,2015-10-12/2015-10-18,2015-10
2,10001A,2015-11-02,SMB,7.5,2015-10-14,2015-10-12/2015-10-18,2015-10
3,10001A,2015-11-22,SMB,18.0,2015-10-14,2015-10-12/2015-10-18,2015-10
4,10001A,2017-04-04,SMB,6.25,2015-10-14,2015-10-12/2015-10-18,2015-10
5,10001A,2017-12-08,SMB,8.75,2015-10-14,2015-10-12/2015-10-18,2015-10
6,10002A,2015-10-14,Enterprise,11.75,2015-10-14,2015-10-12/2015-10-18,2015-10
7,10002A,2015-10-26,Enterprise,12.25,2015-10-14,2015-10-12/2015-10-18,2015-10
8,10002A,2015-11-16,Enterprise,12.25,2015-10-14,2015-10-12/2015-10-18,2015-10
9,10002A,2015-11-23,Enterprise,12.25,2015-10-14,2015-10-12/2015-10-18,2015-10


Combining the basic DAU data with the first date, week, and month for each user, **the DAU Decorated dataframe is our basic building block for many different analyses**. It allows us to use user-level data to inspect engagement, retention, and growth accounting.

### 2.4 Calculate cohort analysis metrics
Now that we have the "DAU Decorated" data frame, we can use it to calculate cohort retention and LTV metrics since they are super-important for an early-stage startup. The **create_xau_decorated_df** function below creates a monthly roll-up of the DAU Decorated data we call MAU Decorated.

In [5]:
### For discrete time period calculations, this helps set the variable names
### in the different dataframes 
def get_time_period_dict(time_period):
    
    time_fields_dict = {
                        'day' : {'grouping_col' : 'activity_date',
                                  'first_period_col' : 'first_dt',
                                  'frequency' : 'Daily',
                                  'unit' : 'Day',
                                  'period_abbr' : 'D',
                                  'python_period' : 'days',
                                  'days' : 1
                                  },
                        'week' : {'grouping_col' : 'Week',
                                  'first_period_col' : 'first_week',
                                  'frequency' : 'Weekly',
                                  'unit' : 'Week',
                                  'period_abbr' : 'W',
                                  'python_period' : 'weeks',
                                  'days' : 7
                                  },
                        'month' : {'grouping_col' : 'Month_Year',
                                   'first_period_col' : 'first_month',
                                   'frequency' : 'Monthly',
                                   'unit' : 'Month',
                                   'period_abbr' : 'M',
                                   'python_period' : 'months',
                                   'days' : 28
                                  }
                        }
                    
    # if time_period passed in is a valid choice, then return the dictionary
    # associated with that choice from the dictionary above
    if time_period in time_fields_dict:
        time_fields = time_fields_dict[time_period]
    else:
        time_fields = None
    
    return time_fields
  
  
  
### This is another helper function that allows us to determine the next week
### or month for any given week of month. We need these because the Pandas
### date math is not consistent. You can use timedelta to add weeks, but you 
### have to use DateOffset to add months.
def increment_period(xau_grouping_col, time_period):
  
    # Call the get_time_period_dict function above, passing in the time period
    time_fields = get_time_period_dict(time_period)
    
    # Set the one-letter period abbreviation to whatever that function returns
    period_abbr = time_fields['period_abbr']
    
    # Depending on the time period, increment the week or month by one after
    # first converting it to the date time of the start of the period
    if time_period == 'week':
        start_of_next_period = pd.to_datetime(pd.PeriodIndex(xau_grouping_col).start_time + timedelta(weeks = 1))
    elif time_period == 'month':
        start_of_next_period = pd.to_datetime(pd.PeriodIndex(xau_grouping_col, freq = period_abbr).start_time) + pd.DateOffset(months = 1)
    else:
        start_of_next_period = None
        
    # Convert the date time from the previous step back into the appropriate 
    # period (week or month) 
    if start_of_next_period is not None:
        next_period = pd.Series(start_of_next_period).dt.to_period(period_abbr)
    else:
        next_period = None
    
    # Return the next period
    return next_period
  
  
### create_xau_decorated_df is a generic function that allows us to find WAU
### Decorated or MAU Decorated from a DAU Decorated based on the time period
### that gets passed in
def create_xau_decorated_df(dau_decorated_df, time_period, use_segment):
    
    # These are the parameters that are set from the get_time_period_dict 
    # function above
    time_fields = get_time_period_dict(time_period)
    grouping_col = time_fields['grouping_col']
    frequency = time_fields['frequency']
    first_period_col = time_fields['first_period_col']
    period_abbr = time_fields['period_abbr']
    
    # Print a notification message indicating that this function has been called
    print('Creating ' + frequency + ' Active Users Decorated dataframe')
    
    # We are grouping by the grouping_col (which is either "Week" or "Month_Year"),
    # the user_id, and the first_period_col (either "first_week" or "first_month")
    # For each user_id, there is one and only one first_period_col.
    
    groupby_cols = [grouping_col, 'user_id', first_period_col]
    
    # If we are using the segment column, we include that in the groupby_cols
    # list as well
    if use_segment:
        groupby_cols = groupby_cols + ['segment']
        
    # Start by making a copy of the dataframe that gets passed in so as not to
    # affect the original
    dau_decorated = dau_decorated_df.copy()
    
    # Convert the activity_date for each transaction in dau_decorated to a 
    # period the same timeframe as the period in question (either a week or a 
    # month)
    dau_decorated[grouping_col] = pd.to_datetime(dau_decorated['activity_date']).dt.to_period(period_abbr)
    
    # Group dau_decorated into the grouping_cols defined above and aggregate the
    # sum of the inc_amt field
    xau = (dau_decorated.groupby(groupby_cols, as_index = False)['inc_amt'].sum())
    
    # Set a new column with the next time period by calling the increment_period
    # function defined above
    xau['Next_' + grouping_col] = increment_period(xau[grouping_col], time_period)
    
    # Select a subset of the resultant columns from the groupby to output
    output_cols = [grouping_col, 'user_id', 'inc_amt', first_period_col, 'Next_' + grouping_col]
    if use_segment:
        output_cols = output_cols + ['segment']
    xau = xau[output_cols]
    
    # Return the resultant dataset
    return xau
  
  
# Run create_xau_decorated_df for MAU's (using 'month'), unsegmented
mau_decorated = create_xau_decorated_df(dau_decorated, 'month', use_segment=False)
mau_decorated.tail(10)

Creating Monthly Active Users Decorated dataframe


Unnamed: 0,Month_Year,user_id,inc_amt,first_month,Next_Month_Year
93658,2019-02,9784A,533.66,2015-09,2019-03
93659,2019-02,9794A,24.72,2015-09,2019-03
93660,2019-02,9808A,85.12,2015-09,2019-03
93661,2019-02,9868A,65.75,2015-10,2019-03
93662,2019-02,9876A,19.48,2015-10,2019-03
93663,2019-02,9902A,39.76,2015-10,2019-03
93664,2019-02,9952A,53.5,2015-10,2019-03
93665,2019-02,9986A,116.64,2015-10,2019-03
93666,2019-02,9989A,21.28,2015-10,2019-03
93667,2019-02,9995A,85.12,2015-10,2019-03


The **create_xau_cohort_df** function below helps us calculate what happens to the cohort of users that comes in as a new customer each month. It looks at how many of them continue to use the product (expressed as a number and as a percentage), how much income they generate each month, the income per customer, and the cumulative income per customer. This information leads to several insightful visuals as we will see below.

In [6]:
### Calculate the user retention and empirical CLTV using inc_amtby cohort 
### defined by any weekly or monthly time period
def create_xau_cohort_df(xau_decorated_df, 
                         time_period, 
                         use_segment = False,
                         recent_periods_back_to_exclude = 1, 
                         date_limit = None,
                         create_period_n_inc_cols = False,
                         add_hours = False,
                         use_standard_col_names = False):
    
    # These are the parameters that are set from the get_time_period_dict 
    # function above    
    time_fields = get_time_period_dict(time_period)
    grouping_col = time_fields['grouping_col']
    first_period_col = time_fields['first_period_col']
    unit = time_fields['unit']
    period_abbr = time_fields['period_abbr']
        
    # If a date_limit is set, set that date as the max date in the 
    # dau_decorated_df dataframe passed into the function.
    # Otherwise, make a copy of the entire input dau_decorated_df.
    if date_limit is not None:
        xau_d = (xau_decorated_df[pd.PeriodIndex(xau_decorated_df[grouping_col], 
                                                 freq = period_abbr)
                                  .start_time <= date_limit]
                 .copy()
                )
    else:
        xau_d = xau_decorated_df.copy()
    
    # Set the since_col variable to say "Months Since First" or "Weeks Since First"
    since_col = '%ss Since First' % unit
    
    # Calculate the value in the since_col to be the number of periods between
    # the current period and the user's first period
    xau_d[since_col] = xau_d[grouping_col] - xau_d[first_period_col]
    
    # Since we are aggregating it all by the cohort of users that started in a
    # particular period, we set the group by columns for the first aggregation
    # as the first_period_col, grouping_col, and since_col
    # first_period_col = {'first_month' | 'first_week' | 'first_day'}
    # grouping_col = {'Month_Year' | 'Week' | 'activity_date'}
    # since_col = {'Months Since First' | 'Weeks Since First' | 'Days Since First'}
    first_groupby_cols = [first_period_col, grouping_col, since_col]
    
    # If we are including the segment in the calculations, include it in the 
    # groupby columns
    if use_segment:
        first_groupby_cols = first_groupby_cols + ['segment']
    
    # Group xau_d by the first_groupby_cols to find the sum of inc_amt and
    # the number of unique user_ids in each grouping
    xau_d = xau_d.groupby(first_groupby_cols)\
                    .agg({'inc_amt' : 'sum', 
                          'user_id' : 'nunique'})\
                    .rename(columns = { 'user_id' : 'cust_ct' })
                    
    # For the second groupby, we reduce the columns down to the first_period_col
    # and the segment (if applicable)
    second_groupby_cols = [first_period_col]
    if use_segment:
        second_groupby_cols = second_groupby_cols + ['segment']
    
    # The second groupby is used to make calculations about the cohort as a whole
    # rather than at the individual period level
    # The first of such calculations is to take the first value for cust_ct
    # (customer count) as being the number of customers in the cohort
    xau_d['cohort_cust_ct'] = xau_d.groupby(second_groupby_cols)['cust_ct'].transform('first')
    
    # The second calculation at the cohort level is to get the cumulative sum
    # of the inc_amt for each period's cohort
    xau_d['cum_inc_amt'] = xau_d.groupby(second_groupby_cols)['inc_amt'].cumsum()
    
    # These ratios are calculated per period using the per-cohort numbers calculated
    # using the second groupby
    # First we calculate cum_inc_per_cohort_cust (cumulative income per cohort
    # customer) as that period's cumulative inc_amt divided by the total number 
    # of customers in the first period of the cohort
    xau_d['cum_inc_per_cohort_cust'] = xau_d['cum_inc_amt'] / xau_d['cohort_cust_ct']
    
    # We also calculate how many users from the original count of cohort customers
    # is still active in the current period. This is the cust_ret_pct (customer
    # retention percentage)
    xau_d['cust_ret_pct'] = xau_d['cust_ct'] / xau_d['cohort_cust_ct']
    
    # Reset the index on the Pandas dataframe
    xau_d = xau_d.reset_index()
    
    # The code below removes rows, adds columns, or tweaks some of the time
    # columns to allow for presentation in certain cases.
    
    # If the time period in question is 'day', then the logic works somewhat
    # differently than it does if it is a 'week' or 'month'. Week/Month are in 
    # the top "if" clause, while Day is in the "else" clause.
    if time_period != 'day':
      
        # If the time period is a month, and we have recent periods back to 
        # exclude (an input parameter), we use this code to exclude those 
        # periods from xau_d. The periods back is measured from TODAY's date.
        if time_period == 'month':
            td = pd.DateOffset(months = recent_periods_back_to_exclude)
        elif time_period == 'week':
            td = timedelta(weeks = recent_periods_back_to_exclude)
        
        last_period = pd.to_datetime(datetime.today() - td).to_period(period_abbr)
        xau_d = xau_d.loc[xau_d[grouping_col] <= last_period]
        
        # If we want to add seven hours to the first_period_col and grouping_col
        # we do so here if the add_hours input parameter is True. It defaults
        # to False
        if add_hours:
            xau_d[first_period_col] = (pd.PeriodIndex(xau_d[first_period_col], 
                                                      freq = period_abbr)
                                       .start_time + timedelta(hours = 7))
            xau_d[grouping_col] = (pd.PeriodIndex(xau_d[grouping_col], 
                                                  freq = period_abbr)
                                   .start_time + timedelta(hours = 7))
        
        # Using the segment column requires us to specify which segment each
        # first_period_col goes with. 
        # Creator's note: needs string to handle the weekly case as well as the monthly
        if use_segment:
            xau_d['segment_first_' + time_period] = (xau_d[first_period_col]
                                                     .dt
                                                     .strftime('%Y-%m') + '-' + xau_d['segment'])  
        
        # If we want to add new columns for weekly trend analysis, we would do
        # so by setting the create_period_n_inc_cols equal to True, which would
        # trigger the add_period_n_cum_inc_per_cohort_cust_columns function
        # to run. The parameter defaults to False.
        if create_period_n_inc_cols:
            xau_d = add_period_n_cum_inc_per_cohort_cust_columns(xau_d, since_col, unit)

    else:
        
        # In the case of days, excluding the last period involves using 
        # timedelta to subtract days from today's date
        last_period = (datetime.today() - timedelta(days = recent_periods_back_to_exclude)).date()
        xau_d = xau_d.loc[xau_d[grouping_col] <= last_period]
        
        # The add_hours piece uses timedelta(hours = 7)
        if add_hours:
            xau_d[first_period_col] = xau_d[first_period_col] + timedelta(hours = 7)
            xau_d[grouping_col] = xau_d[grouping_col] + timedelta(hours = 7)
            
        # Change the data type on the since_col to be a number and not a timedelta    
        xau_d[since_col] = xau_d[since_col].astype(timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
        
        # Using the segment column requires us to specify which segment each
        # first_period_col goes with. 
        if use_segment:
            xau_d['segment_first_' + time_period] = (xau_d[first_period_col]
                                                     .dt
                                                     .strftime('%Y-%m-%d') + '-' + xau_d['segment'])
        
        # Similar to above for adding weekly trend analysis, we would do
        # so by setting the create_period_n_inc_cols equal to True, which would
        # trigger the add_period_n_cum_inc_per_cohort_cust_columns function
        # to run. The parameter defaults to False.
        if create_period_n_inc_cols:
            xau_d = add_period_n_cum_inc_per_cohort_cust_columns(xau_d, since_col, unit)
            
    # This last bit of code changes column names if the input parameter
    # use_standard_col_names is True. It defaults to False
    if use_standard_col_names:
        mapping = {'month_year' : 'month',
                   'cust_ret_pct' : 'retained_pctg',
                   since_col : time_period + 's_since_first'
                   }
        
        cols = xau_d.columns
        new_xau_cols = []
        for c in cols:
            new_c = '_'.join(c.lower().split(' '))
            if new_c in mapping:
                new_c = mapping[new_c]
            new_xau_cols.append(new_c)
        print(new_xau_cols)
        xau_d.columns = new_xau_cols

    
    return xau_d

  
mau_retention_by_cohort = create_xau_cohort_df(mau_decorated, 'month')
mau_retention_by_cohort.tail(10)

Unnamed: 0,first_month,Month_Year,Months Since First,inc_amt,cust_ct,cohort_cust_ct,cum_inc_amt,cum_inc_per_cohort_cust,cust_ret_pct
1071,2018-11,2018-11,0,17178.12,518,518,17178.12,33.162394,1.0
1072,2018-11,2018-12,1,20136.67,356,518,37314.79,72.036274,0.687259
1073,2018-11,2019-01,2,18325.46,300,518,55640.25,107.41361,0.579151
1074,2018-11,2019-02,3,14171.36,263,518,69811.61,134.771448,0.507722
1075,2018-12,2018-12,0,26134.01,740,740,26134.01,35.31623,1.0
1076,2018-12,2019-01,1,31004.43,497,740,57138.44,77.214108,0.671622
1077,2018-12,2019-02,2,22879.87,379,740,80018.31,108.132851,0.512162
1078,2019-01,2019-01,0,20557.47,583,583,20557.47,35.261527,1.0
1079,2019-01,2019-02,1,21454.77,388,583,42012.24,72.062161,0.665523
1080,2019-02,2019-02,0,17109.64,537,537,17109.64,31.861527,1.0


## 3. Load the transformed data into Google Sheets 
### Establish connection to Google Sheets for writing output files
The first time you run this cell, or after some time of inactivity, you will be asked to click on a link. That link will take you to a new tab that will authorize this script to write to Google Sheets spreadsheets in your Google Account. To enable this feature, copy the code you get into the box below and hit Enter.

In [0]:
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())

### Function to facilitate writing from Pandas to Google Sheets
For use further down in this notebook.

In [0]:
def write_to_google_sheet(dataframe, spreadsheet_key, worksheet_name, goog_creds = gc):
  
  sh = goog_creds.open_by_key(spreadsheet_key)
  
  ws = None
  worksheet_list = sh.worksheets()
  for worksheet in worksheet_list:
    if worksheet.title == worksheet_name:
      ws = worksheet
      
  if ws is None:
    ws = sh.add_worksheet(title = worksheet_name, rows="1", cols = "1")
    
  set_with_dataframe(ws, dataframe, row=1, col=1, include_index=False, 
                     include_column_header=True, resize=True, allow_formulas=True)

### Create Google Sheet for writing output files if one is not already established

***Warning: you must choose from Options A and B below***.

**Option A**: If you already have a Google Sheet where you store transformed data for analytics, go to that sheet, copy the long ID string from the sheet's URL in a browser, and paste it between the quotes below. Then uncomment the code and run the cell. Do NOT run the code for Option B without commenting it out.

This option is especially important if you have previously linked a reporting dashboard to this Google Sheet and would like to use this ETL process to update that dashboard.

In [0]:
### Option A Code
GOOGLE_SHEET_KEY = '1-XnO_eWkRwX-E1fiA2Jkbe3kJvoyoPFsdeW7vnF6zS0' 

**Option B**: If you are running this for the first time and do not already have a Google Sheet where you store transformed data for analytics, uncomment and run the cell below to set the GOOGLE_SHEET_KEY variable, which will be used later in the process. Be sure to replace the "Sample Company Analytics" filename with one of your own.

In [0]:
### Option B Code
# GOOGLE_SHEET_KEY = gc.create('Sample Company Analytics').id

After executing the cell above, a new spreadsheet with the name you supply will be shown in your sheets list on [sheets.google.com](http://sheets.google.com/). To go directly to the newly created Google Sheet, run the code block below and visit the link that it outputs. You can keep this Sheet open in a separate tab to see the data get updated whenever **write_to_google_sheet** (defined above) is called.

In [11]:
print('https://docs.google.com/spreadsheets/d/' + GOOGLE_SHEET_KEY)

https://docs.google.com/spreadsheets/d/1-XnO_eWkRwX-E1fiA2Jkbe3kJvoyoPFsdeW7vnF6zS0


In [0]:
write_to_google_sheet(mau_retention_by_cohort, GOOGLE_SHEET_KEY, 'MAU Retention by Cohort')

## 4. Visualize insights in Google Data Studio
A Google DataStudio dashboard preconfigured to read from the Google Sheet created above to visualize the data is [available at this link](https://datastudio.google.com/open/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV) or by clicking on the Google DataStudio logo at the bottom of the chart embedded below. It is available in read-only mode for you to copy, link to your own Google Sheet tabs, and see your own data visualized. 

Note: sometimes there is an issue embedding the Google DataStudio dashboard in this notebook, particularly when it is rendered in the GitHub notebook reader. In that case, click on the "Open in Colab" button at the very top of the notebook to see a notebook that can embed the dashboard properly.
### 4.1 Monthly Cohort User Retention
One way to think about cohort user retention is based on the fact that, typically, not all users are retained from Month 0 to Month 1, and then to Month 2, 3, 4, etc. The chart below depicts this in the downward curve approaching. If you mouseover the bars, we can see that the January 2018 cohort at the farthest right is at 23.7% user retention in Month 13. Compare that to the 60.94% figure for that same cohort in Month 1.

Sometimes this data is depicted as downward curving lines, one for each cohort. We prefer this visual because (a) the lines often end up looking jumbled and spaghetti-like; and (b) it allows us to see the trend in each Months Since First's retention. For example, the most recent Month 1 retention of 66.55% for the January 2019 cohort is down from the Month 1 peak of 72.8% percent for the July 2018 cohort. In this way, we can see the general downward slope that the spaghetti lines would give us, but also see the trends within each Months Since First, for a superior visual.

In [13]:
IFrame('https://datastudio.google.com/embed/reporting/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV/page/EOyj', 
       width=600, 
       height=450)

### 4.2 Cumulative Customer Revenue LTV by Monthly Cohort
The visual below shows the cumulative inc_amt *per original cohort customer* that each cohort has generated in the time it has been active. These lines give a look at how well a group of ServBiz's users keeps generating revenue after they are initially acquired. This gives us an empirical way to look at customer long-term value (LTV) over a range of possible scenarios. For example, if you mouseover 16 Months Since First in the chart below, the four cohorts that have been around that long range from `$`440.47 to `$`588.87 per cohort customer. We can use that as a 16-month LTV range for projection or for calculating LTV-to-CAC ratio.

Another feature of the curves below is that, while they are slighly bending downward, the rate of change in the slope is very gradual. This implies that ServBiz's customers are active for a long period of time, meaning that the time horizon for calculating LTV may be extended and that these customers are quite value in the long-term.

In [14]:
IFrame('https://datastudio.google.com/embed/reporting/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV/page/pVBk', 
       width=600, 
       height=450)