<a href="https://colab.research.google.com/github/theventurecity/data-toolkit/blob/master/Mini_Pipeline_Engagement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://theventure.city"><img src="https://github.com/theventurecity/data-toolkit/blob/master/img/tvc_horiz_junglegreen.png?raw=true" alt='TheVentureCity' style="width: 400px;"></a>

# Mini-Pipeline: Engagement
1. Extract raw event log data from a CSV
2. Transform that data into engagement analysis dataframes
3. Load the transformed data into Google Sheets
4. Visualize insights in Google Data Studio

## Before you begin

- This notebook is shared in Google Colaboratory with read-only access. To run this notebook yourself, first click "**Open in Playground**" in the toolbar above. That will create a separate instance that you can run and/or save a copy of to your own Google Drive.

- To run each cell, hit **Shift-Enter**, which will run the contents of the active cell and move to the next cell. This includes the markup cells (such as this one).

- When you run the first block of Python code, you will get a message that says, "**Warning: This notebook was not authored by Google.**" Please be aware that we are **NOT** accessing your data shared with Google or reading data and credentials from other sessions. This notebook reads data from GitHub and writes to a Google Sheet that only you have access to and can control. We recommend you click the box to "**Reset all runtimes before running**" for extra information security.

## Import relevant Python libraries

In [1]:
### To run this with functions from TheVentureCity's GitHub repository,
### clone the repository to the Google Colaboratory runtime environment.
from importlib.machinery import SourceFileLoader
!git clone https://github.com/theventurecity/data-toolkit.git /tmp/theventurecity

### Installs all the necessary library versions for this environment
!pip install -r /tmp/theventurecity/requirements.txt

### Import the code. This code allows us to run pre-existing functions
### rather than having to define them inline within the notebook
### THIS IS ONLY APPLICABLE IF YOU WANT TO ACCESS THOSE FUNCTIONS ###
!mv /tmp/theventurecity/python/tvc_load_colab.py tvc_load_colab.py
!mv /tmp/theventurecity/python/tvc_transform.py tvc_transform.py
!rm -r /tmp/theventurecity
tvcl = SourceFileLoader('tvc_load_colab', 'tvc_load_colab.py').load_module()
tvct = SourceFileLoader('tvc_transform', 'tvc_transform.py').load_module()

### Pandas, Numpy, and date functions to read the data from its source
### and manipulate it in memory
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import datetime
import math

### The IPython.display library allows us to embed an iFrame within this
### notebook
from IPython.display import IFrame

Cloning into '/tmp/theventurecity'...
remote: Enumerating objects: 389, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 389 (delta 6), reused 0 (delta 0), pack-reused 377[K
Receiving objects: 100% (389/389), 63.65 MiB | 9.11 MiB/s, done.
Resolving deltas: 100% (240/240), done.
Updating files: 100% (28/28), done.
Collecting gspread==6.1.0 (from -r /tmp/theventurecity/requirements.txt (line 3))
  Downloading gspread-6.1.0-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m879.4 kB/s[0m eta [36m0:00:00[0m
Collecting gspread-formatting==1.1.2 (from -r /tmp/theventurecity/requirements.txt (line 5))
  Downloading gspread_formatting-1.1.2-py2.py3-none-any.whl (22 kB)
Collecting google-auth-httplib2==0.2.0 (from -r /tmp/theventurecity/requirements.txt (line 8))
  Downloading google_auth_httplib2-0.2.0-py2.py3-none-any.whl (9.3 kB)
Collecting google-api-pyt

## 1. Extract raw event log data from a CSV

This example uses a data file for a sample company from our GitHub repository called ServBiz. In this step we read the data file into memory as a Pandas dataframe we name "t."

In [2]:
filename = 'https://raw.githubusercontent.com/theventurecity/analytics/master/data/ServBiz_transactions.csv'
t = pd.read_csv(filename)
t.tail(10)

Unnamed: 0,client_id,date,value_usd,segment
420781,27902A,2024-04-30,8.75,Enterprise
420782,34181A,2024-04-30,18.97,SMB
420783,30168A,2024-04-30,17.73,SMB
420784,30844A,2024-04-30,19.98,SMB
420785,35815A,2024-04-30,17.98,SMB
420786,16958A,2024-04-30,17.45,SMB
420787,13090A,2024-04-30,13.48,SMB
420788,19162A,2024-04-30,13.64,Enterprise
420789,28409A,2024-04-30,14.72,SMB
420790,12080A,2024-04-30,18.32,SMB


## 2. Transform the raw data into engagement analysis dataframes
**Note**: For a more detailed discussion about creating the DAU and DAU Decorated dataframes, complete with inline code, visit [Create the DAU Decorated Data Set](https://colab.research.google.com/drive/12uehG2EcIqxcTazKs-pNQRTQSckllOmE)
### 2.1 Create Daily Active Users (DAU) dataframe
The **DAU** dataframe aggregates all activity by user and day.

In [3]:
# Run the create_dau_df function and show the first ten rows of the resulting dataframe
dau = tvct.create_dau_df(t,
                         user_id = 'client_id',
                         activity_date = 'date',
                         inc_amt = 'value_usd',
                         segment_col = 'segment'
                        )
dau.head(10)

Unnamed: 0,user_id,activity_date,segment,inc_amt
0,10000A,2020-12-14,SMB,11.75
1,10001A,2020-12-14,SMB,13.75
2,10001A,2021-01-02,SMB,7.5
3,10001A,2021-01-22,SMB,18.0
4,10001A,2022-06-05,SMB,6.25
5,10001A,2023-02-08,SMB,8.75
6,10002A,2020-12-14,Enterprise,11.75
7,10002A,2020-12-26,Enterprise,12.25
8,10002A,2021-01-16,Enterprise,12.25
9,10002A,2021-01-23,Enterprise,12.25


### 2.2 - 2.3 Calculate First Date and DAU Decorated dataframes
The create_dau_decorated_df function calls the create_first_dt_df if no first_dt dataframe is specified

In [4]:
# Run the create_dau_decorated_df function and show the first ten rows of the resulting dataframe
dau_decorated = tvct.create_dau_decorated_df(dau)
dau_decorated.head(10)

Creating DAU Decorated dataframe
Creating first_dt dataframe


Unnamed: 0,user_id,activity_date,segment,inc_amt,first_dt,first_week,first_month
0,10000A,2020-12-14,SMB,11.75,2020-12-14,2020-12-14/2020-12-20,2020-12
1,10001A,2020-12-14,SMB,13.75,2020-12-14,2020-12-14/2020-12-20,2020-12
2,10001A,2021-01-02,SMB,7.5,2020-12-14,2020-12-14/2020-12-20,2020-12
3,10001A,2021-01-22,SMB,18.0,2020-12-14,2020-12-14/2020-12-20,2020-12
4,10001A,2022-06-05,SMB,6.25,2020-12-14,2020-12-14/2020-12-20,2020-12
5,10001A,2023-02-08,SMB,8.75,2020-12-14,2020-12-14/2020-12-20,2020-12
6,10002A,2020-12-14,Enterprise,11.75,2020-12-14,2020-12-14/2020-12-20,2020-12
7,10002A,2020-12-26,Enterprise,12.25,2020-12-14,2020-12-14/2020-12-20,2020-12
8,10002A,2021-01-16,Enterprise,12.25,2020-12-14,2020-12-14/2020-12-20,2020-12
9,10002A,2021-01-23,Enterprise,12.25,2020-12-14,2020-12-14/2020-12-20,2020-12


Combining the basic DAU data with the first date, week, and month for each user, **the DAU Decorated dataframe is our basic building block for many different analyses**. It allows us to use user-level data to inspect engagement, retention, and growth accounting.

### 2.4 Calculate User Periodic Usage
Now that we have the "DAU Decorated" data frame, we can use it to calculate engagement metrics since they are super-important for an early-stage startup. The **calc_user_periodic_usage** function below calculates the number of active days for each user in a 28-day window of time. It outputs a dataframes that holds the active day count for every user, sorted by their inc_amt.

In [5]:
### For discrete time period calculations, this helps set the variable names
### in the different dataframes
def get_time_period_dict(time_period):

    time_fields_dict = {
                        'day' : {'grouping_col' : 'activity_date',
                                  'first_period_col' : 'first_dt',
                                  'frequency' : 'Daily',
                                  'unit' : 'Day',
                                  'period_abbr' : 'D',
                                  'python_period' : 'days',
                                  'days' : 1
                                  },
                        'week' : {'grouping_col' : 'Week',
                                  'first_period_col' : 'first_week',
                                  'frequency' : 'Weekly',
                                  'unit' : 'Week',
                                  'period_abbr' : 'W',
                                  'python_period' : 'weeks',
                                  'days' : 7
                                  },
                        'month' : {'grouping_col' : 'Month_Year',
                                   'first_period_col' : 'first_month',
                                   'frequency' : 'Monthly',
                                   'unit' : 'Month',
                                   'period_abbr' : 'M',
                                   'python_period' : 'months',
                                   'days' : 28
                                  }
                        }

    # if time_period passed in is a valid choice, then return the dictionary
    # associated with that choice from the dictionary above
    if time_period in time_fields_dict:
        time_fields = time_fields_dict[time_period]
    else:
        time_fields = None

    return time_fields



# The calc_user_periodic_usage function takes the dau_decorated dataframe
# calculated above, determines a range of dates using the last_date and
# window_days inputs, and calculates the total number of active periods
# and the total income amount for each user. The breakouts input allows us to
# see if a user_id is above an active periods threshold (T|F).
def calc_user_periodic_usage(dau_decorated_df,
                             time_period,
                             last_date,
                             window_days,
                             breakouts,
                             use_segment
                             ):

    # We need to know the start date of our window. We calculate it by
    # subtracting window_days-1 days from the last_date input parameter

    window_start_date = last_date - timedelta(days = window_days-1)

    # Create a copy of the dau_decorated_df input dataframe that isolates the
    # activity_dates between the start and end dates of the window. Call it xau.

    xau = (dau_decorated_df
           .loc[(dau_decorated_df['activity_date'] >= window_start_date) &
                (dau_decorated_df['activity_date'] <= last_date)]
           .copy()
           )

    # Make sure the activity_date column is a datetime type
    xau['activity_date'] = pd.to_datetime(xau['activity_date'])

    # These are the parameters that are set from the get_time_period_dict
    # function above
    time_fields = get_time_period_dict(time_period)
    period_abbr = time_period[0] # first letter of the time period (lowercase)
    active_col_name = 'active_' + time_period + 's'
    if time_fields is None:
        period_days = 1
    else:
        period_days = time_fields['days']

    # These three lines create a dataframe called periods_df that has one row
    # for each date in the window range, and determines which period (day, week,
    # or month) that date is in. The window_period_number is the ordinal number
    # of that period in the dataframe. In the case of days, there are as many
    # window_period_number's as there are days in the range.
    date_range = pd.date_range(start = window_start_date, end = last_date, freq = 'D')
    periods_df = pd.DataFrame({'activity_date' : date_range})
    periods_df['window_period_number'] = periods_df.index // period_days
    ### Future enhancement: Put in option for going with discrete time periods

    # If we are NOT using segment, we go ahead and set a columnn called segment
    # to make the later groupby's easier
    if not use_segment:
        xau['segment'] = 'All'

    # This chain of Pandas functions merges the xau dataframe (which starts at
    # the daily level from dau_decorated_df) with the periods_df dataframe
    # to attached a window_period_number to each record. Then it groups by the
    # user_id, window_period_number, and segment and to find the sum of inc_amt
    # for each user in each period and segment. Then we do a second
    # group by, this time removing the window_period_number. This lets us count
    # the number of periods per user/segment. After that we just reset the
    # index and clean up column names
    xau_grouped = (xau.merge(periods_df, on = 'activity_date', how = 'left')
                    .groupby(['user_id', 'window_period_number', 'segment'], as_index = False)
                    ['inc_amt'].sum()
                    .groupby(['user_id', 'segment'])
                    ['inc_amt'].agg(['count', 'sum'])
                    .reset_index()
                    .rename(columns = {'count' : active_col_name, 'sum' : 'inc_amt'})
                    )

    # Breakouts allow us to see very easily which users are above the number
    # of periods specified in the breakout list. For example, if I want to know
    # which users are active 2+ periods and also 4+ periods, I would set
    # breakouts = [2,4].
    for b in breakouts:
        col_name = '%s%s+ users' % (b, period_abbr)
        xau_grouped[col_name] = (xau_grouped[active_col_name] >= b)

    # Before returning the dataframe to the calling function, sort the values
    # in descending order of inc_amt
    xau_grouped_sorted = xau_grouped.sort_values('inc_amt', ascending = False)

    return xau_grouped_sorted


### Run calc_user_periodic_usage and display the first 10 rows
user_daily_usage = calc_user_periodic_usage(dau_decorated,
                                            'day',
                                            dau_decorated['activity_date'].max(),
                                            window_days=28,
                                            breakouts=[2, 4],
                                            use_segment=False
                                           )
user_daily_usage.head(10)

Unnamed: 0,user_id,segment,active_days,inc_amt,2d+ users,4d+ users
1810,24598A,All,22,1171.16,True,True
4049,35405A,All,19,1053.99,True,True
163,11855A,All,20,829.21,True,True
1547,22782A,All,22,819.04,True,True
349,13429A,All,18,763.4,True,True
3035,29597A,All,18,692.09,True,True
1126,19896A,All,18,665.12,True,True
567,15010A,All,22,593.77,True,True
87,11119A,All,22,576.4,True,True
1964,25417A,All,22,559.36,True,True


In the dataframe above, which shows the top 10 rows of the dataframe, we see that each user has been active for at least 18 days. Therefore, each of the breakouts is true. In the bottom 5 rows of the dataframe, shown below, we see that most of these low inc_amt customers are active just 1 day in the 28-day window, maybe 2.

In [6]:
user_daily_usage.tail()

Unnamed: 0,user_id,segment,active_days,inc_amt,2d+ users,4d+ users
4351,36871A,All,1,1.5,False,False
4545,37769A,All,1,1.5,False,False
259,12623A,All,2,0.78,True,False
3104,29965A,All,1,0.75,False,False
1486,22385A,All,1,0.72,False,False


The calc_user_periodic_usage function above serves as a key helper function for the functions below that summarize its data into the data that will be used for the visualizations.

### 4.5 Calculate the DAU Histogram
The DAU histogram puts each user into a bin according to the number of active days they have in the 28-day window. It outputs the user count in each bin and also counts the average number of days active across all users during the window.

In [8]:
def calc_xau_hist(dau_decorated, time_period, last_date, window_days, use_segment):

    # Call calc_user_periodic_usage, the function defined above
    xau_grouped = calc_user_periodic_usage(dau_decorated_df = dau_decorated,
                                           time_period = time_period,
                                           last_date = last_date,
                                           window_days = window_days,
                                           breakouts = [],
                                           use_segment = use_segment
                                           )

    # Define three column names based on the time period parameter
    active_col_name = f'active_{time_period}s'
    active_bin_name = f'{active_col_name}_bin'
    avg_active_name = f'avg_{time_period}s_active'

    # Create a dataframe called counts_df that counts the values of the
    # column that holds the number of active periods. In this way it counts all
    # users with the same number of active periods as part of the same bin.
    if active_col_name in xau_grouped.columns:
        counts_df = (xau_grouped[active_col_name]
                     .value_counts()
                     .reset_index()
                     )

        counts_df.columns = [active_bin_name, 'user_count']  # Directly set new names
        counts_df = counts_df.sort_values(by=active_bin_name, ascending=True)

                    #  .rename(columns={'index': active_bin_name, active_col_name: 'user_count'})
                    #  # .sort_values(active_bin_name, ascending=True)
                    #  )
    else:
        print(f"Column {active_col_name} not found in DataFrame.")
        # Handle the error or raise an exception

    # Create a blank dataframe with bin names and zeros to handle the cases
    # where a bin has no users
    blank_hist_df = pd.DataFrame({active_bin_name : range(1, window_days + 1),
                                  'user_count' : 0})

    # This is like an SQL union that appends the two data frames together. Then
    # we take the max value in each bin
    hist_df = (pd.concat([counts_df, blank_hist_df]).
               groupby(active_bin_name, as_index=False)
               .max()
               )

    # Calculate the weighted average active days in the 28-day period and add
    # a column to hold that constant
    avg_active = ((hist_df[active_bin_name] *
                   hist_df['user_count']).sum() /
                    hist_df['user_count'].sum()
                    )
    hist_df[avg_active_name] = avg_active

    # Make sure the active_days_bin column is a string category, not an integer
    hist_df[active_bin_name] = hist_df[active_bin_name].astype('str', copy=False)

    return hist_df


### Run
dau_hist_L28 = calc_xau_hist(dau_decorated, 'day', dau_decorated['activity_date'].max(), window_days=28, use_segment=False)
dau_hist_L28

Unnamed: 0,active_days_bin,user_count,avg_days_active
0,1,921,4.072579
1,2,1006,4.072579
2,3,487,4.072579
3,4,1148,4.072579
4,5,104,4.072579
5,6,165,4.072579
6,7,135,4.072579
7,8,305,4.072579
8,9,52,4.072579
9,10,116,4.072579


### 4.6 Calculate Rolling DAU/MAU Ratios over Time
The DAU Histogram shown above helps us understand engagement during a fixed 28-day period. What if we could calculate its key attributes to look at engagement trends over time? We can do so by performing a similar analysis and looping through it on a daily basis. In doing so, we can build out a dataframe that contains engagement metrics for every 28-day period over the life of the company.

In [12]:
### The calc_engagement_ratios_for_window function runs calc_user_periodic_usage
### on one window, which has a length defined by window_days. It then calculates
### summary statistics and stores them in a small dataframe just for that window

def calc_engagement_ratios_for_window(dau_decorated_df,
                                      time_period,
                                      last_date,
                                      window_days,
                                      breakouts,
                                      use_segment
                                      ):

    # Call calc_user_periodic_usage, the function defined above
    xau_grouped = calc_user_periodic_usage(dau_decorated_df,
                                           time_period,
                                           last_date,
                                           window_days,
                                           breakouts,
                                           use_segment)


    # These are the parameters that are set from the get_time_period_dict
    # function above
    time_fields = get_time_period_dict(time_period)
    period_abbr = time_period[0]
    active_col_name = 'active_' + time_period + 's'
    if time_fields is None:
        period_days = 1
    else:
        period_days = time_fields['days']
    total_users_col = '1' + period_abbr + '+ users'

    # Create a blank dataframe to store the statistics about xau_grouped
    xau_agg = pd.DataFrame()

    # Set the grouped_df to either be ungrouped or grouped by segment
    if use_segment:
        grouped_df = xau_grouped.groupby('segment')
    else:
        grouped_df = xau_grouped

    # In the blank dataframe, store the sum of the active periods
    # in a column (one row per segment if applicable)
    xau_agg[active_col_name] = pd.Series(grouped_df[active_col_name].sum())

    # In another column, store the count of unique user_id's
    xau_agg[total_users_col] = grouped_df['user_id'].nunique()

    # In another column, store the average number of active days per user
    # during the window
    xau_agg[period_abbr + 'au_window_ratio'] = (xau_agg[active_col_name] / (window_days/period_days)) / xau_agg[total_users_col]

    # In another column store the window_frequency, which is the average number
    # of periods in the window. This is like the DAU/MAU ratio.
    xau_agg['window_frequency'] = xau_agg[period_abbr + 'au_window_ratio'] * (window_days/period_days)

    # For each of the breakouts, calculate the gross number and the number
    # as a percentage of all active users
    for b in breakouts:
        col_name = '%s%s+ users' % (b, period_abbr)
        xau_agg[col_name] = grouped_df[col_name].sum()
        ratio_col_name = '%s%s+ users / total %sd users' % (b, period_abbr, window_days)
        xau_agg[ratio_col_name] = xau_agg[col_name] / xau_agg[total_users_col]

    # Store the window end date, because this will be appended to other iterations
    # of this function call. Then reset the index and return the dataframe.
    xau_agg['window_end_dt'] = last_date
    xau_agg = xau_agg.reset_index()

    return xau_agg




### The create_xau_window_df is what iterates through all possible windows
### of length window_days in the data set and builds a dataframe with summary
### statistics of each window, as calculated by calc_engagement_ratios_for_window
### defined above. Be forewarned, it takes a few minutes to run. That's why
### we include progress statements periodically during the loop.

def create_xau_window_df(dau_decorated_df,
                         time_period = 'day',
                         window_days = 28,
                         breakouts = [2, 4],
                         use_segment = False,
                         use_final_day = True
                         ):

    # Set the start date as window_days after the first activity_date in the
    # dau_decorated_df dataframe.
    start_dt = dau_decorated_df['activity_date'].min() + timedelta(days = window_days)

    # Set the final day as either the last activity_date in the data set
    # or the next-to-last activity_date. (You may want to set use_final_day
    # equal to False if you have an incomplete day of data.)
    if use_final_day:
        end_dt = dau_decorated_df['activity_date'].max()
    else:
        end_dt = dau_decorated_df['activity_date'].max() - timedelta(days = 1)

    # Set a Pandas date_range from the start date to the end date, by day
    date_range = pd.date_range(start = start_dt, end = end_dt, freq = 'D')

    # Initialize the dataframe that will house the engagement stats from every
    # window in the loop
    rolling_engagement_df = pd.DataFrame()
    i = 0
    total_dates = len(date_range)
    print(('%s total ' + time_period + 's to process...') % total_dates)

    # Loop through each date in the date range, calling calc_engagement_ratios_for_window
    # each time, storing it in the this_window dataframe, and then appending
    # this_window to rolling_engagement_df
    for d in date_range:
        if i % 100 == 0:
          print(('Processing ' + time_period + ' %s of %s...') % (i, total_dates))

        d2 = d.date()
        this_window = calc_engagement_ratios_for_window(dau_decorated_df,
                                                    time_period = time_period,
                                                    last_date = d2,
                                                    window_days = window_days,
                                                    breakouts = breakouts,
                                                    use_segment = use_segment)
        rolling_engagement_df = pd.concat([rolling_engagement_df, this_window])
        i+=1
    print(('Finished processing all %s ' + time_period + 's!') % total_dates)

    # Make sure the window_end_dt field is a Pandas datetime
    rolling_engagement_df['window_end_dt'] = pd.to_datetime(rolling_engagement_df['window_end_dt'])

    return rolling_engagement_df


In [13]:
# Run rolling_dau_mau and show the last ten rows (the most recent ones)
rolling_dau_mau = create_xau_window_df(dau_decorated, breakouts = [2, 4, 8, 12])
rolling_dau_mau.tail(10)

1372 total days to process...
Processing day 0 of 1372...
Processing day 100 of 1372...
Processing day 200 of 1372...
Processing day 300 of 1372...
Processing day 400 of 1372...
Processing day 500 of 1372...
Processing day 600 of 1372...
Processing day 700 of 1372...
Processing day 800 of 1372...
Processing day 900 of 1372...
Processing day 1000 of 1372...
Processing day 1100 of 1372...
Processing day 1200 of 1372...
Processing day 1300 of 1372...
Finished processing all 1372 days!


Unnamed: 0,index,active_days,1d+ users,dau_window_ratio,window_frequency,2d+ users,2d+ users / total 28d users,4d+ users,4d+ users / total 28d users,8d+ users,8d+ users / total 28d users,12d+ users,12d+ users / total 28d users,window_end_dt
0,0,18775,4690,0.142971,4.003198,3719,0.792964,2185,0.465885,682,0.145416,186,0.039659,2024-04-21
0,0,18789,4694,0.142956,4.002769,3711,0.790584,2187,0.465914,681,0.145079,186,0.039625,2024-04-22
0,0,18821,4695,0.143169,4.008733,3715,0.791267,2200,0.468584,677,0.144196,186,0.039617,2024-04-23
0,0,18844,4709,0.142918,4.001699,3720,0.789977,2214,0.470164,678,0.14398,186,0.039499,2024-04-24
0,0,18853,4712,0.142895,4.001061,3722,0.789898,2216,0.470289,678,0.143888,186,0.039474,2024-04-25
0,0,18871,4697,0.143488,4.017671,3722,0.792421,2221,0.472855,680,0.144773,188,0.040026,2024-04-26
0,0,18865,4691,0.143626,4.021531,3724,0.793861,2216,0.472394,674,0.143679,187,0.039864,2024-04-27
0,0,18868,4684,0.143864,4.028181,3719,0.79398,2219,0.47374,680,0.145175,186,0.03971,2024-04-28
0,0,18901,4675,0.144393,4.042995,3728,0.797433,2229,0.476791,681,0.145668,188,0.040214,2024-04-29
0,0,18966,4657,0.145449,4.072579,3736,0.802233,2243,0.481641,691,0.148379,188,0.040369,2024-04-30


## 3. Load the transformed data into Google Sheets
**Note**: For a more detailed discussion about loading data into Google Sheets, complete with inline code, visit [Section 3 of Mini-Pipeline: Cohort Analysis](https://colab.research.google.com/drive/1oYy-wJl6VZFgOsv8uw7iGChQxUjrR5rf#scrollTo=hzG-fUwkyM9K)
### Establish connection to Google Sheets for writing output files
The first time you run the cell below, or after some time of inactivity, you will be asked to click on a link. That link will take you to a new tab that will authorize this script to write to Google Sheets spreadsheets in your Google Account. To enable this feature, copy the code you get into the box below and hit Enter.

In [14]:
gc = tvcl.google_authenticate()

In [15]:
### Be sure to set this value to refer to your Google Sheets workbook
GOOGLE_SHEET_KEY = '1-XnO_eWkRwX-E1fiA2Jkbe3kJvoyoPFsdeW7vnF6zS0'

In [16]:
print('https://docs.google.com/spreadsheets/d/' + GOOGLE_SHEET_KEY)

https://docs.google.com/spreadsheets/d/1-XnO_eWkRwX-E1fiA2Jkbe3kJvoyoPFsdeW7vnF6zS0


In [17]:
### Execute this function to write the data in the dataframe to the google sheet
### and tab name specified using the gc Google credentials
tvcl.write_to_google_sheet(dau_hist_L28,
                           GOOGLE_SHEET_KEY,
                           'DAU Histogram L28',
                           gc)

In [18]:
### Execute this function to write the data in the dataframe to the google sheet
### and tab name specified using the gc Google credentials
tvcl.write_to_google_sheet(rolling_dau_mau,
                           GOOGLE_SHEET_KEY,
                           'Rolling DAU/MAU',
                           gc)

## 4. Visualize insights in Google Data Studio
A Google DataStudio dashboard preconfigured to read from the Google Sheet created above to visualize the data is [available at this link](https://datastudio.google.com/open/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV) or by clicking on the Google DataStudio logo at the bottom of the chart embedded below. It is available in read-only mode for you to copy, link to your own Google Sheet tabs, and see your own data visualized.
### 4.1 User DAU Histogram
The chart below shows the number of users in each active days bin to illustrate the distribution of usage. The more users come back to the service on multiple days, the more engaged they are. We see that the highest bar is at 4 active days, or once per week. We also see another, smaller spike at 8 active days, or twice per week. The 4.1 figure for "Average Days Active in Last 28" is another way of expressing the DAU/MAU ratio. The DAU/MAU ratio in this case is 4.1/28, or 0.146, or 14.6%. That means users are active on 14.6% percent of possible days. We think it is easier to understand what this number means by thinking of it as average active days out of 28 rather than as a ratio/percentage.

In [19]:
IFrame('https://datastudio.google.com/embed/reporting/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV/page/IEWi',
       width=600,
       height=450)

### 4.2 Rolling Active Days per 28
The chart below takes the 4.1 Average Active Days in Last 28 from the last chart--which was calculated from the most recent date in the event log, and plots the trend of that number over time. By looking at this number's trend, we can judge whether engagement is getting better, getting worse, or staying the same. In the case of ServBiz, engagement has leveled out around 4.0 over the last year-and-a-half or so after showing steady improvement in the first few years of the business. The exception is the regular seasonal dips when usage frequency wanes for a few weeks.

In [20]:
IFrame('https://datastudio.google.com/embed/reporting/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV/page/hXXk',
       width=600,
       height=450)

### 4.3 Rolling 28-Day Multi-Day Users
A richer display of the distibution of active usage in effect combines the histogram from 4.1 above with the rolling trends from 4.2. In effect, we are turning the histogram on its side and dragging it through time to see the percentage of total active users above various thresholds. In the example below we have plotted four lines. They show the percentage of active users with...

- 2+ active days out of the previous 28, or once at least every other week (about 80% in the most recent 28-day window; see table below the chart)
- 4+ active days, or at least once a week (about 48%)
- 8+ active days, or at least twice a week (about 15%)
- 12+ active days, or at least 3x per week (about 4%)

Going with more than 12 days would make for a lousy visual with a horizontal line bouncing off zero.

In this visual we see that the percentage of users with 2+ days out of 28 continues to increase a bit over time. It is the more frequent categories (4+, 8+) that are flat and dragging down the overall average seen in 4.2.

In [21]:
IFrame('https://datastudio.google.com/embed/reporting/1xjS__Q6ZUXuUUARkgRvY4spYUw1ePksV/page/qdXk',
       width=600,
       height=450)

In [22]:
rolling_dau_mau.tail(1)

Unnamed: 0,index,active_days,1d+ users,dau_window_ratio,window_frequency,2d+ users,2d+ users / total 28d users,4d+ users,4d+ users / total 28d users,8d+ users,8d+ users / total 28d users,12d+ users,12d+ users / total 28d users,window_end_dt
0,0,18966,4657,0.145449,4.072579,3736,0.802233,2243,0.481641,691,0.148379,188,0.040369,2024-04-30
