<a href="https://colab.research.google.com/github/theventurecity/data-toolkit/blob/master/Create_the_DAU_Decorated_Data_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![TheVentureCity](https://theventure.city/wp-content/uploads/2017/06/Theventurecity-logoweb-1.png)

# Create the "DAU Decorated" Data Set
1. Extract raw event log data from a CSV
2. Transform that data into the "DAU Decorated" dataframe

## Before you begin

- This notebook is shared with read-only access. To run this notebook yourself, first click "**Open in Playground**" in the toolbar above. That will create a separate instance that you can run and/or save a copy of to your own Google Drive. 

- To run each cell, hit **Shift-Enter**, which will run the contents of the active cell and move to the next cell. This includes the markup cells (such as this one).

- When you run the first block of Python code, you will get a message that says, "**Warning: This notebook was not authored by Google.**" Please be aware that we are **NOT** accessing your data shared with Google or reading data and credentials from other sessions. This notebook reads data from GitHub and writes to a Google Sheet that only you have access to and can control. We recommend you click the box to "**Reset all runtimes before running**" for extra information security.

## Import relevant Python libraries

In [0]:
### Load the Pandas library
import pandas as pd

## 1. Extract raw event log data from a CSV

This example uses a data file for a sample company from our GitHub repository called ServBiz. In this step we read the data file into memory as a Pandas dataframe we name "t."

In [2]:
# Edit this filename to your local filename.csv if using a local CSV file
filename = 'https://raw.githubusercontent.com/theventurecity/Analytics/master/data/ServBiz_transactions.csv'

t = pd.read_csv(filename)
t.tail(10)

Unnamed: 0,client_id,date,value_usd,segment
420781,27902A,2019-02-28,8.75,Enterprise
420782,34181A,2019-02-28,18.97,SMB
420783,30168A,2019-02-28,17.73,SMB
420784,30844A,2019-02-28,19.98,SMB
420785,35815A,2019-02-28,17.98,SMB
420786,16958A,2019-02-28,17.45,SMB
420787,13090A,2019-02-28,13.48,SMB
420788,19162A,2019-02-28,13.64,Enterprise
420789,28409A,2019-02-28,14.72,SMB
420790,12080A,2019-02-28,18.32,SMB


## 2. Transform the raw data into a cohort analysis dataframe
### 2.1 Create Daily Active Users (DAU) dataframe
The **DAU** dataframe aggregates all activity by user and day. By taking this approach, if any of a user's events fall within the 24 hours of a day, the user is considered active for that day. 

Some notes:

- One of the most important aspects of this function is **standardizing the column names**. As this is the starting point, all subsequent transformation functions expect columns to be called **user_id**, **activity_date**, **inc_amt**, and **segment** (if applicable). At the point you call this function is when you map the original column names to the new ones. 
- The **inc_amt** means different things in different situations. As the name implies, it can and often does mean "income," in the form of either revenue, gross margin, contribution margin, or some other monetary amount that is measurable at the transaction level. But it can refer to any number you want to count. That could mean shares, likes, page views, event counts, or anything else worthy of counting. If you specify this field as None when you run the function--which you might do if all you have is a list of dates and user_ids--then it creates a column of 1's so it can count the number of events for each user/day.
- If you have a **segment** column in the original input data set, there is no reason not to include it here. Subsequent transformation functions contain options for including or excluding the segment column.

In [3]:
# The create_dau_df function takes as inputs a dataframe of transactions and 
# the names of the three key event log columns: User ID, Activity Date, and 
# Income Amount (could be revenue or contribution margin). It can handle a 
# fourth event log column that designates a segment. Next it ensures that
# the Activity Date column is a date and the User ID is a string. Then it groups
# all of the transaction records to calculate the sum of the Income Amount
# by User ID and Activity Date (and Segment, if chosen) combination

def create_dau_df(transactions, 
                  user_id = 'user_id', 
                  activity_date = 'activity_date', 
                  inc_amt = 'inc_amt', 
                  segment_col = None,
                  include_zero_inc = False):
    
    # Ensure correct data types
    # If the activity_date is in date-time format, it gets rolled up into the
    # day on which that event occurred. 
    transactions[activity_date] = pd.to_datetime(transactions[activity_date]).dt.date
    transactions[user_id] = transactions[user_id].astype('str')
    
    # If there is no inc_amt available in the data set, add a column of ones
    # Set the value of the inc_amt variable to 'inc_amt'
    if inc_amt is None:
        transactions['inc_amt'] = 1
        inc_amt = 'inc_amt'
    
    # By default, this function only allows transactions where the inc_amt > 0
    # This means it excludes things with negative amounts, like returns, for
    # example. The include_zero_inc allows us to include those transactions
    # if we see fit
    if include_zero_inc:
        trans_df = transactions
    else:
        trans_df = transactions.loc[transactions[inc_amt] > 0]
        
    # By default we group by user_id and activity_date. If a segment column is
    # specified when the function is called, we include that column's name in
    # the groupby as well. We also make sure that the segment is a string type
    groupby_cols = [user_id, activity_date]
    if segment_col is not None:
        groupby_cols += [segment_col]
        transactions[segment_col] = transactions[segment_col].astype('str')
    
    
    # Group by user_id and activity_date, calculate the sum of the inc_amt
    # and return standardized names for each column
    dau = (trans_df
           .groupby(groupby_cols, as_index = False)
           .agg({inc_amt : 'sum'})
           .rename(columns = {user_id : 'user_id', 
                              activity_date : 'activity_date', 
                              inc_amt : 'inc_amt'})
                        )

    # If we are using a segment column, it gets its own standardized name 'segment'
    if segment_col is not None:
        dau = dau.rename(columns = {segment_col : 'segment'})
        
    return dau
  
# Run the function above and show the first ten rows  
dau = create_dau_df(t, 
                    user_id = 'client_id', 
                    activity_date = 'date', 
                    inc_amt = 'value_usd',
                    segment_col = 'segment'
                   )
dau.head(10)

Unnamed: 0,user_id,activity_date,segment,inc_amt
0,10000A,2015-10-14,SMB,11.75
1,10001A,2015-10-14,SMB,13.75
2,10001A,2015-11-02,SMB,7.5
3,10001A,2015-11-22,SMB,18.0
4,10001A,2017-04-04,SMB,6.25
5,10001A,2017-12-08,SMB,8.75
6,10002A,2015-10-14,Enterprise,11.75
7,10002A,2015-10-26,Enterprise,12.25
8,10002A,2015-11-16,Enterprise,12.25
9,10002A,2015-11-23,Enterprise,12.25


### 2.2 Create a separate dataframe to hold each user's first activity date
Calculate the **first ever activity date** for each user_id in the DAU dataframe, and store it in its own dataframe (**first_dt**). This step is optional because it can be executed within the **create_dau_decorated_df** function below, but we are including it here to illustrate better the mechanics of what is happening.

In [4]:
# The create_first_dt_df function takes as its input the DAU dataframe created
# above. After creating a copy of the original DAU dataframe so as not to 
# affect the original, it creates a new first_dt dataframe. Using the groupby
# and agg functions, it finds the minimum Activity Date for each User ID. Then 
# it specifies the week ('first_week') and month ('first_month') in which the 
# first Activity Date is found. 

def create_first_dt_df(dau_df):
    print('Creating first_dt dataframe')
    
    # Create copy of input dataframe
    dau = dau_df.copy()
    
    # Use groupby to find the minimum activity_date for each user_id
    first_dt = (dau.groupby(['user_id'], as_index = False)
                .agg({'activity_date' : 'min'})
                .rename(columns = { 'activity_date' : 'first_dt' })
               )
    
    # Ensure that the first_dt field is a date
    first_dt['first_dt'] = pd.to_datetime(first_dt['first_dt']).dt.date
    
    # Add two new columns with the first_week and first_month of the first_dt
    first_dt['first_week'] = pd.to_datetime(first_dt['first_dt']).dt.to_period('W')
    first_dt['first_month'] = pd.to_datetime(first_dt['first_dt']).dt.to_period('M')
    
    return first_dt
  
  
# Run the function above and show the first ten rows  
first_dt = create_first_dt_df(dau)
first_dt.head(10)

Creating first_dt dataframe


Unnamed: 0,user_id,first_dt,first_week,first_month
0,10000A,2015-10-14,2015-10-12/2015-10-18,2015-10
1,10001A,2015-10-14,2015-10-12/2015-10-18,2015-10
2,10002A,2015-10-14,2015-10-12/2015-10-18,2015-10
3,10003A,2015-10-16,2015-10-12/2015-10-18,2015-10
4,10004A,2015-10-15,2015-10-12/2015-10-18,2015-10
5,10005A,2015-10-25,2015-10-19/2015-10-25,2015-10
6,10011A,2015-10-16,2015-10-12/2015-10-18,2015-10
7,10012A,2015-10-19,2015-10-19/2015-10-25,2015-10
8,10013A,2015-10-15,2015-10-12/2015-10-18,2015-10
9,10014A,2015-10-28,2015-10-26/2015-11-01,2015-10


### 2.3 Join DAU with First Date
Merge the **dau** dataframe with the **first_dt** dataframe and call it "**DAU Decorated**."

In [5]:
# The create_dau_decorated_df takes the two data frames created above, DAU and
# first_dt, and merges them together based on user_id. This results in a DAU
# dataframe "decorated" with information about the user's first activity date,
# first week, and first month, as shown below. Note: it is not necessary to 
# pass in the first_dt dataframe. If none is provided, the function will run
# create_first_dt_df so it has something to merge to the DAU dataframe.

def create_dau_decorated_df(dau_df, first_dt_df = None):
    print('Creating DAU Decorated dataframe')
    
    # If no first_dt_df is provided, create it
    if first_dt_df is None:
        first_dt_df = create_first_dt_df(dau_df)
        
    # Do a left merge of first_dt_df into dau_df on User ID
    dau_decorated_df = dau_df.merge(first_dt_df, how = 'left', on = 'user_id')

    # If segment is included in this dataframe, ensure that it is a string type
    if 'segment' in dau_decorated_df.columns:
        dau_decorated_df['segment'] = dau_decorated_df['segment'].astype('str')
    
    return dau_decorated_df
  
  
# Run the function above and show the first ten rows  
dau_decorated = create_dau_decorated_df(dau, first_dt_df = first_dt)
dau_decorated.tail(10)

Creating DAU Decorated dataframe


Unnamed: 0,user_id,activity_date,segment,inc_amt,first_dt,first_week,first_month
407208,9995A,2019-01-24,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407209,9995A,2019-01-31,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407210,9995A,2019-02-07,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407211,9995A,2019-02-14,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407212,9995A,2019-02-21,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407213,9995A,2019-02-28,SMB,21.28,2015-10-19,2015-10-19/2015-10-25,2015-10
407214,9998A,2015-10-14,SMB,15.0,2015-10-14,2015-10-12/2015-10-18,2015-10
407215,9998A,2015-11-29,SMB,15.0,2015-10-14,2015-10-12/2015-10-18,2015-10
407216,9998A,2015-12-30,SMB,7.5,2015-10-14,2015-10-12/2015-10-18,2015-10
407217,9999A,2015-10-14,Enterprise,11.75,2015-10-14,2015-10-12/2015-10-18,2015-10


Combining the basic DAU data with the first date, week, and month for each user, **the DAU Decorated dataframe is our basic building block for many different analyses**. It allows us to use user-level data to inspect engagement, retention, and growth accounting.

### 2.4 Use DAU Decorated in Subsequent Data Tranformations
To see how DAU Decorated is used downstream, please visit the Mini-Pipeline notebooks about the subjects below:


*   [MAU Growth Accounting](https://colab.research.google.com/drive/1moHa4Mcycwsz7Fq6T_5Zou1Zunt0afiI)
*   [Cohort Analysis](https://colab.research.google.com/drive/1oYy-wJl6VZFgOsv8uw7iGChQxUjrR5rf)
*   [Engagement](https://colab.research.google.com/drive/1nznm8WRU0dJcMNAR4U5CpkHpbMI-nmWD)
