# User Engagement Data Analysis

Goals:
* define an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period
* identify which factors predict future user adoption

### About the data 

The data is available as two attached CSV files:
* takehome_user_engagement.csv
* takehome_users.csv


The data has the following two tables:

1. A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. This table includes:
* name: the user's name
* object_id: the user's id
* email: email address
* creation_source: how their account was created. This takes on one of 5 values:
    * PERSONAL_PROJECTS: invited to join another user's personal workspace
    * GUEST_INVITE: invited to an organization as a guest (limited permissions)
    * ORG_INVITE: invited to an organization (as a full member)
    * SIGNUP: signed up via the website
    * SIGNUP_GOOGLE_AUTH: signed up using Google
Authentication (using a Google email account for their login
id)
* creation_time: when they created their account
* last_session_creation_time: unix timestamp of last login
* opted_in_to_mailing_list: whether they have opted into receiving
marketing emails
* enabled_for_marketing_drip: whether they are on the regular
marketing email drip
* org_id: the organization (group of users) they belong to
* invited_by_user_id: which user invited them to join (if applicable).

2] A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

## Import Data

In [1]:
import warnings
import pandas as pd
import numpy as np

warnings.simplefilter(action="ignore", category=FutureWarning)

In [3]:
# import provided dataset as pandas DataFrame
user_file = "data/takehome_users.csv"
eng_file = "data/takehome_user_engagement.csv"

df_user = pd.read_csv(user_file, encoding="ISO-8859-1")
df_eng = pd.read_csv(eng_file, parse_dates=["time_stamp"])

df_user = df_user.rename({"object_id":"user_id"}, axis=1)

In [8]:
# preview user data
print("Shape of user data = ", df_user.shape)
df_user.head(10)

Shape of user data =  (12000, 10)


Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0
5,6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,1387424000.0,0,0,197,11241.0
6,7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,
9,10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,4143.0


In [9]:
# preview engangement data
print("Shape of engagement data = ", df_eng.shape)
df_eng.head(10)

Shape of engagement data =  (207917, 3)


Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
5,2013-12-31 03:45:04,2,1
6,2014-01-08 03:45:04,2,1
7,2014-02-03 03:45:04,2,1
8,2014-02-08 03:45:04,2,1
9,2014-02-09 03:45:04,2,1


## Adopted User

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, **identify which factors predict future user adoption**.

In [12]:
unique = df_user.user_id.unique()

In [29]:
df_agg = df_eng.set_index("time_stamp")

users = df_agg["user_id"].unique()
adoption = []

for i in users:
    id_filter = df_agg["user_id"] == i #filter based on user id
    df_filter = df_agg[id_filter].resample("1D").count() #resample and count daily
    df_filter = df_filter.rolling(window=7).sum() #using window
    df_filter = df_filter.dropna()
    adoption.append(any(df_filter["visited"].values >= 3))

In [30]:
# applying 'adopted_user' logic onto df #
user_adoption = list(zip(users, adoption))

# create a new DataFrame for user adoption
df_adopt = pd.DataFrame(user_adoption)
df_adopt.columns = ["user_id", "adopted_user"]

df = df_user.merge(df_adopt, on="user_id", how="left")

In [32]:
# mapping 'adopted_user' #
df.loc[:, "adopted_user"] = df["adopted_user"].map({False:0, True:1, np.nan:0})
df.dropna(subset=["adopted_user"], inplace=True)
df["adopted_user"] = df["adopted_user"].astype(int)

In [33]:
# mapping 'invited_by_user' #
invite = lambda row: 0 if np.isnan(row) else 1
df["invited_by_user"] = df["invited_by_user_id"].apply(invite)

In [34]:
# final df #
df = df[["adopted_user", "invited_by_user", "creation_source",
         "opted_in_to_mailing_list", "enabled_for_marketing_drip"]]

## Modeling

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# machine learning pipeline #
X = df[df.columns[1:]]
y = df[df.columns[0]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)

pipeline = Pipeline(steps=[("encoder", OneHotEncoder()), 
                           ("rf", RandomForestClassifier(random_state = 42))])

params = {"rf__n_estimators" : [50, 75, 100],
          "rf__max_depth" : [5, 10, 15]}

cv = GridSearchCV(pipeline, param_grid=params, cv=5)
cv.fit(X_train, y_train)

print(f"Best parameters: {cv.best_params_}")
print(f"Training accuracy score from tuned model: {cv.best_score_*100:.1f}%")

Best parameters: {'rf__max_depth': 5, 'rf__n_estimators': 50}
Training accuracy score from tuned model: 86.7%


In [39]:
# test set score #
y_pred = cv.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {test_accuracy*100:.2f}%")

Model accuracy: 86.69%


In [40]:
# replicating the pipeline without #
# using the pipeline itself to get #
# "labeled" feature importance     #

X_ohe = pd.get_dummies(X_test)
pipeline.fit(X_ohe, y_test)

fe = pipeline.named_steps["rf"].feature_importances_

feature_importance = zip(X_ohe.columns, fe)
feature_importance = sorted(feature_importance, key=lambda x:x[1], reverse=True)

for i, j in feature_importance:
    print(f"Weight: {j:.3f} | Feature: {i}")

Weight: 0.067 | Feature: creation_source_ORG_INVITE
Weight: 0.063 | Feature: creation_source_GUEST_INVITE
Weight: 0.062 | Feature: creation_source_PERSONAL_PROJECTS
Weight: 0.054 | Feature: enabled_for_marketing_drip
Weight: 0.044 | Feature: creation_source_SIGNUP_GOOGLE_AUTH
Weight: 0.034 | Feature: creation_source_SIGNUP
Weight: 0.019 | Feature: opted_in_to_mailing_list
Weight: 0.014 | Feature: invited_by_user


From the raw dataset, we've utilized the following as our features:

invited_by_user - if a user was referred by another user (custom feature)
creation_source - how the account was created (stock feature)
opted_in_to_mailing_list - whether user has opted into receiving marketing emails (stock feature)
enabled_for_marketing_drip - whether they are on the regular marketing email drip (stock feature)
Our model proved itself well having a final accuracy metric comparable to the cross-validation training score (both at ~94%). Which would mean that our pipeline's feature ranking is likewise reliable in determining what a good predictor for user adoption is. Thanks to one-hot encoding we can clearly see quite specifically what the business could do to potentially boost the likelihood user engagement:

Because personal workspace and guest invite rank highest on how users signed up and caught on, the business could realign its marketing goals to focus more on highly-collaborative user groups.
The marketing drip scheme works and so it's important to retain this effort in keeping the user base intact.
Whether a user opts in the mailing list or not shows to be the least effective predictor and so it doesn't quite matter when there is any emphasis on the newsletter call-to-action. At least it'll help the UI team to keep the app less commercial and enable a good vibe for the user.