<font color="navy">Defining an **adopted user** as a user who has logged into the product on three separate days in at least one sevenday period, identify which factors predict future user adoption.</font>

In [1]:
import warnings
import pandas as pd
import numpy as np

warnings.simplefilter(action="ignore", category=FutureWarning)

# loading the data #
user = r"data\takehome_users.csv"
eng = r"data\takehome_user_engagement.csv"

df_user = pd.read_csv(user, encoding="ISO-8859-1")
df_eng = pd.read_csv(eng, parse_dates=["time_stamp"])

df_user = df_user.rename({"object_id":"user_id"}, axis=1)

In [2]:
# defining an 'adopted user' #
df_agg = df_eng.set_index("time_stamp")

users = df_agg["user_id"].unique()
adoption = []

for i in users:
    id_filter = df_agg["user_id"] == i
    df_filter = df_agg[id_filter].resample("1D").count()
    df_filter = df_filter.rolling(window=7).sum()
    df_filter = df_filter.dropna()
    adoption.append(any(df_filter["visited"].values >= 7))

In [3]:
# applying 'adopted_user' logic onto df #
user_adoption = list(zip(users, adoption))

df_adopt = pd.DataFrame(user_adoption)
df_adopt.columns = ["user_id", "adopted_user"]

df = df_user.merge(df_adopt, on="user_id", how="left")

In [4]:
# mapping 'adopted_user' #
df.loc[:, "adopted_user"] = df["adopted_user"].map({False:0, True:1, np.nan:0})
df.dropna(subset=["adopted_user"], inplace=True)
df["adopted_user"] = df["adopted_user"].astype(int)

In [5]:
# mapping 'invited_by_user' #
invite = lambda row: 0 if np.isnan(row) else 1
df["invited_by_user"] = df["invited_by_user_id"].apply(invite)

In [6]:
# final df #
df = df[["adopted_user", "invited_by_user", "creation_source", \
         "opted_in_to_mailing_list", "enabled_for_marketing_drip"]]

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# machine learning pipeline #
X = df[df.columns[1:]]
y = df[df.columns[0]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.6, random_state=42)

pipeline = Pipeline(steps=[("encoder", OneHotEncoder()), \
                           ("rf", RandomForestClassifier(random_state = 42))])

params = {"rf__n_estimators" : [50, 75, 100],
          "rf__max_depth" : [5, 10, 15]}

cv = GridSearchCV(pipeline, param_grid=params, cv=3)
cv.fit(X_train, y_train)

print(f"Best parameters: {cv.best_params_}")
print(f"Training accuracy score from tuned model: \
       {cv.best_score_*100:.1f}%")

Best parameters: {'rf__max_depth': 5, 'rf__n_estimators': 50}
Training accuracy score from tuned model:        94.8%


In [8]:
# test set score #
y_pred = cv.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {test_accuracy*100:.2f}%")

Model accuracy: 94.82%


In [9]:
# replicating the pipeline without #
# using the pipeline itself to get #
# "labeled" feature importance     #

X_ohe = pd.get_dummies(X_test)
pipeline.fit(X_ohe, y_test)

fe = pipeline.named_steps["rf"].feature_importances_

feature_importance = zip(X_ohe.columns, fe)
feature_importance = sorted(feature_importance, key=lambda x:x[1], reverse=True)

for i, j in feature_importance:
    print(f"Weight: {j:.3f} | Feature: {i}")

Weight: 0.127 | Feature: creation_source_PERSONAL_PROJECTS
Weight: 0.110 | Feature: creation_source_GUEST_INVITE
Weight: 0.084 | Feature: enabled_for_marketing_drip
Weight: 0.074 | Feature: creation_source_ORG_INVITE
Weight: 0.055 | Feature: invited_by_user
Weight: 0.031 | Feature: creation_source_SIGNUP
Weight: 0.005 | Feature: creation_source_SIGNUP_GOOGLE_AUTH
Weight: 0.000 | Feature: opted_in_to_mailing_list


## Answer ###
From the raw dataset, we've utilized the following as our features:
 1. **invited_by_user** - if a user was *referred* by another user (custom feature)
 2. **creation_source** - how the account was created (stock feature)
 3. **opted_in_to_mailing_list** - whether user has opted into receiving marketing emails (stock feature)
 4. **enabled_for_marketing_drip** - whether they are on the regular marketing email drip (stock feature)
 
Our model proved itself well having a final accuracy metric comparable to the cross-validation training score (both at ~94%). Which would mean that our pipeline's *feature ranking* is likewise reliable in determining what a good predictor for user adoption is. Thanks to *one-hot encoding* we can clearly see quite specifically what the business could do to potentially boost the likelihood user engagement:
1. Because **personal workspace** and **guest invite** rank highest on how users signed up and caught on, the business could realign its marketing goals to focus more on highly-collaborative user groups.
2. The **marketing drip** scheme works and so it's important to retain this effort in keeping the user base intact.
3. Whether a user opts in the **mailing list** or not shows to be the least effective predictor and so it doesn't quite matter when there is any emphasis on the newsletter call-to-action. At least it'll help the UI team to keep the app less *commercial* and enable a good vibe for the user.