## Introduction

Understanding and predicting user adoption is critical for the growth and retention of any digital product. In this project, we explore the user behavior patterns that lead to successful engagement, defined as a user logging into the product on **three separate days within a rolling seven-day period** — a common benchmark for early product adoption.

We are provided with two datasets:
1. **User Metadata (`takehome_users`)** — containing account creation details, sign-up methods, organizational information, and marketing preferences for ~12,000 users.
2. **User Engagement (`takehome_user_engagement`)** — capturing individual login timestamps for users across time.

The primary goal of this project is to:
- Identify which users can be classified as “adopted” based on login activity.
- Engineer relevant features from the user metadata.
- Build a predictive model to uncover which characteristics are most closely associated with user adoption.

By doing so, we aim to provide insights that can guide **product onboarding strategies, marketing efforts**, and **user experience improvements** — ultimately helping the business better engage new users and boost long-term retention.


In [1]:
import pandas as pd
import numpy as np

users = pd.read_csv('takehome_users.csv', encoding='latin-1')
engagement = pd.read_csv('takehome_user_engagement.csv', parse_dates=['time_stamp'])

### Define "Adopted Users"

In [2]:
# Drop duplicates of logins on the same day per user
engagement['login_date'] = engagement['time_stamp'].dt.date
unique_logins = engagement[['user_id', 'login_date']].drop_duplicates()
unique_logins = unique_logins.sort_values(by=['user_id', 'login_date'])

In [3]:
# Identify adopted users
adopted_users = set()

for user_id, group in unique_logins.groupby('user_id'):
    dates = pd.Series(group['login_date'].tolist())
    for i in range(len(dates) - 2):
        if (dates[i+2] - dates[i]).days <= 7:
            adopted_users.add(user_id)
            break

In [4]:
# Label Adopted Users in the User Table
users['is_adopted'] = users['object_id'].isin(adopted_users).astype(int)

### Feature Engineering

In [5]:
# Convert timestamps
users['creation_time'] = pd.to_datetime(users['creation_time'])
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit='s')

  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)


In [6]:
users['days_to_first_login'] = (users['last_session_creation_time'] - users['creation_time']).dt.days
users['used_google_auth'] = (users['creation_source'] == 'SIGNUP_GOOGLE_AUTH').astype(int)
users['invited'] = users['invited_by_user_id'].notnull().astype(int)

In [7]:
# One-hot encode creation_source
users = pd.get_dummies(users, columns=['creation_source'], prefix='source')

### Build a Model or Analyze Feature Impact

For this project, we will use logistic regression because it is well-suited for binary classification tasks like predicting user adoption. It provides interpretable results, is quick to implement, and performs well on datasets with a limited number of features. Logistic regression also gives probability outputs, allowing us to not only classify users but also estimate how likely they are to become adopted.


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Select features
features = [
    'opted_in_to_mailing_list',
    'enabled_for_marketing_drip',
    'used_google_auth',
    'invited',
    'days_to_first_login',
] + [col for col in users.columns if col.startswith('source_')]

X = users[features].fillna(0)
y = users['is_adopted']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [10]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [11]:
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      2586
           1       0.90      0.86      0.88       414

    accuracy                           0.97      3000
   macro avg       0.94      0.92      0.93      3000
weighted avg       0.97      0.97      0.97      3000



##  Conclusion

In this project, we aimed to identify the key factors that predict **user adoption**, defined as logging into the product on **three separate days within a seven-day period**. Using data from 12,000 users and their login activity, we engineered an "adopted user" label and built a predictive model to uncover patterns in user behavior and sign-up characteristics.

Our final logistic regression model achieved an overall **accuracy of 97%**, with a precision of **90%** and recall of **86%** for predicting adopted users. This indicates the model performs well in correctly identifying users who are likely to become highly engaged with the product.

###  Key Insights

- Users who signed up via **organizational or personal invitations** were more likely to become adopted, suggesting **social or team-based onboarding** plays a significant role in engagement.
- Sign-ups via **Google authentication** also had a positive effect, likely due to frictionless onboarding.
- Users who opted into marketing communications showed slightly higher engagement, though the effect was modest.

###  Opportunities for Improvement

While the model performs well, we note a slight trade-off in **recall** for adopted users, meaning some engaged users are still missed. Future work could include:
- Enhancing the model using more detailed behavioral features (e.g., session duration, actions taken),
- Testing advanced models like **Random Forest** or **Gradient Boosting**,
- Implementing **oversampling** or **threshold tuning** to capture more adopted users.

---

Ultimately, this analysis provides actionable insights to support **user onboarding strategies**, improve retention, and prioritize product features that align with early user success signals.
