# Goal

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day
period, identify which factors predict future user adoption.

Before we begin, lets obtain the data and take a quick look.

In [102]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error,roc_auc_score,precision_score

In [103]:
import chardet
with open('takehome_users.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

{'encoding': 'ISO-8859-1', 'confidence': 0.7296934, 'language': ''}

In [104]:
log= pd.read_csv('takehome_user_engagement.csv')
users = pd.read_csv('takehome_users.csv',encoding='ISO-8859-1')

In [105]:
log.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [106]:
log.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [107]:
log.iloc[1,0]

'2013-11-15 03:45:04'

In [108]:
log['time_stamp']= pd.to_datetime(log['time_stamp'],format='%Y-%m-%d %H:%M:%S')

In [109]:
log.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


We need to determine who has, at least once, logged into the system 3 times over a 7 day period. This is more of a coding question. First lets create a dictionary of all the users and set their values to false.

In [110]:
unique_users = log['user_id'].unique()
dict_users = {user : False for user in unique_users}

My idea is to query all the logs for a user, sort them in ascending order and then run a for loop, check every value i, i+1, i+2 time stamps and seeing if i+2 timestamp is less than 7 days compared to i, if so, set that dictionary user to true and move on to the next.

In [111]:
for user in dict_users.keys():
    user_log = log[log['user_id']==user].sort_values(by='time_stamp',ascending=True, ignore_index=True)
    i = 0
    while i < len(user_log) - 2 and dict_users[user] is False:
        if user_log.iloc[i+2,0] - user_log.iloc[i,0] <= pd.to_timedelta(7, unit='d'):
            dict_users[user] = True
        else:
            i+=1

Now that we got a list of users and whether they logged in or not, we can add it to are dataframe `users`.

In [112]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [113]:
users['adopted'] = users['object_id'].map(dict_users)

In [114]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,True
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False


Now we need to do some cleanup.

In [115]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
 10  adopted                     8823 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 1.0+ MB


In [118]:
users['creation_time'] = pd.to_datetime(users['creation_time']).map(pd.Timestamp.timestamp)
users['creation_source']= users['creation_source'].astype('category')
users['last_session_creation_time'] = users['last_session_creation_time'].fillna(0)
users['org_id']= users['org_id'].astype('category')
users['invited_by_user_id']= users['invited_by_user_id'].fillna(0)
users['invited_by_user_id']= users['invited_by_user_id'].astype('int64')
users['invited']= users['invited_by_user_id'] > 0 
users['adopted']= users['adopted'].fillna(False)


Lets make a simple random forest classifer, and we will use ROC as the main metric.

In [119]:
X = users.drop(['object_id','name', 'email', 'invited_by_user_id', 'adopted'], axis=1)
y = users['adopted']

In [120]:
import lightgbm as lgb

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1004)
d_train=lgb.Dataset(X_train, label=y_train)

params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt'
params['objective']='binary' 
params['max_depth']=7
params['num_class']=1
params['verbose']=-1

clf=lgb.train(params,d_train,1000)

y_pred=clf.predict(X_test)

In [121]:
y_pred=y_pred.round(0)
y_pred=y_pred.astype(int)

In [122]:
roc_auc_score(y_pred,y_test)

0.9535714285714285

We could always use gridsearch to improve the values here. Also, since we do not have a good sense of what the business problem is, ROC might not be the best metric and we can consider other options. We might want to see how our test compares to just guessing all false or true.