# Import Packages & Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
user_df = pd.read_csv('takehome_users.csv', encoding='latin-1')
engage_df = pd.read_csv('takehome_user_engagement.csv')

# Data Exploration & Cleaning

In [3]:
print("Number of duplicate rows: {}\n".format(sum(user_df.duplicated())))
print(user_df.info())

Number of duplicate rows: 0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB
None


luckily there are no duplicated rows, but we do have some missing data. for the invited by user id  column we are actually going to transform that column into a boolean of if the customer was invtired by someone else or not. If there is a value in the column it will be True, otherwise it will be false. However, for the missing values of the last session creation time we will have to either specify a constant value or use the creation time. It is odd that there is not a last session time value for some users since their creation time should be their last session if they never used it. Let's convert the last session time from a unix timestamp to a ISO timestamp and fill in the missing values with the creation time.

In [4]:
user_df['invited_by_user'] = ~user_df['invited_by_user_id'].isnull()
user_df.drop(columns=['invited_by_user_id'], inplace=True)

In [5]:
user_df['creation_time'] = pd.to_datetime(user_df['creation_time'])
user_df['last_session_creation_time'] = pd.to_datetime(user_df['last_session_creation_time'], unit='s')
user_df.last_session_creation_time[user_df['last_session_creation_time'].isnull()] = user_df.creation_time[user_df['last_session_creation_time'].isnull()]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_df.last_session_creation_time[user_df['last_session_creation_time'].isnull()] = user_df.creation_time[user_df['last_session_creation_time'].isnull()]


Now that the missing data is filled in, lets look quickly for any erroneous values.

In [6]:
# No data issues, expected values only
user_df['creation_source'].value_counts()

ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

In [7]:
#Lots of organizations if it is to be used as a categorical variable, but no apparent data issues
print(len(user_df['org_id'].unique()))
user_df['org_id'].value_counts().head(10)

417


0     319
1     233
2     201
3     168
4     159
6     138
5     128
9     124
7     119
10    104
Name: org_id, dtype: int64

In [8]:
# No data issues, expected values only
user_df['opted_in_to_mailing_list'].value_counts()

0    9006
1    2994
Name: opted_in_to_mailing_list, dtype: int64

In [9]:
# No data issues, values in expected range
print(user_df['creation_time'].min())
print(user_df['creation_time'].max())
print(user_df['last_session_creation_time'].min())
print(user_df['last_session_creation_time'].max())

2012-05-31 00:43:27
2014-05-30 23:59:19
2012-05-31 08:20:06
2014-06-06 14:58:50


Overall, there does not seem to be any issues with the quality of the data values. Our next step will to do feature engineering and drop any unecessary columns.

# Feature Engineering

First let's identify the columns that we are going to drop. An individuals name provides no usable information, we could use it to see if the person's email address is some form of their name, but it is doubtful how much that would help too. The org id will also be eliminated because it is a categorical feature with 417 unique values, which would result in a sparse matrix. The user's unique object id will also be dropped, but only after we use it to merge with the other table we are creating that identifies the person as an adopted user or not. The mail address is also a unique identifier, so we don't want to use the entire email, but we can extract the company portion to see how many different ones there are and some more frequent ones may be beneficial.

In [10]:
user_df[['email_name','email_company']] = user_df['email'].str.split('@', expand=True)
print('Number unique company email addresses: {}'.format(len(user_df['email_company'].unique())))
print('\nTop 10 company email addresses')
print(user_df['email_company'].value_counts().head(10))

Number unique company email addresses: 1184

Top 10 company email addresses
gmail.com         3562
yahoo.com         2447
jourrapide.com    1259
cuvox.de          1202
gustr.com         1179
hotmail.com       1165
rerwl.com            2
oqpze.com            2
qgjbc.com            2
dqwln.com            2
Name: email_company, dtype: int64


So although there are 1184 unique company email adresses, there are 6 of them that are very common. we can break this out so any of the companies not part of the top 6 are changed to other. But first things first, let's eliminate some of our garbage features.

In [11]:
user_df.drop(columns=['email', 'email_name', 'name', 'org_id'],inplace=True)
email_bool = ~user_df['email_company'].isin(['gmail.com', 'yahoo.com', 'jourrapide.com', 'cuvox.de', 'gustr.com', 'hotmail.com'])
user_df['email_company'].loc[email_bool] = 'other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


The next features we have to consider is the creation time and last session time. We could change the datetimes to a numeric feature since newer vs. older could have an impact. However, it would make more sense to truncate the values to the month and year so there are less unique values and then convert them to strings as a categorical feature. For the last session time, we will calculate the days between 

In [12]:
user_df['last_session_creation_delta'] = user_df['last_session_creation_time'] - user_df['creation_time']
user_df['last_session_creation_delta'] = user_df['last_session_creation_delta'].dt.days
user_df['creation_date'] = user_df['creation_time'].dt.strftime("%Y-%m")
user_df.drop(columns=['creation_time', 'last_session_creation_time'], inplace=True)

Ok we should noW be ready to go with our predictor features, so now we need to create our target feature.

In [13]:
engage_df['time_stamp'] = pd.to_datetime(engage_df['time_stamp'])
print('number of unique user ids: {}'.format(len(engage_df['user_id'].unique())))
engage_df.set_index('time_stamp', inplace=True)

number of unique user ids: 8823


So the first thing to note is there are more user ids in the predictor feature table than there are in the user engagement table. This means that there is either missing data on some users in the system, some users never logged in, or whoever created the tables messed up... either way lets continue forward by creating the target feature in which we will look to see if a user had 3 logons within a week.

In [14]:
adopted = engage_df.groupby('user_id').rolling('7d').count()
adopted_users = adopted[adopted['visited'] >= 3].index.get_level_values(0).unique()
user_df['adopted'] = user_df['object_id'].isin(adopted_users)
user_df.drop(columns=['object_id'], inplace=True)

# Model Creation

Since we only want to determine which factors are the most important in predicting user adoption, I am only going to evaluate one model type so long as the metrics are acceptable. The model to be used will be a Random Forest with the default hyperparameters. We will need to encode the categorical features as well since random forest will only take numerical values.

In [41]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score

y = user_df['adopted']
X = user_df.drop(columns=['adopted'])

encoder = OneHotEncoder(sparse=False).fit(X.drop(columns=['last_session_creation_delta']))
X_cat = encoder.transform(X.drop(columns=['last_session_creation_delta']))
X_cat = pd.DataFrame(X_cat, columns=encoder.get_feature_names_out())
X = X_cat.join(X['last_session_creation_delta'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, stratify=y)

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('model AUC: {}'.format(roc_auc_score(y_train, model.predict(X_train))))
print(classification_report(y_test, y_pred))

model AUC: 1.0
              precision    recall  f1-score   support

       False       0.98      0.98      0.98      2599
        True       0.89      0.87      0.88       401

    accuracy                           0.97      3000
   macro avg       0.93      0.92      0.93      3000
weighted avg       0.97      0.97      0.97      3000



So the model worked really well. lets now look at the most important features.

In [42]:
pd.DataFrame(model.feature_importances_, index = model.feature_names_in_, columns=['feat_importance']).sort_values('feat_importance',ascending=False)

Unnamed: 0,feat_importance
last_session_creation_delta,0.88583
creation_date_2014-05,0.006102
creation_date_2014-03,0.004903
email_company_gmail.com,0.004676
opted_in_to_mailing_list_0,0.004469
opted_in_to_mailing_list_1,0.004057
email_company_other,0.003976
email_company_yahoo.com,0.003827
creation_source_GUEST_INVITE,0.003798
enabled_for_marketing_drip_1,0.003789


Conclusion: one of the key indeicators of whether or not someone is an adopted user is if they are still an active user or not. The feature importance for the time between when they signed up and if they are an adopted user is suspiciously large, but there shouldn't be any confounding, or using future values to make the prediction. It does make sense though that someone that has been using the software longer has a greater likelihood of meeting the adopted user classification. let's make another quick model not using that feature and see what we come up with.

In [43]:
X_train.drop(columns=['last_session_creation_delta'], inplace=True)
X_test.drop(columns=['last_session_creation_delta'], inplace=True)

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('model AUC: {}'.format(roc_auc_score(y_train, model.predict(X_train))))
print(classification_report(y_test, y_pred))

model AUC: 0.5735255133693671
              precision    recall  f1-score   support

       False       0.87      0.98      0.92      2599
        True       0.14      0.02      0.04       401

    accuracy                           0.85      3000
   macro avg       0.50      0.50      0.48      3000
weighted avg       0.77      0.85      0.80      3000



In [40]:
pd.DataFrame(model.feature_importances_, index = model.feature_names_in_, columns=['feat_importance']).sort_values('feat_importance',ascending=False).head(20)

Unnamed: 0,feat_importance
opted_in_to_mailing_list_1,0.048764
opted_in_to_mailing_list_0,0.048658
enabled_for_marketing_drip_1,0.040913
enabled_for_marketing_drip_0,0.040791
email_company_jourrapide.com,0.038773
email_company_gustr.com,0.036504
email_company_gmail.com,0.036193
email_company_other,0.033952
email_company_cuvox.de,0.033104
creation_date_2014-05,0.030326


So overall the model accuracy droped only a lite bit, but now we are horrible at predicting if someone is an adopted user. we would have better model accuracy if we assumed everyone is not an adopted user.