# Take-home challenge: Relax, Inc.

### The objective of this analysis is to identify which factors predict future user adoption. In this case an "adopted user" is defined as a user who has logged into the product on three separate days in at least one seven­day period.

In [1]:
import pandas as pd

In [2]:
# import user engagement data
user_engagement = pd.read_csv('data/takehome_user_engagement.csv')
user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


### Here we see the time stamp data is stored as a string so we should convert to datetime. In addition, since we are looking to daily login's the time of log-in is not needed so let's create a new feature with only the date value. user_id and visited are both integers which make sense in this application so we'll leave these alone.

In [3]:
user_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [4]:
# convert time_stamp series to datetime dtype
user_engagement.time_stamp = pd.to_datetime(user_engagement.time_stamp)
# create new feature with date only
user_engagement['date'] = user_engagement.time_stamp.dt.date

In [5]:
user_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 4 columns):
time_stamp    207917 non-null datetime64[ns]
user_id       207917 non-null int64
visited       207917 non-null int64
date          207917 non-null object
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 6.3+ MB


In [6]:
user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited,date
0,2014-04-22 03:53:30,1,1,2014-04-22
1,2013-11-15 03:45:04,2,1,2013-11-15
2,2013-11-29 03:45:04,2,1,2013-11-29
3,2013-12-09 03:45:04,2,1,2013-12-09
4,2013-12-25 03:45:04,2,1,2013-12-25


### Now that we have a date feature, let's use this to do some additional feature engineering to solve whether or not a user can be considered an adopted user.

In [7]:
# calculate the span/range across three logins in a row
user_engagement['three_row_range'] = user_engagement.date.diff(periods=2)
# compare rows used for range calc to make sure users_id is the same
user_engagement['same_user'] = user_engagement.user_id.diff(periods=2) == 0
# an adopted user will have a span less than or equal to 7 days and greater than 0
# in addition the span must be accross the same user
user_engagement['adopted_user'] = (user_engagement.three_row_range.dt.days <= 7) \
                                & (user_engagement.three_row_range.dt.days > 0) \
                                & user_engagement.same_user

# create a new df with a single instance for each user.
# label as adopted user if any series of 3 logins qualified them
adopted_user = user_engagement[['user_id', 'adopted_user']].groupby(['user_id']).any()                                   
user_engagement.head(30)

Unnamed: 0,time_stamp,user_id,visited,date,three_row_range,same_user,adopted_user
0,2014-04-22 03:53:30,1,1,2014-04-22,NaT,False,False
1,2013-11-15 03:45:04,2,1,2013-11-15,NaT,False,False
2,2013-11-29 03:45:04,2,1,2013-11-29,-144 days,False,False
3,2013-12-09 03:45:04,2,1,2013-12-09,24 days,True,False
4,2013-12-25 03:45:04,2,1,2013-12-25,26 days,True,False
5,2013-12-31 03:45:04,2,1,2013-12-31,22 days,True,False
6,2014-01-08 03:45:04,2,1,2014-01-08,14 days,True,False
7,2014-02-03 03:45:04,2,1,2014-02-03,34 days,True,False
8,2014-02-08 03:45:04,2,1,2014-02-08,31 days,True,False
9,2014-02-09 03:45:04,2,1,2014-02-09,6 days,True,True


In [8]:
adopted_user.head(10)

Unnamed: 0_level_0,adopted_user
user_id,Unnamed: 1_level_1
1,False
2,True
3,False
4,False
5,False
6,False
7,False
10,True
11,False
13,False


In [9]:
# import user data
users = pd.read_csv('data/takehome_users.csv', encoding='latin-1')
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [10]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


### Above we see that there are two time series which are not reported as datetime objects. The head above evens hows one is reporting in scientific notation. We also see that creation_source may be a good candidate for a categorical variable. Lastly we see that 'invited_by_user_id' looks as thought it could be an integer but let's check if there are any NaN's first. 

In [11]:
users.invited_by_user_id.value_counts(dropna=False)

NaN        5583
10741.0      13
2527.0       12
1525.0       11
2308.0       11
           ... 
7941.0        1
4134.0        1
6101.0        1
129.0         1
594.0         1
Name: invited_by_user_id, Length: 2565, dtype: int64

### creation source does not have any NaN's

In [12]:
users.creation_source.value_counts(dropna=False)

ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

In [13]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [14]:
# convert creation time to datetime
users.creation_time = pd.to_datetime(users.creation_time)
# convert creation_source to categorical
users.creation_source = users.creation_source.astype('category')
# make dummy features for each creation_source category
creation_source_dummies = pd.get_dummies(users.creation_source)

# join creation_source dummies to user df 
users = users.join(creation_source_dummies)

# fill invited_by_users nan's with 0's to allow int conversion
users['invited_by_user_id'] = users.invited_by_user_id.fillna(0).astype('int64')

# convert last_session_creation_time to datetime
users.last_session_creation_time = pd.to_datetime(users.last_session_creation_time, unit='s')
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803,1,0,0,0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316,0,1,0,0,0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525,0,1,0,0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151,1,0,0,0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240,1,0,0,0,0


In [15]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 15 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null category
last_session_creation_time    8823 non-null datetime64[ns]
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null int64
GUEST_INVITE                  12000 non-null uint8
ORG_INVITE                    12000 non-null uint8
PERSONAL_PROJECTS             12000 non-null uint8
SIGNUP                        12000 non-null uint8
SIGNUP_GOOGLE_AUTH            12000 non-null uint8
dtypes: category(1), datetime64[ns](2), int64(5), object(2), uint8(5)
memory usage: 914.4+ KB


In [16]:
# merge adopted_user classification back to user table
users = users.merge(adopted_user, left_on='object_id', right_index=True)
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803,1,0,0,0,0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316,0,1,0,0,0,True
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525,0,1,0,0,0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151,1,0,0,0,0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240,1,0,0,0,0,False


### Now let's run a simple logisitc regression to assess feature importance

In [17]:
from sklearn.linear_model import LogisticRegression

#X = users[['opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'GUEST_INVITE', 'ORG_INVITE', 'PERSONAL_PROJECTS', 'SIGNUP', 'SIGNUP_GOOGLE_AUTH', 'org_id', 'invited_by_user_id']]
X = users[['opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'GUEST_INVITE', 'ORG_INVITE', 'PERSONAL_PROJECTS', 'SIGNUP', 'SIGNUP_GOOGLE_AUTH']]
y = users['adopted_user']

model = LogisticRegression(solver='lbfgs')

model.fit(X, y)

feat_importance = model.coef_[0]

for feat, score in enumerate(feat_importance):
	print('Feature: %0d, Score: %.5f' % (feat, score))

Feature: 0, Score: 0.04035
Feature: 1, Score: 0.01524
Feature: 2, Score: 0.24012
Feature: 3, Score: -0.08049
Feature: 4, Score: 0.19882
Feature: 5, Score: -0.22849
Feature: 6, Score: -0.13096


### The above feature scores show that users who's accounts were created as GUEST_INVITE or PERSONAL_PROJECTS were strong predictors of becoming an adopted user. Other creation sources were also strong predictors of not becoming an adopted user, overall suggesting that the creation source is a strong predictor of adopted users. One possible explanation of this behavior is that users who joined with a specific project based goal were more likely to become an adopted user.

### We also see that users who opted in to the mailing list were also more likely to be an adopted user but this was not as strong of a predictor as the creation source.