In [8]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from collections import Counter

%matplotlib inline

The data is available as two attached CSV files:<br>
takehome_user_engagement.csv<br>
takehome_users.csv<br>

The data has the following two tables:<br>
1] A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years.<br>

This table includes:

    ● name: the user's name
    ● object_id: the user's id
    ● email: email address
    ● creation_source: how their account was created.
    This takes on one of 5 values:
        ○ PERSONAL_PROJECTS: invited to join another user's personal workspace
        ○ GUEST_INVITE: invited to an organization as a guest (limited permissions)
        ○ ORG_INVITE: invited to an organization (as a full member)
        ○ SIGNUP: signed up via the website
        ○ SIGNUP_GOOGLE_AUTH: signed up using Google

Authentication (using a Google email account for their login id)

    ● creation_time: when they created their account
    ● last_session_creation_time: unix timestamp of last login
    ● opted_in_to_mailing_list: whether they have opted into receiving marketing emails
    ● enabled_for_marketing_drip: whether they are on the regular marketing email drip
    ● org_id: the organization (group of users) they belong to
    ● invited_by_user_id: which user invited them to join (if applicable).
    
2] A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.<br>
    Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven<br>
    ­day period, identify which factors predict future user adoption.
We suggest spending 1-2 hours on this, but you're welcome to spend more or less. <br>
Please send us a brief writeup of your findings (the more concise, the better, no more than one page),
along with any summary tables, graphs, code, or queries that can help us understand your approach. 
    
<br>
Please note any factors you considered or investigation you did, even if they did not pan out.<br> Feel free to identify any further research or data you think would be valuable.

# Solutions:

In [372]:
# Load the data files:
data1 = pd.read_csv('takehome_users.csv', encoding = "ISO-8859-1")
data2 = pd.read_csv('takehome_user_engagement.csv')

In [373]:
# Renaming the object_id column to the user_id column: 
data1 = data1.rename(columns={'object_id': 'user_id'})
data1.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [375]:
# Converting time_stamp to yymmdd format:
data2['time_stamp'] = pd.to_datetime(pd.to_datetime(data2['time_stamp']).dt.strftime('%Y-%m-%d'))

In [377]:
# Filtering the samples for three logins each on separate dates
# in at least seven day periods:(adopted users)
j0 = 1
i0 = 0
users = []
for i, j in enumerate(data2.user_id.values):
    base_time = data2.time_stamp[i0]
    if j == j0 and data2.user_id.values[i+3] == data2.user_id.values[i]:
        diff = (data2.time_stamp[i+3]-base_time)/(np.timedelta64(1, 'D'))
        if diff < 7.: # in seven days period
            users.append(j)
    j0 = j; i0 = i

In [308]:
# adopted users in dataframe:
adopt = pd.DataFrame(list(set(users)), index=range(len(set(users))), columns=['user_id'])

In [309]:
# Creating a columns with values=1 for the adopted users:
adopt['target'] = 1

In [310]:
# Merge the 'adopt' table to the data set data1:
df = pd.merge(data1, adopt, on='user_id', how='outer')

In [311]:
# Assigning target value '0' for non-adopted users 
df.target.fillna(0, inplace=True)

In [312]:
# Dropping the irrelavant features:
drop_cols = ['user_id', 'name'] 
df.drop(drop_cols, axis=1, inplace=True)

Here, we found that only 8 features remaining. For the problem with categorical variables with binary classification<br> 
adopting class 1 and non-adopting class 0. We use ExtraTreeClassifier from sklearn to observe the feature importance. <br> Categorical variables should be changed to numerical by using LabelEncoder.

In [378]:
catgorical = ['creation_time', 'creation_source',
              'last_session_creation_time',
              'org_id', 'invited_by_user_id',
              'email']
numerical = ['opted_in_to_mailing_list',
             'enabled_for_marketing_drip',
             'target']

In [380]:
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier

In [381]:
df_le = df[catgorical].apply(LabelEncoder().fit_transform).reset_index()
df_nu = df[numerical].reset_index()
df = pd.merge(df_le, df_nu,
              on='index',
              how='inner').drop('index', axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 9 columns):
creation_time                 12000 non-null int64
creation_source               12000 non-null int64
last_session_creation_time    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null int64
email                         12000 non-null int64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
target                        12000 non-null float64
dtypes: float64(1), int64(8)
memory usage: 937.5 KB


In [382]:
# Fitting the data features to the target 'target' by sklearn ExtraTreeClassifier:
etc = ExtraTreesClassifier(bootstrap=True, n_estimators=100, random_state=33)
X = df.drop('target', axis=1)
y = df['target']
etc.fit(X, y)

ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=33, verbose=0,
                     warm_start=False)

In [411]:
# Extracting the feature importances:
Important_features = pd.Series(rf.feature_importances_,
                               index=X.columns[:-1]).nlargest(7)

In [444]:
# Table of the feature importances from the tree algorithms:
feat = pd.DataFrame(Important_features)
feat['index'] = feat[0]
feat.drop(0, axis=1)

Unnamed: 0,index
creation_time,0.348515
invited_by_user_id,0.299693
org_id,0.263838
last_session_creation_time,0.040207
creation_source,0.027771
email,0.010767
opted_in_to_mailing_list,0.009209


ExtraTreeClassifier calculates the feature importance as: decrease of node impurity weighted by <br> 
nodee probability. From the table above, we found that the importance of the features ranks as:
#### 1. Creation Time
#### 2. invited_by_user_id
#### 3. org_id (invited to an organization)
#### 4. last_session_creation time
#### 5. Creation Source
#### 6. email id used 
and the least important features are:
#### 7. Opted_in_to mailing list 
#### 8. enabled for marketing drip