# Relax Data Science Challenge¶

The data is available as two attached CSV files:

takehome_user_engagement.csv

takehome_users.csv

The data has the following two tables:

1] A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. This table includes:

● name: the user's name

● object_id: the user's id

● email: email address

● creation_source: how their account was created. This takes on one of 5 values:

○ PERSONAL_PROJECTS: invited to join another user's personal workspace

○ GUEST_INVITE: invited to an organization as a guest (limited permissions)

○ ORG_INVITE: invited to an organization (as a full member) ○ SIGNUP: signed up via the website

○ SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id)

● creation_time: when they created their account

● last_session_creation_time: unix timestamp of last login

● opted_in_to_mailing_list: whether they have opted into receiving marketing emails

● enabled_for_marketing_drip: whether they are on the regular marketing email drip

● org_id: the organization (group of users) they belong to

● invited_by_user_id: which user invited them to join (if applicable).

2] A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

We suggest spending 1-2 hours on this, but you're welcome to spend more or less.

Please send us a brief writeup of your findings (the more concise, the better ­­ no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

In [2]:
#import user df
users = pd.read_csv('takehome_users.csv',encoding = "ISO-8859-1")
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,4/22/2014 3:53,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,11/15/2013 3:45,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,3/19/2013 23:14,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,5/21/2013 8:09,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,1/17/2013 10:14,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
#import user engagement df
engage = pd.read_csv('takehome_user_engagement.csv',encoding = "ISO-8859-1")
engage.time_stamp = pd.to_datetime(engage.time_stamp)
engage.index=engage.time_stamp
engage.drop(labels='time_stamp',axis=1,inplace=True)

In [5]:
#groupby user_id and resample to 1 week period, get sum 
df_agg = engage.groupby([pd.Grouper(freq='W'),'user_id']).sum()

In [6]:
#find all user id's w/ sum of 3 or more which indicates adopted user 
df_adopt = df_agg[df_agg.visited>=3].unstack(level=1).melt()
adopted_users = pd.DataFrame(df_adopt.user_id.unique(),index=range(df_adopt.user_id.unique().shape[0]),columns=['user_id'])

In [7]:
#create df of features, merge users df with adopted users df 
df_join = users.merge(adopted_users,how='inner',left_on='object_id',right_on='user_id')
df_join.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id
0,2,11/15/2013 3:45,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2
1,10,1/16/2013 22:08,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,4143.0,10
2,20,3/6/2014 11:46,Helms Mikayla,lqyvjilf@uhzdq.com,SIGNUP,1401364000.0,0,0,58,,20
3,33,3/11/2014 6:29,Araujo José,JoseMartinsAraujo@cuvox.de,GUEST_INVITE,1401518000.0,0,0,401,79.0,33
4,42,11/11/2012 19:05,Pinto Giovanna,GiovannaCunhaPinto@cuvox.de,SIGNUP,1401045000.0,1,0,235,,42


In [8]:
#drop unnecessary columns
drop_cols = list(df_join.columns[0:4])
drop_cols.append('user_id')
df_join = df_join.drop(drop_cols,axis=1)

In [9]:
#fill missing values in invited_by_user column
df_join['invited_by_user_id'].fillna(value=0,inplace=True)

# One Hot Encoder

One hot encoding is a popular technique used to work with categorical features. Essentially encoding the categorical values 
to numerical values so that ML algorithms can utilize it for prediction. 

In [10]:
#one hot encode creation_source feature
df_create = pd.get_dummies(df_join['creation_source'])
df_features = pd.concat([df_join,df_create],axis=1)
df_features.drop('creation_source',axis=1,inplace=True)

#convert columns to float64
for col in df_features.columns:
    df_features[col] = df_features[col].astype('float64')
df_features.head()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,1396238000.0,0.0,0.0,1.0,316.0,0.0,1.0,0.0,0.0,0.0
1,1401833000.0,1.0,1.0,318.0,4143.0,0.0,1.0,0.0,0.0,0.0
2,1401364000.0,0.0,0.0,58.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1401518000.0,0.0,0.0,401.0,79.0,1.0,0.0,0.0,0.0,0.0
4,1401045000.0,1.0,0.0,235.0,0.0,0.0,0.0,0.0,1.0,0.0


# PCA

PCA is used for reducing dimension, meaning reducing the relationships between variables to consider and lowering the 
likelihood of overfitting. It is also used for feature elimination and feature extraction. 

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#scale data
scaler = StandardScaler()
features = scaler.fit_transform(df_features)

#fit PCA
pca = PCA()
components = pca.fit_transform(features)

In [20]:
np.sum(pca.explained_variance_ratio_[0:9])

1.0

The first nine components account for 100% of the variance in the data 

In [13]:
df_comp = pd.DataFrame(pca.components_,columns=df_features.columns,index=['PC-1','PC-2','PC-3','PC-4','PC-5','PC-6','PC-7','PC-8','PC-9','PC-10'])
#absolute values of correlation with principal components
best_features = np.absolute(df_comp[np.absolute(df_comp) > 0.1])
best_features.head()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
PC-1,,,,,0.637653,0.232628,0.486575,0.233976,0.390324,0.296733
PC-2,,0.695712,0.69891,,,,,,,
PC-3,0.116892,,,,,0.78406,0.590037,,,
PC-4,,,,0.261476,,,,0.108246,0.648711,0.697782
PC-5,0.230407,,,0.219055,,,,0.825688,0.296054,0.350811


In [21]:
#sum correlations to get a relative estimate of the feature importance
best_features.head(9).sum(axis=0).sort_values(ascending=False)

SIGNUP_GOOGLE_AUTH            1.976223
SIGNUP                        1.822364
org_id                        1.792770
last_session_creation_time    1.700941
PERSONAL_PROJECTS             1.674344
ORG_INVITE                    1.418675
enabled_for_marketing_drip    1.405251
opted_in_to_mailing_list      1.401398
invited_by_user_id            1.400076
GUEST_INVITE                  1.305649
dtype: float64

These nine components are highly indicative of adopted users and are highly predictive features. The top three features that
stand out are:
    
SIGNUP_GOOGLE_AUTH            1.976223

SIGNUP                        1.822364

org_id                        1.792770