# Relax Challange

**The Data Science Method**  

1.   Problem Identification 

2.   Data Wrangling
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Data Wrangling

In [88]:
#load common python packages
import os
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.4f' % x) #get rid of scientific notations
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import time
import math
import re
from sklearn import preprocessing
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.metrics import mean_squared_error, mean_absolute_error
from IPython.display import Image
%matplotlib inline

In [89]:
# switch folder
os.chdir('C:\\Users\\tc18f\\Desktop\\springboard\\relax_challenge')
os.getcwd()

'C:\\Users\\tc18f\\Desktop\\springboard\\relax_challenge'

In [131]:
# load the logins.json file
df = pd.read_csv('takehome_users.csv', encoding = "ISO-8859-1")
df.head() # take a quick look

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398138810.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396237504.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363734892.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210168.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358849660.0,0,0,193,5240.0


In [91]:
dfe = pd.read_csv('takehome_user_engagement.csv')
dfe.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


Our goal is to find those who has logged into the product on three separate days in at least one seven day period, and identify which factors predict future user adoption. So, let's find out which user_id is considered adpoted.

In [92]:
# check dfe.info() to see each column's Dtype
dfe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [93]:
# let's chang ethe time stamp interval to whole day and get it to datetime dtype
dfe['time_stamp'] = pd.to_datetime(dfe.time_stamp).dt.strftime('%Y-%m-%d')
dfe['time_stamp'] = pd.to_datetime(dfe.time_stamp)
dfe.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22,1,1
1,2013-11-15,2,1
2,2013-11-29,2,1
3,2013-12-09,2,1
4,2013-12-25,2,1


In [94]:
# let's sort time_stamp and make sure it has at one day in each day
dfe=dfe.sort_values('time_stamp')
start = min(dfe.time_stamp)
end = max(dfe.time_stamp)
display('starting date is {}'.format(start))
display('total number of days logged is {}'.format(str(end-start)))
display('total number of unique dates is {} days'.format((dfe.time_stamp.nunique())))
display('last day is {}'.format(end))

'starting date is 2012-05-31 00:00:00'

'total number of days logged is 736 days 00:00:00'

'total number of unique dates is 736 days'

'last day is 2014-06-06 00:00:00'

Now we know there's no missing days, and the starting/ending date range.

In [95]:
# let's get the ids that appeared more than 2 times through out the whole time
keep_id=[]
value_count = dfe['user_id'].value_counts(ascending=True)
for i in range(dfe.user_id.nunique()): #iterate thru the index of value_count to append the user_id and its count
    idx = value_count.index[i]
    idx_count = value_count[idx]
    if idx_count > 2:
        keep_id.append(idx)
len(keep_id)

2248

In [96]:
# let's add time_ranked column indicating 0 as 2012-05-31 and and 1 increment per day
dfe['time_ranked'] = dfe['time_stamp'].replace([i for i in dfe.time_stamp.unique()], [i for i in range(dfe.time_stamp.nunique())])
dfe.head()

Unnamed: 0,time_stamp,user_id,visited,time_ranked
26821,2012-05-31,1693,1,0
59486,2012-05-31,3428,1,0
178140,2012-05-31,10012,1,0
175638,2012-05-31,9899,1,0
179759,2012-06-01,10163,1,1


In [97]:
# let's subset dfe to those who at least logged in 3 times throught the date range (save time for next cell)
dfe = dfe[dfe['user_id'].isin(keep_id)]
dfe.head()

Unnamed: 0,time_stamp,user_id,visited,time_ranked
26821,2012-05-31,1693,1,0
59486,2012-05-31,3428,1,0
140780,2012-06-01,8068,1,1
126542,2012-06-02,7170,1,2
60374,2012-06-02,3514,1,2


In [98]:
# let's get the ids that appeared more than 2 times through out 7 day period as adopted_id
adpoted_id_list=[]
for i in range(dfe.user_id.nunique()-6): #iterate thru the time_ranked's num
    week_subset = dfe[dfe['time_ranked'].isin([j for j in range(i+7)])] # subset for 1 week of data
    value_count = week_subset['user_id'].value_counts() # get the value counts of the week's data
    for k in range(week_subset.user_id.nunique()): #iterate thru the nunique of user_id for the week subset
        idx = value_count.index[k] # get the id number
        idx_count = value_count[idx] # get the count of that id number for that week
        if idx_count > 2: # append it to adopted id list if the count is larger than 2
            adpoted_id_list.append(idx)
# get the adpoted_id_list to unique ids
adopted_id = pd.Series(adpoted_id_list).unique()
len(adopted_id)

2248

Now that we find out the adopted user_ids, let's get a column that indicates adopted or not in df

In [132]:
df['adopted'] = df['object_id'].apply(lambda x: True if x in list(adopted_id) else False)
df.head(2)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398138810.0,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396237504.0,0,0,1,316.0,True


In [101]:
# let's take a look at invited_by_user_id
df['invited_by_user_id'].value_counts(ascending=False)

10741.0000    13
2527.0000     12
2308.0000     11
1525.0000     11
11770.0000    11
              ..
2746.0000      1
10456.0000     1
8371.0000      1
6266.0000      1
3572.0000      1
Name: invited_by_user_id, Length: 2564, dtype: int64

We're not going to make 2564 dummie variables for this, so we will get a new column that state either invited by user or not

In [133]:
df['invited_by_user'] = df['invited_by_user_id'].apply(lambda x: True if x ==True else False)
#drop the invited_by_user_id
df.drop('invited_by_user_id', inplace=True, axis=1)
# drop the last_session_creation_time column, it's highly biased
df.drop('last_session_creation_time', inplace=True, axis=1)

In [134]:
# make creation_time column datetime and change to hourly
df['creation_time'] = pd.to_datetime(df.creation_time).dt.strftime('%H')
df.head(3)

Unnamed: 0,object_id,creation_time,name,email,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user
0,1,3,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1,0,11,False,False
1,2,3,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,0,0,1,True,False
2,3,23,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,0,0,94,False,False


In [135]:
# change creation_time to int so we can use it to identify whether the acct was create between 6Am to 6PM or not
df['creation_time'] = df['creation_time'].astype('int64')
df['created_6to18'] = df['creation_time'].apply(lambda x: True if x in [i for i in range(6,19)] else False)
df.head(10)

Unnamed: 0,object_id,creation_time,name,email,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user,created_6to18
0,1,3,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1,0,11,False,False,False
1,2,3,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,0,0,1,True,False,False
2,3,23,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,0,0,94,False,False,False
3,4,8,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,0,0,1,False,False,True
4,5,10,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,0,0,193,False,False,True
5,6,3,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,0,0,197,False,False,False
6,7,13,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,0,1,37,False,False,True
7,8,5,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,1,1,74,False,False,False
8,9,4,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,0,0,302,False,False,False
9,10,22,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1,1,318,True,False,False


In [136]:
# check and see if name always have at least two names (first and last)
df['full_name'] = df['name'].apply(lambda x: True if (len(x.split()) > 1) else False)

In [137]:
# check for valid email; must include @ and .com
df['valid_email'] = df['email'].apply(lambda x: True if ((x[len(x)-4:len(x)]=='.com') & ('@' in x)) else False)

In [138]:
# names seems to be in email, let's make a column 'name_in_email' which returns true if either first or last name is in email
name_check_list=[]
for i in range(len(df)):
    first=df['name'][i].split()[0]
    last=df['name'][i].split()[-1]
    if (first in df['email'][i]) & (last in df['email'][i]):
        name_check_list.append(True)
    else:
        name_check_list.append(False)
df['full_name_in_email'] = name_check_list
df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user,created_6to18,full_name,valid_email,full_name_in_email
0,1,3,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1,0,11,False,False,False,True,True,True
1,2,3,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,0,0,1,True,False,False,True,True,True
2,3,23,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,0,0,94,False,False,False,True,True,True
3,4,8,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,0,0,1,False,False,True,True,True,True
4,5,10,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,0,0,193,False,False,True,True,True,True


In [139]:
df['full_name'].value_counts()

True    12000
Name: full_name, dtype: int64

In [140]:
# looks like all names have atleast 2 names, so let's drop the column
df.drop('full_name', axis=1, inplace=True)

In [141]:
df['valid_email'].value_counts()

True     10798
False     1202
Name: valid_email, dtype: int64

In [142]:
df['full_name_in_email'].value_counts()

True     10248
False     1752
Name: full_name_in_email, dtype: int64

In [143]:
#let's take a look at org_id and see what can we do about it
df['org_id'].value_counts()

0      319
1      233
2      201
3      168
4      159
      ... 
396      9
400      8
397      8
386      7
416      2
Name: org_id, Length: 417, dtype: int64

In [144]:
# lets make a column called org_size
org_id_list=[]
org_size=[]
for i in range(df.org_id.nunique()):
    value_count = df['org_id'].value_counts()
    idx = value_count.index[i] # get the id number
    idx_count = value_count[idx] # get the count of that id number
    org_id_list.append(i)
    org_size.append(idx_count)
df['org_size'] = df['org_id'].replace(org_id_list, org_size)
df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user,created_6to18,valid_email,full_name_in_email,org_size
0,1,3,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1,0,11,False,False,False,True,True,87
1,2,3,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,0,0,1,True,False,False,True,True,233
2,3,23,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,0,0,94,False,False,False,True,True,31
3,4,8,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,0,0,1,False,False,True,True,True,233
4,5,10,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,0,0,193,False,False,True,True,True,23


In [145]:
# take alook at creation source
df['creation_source'].value_counts() #we'll dummie encode it

ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

In [146]:
# remove unecessary columns
for col_name in ['name','email','creation_time','org_id','object_id']:
    df.drop(col_name, axis=1, inplace=True)
df.head(3)

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,adopted,invited_by_user,created_6to18,valid_email,full_name_in_email,org_size
0,GUEST_INVITE,1,0,False,False,False,True,True,87
1,ORG_INVITE,0,0,True,False,False,True,True,233
2,ORG_INVITE,0,0,False,False,False,True,True,31


Create dummy features for the creation_source

In [147]:
dfd= pd.DataFrame(df.drop('creation_source',axis =1)).merge(pd.get_dummies(df['creation_source']),left_index=True,right_index=True)
dfd.head(3)

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,adopted,invited_by_user,created_6to18,valid_email,full_name_in_email,org_size,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,1,0,False,False,False,True,True,87,1,0,0,0,0
1,0,0,True,False,False,True,True,233,0,1,0,0,0
2,0,0,False,False,False,True,True,31,0,1,0,0,0


In [148]:
# Create the X and y matrices from the dataframe, where y = df.active_150
X = dfd.drop('adopted', axis=1)
y = dfd.adopted

In [149]:
# apply standard scaler
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

In [150]:
#Split the X_scaled and y into 75/25 training and testing data subsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.25, random_state=18)

In [151]:
# grid search to find best learning rate
learn_rate_list = []
n_est_list=[]
max_dep_list=[]
train_score=[]
valid_score=[]
for learn_rate in [0.05, 0.1, 0.25, 0.5, 0.75, 1]:
    for estimator in [25,50,100,200]:
        for depth in [2,3,4,5]:
            gb = GradientBoostingClassifier(n_estimators=estimator, learning_rate=learn_rate, max_depth=depth, random_state = 18)
            gb.fit(X_train, y_train)
            learn_rate_list.append(learn_rate)
            n_est_list.append(estimator)
            max_dep_list.append(depth)
            train_score.append(gb.score(X_train, y_train))
            valid_score.append(gb.score(X_test, y_test))

In [152]:
gb_df = pd.DataFrame({
    'n_estimator':n_est_list,
    'max_depth': max_dep_list,
    'learning_rate': learn_rate_list,
    'train_score': train_score,
    'validation_score': valid_score,
})
gb_df.head()

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score
0,25,2,0.05,0.8144,0.8073
1,25,3,0.05,0.8144,0.8073
2,25,4,0.05,0.8144,0.8073
3,25,5,0.05,0.8144,0.8073
4,50,2,0.05,0.8144,0.8073


In [153]:
# display the top 10 validation scores
display(gb_df.sort_values('validation_score', ascending=False).head(10))

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score
84,50,2,1.0,0.8147,0.8087
68,50,2,0.75,0.8147,0.8087
64,25,2,0.75,0.8147,0.8087
52,50,2,0.5,0.8147,0.8083
56,100,2,0.5,0.8147,0.8083
80,25,2,1.0,0.8148,0.8083
72,100,2,0.75,0.8147,0.8083
44,200,2,0.25,0.8147,0.8083
45,200,3,0.25,0.8163,0.8083
33,25,3,0.25,0.8147,0.808


In [154]:
# looks like top scores are around 0.80, we want train_score and validation_score to be as close as possible
# let's add score_diff column that equals train_score - validation score then subset the score df
gb_df['score_diff'] = abs(gb_df.train_score - gb_df.validation_score)
gb_df[gb_df['validation_score'] >= 0.80].sort_values('score_diff').head()

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score,score_diff
84,50,2,1.0,0.8147,0.8087,0.006
68,50,2,0.75,0.8147,0.8087,0.006
64,25,2,0.75,0.8147,0.8087,0.006
52,50,2,0.5,0.8147,0.8083,0.0063
72,100,2,0.75,0.8147,0.8083,0.0063


We will pick gb_df's 64th row's prameter, for it will certainly take the least time to get similar good results.

In [155]:
# using the params with gb_df's 36th row
gb = GradientBoostingClassifier(n_estimators=25, learning_rate = 0.75, max_depth = 2, random_state = 18)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
# print the classification report
print(classification_report(y_test, y_pred))
# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('True Positive:', tp, '\n'
     'False Positive :', fp, '\n'
     'True Negative:', tn, '\n'
     'False Negative:', fn)

              precision    recall  f1-score   support

       False       0.81      1.00      0.89      2422
        True       1.00      0.01      0.01       578

    accuracy                           0.81      3000
   macro avg       0.90      0.50      0.45      3000
weighted avg       0.85      0.81      0.72      3000

True Positive: 4 
False Positive : 0 
True Negative: 2422 
False Negative: 574


The f1-score for accuracy is 0.81, that's a pretty good model.

In [156]:
# let's find out how important each features were
ft_df = pd.DataFrame({
    'Feature':X.columns,
    'feature_importance':gb.feature_importances_
})
ft_df.set_index('Feature', inplace=True)
ft_df.sort_values('feature_importance', ascending=False)

Unnamed: 0_level_0,feature_importance
Feature,Unnamed: 1_level_1
org_size,0.4865
PERSONAL_PROJECTS,0.2536
SIGNUP_GOOGLE_AUTH,0.0798
GUEST_INVITE,0.0553
SIGNUP,0.0407
created_6to18,0.0298
full_name_in_email,0.0276
valid_email,0.0138
enabled_for_marketing_drip,0.007
ORG_INVITE,0.0036


The size of the organization was the most important factor, then the reason for the user to sign up: if it was for Personal Projects.