Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

We suggest spending 1-2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings (the more concise, the better - no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

# Import and read csv files

In [17]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
takehome_users = pd.read_csv('./takehome_users.csv',encoding="ISO-8859-1")
takehome_user_engagement = pd.read_csv('./takehome_user_engagement.csv',parse_dates=["time_stamp"])

A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. This table includes:

* name: the user's name

* object_id: the user's id

* email: email address

* creation_source: how their account was created. This takes on one of 5 values: 
    * PERSONAL_PROJECTS: invited to join another user's personal workspace 
    * GUEST_INVITE: invited to an organization as a guest (limited permissions) 
    * ORG_INVITE: invited to an organization (as a full member) 
    * SIGNUP: signed up via the website 
    * SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id) 
* creation_time: when they created their account 
* last_session_creation_time: unix timestamp of last login 
* opted_in_to_mailing_list: whether they have opted into receiving marketing emails 
* enabled_for_marketing_drip: whether they are on the regular marketing email drip 
* org_id: the organization (group of users) they belong to 
* invited_by_user_id: which user invited them to join (if applicable).

# Data analysis and cleaning

In [3]:
takehome_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
len(takehome_users['object_id'].unique())


12000

In [5]:
len(takehome_users['name'].unique())

11355

In [6]:
len(takehome_user_engagement['user_id'].unique())

8823

A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

In [7]:
takehome_user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [9]:
#review data in table
takehome_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [10]:
#noticed null values so dropped the rows with null values
takehome_users = takehome_users['last_session_creation_time'].dropna()
len(takehome_users)

8823

In [11]:
#review data in table - no null values
takehome_user_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null datetime64[ns]
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


In [21]:
#Perform floor operation on the data to the specified freq of days, and convert from datetime to int
takehome_user_engagement['time_stamp'] = takehome_user_engagement['time_stamp'].dt.floor('d').astype(np.int64)

#sorting and remove duplicated days per users 
takehome_user_engagement = takehome_user_engagement.sort_values(['user_id', 'time_stamp']).drop_duplicates()


In [27]:
#groupby user_id and provide rolling window calculations.
a = takehome_user_engagement.groupby('user_id')['time_stamp'].rolling(window=3)


b = pd.to_timedelta((a.max()- a.min())).dt.days
print (b.head())

user_id   
1        0     NaN
2        1     NaN
         2     NaN
         3    24.0
         4    26.0
Name: time_stamp, dtype: float64


In [36]:
print(b[:10])

user_id   
1        0     NaN
2        1     NaN
         2     NaN
         3    24.0
         4    26.0
         5    22.0
         6    14.0
         7    34.0
         8    31.0
         9     6.0
Name: time_stamp, dtype: float64


In [30]:
adopted_users = b[b == 7].index.get_level_values('user_id').tolist()
print(adopted_users)

AttributeError: 'list' object has no attribute 'value_count'

In [26]:
len(c)

6677