# <font color='black'>Question 1 - Use the user_message dataset</font>

### <font color='crimson'>1. Write a function that calculates the total number of content a user had created over the last year and report the users who have greater than 500 pieces of content created.</font>

In [1]:
# import libraries
import numpy as np
import pandas as pd
import itertools
import math
import plotly
import plotly.graph_objs as go
import plotly.offline as plt
import plotly.figure_factory as ff
plt.init_notebook_mode(connected=True)

In [2]:
# dataset path
excel_file = "../spreadsheets/Data science take home Datasets.xlsx"

# loading the sheet 'user_message' into dataframe
user_message_df = pd.read_excel(excel_file, "user_message")

<h3> <font color='crimson'>Content created by each user</font> </h3>

In [3]:
# group by user_id and sum the content count to obtain total content created by each user
content_by_each_user = user_message_df.groupby(['user_id'])['content_count'].sum()
print("Total number of content created by each user over the last year(2015): ")
content_by_each_user

Total number of content created by each user over the last year(2015): 


user_id
20         87
134        94
635       252
950       122
1034       83
2052      640
2434      526
2442      117
2474      105
2487      223
2495      186
2499       81
3063       87
3195      211
3532      523
3723       25
3734      123
3924      686
4062        9
4249      279
4309      281
4324       15
4377      180
4527      135
4609      175
4711      214
4809      112
5307       45
5639      268
5651       71
         ... 
152515     47
154372     39
154379    189
165262    383
167651     47
168874     48
188773    140
189813     42
190103     96
190423    119
190993    161
191609     87
193221     90
194611    408
210992    294
211062     92
211124     49
211224     31
211312    119
211350     96
211387    244
211558    213
211605    290
211659    102
211735    128
211744     94
211797    250
211809     73
211828     80
211942    195
Name: content_count, Length: 257, dtype: int64

In [4]:
# displaying total content by each user using a scatter plot
data = [go.Scatter(x=content_by_each_user.index.map(lambda x: "|"+str(x)+"|"),
                   y=content_by_each_user.values, mode='markers')]
layout = go.Layout(title='<b>Total content by each user</b>', width=1200,
                   xaxis={'title': "UserId"}, yaxis={'title': "Total number of content"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)

<h3> <font color='crimson'>Users with total number of content pieces greater than 500</font> </h3>

In [5]:
# filter and sort the users with total content greater than 500
content_pieces_500 = content_by_each_user[content_by_each_user > 500].sort_values(ascending=False)
print("User who have created more than 500 content pieces over the last year(2015): ")
content_pieces_500

User who have created more than 500 content pieces over the last year(2015): 


user_id
9484     1163
9676      722
12116     688
3924      686
2052      640
10878     601
5999      566
8962      551
11578     544
17616     526
2434      526
3532      523
10530     521
11271     503
Name: content_count, dtype: int64

In [6]:
# displaying users with total content greater than 500 using a bar chart
data = [go.Bar(x=content_pieces_500.index.map(lambda x: "|"+str(x)+"|"),
               y=content_pieces_500.values)]
layout = go.Layout(title='<b>Users with total number of content pieces > 500</b>',
                   xaxis={'title': "UserId"}, yaxis={'title': "Total number of content"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)

### <font color='crimson'>2. Define a metric and a corresponding function that determines which are the fastest growing users in terms of positive customer engagement over the last year. Report the top 10 users based on the metric that defines “fastest growing user”.</font>

<h3> <font color='crimson'>Average Log Growth Rate (ALGR)</font> </h3>
<p> We define "Average Log Growth Rate" as a metric to determine the fastest growing users in terms of positive customer engagement. If a user creates x<sub>i</sub>, x<sub>j</sub> pieces of content chronologically, his growth rate(r<sub>i</sub>). Suppose a user created x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ... x<sub>n</sub> content pieces on n different days in a year chronologically, his growth rates would be r<sub>1</sub>, r<sub>2</sub>, r<sub>3</sub>, ... r<sub>n-1</sub>. The average log growth rate is calculated using the following formula: </p>
<p><font color='crimson'>
\begin{equation*}
ALGR = \frac{1}{n-1}*\sum_{i=1}^{n-1} log(1 + r_{i}),
\end{equation*}
</font></p>

<h3><font color='crimson'>Reasoning</font></h3>
<p> We don't want to our metric to reward equally for positive growth rate and negative growth rate which is possible through log. We are adding 1 to the rate to make sure that 1+rate is always positive. For example, if positive customer engagement double from 8 to 16 ALGR would be log(1+1) = 0.3, however if it reduces to half from 16 to 8 ALGR would be log(1-0.5)=-0.7. Since we are trying to find the growing users, we are penalizing more when compared rewarding using this technique. 
</p> 

<h3><font color='crimson'>Assumptions</font></h3>
<p> We are considering positive customer engagement data by a user only on the days he created some content so that his growth isn't judged/measured during the days he isn't creating any content.
</p>

In [7]:
# creating a multilevel index to access positive customer engagement using user_id and content_created_date
pce_by_each_user = user_message_df.set_index(['user_id', 'content_created_date'])['total_engagement']
pce_by_each_user.head()

user_id  content_created_date
20       2015-01-01               52
         2015-01-02               72
         2015-01-10               83
         2015-01-12               45
         2015-01-16              102
Name: total_engagement, dtype: int64

In [8]:
# loading each user_id and his corresponding "metric" into a series
growing_users = pd.Series()

# iterate over each user and his corresponding work
for user_id, udf in pce_by_each_user.groupby(level=0):
    
    # metric is set to be zero if user created content only once in the entire year
    if len(udf) == 1:
        growing_users.loc[user_id] = 0
        continue
    
    # replacing all the 0's with 1's to make sure log(rate) is always defined
    udf.replace({0: 1}, inplace=True)
    
    # calculating rate by doing (final-initial)/initial
    rate_df = (udf - udf.shift(1)) / udf.shift(1)
    rate_df = rate_df.apply(lambda rate: math.log10(rate + 1))
    
    # loading user_id as index and his corresponding "metric" into the series
    growing_users.loc[user_id] = rate_df.sum()/(len(udf)-1)

<h3> <font color='crimson'>Top 10 fast growing users</font> </h3>

In [9]:
# sort the series by metric(ALGR) in descending order
growing_users.sort_values(ascending=False, inplace=True)
print("Top 10 fast growing users and their corresponding metric(ALGR) are: ")
growing_users.iloc[:10]

Top 10 fast growing users and their corresponding metric(ALGR) are: 


15368     0.528274
152475    0.093704
13817     0.087146
11767     0.077894
17594     0.075257
7752      0.039760
11741     0.036975
15249     0.030348
10701     0.029447
168874    0.021116
dtype: float64

In [10]:
# displaying top 10 fast growing users with positive customer engagement and their respective ALGR
data = [go.Bar(x=growing_users.iloc[:10].index.map(lambda x: "|"+str(x)+"|"),
               y=growing_users.iloc[:10].values)]
layout = go.Layout(title='<b>Top 10 fast growing users</b>',
                   xaxis={'title': "UserId"}, yaxis={'title': "Metric (ALGR)"})
plot = go.Figure(data=data, layout=layout)
plt.iplot(plot)