# Challenge Description

Company XYZ is a very early stage startup. They allow people to stream music from their mobile
for free. Right now, they still only have songs from the Beatles in their music collection, but they
are planning to expand soon.
They still have all their data in json files and they are interested in getting some basic info about
their users as well as building a very preliminary song recommendation model in order to
increase user engagement.
Working with json files is important. If you join a very early stage start-up, they might not have a
nice database and all data will be in jsons. Third party data are often stored in json files as well.

### My Goal

1. What are the top 3 and the bottom 3 states in terms of number of users?
2. What are the top 3 and the bottom 3 states in terms of user engagement? You can
choose how to mathematically define user engagement. What the CEO cares about here
is in which states users are using the product a lot/very little.
3. The CEO wants to send a gift to the first user who signed-up for each state. That is, the
first user who signed-up from California, from Oregon, etc. Can you give him a list of
those users?
4. Build a function that takes as an input any of the songs in the data and returns the most
likely song to be listened next. That is, if, for instance, a user is currently listening to
"Eight Days A Week", which song has the highest probability of being played right after it
by the same user? This is going to be v1 of a song recommendation model.
5. How would you set up a test to check whether your model works well and is improving
engagement?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
warnings.simplefilter('ignore')

In [2]:
data = pd.read_json('day8.json',convert_dates = ['user_sign_up_date','time_played'])

In [3]:
data.head(5)

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35
1,HWKKBQKNWI,3,Ohio,2015-05-01,We Can Work It Out,2015-06-06 16:49:19
2,DKQSXVNJDH,35,New Jersey,2015-05-04,Back In the U.S.S.R.,2015-06-14 02:11:29
3,HLHRIDQTUW,126,Illinois,2015-05-16,P.s. I Love You,2015-06-08 12:26:10
4,SUKJCSBCYW,6,New Jersey,2015-05-01,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 6 columns):
id                   4000 non-null object
user_id              4000 non-null int64
user_state           4000 non-null object
user_sign_up_date    4000 non-null datetime64[ns]
song_played          4000 non-null object
time_played          4000 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(3)
memory usage: 187.6+ KB


In [5]:
data.describe()

Unnamed: 0,user_id
count,4000.0
mean,101.574
std,58.766835
min,1.0
25%,48.0
50%,102.0
75%,155.0
max,200.0


### EDA

In [6]:
print(data.groupby('user_state').count().sort_values('id')[:3])
print(data.groupby('user_state').count().sort_values('id')[-3:])

             id  user_id  user_sign_up_date  song_played  time_played
user_state                                                           
Kansas        8        8                  8            8            8
Connecticut  16       16                 16           16           16
New Mexico   17       17                 17           17           17
             id  user_id  user_sign_up_date  song_played  time_played
user_state                                                           
Texas       230      230                230          230          230
California  425      425                425          425          425
New York    469      469                469          469          469


#### The top 3 states in terms of number of users are New York, California and Texas.
#### The bottom 3 states in terms of number of users are Kansas, Connecticut and New Mexico.

#### In order to analyze the user engagement, here we defind the user engagement as the song played time devided by the date they register the account because it can show how freuquently users use this app to listen the music and release the real value of this product.

In [7]:
data.head()

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35
1,HWKKBQKNWI,3,Ohio,2015-05-01,We Can Work It Out,2015-06-06 16:49:19
2,DKQSXVNJDH,35,New Jersey,2015-05-04,Back In the U.S.S.R.,2015-06-14 02:11:29
3,HLHRIDQTUW,126,Illinois,2015-05-16,P.s. I Love You,2015-06-08 12:26:10
4,SUKJCSBCYW,6,New Jersey,2015-05-01,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00


In [8]:
data['have_registered_dates'] = (pd.to_datetime('today') - data['user_sign_up_date']).dt.days
data.head(5)

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played,have_registered_dates
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35,1695
1,HWKKBQKNWI,3,Ohio,2015-05-01,We Can Work It Out,2015-06-06 16:49:19,1710
2,DKQSXVNJDH,35,New Jersey,2015-05-04,Back In the U.S.S.R.,2015-06-14 02:11:29,1707
3,HLHRIDQTUW,126,Illinois,2015-05-16,P.s. I Love You,2015-06-08 12:26:10,1695
4,SUKJCSBCYW,6,New Jersey,2015-05-01,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00,1710


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 7 columns):
id                       4000 non-null object
user_id                  4000 non-null int64
user_state               4000 non-null object
user_sign_up_date        4000 non-null datetime64[ns]
song_played              4000 non-null object
time_played              4000 non-null datetime64[ns]
have_registered_dates    4000 non-null int64
dtypes: datetime64[ns](2), int64(2), object(3)
memory usage: 218.9+ KB


In [10]:
frequency = data[['user_id','user_state','time_played']].groupby(['user_id','user_state']).count().reset_index()

In [11]:
frequency = pd.merge(frequency,data[['user_id','user_state','have_registered_dates']],how = 'left',on=['user_id','user_state'])
frequency = frequency.drop_duplicates()
frequency.head(5)

Unnamed: 0,user_id,user_state,time_played,have_registered_dates
0,1,Oregon,10,1710
10,2,North Carolina,18,1710
28,3,Ohio,18,1710
46,4,New Mexico,17,1710
63,5,Alabama,21,1710


In [16]:
state_frequency = frequency.groupby(['user_state']).sum().reset_index()
state_frequency.drop('user_id',axis = 1)
state_frequency['frequency'] = state_frequency['time_played'] / state_frequency['have_registered_dates']
state_frequency.sort_values('frequency',ascending = False).head(5)

Unnamed: 0,user_state,user_id,time_played,have_registered_dates,frequency
22,Nebraska,134,36,1695,0.021239
1,Alaska,301,58,3390,0.017109
33,South Carolina,254,85,5102,0.01666
20,Mississippi,147,85,5117,0.016611
32,Rhode Island,174,27,1692,0.015957


In [17]:
state_frequency.sort_values('frequency',ascending = True).head(5)

Unnamed: 0,user_state,user_id,time_played,have_registered_dates,frequency
13,Kansas,177,8,1692,0.004728
37,Virginia,286,17,3387,0.005019
19,Minnesota,234,42,6817,0.006161
39,West Virginia,404,38,5088,0.007469
11,Indiana,517,55,6781,0.008111


#### Based on the metric we defined here, the top 3 states have higher user engagement is Nebraska, Alaska and South Carolina. And the bottom 3 states is Kansas, Virginia, Minnesota.


In [23]:
# looking for the first sign up date for each state
earliest_state = data[['user_state','user_sign_up_date']].groupby('user_state').min()
earliest_state.head(5)

Unnamed: 0_level_0,user_sign_up_date
user_state,Unnamed: 1_level_1
Alabama,2015-05-01
Alaska,2015-05-12
Arizona,2015-05-12
Arkansas,2015-05-08
California,2015-05-04


In [27]:
earliest_state

Unnamed: 0_level_0,user_sign_up_date
user_state,Unnamed: 1_level_1
Alabama,2015-05-01
Alaska,2015-05-12
Arizona,2015-05-12
Arkansas,2015-05-08
California,2015-05-04
Colorado,2015-05-19
Connecticut,2015-05-16
Florida,2015-05-04
Georgia,2015-05-02
Idaho,2015-05-19


In [34]:
earliest_list = pd.merge(earliest_state, data[['user_id','user_state','user_sign_up_date']], how = 'left', on=['user_state','user_sign_up_date'])
earliest_list = earliest_list.drop_duplicates()
earliest_list

Unnamed: 0,user_state,user_sign_up_date,user_id
0,Alabama,2015-05-01,5
21,Alaska,2015-05-12,106
47,Arizona,2015-05-12,105
69,Arkansas,2015-05-08,78
78,California,2015-05-04,39
79,California,2015-05-04,44
119,Colorado,2015-05-19,173
122,Colorado,2015-05-19,166
154,Connecticut,2015-05-16,127
170,Florida,2015-05-04,41


### Recommendation System based on the probabilities

In [76]:
def count(x):
    count = 0
    for i in playlist:
        if x in i:
            count += 1
    return count

def nextsong():
    currsong = input('Please enter the current song:')
    song_list = data[['user_id','song_played']].groupby('user_id').agg(lambda x: set(x))
    song_list['exist'] =  song_list['song_played'].apply(lambda x: 1 if currsong in x else 0)
    playlist = song_list[song_list['exist'] == 1]['song_played'].to_list()
    song = data[['song_played']].drop_duplicates()

    song['count'] = song['song_played'].apply(count)
    return song.sort_values('count',ascending = False).iloc[1,0]
    

In [79]:
nextsong()

Please enter the current song:A Day In The Life


'Revolution'

### Test for recommendation system

#### In order to test the influence of recommendation system, we have several steps to follow in order to detect if users really like the song we recommend to them or not.

#### The ultimate metric is to users' play time of the song we recommend to them.
Thus, we will choose randomly 30% of our entire user group as our testing objectives to conduct an A/B testing. Half of these users will be in the test group and half of them in the control group.

#### Our null hypothesis is, the song playing time in new system is equal to or smaller than the primary one.

#### Our alternative hypothesis is, the song plaing time in new system is larger than the primary one.

The duration will be 1 month. Within this one month, we will detect if the average song playing has any change.