# Create Behavior Features
## Features based on my behavior - how and when I listen to songs and artists

The first version of features is just using what comes from the raw listening data, specifically looking at frequency and recency of listening to songs and artists.

The aim is to then analyse and cluster the data set to understand more about my listening behavior. 

Once I am happy with the first iteration I would like to visulise the results in an app (probably R Shiny).

In future iterations I want to bring in external data about the music itself (genre, song length, year of release etc.) and re-run analysis and clustering.

This will hopefully spark ideas for predictive and recommendation projects I can do with this data (i.e. recommend songs using others last fm data, predict what I might listen to next or look for commonality with other users (Carl!)

## Technical Debt
- Year of 2019 has been imposed

## To-do list

- Tidy up play_time field to get a datetime field, being careful with the "hours ago" fields
- Make time of day/day of week features e.g. hour, day, time of day(seconds), weekend, grouped times (requires pre-analysis)
- Make features about the frequency and receny of songs listened to
- Make features about the frequenct, recenct of artists
- Make features about the distinct songs and frequency of distinct songs played by artists (may required analysis)

Once this is done, there is hopefully enough features to do simple clustering!

## Import packages

In [5]:
import pandas as pd # allows making dataframes of results
import numpy as np
from datetime import datetime, date
import matplotlib as plt # for analysis to determine which features to create
import seaborn as sns # for analysis to determine which features to create

## Parameters

In [201]:
# data filepath
download_history_path = "/Users/rosiedempsey/Desktop/MusicProject/finely_tuned/DataExports/RawScrobbles_master.csv"

# strings for classifying play_time types
minutes_string = 'minute'
hours_string = 'hour'

# scrobble year to fill in missing year in time field
scrobble_year = pd.datetime.now().year
# format of datetimes from lastfm if they are not X minutes/hours ago
datetime_format = "%Y%d%b%I:%M%p"


## Functions

In [214]:

# Functions for transforming date times

# intermediate functions
# transform datetime plays
def clean_datetime_from_datetime_plays(datetime_df):
    """
    add year to datetimes, make proper datetime field, drop temporary columns
    """
    datetime_df['play_time_year']=str(scrobble_year)+datetime_df['play_time']
    datetime_df['play_datetime'] = pd.to_datetime(datetime_df['play_time_year'],\
                                                         format=datetime_format)
    return datetime_df.drop(columns='play_time_year')

# clean plays in the form "Xunitago", unit is minutes or hours
def clean_datetime_from_ago_plays(ago_df):
    """
    Extract numbers from the time stamp field
    Fill with 1 if na, as this represents when it says "anhourago"
    Turn that into a timedelta in minutes or hours depending on the time_type field
    Subtract this from download time
    Drop calculation fields
    """
    
    ago_df['unit_ago'] = ago_df["play_time"].str.extract('(\d+)').fillna(1).astype(int)
    ago_df['unit_ago_delta'] = pd.to_timedelta(ago_df['unit_ago'], unit='h')
    ago_df.loc[ago_df['time_type']=='minutes_ago','unit_ago_delta']= pd.to_timedelta(ago_df['unit_ago'], unit='m')
    ago_df['play_datetime'] = ago_df['download_time']-ago_df['unit_ago_delta']
    
    return ago_df.drop(columns=['unit_ago','unit_ago_delta'])
    

# Transform all play times
def get_datetime_of_play(scrobble_df):
    """
    Create helper field that's say what type of playtime is given (minutes ago, hours ago or datetime-like)
    Create two dfs for different types then concat later
    For datetime types, add year to them, then transform to date time
    For non-datetimes extract the unit of time, turn into a delta then remove
    """
#     helper field of time type
    scrobble_df['time_type'] = 'datetime'
    scrobble_df.loc[scrobble_df['play_time'].str.contains(minutes_string), 'time_type'] = 'minutes_ago'
    scrobble_df.loc[scrobble_df['play_time'].str.contains(hours_string), 'time_type'] = 'hours_ago'
    
#   Take datetimes
    datetimes_raw_df = scrobble_df[scrobble_df['time_type']=='datetime']
    datetimes_clean_df = clean_datetime_from_datetime_plays(datetimes_raw_df)
    
# Take ago fields
    agotime_raw_df = scrobble_df[scrobble_df['time_type'].isin(['minutes_ago','hours_ago'])]
    agotime_clean_df = clean_datetime_from_ago_plays(agotime_raw_df)
        
    return pd.concat([agotime_clean_df,datetimes_clean_df])



# Create date features

## Read in data

In [215]:
get_datetime_of_play(scrobble_history).head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable

Unnamed: 0,song,artist,play_time,download_time,time_type,play_datetime
0,The Circle Game,Buffy Sainte-Marie,3minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019-08-22 17:23:05.650883
1,Kentucky Woman,Deep Purple,7minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019-08-22 17:19:05.650883
2,Jenny Take a Ride,Mitch Ryder,11minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019-08-22 17:15:05.650883
3,Choo Choo Train,The Box Tops,14minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019-08-22 17:12:05.650883
4,Hungry,Paul Revere & The Raiders,17minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019-08-22 17:09:05.650883


In [213]:
# read in hisory of scrobbles
scrobble_history = pd.read_csv(download_history_path)
# make download_time datetime
scrobble_history['download_time'] = pd.to_datetime(scrobble_history['download_time'])
scrobble_history.head(5)

Unnamed: 0,song,artist,play_time,download_time
0,The Circle Game,Buffy Sainte-Marie,3minutesago,2019-08-22 17:26:05.650883
1,Kentucky Woman,Deep Purple,7minutesago,2019-08-22 17:26:05.650883
2,Jenny Take a Ride,Mitch Ryder,11minutesago,2019-08-22 17:26:05.650883
3,Choo Choo Train,The Box Tops,14minutesago,2019-08-22 17:26:05.650883
4,Hungry,Paul Revere & The Raiders,17minutesago,2019-08-22 17:26:05.650883


In [165]:
# Two types of dates
# "Ago" data, i.e. 8 hours ago
# Time data
# First label the different time types
scrobble_history['time_type'] = 'datetime'
# scrobble_history.loc[scrobble_history['play_time'].str.contains(pm_string), 'time_type'] = 'datetime_pm'
scrobble_history.loc[scrobble_history['play_time'].str.contains(minutes_string), 'time_type'] = 'minutes_ago'
scrobble_history.loc[scrobble_history['play_time'].str.contains(hours_string), 'time_type'] = 'hours_ago'
scrobble_history['year'] = scrobble_year

In [166]:
scrobble_history.head(5)

Unnamed: 0,song,artist,play_time,download_time,time_type,year
0,The Circle Game,Buffy Sainte-Marie,3minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019
1,Kentucky Woman,Deep Purple,7minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019
2,Jenny Take a Ride,Mitch Ryder,11minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019
3,Choo Choo Train,The Box Tops,14minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019
4,Hungry,Paul Revere & The Raiders,17minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019


In [167]:
# do datetimes first (other time features can be extracted now!)
datetime_scrobbles = scrobble_history[scrobble_history['time_type']=='datetime']
datetime_scrobbles['play_time_year']= str(scrobble_year)+datetime_scrobbles['play_time']
# make datetime
datetime_scrobbles['play_datetime'] = pd.to_datetime(datetime_scrobbles['play_time_year'],\
                                                         format=datetime_format)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [168]:
datetime_scrobbles.dtypes

song                      object
artist                    object
play_time                 object
download_time     datetime64[ns]
time_type                 object
year                       int64
play_time_year            object
play_datetime     datetime64[ns]
dtype: object

In [184]:
# now do hours and minutes
ago_scrobbles = scrobble_history[scrobble_history['time_type'].isin(['minutes_ago','hours_ago'])]
ago_scrobbles['unit_ago'] = ago_scrobbles["play_time"].str.extract('(\d+)').fillna(1).astype(int)
ago_scrobbles['unit_ago_delta'] = pd.to_timedelta(ago_scrobbles['unit_ago'], unit='h')
ago_scrobbles.loc[ago_scrobbles['time_type']=='minutes_ago','unit_ago_delta']= pd.to_timedelta(ago_scrobbles['unit_ago'], unit='m')
ago_scrobbles['play_datetime'] = ago_scrobbles['download_time']-ago_scrobbles['unit_ago_delta']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value inst

In [185]:
ago_scrobbles.head(20)

Unnamed: 0,song,artist,play_time,download_time,time_type,year,unit_ago,unit_ago_delta,play_datetime
0,The Circle Game,Buffy Sainte-Marie,3minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,3,00:03:00,2019-08-22 17:23:05.650883
1,Kentucky Woman,Deep Purple,7minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,7,00:07:00,2019-08-22 17:19:05.650883
2,Jenny Take a Ride,Mitch Ryder,11minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,11,00:11:00,2019-08-22 17:15:05.650883
3,Choo Choo Train,The Box Tops,14minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,14,00:14:00,2019-08-22 17:12:05.650883
4,Hungry,Paul Revere & The Raiders,17minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,17,00:17:00,2019-08-22 17:09:05.650883
5,Good Thing,Paul Revere & The Raiders,20minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,20,00:20:00,2019-08-22 17:06:05.650883
6,Tanya Tanning Butter Advertisement,Bristol-Myers,21minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,21,00:21:00,2019-08-22 17:05:05.650883
7,Paxton Quigley's Had The Course,Chad & Jeremy,24minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,24,00:24:00,2019-08-22 17:02:05.650883
8,Son Of A Lovin' Man,Buchanan Brothers,27minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,27,00:27:00,2019-08-22 16:59:05.650883
9,Hector,The Village Callers,29minutesago,2019-08-22 17:26:05.650883,minutes_ago,2019,29,00:29:00,2019-08-22 16:57:05.650883


In [106]:
# make datetime
datetime_am_scrobbles['play_date_time'] = pd.to_datetime(datetime_am_scrobbles['play_time_year'],\
                                                         format=datetime_am_format)
# extract month
# datetime_am_scrobbles['play_month'] = datetime_am_scrobbles['play_date_time'].month()

# # extract day
# datetime_am_scrobbles['play_date_time'] = pd.to_datetime(datetime_am_scrobbles['play_time_year'],\
#                                                          format=datetime_am_format)
# # extract hour

# # get 24 hour, hour

# # get minutes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [2]:
#  Useful code from previous notebook

# date time doesn't come with year! Need to concatenate year on 
# I guess add current year at time of scrape, not perfect but would do
# from datetime import datetime
# # 1 Jul 5:21pm 
# # "%d %mmm %H:%Mpm"
# # datetimeObj = datetime.strptime('2018-09-11T15::11::45.456777', '%Y-%m-%dT%H::%M::%S.%f')
# datetimeObj = datetime.strptime('20191 Jul 5:21pm', "%Y%d %b %H:%Mpm")
# print(datetimeObj)
# print(type(datetimeObj))