# LastFM Song Plays

This notebook uses the [Music recommendation dataset](https://www.dtic.upf.edu/%7Eocelma/MusicRecommendationDataset/lastfm-1K.html) provided by Òscar Celma using data collected from [LastFM](last.fm).

In [1]:
import pandas as pd

Read in the data from the dataset, which should be downloaded from the link above and placed at the root of the git repository.

In [2]:
df = pd.read_table(
    '../userid-timestamp-artid-artname-traid-traname.tsv',
    header=None,
    names=["userid", "timestamp", "artist_id", "artist_name", "track_id", "track_name"],
    error_bad_lines=False
)

df['timestamp'] = pd.to_datetime(df['timestamp'], format="%Y-%m-%dT%H:%M:%SZ")

## How many songs has each user played?

According to the README provided with the dataset, there are 992 unique users.

In [3]:
unique_song_plays = (
    df
    .drop_duplicates(['userid', 'track_name'])
    .groupby('userid')
    .size()
    .reset_index(name='n_unique_songs')
)

unique_song_plays

Unnamed: 0,userid,n_unique_songs
0,user_000001,3092
1,user_000002,8129
2,user_000003,4565
3,user_000004,5974
4,user_000005,1974
5,user_000006,7733
6,user_000007,1093
7,user_000008,608
8,user_000009,2555
9,user_000010,874


## Most popular songs

The hundred most popular songs in the dataset.

In [4]:
most_popular_songs = (
    df
    .groupby(['artist_name', 'track_name'])
    .size()
    .reset_index(name='n_plays')
    .sort_values('n_plays', ascending=False)
    [:100]
)

most_popular_songs

Unnamed: 0,artist_name,track_name,n_plays
1297507,The Postal Service,Such Great Heights,3991
182347,Boy Division,Love Will Tear Us Apart,3651
1010920,Radiohead,Karma Police,3533
319890,Death Cab For Cutie,Soul Meets Body,3479
868054,Muse,Supermassive Black Hole,3463
1273109,The Knife,Heartbeats,3155
83410,Arcade Fire,Rebellion (Lies),3047
868024,Muse,Starlight,3040
191434,Britney Spears,Gimme More,3002
1271575,The Killers,When You Were Young,2997


## Sessions

If we define a user’s “session” of Last.fm usage to be comprised of one or more songs
played by that user, where each song is started within 20 minutes of the previous song’s
start time, then we can list the top 10 sessions.

In [5]:
def gen_sessions(group_df):
    """
    Calculate the sessions in the group of a user's plays.
    """
    group_df.sort_values("timestamp", inplace=True)

    # Will increment every time the difference between rows
    # is not less than 20 minutes
    group_df["session_id"] = (
        (
            ~(
                (group_df["timestamp"] - group_df["timestamp"].shift()).fillna(
                    pd.Timedelta(minutes=25)
                )
                < pd.Timedelta(minutes=20)
            )
        )
        .astype("int")
        .cumsum()
    )

    return group_df


df = df.groupby("userid").apply(gen_sessions).reset_index(drop=True)


def list_of_songs(series):
    """
    Will be the list of songs in the order of play.
    """
    return series.tolist()


sessions_df = (
    df.groupby(["userid", "session_id"]).agg(
        {"timestamp": ["max", "min"], "track_name": list_of_songs}
    )
    .reset_index()
)

sessions_df["session_length"] = sessions_df["timestamp"]["max"] - sessions_df["timestamp"]["min"]


# the top 10 sessions by length
sessions_df = sessions_df.sort_values("session_length", ascending=False)[:10]

In [7]:
sessions_df

Unnamed: 0_level_0,userid,session_id,timestamp,timestamp,track_name,session_length
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,max,min,list_of_songs,Unnamed: 6_level_1
989087,user_000949,151,2006-02-27 11:29:37,2006-02-12 17:49:31,"[Chained To You, The Animal Song, The Lover Af...",14 days 17:40:06
1033397,user_000997,18,2007-05-10 17:55:03,2007-04-26 00:36:02,"[Unentitled States Of Hysteria, Dim Allentown ...",14 days 17:19:01
989495,user_000949,559,2007-05-14 00:05:52,2007-05-01 02:41:15,"[White Daisy Passing, The Night'S Disguise, He...",12 days 21:24:37
571717,user_000544,75,2007-02-23 00:51:08,2007-02-12 13:03:52,"[Finally Woken, One Hundred Things You Should ...",10 days 11:47:16
989075,user_000949,139,2005-12-18 04:40:04,2005-12-09 08:26:38,"[Neighborhood #2 (Laika), Rainbows, Une Année ...",8 days 20:13:26
989061,user_000949,125,2005-11-18 22:50:07,2005-11-11 03:30:37,"[Excuse Me Miss Again, Gone To Earth, Rock N R...",7 days 19:19:30
989125,user_000949,189,2006-03-26 18:13:45,2006-03-18 23:04:14,"[Disco Science, Here Is Gone, Meet Virginia, T...",7 days 19:09:31
571697,user_000544,55,2007-01-13 13:57:45,2007-01-06 01:07:04,"[La Murga, Breathe Through, Heathen Town, For ...",7 days 12:50:41
277761,user_000250,1285,2008-02-28 21:18:03,2008-02-21 15:31:45,"[Lazarus Heart, Space Station No. 5, Cubert, S...",7 days 05:46:18
989088,user_000949,152,2006-03-06 19:52:35,2006-02-27 17:47:28,"[Y-Control, Banquet, The Swing, Happy Face, A ...",7 days 02:05:07
