# LastFM Popularity

<b>What</b> are the most popular musical genres in LastFM? <b>Who</b> are the most popular artists in each genre, and <b>where</b> do they come from? <br>
<b>When</b> were these music made? <b>What</b> are their mood?

These are the questions that these project makes. We will levarege the [dataset](https://www.kaggle.com/pieca111/music-artists-popularity) made available in Kaggle by [Piotr](https://www.kaggle.com/pieca111) to answer them.

### Lets take a look at the dataset.

In [125]:
import pandas as pd
df = pd.read_csv('Datasets/artists.csv', dtype=dict(
    artist_lastfm=str, country_lastfm=str, tags_lastfm=str))

df.head(1)

Unnamed: 0,mbid,artist_mb,artist_lastfm,country_mb,country_lastfm,tags_mb,tags_lastfm,listeners_lastfm,scrobbles_lastfm,ambiguous_artist
0,cc197bad-dc9c-440d-a5b5-d52ba2e14234,Coldplay,Coldplay,United Kingdom,United Kingdom,rock; pop; alternative rock; british; uk; brit...,rock; alternative; britpop; alternative rock; ...,5381567.0,360111850.0,False


To answer the questions we posed at the beginning, we will need:<br>
- <b>ambiguous_artist</b> | Sometimes more than one artist may share the same name
- <b>artist_lastfm</b> | Artist name in LastFM
- <b>country_lastfm</b> | Their countries
- <b>tags_lastfm</b> | Their tags
- <b>listeners_lastfm</b> | You get the idea.
- <b>scrobbles_lastfm</b>




In [126]:
df = df[['ambiguous_artist', 'artist_lastfm', 'country_lastfm', 'tags_lastfm', 'listeners_lastfm', 'scrobbles_lastfm']]
df.head()

Unnamed: 0,ambiguous_artist,artist_lastfm,country_lastfm,tags_lastfm,listeners_lastfm,scrobbles_lastfm
0,False,Coldplay,United Kingdom,rock; alternative; britpop; alternative rock; ...,5381567.0,360111850.0
1,False,Radiohead,United Kingdom,alternative; alternative rock; rock; indie; el...,4732528.0,499548797.0
2,False,Red Hot Chili Peppers,United States,rock; alternative rock; alternative; Funk Rock...,4620835.0,293784041.0
3,False,Rihanna,Barbados; United States,pop; rnb; female vocalists; dance; Hip-Hop; Ri...,4558193.0,199248986.0
4,False,Eminem,United States,rap; Hip-Hop; Eminem; hip hop; pop; american; ...,4517997.0,199507511.0


Way better.
### <b>However</b>

In [127]:
df.tail()

Unnamed: 0,ambiguous_artist,artist_lastfm,country_lastfm,tags_lastfm,listeners_lastfm,scrobbles_lastfm
1466078,False,,South Korea,,,
1466079,False,,,,,
1466080,False,,,,,
1466081,False,,South Korea,,,
1466082,False,,South Korea,,,


Looks like we have our fair share of null values

In [128]:
print('Number of null values by columns: \n' ,df.isna().sum())
print('\n Number of rows in the dataset: \n', len(df))

Number of null values by columns: 
 ambiguous_artist          0
artist_lastfm        479327
country_lastfm      1254585
tags_lastfm         1085008
listeners_lastfm     479323
scrobbles_lastfm     479323
dtype: int64

 Number of rows in the dataset: 
 1466083


The column with the highest number of null values if country_lastfm, with a null rate of approximately 85%. <br>
It might reasonable to assume that the artists with no country filled in their lastfm page are not among the most popular. Lets drop all null values and see with what we end up with.

In [129]:
print('Total scrobbles (plays) before filtering null values:')
print(f'{df["scrobbles_lastfm"].sum():,}')

df.dropna(axis=0, how='any', inplace=True)

print('...and after:')
print(f'{df["scrobbles_lastfm"].sum():,}')

Total scrobbles (plays) before filtering null values:
120,324,972,038.0
...and after:
102,827,413,080.0


So we are losing less than 20% of the total plays by doing so.<br><br>
Now, lets see how much we lose by dropping the ambiguous artists

In [130]:
df.drop(df[df['ambiguous_artist']].index, inplace=True)
df.drop(df[df['scrobbles_lastfm'] <= 0].index, inplace=True) # lets drop artists with no plays as well
print('Total number of plays:' ,f'{df["scrobbles_lastfm"].sum():,}')
print('Total number of unique artists:' ,f'{len(df):,}')

Total number of plays: 87,214,810,301.0
Total number of unique artists: 171,128


So we are set with 3/4 of the original number of plays and with over 170 thousand unique artists. That will do. <br><br>
Some standard dataprep:

In [131]:
df.drop(['ambiguous_artist'], axis=1, inplace=True) # We don't need this columns anymore
df.columns = ['artist', 'country', 'tags','listeners', 'scrobbles'] # If not at the sacrifice of clarity, shorter names = good
df[['scrobbles', 'listeners']] = df[['scrobbles', 'listeners']].astype(int) # There are no half listeners (nor half scrobble)
df.sort_values(by='scrobbles', ascending=False, inplace=True)
df.head()

Unnamed: 0,artist,country,tags,listeners,scrobbles
17,The Beatles,United Kingdom,classic rock; rock; british; 60s; pop,3674017,517126254
1,Radiohead,United Kingdom,alternative; alternative rock; rock; indie; el...,4732528,499548797
0,Coldplay,United Kingdom,rock; alternative; britpop; alternative rock; ...,5381567,360111850
8,Muse,United Kingdom,alternative rock; rock; alternative; Progressi...,4089612,344838631
23,Arctic Monkeys,United Kingdom,indie rock; indie; british; rock; alternative;...,3501680,332306552


### We could really use this tags, if only they were each in their own row....

In [132]:
df['tags'] = df['tags'].apply(lambda x: x.split('; '))
df['country'] = df['country'].apply(lambda x: x.split('; '))
df = df.explode('tags')
df = df_explode_tags.explode('country').copy()
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,artist,country,tags,listeners,scrobbles
0,The Beatles,United Kingdom,classic rock,3674017,517126254
1,The Beatles,United Kingdom,rock,3674017,517126254
2,The Beatles,United Kingdom,british,3674017,517126254
3,The Beatles,United Kingdom,60s,3674017,517126254
4,The Beatles,United Kingdom,pop,3674017,517126254


In [135]:
tags = df_explode.groupby(by='tags').size().sort_values(ascending=False)
print(f'{len(tags):,}')

323,038


We have <b>323,038</b> unique tags.

<b>Which is quite a lot.</b> <br>
<br>
We will start off by selecting the <b>top 300 tags</b> in terms of the number of times that they appear to something more manageable to work with.

In [136]:
top_tags = tags[:300].copy()
print(top_tags.index[:50])

Index(['seen live', 'rock', 'electronic', 'pop', 'All', 'indie', 'alternative',
       'female vocalists', 'under 2000 listeners', 'experimental', 'american',
       'folk', 'metal', 'punk', 'male vocalists', 'electronica', 'ambient',
       'spotify', 'USA', 'jazz', 'singer-songwriter', 'instrumental',
       'hardcore', '00s', 'dance', 'alternative rock', 'indie rock', 'german',
       'japanese', 'british', 'Hip-Hop', 'punk rock', '90s', 'psychedelic',
       'electro', 'world', '80s', 'chillout', 'death metal', 'acoustic',
       'french', 'black metal', 'rap', 'indie pop', 'female vocalist',
       'hard rock', 'hip hop', 'techno', 'heavy metal', 'soul'],
      dtype='object', name='tags')


### Warning! Arbitraty decisions below
We are gonne look through the top 300 tags and manually classify them as on of the following:
- Musical genre
- Mood
- Decade

In [145]:
genres = [
    'rock', 'pop', 'alternative', 'indie',
    'Dance', 'hardcore punk',
    'dark ambient', 'folk rock', 'lounge', 'rnb', 'hip hop',
    'grindcore', 'doom metal', 'Psychedelic Rock', 'indie folk',
    'post-hardcore', 'electropop', 'dub', 'shoegaze',
    'JPop', 'Disco', 'country', 'ska', 'Brutal Death Metal',
    'Melodic Death Metal', 'Drum and bass', 'darkwave',
    'screamo', 'pop punk', 'hiphop', 'Power metal',
    'contemporary classical', 'dream pop', 'J-rock', 'new age',
    'synth pop', 'Stoner Rock', 'Pop-Rock', 'Garage Rock',
    'Grunge', 'Gothic Rock', 'Sludge', 'acid jazz', 
    'metal', 'punk', 'ambient', 'jazz', 'death metal',
    'hardcore', 'alternative rock', 'indie rock',
    'Hip-Hop', 'punk rock', 'electro', 'chillout', 
    'black metal', 'rap', 'indie pop', 'hard rock', 
    'techno', 'heavy metal', 'soul', 'House', 'Experimental Rock',
    'Progressive rock', 'industrial', 'new wave', 'post-punk',
    'Classical', 'pop rock', 'Avant-Garde', 'downtempo',
    'blues', 'post-rock', 'trance', 'thrash metal', 
    'classic rock', 'synthpop', 'idm', 'j-pop',
    'metalcore', 'reggae', 'minimal', 'Fusion', 'r&b',
    'trip-hop', 'Progressive metal', 'Lo-Fi', 'noise rock', 
    'Gothic Metal', 'blues rock', 
    'glitch', 'melodic metal', 'Post punk', 'garage',
    'soft rock', 'britpop', 'breakbeat', 'power pop'
]

moods = [
    'Love', 'beautiful', 'sexy',
    'Mellow', 'groove', 'comedy',
    'funky', 'ethereal', 'fun',
    'melancholic', 'romantic', 'melancholy',
    'relaxing', 'Dreamy',
]

decade = ['50s', '60s', '70s', '80s', '90s', '00s', '10s']


Lets keep only the tags that we selected by hand

In [161]:
tags_to_keep = genres + moods + decade
df.drop(df[~df['tags'].isin(tags_to_keep)].index, axis=0, inplace=True)

And lets flag each tag by their category (i.e. if it is a genre, a mood or a decade)

In [167]:
tags = [genres, moods, decade]
categories = ['genre', 'mood', 'decade']
tags_categories = dict()

for tag_group, category in zip(tags, categories):
    for tag in tag_group:
        tags_categories[tag] = category

df['tag_category'] = df['tags'].map(tags_categories)
df.head()

Unnamed: 0,artist,country,tags,listeners,scrobbles,tag_category
0,The Beatles,United Kingdom,classic rock,3674017,517126254,genre
1,The Beatles,United Kingdom,rock,3674017,517126254,genre
3,The Beatles,United Kingdom,60s,3674017,517126254,decade
4,The Beatles,United Kingdom,pop,3674017,517126254,genre
5,Radiohead,United Kingdom,alternative,4732528,499548797,genre


Some tags can be found written in more than one way, such as melancholic and melancholy. Lets create a mapping between them and standarize our tags.

In [171]:
duplicate_tag_mapping = {
    'hip hop': 'Hip-Hop',
    'hiphop': 'Hip-Hop',
    'JPop': 'j-pop',
    'pop rock': 'Pop-Rock',
    'r&b': 'rnb',
    'synth pop': 'synthpop',
    'melancholic': 'melancholy'
}
df['tags'] = df['tags'].map(duplicate_tag_mapping).fillna(df['tags'])
df.drop_duplicates(inplace=True) # We are creating duplicate records by doing this.


# you stopped here

### Avoiding Value Duplication
When calculation the total listeners for one artist, we don't want to add the listener values for every row the artist appears (i.e. appears in one row for every tag-country (some gorups are composed of members from multiple countries) combination it has)

In [474]:
df_explode['artist_unique'] =  df_explode.groupby(['artist']).cumcount() + 1
df_explode['tags_artist_unique'] = df_explode.groupby(['artist', 'tags']).cumcount() + 1
df_explode['country_artist_unique'] = df_explode.groupby(['artist', 'country']).cumcount() + 1

In [475]:
# Some artist have too many tags, it looks like they are ordered by relevance (to some point)
# So lets keep only the top 3 tag by artist
df_explode.reset_index(inplace=True, drop=True)
df_explode.drop(df_explode[df_explode['country_artist_unique'] > 3].index, inplace=True)

In [476]:
cols_to_bool = ['artist_unique', 'tags_artist_unique', 'country_artist_unique']
for col in cols_to_bool:
    df_explode[col] = df_explode[col].apply(lambda x: x==1)

### Artist Rank by Genre

In [477]:
df_explode.sort_values(by=['tags', 'scrobbles'], ascending=False, inplace=True)
df_explode['artist_rank_by_genre'] = df_explode.groupby(by=['tags'])['scrobbles'].rank(method='dense', ascending=False)

### Export

In [478]:
df_explode['tags'] = df_explode['tags'].str.capitalize()
df_explode.to_csv('Datasets/lastfm_tags.csv', index=False)
df.to_csv('Datasets/lastfm_artists.csv', index=False)

In [484]:
df_explode[df_explode['tags']=='Jazz']

Unnamed: 0,artist,country,tags,listeners,scrobbles,artist_unique,tags_artist_unique,country_artist_unique,artist_rank_by_genre
174,Lana Del Rey,United States,Jazz,1881065.0,217157209.0,False,True,False,1.0
641,Christina Aguilera,United States,Jazz,2788515.0,128756438.0,True,True,True,2.0
959,John Mayer,United States,Jazz,2453071.0,110319385.0,False,True,False,3.0
1579,Portishead,United Kingdom,Jazz,2094008.0,83662916.0,True,True,True,4.0
1714,Sia,Australia,Jazz,2124548.0,80113588.0,False,True,False,5.0
...,...,...,...,...,...,...,...,...,...
892445,Lesław Lic Jazz Ensemble,Poland,Jazz,10.0,14.0,True,True,True,8684.0
892503,Sylvain Kassap & François Corneloup,France,Jazz,2.0,12.0,True,True,True,8685.0
892553,Jerzy Herman Jazz Ensemble,Poland,Jazz,7.0,9.0,True,True,True,8686.0
892706,Shaydze Ov Colour,United Kingdom,Jazz,3.0,4.0,True,True,True,8687.0
