# Music Emotion Recognition Using "in the Wild" Data
## - Datasets for Practical Work

This is a simple explanation of the datasets that will probably be used for the Project. 
Just as a short introduction to the project, there are two main goals to this project:
 - Collect and unify a dataset using either already existing big datasets derived from web crawling, and self created datasets crawled from diverse music internet platforms. 
 - Test several methods to create emotional labels for the whole dataset
 
An important detail is that instead of focusing the labeling on a particular piece of music, the intention is that a **music listening event** is labeled, thus taking the listeners' and the tracks' content and context into consideration.

In [227]:
# just some imports and stuff
import pandas as pd
from tqdm import tqdm
from glob import glob

folder = "resources/FINAL/"

## Datasets - Summary

Below is a table summarizing the number of files from each source. Some of these are existing datasets by researchers, some have been crawled by me and "Metadata" is just a list of the tracks and artists that are the intersection of all the main datasets.

In [234]:
# getting the files
files = glob(folder+"*/*.csv", recursive=True) 
files_df = pd.DataFrame(files)[0].str.split("/", expand=True).drop([0, 1], axis=1).rename({2: "Source", 3: "File"}, axis=1)
files_df = files_df.sort_values(["File"]).sort_values(["Source"]).reset_index(drop=True)

# How many datasets in each folder / group / source
files_df.groupby("Source").count().rename({"File": "number of files"}, axis=1)

Unnamed: 0_level_0,number of files
Source,Unnamed: 1_level_1
AudioDB,1
LFM,6
LJ2M,3
Lyrics,1
MMTD,4
Metadata,1
MuMu,7
SMPD,2
Spotify,2
SpotifyPlaylists,5


In general, the following sections will present the following 4 pieces of information:
- the file names
- a short sample of the dataset
- the column names
- the shape of the dataset (number of samples x number of columns)

## 1. Metadata - Final Intersection

The first dataset to be indroduced is the list of tracks-artist pairs that form the intersection of the main datasets in this collection. There are >8000 tracks in the list are were present in datasets such as the LFM subset, LJ2M and others. Based on this intersection, additional data was then crawled, which will be presented in further sections.

In [236]:
file = [file for file in files if "Metadata" in file][0]
file

'resources/FINAL/Metadata/final_intersection.csv'

In [239]:
df = pd.read_csv(file, index_col=0)
df.head(3)

Unnamed: 0,Artist,Track
0,10 Years,Wasteland
1,10cc,Dreadlock Holiday
2,10cc,Rubber Bullets


In [96]:
df.columns

Index(['Artist', 'Track'], dtype='object')

In [98]:
df.shape

(8770, 2)

## 2. AudioDB

This data was extracted using the AudioDB API. It contains some metadata information, such as ID's from other libraries, etc, but also some descriptive texts about the track and user rating statistics.

In [222]:
file = [file for file in files if "AudioDB" in file][0]
file

'resources/FINAL/AudioDB/AudioDB_track_metadata.csv'

In [223]:
df = pd.read_csv(file, index_col=0)
df.head(3)

Unnamed: 0,idTrack,idAlbum,idArtist,idLyric,idIMVDB,strTrack,strAlbum,strArtist,strArtistAlternate,intCD,...,intScoreVotes,intTotalListeners,intTotalPlays,strMusicBrainzID,strMusicBrainzAlbumID,strMusicBrainzArtistID,strLocked,Track,Artist,lfm_id
0,32997498,2132356,114613,0.0,,Wasteland,Killing All That Holds You,10 Years,,,...,4.0,250701.0,1534647.0,f9a525b3-51f2-4ffa-8689-1df8c82788b8,02835573-67b8-3268-a442-56afd9770453,b18bc9c4-6f22-4f1b-a918-e9c86a39fe7a,Unlocked,Wasteland,10 Years,44101852
1,32828322,2118689,112493,0.0,,Dreadlock Holiday,Windows in the Jungle,10cc,,,...,2.0,283360.0,1165259.0,5d3affdf-1790-420f-8bad-d4308abf8137,4532b2fa-640e-4726-8a52-4d71bf26bd41,f37c537b-3557-4031-bfd6-ab63ced32854,Unlocked,Dreadlock Holiday,10cc,12754290
2,32828371,2118694,112493,0.0,,Rubber Bullets,Two Classic Albums: 10cc / Sheet Music,10cc,,,...,,51121.0,168300.0,2deac7a4-3e2d-4449-855e-5b81812c8f83,9a01e280-371e-3234-9a33-ce63074d2ffd,f37c537b-3557-4031-bfd6-ab63ced32854,Unlocked,Rubber Bullets,10cc,33773799


In [224]:
df.columns

Index(['idTrack', 'idAlbum', 'idArtist', 'idLyric', 'idIMVDB', 'strTrack',
       'strAlbum', 'strArtist', 'strArtistAlternate', 'intCD', 'intDuration',
       'strGenre', 'strMood', 'strStyle', 'strTheme', 'strDescriptionEN',
       'strDescriptionDE', 'strDescriptionFR', 'strDescriptionCN',
       'strDescriptionIT', 'strDescriptionJP', 'strDescriptionRU',
       'strDescriptionES', 'strDescriptionPT', 'strDescriptionSE',
       'strDescriptionNL', 'strDescriptionHU', 'strDescriptionNO',
       'strDescriptionIL', 'strDescriptionPL', 'strTrackThumb',
       'strTrack3DCase', 'strTrackLyrics', 'strMusicVid',
       'strMusicVidDirector', 'strMusicVidCompany', 'strMusicVidScreen1',
       'strMusicVidScreen2', 'strMusicVidScreen3', 'intMusicVidViews',
       'intMusicVidLikes', 'intMusicVidDislikes', 'intMusicVidFavorites',
       'intMusicVidComments', 'intTrackNumber', 'intLoved', 'intScore',
       'intScoreVotes', 'intTotalListeners', 'intTotalPlays',
       'strMusicBrainzID', 'st

In [225]:
df.shape

(8279, 57)

## 3. LFM

The [LFM](http://www.cp.jku.at/datasets/LFM-2b/) dataset, or more specifically, the RecSys'22 subset of the LFM-2b dataset contains information and metadata on the tracks, user listening events, user information and user created tags and genres. This data was retrieved from the Last.FM music platform.

In [None]:
lfm_files = [file for file in files if "LFM" in file]
lfm_files

['resources/FINAL/LFM/lfm_song_id_spotify_uri.csv',
 'resources/FINAL/LFM/lfm_users_listening_counts.csv',
 'resources/FINAL/LFM/lfm_genres.csv',
 'resources/FINAL/LFM/lfm_albumns.csv',
 'resources/FINAL/LFM/lfm_listening_events.csv',
 'resources/FINAL/LFM/lfm_tags.csv']

#### 3.1 LFM ID's and Track - Artist

In [11]:
lfm_files[0]

'resources/FINAL/LFM/lfm_song_id_spotify_uri.csv'

In [12]:
df = pd.read_csv(lfm_files[0])
df.head(3)

Unnamed: 0,Artist,Track,lfm_id,lfm_spotify_uri
0,10 Years,Wasteland,44101852,3pO37BXsjMC2wApALxGbuB
1,10cc,Dreadlock Holiday,12754290,1LOZMYF5s8qhW7Rv4w2gun
2,10cc,Rubber Bullets,33773799,1QQgSUKCG8GakzMOwi4lFS


In [13]:
df.columns

Index(['Artist', 'Track', 'lfm_id', 'lfm_spotify_uri'], dtype='object')

In [14]:
df.shape

(8772, 4)

#### 3.2 LFM - User listening counts

In [15]:
lfm_files[1]

'resources/FINAL/LFM/lfm_users_listening_counts.csv'

In [16]:
df = pd.read_csv(lfm_files[1])
df.head(3)

Unnamed: 0,user_id,in,out
0,2,60.0,662.0
1,6,31.0,2725.0
2,14,614.0,17479.0


In [17]:
df.columns

Index(['user_id', 'in', 'out'], dtype='object')

In [18]:
df.shape

(15258, 3)

#### 3.3 LFM ID's and Track - Artist

In [19]:
lfm_files[2]

'resources/FINAL/LFM/lfm_genres.csv'

In [20]:
df = pd.read_csv(lfm_files[2])
df.head(3)

Unnamed: 0,Artist,Track,variable,value
0,Nirvana,Smells Like Teen Spirit,grunge,100
1,Nirvana,Smells Like Teen Spirit,rock,69
2,Nirvana,Smells Like Teen Spirit,alternative rock,23


In [21]:
df.columns

Index(['Artist', 'Track', 'variable', 'value'], dtype='object')

In [22]:
df.shape

(89731, 4)

#### 3.4 LFM Albums

In [23]:
lfm_files[3]

'resources/FINAL/LFM/lfm_albumns.csv'

In [24]:
df = pd.read_csv(lfm_files[3])
df.head(3)

Unnamed: 0,lfm_album_id,artist,album
0,16391976,Mtume,R&B: From Doo-Wop To Hip-Hop
1,4180281,Sum 41,"All Killer, No Filler"
2,22269931,Weezer,Weezer (Green Album)


In [25]:
df.columns

Index(['lfm_album_id', 'artist', 'album'], dtype='object')

In [26]:
df.shape

(35260, 3)

#### 3.5 LFM Listening Events

In [27]:
lfm_files[4]

'resources/FINAL/LFM/lfm_listening_events.csv'

In [28]:
df = pd.read_csv(lfm_files[4])
df.head(3)

Unnamed: 0,lfm_user_id,lfm_track_id,lfm_album_id,timestamp
0,14807,21889387,16391976,2020-01-01 00:00:03
1,21778,14820813,4180281,2020-01-01 00:00:07
2,2007,17825087,22269931,2020-01-01 00:00:21


In [29]:
df.columns

Index(['lfm_user_id', 'lfm_track_id', 'lfm_album_id', 'timestamp'], dtype='object')

In [30]:
df.shape

(994660, 4)

#### 3.6 LFM Tags

In [31]:
lfm_files[5]

'resources/FINAL/LFM/lfm_tags.csv'

In [32]:
df = pd.read_csv(lfm_files[5])
df.head(3)

Unnamed: 0,Artist,Track,tag,value
0,Nirvana,Smells Like Teen Spirit,Grunge,100
1,Nirvana,Smells Like Teen Spirit,rock,69
2,Nirvana,Smells Like Teen Spirit,Nirvana,43


In [33]:
df.columns

Index(['Artist', 'Track', 'tag', 'value'], dtype='object')

In [34]:
df.shape

(637620, 4)

## 4. LJ2M

The [LJ2M dataset](https://ieeexplore.ieee.org/document/6890172) (download not available anymore) is a collection of blog posts in the LiveJournal platform, where users create posts, label it with an emotional keyword and associate a song to it. Here is presented the list of triplets of text (in BoW format), emotion and track, as well as the user who created that post. In addition, there is a list of the emotional keywords as well as the BoW words used to better decipher the text.

In [37]:
lj2m_files = [file for file in files if "LJ2M" in file]
lj2m_files

['resources/FINAL/LJ2M/lj2m_emotions.csv',
 'resources/FINAL/LJ2M/LJ2M_dictionary_for_BoW_nostemming.csv',
 'resources/FINAL/LJ2M/lj2m_songs_user_words_emotions.csv']

#### 4.1 LJ2M emotion list

In [38]:
lj2m_files[0]

'resources/FINAL/LJ2M/lj2m_emotions.csv'

In [39]:
df = pd.read_csv(lj2m_files[0])
df.head(3)

Unnamed: 0,Emotion,AllMusic,AllMusic_ANEW,LiveJournal,LiveJournal_ANEW
0,acerbic,True,False,False,False
1,aggressive,True,True,False,False
2,ambitious,True,True,False,False


In [40]:
df.columns

Index(['Emotion', 'AllMusic', 'AllMusic_ANEW', 'LiveJournal',
       'LiveJournal_ANEW'],
      dtype='object')

In [41]:
df.shape

(301, 5)

#### 4.2 LJ2M BoW Words

In [45]:
lj2m_files[1]

'resources/FINAL/LJ2M/LJ2M_dictionary_for_BoW_nostemming.csv'

In [42]:
df = pd.read_csv(lj2m_files[1])
df.head(3)

Unnamed: 0,i
0,to
1,and
2,the


In [43]:
df.columns

Index(['i'], dtype='object')

In [44]:
df.shape

(46959, 1)

#### 4.3 LJ2M Songs - Users - Emotional Label - Text(BoW)

In [47]:
lj2m_files[2]

'resources/FINAL/LJ2M/lj2m_songs_user_words_emotions.csv'

In [48]:
df = pd.read_csv(lj2m_files[2])
df.head(3)

Unnamed: 0,emotion_id,word_counts,user ID,artist,song title
0,accomplished_3,0:7 1:1 2:7 4:7 5:4 6:1 7:2 8:2 10:1 11:3 12:2...,u480993,David Bisbal,Buleria
1,accomplished_7,0:6 1:1 2:1 5:3 6:2 7:2 8:3 12:1 13:1 15:2 17:...,u574291,Yellowcard,Only One
2,accomplished_8,0:3 1:5 2:4 3:7 4:3 5:1 6:4 7:2 9:1 10:2 12:1 ...,u402088,Yellowcard,Ocean Avenue


In [49]:
df.columns

Index(['emotion_id', 'word_counts', 'user ID', 'artist', 'song title'], dtype='object')

In [50]:
df.shape

(741204, 5)

## 5. Lyrics
This list of Lyrics was created using the Genius API.

In [57]:
lyrics_file = [file for file in files if "Lyrics" in file][0]
lyrics_file

'resources/FINAL/Lyrics/CrawledLyrics.csv'

In [58]:
df = pd.read_csv(lyrics_file)
df.head(3)

Unnamed: 0,Artist,Track,Lyrics
0,10 Years,Wasteland,"Wasteland Lyrics[Intro]\nChange my attempt, go..."
1,10cc,Dreadlock Holiday,
2,10cc,Rubber Bullets,Rubber Bullets Lyrics[Verse 1: Lol Creme]\nI w...


In [59]:
df.columns

Index(['Artist', 'Track', 'Lyrics'], dtype='object')

In [60]:
df.shape

(8770, 3)

## 6. MMTD - Million Music Tweed Dataset (#NowPlayling)

The [Million Music Tweet Dataset (MMTD)](http://www.cp.jku.at/datasets/MMTD/http://www.cp.jku.at/datasets/MMTD/) is composed of some ID information, the Tweeting events representing a music listening event (including user id, location and other info) as well as a more detail description on the different locations and country data.

In [62]:
mmtd_files = [file for file in files if "MMTD" in file]
mmtd_files

['resources/FINAL/MMTD/mmtd_songs_ids.csv',
 'resources/FINAL/MMTD/MMTD_tweets_events.csv',
 'resources/FINAL/MMTD/locations.csv',
 'resources/FINAL/MMTD/countryInfo.csv']

#### 6.1 MMTD - Song ID's

In [63]:
file = mmtd_files[0]
file

'resources/FINAL/MMTD/mmtd_songs_ids.csv'

In [64]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,Artist,Track,track_id
0,10 Years,Wasteland,8512
1,10cc,Dreadlock Holiday,12225
2,10cc,Rubber Bullets,12349


In [65]:
df.columns

Index(['Artist', 'Track', 'track_id'], dtype='object')

In [66]:
df.shape

(8847, 3)

#### 6.2 MMTD - Listening Events

In [67]:
file = mmtd_files[1]
file

'resources/FINAL/MMTD/MMTD_tweets_events.csv'

In [68]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,tweet_userId,tweet_datetime,tweet_weekday,tweet_longitude,tweet_latitude,Artist,Track,track_id
0,199729912,2012-02-09 00:48:30,3,-46.5955,-23.6828,Slipknot,Iowa,6721161
1,169895729,2012-05-12 00:31:42,5,-40.3339,-20.2784,Slipknot,Iowa,6721161
2,78146702,2012-01-27 16:49:19,4,-34.9394,-8.0436,Slipknot,Iowa,6721161


In [69]:
df.columns

Index(['tweet_userId', 'tweet_datetime', 'tweet_weekday', 'tweet_longitude',
       'tweet_latitude', 'Artist', 'Track', 'track_id'],
      dtype='object')

In [70]:
df.shape

(97062, 8)

#### 6.3 MMTD - Locations

In [86]:
file = mmtd_files[2]
file

'resources/FINAL/MMTD/locations.csv'

In [87]:
df = pd.read_csv(file, index_col=0)
df.head(3)

Unnamed: 0,location_id,latitude,longitude,country,state,county,city,postalCode,street,timezone
0,1,-82.863,-135.0,AQ,,,,,,12.0
1,2,-64.28,-63.116,AQ,,,,,,-4.0
2,3,-53.17,-70.918,CL,Magallanes and Antartica Chilena Region,,Punta Arenas,,Capitán Ramón Serrano,-4.0


In [88]:
df.columns

Index(['location_id', 'latitude', 'longitude', 'country', 'state', 'county',
       'city', 'postalCode', 'street', 'timezone'],
      dtype='object')

In [89]:
df.shape

(244932, 10)

#### 6.4 MMTD - Country Info

In [90]:
file = mmtd_files[3]
file

'resources/FINAL/MMTD/countryInfo.csv'

In [91]:
df = pd.read_csv(file, index_col=0)
df.head(3)

Unnamed: 0,countryCode,countryName,isoAlpha3,fipsCode,continent,continentName,capital,areaInSqKm,population,currencyCode,languages,west,north,east,south
0,AD,Andorra,AND,AN,EU,Europe,Andorra la Vella,468.0,84000,EUR,ca,1.40719,42.656,1.78654,42.4285
1,AE,United Arab Emirates,ARE,AE,AS,Asia,Abu Dhabi,82880.0,4975593,AED,"ar-AE,fa,en,hi,ur",51.5833,26.0842,56.3817,22.6333
2,AF,Afghanistan,AFG,AF,AS,Asia,Kabul,647500.0,29121286,AFN,"fa-AF,ps,uz-AF,tk",60.4784,38.4834,74.8794,29.3775


In [92]:
df.columns

Index(['countryCode', 'countryName', 'isoAlpha3', 'fipsCode', 'continent',
       'continentName', 'capital', 'areaInSqKm', 'population', 'currencyCode',
       'languages', 'west', 'north', 'east', 'south'],
      dtype='object')

In [93]:
df.shape

(250, 15)

## 7. MuMu

The [MuMu](https://www.upf.edu/web/mtg/mumuhttps://www.upf.edu/web/mtg/mumu) dataset contains valuabe information about amazon album purchases, such as "bought together" data, reviews and genre labeling.

In [127]:
mumu_files = [file for file in files if "MuMu" in file]
mumu_files

['resources/FINAL/MuMu/amazon_buy_after_viewing.csv',
 'resources/FINAL/MuMu/amazon_album_metadata.csv',
 'resources/FINAL/MuMu/amazon_also_viewed.csv',
 'resources/FINAL/MuMu/amazon_bought_together.csv',
 'resources/FINAL/MuMu/amazon_reviews_MuMu.csv',
 'resources/FINAL/MuMu/MuMu_dataset_single-label.csv',
 'resources/FINAL/MuMu/MuMu_dataset_multi-label.csv']

#### 7.1 MuMu - Amazon - Buy After Viewing

In [104]:
file = mumu_files[0]
file

'resources/FINAL/MuMu/amazon_buy_after_viewing.csv'

In [105]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,amazon_id,buy_after_viewing_amazon_id
0,3937406328,B000007WK4
1,4266950926,B000BNM7LQ
2,5555799063,B0000037C6


In [106]:
df.columns

Index(['amazon_id', 'buy_after_viewing_amazon_id'], dtype='object')

In [107]:
df.shape

(10975, 2)

#### 7.2 MuMu - Amazon - Album Metadata

In [108]:
file = mumu_files[1]
file

'resources/FINAL/MuMu/amazon_album_metadata.csv'

In [109]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,title,price,imUrl,amazon_id,Rank - Music
0,Comfort Zone,20.69,http://ecx.images-amazon.com/images/I/51fdvJLW...,1458389375,
1,Kirtan,22.94,http://ecx.images-amazon.com/images/I/51GGK0zo...,1591791065,89340.0
2,Rough Guide to Gypsy Music,21.5,http://ecx.images-amazon.com/images/I/51Li1pqK...,1906063443,337752.0


In [106]:
df.columns

Index(['amazon_id', 'buy_after_viewing_amazon_id'], dtype='object')

In [107]:
df.shape

(10975, 2)

#### 7.3 MuMu - Amazon - Also Viewed

In [110]:
file = mumu_files[2]
file

'resources/FINAL/MuMu/amazon_also_viewed.csv'

In [111]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,amazon_id,also_viewed_amazon_id
0,5554329039,B000005XG5
1,5555728123,B000003FY9
2,5555728123,B000001G7R


In [112]:
df.columns

Index(['amazon_id', 'also_viewed_amazon_id'], dtype='object')

In [113]:
df.shape

(39252, 2)

#### 7.4 MuMu - Amazon - Bought Together

In [116]:
file = mumu_files[3]
file

'resources/FINAL/MuMu/amazon_bought_together.csv'

In [117]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,amazon_id,bought_together_amazon_id
0,1591791065,B000B6FXWS
1,1591791065,B000AA4JHK
2,1929243766,B003KWWDE6


In [118]:
df.columns

Index(['amazon_id', 'bought_together_amazon_id'], dtype='object')

In [119]:
df.shape

(21746, 2)

#### 7.5 MuMu - Amazon - Album Reviews

In [128]:
file = mumu_files[4]
file

'resources/FINAL/MuMu/amazon_reviews_MuMu.csv'

In [129]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,reviewerID,reviewerName,helpful,amazon_id,unixReviewTime,reviewText,overall,reviewTime,summary
0,A1T7P8KS0NYQ7I,Elizabeth,"[0, 0]",1458389375,1312588800,"Earth and Sky Dancing Music, intelligent on so...",5.0,"08 6, 2011","Smooth, Fun & Deep - And What A Voice!"
1,A1OV48FXDQ6DZJ,"A. Billings ""Drace""","[14, 14]",1591791065,1127952000,I bought this based on the snippets that they ...,5.0,"09 29, 2005",Kirtan is Really Addictive
2,A1XDPR9ZJGC8JY,A.Makote DeGuevara,"[0, 0]",1591791065,1395705600,"I really love Jai's voice, it's soothing and t...",5.0,"03 25, 2014",Devotional Voice


In [130]:
df.columns

Index(['reviewerID', 'reviewerName', 'helpful', 'amazon_id', 'unixReviewTime',
       'reviewText', 'overall', 'reviewTime', 'summary'],
      dtype='object')

In [131]:
df.shape

(447583, 9)

#### 7.6 MuMu - Amazon - Metadata and Genres

In [132]:
file = mumu_files[5]
file

'resources/FINAL/MuMu/MuMu_dataset_single-label.csv'

In [133]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,amazon_id,album_mbid,MSD_track_id,recording_mbid,artist_mbid,genres
0,B0018ZB6ZO,0164b5ce-42f8-46bf-b140-61ecd2ed449e,TRYDUXK12903CE0C37,66dd470f-4c4f-4caa-b8d4-2ebc80a83cb6,7434b85a-4a06-42ba-9e1d-9c568c044842,Dance & Electronic
1,B0018ZB6ZO,0164b5ce-42f8-46bf-b140-61ecd2ed449e,TRGWXYX128F426E27F,b45aef36-c3a3-42f2-90fb-728e8b3c54a2,7434b85a-4a06-42ba-9e1d-9c568c044842,Dance & Electronic
2,B000003HGR,7fe6e337-9115-3719-855b-0441c42a2c36,TRVMIIJ12903CDBF1D,36b15f42-b441-4d64-8ba9-ca264ed1c6f1,1b54e90c-638e-4fdd-a20e-4ab09db9fdaf,Alternative Rock


In [106]:
df.columns

Index(['amazon_id', 'buy_after_viewing_amazon_id'], dtype='object')

In [107]:
df.shape

(10975, 2)

#### 7.7 MuMu - Amazon - Metadata and Multi-Labeled Genres

In [134]:
file = mumu_files[6]
file

'resources/FINAL/MuMu/MuMu_dataset_multi-label.csv'

In [135]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,amazon_id,album_mbid,MSD_track_id,recording_mbid,artist_mbid,genres
0,B00005YQOV,77944b8c-f753-4c7c-84ba-a48fbf518667,TRJIKJU128F930BF28,68c38213-65ba-4c1e-ac20-76883b512993,0a6f37da-2a2a-4308-896a-7c34b968b0b3,"Vocal Jazz,Jazz,Traditional Vocal Pop,Pop,Mode..."
1,B00005YQOV,77944b8c-f753-4c7c-84ba-a48fbf518667,TRLCVKT128F930BF18,f5c25488-4fbc-4ade-9cd4-431ae3fe3737,0a6f37da-2a2a-4308-896a-7c34b968b0b3,"Vocal Jazz,Jazz,Traditional Vocal Pop,Pop,Mode..."
2,B00005YQOV,77944b8c-f753-4c7c-84ba-a48fbf518667,TRBQSIG128F930BEFC,e76bbfcc-f94a-425d-bf13-d4e8d32df173,0a6f37da-2a2a-4308-896a-7c34b968b0b3,"Vocal Jazz,Jazz,Traditional Vocal Pop,Pop,Mode..."


In [136]:
df.columns

Index(['amazon_id', 'album_mbid', 'MSD_track_id', 'recording_mbid',
       'artist_mbid', 'genres'],
      dtype='object')

In [137]:
df.shape

(147295, 6)

## 8. SMPD

The [Spotify Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challengehttps://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge) contains, as the name says, a million playlists. The extracted info was the playlist's name, the number of followers, the total number of tracks, and a list of the songs that are part of the intersection.

In [144]:
smpd_files = [file for file in files if "SMPD" in file]
smpd_files

['resources/FINAL/SMPD/smpd_playlist_info.csv',
 'resources/FINAL/SMPD/smpd_playlistID_trackID.csv']

#### 8.1 SMPD Playlist Info

In [147]:
file = smpd_files[0]
file

'resources/FINAL/SMPD/smpd_playlist_info.csv'

In [148]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,playlist_id,playlist_name,playlist_followers,playlist_num_tracks
0,698001,Slow Jams,9,1
1,698002,Español,21,1
2,698003,finals,50,1


In [149]:
df.columns

Index(['playlist_id', 'playlist_name', 'playlist_followers',
       'playlist_num_tracks'],
      dtype='object')

In [150]:
df.shape

(559185, 4)

#### 8.2 SMPD - Tracks-Playlists

In [151]:
file = smpd_files[1]
file

'resources/FINAL/SMPD/smpd_playlistID_trackID.csv'

In [152]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,playlist_id,track_id
0,961001,6183
1,961001,7198
2,961001,2957


In [153]:
df.columns

Index(['playlist_id', 'track_id'], dtype='object')

In [154]:
df.shape

(3764767, 2)

## 9. Spotify
This is information extracted from the Spotify API. It contains a big deal of features, such as danceability, energy, key, tempo and valence. It also contains a more detailed analysis of the content of the song.

In [161]:
spotify_files = [file for file in files if "Spotify/" in file]
spotify_files

['resources/FINAL/Spotify/spotify_features.csv',
 'resources/FINAL/Spotify/spotify_analysis.csv']

#### 9.1 Spotify Features

In [162]:
file = spotify_files[0]
file

'resources/FINAL/Spotify/spotify_features.csv'

In [163]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,lfm_spotify_uri
0,0.409,0.783,6,-5.083,0,0.0731,0.000422,0.000354,0.0664,0.313,145.541,audio_features,3pO37BXsjMC2wApALxGbuB,spotify:track:3pO37BXsjMC2wApALxGbuB,https://api.spotify.com/v1/tracks/3pO37BXsjMC2...,https://api.spotify.com/v1/audio-analysis/3pO3...,229867,4,3pO37BXsjMC2wApALxGbuB
1,0.837,0.38,7,-13.341,0,0.064,0.541,0.00789,0.198,0.892,104.995,audio_features,1LOZMYF5s8qhW7Rv4w2gun,spotify:track:1LOZMYF5s8qhW7Rv4w2gun,https://api.spotify.com/v1/tracks/1LOZMYF5s8qh...,https://api.spotify.com/v1/audio-analysis/1LOZ...,267947,4,1LOZMYF5s8qhW7Rv4w2gun
2,0.548,0.877,7,-4.458,1,0.124,0.0437,1.3e-05,0.0921,0.466,148.799,audio_features,1QQgSUKCG8GakzMOwi4lFS,spotify:track:1QQgSUKCG8GakzMOwi4lFS,https://api.spotify.com/v1/tracks/1QQgSUKCG8Ga...,https://api.spotify.com/v1/audio-analysis/1QQg...,318853,4,1QQgSUKCG8GakzMOwi4lFS


In [164]:
df.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature', 'lfm_spotify_uri'],
      dtype='object')

In [165]:
df.shape

(8497, 19)

#### 9.2 Spotify - Detailed Song Analysis

In [170]:
file = spotify_files[1]
file

'resources/FINAL/Spotify/spotify_analysis.csv'

In [167]:
df = pd.read_csv(file)
df.head(3)


Unnamed: 0,start,duration,confidence,loudness,tempo,tempo_confidence,key,key_confidence,mode,mode_confidence,time_signature,time_signature_confidence,lfm_spotify_uri
0,0.0,13.65422,1.0,-12.152,147.905,0.328,6,0.576,0,0.644,4,1.0,3pO37BXsjMC2wApALxGbuB
1,13.65422,58.1663,1.0,-6.114,147.457,0.393,6,0.627,0,0.781,4,1.0,3pO37BXsjMC2wApALxGbuB
2,71.82053,26.8085,0.123,-4.192,147.186,0.13,6,0.331,0,0.701,4,1.0,3pO37BXsjMC2wApALxGbuB


In [171]:
df.columns

Index(['start', 'duration', 'confidence', 'loudness', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'mode', 'mode_confidence',
       'time_signature', 'time_signature_confidence', 'lfm_spotify_uri'],
      dtype='object')

In [172]:
df.shape

(90280, 13)

## 10. Spotify Playlists (SP)
This list of spotify playlists by [Eva Zangerle](https://evazangerle.at/publication/pichl-ijmdem-2017/pichl-ijmdem-2017.pdfhttps://evazangerle.at/publication/pichl-ijmdem-2017/pichl-ijmdem-2017.pdf) contains also some information on spotify playlists, however it includes user id's as well.

In [173]:
sp_files = [file for file in files if "SpotifyPlaylist" in file]
sp_files

['resources/FINAL/SpotifyPlaylists/user_count.csv',
 'resources/FINAL/SpotifyPlaylists/sp_data.csv',
 'resources/FINAL/SpotifyPlaylists/track_count.csv',
 'resources/FINAL/SpotifyPlaylists/playlist_count.csv',
 'resources/FINAL/SpotifyPlaylists/artist_count.csv']

#### 10.1 SP - User Count

In [176]:
file = sp_files[0]
file

'resources/FINAL/SpotifyPlaylists/user_count.csv'

In [177]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,user,counts
0,9cc0cfd4d7d7885102480dd99e7a90d6,104
1,07f0fc3be95dcd878966b1f9572ff670,1486
2,944c80d26922ae634d6ce445b1fdff7f,961


In [178]:
df.columns

Index(['user', 'counts'], dtype='object')

In [179]:
df.shape

(15918, 2)

#### 10.2 SP - Tracks - Playlist Names - UserID

In [191]:
file = sp_files[1]
file

'resources/FINAL/SpotifyPlaylists/sp_data.csv'

In [192]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,Artist,Track,Playlist,User
0,Elvis Costello,Alison,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6
1,Crowded House,Don't Dream It's Over,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6
2,Joshua Radin,Winter,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6


In [193]:
df.columns

Index(['Artist', 'Track', 'Playlist', 'User'], dtype='object')

In [194]:
df.shape

(617153, 4)

#### 10.3 SP - Track Count

In [195]:
file = sp_files[2]
file

'resources/FINAL/SpotifyPlaylists/track_count.csv'

In [196]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,track,counts
0,(The Angels Wanna Wear My) Red Shoes - Elvis C...,76
1,"(What's So Funny 'Bout) Peace, Love And Unders...",85
2,7 Years Too Late - Tiffany Page,1


In [197]:
df.columns

Index(['track', 'counts'], dtype='object')

In [198]:
df.shape

(2824004, 2)

#### 10.4 SP - Playlist Count

In [202]:
file = sp_files[3]
file

'resources/FINAL/SpotifyPlaylists/playlist_count.csv'

In [203]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,playlist,counts
0,HARD ROCK 2010,67
1,IOW 2012,37
2,2080,10


In [204]:
df.columns

Index(['playlist', 'counts'], dtype='object')

In [205]:
df.shape


(157537, 2)

#### 10.5 SP - Artist Count

In [206]:
file = sp_files[1]
file

'resources/FINAL/SpotifyPlaylists/sp_data.csv'

In [207]:
df = pd.read_csv(file)
df.head(3)

Unnamed: 0,Artist,Track,Playlist,User
0,Elvis Costello,Alison,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6
1,Crowded House,Don't Dream It's Over,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6
2,Joshua Radin,Winter,HARD ROCK 2010,9cc0cfd4d7d7885102480dd99e7a90d6


In [208]:
df.columns

Index(['Artist', 'Track', 'Playlist', 'User'], dtype='object')

In [209]:
df.shape

(617153, 4)

## 11. Twitter

This dataset was crawled using the Twitter API for the keywords "[Artist] - [Track]", and can be extended easily.

In [214]:
file = [file for file in files if "Twitter" in file][0]

In [215]:
tweets = pd.read_csv(file, lineterminator='\n')
tweets.head(3)

Unnamed: 0,text,id_str,location,utc_offset,followers_count,friends_count,verified,statuses_count,geo,coordinates,place,i
0,@LabradorYuki 10 Years\n\n10 Years - Wasteland...,1460474131,United States,,518,1833,False,26217,,,,0
1,"@CrootMatt I couldn’t agree more mate, his ten...",20863217,Bristol,,431,545,False,6931,,,,0
2,@SpotifyMexico Wasteland - 10 years. Hasta aho...,1037216346977828864,Ciudad de México,,34,268,False,139,,,,0


In [216]:
tweets.columns

Index(['text', 'id_str', 'location', 'utc_offset', 'followers_count',
       'friends_count', 'verified', 'statuses_count', 'geo', 'coordinates',
       'place', 'i'],
      dtype='object')

In [217]:
tweets.shape

(205423, 12)

## 12. Youtube - [yet to be included...]

# Current situation and next steps....

As seen above, the datasets are already selected and filtered according to the intersection, such that there are no unnecessary samples in the sets. They are also all at a size that is manageable on a normal computer.
The next steps that are necessary are the following:

1. Most of the data is still in a raw format and hard to make use of. It is required to do some feature extraction such that it can be more easily used. This includes:
 - Creating feature vectors for text, such as BoW's
 - Extract emotional content from text using existing libraries
 - Extract meaningful information from data such as from dates: this could be the day of the week, the time of day for the user, etc..
 - Calculate similarity measurements for tracks using playlists and other...
 - In general, there are many features that can be extracted as in the principle of transfer learning and feature engineering.
 
 

2. Data Exploration: make statistical descriptions of each variable and combinations of variables.

3. Find out more features that can be added that are currently missing but can be useful.
    
4. Define the emotional model based on the existing information

5.  Develop the testing metrics

6. Combine the datasets such that they are usable for the algorithms to be tested. 

7.  Develop the relevant algorithms:
 - Test common regressors for the inference of missing values
 - An autoencoder that can fill in the missing values of a feature vector.
 - Employment of graph-based semi-supervised learning (GSSL) algorithms