# Data Summary 2 - Processed Data

In the previous summary, I had presented the collected data from various sources. This time, I present the data which I have organized and to some extent preprocessed and operationalized (made usable by the algorithms).

To remind a bit what the project is about, here is a short summary:

The objective of the project is to label listening events, taking into consideration the listener's properties, the context of the listening event, the track's content and the track's context. To do this, two approaches will be tested: 

**1st approach:**

For the first approach, I will feed a neural network with feature vectors and a mask hiding some of the values. The neural network should then return the hidden (missing) values. This requires that it is fed with well structured vectors that contain the whole data.
The target values will not only be the emotional labels, but rather the whole vector, such that it is able to classify the emotion of a musical event, but also return information on the listener, the music content or other information based on the existing emotional or not emotional data. 
In the end, the created NN will be able to fill in the whole dataset of emotionally labeled listening events.

**2nd approach:**

Create a graph, where the nodes represent music listening events and the vertices are vectors describing the similarity relationship between the events at different levels, such as similarity based on listeners' properties and context, a song's content and context, etc. By finding the correct mathematical transformation based on the similarity of a track to it's neighbours, the neighbours update and improve the music events' missing values. In other words, it will use Graph-Based Semi-supervised Learning to infer the missing values and thus create a dataset of emotionally labeled listening events.


### Index

**1) Data**:<br>
1.1) Track's Metadata<br>
1.2) Track Content<br>
1.3) Track Context<br>
1.4) User Context and Properties<br>
1.5) Event Data<br>

**2) Summary and Conclusion**




In [3]:
import pandas as pd
from tqdm import tqdm
from glob import glob
import numpy as np

folder = "resources/DataSummary-5-Features/"

# 1) DATA
In the following sections I will describe the data, which has now been preprocessed and is almost ready to be used. The missing parts are the separation into training and test sets and also a deeper analysis of each feature. The preprocessing steps that have been taken do not infringe the training-testing separation, except where explicitly described - and recalculation will be done for a clean separation between training and testing preprocessing.The main difference in regards to the first notebook regarding the datasets is that the data is now operationalized and ready to use (with only a very few exceptions).

For the presentation of the datasets, I have separated everything into the following categories: first, just a list of the track's metadata, which are a collection of the id's within the different sources for easy merging. Secondly follows the features of the tracks' content, also followed by data relating to the tracks' context. In fourth place comes a collection of user data, and finally a list of all the events which will be the actual samples.


The data that is described in the following sections is derived from the list of sources described below:

- Using the [AudioDB](https://www.theaudiodb.com) API, important information was crawled that helped in merging the different datasets, as well as prividing some few features.

- The [LFM](http://www.cp.jku.at/datasets/LFM-2b/) dataset, or more specifically, the RecSys'22 subset of the LFM-2b dataset contains information and metadata on the tracks, user listening events, user information and user created tags and genres. This data was originally retrieved from the Last.FM music platform.

- The [LJ2M dataset](https://ieeexplore.ieee.org/document/6890172) (download not available anymore) is a collection of blog posts in the LiveJournal platform, where users create posts, label it with an emotional keyword and associate a song to it. 

- The list of Lyrics was created using the [Genius](https://genius.com/) API.

- The [Million Music Tweet Dataset (MMTD)](http://www.cp.jku.at/datasets/MMTD/) is composed of some ID information, the Tweeting events representing a music listening event (including user id, location and other info) as well as a more detailed description on the different locations and country data.

- The [Spotify Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge) contains, as the name says, a million playlists. The extracted info was the playlist's name, the number of followers, the total number of tracks, and a list of the songs.

- This is information extracted from the [Spotify](https://open.spotify.com/) API. It contains a big deal of features, such as danceability, energy, key, tempo and valence.

- This list of spotify playlists by [Eva Zangerle](https://evazangerle.at/publication/pichl-ijmdem-2017/pichl-ijmdem-2017.pdf) contains also some information on spotify playlists, however it includes user id's as well.

Additionaly, there are two other sources that have not finished the crawling process, as they take long due to the maximum allowed items collected per day from the respective services. If they are crawled sucessfully, they will be added in the future, within the scope of the "Bachelorarbeit".  These are:

- Crawled Tweets using the Twitter API for the keywords "[Artist] - [Track]".
- Youtube comments - using the Google Api.

This dataset mentioned in a previous submission will not be included since it only narroly overlaps with the rest of the data, thus probably not bringing any valuable information:

- The [MuMu](https://www.upf.edu/web/mtg/mumuhttps://www.upf.edu/web/mtg/mumu) dataset contains information about amazon album purchases, such as "bought together" data, reviews and genre labeling.

## 1.1) Tracks - Meta:

This first dataset is not that interesting. It serves only to have a link between all the types of IDs that are being used on the rest of the datasets. 


One important detail: it contains 8770 unique tracks, some of them being repeated because they appear in different albums or have different id's in the same dataset. In total, there are 9037 entries.

In [7]:
tracks_ids = pd.read_csv(folder+"track_ids.csv", dtype=object)

In [8]:
tracks_ids.columns

Index(['Artist', 'Track', 'lfm_id', 'lfm_spotify_uri', 'audiodb_idTrack',
       'audiodb_idAlbum', 'audiodb_idArtist', 'audiodb_strAlbum',
       'audiodb_strArtistAlternate', 'audiodb_strMusicVid',
       'audiodb_strMusicBrainzID', 'audiodb_strMusicBrainzAlbumID',
       'audiodb_strMusicBrainzArtistID', 'mmtd_track_id', 'smpd_my_track_id'],
      dtype='object')

In [9]:
tracks_ids.shape

(9037, 15)

In [10]:
tracks_ids.drop_duplicates(["Artist", "Track"]).shape

(8770, 15)

In [12]:
tracks_ids.sample(10)

Unnamed: 0,Artist,Track,lfm_id,lfm_spotify_uri,audiodb_idTrack,audiodb_idAlbum,audiodb_idArtist,audiodb_strAlbum,audiodb_strArtistAlternate,audiodb_strMusicVid,audiodb_strMusicBrainzID,audiodb_strMusicBrainzAlbumID,audiodb_strMusicBrainzArtistID,mmtd_track_id,smpd_my_track_id
8930,X-Ray Spex,Identity,20117808,4nkYn6fooFedT9r3dRQn8Z,33611252,2182511,124035,Germfree Adolescents,,,78da0c94-7ebb-4f19-a8aa-e90437dab946,02c1c972-b5a1-3b8f-be3f-83dfb9373def,f4c967a4-7591-4bba-8237-82e171a2fa7f,8309398,10221
3214,Groove Armada,My Friend,27514001,5LRxSyiIRHQD26h1mdM0Ir,32778709,2114681,111793,Goodbye Country (Hello Nightclub),,http://www.youtube.com/watch?v=JxohJX9ElpE,bca38fa0-1f36-494c-8158-145f2afd6e73,b69719c2-4da9-3e10-b271-282f0945743a,35723b60-732e-4bd8-957f-320b416e7b7f,3013418,3703
7389,Stone Temple Pilots,Pop's Love Suicide,31501296,4mnvnildEMsGIQtgTBdH07,32792595,2115812,112007,Tiny Music... Songs From the Vatican Gift Shop,STP,,649313c1-9571-411e-9284-1a04114dac33,88c2b58d-ab28-3d7c-b719-3f8b4e3062a1,8c32bb01-58a3-453b-8050-8c0620edb0e5,6950770,8438
5330,My Bloody Valentine,What You Want,44594110,6TQMx46BOs5GHS3hcshYQf,32929800,2126630,113761,Loveless,,,d180f484-2440-4fb8-96e9-377a2d854edc,cb76227e-3ac0-3002-9a10-615a5b73cc59,8ca01f46-53ac-4af2-8516-55a909c0905e,5251319,6132
7808,The Clash,Stay Free,37236501,32rUsiLCvd1yZWr4UFUoiu,32753721,2112508,111450,Give 'Em Enough Rope,,,d7d216e6-f056-48c0-8563-1dc674cee977,fcd23bde-a71f-3a1e-8c7a-cde4eb427532,8f92558c-2baa-4758-8c38-615519e9deda,7278984,8934
6100,Prince,Let's Go Crazy,23936860,0QeI79sp1vS8L3JgpEO7mD,33163099,2145593,111308,Purple Rain,The Artist Formerly Known As Prince,https://www.youtube.com/watch?v=aXJhDltzYVQ,715138fa-c498-4fce-883d-f5e1f6e90797,b93a7c47-a6d4-33f2-9034-53fdd991f4ba,070d193a-845c-479f-980e-bef15710653e,5928918,7026
4749,Madness,One Step Beyond,29674997,1G6eFFDRaLr9EbThnhzMBD,32914775,2125515,113569,One Step Beyond...,,https://www.youtube.com/watch?v=SOJSM46nWwo,a7f3cd9e-3ed3-410c-9d4d-6f7a0ef13e88,f1e1b952-ec8d-363a-bd63-4c72d248cb51,5f58803e-8c4c-478e-8b51-477f38483ede,4718196,5449
5081,Misery Signals,The Stinging Rain,40642973,372bENyH7fpwQEHRb987e7,33001920,2132681,114697,Of Malice and the Magnum Heart,,,0adf276d-9029-4cc3-b35d-5ad0df3d90dc,3788b96e-4578-3b35-97e1-952fa9514331,0b0b523f-e9d5-4f18-919c-ebf084a58c26,5128785,5847
2045,Dario G,Sunchyme,37937587,5vNvbqy0OI3wG5MFeC0Qkq,33862465,2202684,128141,Sunmachine,,,f000d618-1a55-4783-98af-f20ad9436631,4ad18305-c2c8-3795-9e59-a4768bc30c93,8b763a0b-7951-4d24-b12a-58c2ab6cce8e,1749196,2352
5838,Paris Hilton,Screwed,34643764,6YhyLXlXUIR6CZCP9x5Crj,33111453,2141531,116305,Paris,,,789f3d0c-72b7-4a9b-808d-b0f039ce4504,0bec2fbb-505c-3f6f-bd34-ee723986acc3,0d9ff9c7-394e-4228-93aa-c4afd4765324,5646676,6719


## Track Content:

Here we have a table showing the content of a song. It contains information for all of the songs in the dataset. It's features are the following:

**Lyrics:** Here the lyrics are still in their textual format, however they have been cleaned as to not contain descriptive tags such as *[Intro]* and similar. They have all been set to lower case, and punctuation has been reduced only to "?" and "!". Numbers have been removed and any letters that repeat more than 3 times have been reduced to 2 subsequent appearences("Pleeeeease" --> "Pleease") in order to reduce somewhat such stylistic figures. The features extracted from the text will be described in the explanation of the DataFrame after this one.

**audioDB_intDuration:** This describes the duration of the song in milliseconds as an integer value, as retrieved from the AudioDB API. 

**spotify_... :** These are some features that were crawled using the twitter API. They refer to features extracted using the song's content and, even though we do not know how they were computed, they are easily accessible in the wild. They contain a lot of interesting information, such as energy, key, loudness, speechiness, acousticness, instrumentalness etc., which intuitively, seem to be relevant to influence the emotion of the listener.

It follows some examples of the data contained, such as the head of the dataframe, it's shape and columns, and an example of the cleaned lyrics.

In [21]:
track_content = pd.read_csv(folder+"track_content.csv")
track_content.head(3)

Unnamed: 0,lfm_id,Lyrics,audiodb_intDuration,spotify_danceability,spotify_energy,spotify_key,spotify_loudness,spotify_mode,spotify_speechiness,spotify_acousticness,spotify_instrumentalness,spotify_liveness,spotify_valence,spotify_tempo,spotify_duration_ms,spotify_time_signature
0,44101852,change my attempt good intentions crouched ov...,231933.0,0.409,0.783,6.0,-5.083,0.0,0.0731,0.000422,0.000354,0.0664,0.313,145.541,229867.0,4.0
1,12754290,,300000.0,0.837,0.38,7.0,-13.341,0.0,0.064,0.541,0.00789,0.198,0.892,104.995,267947.0,4.0
2,33773799,i went to a party at the local county jail al...,319400.0,0.548,0.877,7.0,-4.458,1.0,0.124,0.0437,1.3e-05,0.0921,0.466,148.799,318853.0,4.0


In [20]:
track_content.shape

(8772, 16)

In [18]:
track_content.columns

Index(['lfm_id', 'Lyrics', 'audiodb_intDuration', 'spotify_danceability',
       'spotify_energy', 'spotify_key', 'spotify_loudness', 'spotify_mode',
       'spotify_speechiness', 'spotify_acousticness',
       'spotify_instrumentalness', 'spotify_liveness', 'spotify_valence',
       'spotify_tempo', 'spotify_duration_ms', 'spotify_time_signature'],
      dtype='object')

In [23]:
# Lyrics Example
track_content.loc[0, "Lyrics"]

' change my attempt good intentions crouched over you were not there living in fear but signs were not really that scarce obvious tears but i will not hide you through this i want you to help them please see the bleeding heart perched on my shirt die withdraw hide in cold sweat quivering lips ignore remorse naming a kid living wasteland this time you ve tried all that you can turning you red change my attempt good intentions should i ? could i ? here we are with your obsession should i ? could i ? crowned hopeless the article read living wasteland this time you ve tried all that you can turning you red but i will not hide you through this i want you to help change my attempt good intentions should i ? could i ? here we are with your obsession should i ? could i ? heave the silver hollow sliver piercing through another victim turn and tremble be judgmental ignorant to all the symbols blind the face with beauty paste eventually you ll one day know change my attempt good intentions limbs 

### Lyrics - Features
After extracting some more relevant features from the Lyrics, this part of the dataset will easily be used for the above described algorithms, as all of it's data is numeric.

Different algoriths and tools have been applied to extract textual features, in order to create a vector that summarizes the information in those lyrics:
- describe the lyrics as the mean value of their words within an embedding ([GloVe](https://nlp.stanford.edu/projects/glove/) embedding pretrained using Twitter); 
- Use tools such as [TextBlob](https://textblob.readthedocs.io/en/dev/), [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html), [pysentimiento](https://github.com/pysentimiento/pysentimiento), [ANEW](https://pdodds.w3.uvm.edu/teaching/courses/2009-08UVM-300/docs/others/everything/bradley1999a.pdf) word lexicon

The text extraction procedure seen in this section has been applied to all the NLP feature extraction processes in this notebook, and thus the following explanation should be valid every time that in the following notebook I refer to "NLP features".
- The embedding values are the mean of the embedding values of all the words in the text. If a word was not in the dictionary, it was ignored. If no words were found, the vector is filled with "NaN". The text was set to lower case, stopwords and numbers were removed before preprocessing, however words were not lemmatized or stemmed in any way.
- The emotional features are the following:
- TextBlob provides a value for polarity [-1 to 1] and subjectivity [0 to 1].
- VADER provides a value for the positive, neutral, negative and the compound sentiment in the full sentence.
- Pysentimento's emotional analyzer provides a probability of the text containing the emotion anger, surpise, fear, disgust, joy, sadness or "others".
- Using the ANEW lexicon, the Valence, Arousal and Dominance values for each of the words of the text present in the lexicon were collected. Their mean and the 0 to 100's percentiles of the distribution, in steps of 10, should more or less describe the emotional distribution os a sentence.



In [4]:
#mean of embedding values of each word in lyrics
pd.read_csv(folder+"lyrics_embedding_glove_twitter_100.csv", index_col=0)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
lfm_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
44101852,0.033952,0.087054,-0.078017,0.221227,-0.029613,0.281334,0.266406,-0.151590,-0.039592,0.074412,...,0.105830,0.055721,0.124386,-0.096698,-0.100704,-0.039003,-0.110011,-0.001672,0.094188,0.074915
33773799,0.001411,0.228059,-0.042129,-0.206601,-0.010182,-0.040911,0.140350,0.163569,0.018888,-0.014891,...,-0.039776,0.103009,-0.165613,-0.079112,-0.109839,0.056597,0.145956,0.089629,0.026967,-0.007331
7843169,0.015949,0.086746,-0.045775,0.147699,-0.139444,0.228159,0.363864,-0.022935,0.152875,0.278725,...,0.032939,-0.082183,0.227534,0.035462,-0.154588,0.034772,-0.017234,-0.084629,0.130964,0.103405
10346274,-0.102411,0.126315,0.045461,-0.029653,-0.110125,0.324240,0.337981,-0.121796,-0.064328,0.104915,...,0.143041,0.075079,0.008869,-0.014097,-0.212343,-0.018924,0.028680,-0.055990,0.211677,-0.004865
14580667,-0.298739,0.023444,0.049075,-0.006300,-0.028839,0.374516,0.407108,0.019072,-0.101141,0.367689,...,-0.047637,0.170273,-0.025706,-0.027376,-0.124376,-0.095619,-0.060320,-0.147475,0.115549,0.145992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39042005,0.103655,0.375215,0.038044,0.184149,-0.355622,0.037989,0.430183,0.005687,0.195466,0.278224,...,0.043457,0.019154,-0.072690,0.106155,-0.111427,-0.037355,-0.207661,0.026142,0.343325,0.160644
11713650,-0.008193,0.071272,0.120118,-0.050687,-0.074789,0.272519,0.211622,-0.097406,-0.092352,0.207866,...,0.132661,0.162597,0.216178,-0.002175,-0.057239,-0.047344,-0.026961,0.038471,0.116563,-0.035992
18647416,-0.005132,0.194548,0.165425,0.048033,0.001130,0.224265,0.253001,-0.021075,0.172924,0.065538,...,-0.002660,0.161117,0.005536,-0.055045,-0.158890,0.112453,-0.014793,0.017185,-0.104653,0.056243
12856976,0.009055,0.375269,0.262648,-0.296742,-0.213797,0.151595,0.314310,0.129625,0.039465,0.298660,...,-0.031837,0.130994,-0.137951,-0.116435,-0.115925,-0.050213,-0.037468,0.188657,0.063615,-0.107474


In [4]:
pd.read_csv(folder+"lyrics_emo.csv", index_col=0)

Unnamed: 0_level_0,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,PySentimiento_joy,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
lfm_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
44101852,0.139031,0.410884,0.153,0.662,0.185,0.7824,0.008647,0.160927,0.003609,0.004359,...,-1.005,-0.430,-0.170,0.200,0.595,0.680,0.825,1.370,1.810,2.74
33773799,0.023148,0.144074,0.040,0.849,0.112,0.9774,0.006620,0.614601,0.004955,0.004260,...,-1.090,-0.818,0.000,0.380,0.585,1.082,1.617,1.680,1.710,2.21
7843169,-0.269805,0.446949,0.361,0.564,0.076,-0.9983,0.002003,0.008848,0.005399,0.001576,...,-1.240,-1.240,-0.545,-0.170,0.060,0.380,0.720,0.780,1.000,2.16
10346274,0.320833,0.604167,0.053,0.708,0.240,0.9959,0.002251,0.007050,0.013107,0.015905,...,-0.640,0.166,0.500,0.590,0.780,0.890,0.944,1.280,1.500,1.95
14580667,-0.056173,0.693210,0.032,0.670,0.297,0.9980,0.002055,0.006389,0.010033,0.011690,...,-0.370,-0.370,-0.370,0.060,0.830,0.890,0.940,1.232,1.560,1.88
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39042005,-0.322500,0.683333,0.170,0.673,0.157,-0.9666,0.007879,0.096990,0.006758,0.003417,...,-0.640,-0.060,-0.060,-0.060,0.120,0.390,0.390,0.870,1.244,1.88
11713650,0.065335,0.454316,0.117,0.802,0.081,-0.8860,0.001529,0.003065,0.004598,0.004759,...,-0.710,0.220,0.300,0.450,0.590,0.780,0.890,1.220,1.725,2.42
18647416,0.019630,0.654815,0.121,0.738,0.141,0.7383,0.001978,0.022632,0.006423,0.005070,...,-0.804,-0.500,-0.170,0.170,0.610,1.090,1.272,1.622,1.770,2.00
12856976,-0.131111,0.687778,0.055,0.799,0.146,0.8945,0.000918,0.003083,0.005378,0.010652,...,-0.897,-0.258,0.330,0.408,0.670,0.692,0.940,1.354,1.677,1.77


## Track Context:

#### Track Similarity based on co-occurence in playlists:

Here is a very interesting dataset: it is a matrix of Tracks x Tracks (lines x columns), where each cell contains a list (separated by "\t") of the playlists that the track in the columns co-occures with the track in the lines. In the example below, we see that the track with lfm_id=23570 does not share any playlist with track lfm_id=60189 indicating small similarity between both, while track lfm_id=189560 and track lfm_id=1675423 are in playlists "Starred" and "Last.fm Recommendations", indicating a small but not null similarity. 
The strenght of such a similarity will need to be further computed into numerical values, and there are several possibilities to do so:
- simply count the number of playlists: This gives a raw number of the counts of playlists in which two tracks co-occur. There is however more information to be extracted here.
- Weight the counting by the inverse count of the number of songs in that playlist. I.e.: we see that "Starred" occurs quite often in these 3 rows, which seems to mean that maybe a lot of songs are in a playlist called "Starred" (not always the same playlist, but this is not extractable from the original dataset). In this case, co-occuring in the "Starred" should not weight as much as being in the same playlist called "When I am feeling lonely", which only contains a hand full of songs. Songs that occur in the latter playlist should be judged to be more similar than in "Starred".
- The name of the playlist is also relevant! : As in the above example, "When I am feeling lonely" is a name of the playlist that not only servers as an identifyer, but also contains some information. One can use several methods to judge the emotional or semantic content of the name, and then weight the counting using such additional information, resulting in similarities between songs that are for example based on emotion: i.e. in terms of loneliness of the listener's description, two songs show a high degree of similarity.

The resulting similarity matrices can then be used in the following way for the algorithms:
- Using Latent Semantic Analysis(LSA) or other algorithms, one can reduce the dimensionality of the matrix, thus resulting in a vector describing it's position within an embedding. This can then be added to the feature vector that goes into the Neural Network.
- A matrix with songs as lines and columns serves as an adjancy matrix in a graph, describing the paths/similarities between track-nodes. 

It is worth noting that this method based on co-occurrence in the domain of spotify playlists can be extended to other domains, such as festival line-ups and other settings where tracks or artists co-occur.

The following cells are:
- An example of the co-occurrence matrix using the Spotify Playlist (by Eva Zangerle) dataset, with each cell being a list of the playlist names.
- An example of the co-occurrence matrix using the Spofity Million Paylist Dataset (SMPD), which contains a list of playlist's ID's in the cell.
- The same co-occurrence matrix as above, however only with the total counts, thus describing how this matrix can be turned into numerical values that then allow for the computation of similarity between songs.
- Additional information that can be used for playlist weighing, for example using the number of followers, the number of total tracks (only counting those that are in our dataset) and the playlist names.
- The original size of the playlists, as presented by the original authors of the dataset, with tracks that are and are not present in our dataset.

In [33]:
# An example of the co-occurrence matrix using the Spotify Playlist (by Eva Zangerle) 
# dataset, with each cell being a list of the playlist names
for sp_track_cooccurence_list in pd.read_csv(folder+"sp_track_cooccurrence_playlist_list.csv", chunksize=10, index_col=0):
    break
sp_track_cooccurence_list.head(3)

Unnamed: 0,23570,60189,189564,1592801,1636891,1675423,1711943,1763141,1770945,1988698,...,46187719,46202411,46210252,46225268,46229272,46231620,46242855,46244293,46275910,47112160
23570,Starred\tRap\thip hop\tJuvenile — 400 Degreez\...,,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,...,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,Starred\t,,Starred\t,Starred\t,Starred\t
60189,,Dr_Dre-Lintegrale-3CD-2010-H5N1\t,,,,,,,,,...,,,,,,,,,,
189564,Starred\t,,Starred\tLast.fm Recommendations\t2014\tEveryt...,Starred\t,Starred\t,Starred\tLast.fm Recommendations\t,Starred\t,Starred\t,Starred\t,Starred\tMusica!\t,...,Starred\t,Starred\t,Starred\t,Starred\tMusica!\t,Starred\tsongs of my stolen ipod\tMusica!\t,Starred\t,,Starred\tMusica!\t,Starred\t,Starred\tEverything\tMusica!\t


In [39]:
# An example of the co-occurrence matrix using the Spofity Million Paylist Dataset (SMPD), 
# which contains a list of playlist's ID's in the cell
for smpd_track_cooccurence_list in pd.read_csv(folder+"smpd_track_cooccurrence_playlist_lists.csv", chunksize=10, index_col=0):
    break
smpd_track_cooccurence_list.head(3)

Unnamed: 0,0,1,2,4,5,6,7,8,9,10,...,10324,10325,10326,10327,10328,10329,10330,10331,10333,10334
0,961150\t961688\t961705\t893339\t248564\t248714...,,,655452\t365390\t584394\t765844\t711513\t464421...,754742\t480616\t655452\t765844\t485747\t472066...,765844\t180269\t6824\t,,765844\t6824\t46400\t,765844\t46400\t,973965\t655452\t765844\t464421\t155492\t657173...,...,,,,,,,,,784503\t,
1,,437933\t352390\t580274\t527761\t454316\t454596...,838665\t220400\t998075\t692920\t,,,,,,,,...,,,,,,,,,,
2,,838665\t220400\t998075\t692920\t,903940\t757961\t675579\t838665\t950633\t220400...,,,,,,,,...,,,,,,,,,,


In [43]:
# The same co-occurrence matrix as above, however only with the total counts, thus describing how this matrix can be 
# turned into numerical values that then allow for the computation of similarity between songs
for smpd_total_cooccurrence in pd.read_csv(folder+"smpd_total_track_cooccurrence.csv", chunksize=10, index_col=0):
    break
smpd_total_cooccurrence

Unnamed: 0,0,1,2,4,5,6,7,8,9,10,...,10324,10325,10326,10327,10328,10329,10330,10331,10333,10334
0,1353,0,0,28,8,3,0,3,2,15,...,0,0,0,0,0,0,0,0,1,0
1,0,331,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,4,20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,28,0,0,133,20,5,0,5,4,36,...,0,0,0,0,0,0,0,0,2,0
5,8,0,0,20,49,5,0,4,3,19,...,0,0,0,0,0,0,0,0,0,0
6,3,0,0,5,5,13,0,4,1,7,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,7,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,3,0,0,5,4,4,0,9,2,7,...,0,0,0,0,0,0,0,0,0,0
9,2,0,0,4,3,1,0,2,5,4,...,0,0,0,0,0,0,0,0,0,0
10,15,0,0,36,19,7,1,7,4,86,...,0,0,0,0,0,0,0,0,0,0


In [45]:
# Additional information that can be used for playlist weighing, for example using the number of followers,
# the number of total tracks (only counting those that are in our dataset) and the playlist names.
pd.read_csv(folder+"smpd_playlist_data.csv", index_col=0)

Unnamed: 0,playlist_id,playlist_name,playlist_followers,playlist_num_tracks
0,698001,Slow Jams,9,1
1,698002,Español,21,1
2,698003,finals,50,1
3,698005,2017,20,1
4,698006,Sunny Days,128,1
...,...,...,...,...
559180,766993,Goood,59,1
559181,766995,Coldplay,94,1
559182,766996,Classics,103,3
559183,766998,Kitchen,9,1


In [46]:
# The original size of the playlists, as presented by the original authors of the dataset, 
# with tracks that are and are not present in our dataset.
pd.read_csv(folder+"smpd_playlist_counts.csv", index_col=0)

Unnamed: 0,playlist,counts
0,HARD ROCK 2010,67
1,IOW 2012,37
2,2080,10
3,C418,34
4,Chill out,570
...,...,...
157532,Mom's Indian Songs,41
157533,Music for bae,42
157534,Nov 2014 for Alp,19
157535,Various Artists – Top Latino V.7,14


#### Playlist's Name's NLP features:
Here we also have the precomputed embedding and emotional values for the playlist names, both for the SMPD and the SP datasets showed above.

In [16]:
pd.read_csv(folder+"smpd_playlist_name_embedding_glove_twitter_100.csv")

Unnamed: 0,playlist_id,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,698001,0.174735,-0.318845,0.157661,-0.608225,0.032369,0.376055,-0.061510,0.196331,0.065102,...,-0.036555,-0.046360,-0.661415,0.262195,0.304895,0.111800,0.302695,0.875705,-0.440380,-0.394495
1,698002,-0.019015,-0.662710,-0.443445,0.253393,-0.191645,-0.092112,-0.414070,-0.139075,-0.359580,...,-0.203580,-0.294395,-0.117185,0.425525,0.389865,-0.399035,0.277790,0.207365,0.147010,-0.390790
2,698003,-0.633660,0.115840,1.138700,-0.778970,-0.454960,0.554680,0.813060,0.566880,-0.319350,...,-0.046501,0.346640,-0.036071,0.289710,-0.494390,-0.762520,0.049187,1.326700,0.424830,0.064332
3,698005,,,,,,,,,,...,,,,,,,,,,
4,698006,0.022596,-0.463255,0.181744,-0.029399,0.397010,0.380395,0.292585,-0.207900,-0.799810,...,0.058335,0.337458,0.003600,-0.386580,0.128993,-0.212567,0.182214,0.301670,-0.392470,-0.005475
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
559172,766993,0.091552,0.093336,-0.028113,0.369900,0.189560,0.431910,0.102660,0.340920,-0.387170,...,0.142490,-0.023226,-0.313460,0.225370,-0.427860,-0.516060,0.450300,0.331730,-0.087436,-0.555060
559173,766995,-0.514980,-0.208900,0.013223,0.696470,-0.779460,0.424420,0.586620,-0.213240,-0.325490,...,0.170980,-0.333370,-0.120740,-0.335100,-1.340400,0.474320,-0.380760,-0.343710,-0.227450,-0.055588
559174,766996,-0.281130,-1.253800,-0.258950,-0.223500,-0.350450,0.705420,-0.442670,0.118580,0.311180,...,0.469520,0.441510,-0.536690,-0.362720,0.249480,-0.395800,0.072521,0.802980,0.095195,-0.394790
559175,766998,-0.034679,-0.032174,0.713220,-0.252050,0.286430,0.377340,-0.241320,0.661780,0.453720,...,0.087682,0.365650,-0.085284,-0.228720,-0.084207,0.651260,0.486620,-0.113060,-0.489250,0.372330


In [17]:
pd.read_csv(folder+"smpd_playlist_name_emo.csv")

Unnamed: 0,playlist_id,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
0,698001,-0.3,0.4,0.0,1.000,0.000,0.0000,0.005828,0.023104,0.013448,...,0.440,0.490,0.540,0.590,0.64,0.690,0.740,0.790,0.840,0.89
1,698002,0.0,0.0,0.0,1.000,0.000,0.0000,0.004011,0.004363,0.009632,...,,,,,,,,,,
2,698003,0.0,0.0,0.0,1.000,0.000,0.0000,0.021469,0.015812,0.038565,...,-1.060,-1.060,-1.060,-1.060,-1.06,-1.060,-1.060,-1.060,-1.060,-1.06
3,698005,0.0,0.0,0.0,1.000,0.000,0.0000,0.024942,0.036124,0.057735,...,,,,,,,,,,
4,698006,0.0,0.0,0.0,0.263,0.737,0.4215,0.004395,0.003234,0.005524,...,0.158,0.206,0.254,0.302,0.35,0.398,0.446,0.494,0.542,0.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
559172,766993,0.0,0.0,0.0,1.000,0.000,0.0000,0.006411,0.003337,0.005151,...,1.410,1.410,1.410,1.410,1.41,1.410,1.410,1.410,1.410,1.41
559173,766995,0.0,0.0,0.0,1.000,0.000,0.0000,0.013458,0.033821,0.018151,...,,,,,,,,,,
559174,766996,0.0,0.0,0.0,1.000,0.000,0.0000,0.009419,0.016498,0.021154,...,1.290,1.290,1.290,1.290,1.29,1.290,1.290,1.290,1.290,1.29
559175,766998,0.0,0.0,0.0,1.000,0.000,0.0000,0.034707,0.044327,0.088906,...,0.890,0.890,0.890,0.890,0.89,0.890,0.890,0.890,0.890,0.89


In [18]:
pd.read_csv(folder+"sp_playlist_name_embedding_glove_twitter_100.csv")

Unnamed: 0,playlist,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,HARD ROCK 2010,-0.345170,-0.393250,-0.093355,-0.350670,-0.220450,0.445455,0.261060,0.060919,-0.224915,...,0.024910,0.055206,0.011435,-0.336450,-0.176315,0.079925,0.174486,0.016430,0.130921,-0.325805
1,IOW 2012,-0.374930,0.413750,-0.793090,-0.691420,-0.692290,-1.330800,0.712770,0.449390,-0.241600,...,0.552160,-0.219590,-0.559890,0.398880,-1.197600,-0.257290,0.030922,0.602610,0.612550,0.067323
2,Chill out,-0.008690,0.391270,-0.152280,-0.733080,0.201500,0.078276,-0.030451,0.168980,0.003582,...,-0.246050,0.369350,-0.381100,0.042372,-0.181370,0.060421,-0.016971,0.874910,-0.092507,-0.199630
3,Daft Punk,-0.001295,-0.365260,-0.338665,0.217550,-0.067742,0.544880,0.008115,-0.374615,-0.171830,...,-0.414025,0.242870,-0.260655,0.009095,-0.336944,0.570005,-0.261565,0.133741,0.197075,-0.517800
4,Electro,-0.416150,-0.647540,-0.629790,-0.377450,0.559920,0.061103,0.549060,-1.039400,-0.205030,...,-0.176600,0.220910,0.407330,-0.028020,-0.481280,0.494590,0.514940,-0.182910,-0.248260,-0.221660
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51476,Party! Time,-0.366199,0.361617,0.224295,-0.495690,0.377505,0.532065,0.485325,0.058635,-0.021540,...,-0.152825,0.436575,-0.146957,-0.145609,-0.477490,0.106456,0.075139,0.037293,-0.332360,-0.475125
51477,Rum! Time,-0.431319,-0.267328,0.061605,-0.085675,0.203165,0.357527,0.211860,0.091340,0.430875,...,-0.339295,-0.082810,-0.129375,-0.328855,-0.475935,-0.383434,0.096852,0.214823,-0.305995,-0.284790
51478,Brandon,0.039364,0.344590,0.250170,-0.453590,-0.285900,-0.844550,0.863780,0.561310,-1.146400,...,-0.560660,-0.630750,-0.007525,0.826760,-0.000563,0.520180,-0.495890,0.406340,0.174190,0.276030
51479,Hallie's Songs,-0.051525,-0.196858,0.031600,0.042216,-0.037331,0.455515,-0.056910,0.371375,0.178109,...,0.594350,0.163515,0.245325,0.139860,0.152095,0.526785,-0.383945,0.398020,0.177965,-0.235270


In [19]:
pd.read_csv(folder+"sp_playlist_name_emo.csv")

Unnamed: 0,playlist,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
0,HARD ROCK 2010,-0.291667,0.541667,0.516,0.484,0.000,-0.2808,0.003750,0.003708,0.003430,...,-0.010,0.020,0.050,0.080,0.110,0.140,0.170,0.200,0.230,0.26
1,IOW 2012,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.002498,0.003677,0.003749,...,,,,,,,,,,
2,Chill out,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.074133,0.040585,0.012467,...,0.550,0.550,0.550,0.550,0.550,0.550,0.550,0.550,0.550,0.55
3,Daft Punk,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.007053,0.009683,0.012615,...,-0.954,-0.868,-0.782,-0.696,-0.610,-0.524,-0.438,-0.352,-0.266,-0.18
4,Electro,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.028493,0.035238,0.062673,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51476,Party! Time,0.000000,0.000000,0.000,0.251,0.749,0.4574,0.006623,0.002500,0.003455,...,-0.484,-0.328,-0.172,-0.016,0.140,0.296,0.452,0.608,0.764,0.92
51477,Rum! Time,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.005733,0.005388,0.002188,...,-0.459,-0.278,-0.097,0.084,0.265,0.446,0.627,0.808,0.989,1.17
51478,Brandon,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.031186,0.047769,0.066677,...,,,,,,,,,,
51479,Hallie's Songs,0.000000,0.000000,0.000,1.000,0.000,0.0000,0.002244,0.002918,0.002051,...,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.95


#### Artist / Album co-occurrence:
As shortly mentioned in the above section, co-occurrence can be measured at different levels. Two such levels are whether tracks are played by the same artist or whether they are in the same album. This is, because there is a chance that there is some similarity between songs of the same artist/album. Just as the co-occurence matrices above, these can be useful for adjancy matrices in graphs or using LSA reduced to a meaningful amount of dimensions, to use within a song's feature vector.

Thus, in the following tables we can see the matrix where lines and columns are both tracks, and the cell is 0 if there is no link between them, and 1 if there is. Below that, we shortly compute the sum of each line, to show that, even though the matrix is super sparse, nevertheless, there are links between tracks.
These values are computed for albums and for artists.

In [9]:
# matrix for albums cooccurence
track_album_cooccurrence = pd.read_csv(folder+"lfm_events_track_album_cooccurence.csv", index_col=0)
track_album_cooccurrence

Unnamed: 0,23570,60189,189564,1592801,1636891,1675423,1711943,1763141,1770945,1988698,...,46187719,46202411,46210252,46225268,46229272,46231620,46242855,46244293,46275910,47112160
23570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60189,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
189564,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1592801,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1636891,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46231620,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46242855,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46244293,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46275910,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# number of cooccurence for each track on albumns
track_album_cooccurrence.sum()

23570         4
60189         6
189564       14
1592801       6
1636891      28
           ... 
46231620     50
46242855     18
46244293     44
46275910     78
47112160    238
Length: 8772, dtype: int64

In [3]:
del track_album_cooccurrence

In [5]:
# matrix for artist cooccurence
track_artist_cooccurrence = pd.read_csv(folder+"lfm_events_track_artist_cooccurence.csv", index_col=0)
track_artist_cooccurrence

Unnamed: 0,23570,60189,189564,1592801,1636891,1675423,1711943,1763141,1770945,1988698,...,46187719,46202411,46210252,46225268,46229272,46231620,46242855,46244293,46275910,47112160
23570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60189,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
189564,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1592801,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1636891,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46231620,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46242855,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46244293,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46275910,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# number of cooccurence for each track on arists
track_artist_cooccurrence.sum()

23570       16
60189       28
189564      52
1592801     14
1636891     52
            ..
46231620    16
46242855    26
46244293    52
46275910    34
47112160    48
Length: 8772, dtype: int64

In [8]:
del track_artist_cooccurrence

#### Track Similarity based on Temporal proximity:
If a user listens to track A, and shortly afterwards listens to track B, then one can assume that there must be some kind of similarity between tracks A and B. If the track is listened right after, then the similarity is probably higher than if the time difference is 1 hour between listening events.
The table below shows the similarity between tracks based on each user, where values close to 1 would be listening the song in column y right after song in column y, and values close to 0 are listening them almost 2h later. Values with time difference >2h were discarded since the influence they have on on each other might not be true anymore. 

Additionaly, the values in column "time_diff" are the mean time difference of values <2h for each user. Later, the analysis can be done listener specific of aggregated for all users. The values go from a maximum similarity of 1 to a minimum of 0. 

Given that we have 3 dimensions: track_x, track_y and user_id, displaying this as a matrix is not feasible. Nevertheless, after filtering, it can be used just the same way as the matrices above.

In [6]:
pd.read_csv(folder+"lfm_events_2h_similarity(max=1).csv")

Unnamed: 0,lfm_track_id_x,lfm_track_id_y,time_diff,lfm_user_id
0,3004204,7126882,0.888626,14807
1,3004204,9398794,0.620968,14807
2,3004204,11998240,0.429714,14807
3,3004204,12961344,0.877781,14807
4,3004204,14854540,0.937013,14807
...,...,...,...,...
1462183,22632712,13804585,0.262430,48596
1462184,22632712,34811199,0.135315,48596
1462185,22632712,41403882,0.398770,48596
1462186,41403882,13804585,0.863660,48596


#### User-generated genre tags
In the LFM-2b subset is a list of user-generated tags for each of the tracks. Below is the same information, however displayed such that each of the songs has a feature vector, where each feature is the count (user-given relevance) for a given genre. in total there are >1000 genres (and thus features). This can be used as a feature vector in itself, or the dimensionality can be reduced in different ways.

In [6]:
genres = pd.read_csv(folder+"lfm_genres.csv", index_col=0)
genres

Unnamed: 0,Artist,Track,lfm_id,8-bit,a cappella,abstract,abstract beats,abstract hip hop,accordeon,accordion,...,world fusion,worship,wrestling,yacht rock,yaoi,ye ye,yodeling,yoga,zen,zouk
0,10 Years,Wasteland,44101852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10cc,Dreadlock Holiday,12754290,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,10cc,Rubber Bullets,33773799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,12 Stones,Broken,7843169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12 Stones,Crash,10346274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9032,cLOUDDEAD,Dead Dogs Two,11245084,0.0,0.0,25.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9033,dEUS,Little Arithmetics,24313420,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9034,dEUS,Suds & Soda,37737921,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9035,mewithoutYou,January 1979,21517830,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### More track context: user-generated tags
The genre tags are extracted from the more general list of tags, which are a list of user-generated tags associated to a song, to which a value or relevance is given. Below is a list of these tags and the track it is associated to. In it's raw form it is not very useful, however, from each tag there is a lot of information that can be extracted. 

In the second cell below this, we consider the "value", which somehow resembles a user-attributed relevance of the tag.

Then, the NLP features are presented, i.e. the embedding and emotional features are presented for each tag, according to the method for preprocessing NLP described above.

In [2]:
tags = pd.read_csv(folder+"lfm_tags.csv", index_col=0)
tags

Unnamed: 0,Artist,Track,tag,value
0,Nirvana,Smells Like Teen Spirit,grunge,100
1,Nirvana,Smells Like Teen Spirit,rock,69
2,Nirvana,Smells Like Teen Spirit,nirvana,43
3,Nirvana,Smells Like Teen Spirit,90s,42
4,Nirvana,Smells Like Teen Spirit,alternative,41
...,...,...,...,...
637615,Eraserheads,Kailan,opm,100
637616,Eraserheads,Kailan,eraserheads,100
637617,Eraserheads,Kailan,songs by bands with the suffix head,50
637618,Eraserheads,Kailan,tagalog songs,50


In [3]:
tags_summary = pd.read_csv(folder+"lfm_tags_summary.csv", index_col=0)
tags_summary

Unnamed: 0,Artist,Track,lfm_id,sum,mean,count,min,std,median,skew
0,10 Years,Wasteland,44101852,598.0,5.980000,100.0,2.0,13.748080,2.0,5.014670
1,10cc,Dreadlock Holiday,12754290,409.0,8.019608,51.0,2.0,16.851695,2.0,4.118480
2,10cc,Rubber Bullets,33773799,838.0,8.380000,100.0,4.0,15.704197,4.0,4.542838
3,12 Stones,Broken,7843169,592.0,5.920000,100.0,2.0,13.738304,2.0,4.841314
4,12 Stones,Crash,10346274,623.0,6.230000,100.0,2.0,14.507299,2.0,4.551917
...,...,...,...,...,...,...,...,...,...,...
9032,cLOUDDEAD,Dead Dogs Two,11245084,1136.0,20.285714,56.0,13.0,20.768608,13.0,3.149441
9033,dEUS,Little Arithmetics,24313420,1008.0,10.080000,100.0,6.0,16.370508,6.0,4.424193
9034,dEUS,Suds & Soda,37737921,928.0,9.280000,100.0,4.0,16.666715,4.0,4.075889
9035,mewithoutYou,January 1979,21517830,970.0,9.700000,100.0,5.0,14.784991,5.0,4.637763


In [15]:
# Embeddings of the tag
pd.read_csv(folder+"lfm_tags_embedding_glove_twitter_100.csv", index_col=0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.993200,-0.943520,-1.429700,0.728390,0.912280,-0.248420,-0.089259,-0.528780,0.465920,-0.841810,...,0.006518,0.877780,-0.004188,-0.695630,-0.269390,0.019452,0.101450,-0.046923,-0.038568,0.156870
1,-0.512270,-0.493800,-0.367660,-0.266960,0.248130,0.658950,0.392170,-0.043532,-0.118320,-0.527790,...,-0.398020,-0.025849,-0.325490,-0.516890,-0.903580,0.430570,0.343310,-0.139090,-0.021698,-0.511740
2,-0.195930,-0.449240,-0.286630,0.259450,-0.005274,0.122810,-0.675900,0.339630,-0.300520,-0.625070,...,0.003571,0.174110,0.072275,-0.151620,-1.134400,0.417240,-0.560190,-0.345120,0.143320,0.260560
3,,,,,,,,,,,...,,,,,,,,,,
4,-0.039742,-0.667870,-0.476440,0.497770,0.801190,0.176540,-0.124440,-0.811710,-0.121510,-0.522170,...,-0.223440,-0.182450,-0.348200,-0.246750,-0.708170,-0.234670,-0.206360,0.857330,0.141460,-0.343440
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637615,-1.543600,0.053963,-0.407890,-0.046700,-0.718410,-0.350390,-0.270660,0.586300,0.756770,-0.160830,...,0.653250,0.100750,0.478540,-0.425600,-0.166150,0.385950,0.272890,0.082261,-0.005118,-0.779320
637616,-0.894220,-0.175000,-0.232750,0.209840,-0.656370,0.439220,0.612250,0.155200,0.367550,-1.355800,...,0.233580,-1.010400,-0.076186,-0.307810,0.168550,0.334810,0.084215,0.683980,-0.145920,0.007488
637617,-0.316575,0.059000,-0.095398,0.045582,0.084365,0.362607,0.043337,0.075323,-0.134851,-0.352550,...,0.030745,0.235927,-0.352607,-0.015895,-0.313470,-0.002918,0.011385,0.122093,0.321602,0.031872
637618,-0.020300,-0.145164,-0.211676,0.337640,0.128744,0.206615,0.452510,0.544920,0.110019,-0.162865,...,1.064115,0.158290,-0.463140,-0.207000,-0.161038,-0.279690,0.087661,0.168220,0.227734,-0.056155


In [12]:
# Emotional values extracted for each tag
tags_emo = pd.read_csv(folder+"lfm_tags_emo.csv", index_col=0)
tags_emo

Unnamed: 0,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,PySentimiento_joy,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
0,0.0,0.0,0.0,1.0,0.0,0.0,0.016438,0.027353,0.032218,0.030007,...,,,,,,,,,,
1,0.0,0.0,0.0,1.0,0.0,0.0,0.027198,0.043684,0.051776,0.046403,...,0.260,0.260,0.260,0.260,0.26,0.260,0.260,0.260,0.260,0.26
2,0.0,0.0,0.0,1.0,0.0,0.0,0.007660,0.009644,0.008943,0.023637,...,,,,,,,,,,
3,0.0,0.0,0.0,1.0,0.0,0.0,0.028505,0.042176,0.063775,0.046629,...,,,,,,,,,,
4,0.0,0.0,0.0,1.0,0.0,0.0,0.030830,0.048031,0.057703,0.038197,...,1.330,1.330,1.330,1.330,1.33,1.330,1.330,1.330,1.330,1.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637615,0.0,0.0,0.0,1.0,0.0,0.0,0.021294,0.038223,0.046962,0.034553,...,,,,,,,,,,
637616,0.0,0.0,0.0,1.0,0.0,0.0,0.012624,0.024692,0.016272,0.020135,...,,,,,,,,,,
637617,0.0,0.0,0.0,1.0,0.0,0.0,0.002072,0.007240,0.003852,0.009678,...,0.560,0.560,0.560,0.560,0.56,0.838,1.116,1.394,1.672,1.95
637618,0.0,0.0,0.0,1.0,0.0,0.0,0.002635,0.003207,0.004368,0.022412,...,1.950,1.950,1.950,1.950,1.95,1.950,1.950,1.950,1.950,1.95


#### Emotions
Emotions make part of the user context of the listening event because they are features respective to the user, and at the same time they may vary depending on the event. However, we can extract the emotional spectrum related to a specific song, which would then describe context features of the that track. LiveJournal provides such information, where users provided the emotion that they experienced when writing of an event, and then relate a song to that text and emotion. That information is captured in the following events, which show the emotion, the text, the user ID, and the track - artist.

In [13]:
lj2m_emotions = pd.read_csv("resources/DataSummary-5-Features/lj2m_data.csv")
lj2m_emotions

Unnamed: 0,user ID,Artist,Track,text,emotion,lfm_id
0,u480993,David Bisbal,Buleria,to to to to to to to and the the the the the t...,accomplished,7970735
1,u574291,Yellowcard,Only One,to to to to to to and the of of of my my that ...,accomplished,29730378
2,u402088,Yellowcard,Ocean Avenue,to to to and and and and and the the the the a...,accomplished,29222390
3,u535133,Maroon 5,This Love,to to to to to to to to to to to and and and a...,accomplished,41207406
4,u481022,Crossfade,Cold,to to to to to to to to to to to to to and and...,accomplished,9651905
...,...,...,...,...,...,...
741199,u574228,Ryan Cabrera,True,to to to to to and and and and the the the the...,worried,42245802
741200,u263576,Marilyn Manson,Great Big White World,to to to to to to to to to to to to to to and ...,worried,17232166
741201,u638919,Incubus,I Miss You,to to to to to to to and and and the the the t...,worried,19455382
741202,u623682,Eisley,I Wasn't Prepared,to to and and and and a a of that you you in i...,worried,19616463


After removing the users that belong to the test set, one can easily create a table where we count the number of times each emotion was experienced by each song, thus giving the spectrum of emotions for each song.

In [14]:
# will need to be recomputed after removing test set
lj2m_emotions_counts = lj2m_emotions.groupby(["Artist", "Track", "emotion"])["emotion"].count().to_frame()
lj2m_emotions_counts = lj2m_emotions_counts.rename({"emotion": "counts"}, axis=1).reset_index().set_index(["Artist", "Track"])
lj2m_emotions_counts = lj2m_emotions_counts.pivot(columns="emotion")
lj2m_emotions_counts.columns = lj2m_emotions_counts.columns.droplevel()
lj2m_emotions_counts = lj2m_emotions_counts.reset_index()
lj2m_emotions_counts.head(5)

emotion,Artist,Track,accomplished,aggravated,amused,angry,annoyed,anxious,apathetic,artistic,...,sympathetic,thankful,thirsty,thoughtful,tired,touched,uncomfortable,weird,working,worried
0,10 Years,Wasteland,1.0,,,,,,,,...,,,,,,,,,,
1,10cc,Dreadlock Holiday,,,,,,,,,...,,,,,1.0,,,,,
2,10cc,Rubber Bullets,,,,,,,,1.0,...,,,,,,,,,,
3,12 Stones,Broken,4.0,3.0,6.0,2.0,4.0,2.0,2.0,1.0,...,1.0,1.0,1.0,2.0,11.0,1.0,1.0,2.0,,3.0
4,12 Stones,Crash,,2.0,1.0,1.0,,,,2.0,...,,1.0,,1.0,4.0,,,,1.0,1.0


Just out of curiosity, if we take the transpose of the above table, we end up with a list of emotions, where each of its features is the relationship it was to a song. By applying LSA to it, one gets an embedding of the emotions, and can thus calculate the cosine similarities between them. It is not perfect, but below we can see that for the emotions "happy", "sad", "angry" and "amused" show a more interesting relationship than only that of valence and arousal: happy and amused are very similar, however they both are distinct from sad and angry, and sad and angry are actually distinct from one another.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

emotions = lj2m_emotions_counts.drop(["Artist", "Track"], axis=1).fillna(0).T.to_numpy()
svd = TruncatedSVD(n_components=50, n_iter=10, random_state=42)
emotions_transformed = svd.fit_transform(emotions)
emotion_similarity = pd.DataFrame(cosine_similarity(emotions_transformed), index=lj2m_emotions_counts.drop(["Artist", "Track"], axis=1).columns, columns=lj2m_emotions_counts.drop(["Artist", "Track"], axis=1).columns)
check_emotions = ["happy", "sad", "angry", "amused"]
emotion_similarity.loc[check_emotions,check_emotions]

emotion,happy,sad,angry,amused
emotion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
happy,1.0,0.788586,0.67181,0.919236
sad,0.788586,1.0,0.635913,0.744266
angry,0.67181,0.635913,1.0,0.7396
amused,0.919236,0.744266,0.7396,1.0


The texts, in this case the words and the number of their occurences can also be used to extract further feautures. However, they will only be used at the event level.

## Users
before jumping outright to users, there is another thing that we need to look at. Users contain usually some reference to their location, and that is something that we also need to first make sure is rather complete.

For most of the data, i.e. the lfm events, we only know the country of origin of a big number of the users. For this reason, it is important to be able to use this info in some way. To operationalize it, i've picked on the non-numerical values, i.e. countryCode, conitentName and currencyCode as well as the main language and created one-hot vectors of each. 
The reason to pick those features is that there might be hidden relationshipts between some of the countries, such as sharing a language, a monetary union or a geographic region - which influences the culture. 
The dataframe, altough not visible in this display, also contains info on the main/mean time-zone, such that we can calculate the time of day of a certain event.

In [19]:
countries = pd.read_csv("resources/DataSummary-5-Features/mmtd_countries.csv", index_col=0)
countries.head(3)

Unnamed: 0,countryCode,countryName,continentName,areaInSqKm,population,currencyCode,west,east,north,south,...,VN,VU,WF,WS,XK,YE,YT,ZA,ZM,ZW
0,AD,Andorra,Europe,468.0,84000,EUR,1.40719,1.78654,42.656,42.4285,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AE,United Arab Emirates,Asia,82880.0,4975593,AED,51.5833,56.3817,26.0842,22.6333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,AF,Afghanistan,Asia,647500.0,29121286,AFN,60.4784,74.8794,38.4834,29.3775,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Since when we work with tweets, we also have the coordinates, we might actually also get some more fine-grained information, especially consdering large countries such as Brazil and the US, that might show different tasts depending on the where the event happens. Below are a list of all the individual locations and their coordinates from the Million Music Tweet Dataset. 
To make it more useful, but also not too granulous, both the state and the city, for which there have been more than 100events, a one hot vector such as in the second and third cell.

In [40]:
locations = pd.read_csv("resources/DataSummary-5-Features/mmtd_locations.csv")
locations

In [44]:
pd.read_csv("resources/DataSummary-5-Features/mmtd_city_onehot.csv", index_col=0)

Unnamed: 0,City of Westminster,Potsdam,New Westminster,Diadema,Kuala Lumpur,RW 02,Karaağaç,Manaus,Mesquita,Los Angeles,...,RW 03,Columbus,Shaver Lake,Johor Bahru,Ribeirão Pires,Birmingham,Denpasar,Sacramento,Petaling Jaya,Guayaquil
City of Westminster,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Potsdam,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
New Westminster,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Diadema,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Kuala Lumpur,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Birmingham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Denpasar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Sacramento,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Petaling Jaya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [45]:
pd.read_csv("resources/DataSummary-5-Features/mmtd_state_onehot.csv", index_col=0)

Unnamed: 0,England,Brandenburg,São Paulo,CA,Jakarta Special Capital Region,BC,TX,Île-de-France,Rio de Janeiro,NY,...,UT,IA,Department of Lima,Aquitaine,Guayas,Bogotá,Maharashtra,Kinki Region,DE,Galicia
England,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Brandenburg,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
São Paulo,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CA,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jakarta Special Capital Region,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Bogotá,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Maharashtra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Kinki Region,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
DE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Now that we have the locations sorted out, lets look at some other user properties.

### LFM Users
LFM users contain some relevant features that might be quite interesting: it contains the country, age, gender and creation time.

The country feature will later be substituted by the countries information from above. the creation time was changed and later ignored, such that now we have the "days_since_creation", which counts the days between the creation_time feature and the last listening event from the lfm dataset. During the filtering of the songs, to see which would go into the intersection or not, I also counted for each user how many where part of the final intersection ("in"), how many were "out". Together they sum up to the "total". "Coverage" is "in"/"total", and provides some confidence about how well the user is going to be represented within the existing dataset. The age, which was a numeric value, has been changed to a one-hot vector. The reason is that this vector means "with 100% certainty, the user was this age". Later, when infering, the algorithm might return different ages, with different condifences. Take into consideration, -1 represents that the information is not provided, and some ages, such as 0-10 or >100 might, in some cases, mean that the user provided a fake age.

The same one-hot represnetation applies to gender, with values for female (f), male(m) and neutral (n).

In [46]:
lfm_users = pd.read_csv("resources/DataSummary-5-Features/lfm_users_sub.csv")
lfm_users

Unnamed: 0,user_id,country,creation_time,days_since_creation,in,out,total,coverage,-1,0,...,108,109,110,111,112,113,114,f,m,n
0,2,UK,2002-10-29 01:00:00,6352,60.0,662.0,722.0,0.083102,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,1.0,0.0
1,6,AT,2003-07-23 02:00:00,6085,31.0,2725.0,2756.0,0.011248,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
2,14,UK,2003-02-18 21:44:13,6239,614.0,17479.0,18093.0,0.033936,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,1.0,0.0
3,20,,2003-03-19 13:18:50,6210,2.0,339.0,341.0,0.005865,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
4,24,,2003-07-17 02:00:00,6091,2.0,208.0,210.0,0.009524,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13179,119942,NL,2012-05-28 17:50:26,2852,83.0,896.0,979.0,0.084780,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,1.0,0.0,0.0
13180,119957,PL,2012-05-28 19:59:37,2852,34.0,4453.0,4487.0,0.007577,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,1.0,0.0,0.0
13181,119965,,2012-05-28 20:12:04,2852,9.0,4660.0,4669.0,0.001928,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
13182,120095,RU,2012-05-29 21:19:14,2851,6.0,201.0,207.0,0.028986,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,1.0,0.0,0.0


#### LJ2M emotional spectrum
using the emotions that the user posted on LJ2M, we can create an emotional spectrum described by the user. It contains a lot of missing values, thus it is yet hard to identify whether it is going to be useful. Nevertheless, it might capture some interesting information for some users.

In [11]:
pd.read_csv("resources/DataSummary-5-Features/lj2m_user_emotion_summary.csv")

Unnamed: 0,user ID,accomplished,aggravated,amused,angry,annoyed,anxious,apathetic,artistic,awake,...,sympathetic,thankful,thirsty,thoughtful,tired,touched,uncomfortable,weird,working,worried
0,u0,,,,,,,,,,...,,,,,,,,,,
1,u1000,,,,,,,,,,...,,,,1.0,,,,,,
2,u10000,,,,,,,,,,...,,,,,,,,,,
3,u100001,,,,,,,,,,...,,,,,,,,,,
4,u100004,,,,,,,,,,...,,,,,,,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400110,u99982,,,,,,,,,,...,,,,,,,,,,
400111,u99988,,,,,,,,,,...,,,,,,,,,,
400112,u9999,,,,,,,,,,...,,,,,,,,,,
400113,u99993,,,,,,,,,,...,,,,,,,,,,


#### LJ2M text, aggregated by user
By concatenating the all the texts that each user provided, we can differentiate between users using the semantic (embedding) and the emotional content of their texts. Below we have the DataFrames with the texts in their LJ2M format, the embedding and the text's emotional values.

In [12]:
pd.read_csv("resources/DataSummary-5-Features/lj2m_user_total_text.csv")

Unnamed: 0,user ID,text
0,u0,to to to to to to to and and and the the the t...
1,u1000,to and and and and the the the the the a a a a...
2,u10000,to to to and a a of that that you so so s is i...
3,u100001,to to the the the the the the a you you you in...
4,u100004,to to to to and and and and a a a a a it it my...
...,...,...
400110,u99982,to to to to to to to to to to to to and and an...
400111,u99988,to to to to and and and and the the the a a it...
400112,u9999,to to to to to to to to to to to to to to to a...
400113,u99993,to to to to to to to to to to to to to to to t...


In [22]:
pd.read_csv(folder+"lj2m_user_total_text_embedding_glove_twitter_100.csv")

Unnamed: 0,user ID,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,u0,-0.018698,0.029178,0.110071,0.101854,-0.036385,0.092955,0.225958,0.034222,0.006401,...,0.017720,0.301723,0.055720,0.068713,-0.049797,-0.083248,-0.138404,-0.082758,0.071996,-0.006199
1,u1000,0.118499,0.104947,0.004998,-0.037055,-0.038516,-0.013582,0.148588,0.038515,0.081217,...,-0.101630,-0.021785,0.021197,-0.071599,0.005077,0.034234,-0.055358,0.000562,0.053952,0.002734
2,u10000,-0.007958,0.050466,0.078728,0.039875,0.030995,0.211283,0.390878,-0.040201,-0.073213,...,-0.184727,0.301685,-0.074408,-0.020765,-0.038711,0.032753,-0.076952,0.007679,0.069787,-0.077843
3,u100001,0.100385,0.206977,0.142843,-0.063728,0.001191,-0.004611,0.182074,-0.011257,0.018906,...,-0.012605,0.052539,0.095069,-0.002301,-0.153952,-0.064632,0.042534,-0.029748,0.013648,0.082490
4,u100004,0.071876,0.103079,0.061939,0.001675,-0.019356,0.128655,0.181921,0.032280,0.017077,...,0.007239,0.218067,0.014791,0.098984,-0.191433,-0.056514,0.028248,0.022998,0.137858,-0.012365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400110,u99982,0.099139,0.100407,0.040894,0.007159,-0.046733,0.166010,0.258399,0.030039,-0.050881,...,-0.085469,0.231958,-0.037773,0.041008,-0.115438,-0.092264,-0.068844,-0.024307,0.120050,-0.071987
400111,u99988,0.113329,0.155983,0.075374,0.058440,-0.019345,0.200470,0.214894,0.034920,-0.123624,...,0.061971,0.161445,0.108136,-0.029947,-0.184785,-0.131793,-0.063728,-0.112603,0.130668,-0.023567
400112,u9999,0.043873,0.120641,0.041481,0.017668,-0.075188,0.018365,0.270723,-0.015486,0.050448,...,-0.096632,0.159654,0.054096,0.064705,-0.052580,-0.013213,-0.053450,-0.001375,0.124463,0.005854
400113,u99993,0.086157,0.172574,0.104767,-0.011321,-0.132366,0.120103,0.249130,0.082931,0.004161,...,-0.079009,0.242756,0.010542,0.032084,-0.117222,-0.033031,0.009273,0.003720,0.102940,-0.057164


In [41]:
pd.read_csv(folder+"lj2m_user_total_text_emo.csv")

Unnamed: 0,user ID,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
0,u0,-0.016414,0.553950,0.130,0.772,0.098,-0.7579,0.011193,0.077125,0.002392,...,-0.736,-0.248,0.366,0.396,0.560,0.784,0.936,1.000,1.256,2.26
1,u1000,-0.173333,0.680000,0.216,0.660,0.124,-0.8360,0.169355,0.180433,0.003173,...,-0.660,-0.048,0.368,0.656,0.720,0.870,1.065,1.114,1.174,1.38
2,u10000,0.276667,0.536667,0.081,0.830,0.088,0.1027,0.002371,0.004296,0.001888,...,0.000,0.190,0.515,0.610,0.820,0.840,1.260,1.370,1.885,2.21
3,u100001,0.015625,0.468750,0.042,0.939,0.019,-0.2960,0.003622,0.011413,0.001674,...,-0.345,0.040,0.170,0.430,0.465,0.590,0.650,0.940,1.245,1.88
4,u100004,-0.048214,0.550595,0.070,0.790,0.141,0.6550,0.008252,0.017637,0.001736,...,-0.034,0.364,0.438,0.780,0.890,0.900,1.280,1.358,1.542,2.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400110,u99982,0.129720,0.514423,0.000,0.882,0.118,0.9524,0.003477,0.009952,0.001658,...,-0.078,0.180,0.220,0.408,0.650,0.802,0.940,1.354,1.660,2.21
400111,u99988,-0.109184,0.590816,0.188,0.697,0.116,-0.7003,0.003572,0.014562,0.001607,...,-0.645,-0.220,0.365,0.780,0.795,0.810,0.915,1.410,1.785,2.16
400112,u9999,0.018154,0.455139,0.086,0.714,0.200,0.9928,0.005923,0.015154,0.001493,...,-0.740,0.120,0.330,0.500,0.780,0.940,1.282,1.380,1.588,2.37
400113,u99993,0.037003,0.518034,0.091,0.711,0.198,0.9950,0.002958,0.011513,0.001261,...,-0.640,-0.170,0.220,0.434,0.710,0.780,0.889,1.240,1.533,2.04


#### user profile by listening habits.
A problem yet to solve is how to handle, in terms of memory capacity, the following very sparse matrices. These are the data that can easily describe users across platforms: their listening habits. 
The matrices below are the interaction matrices, where the lines are the users, the columns the tracks, and the values the number of times that a given user listened to a given song.
By, for example, applying the LSA algorithm to all of the matrices combined (after normalizing the values), one can describe the user based on semantic variables independently of the source of the data, thus unifying the datasets.


The first cell is the LFM user-track matrix, followed by the MMTD (Million Music Tweed Dataset) matrix ( separated into two due to it's size).

In [2]:
lfm_iter = pd.read_csv("resources/DataSummary-5-Features/lfm_interaction_matrix.csv", index_col=0)

In [3]:
lfm_iter

Unnamed: 0_level_0,23570,60189,189564,1592801,1636891,1675423,1711943,1763141,1770945,1988698,...,46187719,46202411,46210252,46225268,46229272,46231620,46242855,46244293,46275910,47112160
lfm_user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119965,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
120095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# just to show that it's not only 0s
lfm_iter.sum()

23570         1.0
60189         3.0
189564       43.0
1592801     206.0
1636891       3.0
            ...  
46231620    100.0
46242855     17.0
46244293     71.0
46275910    362.0
47112160    368.0
Length: 8772, dtype: float64

In [2]:
pd.read_csv("resources/DataSummary-5-Features/mmtd_users_m1_interaction_matrix.csv", index_col=0)

Unnamed: 0_level_0,23570,60189,1592801,1636891,1675423,1711943,1770945,1988698,2019862,2055061,...,46187719,46202411,46210252,46225268,46229272,46231620,46242855,46244293,46275910,47112160
tweet_userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10440,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
11916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
625263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174587171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174593632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174607283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174608192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
pd.read_csv("resources/DataSummary-5-Features/mmtd_users_m2_interaction_matrix.csv", index_col=0)

Unnamed: 0_level_0,23570,60189,189564,1592801,1675423,1711943,1763141,1770945,1988988,2153869,...,46088687,46160108,46187719,46210252,46225268,46229272,46231620,46244293,46275910,47112160
tweet_userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
174646237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174675482,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174700677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174705653,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
174713486,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1348283684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1353709112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1359477876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1366274922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### LJ2M
In the LJ2M dataset there are a lot of listeners that did not listen more than just a handfull of tracks. This results in an interaction matrix that is not that useful, and also is so sparce that it does not fit in the RAM. Therefore, the interaction matrix only considers users that listened to 5 or more tracks, while the others will be considered as missing values. This has the additional benefit that we do not judge listeners based only on one song, which might have too big of an influence in the results.

In [26]:
pd.read_csv(folder+"lj2m_interaction_matrix.csv")

Unnamed: 0,user ID,23570,60189,189564,1592801,1636891,1675423,1711943,1763141,1988988,...,46064077,46094401,46160108,46187719,46210252,46225268,46229272,46231620,46275910,47112160
0,u256056,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,u256063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,u256064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,u256065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,u256071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20551,u649689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20552,u649695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20553,u649702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20554,u649704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Listening Events
These are the items that form the basis of the samples being fet into the system: listening events. 
They are characterized by the following: a specific **user** listens/interacts with a specific **track** at a specific **time** while in a specific **context**.

Here we have a list of such events, even though not with that complete information, but nevertheless hopefully enough to learn a lot.

The first dataset is the LFM dataset, based on the user id, the track id, and the local time (based on the user's country, the local time was calculated by adding the time-zone of that country). Since a date_time object is not too simlpe to use in our system, and since the time scope of the dataset is limited, only the weekday and the specific hour will be considered. 

For the second dataset, MMTD provides us with some additional information: besides the track, user_id and time, we also have more specific information on the location of the event, including the coordinates. whether the location will play a role or just ignored, is not yet defined.

In [None]:
# lfm listening events

In [28]:
pd.read_csv("resources/DataSummary-5-Features/lfm_events_preprocessed_time.csv")

Unnamed: 0,lfm_user_id,lfm_track_id,local_time,day,weekday,hour
0,14807,21889387,2020-01-01 01:00:03,0,2,1
1,21778,14820813,2019-12-31 21:00:07,0,1,21
2,2007,17825087,2019-12-31 17:00:21,0,1,17
3,19326,28215546,2020-01-01 01:00:45,0,2,1
4,55135,7006853,2020-01-01 01:00:47,0,2,1
...,...,...,...,...,...,...
994655,26320,31933800,2020-03-20 12:59:23,79,4,12
994656,64572,6171792,2020-03-20 09:59:27,79,4,9
994657,76161,10192989,2020-03-20 13:59:40,79,4,13
994658,38545,15779436,2020-03-20 13:59:43,79,4,13


In [None]:
# mmtd tweet listening event

In [29]:
pd.read_csv("resources/DataSummary-5-Features/mmtd_tweeting_events.csv", index_col=0)

Unnamed: 0_level_0,tweet_datetime,tweet_weekday,tweet_longitude,tweet_latitude,lfm_id,country,state,city,timezone,local_time,hour,weekday,day
tweet_userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
199729912.0,2012-02-09 00:48:30,3.0,-46.596,-23.683,21046049.0,BR,São Paulo,Diadema,-1 days +21:00:00,2012-02-08 21:48:30,21.0,2.0,91.0
169895729.0,2012-05-12 00:31:42,5.0,-40.334,-20.278,21046049.0,BR,Espírito Santo,Vitória,-1 days +21:00:00,2012-05-11 21:31:42,21.0,4.0,184.0
78146702.0,2012-01-27 16:49:19,4.0,-34.939,-8.044,21046049.0,BR,Pernambuco,Recife,-1 days +21:00:00,2012-01-27 13:49:19,13.0,4.0,79.0
199729912.0,2012-01-28 01:35:21,5.0,-46.596,-23.683,21046049.0,BR,São Paulo,Diadema,-1 days +21:00:00,2012-01-27 22:35:21,22.0,4.0,79.0
156019258.0,2012-04-02 13:05:02,0.0,-51.092,-27.653,21046049.0,BR,Santa Catarina,Anita Garibaldi,-1 days +21:00:00,2012-04-02 10:05:02,10.0,0.0,145.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27323694.0,2013-04-09 16:05:09,1.0,-75.411,7.984,9812686.0,CO,Córdoba,Montelíbano,0 days 05:00:00,2013-04-09 21:05:09,21.0,1.0,517.0
354245578.0,2013-04-09 18:40:31,1.0,32.695,39.985,40489942.0,TR,Ankara,Batıkent,0 days 02:00:00,2013-04-09 20:40:31,20.0,1.0,517.0
354961894.0,2013-04-09 18:51:01,1.0,30.329,59.934,4586285.0,RU,Saint Petersburg,Муниципальный округ № 78,0 days 04:00:00,2013-04-09 22:51:01,22.0,1.0,517.0
101532682.0,2013-04-10 00:20:46,2.0,-48.460,-1.273,7038514.0,BR,Pará,Belém,-1 days +21:00:00,2013-04-09 21:20:46,21.0,1.0,517.0


#### Textual Events
Lastly, the LJ2M dataset also contains events, however, they do not contain information regarding the time. This values will be considered as missing data, however, the text of the event provides some context regarding the event.

While the only link between these datasets is only the tracks that each one listens to, later, with more datasets from twitter and youtube, there will be a stronger connection between the temporal aspect and the textual/contextual aspect of music listening.

It follows the list of events, and then the NLP features such as embedding of each text and its emotional values.

In [10]:
# lj2m commenting event
pd.read_csv("resources/DataSummary-5-Features/lj2m_data.csv")

Unnamed: 0,user ID,Artist,Track,text,emotion,lfm_id
0,u480993,David Bisbal,Buleria,to to to to to to to and the the the the the t...,accomplished,7970735
1,u574291,Yellowcard,Only One,to to to to to to and the of of of my my that ...,accomplished,29730378
2,u402088,Yellowcard,Ocean Avenue,to to to and and and and and the the the the a...,accomplished,29222390
3,u535133,Maroon 5,This Love,to to to to to to to to to to to and and and a...,accomplished,41207406
4,u481022,Crossfade,Cold,to to to to to to to to to to to to to and and...,accomplished,9651905
...,...,...,...,...,...,...
741199,u574228,Ryan Cabrera,True,to to to to to and and and and the the the the...,worried,42245802
741200,u263576,Marilyn Manson,Great Big White World,to to to to to to to to to to to to to to and ...,worried,17232166
741201,u638919,Incubus,I Miss You,to to to to to to to and and and the the the t...,worried,19455382
741202,u623682,Eisley,I Wasn't Prepared,to to and and and and a a of that you you in i...,worried,19616463


In [32]:
pd.read_csv(folder+"lj2m_text_embedding_glove_twitter_100.csv", index_col=0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.100051,0.091393,0.047161,-0.053448,-0.068121,0.084335,0.190180,0.086093,0.024028,0.100992,...,-0.064083,0.090171,-0.012714,0.073783,-0.048405,-0.041748,-0.072649,-0.004187,0.012007,-0.037510
1,-0.050872,0.056631,0.040612,-0.061663,0.070380,0.146213,0.268352,0.131356,-0.112119,0.157449,...,0.023440,0.265152,0.017009,0.065344,-0.026274,-0.119465,-0.072494,0.024055,0.030977,-0.120700
2,0.064653,0.056814,0.003300,-0.011294,-0.079213,0.065656,0.192188,0.023729,-0.030252,0.041595,...,-0.043433,0.171674,-0.046267,0.073043,-0.018012,0.096863,-0.093037,-0.119330,0.096027,-0.076608
3,0.160131,0.206913,0.056646,-0.085565,-0.073290,-0.060308,0.203218,0.118658,-0.134713,0.017045,...,0.030030,0.129533,-0.109807,0.045477,-0.189301,0.081680,-0.029653,-0.031589,0.106273,-0.001746
4,0.017266,0.132926,0.101180,-0.080306,-0.004875,0.100443,0.229950,0.071523,-0.070328,0.091078,...,-0.087798,0.188011,-0.006996,0.051310,-0.131943,-0.052801,0.011637,-0.017380,0.090757,-0.051881
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
741199,0.081560,0.107166,0.118414,-0.083207,-0.006886,0.053342,0.259320,0.084762,0.072651,0.119559,...,-0.102838,0.165551,0.090444,0.082418,-0.087254,-0.161675,-0.094361,0.041281,0.035553,-0.045858
741200,0.011536,0.139774,0.053557,0.026574,-0.025702,0.083013,0.357147,0.029079,0.036002,-0.066802,...,0.027691,0.238813,0.019730,0.078050,-0.033584,-0.017024,0.045487,0.033575,0.026703,-0.025444
741201,0.100653,0.076352,0.030791,0.073395,-0.006770,0.124359,0.175431,0.064076,-0.010600,0.186473,...,-0.062049,0.149413,0.015447,0.139372,-0.034288,-0.039904,-0.086798,-0.094509,0.113498,0.030957
741202,0.072596,0.151326,0.157539,-0.034349,-0.130824,0.008158,0.321873,0.062502,-0.065468,0.062827,...,-0.052319,0.198383,0.051453,-0.024349,-0.140099,-0.058381,-0.083570,-0.059348,0.243565,-0.083383


In [40]:
pd.read_csv(folder+"lj2m_text_emo.csv", index_col=0)

Unnamed: 0,TextBlob_polarity,TextBlob_subjectivity,VADER_neg,VADER_neu,VADER_pos,VADER_compound,PySentimiento_anger,PySentimiento_disgust,PySentimiento_fear,PySentimiento_joy,...,Anew_Dominance10_percentile,Anew_Dominance20_percentile,Anew_Dominance30_percentile,Anew_Dominance40_percentile,Anew_Dominance50_percentile,Anew_Dominance60_percentile,Anew_Dominance70_percentile,Anew_Dominance80_percentile,Anew_Dominance90_percentile,Anew_Dominance100_percentile
0,-0.023260,0.518681,0.118,0.754,0.128,-0.0772,0.004417,0.010661,0.001706,0.006051,...,-0.838,-0.092,0.049,0.342,0.425,0.674,0.806,1.144,1.524,2.39
1,0.150000,0.525000,0.067,0.768,0.164,0.8807,0.001750,0.003345,0.001575,0.014853,...,-0.049,0.062,0.220,0.404,0.575,0.836,1.069,1.370,1.408,2.21
2,0.034207,0.493910,0.190,0.660,0.150,-0.9435,0.005672,0.020590,0.001617,0.005091,...,-0.752,-0.082,0.330,0.554,0.830,1.000,1.250,1.298,1.500,2.74
3,-0.123437,0.518750,0.084,0.882,0.034,-0.7096,0.004311,0.016640,0.001814,0.004700,...,-0.330,0.120,0.195,0.280,0.355,0.410,0.680,1.060,1.230,1.43
4,0.092905,0.496795,0.062,0.790,0.148,0.9823,0.002599,0.007065,0.001291,0.008718,...,-0.521,0.012,0.276,0.566,0.680,0.810,1.006,1.228,1.473,2.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
741199,0.148052,0.545887,0.211,0.615,0.174,-0.8338,0.746941,0.128604,0.002634,0.004683,...,-0.175,0.590,0.680,0.890,0.940,0.940,1.095,1.120,1.485,1.89
741200,0.266176,0.529412,0.041,0.718,0.241,0.9861,0.001744,0.002136,0.001211,0.044421,...,0.120,0.180,0.338,0.432,0.590,0.800,0.896,1.380,2.100,2.33
741201,0.143902,0.557668,0.119,0.605,0.276,0.9905,0.011503,0.165909,0.002042,0.004950,...,-1.242,-0.812,-0.220,0.096,0.350,0.604,1.000,1.410,1.551,1.79
741202,-0.001786,0.600000,0.079,0.772,0.149,0.6486,0.002317,0.017170,0.001941,0.004620,...,-0.104,0.256,0.584,0.702,0.790,0.932,1.098,1.434,1.884,1.95


# Conclusion

The most time consuming part is now finished: The data has been cleaned and preprocessed and is now ready to be used. 
To summarize, since the purpose of this project is to create a dataset of music linstening events labeled with emotion, one needs to find a set of listening events, and through similarity of the track's and listener's properties and context, try to infer the emotional labels from data where one knows the labels to listening events where one does not know the label yet.

In the last section of the Data chapter, the listening events, containing a user_id, a track_id, a time stamp and some event_specific user context (LJ2M text) were given. These map to the information provided in the other sections of the Data chapter. There we find information about the following things: 

1. Track Content:
- Spotify's algorithms process and openly provide some features regarding the song's content, from features such as "danceability" to "energy" or the musical key.
- From the Lyrics, we can extract fearures regarding the semantics and the emotion/sentiment in the text.
2. Track Context:
- Track co-occurence in a playlist provides similarity measures between tracks, especially if we interpret the name of the playlist
- artist and album co-occurence also shows a strong degree of similarity, one can expect
- based on event history, two tracks being played within a short period of time can show a high degree of similarity
- user-generated genre tags are a almost universal way of labeling tracks
- other user-generated tags can provide also other emotional or semantic context to a song
- emotional labels given to songs by users can form an emotional spectrum that can be expected for a particular song

3) Users:
- general location data
- metadata about LFM users
- LJ2M's user emotional spectrum
- LJ2M's text aggregated by users
- user profile based on listening habits.

# Future Steps
1) The first next step is to separate the data into training and testing, such that from this point on, there is no danger of going against the i.i.d rules. Some of the preprocessing will be recalculated to take this into account. 
In addition this requires also to think about how to actually design the experiment, in order to make sure that the final evaluation are valid.

2) Prototyping and testing out the first models

3) Some analysis of the existing features, such as statistical and visual description of the data will be very handy at this point, as this will help interpet and verify the results.