Task : 
Data pre-processing
We need to convert the ListenBrainz data dump to a matrix containing user ids, artist ids, and play counts (how many times someone listened to a given artist in 2022). Write a script to do the following:



1. Read the jsonlines data
2. For each item, extract user information and the MessyBrainz id
3. Look up a recording MusicBrainz id in the mapping. Note that not all items in the dataset have a mapping available, so you will need to skip these items.
4. Look up an artist for the recording. In the case that a recording has multiple artists you can choose what to do
  * Take only the first artist in the list
  * Add multiple entries, one for each artists in the list
  * Treat the artist credit as a single entity (in this case you can use the “artist credit name” from the metadata file as your mapping in the next step)
5. Build a mapping of artist id to a textual name so that you can use the name to show results

You may wish to process only a part of this dataset during your initial development and then process the full dataset once you are ready to evaluate it.


# Import & setup

In [None]:
import pandas as pd
import sys
import csv
import re
import time
import ast
import numpy as np
csv.field_size_limit(sys.maxsize)
from scipy.sparse import coo_matrix, csr_matrix
import h5py
import json
!pip install implicit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%pwd
%cd /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/
%ls -lh

/content/gdrive/.shortcut-targets-by-id/13rfNpcbYnyNWcuTmgPUHIW7xtCpQw_Nr/ListenBrainzData
total 5.5G
dr-x------ 2 root root 4.0K Jan 26 15:15 [0m[01;34mlistenbrainz-2022[0m/
-r-------- 1 root root 5.4G Jan 26 14:10 listenbrainz_msid_mapping.csv
dr-x------ 2 root root 4.0K Jan 26 14:12 [01;34mmetabrainz-metadata-dump-20230117-172210[0m/
-r-------- 1 root root 103M Jan 26 14:11 musicbrainz_artist_mbid_name.csv


# CSV : Linking a user_id and all his recording_msid's

We create a dictionary : 
> Key : user_id a user

> Value : list of all recording_msid of this user

For every line of every *.listens file, we append the recording_msid found to the user_id found.

We then save this dictionary in a csv file : `/content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv` that'll be used later. This file weight 1.8Go and contain the information of nearly 7600 users.

In [None]:
#!head listenbrainz-2022/1.listens
#!head listenbrainz-2022/1.listens > /content/gdrive/MyDrive/Term2/ASMLab/test.csv
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz-2022/1.listens
!cat /content/gdrive/MyDrive/Term2/ASMLab/test.csv

head: cannot open 'listenbrainz-2022/1.listens' for reading: No such file or directory
4339352 /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz-2022/1.listens
{"user_id":17240,"user_name":"Winterbay","timestamp":1640995200,"track_metadata":{"artist_name":"Tiken Jah Fakoly","release_name":"Coup De Gueule","additional_info":{"artist_msid":"f03fa31b-a428-47e2-8dcb-bceec9ca1221","release_msid":"238bd863-a293-4718-a66b-093ea54bf8f3","listening_from":"lastfm","recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204","lastfm_artist_mbid":"edef3cfa-4e5e-4d64-8bd8-20f9dc1d8cad","lastfm_release_mbid":"9dc7fe6a-3fa4-4461-8975-ecb7218b39a3"},"track_name":"Alou Maye"},"recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204"}
{"user_id":16930,"user_name":"kazoo","timestamp":1640995200,"track_metadata":{"artist_name":"Fraktus","release_name":"Millennium Edition","additional_info":{"artist_msid":"afe0b08d-f47d-4adf-bd78-3e95ca276f7c","tracknumber":11,"release_msid":"9cd2285c-4860-4e83-b

 Structure : cf ci-dessous
 
 ```
 {
	"user_id":17240,
	"user_name":"Winterbay",
	"timestamp":1640995200,
	"track_metadata":
		{
		"artist_name":"Tiken Jah Fakoly",
		"release_name":"Coup De Gueule",
		"additional_info":
			{
			"artist_msid":"f03fa31b-a428-47e2-8dcb-bceec9ca1221",
			"release_msid":"238bd863-a293-4718-a66b-093ea54bf8f3",
			"listening_from":"lastfm",
			"recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204",
			"lastfm_artist_mbid":"edef3cfa-4e5e-4d64-8bd8-20f9dc1d8cad",
			"lastfm_release_mbid":"9dc7fe6a-3fa4-4461-8975-ecb7218b39a3"
			},
		"track_name":"Alou Maye"
	}
	,"recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204"
}


 ```

In [None]:
d = {}
iter = 0

start = time.monotonic()

#For every file
for i in range(12):

  # We check the number of lines of the new - used in printing
  with open(f'/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz-2022/{i+1}.listens') as fp:
    lines = len(fp.readlines())
    print(f'There is {lines} lines for the {i+1} file')

  # We open and work on a new file
  with open(f'/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz-2022/{i+1}.listens') as fp:
      csvreader = csv.reader(fp)
      for line in csvreader:
          iter += 1
          if(iter % 100000 == 0):
            print(f'File n°{i+1} : {iter} lines processed so far on {lines} ; {iter/lines*100}%')

          # Using regexp to get the data we want in the string
          user_id = re.findall(r'\d+',line[0])[0]
          recording_msid = re.findall(r'\:"(.*)"}',line[-1])[0]

          if(len(recording_msid) != 36):
            print(f'/!\ recording_msid too short : {recording_msid} ; there might be an issue')
            break

          # If it's none, we have to create the key/value ; otherwise we append
          if d.get(user_id) == None:
            d[user_id] = [recording_msid]
          else:
            d[user_id].append(recording_msid)

#We write the final results in the .csv
# Number of keys of the dictionary : 
nb_keys = len(d)
keys_list = list(d.keys())

df = pd.DataFrame(
    data={
        "index": keys_list,
        "data": [d.get(k) for k in keys_list]
    }
)


file_path = "/content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv"
df.to_csv(file_path)
end = time.monotonic()
total = end - start
print('This operation took {:.2f} seconds'.format(total))

There is 4339352 lines for the 1 file
File n°1 : 100000 lines processed so far on 4339352 ; 2.304491546203212%
File n°1 : 200000 lines processed so far on 4339352 ; 4.608983092406424%
File n°1 : 300000 lines processed so far on 4339352 ; 6.913474638609636%
File n°1 : 400000 lines processed so far on 4339352 ; 9.217966184812848%
File n°1 : 500000 lines processed so far on 4339352 ; 11.52245773101606%
File n°1 : 600000 lines processed so far on 4339352 ; 13.826949277219272%
File n°1 : 700000 lines processed so far on 4339352 ; 16.131440823422484%
File n°1 : 800000 lines processed so far on 4339352 ; 18.435932369625696%
File n°1 : 900000 lines processed so far on 4339352 ; 20.740423915828906%
File n°1 : 1000000 lines processed so far on 4339352 ; 23.04491546203212%
File n°1 : 1100000 lines processed so far on 4339352 ; 25.34940700823533%
File n°1 : 1200000 lines processed so far on 4339352 ; 27.653898554438545%
File n°1 : 1300000 lines processed so far on 4339352 ; 29.958390100641758%
Fil

In [None]:
# Information to big to be printed -> we still want the number of lines of the newly created file.
# We write the head of the file to a test file
!head /content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv > /content/gdrive/MyDrive/Term2/ASMLab/test_read.csv
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv

7684 /content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv


In [None]:
# We try to read the data from the created file and put it in a dictionary.
# We just it for now on our test file created 

d={}

with open('/content/gdrive/MyDrive/Term2/ASMLab/test_read.csv') as fp:
#with open('/content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv') as fp:
    csvreader = csv.reader(fp)
    for line in csvreader:
      if line[1] == 'index':
        pass
      else:
        d[str(line[1])] = list(line[2][2:-2].split("', '")) # We parse the data to have a list

print(d['249'])
print(d['249'][0])
print(len(d['249']))

['1160f670-c889-4f09-9e34-e4e22d5717d8', '822b3da6-ac99-4e5e-bf07-56c20868eca2', '1fed2068-50b0-44ca-817f-b420db3c3ee1', '7484c826-7392-4636-8b6b-0b8213b9066b', '874a15c9-1af3-44fd-a1f3-3ce1f5d8ee50', '874a15c9-1af3-44fd-a1f3-3ce1f5d8ee50', '17dd1275-a72f-48c3-8ac1-0937ff7aa879', '17dd1275-a72f-48c3-8ac1-0937ff7aa879', 'f5f78cc8-e11f-4798-98a9-77bf31ba07be', 'f5f78cc8-e11f-4798-98a9-77bf31ba07be', '8f60a74b-07e6-4019-9687-ee0090cbc636', '8f60a74b-07e6-4019-9687-ee0090cbc636', 'bce3aaea-c080-4b6e-b38e-789cf21a3859', 'bce3aaea-c080-4b6e-b38e-789cf21a3859', '1ec91b62-e082-4d2e-a999-d29752721c79', '1ec91b62-e082-4d2e-a999-d29752721c79', 'da88eaee-4aa0-4264-843d-d4d84721ffb0', 'a7a1dffd-0160-4e0f-80cd-3839e09addf7', '058f7259-07ee-4123-a5e3-905ef02a0607', '058f7259-07ee-4123-a5e3-905ef02a0607', '87802984-e1d9-46cf-b6fd-fec0c9cd43cf', '87802984-e1d9-46cf-b6fd-fec0c9cd43cf', '4400d7b2-df54-4a9b-83b2-1ce854d6be61', '4400d7b2-df54-4a9b-83b2-1ce854d6be61', '04b0e92c-26ed-4ae4-9296-3ad0214a0cd4',

# CSV : Linking the recording_mbid to the artists_mbid 

We parse the `ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv` database, in order to create a dictionary : 
> Key : recording_mbid 

> Value : associated artists_mbid value

The goal is to link the recording_mbid with the artists_mbid

We then save this dictionary in a csv file : `recording_mbid2artists_mbid` that'll be used later. This file weight 1.7Go and contain nearly 22 million links.

In [None]:
!head "/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv" > /content/gdrive/MyDrive/Term2/ASMLab/test_read_canonicdb.csv

In [None]:
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv
!cat /content/gdrive/MyDrive/Term2/ASMLab/test_read_canonicdb.csv

22396747 /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv
id,artist_credit_id,artist_mbids,artist_credit_name,release_mbid,release_name,recording_mbid,recording_name,combined_lookup,score,year
28939355,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,5f5f649f-1938-4a3e-a879-a95693a99a71,Heavenly Flow,michieoneheavenlyflow,371181,2006
28939356,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,6ee381af-e9b4-46c4-a4f0-541dce2f03ea,Party,michieoneparty,371181,2006
28939357,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,8b371ea0-dee1-4fbf-bacd-3aa87a4aef13,Free Like Jah,michieonefreelikejah,371181,2006
28939358,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,9d904f0f-314

In [None]:
recording_mbid2artists_mbid = {}
iter = 0
with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv') as fp:
    csvreader = csv.reader(fp)
    for line in csvreader:
      iter += 1
      if(iter % 1000000 == 0):
        print(f'We processed {iter} lines on 22396747  - {iter/22396747*100} %')
      if line[0] != 'id':
        # We associate recording_mbid (line[6]) to artist_mbids (line[2])
        recording_mbid2artists_mbid[line[6]] = line[2][1:-1]

print(len(recording_mbid2artists_mbid))


We processed 1000000 lines on 22396747  - 4.46493412637112 %
We processed 2000000 lines on 22396747  - 8.92986825274224 %
We processed 3000000 lines on 22396747  - 13.39480237911336 %
We processed 4000000 lines on 22396747  - 17.85973650548448 %
We processed 5000000 lines on 22396747  - 22.3246706318556 %
We processed 6000000 lines on 22396747  - 26.78960475822672 %
We processed 7000000 lines on 22396747  - 31.25453888459784 %
We processed 8000000 lines on 22396747  - 35.71947301096896 %
We processed 9000000 lines on 22396747  - 40.18440713734008 %
We processed 10000000 lines on 22396747  - 44.6493412637112 %
We processed 11000000 lines on 22396747  - 49.11427539008232 %
We processed 12000000 lines on 22396747  - 53.57920951645344 %
We processed 13000000 lines on 22396747  - 58.044143642824565 %
We processed 14000000 lines on 22396747  - 62.50907776919568 %
We processed 15000000 lines on 22396747  - 66.9740118955668 %
We processed 16000000 lines on 22396747  - 71.43894602193792 %
We pr

In [None]:
with open('/content/gdrive/MyDrive/Term2/ASMLab/recording_mbid2artists_mbid.csv', 'w') as fpw:
    writer = csv.DictWriter(fpw, fieldnames=['recording_mbid', 'artists_mbid'])
    writer.writeheader()
    for key in recording_mbid2artists_mbid.keys():
      writer.writerow({'recording_mbid' : key, 'artists_mbid' : recording_mbid2artists_mbid[key]})

In [None]:
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/recording_mbid2artists_mbid.csv

22396747 /content/gdrive/MyDrive/Term2/ASMLab/recording_mbid2artists_mbid.csv


# CSV : Linking recording_msid to artist_mbids




We parse the `ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv` database, in order to create a dictionary : 
> Key : recording_msid 

> Value : associated artists_mbid value

The goal is to link the recording_msid with the artists_mbid.

The canonical database is too heavy to be loaded in the RAM, therefore we iterate on small portions of this database.

For this, we need to have `recording_mbid2artists_mbid` loaded in memory first : https://colab.research.google.com/drive/1apVN3G1hm0v1s9y-4YNoWV4OlFIyFUO_#scrollTo=P2dPRhVhJ9RV&line=2&uniqifier=1

We then save this dictionary in a csv file : `recording-msid2artists_mbid.csv` that'll be used later. This file weight 3.34Go and contain nearly 45 million links.

In [None]:
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz_msid_mapping.csv

77317915 /content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz_msid_mapping.csv


In [None]:
# We have the list of all possible qualities :
quality = []
with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz_msid_mapping.csv') as fp:
  csvreader = csv.reader(fp)
  for line in csvreader:
    if line[2] not in quality:
      quality.append(line[2])
    
print(quality)
    

['match_type', 'exact_match', 'high_quality', 'med_quality', 'low_quality', 'no_match']


In [None]:
# We count every element for every possible qualities
quality_quantity = {"exact_match" : 0, "high_quality" : 0, "med_quality" : 0, "low_quality" : 0, "no_match" : 0, "match_type" : 0}
with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz_msid_mapping.csv') as fp:
  csvreader = csv.reader(fp)
  for line in csvreader:
    quality_quantity[line[2]] += 1

print(quality_quantity)
total = 0
for key in quality_quantity.keys():
  print(f'For this category : {key} we have {quality_quantity[key]} elements')
  total += quality_quantity[key]

print(f'We have a total of : {total}')

{'exact_match': 44071334, 'high_quality': 1077293, 'med_quality': 2252381, 'low_quality': 6391631, 'no_match': 23525275, 'match_type': 1}
For this category : exact_match we have 44071334 elements
For this category : high_quality we have 1077293 elements
For this category : med_quality we have 2252381 elements
For this category : low_quality we have 6391631 elements
For this category : no_match we have 23525275 elements
For this category : match_type we have 1 elements
We have a total of : 77317915


In [None]:
# NEED TO LOAD SOMETHING IN MEMORY FIRST : recording_mbid2artists_mbid
# https://colab.research.google.com/drive/1apVN3G1hm0v1s9y-4YNoWV4OlFIyFUO_#scrollTo=P2dPRhVhJ9RV&line=2&uniqifier=1

recording_msid2artists_mbid = {}
iter = 0
iter_all = 0
error = 0

with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/listenbrainz_msid_mapping.csv') as fp:
    with open('/content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv', 'w') as fpw:
        writer = csv.DictWriter(fpw, fieldnames=['recording_msid', 'artists_mbid'])
        writer.writeheader()
        csvreader = csv.reader(fp)
        for line in csvreader:
          iter_all += 1

          # We only take into account the link if it's exact_match or high_quality
          if line[2] in {'exact_match', 'high_quality'}: 
            iter += 1
            # Line[0] : recording_msid 
            # Line[1] : recording_mbid
            # We work with a list of all the artists of the song.
            try :
              tmp = recording_mbid2artists_mbid[line[1]].split(',') 
              if len(tmp) == 0: # No artists : pass
                pass
              else:
                artists_mbid_list = tmp
            except:
              error += 1
              artists_mbid_list = -1

            recording_msid2artists_mbid[line[0]] = artists_mbid_list

            # Every 10M 'interesting' link, we save to the CSV and we reset the dictionary -> This is to save RAM
            if(iter % 10000000 == 0):
              print(f'We processed {iter_all} lines on 77317915 - {iter_all/77317915 *100} %')
              print (f'error rate : {error}/{iter} - {error/iter}')
              for key in recording_msid2artists_mbid.keys():
                writer.writerow({'recording_msid' : key, 'artists_mbid' : recording_msid2artists_mbid[key]})
              recording_msid2artists_mbid = {}
        
        # At the end, we process the remaining files
        print(f'We processed {iter_all} lines on 77317915 - {iter_all/77317915 *100} %')
        for key in recording_msid2artists_mbid.keys():
          writer.writerow({'recording_msid' : key, 'artists_mbid' : recording_msid2artists_mbid[key]})
        recording_msid2artists_mbid = {}


print (f'Final error rate : {error}/{iter}')

We processed 10948582 lines on 77000000 - 14.218937662337664 %
error rate : 359913/10000000 - 0.0359913
We processed 21889997 lines on 77000000 - 28.42856753246753 %
error rate : 718521/20000000 - 0.03592605
We processed 37268225 lines on 77000000 - 48.400292207792205 %
error rate : 1065613/30000000 - 0.03552043333333333
We processed 55220907 lines on 77000000 - 71.71546363636364 %
error rate : 1433598/40000000 - 0.03583995
We processed 77317915 lines on 77000000 - 58.634580519480515 %
Final error rate : 1647175/45148627


Based on what we saw before, we work on 44071334 + 1077293 elements (exact_match + high_quality). 

We also have nearly 3% error rate for those links.

In [None]:
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv

45148628 /content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv


In [None]:
!head /content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv

recording_msid,artists_mbid
00000737-3a59-4499-b30a-31fe2464555d,['5b24fbab-c58f-4c37-a59d-ab232e2d98c4']
000013b3-dbb4-43a0-8fd4-ca92ff5ed033,['797bcf41-0e02-431d-ab99-020e1cb3d0fd']
00002714-6f74-409d-9fa4-441c8dfb195f,['ae6362d6-e1a3-4e31-b26f-e413d9e3d1c5']
00003a81-2a6c-4d6c-ad43-990c0806458b,['4fdde741-92d7-4c8b-b85a-d1840e5b3aa8']
00005660-7eb0-4592-a74b-14f3de9cc4cb,['b7ffd2af-418f-4be2-bdd1-22f8b48613da']
00006a3b-babd-4bb0-89b3-e2835aba6425,['d3b2711f-2baa-441a-be95-14945ca7e6ea']
0000825d-b547-43a7-b294-948e8e472766,['c3e34e4f-85ec-42e6-ae00-85039d88cb13']
00009fc5-4f28-4020-b02c-966d6c5e4202,['391d85ad-2130-4732-8145-853908013354']
0000b284-1f92-404f-97ab-12c89356de22,['f7861eb7-a14c-41ec-813d-15a28a220a06']


# CSV : Linking user_id to artists_mbid

Here, we try to link user_id with artists_mbid in a dictionary and write it to a csv file.

>    Key : user_id

>    Value : associated list of artist_mbids values.



For this, we need to have user_id2recordings\_msid loaded in memory first. This is done in one of the cells below

We then save this dictionary in a csv file : user_id2artists_mbid.csv that'll be used later. This file weight 1.3Go and contains nearly 7600 lines.


In [None]:
!head /content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv

In [None]:
!head /content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv #> /content/gdrive/MyDrive/Term2/ASMLab/test_recording-msid2artists_mbid.csv

recording_msid,artists_mbid
00000737-3a59-4499-b30a-31fe2464555d,['5b24fbab-c58f-4c37-a59d-ab232e2d98c4']
000013b3-dbb4-43a0-8fd4-ca92ff5ed033,['797bcf41-0e02-431d-ab99-020e1cb3d0fd']
00002714-6f74-409d-9fa4-441c8dfb195f,['ae6362d6-e1a3-4e31-b26f-e413d9e3d1c5']
00003a81-2a6c-4d6c-ad43-990c0806458b,['4fdde741-92d7-4c8b-b85a-d1840e5b3aa8']
00005660-7eb0-4592-a74b-14f3de9cc4cb,['b7ffd2af-418f-4be2-bdd1-22f8b48613da']
00006a3b-babd-4bb0-89b3-e2835aba6425,['d3b2711f-2baa-441a-be95-14945ca7e6ea']
0000825d-b547-43a7-b294-948e8e472766,['c3e34e4f-85ec-42e6-ae00-85039d88cb13']
00009fc5-4f28-4020-b02c-966d6c5e4202,['391d85ad-2130-4732-8145-853908013354']
0000b284-1f92-404f-97ab-12c89356de22,['f7861eb7-a14c-41ec-813d-15a28a220a06']


In [None]:
#Loading the user_id2recordings_msid in memory. It will be used to link everything

user_id2recordings_msid = {}

with open('/content/gdrive/MyDrive/Term2/ASMLab/dict_user-msid.csv') as fp:
    csvreader = csv.reader(fp)
    for line in csvreader:
      if line[1] == 'index':
        pass
      else:
        user_id2recordings_msid[str(line[1])] = list(line[2][2:-2].split("', '")) # We parse the data to have a list

In [None]:
print(user_id2recordings_msid['249'])
print(user_id2recordings_msid['249'][0])
print(len(user_id2recordings_msid['249']))

['1160f670-c889-4f09-9e34-e4e22d5717d8', '822b3da6-ac99-4e5e-bf07-56c20868eca2', '1fed2068-50b0-44ca-817f-b420db3c3ee1', '7484c826-7392-4636-8b6b-0b8213b9066b', '874a15c9-1af3-44fd-a1f3-3ce1f5d8ee50', '874a15c9-1af3-44fd-a1f3-3ce1f5d8ee50', '17dd1275-a72f-48c3-8ac1-0937ff7aa879', '17dd1275-a72f-48c3-8ac1-0937ff7aa879', 'f5f78cc8-e11f-4798-98a9-77bf31ba07be', 'f5f78cc8-e11f-4798-98a9-77bf31ba07be', '8f60a74b-07e6-4019-9687-ee0090cbc636', '8f60a74b-07e6-4019-9687-ee0090cbc636', 'bce3aaea-c080-4b6e-b38e-789cf21a3859', 'bce3aaea-c080-4b6e-b38e-789cf21a3859', '1ec91b62-e082-4d2e-a999-d29752721c79', '1ec91b62-e082-4d2e-a999-d29752721c79', 'da88eaee-4aa0-4264-843d-d4d84721ffb0', 'a7a1dffd-0160-4e0f-80cd-3839e09addf7', '058f7259-07ee-4123-a5e3-905ef02a0607', '058f7259-07ee-4123-a5e3-905ef02a0607', '87802984-e1d9-46cf-b6fd-fec0c9cd43cf', '87802984-e1d9-46cf-b6fd-fec0c9cd43cf', '4400d7b2-df54-4a9b-83b2-1ce854d6be61', '4400d7b2-df54-4a9b-83b2-1ce854d6be61', '04b0e92c-26ed-4ae4-9296-3ad0214a0cd4',

Principe: 
- Load user_id2recordings_mbid in memory
- Reading the recordings_msid2recordings_mbid file
- Création of a user_id2artists_mbid file -> New dictionary 
    - key : user_id
    - value : list of the artists he listened

- Every 5M of lines of recordings_msid2artists\_mbid :
  - For each user, go through his list of recording_msid
  - If it found a msid in the data studied, append the related artists_mbid to the list of this user in the new dictionary.

In [None]:
# /!\ Needing to have user_id2recordings_msid loaded in memory, cf above

user_id2artists_mbid = {}
recording_msid2artists_mbid = {}
iter = 0
error = 0

# Create dictionary of an empty list for every key in the user_id2recording_msid
for key in user_id2recordings_msid.keys():
  user_id2artists_mbid[key] = []

with open('/content/gdrive/MyDrive/Term2/ASMLab/recording-msid2artists_mbid.csv') as fp:
  csvreader = csv.reader(fp)
  for line in csvreader:
      iter += 1

      # Working for the line where there was no issue
      if line[1] not in {'artists_mbid', '-1'}:
        recording_msid2artists_mbid[line[0]] = ast.literal_eval([line[1]][0]) # We parse the data to have a list
      else:
        error += 1


      # Every 5M (or at the end) elements, we save the dictionary to a CSV and reset the dictionary
      if(iter % 5000000 == 0 or iter == 45148627):
        print(f'We processed {iter} lines on 45000000 - {iter/45000000*100} %')
        print(f'Error : {error} - rate : {error/iter*100} %')      

        list_of_msid_keys = recording_msid2artists_mbid.keys()
        # For every user, we iterate through the list of his recording_msid
        # key = a user_id
        for key in user_id2recordings_msid.keys():
          # List of what he listened too
          list_of_user_msid_recordings = user_id2recordings_msid[key]
          # We iterate through this list 
          for user_rec_msid in list_of_user_msid_recordings:
            # If one recording_msid is in the list of the things we processed :
            if user_rec_msid in list_of_msid_keys:
              # We append the artists mbid to this user one by one
              for artists_mbid in recording_msid2artists_mbid[user_rec_msid]:
                user_id2artists_mbid[key].append(artists_mbid)
            else: # We didn't found 
              pass

        # Before returning, we free the msid2artists dico
        recording_msid2artists_mbid = {}

        
# Finally, we write to a csv   
with open('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv', 'w') as fpw:
    writer = csv.DictWriter(fpw, fieldnames=['user_id', 'artists_mbid'])
    writer.writeheader()
    for key in user_id2artists_mbid.keys():
      writer.writerow({'user_id' : key, 'artists_mbid' : user_id2artists_mbid[key]})

We processed 5000000 lines on 45000000 - 11.11111111111111 %
Error : 183291 - rate : 3.66582 %
We processed 10000000 lines on 45000000 - 22.22222222222222 %
Error : 359914 - rate : 3.59914 %
We processed 15000000 lines on 45000000 - 33.33333333333333 %
Error : 538605 - rate : 3.5907 %
We processed 20000000 lines on 45000000 - 44.44444444444444 %
Error : 718522 - rate : 3.59261 %
We processed 25000000 lines on 45000000 - 55.55555555555556 %
Error : 901147 - rate : 3.604588 %
We processed 30000000 lines on 45000000 - 66.66666666666666 %
Error : 1065614 - rate : 3.5520466666666666 %
We processed 35000000 lines on 45000000 - 77.77777777777779 %
Error : 1186333 - rate : 3.389522857142857 %
We processed 40000000 lines on 45000000 - 88.88888888888889 %
Error : 1433599 - rate : 3.5839975 %
We processed 45000000 lines on 45000000 - 100.0 %
Error : 1644798 - rate : 3.6551066666666667 %
We processed 45148627 lines on 45000000 - 100.33028222222222 %
Error : 1647176 - rate : 3.6483412884294353 %


In [None]:
!wc -l /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv

7684 /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv


In [None]:
# To big to be loaded
!head /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv

# CSV : We want the play count

Idea : go from 
> user_id -> list of artist_mbid 

to

>user_id -> artist_mbid -> count

For this, we are going to have a dictionary of dictionary :
* key : user_id
* value : dictionary of
  * key : artist_mbid
  * value : play_count

We process this from the data we just had and store the CSV to memory in this file : `user_id2artists_count.csv` which weights 144Mo.


In [None]:
user_id2artists_mbid = {}

In [None]:
!head /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv

In [None]:
iter = 0
with open('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_mbid.csv') as fp:  
  with open('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count.csv', 'w') as fpw:
      writer = csv.DictWriter(fpw, fieldnames=['user_id', 'artists_count_dico'])
      writer.writeheader()
      csvreader = csv.reader(fp)

      for line in csvreader:
        iter += 1
        if(iter % 100 == 0):
          print(f'{iter} user processed on 76XX - {iter/7600*100}')
        if line[0] not in {'user_id'}:
          user_id = line[0]
          list_artists_mbid = ast.literal_eval(line[1]) # a list of the artist mbids
          
          # Creation of the dictionary for the user
          temp_dico = {}
          unique_list_artists_mbid = np.unique(list_artists_mbid)
          for item in unique_list_artists_mbid:
            temp_dico[item] = 0

          # Filling the temporary dico
          for item in list_artists_mbid:
            temp_dico[item] += 1

          # Writing in memory
          writer.writerow({'user_id' : user_id, 'artists_count_dico' : str(temp_dico)})



    


100 user processed on 76XX - 1.3157894736842104
200 user processed on 76XX - 2.631578947368421
300 user processed on 76XX - 3.9473684210526314
400 user processed on 76XX - 5.263157894736842
500 user processed on 76XX - 6.578947368421052
600 user processed on 76XX - 7.894736842105263
700 user processed on 76XX - 9.210526315789473
800 user processed on 76XX - 10.526315789473683
900 user processed on 76XX - 11.842105263157894
1000 user processed on 76XX - 13.157894736842104
1100 user processed on 76XX - 14.473684210526317
1200 user processed on 76XX - 15.789473684210526
1300 user processed on 76XX - 17.105263157894736
1400 user processed on 76XX - 18.421052631578945
1500 user processed on 76XX - 19.736842105263158
1600 user processed on 76XX - 21.052631578947366
1700 user processed on 76XX - 22.36842105263158
1800 user processed on 76XX - 23.684210526315788
1900 user processed on 76XX - 25.0
2000 user processed on 76XX - 26.31578947368421
2100 user processed on 76XX - 27.631578947368425
2

In [None]:
# Now that the data is smaller, we can !head it

!head /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count.csv

user_id,artists_count_dico
17240,"{'0004d643-96e3-486a-bafe-eb792871ce9f': 2, '0008af7d-2aa1-4b4d-80af-b3b64ee3cac6': 81, '000fc734-b7e1-4a01-92d1-f544261b43f5': 3, '00159758-bdd1-4d54-aa0b-555ed1c2f10b': 2, '002a7949-e949-4826-afeb-271e65e6b5ba': 2, '00336255-caf9-4117-b887-05e4ec6ab099': 2, '00376321-ce0f-4bd7-a98f-fcabdbf06ea7': 19, '0039c7ae-e1a7-4a7d-9b49-0cbc716821a6': 28, '003b2747-b74a-46c1-a51e-aeaffe88256c': 37, '003ca7a0-60ef-4c4b-a4d6-f8e33256577e': 2, '00467da8-2a92-498f-8b10-a80889bcded7': 2, '00486b89-09b6-40c0-8689-59a7a01dc88b': 2, '004c1bc0-adb3-47b7-b5be-f470930e1b8f': 1, '004e5eed-e267-46ea-b504-54526f1f377d': 1, '0052c396-de2c-4fca-8bd8-d30f73cc92d7': 11, '0068ae6c-7156-40f9-a81f-39294af6a549': 2, '0070ca77-26e8-4ce2-a28a-d1b76ec15485': 2, '007aebae-6e6c-4d0f-a772-b086d9f9f2a8': 2, '0094b33e-b865-40fd-8913-ba6d66752fa6': 1, '00a1afe3-79c6-44a5-a666-c77e281ee7fe': 65, '00a537c2-915b-466c-873e-e401843c5dd1': 3, '00a86c43-ac30-4b1a-8c02-f53b509a2225': 2, '00a9f935-ba9

# CSV : Final step - follow the template

We go start from our file, and we want to have data following this template : 

```
user_id,artists_mbid,artist_name,count
17240,0004d643-96e3-486a-bafe-eb792871ce9f,Siddhartha,2
17240,0008af7d-2aa1-4b4d-80af-b3b64ee3cac6,Miklós Rózsa,81
17240,000fc734-b7e1-4a01-92d1-f544261b43f5,Cocteau Twins,3
17240,00159758-bdd1-4d54-aa0b-555ed1c2f10b,TIMPANA,2
```

We obtain this final csv file : `user_id2artists_count_better_formatted.csv` which weights nearly 200Mo.

In [None]:
# Dictionary of artists_mbid to name 

artists_mbid2name = {}
iter = 0
with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/musicbrainz_artist_mbid_name.csv') as fp:
    csvreader = csv.reader(fp)
    for line in csvreader:
      iter += 1
      if(iter % 100000 == 0):
        print(f'We processed {iter} lines on 2102548 - {iter/2102548*100} %')
      if line[0] != 'mbid':
        artists_mbid2name[line[0]] = line[1]

print(len(artists_mbid2name))
print(str(artists_mbid2name)[:1000])

We processed 100000 lines on 2102548 - 4.756133986001746 %
We processed 200000 lines on 2102548 - 9.512267972003492 %
We processed 300000 lines on 2102548 - 14.268401958005239 %
We processed 400000 lines on 2102548 - 19.024535944006985 %
We processed 500000 lines on 2102548 - 23.78066993000873 %
We processed 600000 lines on 2102548 - 28.536803916010477 %
We processed 700000 lines on 2102548 - 33.29293790201223 %
We processed 800000 lines on 2102548 - 38.04907188801397 %
We processed 900000 lines on 2102548 - 42.80520587401572 %
We processed 1000000 lines on 2102548 - 47.56133986001746 %
We processed 1100000 lines on 2102548 - 52.31747384601921 %
We processed 1200000 lines on 2102548 - 57.073607832020954 %
We processed 1300000 lines on 2102548 - 61.8297418180227 %
We processed 1400000 lines on 2102548 - 66.58587580402445 %
We processed 1500000 lines on 2102548 - 71.3420097900262 %
We processed 1600000 lines on 2102548 - 76.09814377602794 %
We processed 1700000 lines on 2102548 - 80.8542

In [None]:
# Having the artists_mbid2name dictionary is needed, cf the cell above.

error = 0
iter = 0
iter_art = 0

with open('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count.csv') as fp:
  with open('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count_better_formatted.csv', 'w') as fpw:
    writer = csv.DictWriter(fpw, fieldnames=['user_id', 'artists_mbid', 'artist_name', 'count'])
    writer.writeheader()
    csvreader = csv.reader(fp)
    for line in csvreader:
      iter += 1
      if(iter % 500 == 0):
          print(f'{iter} user processed on 76XX - {iter/7600*100} %')
          print(f'Error rate : {error}/{iter_art}')
      user_id = line[0]
      tmp = line[1]
      if tmp not in {'artists_count_dico'}:
        artists2count_dico = ast.literal_eval([tmp][0])
        for key in artists2count_dico.keys():
          iter_art += 1
          try:
            writer.writerow({'user_id' : user_id, 'artists_mbid' : key, 'artist_name' : artists_mbid2name[key], 'count' : artists2count_dico[key] })
          except :
            error += 1

500 user processed on 76XX - 6.578947368421052 %
Error rate : 17/528970
1000 user processed on 76XX - 13.157894736842104 %
Error rate : 31/937090
1500 user processed on 76XX - 19.736842105263158 %
Error rate : 40/1321756
2000 user processed on 76XX - 26.31578947368421 %
Error rate : 47/1699042
2500 user processed on 76XX - 32.89473684210527 %
Error rate : 53/2048706
3000 user processed on 76XX - 39.473684210526315 %
Error rate : 59/2319465
3500 user processed on 76XX - 46.05263157894737 %
Error rate : 62/2576217
4000 user processed on 76XX - 52.63157894736842 %
Error rate : 64/2783731
4500 user processed on 76XX - 59.210526315789465 %
Error rate : 66/2914962
5000 user processed on 76XX - 65.78947368421053 %
Error rate : 68/3024561
5500 user processed on 76XX - 72.36842105263158 %
Error rate : 69/3148598
6000 user processed on 76XX - 78.94736842105263 %
Error rate : 75/3272982
6500 user processed on 76XX - 85.52631578947368 %
Error rate : 75/3359686
7000 user processed on 76XX - 92.1052

In [None]:
!head /content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count_better_formatted.csv

user_id,artists_mbid,artist_name,count
17240,0004d643-96e3-486a-bafe-eb792871ce9f,Siddhartha,2
17240,0008af7d-2aa1-4b4d-80af-b3b64ee3cac6,Miklós Rózsa,81
17240,000fc734-b7e1-4a01-92d1-f544261b43f5,Cocteau Twins,3
17240,00159758-bdd1-4d54-aa0b-555ed1c2f10b,TIMPANA,2
17240,002a7949-e949-4826-afeb-271e65e6b5ba,Lux,2
17240,00336255-caf9-4117-b887-05e4ec6ab099,Shila Amzah,2
17240,00376321-ce0f-4bd7-a98f-fcabdbf06ea7,Raappana,19
17240,0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,Death Cab for Cutie,28
17240,003b2747-b74a-46c1-a51e-aeaffe88256c,Erdmöbel,37


# Implicit

We are now going to use the Implicit library in order to use the informations that we gathered to do collaborative filtering.

For this, we need :  
* The file that we generated earlier : user_id2artists_count_better_formatted.csv
* The dictionary of artists_mbid to name : cf cell below

In [None]:
# Dictionary of artists_mbid to name 

artists_mbid2name = {}
iter = 0
with open('/content/gdrive/MyDrive/Term2/ASMLab/ListenBrainzData/musicbrainz_artist_mbid_name.csv') as fp:
    csvreader = csv.reader(fp)
    for line in csvreader:
      iter += 1
      if(iter % 100000 == 0):
        print(f'We processed {iter} lines on 2102548 - {iter/2102548*100} %')
      if line[0] != 'mbid':
        artists_mbid2name[line[0]] = line[1]

print(len(artists_mbid2name))
print(str(artists_mbid2name)[:1000])

We processed 100000 lines on 2102548 - 4.756133986001746 %
We processed 200000 lines on 2102548 - 9.512267972003492 %
We processed 300000 lines on 2102548 - 14.268401958005239 %
We processed 400000 lines on 2102548 - 19.024535944006985 %
We processed 500000 lines on 2102548 - 23.78066993000873 %
We processed 600000 lines on 2102548 - 28.536803916010477 %
We processed 700000 lines on 2102548 - 33.29293790201223 %
We processed 800000 lines on 2102548 - 38.04907188801397 %
We processed 900000 lines on 2102548 - 42.80520587401572 %
We processed 1000000 lines on 2102548 - 47.56133986001746 %
We processed 1100000 lines on 2102548 - 52.31747384601921 %
We processed 1200000 lines on 2102548 - 57.073607832020954 %
We processed 1300000 lines on 2102548 - 61.8297418180227 %
We processed 1400000 lines on 2102548 - 66.58587580402445 %
We processed 1500000 lines on 2102548 - 71.3420097900262 %
We processed 1600000 lines on 2102548 - 76.09814377602794 %
We processed 1700000 lines on 2102548 - 80.8542

In [None]:
pd.read_csv('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count_better_formatted.csv')

Unnamed: 0,user_id,artists_mbid,artist_name,count
0,17240,0004d643-96e3-486a-bafe-eb792871ce9f,Siddhartha,2
1,17240,0008af7d-2aa1-4b4d-80af-b3b64ee3cac6,Miklós Rózsa,81
2,17240,000fc734-b7e1-4a01-92d1-f544261b43f5,Cocteau Twins,3
3,17240,00159758-bdd1-4d54-aa0b-555ed1c2f10b,TIMPANA,2
4,17240,002a7949-e949-4826-afeb-271e65e6b5ba,Lux,2
...,...,...,...,...
3493467,8294,eb5b64ff-7189-4976-975b-654134029fa8,Ásdís María Viðarsdóttir,1
3493468,8294,f5ecff2e-501f-4ed1-a32f-2010dd610552,Leony,1
3493469,21151,2eb55bd9-cfbf-44fd-a5da-92dd825284c7,AJR,1
3493470,21151,47c8a452-4efb-45b4-a0d3-df506dbe42b2,Wang Chung,1


In [None]:
def my_read_dataframe(filename):
    data = pd.read_csv(
        filename, usecols=[0, 1, 3], skiprows=1, names=["user_id", "artist_mbid", "plays"], na_filter=False
    )

    data["user_id"] = data["user_id"].astype("category")
    data['artist_mbid'] = data["artist_mbid"].astype('category')
    data["plays"] = data["plays"].astype(int)

    data["artist_mbid_cat"] = data["artist_mbid"]
    data["artist_mbid"] = pd.factorize(data['artist_mbid'])[0]

    return data

data_postfunc = my_read_dataframe('/content/gdrive/MyDrive/Term2/ASMLab/user_id2artists_count_better_formatted.csv')

print(data_postfunc)

        user_id  artist_mbid  plays                       artist_mbid_cat
0         17240            0      2  0004d643-96e3-486a-bafe-eb792871ce9f
1         17240            1     81  0008af7d-2aa1-4b4d-80af-b3b64ee3cac6
2         17240            2      3  000fc734-b7e1-4a01-92d1-f544261b43f5
3         17240            3      2  00159758-bdd1-4d54-aa0b-555ed1c2f10b
4         17240            4      2  002a7949-e949-4826-afeb-271e65e6b5ba
...         ...          ...    ...                                   ...
3493467    8294         5854      1  eb5b64ff-7189-4976-975b-654134029fa8
3493468    8294        26853      1  f5ecff2e-501f-4ed1-a32f-2010dd610552
3493469   21151         1198      1  2eb55bd9-cfbf-44fd-a5da-92dd825284c7
3493470   21151        17475      1  47c8a452-4efb-45b4-a0d3-df506dbe42b2
3493471   21151         5324      1  d5cc67b8-1cc4-453b-96e8-44487acdebea

[3493472 rows x 4 columns]


In [None]:
# Dict mapping the artist_mbid indices and it's value (mbid)
artist_mbid_indices2artist_mbid = data_postfunc.set_index('artist_mbid')['artist_mbid_cat'].to_dict()

print(f'Index is {0} and value is : {artist_mbid_indices2artist_mbid[0]}')

data_postfunc = data_postfunc.drop('artist_mbid_cat', axis=1)
print(data_postfunc)

Index is 0 and value is : 0004d643-96e3-486a-bafe-eb792871ce9f
        user_id  artist_mbid  plays
0         17240            0      2
1         17240            1     81
2         17240            2      3
3         17240            3      2
4         17240            4      2
...         ...          ...    ...
3493467    8294         5854      1
3493468    8294        26853      1
3493469   21151         1198      1
3493470   21151        17475      1
3493471   21151         5324      1

[3493472 rows x 3 columns]


In [None]:
# Keys : unique user_ids ; Values : corresponding integer indices
user_id2his_index = {user: index for index, user in enumerate(set(data_postfunc['user_id']))}

# User indices in the data_postfunc frame
user_indices_from_data_postfun = np.array([user_id2his_index[user] for user in data_postfunc['user_id']])

print(f'user_id to his index dict : {user_id2his_index}')
print(f'Indice n°{1} corresponds to user_id n°{user_id2his_index[1]}')
print(f'Users : We have {len(user_indices_from_data_postfun)} rows with {len(np.unique(user_indices_from_data_postfun))} unique users')

# Artists indices in the data_postfunc frame
artist_indices_from_data_postfun = data_postfunc['artist_mbid']
print(f'Artists : We have {len(artist_indices_from_data_postfun)} rows with {len(np.unique(artist_indices_from_data_postfun))} unique artists')

# List of unique elements - used later
unique_users = np.unique(user_indices_from_data_postfun)
unique_artists = np.unique(artist_indices_from_data_postfun)

user_id to his index dict : {1: 0, 2: 1, 6: 2, 8: 3, 9: 4, 10: 5, 11: 6, 13: 7, 14: 8, 16: 9, 17: 10, 19: 11, 20: 12, 21: 13, 23: 14, 24: 15, 25: 16, 27: 17, 32: 18, 34: 19, 35: 20, 40: 21, 44: 22, 48: 23, 49: 24, 51: 25, 54: 26, 56: 27, 57: 28, 59: 29, 60: 30, 62: 31, 66: 32, 68: 33, 70: 34, 71: 35, 74: 36, 76: 37, 78: 38, 82: 39, 83: 40, 84: 41, 87: 42, 91: 43, 94: 44, 99: 45, 100: 46, 104: 47, 106: 48, 107: 49, 117: 50, 120: 51, 122: 52, 128: 53, 130: 54, 131: 55, 135: 56, 140: 57, 141: 58, 144: 59, 145: 60, 149: 61, 150: 62, 153: 63, 155: 64, 157: 65, 164: 66, 165: 67, 172: 68, 173: 69, 174: 70, 176: 71, 180: 72, 184: 73, 187: 74, 190: 75, 195: 76, 196: 77, 200: 78, 211: 79, 213: 80, 216: 81, 217: 82, 222: 83, 224: 84, 228: 85, 238: 86, 241: 87, 243: 88, 245: 89, 249: 90, 254: 91, 258: 92, 270: 93, 276: 94, 277: 95, 279: 96, 281: 97, 283: 98, 284: 99, 285: 100, 287: 101, 288: 102, 296: 103, 299: 104, 300: 105, 301: 106, 302: 107, 304: 108, 309: 109, 310: 110, 319: 111, 325: 112, 32

In [None]:
# Sparse matrix with COO format
artist_user_plays = coo_matrix((data_postfunc['plays'], (artist_indices_from_data_postfun, user_indices_from_data_postfun)))

In [None]:
from implicit.nearest_neighbours import bm25_weight

# weight the matrix, both to reduce impact of users that have played the same artist thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()



In [None]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
model.fit(user_plays)

  0%|          | 0/15 [00:00<?, ?it/s]

In [None]:
# Get recommendations for the a random single user
import random
userid = random.randint(0,7600)
ids, scores = model.recommend(userid, user_plays[userid], N=10, filter_already_liked_items=False)
list_ = []
for i in ids:
  list_.append(artists_mbid2name[artist_mbid_indices2artist_mbid[unique_artists[i]]])

print(f'For this user : {userid}, the collaborative filtering recommends :')
pd.DataFrame({"artist": list_, "score": scores, "already_liked": np.in1d(ids, user_plays[userid].indices)}) 

For this user : 6278, the collaborative filtering recommends :


Unnamed: 0,artist,score,already_liked
0,Alias Conrad Coldwood,0.528886,True
1,Pentagon,0.388292,True
2,永田権太,0.281805,True
3,Toby Fox,0.275514,False
4,Richard Jacques,0.274436,False
5,祖堅正慶,0.273329,False
6,Jackal Queenston,0.262407,False
7,steventhedreamer,0.261463,False
8,Darius,0.2606,False
9,RushJet1,0.258647,False


In [None]:
def get_key_from_value(d, val):
    return [k for k, v in d.items() if v == val]

# Mall Grab as an example : https://musicbrainz.org/artist/845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac

val = get_key_from_value(artist_mbid_indices2artist_mbid,'845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac')
print(f'For this MBID : 845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac, we have the artist number {val} in our list')

val = get_key_from_value(artists_mbid2name, 'Mall Grab')
print(f'And the artist name Mall Grab links to this MBID : {val}')

[24507]
['845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac']


In [None]:
# Function to verify is the data is an MSID or not 
def is_msid(data_in):
  if type(data_in) == str and len(data_in) == 36:
    cpt = 0
    for c in data_in:
      if c == '-':
        cpt +=1
    if cpt == 4:
      return True
  return False

# Getting recommendations for other artists ; for several inputs : MBID ; name and also number
def get_Reco_from_artist(data_in):
  if type(data_in) == str and not is_msid(data_in):
    try:
      corresp_numb = get_key_from_value(artist_mbid_indices2artist_mbid,get_key_from_value(artists_mbid2name, data_in)[0])[0]
    except:
      print(f'Error : Nothing found for this input : {data_in}')
      return -1

  elif type(data_in) == int:
    corresp_numb = data_in

  # If we want to reco from an msid :
  elif is_msid(data_in):
    try:
      corresp_numb = get_key_from_value(artist_mbid_indices2artist_mbid, data_in)[0]
    except:
      print(f'Error : Nothing found for this input : {data_in}')
      return -1

  else:
    print(f'Error : Nothing found for this input : {data_in}')
    return -1

  try:
    ids, scores = model.similar_items(corresp_numb)
  except:
    print(f'Error, not found for this number {corresp_numb}')
    return -1

  # Making the out dataframe
  list_ = []
  for i in ids:
    list_.append(artists_mbid2name[artist_mbid_indices2artist_mbid[unique_artists[i]]])
  out = pd.DataFrame({"artist": list_, "score": scores })

  return out

We choose Mall Grab as an example https://musicbrainz.org/artist/845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac

MBID : 845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac

Name : 'Mall Grab'

Nb in our db : 24507

In [None]:
# From the number in the list :
get_Reco_from_artist(24507)

Unnamed: 0,artist,score
0,Mall Grab,1.0
1,Overmono,0.830906
2,DJ Seinfeld,0.821291
3,DJ Boring,0.814272
4,Harrison BDP,0.809491
5,Asquith,0.80553
6,Dax J,0.80223
7,Riohv,0.798247
8,Baltra,0.795541
9,DJ Swagger,0.793326


In [None]:
# From the name :
get_Reco_from_artist('Mall Grab')

Unnamed: 0,artist,score
0,Mall Grab,1.0
1,Overmono,0.830906
2,DJ Seinfeld,0.821291
3,DJ Boring,0.814272
4,Harrison BDP,0.809491
5,Asquith,0.80553
6,Dax J,0.80223
7,Riohv,0.798247
8,Baltra,0.795541
9,DJ Swagger,0.793326


In [None]:
# From the MBID :
get_Reco_from_artist('845ddc8b-36ba-4d7b-9c6f-bd7623cd3cac')

Unnamed: 0,artist,score
0,Mall Grab,1.0
1,Overmono,0.830906
2,DJ Seinfeld,0.821291
3,DJ Boring,0.814272
4,Harrison BDP,0.809491
5,Asquith,0.80553
6,Dax J,0.80223
7,Riohv,0.798247
8,Baltra,0.795541
9,DJ Swagger,0.793326


In [None]:
get_Reco_from_artist('Leon Vynehall')

Unnamed: 0,artist,score
0,Leon Vynehall,1.0
1,Actress,0.843939
2,Floating Points,0.842779
3,Four Tet,0.828306
4,Daphni,0.82319
5,Mount Kimbie,0.811539
6,Anthony Naples,0.808387
7,Axel Boman,0.807609
8,Andy Stott,0.807032
9,Djrum,0.80381
