This notebook serves a studying/ recording material for the course MAFS6010S Machine Learning and Applications, trying to understand and apply some techniques in the music recommendation challenge of KKBOX.
The notebook is structured in a way that starts with sets of basic analysis of the raw data and followed by model development and training programme.

## 1. Initial Explorative Study of the Raw Date Set

In [1]:
import numpy as np
import pandas as pd
import sklearn.metrics
import datetime
# scan through the raw data files
# import source data for review
from os import listdir
f = []
f = listdir("./data/")
f
#del(f)

['songs.csv',
 '.DS_Store',
 'test.csv',
 'members.csv',
 'kkbox-music-recommendation-challenge.zip',
 'train.csv',
 'sample_submission.csv',
 'song_extra_info.csv']

In [2]:
# loading training data set for initial study
train_df = pd.read_csv("./data/train.csv")
train_df.head(5)

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1


In [3]:
# start with some basic study of the data source
# 1. dimension
train_df.shape

(7377418, 6)

In [4]:
# 2. total number of revisited user-song pair
sum(train_df["target"])

3714656

In [5]:
# 3. number of unique users
train_df["msno"].nunique()

30755

In [6]:
# 4. number of unique songs in training set
train_df["song_id"].nunique()

359966

In [7]:
# 5. check if any duplicates of the source data in terms of user-song pair
train_user_song_pr = train_df[['msno','song_id']]
train_user_song_pr.shape
# del(train_user_song_pr)

(7377418, 2)

Based on study above we have collected info list as below:
1. there is no duplicative row and each row stands for one unique user-song pair
2. total number of unique users within the training data set is 30755
3. total number of unique songs within the training data set is 359966 - far more than the number of users - sparse

In [8]:
# From the observations above, it can be spotted that the user-song pair matrix is gonna be wide
# More importantly the matrix is gonna be very sparse - 
# on average 1 user listens to around 250 songs which is far less than the total number of songs (359966)
# 1st of all try dimension reduction for the songs based on Songs spec
song_feature_df = pd.read_csv("./data/songs.csv")
song_feature_df.head(5)

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,張信哲 (Jeff Chang),董貞,何啟弘,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,湯小康,徐世珍,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,貴族精選,Traditional,Traditional,52.0


In [9]:
# total number of songs from the song-base
song_feature_df["song_id"].nunique()

2296320

In [10]:
# total number of genre from the song-base
song_feature_df["genre_ids"].nunique()

1045

## 2. Data Cleaning/ Restructuring and Model Development
As the size of the data set is huge, first of all consider to group the huge number of songs to reduce data dimension by using only music genre of the song. After transformation the song-dimension will be smaller than the user number dimension, where item-based(genre-based) collaborative filtering is considered here.

In [11]:
# training data preprocessing
user_id=list(train_df.msno.unique())
song_id=list(train_df.song_id.unique())

In [12]:
df1=train_df["song_id"]
df1=pd.merge(df1, song_feature_df[['song_id','genre_ids']],how='left',on='song_id')
#df1
train_df["genre"]=df1["genre_ids"]
train_df.head(3)
#del(df1)

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target,genre
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1,359
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1,1259
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1,1259


After the first manipulation we aim to convert the raw data into user-song metrics and use each row to calculate similarity among each genre of the songs.
Here we use the index/ percentage below to represent the user's preference/ possibility of listening to certain genres of the songs:
N(i)/N(Un), which is number of replayed songs from genre A/ Number of songs that the user has interacted with.

In [40]:
# then group the songs into same genres:
train_clean = train_df[["msno","genre","target"]].groupby(["msno","genre"],as_index=False).agg({"target":["sum","count"]})
train_clean.columns = train_clean.columns.get_level_values(0)
train_clean.columns = ["msno","genre","sum","count"]
train_clean.head(5)

Unnamed: 0,msno,genre,sum,count
0,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1180,0,1
1,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1259,27,66
2,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,139,5,5
3,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1609,12,19
4,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1616,0,9


In [27]:
# group the data by users to obtain aggregated stat for each user
train_total = train_df[["msno","genre","target"]].groupby(["msno"],as_index=False).agg({"target":["sum","count"]})
train_total.columns = train_total.columns.get_level_values(0)
train_total.columns = ["msno","sum","count"]
train_total["percent"] = round(train_total["sum"]/train_total["count"],5)
train_total.head(3)

Unnamed: 0,msno,sum,count,percent
0,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,293,589,0.49745
1,++AH7m/EQ4iKe6wSlfO/xXAJx50p+fCeTyF90GoE9Pg=,141,220,0.64091
2,++e+jsxuQ8UEnmW40od9Rq3rW7+wAum4wooXyZTKJpk=,76,108,0.7037


In [41]:
# calculate the relative interest level of certain genre by using # of replayed songs in Genre A/ All # of replayed songs.
train_clean = pd.merge(train_clean, train_total[["msno","sum"]], on='msno',how="left")
train_clean["percent"]=train_clean["sum_x"].div(train_clean["sum_y"],level=0)
train_clean.head(10)

Unnamed: 0,msno,genre,sum_x,count,sum_y,percent
0,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1180,0,1,293,0.0
1,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1259,27,66,293,0.09215
2,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,139,5,5,293,0.017065
3,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1609,12,19,293,0.040956
4,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1616,0,9,293,0.0
5,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,1616|1609,4,15,293,0.013652
6,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,2022,3,13,293,0.010239
7,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,359,26,37,293,0.088737
8,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,451,6,15,293,0.020478
9,++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,458,42,76,293,0.143345


In [43]:
df_output = train_clean.pivot(index='msno', columns='genre', values='percent').replace(np.nan,0)
df_output.head(5)

genre,1000,1007,1011,1011|2189|367,1011|359,1011|691,1019,1026,1033,1040,...,958|2022,958|2122,958|691,958|786,958|947,965,972,979,986,993|751
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++AH7m/EQ4iKe6wSlfO/xXAJx50p+fCeTyF90GoE9Pg=,0.0,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++e+jsxuQ8UEnmW40od9Rq3rW7+wAum4wooXyZTKJpk=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++xWL5Pbi2CpG4uUugigQahauM0J/sBIRloTNPBybIU=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
+/SKX44s4ryWQzYzuV7ZKMXqIKQMN1cPz3M8CJ8CFKU=,0.0,0.0,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# calculate cosine similarity
df_item_similarity = sklearn.metrics.pairwise.cosine_similarity(df_output.T, dense_output=True)
df_item_similarity[:5]

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 2.50360875e-01, 2.57546787e-01],
       [0.00000000e+00, 1.00000000e+00, 1.78711569e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.78711569e-05, 1.00000000e+00, ...,
        3.97123654e-02, 5.44494310e-03, 1.27131084e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

## 3. Train the Model and Produce Prediction
As one limitation for collaborative filtering is "cold start" where no info about a user/ song is capture before within the training data set.
With the obtained similarity matrix the model is able to fit training data first to estimate performance.
For the user U: 
1. if the user is included in the training data set:

    1.1 for songs in genre G, if G belongs to genres included in training data set: 
        use top x similar genres' replaying rate to calculate (where x is the tuning parameter)

    1.2 if G is not included in the training data set: 
        use the average replay rate of user U for all genres

2. if the user is not included in the training data set:

    2.1 if the song G is included in the training set:
        use average replay rate of genre G across all users

    2.2 if the song G is not included in the training set:
        use average replay rate for all genres across all users
        
After obtaining the replay possibility, confirm the threshold of the probability pr where if P(U,G)> pr, we set Replay(U,G) = 1, else set Replay = 0.

In [295]:
# initialize the tuning parameters
ls_x = list(range(3,5))
ls_pr = np.arange(0.2, 0.65, 0.15).tolist()

In [296]:
# for model training
def model(x, pr):
    # training the model using training data set
    if True:
        train_pred = []
        # source data is the raw train_df data with columns: userid, song_id/genre, test against target
        # get the userid and genre type
        full_set = train_df.shape[0]
        for i in range(0,1000000):
            userid=train_df["msno"][i]
            genre=train_df["genre"][i]
            if pd.isna(genre):
                k = sum(train_clean["sum_x"])/sum(train_clean["count"])
            else:
                # get the top x genres that are similar to genre
                lt_genre = df_item_similarity[list(df_output.columns).index(genre)] #get similarity
                idx_genre = np.argsort(-lt_genre)[:x+1] #get index of the top x similar genres
                simi_genre = [lt_genre[index] for index in idx_genre] #get the similarity of the genre-pairs
                interest_user = df_output.loc[userid]
                interest_user = [interest_user[index] for index in (idx_genre)]
                k = sum(x * y for x, y in zip(interest_user, simi_genre))
            result = 1 if k > pr else 0
            train_pred.insert(len(train_pred),result)
    
    return train_pred       

In [297]:
column_names = ["no of reference","threshold","precision"]
prediction_results = pd.DataFrame(columns=column_names)
observed = list(train_df["target"][:1000000])

In [298]:
count=0
for x in ls_x:
    for pr in ls_pr:
        count+=1
        if count % 50000==0:
            print(count)
        pred = model(x, pr)
        tn, fp, fn, tp = sklearn.metrics.confusion_matrix(observed, pred).ravel()
        precision = tp/(tp+fp)
        lt = pd.DataFrame([[x,pr,precision]],columns=column_names)
        prediction_results = prediction_results.append(lt)

In [299]:
prediction_results

Unnamed: 0,no of reference,threshold,precision
0,3,0.2,0.712162
0,3,0.35,0.71373
0,3,0.5,0.718878
0,4,0.2,0.712173
0,4,0.35,0.713762
0,4,0.5,0.718394
