### What is LightFM?

**LightFM is a hybrid matrix factorisation model representing users and items as linear combinations of their content features’ latent factors. The model outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant.**

In LightFM, like in a collaborative filtering model, users and items are represented as latent vectors (embeddings). For example, if the movie ‘Wizard of Oz’ is described by the following features: ‘musical fantasy’, ‘Judy Garland’, and ‘Wizard of Oz’, then its latent representation will be given by the sum of these features’ latent representations. In doing so, LightFM unites the advantages of contentbased and collaborative recommenders.

**How LightFM works**: To put it simply in words, lightFM model learns embeddings (latent representations in a high-dimensional space) for users and items in a way that encodes user preferences over items. When multiplied together, these representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user

In [None]:
import numpy as np
import pandas as pd
from sklearn import model_selection
# import lightgbm as lgb
import os
import sys
import shutil
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
# from catboost import CatBoostClassifier


from lightfm import LightFM
import scipy.sparse as sp


!pip install pyunpack
!pip install patool
from pyunpack.cli import Archive
os.system('apt-get install p7zip')
print(os.getcwd()) #/kaggle/working

In [None]:
# Using datatable library for managing large datasets on Kaggle without fearing the out of memory error
import datatable as dt

In [None]:
%%time
directory = '/kaggle/working/'
Archive('/kaggle/input/kkbox-music-recommendation-challenge/train.csv.7z').extractall(directory)
Archive('/kaggle/input/kkbox-music-recommendation-challenge/test.csv.7z').extractall(directory)
Archive('/kaggle/input/kkbox-music-recommendation-challenge/songs.csv.7z').extractall(directory)
Archive('/kaggle/input/kkbox-music-recommendation-challenge/members.csv.7z').extractall(directory)
Archive('/kaggle/input/kkbox-music-recommendation-challenge/song_extra_info.csv.7z').extractall(directory)

#sys.exit("Error message")
train = dt.fread('./train.csv').to_pandas()
test = dt.fread('./test.csv').to_pandas()
songs = dt.fread('./songs.csv').to_pandas() #'composer', 'lyricist'
members = dt.fread('./members.csv').to_pandas()
songs_extra = dt.fread('./song_extra_info.csv',fill=True).to_pandas()

print('Data loading completed!')
print(train.shape, test.shape, songs.shape, members.shape, songs_extra.shape)

## EDA and Feature preprocessing
Exploring the train dataset

In [None]:
print("Train users: ", len(train.msno.unique()),"Train songs: ", len(train.song_id.unique()))
print("Test users: ", len(test.msno.unique()),"Test songs: ", len(test.song_id.unique()))

In [None]:
# List of songs and their repeat frequency
dict_count_song_played_train = {k: v for k, v in train['song_id'].value_counts().iteritems()}
dict_count_song_played_test = {k: v for k, v in test['song_id'].value_counts().iteritems()}
# dict_count_song_played_train

In [None]:
for col in train.columns:
    if train[col].dtype == object:
        train[col] = train[col].astype('category')
        test[col] = test[col].astype('category')

#### Treatment of songs and user names
The userid and songid is difficult to interpret giving its huge alphanumeric names, hence we will map it onto some names to make it easier to understand the insights drawn from them. <br>

Approach: Although we have 30,755 users in the train set, We have been able to curate 149 Korean names. Hence we could map it to any user randomly or we can also map these names to top users of this music app to be able to draw deeper insight into their music listening behaviour.

In [None]:
# Fetching Song names
train= train.merge(songs_extra, on= 'song_id', how='left')
train.rename(columns= {'msno':'userid','target':'repeat_listener','name':'song'}, inplace=True)
train[:3]

I have randomly selected 139 Korean usernames to rename some top user id's just to make it easier to interpret our results!

In [None]:
username= pd.read_csv('/kaggle/input/names/Korean_names.csv')
names= username.loc[:,'name'].tolist()
names[:5]

In [None]:
%time
# User name manipulation

# Storing concatenation of the userid into the user column
train['user']= train['userid'].str[:10]

# Fetching a list of the top users
top_users= train.user.value_counts()[:139].index.tolist()

# Creating a dict with username mapping
user_mapping= {top_users[i]: names[i] for i in range(len(top_users))}

# Assign names wrt name mapping
train= train.replace({'user': user_mapping, 'repeat_listener':{True:1, False:0}})
train.user.value_counts()[:15]

In [None]:
train.sample(5)

#### Missing Value Imputation

For feeding data into the recommendation engine, we basically need a user-item dataset in the form of a sparse matrix. For this purpose we will be using columns- user and song.

In [None]:
# For recommendation algorithm we need the dataframe in the user-item matrix format
df= train[['user','song','repeat_listener']]
df.head(5)

In [None]:
df.isnull().sum()

In [None]:
# Checking out a user
df.loc[df.user=='Gyeong-nim']

### Sparsity challenge
Even though it might possible to pivot transform the data, pivoting isn't exactly the best strategy because user-item data is notoriously sparse. Instead, we'll create a sparse user-item matrix with the coo_matrix function from scipy.sparse. The following is lifted from the coo_matrix docstring:
<pre>

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])

sp.coo_matrix((data, (row, col)), shape=(3, 3)).todense()

In [None]:
# Convert boolean to integer


In [None]:
ratings= np.array(train['repeat_listener'])
users= np.array(train['user'])
songs= np.array(train['song'])

sp.coo_matrix((ratings, (users, songs)), shape=(len(users), len(songs)))

In [None]:
# Creating a simple class to return the coo matrix in the form of an interactions matrix

class Interactions:
    def __init__(self):
        self.user_encoder= LabelEncoder()
        self.song_encoder= LabelEncoder()
    
    def fit(self, users, songs):
        self.user_encoder.fit(users)
        self.song_encoder.fit(songs)
        return self
    
    def transform(self, users, songs, ratings=None):
        if ratings is None:
            ratings= [1]* len(users)
        uid= self.user_encoder.transform(users)
        iid= self.song_encoder.transform(songs)
        n_users= len(np.unique(uid))
        n_songs= len(np.unique(iid))
        interactions= sp.coo_matrix((ratings, (uid, iid)), shape= (n_users, n_songs))
        return interactions

In [None]:
# Instantiate an interactions machine

interactions= Interactions()
interactions.fit(df['user'], df['song'])

matrix= interactions.transform(df['user'], df['song'], df['repeat_listener'])

print("Original train size:",sys.getsizeof(train))
print("Coordinate train size:",sys.getsizeof(matrix))

In [None]:
# You can take a peek using toarray()
matrix.toarray()

## LightFM

In [None]:
model= LightFM()
model.fit(matrix)

# model= LightFM(no_components=100, k=5, learning_rate=0.05, random_state=33)

# model.fit(matrix,epochs= 50, num_threads= 2)

#### Recommendation analysis for a particular user
Predict the likelihood of the user Dong-geon will have recurring listening event(s) triggered within a month for the songs 'Good Grief' and 'Sleep Without You'?

In [None]:
print("User encoding: ", interactions.user_encoder.transform(['Dong-geon']))
print("Song encoding: ", interactions.song_encoder.transform(['Good Grief','Sleep Without You']))

In [None]:
model.predict(7425, [51441, 121825])

# Although these values do not convey meaning independently but negative values signify less likelihood

In [None]:
# Taking Dong-geon once again and running through all the songs to get the closest matches of his liking
model.predict(7425, np.arange(len(interactions.song_encoder.classes_)))[:10]

# Storing them into a dataframe
songs_7425= pd.DataFrame({'song': interactions.song_encoder.classes_,
                          'pred': model.predict(7425, np.arange(len(interactions.song_encoder.classes_)))}).sort_values('pred', ascending=False).head(10)

songs_7425

In [None]:
matrix

 **EVALUATION AND TUNING OUR MODEL**

In [None]:
from lightfm.evaluation import auc_score, precision_at_k
print("auc:",auc_score(model, matrix, num_threads=4).mean())
print("prec:",precision_at_k(model, matrix, k=10, num_threads=4).mean())

**AUC SCORE**<br>
AUC measures the quality of the overall ranking. In the binary case, it can be interpreted as the probability that a randomly chosen positive item is ranked higher than a randomly chosen negative item.<br> 
Consequently, an AUC close to 1.0 will suggest that, by and large, your ordering is correct: and this can be true even if none of the first K items are positives. This metric may be more appropriate if you do not exert full control on which results will be presented to the user; it may be that the first K recommended items are not available any more (say, they are out of stock), and you need to move further down the ranking. A high AUC score will then give you confidence that your ranking is of high quality throughout.

**PRECISION AT K**<BR>
Measure the precision at k metric for a model: the fraction of known positives in the first k positions of the ranked list of results. A perfect score is 1.0.

In [None]:
from lightfm.cross_validation import random_train_test_split
train, test= random_train_test_split(matrix, test_percentage=0.2)

In [None]:
# model = LightFM()
# model.fit(train, epochs=500)

In [None]:
model = LightFM()

scores = []
for e in range(100):
    model.fit_partial(train, epochs=1)
    auc_train = auc_score(model, train).mean()
    auc_test = auc_score(model, test).mean()
    scores.append((auc_train, auc_test))
    
scores = np.array(scores)

In [None]:
from matplotlib import pyplot as plt

%matplotlib inline

plt.plot(scores[:, 0], label='train')
plt.plot(scores[:, 1], label='test')
plt.legend()

In [None]:
# random_train_test_split(matrix)

### Loss Evaluation

**WARP**: Weighted Approximate-Rank Pairwise loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision@k) is desired.

In [None]:
from lightfm.evaluation import auc_score, precision_at_k
model= LightFM(loss='warp')
scores=[]
for e in range(25):
    model.fit_partial(train, epochs=1, num_threads=4)
    auc_train= auc_score(model, train, num_threads=4).mean()
    auc_test= auc_score(model, test, num_threads=4).mean()
    scores.append((auc_train, auc_test))
    
scores = np.array(scores)

In [None]:
from matplotlib import pyplot as plt

%matplotlib inline

plt.plot(scores[:,0], label='train')
plt.plot(scores[:,1], label='test')

**BPR**: <br>Bayesian Personalised Ranking - pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.

In [None]:
#Loss- 'bpr'
model= LightFM(loss='bpr')

scores=[]
for e in range(25):
    model.fit_partial(train, epochs=1, num_threads=4)
    auc_train = auc_score(model, train, num_threads=4).mean()
    auc_test= auc_score(model, test, num_threads=4).mean()
    scores.append((auc_train, auc_test))
    
scores= np.array(scores)

In [None]:
from matplotlib import pyplot as plt

plt.plot(scores[:,0], label='train')
plt.plot(scores[:,1], label='test')
plt.legend()

**Let's design EARLY STOPPING to obtain the optimum model in least training epochs**

In [None]:
from copy import deepcopy

model= LightFM(loss='bpr')

count = 0
best = 0
scores = []
for e in range(50):
    if count>5:
        break
    model.fit_partial(train, epochs=1)
    auc_train= auc_score(model, train).mean()
    auc_test= auc_score(model, test).mean()
    print(f'Epoch: {e}, Train AUC={auc_train:.3f}, Test AUC={auc_test:.3f}')
    scores.append((auc_train, auc_test))
    if auc_test > best:
        best_model = deepcopy(model)
        best = auc_test
    else:
        count += 1

model= deepcopy(best_model)

References: 
https://making.lyst.com/lightfm/docs/lightfm.html
https://towardsdatascience.com/using-pythons-datatable-library-seamlessly-on-kaggle-f221d02838c7

Do refer this well-documented kernel on LightFM:
   https://www.kaggle.com/niyamatalmass/lightfm-hybrid-recommendation-system

**Why LightFM**:<br>

In both cold-start and low density scenarios, LightFM performs at least as well as pure content-based models, substantially outperforming them when either collaborative information is available in the training set or user features are included in the model. This is really useful for our Music recommendation system beacause we will have many new songs and users that makes a very good environment for the cold start problem.

When collaborative data is abundant (warm-start, dense user-item matrix), LightFM performs at least as well as the Matrix Factorization model.

Embeddings produced by LightFM encode important semantic information about features, and can be used for related recommendation tasks such as tag recommendations. This is also very important for our problem. Because there are useful for finding similar tags so that model can recommend questions that has similiar tags to professionals tags.

***Want to learn more about LightFM library?: *** <br>
If you want to deep dive how to use this library please visit it's official page: https://making.lyst.com/lightfm/docs/index.html. 