# [LightFM](https://github.com/lyst/lightfm)
[Docs](http://lyst.github.io/lightfm/docs/home.html)

Recommednation engines are complciated. 
There is much do discuss, but I will be brief.
Check out [this exmaple] (https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html) and [this](https://medium.com/analytics-vidhya/matrix-factorization-made-easy-recommender-systems-7e4f50504477).

### Get the data

In real life, data doesn't arrive in nice sparse matrices - 
so we turn it to tabular as a reasonable starting point

# The data
An optional helper vaex class for getting sparse matrices.

In [194]:
import vaex
import numpy as np
from scipy.sparse import csr_matrix,dok_matrix
import pyarrow as pa

@vaex.register_dataframe_accessor('sparse', override=True)
class Sparse(object):
    def __init__(self, df):
        self.df = df
        
    def to_csr(self):
        # in case we get new qustions 
        max_question = self.df.variables['max_question']
        max_user = self.df.variables['max_user']
        data = self.df[(self.df['question'] < max_question) &
               (self.df['user'] < max_user)]
        length = len(data)
        return csr_matrix((np.ones(length), 
                   (data['user'].values, data['question'].values)), 
                  shape=(max_user, max_question))
    
    def side_features(self):
        tag_count = self.df.variables['tags_count']
        max_question = self.df.variables['max_question']
        data = self.df[(self.df['question'] < max_question)]
        S = dok_matrix((max_question + 1, 
                        tag_count + 1), 
                        dtype=np.int32)
        cache = set()
        qustions = data.question.tolist()
        tags_ids = data.tags_ids.tolist()
        for row, row_tags in zip(qustions, tags_ids):
            if row not in cache:
                for t in row_tags:
                    if t is not None:
                        S[row, t] = 1
                cache.add(row)
        return S.tocsr()
    
    def get_similar_tags(model, tag_id):
        # Define similarity as the cosine of the angle
        # between the tag latent vectors

        # Normalize the vectors to unit length
        tag_embeddings = (model.item_embeddings.T
                          / np.linalg.norm(model.item_embeddings, axis=1)).T

        query_embedding = tag_embeddings[tag_id]
        similarity = np.dot(tag_embeddings, query_embedding)
        most_similar = np.argsort(-similarity)[1:4]

        return most_similar


df = vaex.open('data/stack_exchange.parquet')
train = df[df['dataset']=='train']
test = df[df['dataset']=='test']
df.head(2)

#,user,question,dataset,tags
0,3,0,train,"['bayesian', 'prior', 'elicitation']"
1,13,0,train,"['bayesian', 'prior', 'elicitation']"


### Reality vs Toy examples

If you have all the possible tags ahead of time, you can work with all possible tags, and questions. It means to count the tags with "df" and not with "train".

In reality you might, you might get new tags in production...

A similar issue is with the shape of the sparse matrix. 
If you only have the train data at the begining - the "max_question" and "max_user" should be the max from the train instead of the df.

To challange ourselves, we will pretend we don't have the test data at all, which is worse case.   
This get's us a more rubust modeling and realistic results. 

In [196]:
import numpy as np

tags = {tag:i for i,tag in enumerate(
    set([item for sublist in train['tags'].tolist() for item in sublist]))}

@vaex.register_function(on_expression=True)
def ids(ar):    
    return np.array([[tags.get(tag) for tag in x] for x in ar.tolist()], dtype=object)

train.add_function('ids', ids)
train['tags_ids'] = train.func.ids('tags')
train.variables['tags_count'] = len(tags)
train.variables['max_question'] = int(train.question.max())
train.variables['max_user'] = int(train.user.max())
train.head(2)

#,user,question,dataset,tags,tags_ids
0,3,0,train,"['bayesian', 'prior', 'elicitation']","array([1038, 967, 931], dtype=object)"
1,13,0,train,"['bayesian', 'prior', 'elicitation']","array([1038, 967, 931], dtype=object)"


# A pure collaborative filtering model

In [197]:
# Import the model
from lightfm import LightFM
from lightfm.evaluation import auc_score
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 16
NUM_COMPONENTS = 30
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6

# Let's fit a WARP model: these generally have the best performance.
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
               no_components=NUM_COMPONENTS)

# Run 3 epochs and time it.
train_csr = train.sparse.to_csr()
%time model = model.fit(train_csr, epochs=NUM_EPOCHS, num_threads=NUM_THREADS)

train_auc = auc_score(model, train_csr, num_threads=NUM_THREADS).mean()
print('Collaborative filtering train AUC: %s' % train_auc)

CPU times: user 219 ms, sys: 13.5 ms, total: 232 ms
Wall time: 233 ms
Collaborative filtering train AUC: 0.8706144


How did we do on the test data?

In [198]:
from goldilox import Pipeline

pipeline = Pipeline.from_vaex(train)
test = pipeline.inference(test)
test_csr = test.sparse.to_csr()

test_auc = auc_score(model, test_csr, train_interactions=train_csr, num_threads=NUM_THREADS).mean()
print('Collaborative filtering test AUC: %s' % test_auc)



Collaborative filtering test AUC: 0.53614616


Not very well - a cold start problem 

# Let's use side features
The StackExchange data comes with content information in the form of tags users apply to their questions

In [199]:
# Define a new model instance
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)


train_item_features = train.sparse.side_features()
model = model.fit(train_csr,
                item_features=train_item_features,
                epochs=NUM_EPOCHS,
                num_threads=NUM_THREADS)

train_auc = auc_score(model,
                      train_csr,
                      item_features=train_item_features,
                      num_threads=NUM_THREADS).mean()
print(f"Hybrid training set AUC: {train_auc}")

Hybrid training set AUC: 0.8841671347618103


How do we do now?

In [200]:
test_item_features = test.sparse.side_features()
test_auc = auc_score(model,
                    test_csr,
                    train_interactions=train_csr,
                    item_features=test_item_features,
                    num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)

Hybrid test set AUC: 0.91697025


# Let's get it into prodction
* We use all the data.
* We create a recommendation column

In [201]:
tags = {tag:i for i,tag in enumerate(set([item for sublist in train['tags'].tolist() for item in sublist]))}

@vaex.register_function(on_expression=True)
def ids(ar):    
    return np.array([[tags.get(tag) for tag in x] for x in ar.tolist()], dtype=object)

df.add_function('ids', ids)
df['tags_ids'] = train.func.ids('tags')
df.variables['tags_count'] = len(tags)
df.variables['max_question'] = int(df.question.max())
df.variables['max_user'] = int(df.user.max())
df.head(2)

#,user,question,dataset,tags,tags_ids
0,3,0,train,"['bayesian', 'prior', 'elicitation']","array([1038, 967, 931], dtype=object)"
1,13,0,train,"['bayesian', 'prior', 'elicitation']","array([1038, 967, 931], dtype=object)"


In [202]:
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)


item_features = df.sparse.side_features()
model = model.fit(df.sparse.to_csr(),
                item_features=item_features,
                epochs=NUM_EPOCHS,
                num_threads=NUM_THREADS)

In [208]:
# groupby-concatenate currently not supported in vaex
users = df[['user','question']].to_pandas_df() 
users_history = users.groupby(['user'])['question'].apply(list).to_dict()
qustions = set(df['question'].unique())
users_options = {user: qustions.difference(history) for user, history in users_history.items()}
most_popular = list(df['question'].value_counts()[:5])

In [227]:
@vaex.register_function()
def recommend(ar, topk=5):
    ret = []
    for user in ar.tolist():
        user_options = list(users_options.get(user))
        if not user_options:
            ret.append(most_popular)
        else:
            # cool way to sort topk
            recommendations = model.predict(np.repeat(user, len(user_options)), 
                                            user_options,
                                            item_features=item_features).argsort()[-topk:][
                              ::-1]
            if len(recommendations) == 0:
                recommendations = most_popular
            ret.append(recommendations)
    return np.array(ret)
df.add_function('recommend', recommend)
df['recommendations'] = df.user.recommend()

In [None]:
pipeline = Pipeline.from_vaex(df)
pipeline.raw = {"user":5}
pipeline.inference(pipeline.raw)

# A few words

This might looks like a bit of a complicated way to do it.   
Recommendation engines are not very simple, and there are many edge cases.   
Thing to look out for - which should not crash your solution:
* New users.
* New tags.
* No user.
* No tags.