# Model Input and Output Examples

This notebook is intended to demonstrate and clarify the inputs and outputs for LightFM recommender models. I've modified some of code from examples in the lightfm documentation to do this. We'll start by installing lightfm, if not previously installed, importing a demo dataset, and then demonstrating inputs and outputs for prediction after fitting a model. 

Please note that this code is for demonstration only - given it's purpose there are no efforts made to evaluate models.  It is not polished code and is only intended to be illustrative. 

In [None]:
# Run only if lightfm is not installed. 
!conda install -c conda-forge lightfm -y 

In [52]:
# Import required libraries.  
import numpy as np
from lightfm import LightFM
from lightfm import evaluation
from lightfm.datasets import fetch_stackexchange


## Stackoverflow Demo Dataset

If you're curious about the dataset you can find [more information here.](https://making.lyst.com/lightfm/docs/_modules/lightfm/datasets/stackexchange.html) The most important thing to note is that train and test dataset use the same items and features (i.e., they are matrices w/ the same shape).  The difference between them is that some of the user-item interactions are held out for evaluating the accuracy of the recommender system.  

The item features here are tags applied to stackoverflow questions.  In Pushly's case, the item features would have typically been a score or value for each vocabulary term used in the corpus (i.e., the body of documents we are making predictions about). 

In [2]:
# Getting example dataset. 
data = fetch_stackexchange('crossvalidated',
                           test_set_fraction=0.1,
                           indicator_features=False,
                           tag_features=True)

train = data['train']
test = data['test']

print('The dataset has %s users and %s items, '
      'with %s interactions in the test and %s interactions in the training set.\n'
      % (train.shape[0], train.shape[1], test.getnnz(), train.getnnz()))

item_features = data['item_features']
tag_labels = data['item_feature_labels']

print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))

The dataset has 3213 users and 72360 items, with 4453 interactions in the test and 57795 interactions in the training set.

There are 1246 distinct tags, with values like ['bayesian', 'prior', 'elicitation'].


## Model Training

Here we are actually training our model.  Note that we pass in our training dataset (i.e., the matrix of user-item interactions) and the item features.  It's important to note that if you train the model with item features you have to use the item features in prediction.  Remember that the training and test dataset only differ on the basis of the interactions (i.e., the test dataset has additional interactions) but have the same users and items.  As a result, the item features used to train the model are appropriate to pass when we engage in prediction.  

In [3]:
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 30
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6

# Define a new model instance
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)

# Fit the hybrid model. Note that we pass in the item features matrix
# w/ the training data of user/item interactions. .
%time model = model.fit(train, item_features=item_features, \
                        epochs=NUM_EPOCHS,num_threads=NUM_THREADS)


CPU times: user 309 ms, sys: 1.93 ms, total: 311 ms
Wall time: 309 ms


## Making Predictions 

When making a prediction, you should use the model.predict method.  This method takes two 1 dimensional numpy arrays, user and items, that are of equal length. The user array consists of the user ids (i.e., indices in the matrix) for the user for whom we wish to make predictions.  The item array has the item ids (i.e., indices in the matrix) for items for which we want to predict a rating. In real world application, we would create mappings of indices to actual user and item ids (e.g., a python dictionary of index:id) in order to link predictions back to the users and items. 

Importantly, when making predictions we need to repeat the user id for every item we wish to predict and the set of item ids should be repeated for each user. For example, if we wanted to predict the ratings for users 4, 6, 8 on items 3, 12, and 48 the arrays that would be passed would be: 

- user = [4, 4, 4, 6, 6, 6, 8, 8, 8]

- item = [3, 12, 48, 3, 12, 48, 3, 12, 48]

I've written a function below to show how this can be done programmatically and give you some idea of the output.  



In [4]:
def pred_demo(model, u, i=None, mtrx=None):
    """Generates predictions using lightfm model. 
    
    Function is designed to demonstrate how multiple input types can be 
    manipulated efficiently into the required format and outputs predictions
    for inspection.  Also returns user ids to enable predictions.   
    
    Args:
        u (np array): Array of indices for users for whom we want to make 
                      predictions. 
        i (np array): Array of indices for items we wish to recommend. 
        mtrx (np 2D array): user x items matrix (e.g. test or train matrix)
    
    Returns:
        np array: Array of users indices corresponding to predictions
        np array: Array of item indices corresponding to predictions. 
        np array: Array of prediction scores 
        
    """
    
    if mtrx is not None: 
        _, n_items = mtrx.shape
        items = np.tile(np.arange(n_items), len(u))
    elif i is not None: 
        n_items = len(i)
        items = np.tile(i, len(u)) 
    else: 
        print("No items or matrix supplied.")
        return 
    
    users = np.repeat(u, n_items)
    
    assert len(users) == len(items), "Users and items are different lengths."
    
    predictions = model.predict(users, 
                                items, 
                                item_features=item_features)

    return users, items, predictions 

#### Making Predictions for a Single User 

Let's make a prediction for a single user using the function above.  It will return an array of prediction scores.  To figure out which items to recommend we simply take the top N values from the array. 

In [56]:
# Number of recommendations we would like. 
top_n = 5

np.set_printoptions(suppress=True) # suppressing scientific notation in output

users = [4]
user_id, item_id, predictions = pred_demo(model, users, mtrx=data['train'])
print(np.column_stack((user_id, item_id, predictions)))


print("\nNumber of users: {}".format(len(users)))
_, n_items = data['train'].shape
print("Number of items: {}".format(n_items))
print("Users X items: {}".format(len(users)*n_items))
print("\nNumber of predictions calculated: {}".format(len(predictions)))




# Get indices for top predictions.  
indices = np.argpartition(predictions, -top_n)[-top_n:]

print (f"\nFor user {users[0]} the top {top_n} predictions would be: {indices}")

[[    4.             0.            -1.93492508]
 [    4.             1.            -1.12955868]
 [    4.             2.            -0.95335263]
 ...
 [    4.         72357.            -2.59813523]
 [    4.         72358.            -2.08621049]
 [    4.         72359.            -2.92063475]]

Number of users: 1
Number of items: 72360
Users X items: 72360

Number of predictions calculated: 72360

For user 4 the top 5 predictions would be: [71944 27431  9068 10534  4045]


#### Predictions for a subset of Users and Items 

Remember that when calling model.predict that the items need to have the format:  

- user = [4, 4, 4, 6, 6, 6, 8, 8, 8] 
- item = [3, 12, 48, 3, 12, 48, 3, 12, 48]

By using numpy's built-in methods np.repeat and np.tile, respectively, we can pass simple lists of single indices to the above function: 

- user = [4, 6, 8] 
- item = [3, 12, 48]

In [5]:
users = [4, 6, 8]
items = [3, 12, 48]

user_id, item_id, predictions = pred_demo(model, users, items)

print(np.column_stack((user_id, item_id, predictions)))
print("\nNumber of users: {}".format(len(users)))
print("Number of items: {}".format(len(items)))
print("Users X items: {}".format(len(users)*len(items)))
print("\nNumber of predictions calculated: {}".format(len(predictions)))

[[ 4.          3.         -1.79501843]
 [ 4.         12.         -1.53563535]
 [ 4.         48.         -2.71290112]
 [ 6.          3.         -0.60396689]
 [ 6.         12.         -0.38265514]
 [ 6.         48.         -1.86493695]
 [ 8.          3.         -1.58035123]
 [ 8.         12.         -5.30431223]
 [ 8.         48.          0.47020158]]

Number of users: 3
Number of items: 3
Users X items: 9

Number of predictions calculated: 9


#### Using Matrix Shape to Make Predictions for all Items

If we pass the above function the training or test matrix it will calculate the predictions for all items for the given users. Note that it uses the matrix shape - not the content of the matrix so in this case we can pass it either the test or training matrix as they have the same shape.  In this example, I've implemented it to use all items if a matrix is provided but you could also code it to use all users as well or make the use of the matrix shape contingent upon user and item arrays being unprovided.

In [24]:
users = [4, 6, 8]
user_id, item_id, predictions = pred_demo(model, users, mtrx=data['train'])
print(np.column_stack((user_id, item_id, predictions)))


print("\nNumber of users: {}".format(len(users)))
_, n_items = data['train'].shape
print("Number of items: {}".format(n_items))
print("Users X items: {}".format(len(users)*n_items))
print("\nNumber of predictions calculated: {}".format(len(predictions)))

[[    4.             0.            -1.93492508]
 [    4.             1.            -1.12955868]
 [    4.             2.            -0.95335263]
 ...
 [    8.         72357.            -2.22958422]
 [    8.         72358.             0.36491704]
 [    8.         72359.             0.34049028]]

Number of users: 3
Number of items: 72360
Users X items: 217080

Number of predictions calculated: 217080


#### Predictions for a Single Item

We can input a single article and get an output for all users as long as the user array contains all user indices!  This tells us which users shoudl be most  

In [9]:
# All users(3,216) in data set x all items (72,360)
n_users,_ = data['test'].shape 
users = np.arange(n_users)

item = [1] # we only want predictions for item at index position 1.

user_id, item_id, predictions = pred_demo(model, users, item)
print(np.column_stack((user_id, item_id, predictions)))
print("\nNumber of users: {}".format(n_users))
print("Number of predictions calculated: {}".format(len(predictions)))



[[   0.            1.           -1.27421498]
 [   1.            1.           -1.76793253]
 [   2.            1.           -0.74393654]
 ...
 [3210.            1.           -0.19669344]
 [3211.            1.           -0.04075264]
 [3212.            1.            0.31518275]]

Number of users: 3213
Number of predictions calculated: 3213


#### Predicting Ratings for All Items for All Users

One question you might ask is how to input to the model/prediction was all articles and the output is for all users (i.e., all items x all users).  You can do that but in the real world Pushly would likely have not as there is not point recommending items that the user had already seen and you would not want to recommend very old articles. 

I decided to see how lightfm performed predicting all items in the dataset for all users to stress test the prediction method with the data I've got on hand.  I make no promises regarding how this scales but considering the Pushly likely only want to predict three or four weeks of items (i.e., back of napkin if you have 80 articles per day X 28 days ~2240 items) this is encouraging.  You will have a much larger userbase but a smaller number of items that need prediction than what is represented here (approx. 72,000). On my laptop locally, you can do about 230 million predictions in about 20 seconds.    

In [8]:
# All users(3,216) in data set x all items (72,360)
n_users,_ = data['test'].shape 
users = np.arange(n_users)

%time user_id, item_id, predictions = pred_demo(model, users, mtrx=data['train'])

print("\nNumber of predictions calculated: {}".format(len(predictions)))

CPU times: user 20.5 s, sys: 5.15 s, total: 25.6 s
Wall time: 26.4 s

Number of predictions calculated: 232492680


# Conclusion 

As you can see it's relatively easy to manipulate a list of indices in the correct format and if you are predicting more than a single item for more than one user you can easily massage it into a three column data structure (i.e., user id, item id, and prediction score). 

Want to learn more?  Microsoft has an indepth guide to a dozens of recommender models including LightFM!: https://github.com/microsoft/recommenders