# Exercise 2

Now that you have discovered relevant patterns that translate into better retention and engagement of our customers, propose one improvement to our product that can be solved via a ML model (for example, after one end-user finishes a task, recommend a new tool they are likely to need via a real-time notification), with a high-level explanation on how you would deploy it in production. You may assume that you have access to any data source you consider relevant.

# Exercise 2: Recommendation using Scikit-Learn

#### Import libraries

In [1]:
import numpy as np
import pandas as pd

#### Read dataset

In [178]:
data = pd.read_csv('ds-dataset.csv')

The other columns are not relevant for our tool recommendation, so we will simply drop those:

In [179]:
data = data[['user_id' , 'page']]
data.head()

Unnamed: 0,user_id,page
0,196669322373702694527343919754227674361,merge
1,196669322373702694527343919754227674361,delete
2,212955203693754102065312977639302287127,jpg
3,212955203693754102065312977639302287127,rotate
4,212955203693754102065312977639302287127,compress


In [180]:
data.shape

(164933, 2)

We want to see what tools have been used by which user:

In [181]:
DataGrouped = data.groupby(['user_id', 'page']).sum().reset_index() # Group together
DataGrouped.head()

Unnamed: 0,user_id,page
0,100057565415361423597239221229734238436,edit
1,100057565415361423597239221229734238436,merge
2,10008297250197642640412434822899674026,ppt
3,100103232984930506871964919813308121190,compress
4,100103232984930506871964919813308121190,delete


In [182]:
DataGrouped.shape

(14695, 2)

Our Collaborative Filtering will be based on binary data. For every dataset we will add a 1 as used. That means, that this user has used this tool, no matter how many the user actually has used in the past. 
We use this binary data approach for our recommending example. 

Another approach would be to use the amount of tools has been used and normalize it, in case you want to treat the amount of tools used as a kind of taste factor, meaning that someone who used the tool x 100 times- while another user used that same tool x only 5 times- does not like it as much. 
I believe that very often in Sales Recommendations a binary approach makes more sense.

In [183]:
def create_DataBinary(DataGrouped):
    DataBinary = DataGrouped.copy()
    DataBinary['UsedYes'] = 1
    return DataBinary

In [184]:
DataBinary = create_DataBinary(DataGrouped)
DataBinary.head()

Unnamed: 0,user_id,page,UsedYes
0,100057565415361423597239221229734238436,edit,1
1,100057565415361423597239221229734238436,merge,1
2,10008297250197642640412434822899674026,ppt,1
3,100103232984930506871964919813308121190,compress,1
4,100103232984930506871964919813308121190,delete,1


Now our dataframe is eventually prepared for building the recommender.

In [185]:
from scipy.sparse import coo_matrix, csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder

Let’s now calculate the Tool-Tool cosine similarity:

In [187]:
def GetToolToolSim(user_ids, tool_ids):
    ToolUserMatrix = csr_matrix(([1]*len(user_ids), (tool_ids, user_ids)))
    # print("Tool User Ma÷trix: ", ToolUserMatrix)
    similarity = cosine_similarity(ToolUserMatrix)
    # print("Similarity: ", similarity)
    return similarity, ToolUserMatrix

Receiving the top 3 tool recommendations per user in a dataframe, we will use the Tool-Tool Similarity Matrix from above cell via creating a SalesToolUserMatrix (tool per rows and user as columns filled binary incidence).

In [188]:
def get_recommendations_from_similarity(similarity_matrix, ToolUserMatrix, top_n=3):
    UserToolMatrix = csr_matrix(ToolUserMatrix.T)
    # print("User Tool Matrix: ", UserToolMatrix)
    UserToolScores = UserToolMatrix.dot(similarity_matrix) # sum of similarities to all used tools
    # print("User Tool Scores: ", UserToolScores)
    RecForUsr = []
    for user_id in range(UserToolScores.shape[0]):
        scores = UserToolScores[user_id, :]
        # print("score: ", scores)
        used_tools = UserToolMatrix.indices[UserToolMatrix.indptr[user_id]:UserToolMatrix.indptr[user_id+1]]
        # print("Used tools: " ,used_tools)
        scores[used_tools] = -1 # do not recommend already used tools
        top_tools_ids = np.argsort(scores)[-top_n:][::-1]
        recommendations = pd.DataFrame(top_tools_ids.reshape(1, -1),index=[user_id],columns=['Top%s' % (i+1) for i in range(top_n)])
        RecForUsr.append(recommendations)
        return pd.concat(RecForUsr)

Compute the recommendations:

In [190]:
def get_recommendations(used_tools):
    user_label_encoder = LabelEncoder()
    # print("User label encoder: ", user_label_encoder)
    user_ids = user_label_encoder.fit_transform(used_tools.user_id)
    tool_label_encoder = LabelEncoder()
    # print("Tool label encoder: ", tool_label_encoder)
    tool_ids = tool_label_encoder.fit_transform(used_tools.page)
    # compute recommendations
    similarity_matrix, ToolUserMatrix = GetToolToolSim(user_ids, tool_ids)
    recommendations = get_recommendations_from_similarity(similarity_matrix, ToolUserMatrix)
    recommendations.index = user_label_encoder.inverse_transform(recommendations.index)
    for i in range(recommendations.shape[1]):
        recommendations.iloc[:, i] = tool_label_encoder.inverse_transform(recommendations.iloc[:, i])
    return recommendations

Let’s start our recommender:

In [191]:
recommendations = get_recommendations(DataBinary)

In [192]:
print(recommendations)

                                          Top1    Top2 Top3
100057565415361423597239221229734238436  split  delete  jpg


Export the recommendations to a csv file:

In [195]:
dfrec = recommendations
dfrec.to_csv("ExportUserId_ToolName.csv")

## Exercise 2: Recommendation using Tensorflow Recommender

Real-world recommender systems are often composed of two stages:

1. The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.

2. The ranking stage takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.


In this project, we're going to focus on the first stage, retrieval.

Retrieval models are often composed of two sub-models:

1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.
2. A candidate model computing the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In this project, we're going to build and train such a two-tower model using the given dataset.

We're going to:

1. Get our data and split it into a training and test set.
2. Implement a retrieval model.
3. Fit and evaluate it.

In [196]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-recommenders

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-text 2.5.0 requires tensorflow<2.6,>=2.5.0, but you have tensorflow 2.6.0 which is incompatible.
You should consider upgrading via the 'c:\users\saeed\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\users\saeed\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


#### Imports required libraries

In [199]:
import pandas as pd
import os
import pprint
import tempfile
from typing import Dict, Text
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

#### Reading the data file

In [200]:
data = pd.read_csv('ds-dataset.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,ts,user_id,os,browser,plan,page
0,0,2020-01-02 20:17:20.623000,196669322373702694527343919754227674361,mac_os,Firefox,monthly,merge
1,1,2020-01-03 10:22:31.619000,196669322373702694527343919754227674361,mac_os,Firefox,monthly,delete
2,2,2020-01-05 02:21:24.924000,212955203693754102065312977639302287127,windows,Firefox,monthly,jpg
3,3,2020-01-05 02:21:44.378000,212955203693754102065312977639302287127,windows,Firefox,monthly,rotate
4,4,2020-01-19 02:23:02.320000,212955203693754102065312977639302287127,windows,Firefox,monthly,compress


#### Removing unrelevant columns

In [201]:
data = data.drop(["Unnamed: 0", "ts", "os" , "browser", "plan"], axis=1)

#### For working with tfrs first we need to convert our dataset to a tf.data.Dataset object

In [202]:
dataset = tf.data.Dataset.from_tensor_slices(dict(data))

The dataset returns a dictionary of page name and user id:

In [203]:
for x in dataset.take(1).as_numpy_iterator():
    pprint.pprint(x)

{'page': b'merge', 'user_id': b'196669322373702694527343919754227674361'}


This model architecture is quite flexible. The inputs can be anything: user ids, os, or timestamps on the query side; page names, plan on the candidate side.

In this project, we're going to keep things simple and stick to user ids for the query tower, and pages for the candidate tower.

In [204]:
user_usage = dataset.map(lambda x: {
    "page": x["page"],
    "user_id": x["user_id"],
})
pages = dataset.map(lambda x: x["page"])

To fit and evaluate the model, we need to split it into a training and evaluation set. In an industrial recommender system, this would most likely be done by time: the data up to time *T* would be used to predict interactions after *T*.

In this project, however, we'll use a random split, putting 80% of the ratings in the train set, and 20% in the test set.

In [205]:
tf.random.set_seed(42)
shuffled = user_usage.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

Let's also figure out unique user ids and pages present in the data.

This is important because we need to be able to map the raw values of our categorical features to embedding vectors in our models. To do that, we need a vocabulary that maps a raw feature value to an integer in a contiguous range: this allows us to look up the corresponding embeddings in our embedding tables.

In [206]:
page_names = pages.batch(1_000)
user_ids = user_usage.batch(1_000_000).map(lambda x: x["user_id"])

unique_page_names = np.unique(np.concatenate(list(page_names)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_page_names[:10]

array([b'compress', b'delete', b'edit', b'excel', b'extract', b'jpg',
       b'merge', b'number-pages', b'ppt', b'protect'], dtype=object)

## Implementing a model

As we are building a two-tower retrieval model, we can build each tower separately and then combine them in the final model.

### The query tower
The first step is to decide on the dimensionality of the query and candidate representations:

In [207]:
embedding_dimension = 32

Higher values will correspond to models that may be more accurate, but will also be slower to fit and more prone to overfitting.

The second is to define the model itself. Here, we're going to use Keras preprocessing layers to first convert user ids to integers, and then convert those to user embeddings via an Embedding layer. Note that we use the list of unique user ids we computed earlier as a vocabulary:

In [208]:
user_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.StringLookup(
      vocabulary=unique_user_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

### The candidate tower

We can do the same with the candidate tower.

In [209]:
page_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.StringLookup(
      vocabulary=unique_page_names, mask_token=None),
  tf.keras.layers.Embedding(len(unique_page_names) + 1, embedding_dimension)
])

### Metrics

In our training data we have positive (user, page) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other candidates, our model is highly accurate.

To do this, we can use the **tfrs.metrics.FactorizedTopK** metric. The metric has one required argument: the dataset of candidates that are used as implicit negatives for evaluation.

In our case, that's the page dataset, converted into embeddings via our page model:

In [210]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=pages.batch(128).map(page_model)
)

### Loss
The next component is the loss used to train our model.

In this project, we'll make use of the Retrieval task object: a convenience wrapper that bundles together the loss function and metric computation:

In [211]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: we'll use that to implement the model's training loop.

### The full model
We can now put it all together into a model. TFRS exposes a base model class (tfrs.models.Model) which streamlines building models: all we need to do is to set up the components in the __init__ method, and implement the compute_loss method, taking in the raw features and returning a loss value.

The base model will then take care of creating the appropriate training loop to fit our model.

In [220]:
class PageRecModel(tfrs.Model):
    def __init__(self, user_model, page_model):
        super().__init__()
        self.page_model: tf.keras.Model = page_model
        self.user_model: tf.keras.Model = user_model
        self.task: tf.keras.layers.Layer = task
    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model.
        user_embeddings = self.user_model(features["user_id"])
        # And pick out the page feature and pass it into the page model,
        # getting embeddings back.
        positive_page_embeddings = self.page_model(features["page"])
        # The task computes the loss and the metrics.
        return self.task(user_embeddings, positive_page_embeddings)

## Fitting and evaluating
After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

In [221]:
model = PageRecModel(user_model, page_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

We'll shuffle, batch, and cache the training and evaluation data.

In [222]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Then train the model:

In [224]:
model.fit(cached_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x2e294396748>

As the model trains, the loss is falling and a set of top-k retrieval metrics is updated. These tell us whether the true positive is in the top-k retrieved items from the entire candidate set. For example, a top-5 categorical accuracy metric of 0.2 would tell us that, on average, the true positive is in the top 5 retrieved items 20% of the time.

Finally, we can evaluate our model on the test set:

In [225]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.32760000228881836,
 'factorized_top_k/top_5_categorical_accuracy': 0.3787499964237213,
 'factorized_top_k/top_10_categorical_accuracy': 0.3787499964237213,
 'factorized_top_k/top_50_categorical_accuracy': 0.3787499964237213,
 'factorized_top_k/top_100_categorical_accuracy': 0.3787499964237213,
 'loss': 24980.08984375,
 'regularization_loss': 0,
 'total_loss': 24980.08984375}

## Making predictions

Now that we have a model, we would like to be able to make predictions. We can use the tfrs.layers.factorized_top_k.BruteForce layer to do this.

In [226]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends pages out of the entire pages dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((pages.batch(100), pages.batch(100).map(model.page_model)))
)

# Get recommendations.
_, page_name = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {page_name[0, :3]}")

Recommendations for user 42: [b'excel' b'excel' b'excel']


## Model serving

After the model is trained, we need a way to deploy it.

In a two-tower retrieval model, serving has two components:

 

*   a serving query model, taking in features of the query and transforming them into a query embedding
*   a serving candidate model. This most often takes the form of an approximate nearest neighbours (ANN) index which allows fast approximate lookup of candidates in response to a query produced by the query model

In TFRS, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the names of top pages for that user. This is done via exporting the model to a SavedModel format, which makes it possible to serve using TensorFlow Serving.

To deploy a model like this, we simply export the BruteForce layer we created above:

In [228]:
# Export the query model.
with tempfile.TemporaryDirectory() as tmp:
    path = os.path.join(tmp, "model")
    
    # Save the index.
    tf.saved_model.save(index, path)
    
    # Load it back; can also be done in TensorFlow Serving.
    loaded = tf.saved_model.load(path)
    
    # Pass a user id in, get top predicted page names back.
    scores, names = loaded(["42"])
    
    print(f"Recommendations: {page_name[0][:3]}")



INFO:tensorflow:Assets written to: C:\Users\saeed\AppData\Local\Temp\tmpvc1e8b6m\model\assets


INFO:tensorflow:Assets written to: C:\Users\saeed\AppData\Local\Temp\tmpvc1e8b6m\model\assets


Recommendations: [b'excel' b'excel' b'excel']


# End!
@saeid Vaghefi, September 26th 2021