<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Data transformation (collaborative filtering)

It is usually observed in the real-world datasets that users may have different types of interactions with items. In addition, same types of interactions (e.g., click an item on the website, view a movie, etc.) may also appear more than once in the history. Given that this is a typical problem in practical recommendation system design, the notebook shares data transformation techniques that can be used for different scenarios.

Specifically, the discussion in this notebook is only applicable to collaborative filtering algorithms.

## 0 Global settings

In [1]:
# set the environment path to find Recommenders
import sys

import pandas as pd
import numpy as np
import datetime
import math

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]


## 1 Data creation

Two dummy datasets are created to illustrate the ideas in the notebook. 

### 1.1 Explicit feedback

In the "explicit feedback" scenario, interactions between users and items are numerical / ordinal **ratings** or binary preferences such as **like** or **dislike**. These types of interactions are termed as *explicit feedback*.

The following shows a dummy data for the explicit rating type of feedback. In the data,
* There are 3 users whose IDs are 1, 2, 3.
* There are 3 items whose IDs are 1, 2, 3.
* Items are rated by users only once. So even when users interact with items at different timestamps, the ratings are kept the same. This is seen in some use cases such as movie recommendations, where users' ratings do not change dramatically over a short period of time.
* Timestamps of when the ratings are given are also recorded.

In [2]:
data1 = pd.DataFrame({
    "UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
    "ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
    "Rating": [4, 4, 3, 3, 3, 4, 5, 4, 5, 5, 5, 5, 5, 5, 4],
    "Timestamp": [
        '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
        '2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
        '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
    ]
})

In [3]:
data1

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,4,2000-01-01
1,1,1,4,2000-01-01
2,1,2,3,2000-01-02
3,1,2,3,2000-01-02
4,1,2,3,2000-01-02
5,2,1,4,2000-01-01
6,2,2,5,2000-01-01
7,2,1,4,2000-01-03
8,2,2,5,2000-01-03
9,2,3,5,2000-01-03


### 1.2 Implicit feedback

Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. For example, a user may puchase something on a website, click an item on a mobile app, or order food from a restaurant. This information may reflect users' preference towards the items in an **implicit** manner. 

As follows, a data set is created to illustrate the implicit feedback scenario. 

In the data,
* There are 3 users whose IDs are 1, 2, 3.
* There are 3 items whose IDs are 1, 2, 3.
* There are no ratings or explicit feedback given by the users. Sometimes there are the types of events. In this dummy dataset, for illustration purposes, there are three types for the interactions between users and items, that is, **click**, **add** and **purchase**, meaning "click on the item", "add the item into cart" and "purchase the item", respectively. 
* Sometimes there is other contextual or associative information available for the types of interactions. E.g., "time-spent on visiting a site before clicking" etc. For simplicity, only the type of interactions is considered in this notebook.
* The timestamp of each interaction is also given.

In [4]:
data2 = pd.DataFrame({
    "UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
    "ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
    "Type": [
        'click', 'click', 'click', 'click', 'purchase',
        'click', 'purchase', 'add', 'purchase', 'purchase',
        'click', 'click', 'add', 'purchase', 'click'
    ],
    "Timestamp": [
        '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
        '2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
        '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
    ]
})

In [5]:
data2

Unnamed: 0,UserId,ItemId,Type,Timestamp
0,1,1,click,2000-01-01
1,1,1,click,2000-01-01
2,1,2,click,2000-01-02
3,1,2,click,2000-01-02
4,1,2,purchase,2000-01-02
5,2,1,click,2000-01-01
6,2,2,purchase,2000-01-01
7,2,1,add,2000-01-03
8,2,2,purchase,2000-01-03
9,2,3,purchase,2000-01-03


## 2 Data transformation

Many collaborative filtering algorithms are built on a user-item sparse matrix. This requires that the input data for building the recommender should contain unique user-item pairs. 

For explicit feedback datasets, this can simply be done by deduplicating the repeated user-item-rating tuples.

In [6]:
data1 = data1.drop_duplicates()

In [7]:
data1

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,4,2000-01-01
2,1,2,3,2000-01-02
5,2,1,4,2000-01-01
6,2,2,5,2000-01-01
7,2,1,4,2000-01-03
8,2,2,5,2000-01-03
9,2,3,5,2000-01-03
10,3,3,5,2000-01-01
11,3,3,5,2000-01-03
14,3,1,4,2000-01-04


In the implicit feedback use cases, there are several methods to perform the deduplication, depending on the requirements of the actual business user cases.

### 2.1 Data aggregation

Usually, data is aggregated by user to generate some scores that represent preferences (in some algorithms like SAR, the score is called *affinity score*, for simplicity reason, hereafter the scores are termed as *affinity*).

It is worth mentioning that in such case, the affinity scores are different from the ratings in the explicit data set, in terms of value distribution. This is usually termed as an [ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression) problem, which has been studied in [Koren's paper](https://pdfs.semanticscholar.org/934a/729409d6fbd9894a94d4af66bd82222b5515.pdf). In this case, the algorithm used for training a recommender should be carefully chosen to consider the distribution of the affinity scores rather than discrete integer values.

#### 2.2.1 Count

The most simple technique is to count times of interactions between user and item for producing affinity scores. The following shows the aggregation of counts of user-item interactions in `data2` regardless the interaction type.

In [8]:
data2_count = data2.groupby(['UserId', 'ItemId']).agg({'Timestamp': 'count'}).reset_index()
data2_count.columns = ['UserId', 'ItemId', 'Affinity']

In [9]:
data2_count

Unnamed: 0,UserId,ItemId,Affinity
0,1,1,2
1,1,2,3
2,2,1,2
3,2,2,2
4,2,3,1
5,3,1,1
6,3,3,4


#### 2.2.1 Weighted count

It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively. A weighted-count can be done as the following

In [10]:
# Add column of weights
data2_w = data2.copy()

conditions = [
    data2_w['Type'] == 'click',
    data2_w['Type'] == 'add',
    data2_w['Type'] == 'purchase'
]

choices = [1, 2, 3]

data2_w['Weight'] = np.select(conditions, choices, default='black')

# Convert to numeric type.
data2_w['Weight'] = pd.to_numeric(data2_w['Weight'])

In [11]:
# Do count with weight.
data2_wcount = data2_w.groupby(['UserId', 'ItemId'])['Weight'].sum().reset_index()
data2_wcount.columns = ['UserId', 'ItemId', 'Affinity']

In [12]:
data2_wcount

Unnamed: 0,UserId,ItemId,Affinity
0,1,1,2
1,1,2,5
2,2,1,3
3,2,2,6
4,2,3,3
5,3,1,1
6,3,3,7


#### 2.2.2 Time dependent count

In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in [SAR](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_deep_dive.ipynb). Formula for getting affinity score for each user-item pair is 

$$a_{ij}=\sum_k w_k \left(\frac{1}{2}\right)^{\frac{t_0-t_k}{T}} $$

where $a_{ij}$ is the affinity score, $w_k$ is the interaction weight, $t_0$ is a reference time, $t_k$ is the timestamp for the $k$-th interaction, and $T$ is a hyperparameter that controls the speed of decay.

The following shows how SAR applies time decay in aggregating counts for the implicit feedback scenario. 

In this case, we use 5 days as the half-life parameter, and use the latest time in the dataset as the time reference.

In [13]:
T = 5

t_ref = pd.to_datetime(data2_w['Timestamp']).max()

In [14]:
# Calculate the weighted count with time decay.

data2_w['Timedecay'] = data2_w.apply(
    lambda x: x['Weight'] * np.power(0.5, (t_ref - pd.to_datetime(x['Timestamp'])).days / T), 
    axis=1
)

In [15]:
data2_w

Unnamed: 0,UserId,ItemId,Type,Timestamp,Weight,Timedecay
0,1,1,click,2000-01-01,1,0.659754
1,1,1,click,2000-01-01,1,0.659754
2,1,2,click,2000-01-02,1,0.757858
3,1,2,click,2000-01-02,1,0.757858
4,1,2,purchase,2000-01-02,3,2.273575
5,2,1,click,2000-01-01,1,0.659754
6,2,2,purchase,2000-01-01,3,1.979262
7,2,1,add,2000-01-03,2,1.741101
8,2,2,purchase,2000-01-03,3,2.611652
9,2,3,purchase,2000-01-03,3,2.611652


Affinity scores of user-item pairs can be calculated then by summing the 'Timedecay' column values.

In [16]:
data2_wt = data2_w.groupby(['UserId', 'ItemId'])['Timedecay'].sum().reset_index()
data2_wt.columns = ['UserId', 'ItemId', 'Affinity']

In [17]:
data2_wt

Unnamed: 0,UserId,ItemId,Affinity
0,1,1,1.319508
1,1,2,3.789291
2,2,1,2.400855
3,2,2,4.590914
4,2,3,2.611652
5,3,1,1.0
6,3,3,5.883057


### 2.2 Negative sampling

The above aggregation is based on assumptions that user-item interactions can be interpreted as preferences by taking the factors like "number of interation times", "weights", "time decay", etc. Sometimes these assumptions are biased, and only the interactions themselves matter. That is, the original dataset with implicit interaction records can be binarized into one that has only 1 or 0, indicating if a user has interacted with an item, respectively.

For example, the following generates data that contains existing interactions between users and items. 

In [18]:
data2_b = data2[['UserId', 'ItemId']].copy()
data2_b['Feedback'] = 1
data2_b = data2_b.drop_duplicates()

In [19]:
data2_b

Unnamed: 0,UserId,ItemId,Feedback
0,1,1,1
2,1,2,1
5,2,1,1
6,2,2,1
9,2,3,1
10,3,3,1
14,3,1,1


"Negative sampling" is a technique that samples negative feedback. Similar to the aggregation techniques, negative feedback cna be defined differently in different scenarios. In this case, for example, we can regard the items that a user has not interacted as those that the user does not like. This may be a strong assumption in many user cases, but it is reasonable to build a model when the interaction times between user and item are not that many.

The following shows that, on top of `data2_b`, there are another 2 negative samples are generated which are tagged with "0" in the "Feedback" column.

In [20]:
users = data2['UserId'].unique()
items = data2['ItemId'].unique()

In [21]:
interaction_lst = []
for user in users:
    for item in items:
        interaction_lst.append([user, item, 0])

data_all = pd.DataFrame(data=interaction_lst, columns=["UserId", "ItemId", "FeedbackAll"])

In [22]:
data_all

Unnamed: 0,UserId,ItemId,FeedbackAll
0,1,1,0
1,1,2,0
2,1,3,0
3,2,1,0
4,2,2,0
5,2,3,0
6,3,1,0
7,3,2,0
8,3,3,0


In [23]:
data2_ns = pd.merge(data_all, data2_b, on=['UserId', 'ItemId'], how='outer').fillna(0).drop('FeedbackAll', axis=1)

In [24]:
data2_ns

Unnamed: 0,UserId,ItemId,Feedback
0,1,1,1.0
1,1,2,1.0
2,1,3,0.0
3,2,1,1.0
4,2,2,1.0
5,2,3,1.0
6,3,1,1.0
7,3,2,0.0
8,3,3,1.0


Also note that sometimes the negative sampling may also impact the count-based aggregation scheme. That is, the count may start from 0 instead of 1, and 0 means there is no interaction between the user and item. 

# References

1. X. He *et al*, Neural Collaborative Filtering, WWW 2017. 
2. Y. Hu *et al*, Collaborative filtering for implicit feedback datasets, ICDM 2008.
3. Simple Algorithm for Recommendation (SAR). See notebook [sar_deep_dive.ipynb](../02_model_collaborative_filtering/sar_deep_dive.ipynb).
4. Y. Koren and J. Sill, OrdRec: an ordinal model for predicting personalized item rating distributions, RecSys 2011.