In [None]:
import numpy as np
import pandas as pd

from polara import get_movielens_data
from polara.preprocessing.dataframes import leave_one_out, reindex

from dataprep import transform_indices, verify_time_split, generate_interactions_matrix
from evaluation import topn_recommendations, model_evaluate, downvote_seen_items

# Task

Implement two variants of user-based KNN for the top-$n$ recommendations task when:
1. similarity matrix is symmetric,
2. similarity matrix is asymmetric.

Recall, there's no reason for implementing row-wise weighting scheme in user-based KNN. So choose the weighting scheme wisely.

 In your experiments:  
- Test your solution against both weak and strong generalization. 
  - In total you'll have 4 different experiments.
- Follow the "most-recent-item" sampling strategy for constructing holdout.
  - Explain potential issues of this scheme in relation to both weak and strong generalization.  
- Report evaluation metrics, compare the models, and analyse the results.  
- Use Movielens-1M data.

**Note**: you can reuse some code from seminars if necessary.

In [None]:
data = get_movielens_data(include_time=True)

# Weak generalization test

## Preparing data (1 pts)

Your task is
- split data into training and holdout parts
- build a new internal contiguous representation of user and item index based on the training data
- make sure same index is used in the holdout data

In [None]:
# split most recent holdout item from each user
training_, holdout_ = ...

# check correct time splitting
verify_time_split(training_, holdout_)

In [None]:
# reindex data to make contiguous index starting from 0 for user and item IDs
training, data_index = ...

# apply new index to the holdout data
holdout = ...
holdout = holdout.sort_values('userid')

- Let's also populate data description dictionary for convenience.
- It allows using uniform names for users and items field.
  - This way the code does't depend on the actual names in you dataset.
  - So later you can easily switch to another dataset without changing the code fo the pipeline.


In [None]:
data_description = dict(
    users = data_index['users'].name,
    items = data_index['items'].name,
    feedback = 'rating',
    n_users = len(data_index['users']),
    n_items = len(data_index['items']),
)

As previously, let's also explicitly store our testset (i.e., ratings of test users excluding holdout items).

In [None]:
userid = data_description['users']
seen_idx_mask = ...
testset = ...

## Models implementation

### Symmetric case (5 pts)

- You can consult the code from seminars or implement your own solution as long as it is fast enough.

- Recall that subsampling of the neighborhood not only makes the algorithm run faster, but can also improve the results.  
- **Make sure to implement some kind of neighborhood subsampling.**

In [None]:
def build_uknn_model(config, data, data_description):
    user_item_mtx = ...

    # compute similarity matrix
    user_similarity = ...
    return user_item_mtx, user_similarity


def uknn_model_scoring(params, testset, testset_description):
    # implement the scoring function to assign scores
    # to all items for test users
    user_item_mtx, user_similarity = params
    # write your code for scoring, don't forget to return a dense array
    ...
    return scores

In [None]:
uknn_params = ...

In [None]:
uknn_scores = ...

Note: recommending items from user history doesn't make sense.

### Asymmetric case (5 pts)

- Your task here is to implement user-based KNN with asymmetric similarity.

In [None]:
def build_uknn_model_asym(config, data, data_description):
    ...
    return ...


def uknn_model_scoring_asym(params, testset, testset_description):
    ...
    return ...

In [None]:
uknn_params_asym = ...

In [None]:
uknn_scores_asym = ...

 ## Evaluation (1 pts)

#### Generate top-$n$ recommendations for both models

In [None]:
uknn_recs = ...

In [None]:
uknn_recs_asym = ...

### Calculate metrics

In [None]:
modes = ['symmetric', 'asymmetric']
uknn_recs = dict(zip(modes, [uknn_recs, uknn_recs_asym]))


uknn_metrics = {}
for mode, recs in uknn_recs.items():
    if recs is None: continue
    uknn_metrics[mode] = metrics = model_evaluate(recs, holdout, data_description)
    print(
        f'Similarity type: {mode}\n'\
        'HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics)
    )

# Strong generalization test

- Recall that in the strong generalization test you work with the warm-start scenario.
- It means that the set of test users is disjoint from the set of users in the training.
- You're provided with the basic functions to help you perform correct splitting, but there're still a few places where your input is required. Make sure you understand the logic of data splitting in this scenario.

## Preparing data (2 pts)

- Your task is to select a subset of users who have the most recent interactions in their history across entire dataset.
- You will apply holdout splitting to only this subset.
  - Think, why simply taking all users (as in weak generalization test) makes no sense in this scenario. 

In [None]:
def split_by_time(data, time_q=0.95, timeid='timestamp'):
    '''
    Split the input `data` DataFrame into two parts based on the timestamp, with the split point
    being determined by the quantile value `time_q`. The function returns a tuple `(before, after)`
    containing the two DataFrames. The `after` DataFrame contains the rows with timestamps greater
    than or equal to the split point, while the `before` DataFrame contains the remaining rows. 

    Details:
    The `quantile` method of the pandas DataFrame is used to calculate the time point (i.e., timestamp)
    that divides the data into two parts based on the given quantile value `time_q`. Specifically,
    the time point `split_timepoint` is calculated as the `time_q`th quantile of the values in the `timeid`
    column of the `data` DataFrame, using the interpolation method of `nearest`. This means that
    `split_timepoint` is the timestamp at or immediately after which `time_q` percent of the data points occur.    
    '''
    split_timepoint = data[timeid].quantile(q=time_q, interpolation='nearest')
    after = data.query(f'{timeid} >= @split_timepoint') 
    before = data.drop(after.index)
    return before, after

Firstly, you need to select a candidate subset of observations, from which you'll construct the the training, testset, and holdout datssets. Check the `split_by_time` function below and its description in the above cell.

In [None]:
before, after = split_by_time(data, time_q=0.95)

- Now it's time to perform holdout sampling based on the obtained timepoint splitting. 
- Remember, you only sample from the test users.

In [None]:
testset_part_, holdout_ = ... # your code for holdout sampling

# verify correctness of time-based splitting,
# i.e., for each test user, the holdout contains only future interactions w.r.t to testset
verify_time_split(testset_part_, holdout_)

In [None]:
training_ = ... # recall that training and testset must be disjoint by users

- Note that `testset_part_` only contains interactions of the test users **after the timepoint**.
- You need to combine it with the remaining histories of these users.

In [None]:
# combine all test users data into a single `testset_` Dataframe.
testset_ = pd.concat(
    [..., ...],
    axis = 0,
    ignore_index=False
)

### Building internal representation of user and item index

Use the `transform_indices` function for building a contiguous index starting from 0.

In [None]:
training, data_index = ...

- Before applying new index to the test data:
  - note that the users in the `testset` must be the same as the users in the `holdout`.
- Below is the corresponding function `align_test_by_users` that ensures these two datasets' alignment.

In [None]:
def align_test_by_users(testset, holdout):
    test_users = np.intersect1d(holdout['userid'].values, testset['userid'].values)
    # only allow the same users to be present in both datasets
    testset = testset.query('userid in @test_users').sort_values('userid')
    holdout = holdout.query('userid in @test_users').sort_values('userid')
    return testset, holdout

Let's apply new item index to test data and finalize the test split:

In [None]:
holdout = reindex(holdout_, data_index['items'], filter_invalid=True)
testset = reindex(testset_, data_index['items'], filter_invalid=True)

testset, holdout = align_test_by_users(testset, holdout)

- Think why we do not apply new index to users here.

## Models implementation

- In this section you'll need to implement user-based KNN models for the warm-start scenario.
- Think carefully which data must be generated at the build time and which data must be generated in the scoring function.

### Symmetric case (5 pts)

In [None]:
def build_uknn_model(config, data, data_description):
    ...
    return ...


def uknn_model_scoring(params, testset, testset_description):
    ...    
    return ...

In [None]:
uknn_params = ...

In [None]:
uknn_scores = ...

### Asymmetric case (5 pts)

In [None]:
def build_uknn_model_asym(config, data, data_description):
    
    return ...

def uknn_model_scoring_asym(params, testset, testset_description):

    return ...

In [None]:
uknn_params_asym = ...

In [None]:
uknn_scores_asym = ...

 ## Evaluation (1 pts)

### Generate recommendations for both models

In [None]:
uknn_recs = ...

In [None]:
uknn_recs_asym = ...

### Calculate metrics

In [None]:
modes = ['symmetric', 'asymmetric']
uknn_recs = dict(zip(modes, [uknn_recs, None]))


uknn_metrics = {}
for mode, recs in uknn_recs.items():
    if recs is None: continue
    uknn_metrics[mode] = metrics = model_evaluate(recs, holdout, data_description)
    print(
        f'Similarity type: {mode}\n'\
        'HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics)
    )

## Tuning (2 pts)
- Try to find a neighborhood size that gives you better results.
- Perform a simple grid-search experiment and report your findings.

# Final analysis (3 pts)

1. Provide an analysis on which model performs the best and explain why.
2. Explain the difference in computational complexity of your models. Consider how the training and the recommendation generation differ for different models in terms of
    - the amount of RAM,
    - the amount of disk storage,
    - the load on CPU.
3. How else would you modify the model to improve either the quality of recommendations or computational performance? Describe at least one modification and its envisioned effect.