# Content-based models (20 PTS)

In this set, you need to implement a content-based approach to solve the ranking problem. Moreover, you should add some personalization for model as a result, to provide a list of personal recommendations for every user. Thus we need to exploit the information on user-item interactions. It could be done in two ways:

**1)** Constructing content-based models for every user in the dataset

**2)** Constructing user profiles

To evaluate your solution, you need a new metric. As this is a ranking problem, we will use $Recall@n$. $Recall@n$ will be calculated for each user individualy. 

$$
Recall_u @n = \frac{|anime_u \cap holdout_u|}{|holdout_u|}
$$

Holdout items here are the items our model will not see during the training. 
Each user has his/her own holdout items.
You will need a holdout in the evaluation step.
In this step, we predict the top $N$ recommended animes. We expect that the holdout items will be within recommended items.



## Content-based models with personalization (10 PTS)

In this problem you need to implement simple content-based model for each user individually in order to achieve some level of personalization. Thus your model may be considered as ensemble of the personal models.

Here we present some default functions which are used in the code below. Do not change them. Note, that the functions below are improved versions of the functions from the seminar.

In [3]:
from ast import literal_eval
import numpy as np
import pandas as pd
from evaluation import topidx

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

ImportError: cannot import name 'topidx' from 'evaluation' (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/evaluation/__init__.py)

## Loading the Data

In [None]:
anime_cut = (
    pd.read_csv('anime_cut.csv')
    .dropna() # remove items w/o description
    # remove items with empty string as descriptions
    .loc[lambda x: x['synopsis'].str.strip().apply(len)>0]
)

In [None]:
def string_ids_to_ints(line, allowed_ids):
    '''
    Convert text representation of ids list into python list of integers.
    Filter out ids that are not present in allowed ids.
    '''
    return [int(x) for x in literal_eval(line) if int(x) in allowed_ids]

In [None]:

allowed_items = set(anime_cut['anime_id'].values)

reviews_cut = (
    pd.read_csv('reviews_cut.csv')
    .drop_duplicates(subset=['user_id', 'anime_id'])
    .query('anime_id in @allowed_items') # ensure review texts are present
    .assign(# convert favorites data into lists of integer ids 
        favorites_anime = lambda x:
            x['favorites_anime']
            .apply(string_ids_to_ints, args=(allowed_items,))
    )
    .loc[lambda x: x['favorites_anime'].apply(len)>0] # drop users without favorites
)

## Getting test triplets 

In [None]:
def split_train_test(reviews, n_pairs, score_cutoff=5, seed=0):
    """
    Splits anime rating data into training and test sets for content-based filtering.

    Parameters
    ----------
    reviews : pandas.DataFrame
        DataFrame containing ratings data.
    n_pairs : int
        The number of liked, disliked anime items to select per test user.
    score_cutoff : int, optional
        The cutoff threshold for item ratings. Items with ratings below this threshold are considered disliked. Default is 5.
    seed : int, optional
        Random seed to ensure reproducibility. Default is 0.

    Returns
    -------
    tuple of pandas.DataFrame
        A tuple containing the training and test datasets.
        - The training DataFrame contains the user ID, anime ID, and rating for each item.
        - The test DataFrame contains triplets of liked, disliked, and favorite anime items for a subset of test users.

    TL;DR
    -----
    Collects triplets of liked, disliked and favorite items for a subset of test users.
    The remaining items of the selected test users are used for training CB models.
    """    
    # select only users with at least 1 anime in favorites
    subset = reviews.loc[
        reviews["favorites_anime"].apply(lambda x : len(x) >0),
        ['user_id', 'anime_id', 'score', 'favorites_anime']
    ]
    valid_users = users_with_enough_data(subset, score_cutoff, n_pairs)
    # select only valid users (i.e. with enough likes and dislikes) and shuffle
    user_selection = subset.query('user_id in @valid_users')
    likes_dislikes = gather_user_feedback(user_selection, score_cutoff, n_pairs, seed)
    # extract favorites data
    favorites = (
        user_selection
        .drop_duplicates(subset=['user_id'])
        .set_index('user_id')
        ['favorites_anime']
    )
    # combine likes, dislikes, and triplets into single dataframe
    test_triplets = pd.merge(
        likes_dislikes,
        favorites,
        left_index=True,
        right_index=True,
        how='inner'
    )
    # for each user, exclude test items from training
    test_data = (
        test_triplets
        .eval('likes + dislikes + favorites_anime')
        .explode()
        .to_frame('anime_id')
        .reset_index()
    )
    all_data = user_selection[['user_id', 'anime_id', 'score']]
    train_data = pd.merge(
        all_data,
        test_data,
        on=['user_id', 'anime_id'],
        indicator=True, # test entries will be indicated as "both" or "right_only"
        how='left', # train entries will be indicated as "left_only"
    ).query('_merge == "left_only"') # select train entries only
    return train_data.drop('_merge', axis=1), test_triplets.sort_index()


def users_with_enough_data(data, score_cutoff, n_pairs):
    '''
    Return users that have enough positive and negative items.
    '''
    valid_users = (
        (data["score"] >= score_cutoff)
        .groupby(data['user_id'])
        .agg(total='size', n_positive='sum')
        .assign(n_negative=lambda x: x.eval('total - n_positive'))
        .eval('n_positive >= @n_pairs and n_negative >= @n_pairs')
    )
    return valid_users.loc[lambda x: x].index


def gather_user_feedback(data, score_cutoff, n_pairs, seed):
    '''
    Extract fixed number of likes and dislikes per each user.
    '''
    likes_dislikes = (
        data
        # shuffle data to randomize selection of items
        .sample(frac=1, random_state=seed)
        # group items by positive/negative score for each user
        .groupby(['user_id', data['score']>=score_cutoff])
        ['anime_id'].apply(list) # make pos/neg item lists
        .str[:n_pairs] # select fixed number of pos/neg items per user
        .unstack('score') # pos/neg class as columns in dataframe
    )
    return likes_dislikes.rename(columns={False: 'dislikes', True: 'likes'})

## Prepearing the data

Here we prepare the data. We divide  the original dataset into two disjoint parts so that for every user his\her train history does not include likes and dislikes from test triplets.

In [None]:
reviews_train, test_triplets_ = split_train_test(reviews_cut, 3, score_cutoff=5, seed=0)
test_triplets = test_triplets_[test_triplets_.index.isin(reviews_train['user_id'].unique())]

## Creating the model

Let's come down to business!

- Build a collection of regression-based CB models on the anime data from Lecture 2.

- Pay attention that now you are asked to build $N$
CB models for every user separately, taking only users' history into account. 

But if your model considers only synopsises of the animes from user history, your model may not see all the words from all the synopsises. Hence, during the evaluation, some features (words) will be omitted, which affects the model's predictions. Suggest the solution of this problem.






In [None]:
cb_config = {
    "tfidf": dict( # TfIDF Vectorizer config
        ngram_range = (1, 1),
        min_df=1, max_df=4,
        strip_accents='unicode',
        stop_words = 'english',
        analyzer = 'word',
        use_idf=True,
        smooth_idf=True,
        sublinear_tf=True
    ),
}
# we also define a general representation of our dataset
anime_description = {
    'users': 'user_id',
    'items': 'anime_id',
    'favorites': 'favorites_anime',
    'feedback' : 'score',
    'feature_map': anime_cut.set_index('anime_id')['synopsis'],
    'train_items': reviews_train['anime_id'].unique()
}

In [None]:
def build_cb_model(config, trainset, trainset_description):
    """
    Build a set of content-based models to recommend items to users based on their history of feedback.
    Each user has a separate model.
    
    Parameters
    ----------
    config : dict
        A dictionary containing configuration settings for the model.
        
    trainset : pd.DataFrame
        A pandas DataFrame containing user-item-feedback tuples for training the model.
        
    trainset_description : dict
        A dictionary containing the description of the trainset with the following keys:
            - 'users': string
                The name of the column containing user IDs in the trainset.
            - 'items': string
                The name of the column containing item IDs in the trainset.
            - 'feedback': string
                The name of the column containing feedback values in the trainset.
            - 'feature_map': pd.Series
                A Series containing item features mapped to their respective IDs.
        
    Returns
    -------
    dict
        A dictionary containing trained Linear Regression models and TfidfVectorizer objects 
        for each user ID in the trainset.
    """
    userid = trainset_description['users']
    itemid = trainset_description['items']
    feedback = trainset_description['feedback']
    feature_map = trainset_description['feature_map']
    
    train_data = trainset[[userid, itemid, feedback]].groupby(userid).agg(list)
    users_dict = {}
    for user_id, items, ratings in train_data.itertuples(name=None):
        # we iterate over rows of `train_data` Dataframe
        # note that by construction, `train_data`'s index encodes users IDs,
        # and the two columns of `train_data` correspond to items and their ratings from user history
        word_vectorizer = TfidfVectorizer(**config['tfidf'])
        item_features = ...
        tfidf_matrix = ...
        reg =...
        users_dict[user_id] = (reg, word_vectorizer)
    return users_dict

In [None]:
cb_params = build_cb_model(cb_config, reviews_train, anime_description)

# Generating recommendations

In order to evaluate the model you need to pass the model's recommendations into evaluation function.

- To get predictions, you need to provide the model information about animes' synopsis from triplets (likes/dislikes/favorites). Moreover, you need to provide the model the rest of the anime from the catalog $anime_{cat}$ .

$$
anime_{test} = likes + dislikes + favourites + anime_{cat}
$$

- When you pass $anime_{test}$ into your model, you will get  predictions of scores. To get $anime_u$ (list of our top $N$ recommendations) you need to sort $anime_{test}$ according to predicted score and take top $N$ items with the highest scores 

$$
anime_{u} = anime_{test}[sorted scores][:N]
$$

In [None]:
def cb_model_recommendations(params, training, testset, data_description, topn=10):
    """
    Uses an ensemble of individual content-based models to generate recommendations for each test user.

    Parameters
    ----------
    params : dict
        A dictionary containing the regression model and word vectorizer for each user.
    training : pandas.DataFrame
        DataFrame containing the user ID, item ID, and rating for each item in the training dataset.
    testset : pandas.DataFrame
        DataFrame containing the user ID, liked items, disliked items, and favorite items for each test user.
    data_description : dict
        A dictionary containing metadata information for the dataset.
        - feature_map : pandas.DataFrame
            DataFrame containing the item ID and feature representation for each item.
        - users : str
            The name of the user ID column.
        - items : str
            The name of the anime ID column.
        - train_items : numpy.ndarray
            Array containing the unique item IDs from the training dataset.
    topn : int, optional
        The number of recommendations to generate per user. Default is 10.

    Returns
    -------
    numpy.ndarray
        Array containing the top n recommended anime IDs for each test user
        with preserved ordering of rows corresponding to the order of users in testset.

    TL;DR
    -----
    This function generates item recommendations for test users using a content-based model.
    For each user in the testset, the function selects items not in the user's history and 
    combines them with the user's likes, dislikes, and favorites.
    The function then applies the user's regression model to the feature representation of 
    these items and generates a score for each of them. The top-n scoring items are recommended to the user.

    """
    feature_map = data_description['feature_map']
    userid = data_description['users']
    itemid = data_description['items']
    user_history = training.groupby(userid)[itemid].apply(list)
    all_items = data_description['train_items']
    recs = []
    for user_id, likes, dislikes, favs in testset.itertuples(name=None):
        items_not_from_history = np.setdiff1d(all_items, user_history[user_id])
        scoring_items = ...
        reg, word_vectorizer = ...
        tfidf_matrix = ...
        user_scores = ...
        user_recs = ...
        recs.append(user_recs)
    return np.array(recs)

In [None]:
cb_recs = cb_model_recommendations(cb_params, reviews_train, test_triplets, anime_description)

# EVALUATION AND HOLDOUT

## HOLDOUT
-  Before evaluation, you should pick out $holdout$ items - the items our model will not see during the training. For this purpose, you need to sample $k$ elements from likes and favorites. You will need a holdout in the evaluation step. In this step, we predict the top $N$ recommended animes, and we expect that the holdout items will be within recommended items.



## EVALUATION


1) In this task you need to compute metric $Recall@n$, $Recall$.

- In this task, you will solve the top $N$ recommendation problem. For this purpose, you need a more complex evaluation function.
That's why we face so-called $Recall@n$. Namely,this metric takes as an input the list of personnel recommendations and holdout, and computes $Recall@n$ for each user separately. If our holdout animes are in the top $N$ recommendations for the user, the current recommendation is valid. To evaluate the models' ensemble, we need to average $Recall@n$ for all users where $animes_u$ - top  $N$ predicted animes for every user. 


$$
Recall_u @n = \frac{|anime_u \cap holdout_u|}{|holdout_u|}
$$






- To evaluate the whole model (model of models =)), you need to average personal recalls

$$
Recall = \frac{1}{\# users} \sum_u Recall_u
$$

In [None]:
def cb_model_evaluate(recs, holdout, topn=10):
    '''
    Evaluate the recommendation system using the recall metric.

    Parameters:
    recs (numpy.ndarray): A 2D numpy array containing the recommended items for each user.
                          The shape of the array is (num_users, num_items).
    holdout (pandas.Series): A pandas Series containing the ground truth for each user.
                             The index of the series corresponds to the user IDs and the values are lists
                             of item IDs.
    topn (int): The number of top recommendations to consider.

    Returns:
    float: The recall score of the recommendation system.
    ''' 
    recall = []
    recs = recs[:, :topn]
    for idx, user_id in enumerate(holdout.index):
        user_recs = recs[idx]
        user_items = holdout.loc[user_id]
        user_recall = ...
        recall.append(user_recall)
    return np.mean(recall)


def sample_holdout(test_triplets, k=3):
    '''
    The function picks out holdout elements.
    It chooses likes and favorites_anime for every user,   
    shufflle obtained dataset,
    groups it by user_id, 
    set the size of holdout
    
    '''
    #   Complete the holdout function.

    holdout = (
        ...
    )
    return holdout

In [None]:
holdout = sample_holdout(test_triplets)

In [None]:
cb_recall = cb_model_evaluate(cb_recs, holdout)

# Similarity Based models (8 pts)

The similarity-based approach is another attempt to personalize the content-based approach. We create so-called user profiles - the weighted sum of the TfIdf vectors of movies from user history to make user representations. Afterwards, we compute cosine similarities between user profile vectors and vectors of all the movies in the catalog.In the following part you need to improve similarity based approach presented on a lecture 2. 


## USER PROFILE

Let's have user $u_{i}$, who gave ratings $r_{i,j}$ to each anime $a_{j}$. (In our case $a_{j}$ is TfIdf representation of the current anime). So user profile vector will be the following:

$$
u_{i} = \frac{\sum_{j} (r_{i,j} \times a_{j})}{\sum_{j} r_{i,j}}
$$

In order to provide recommendations for the specific user we are going to compare an anime vector representation to user profile vector


 
**1)** Construct user profiles for users from test triplets. 

- In the first seminar, we used only users' positive reviews to create user profiles. Is this a good idea? Or we should use all the user's history? Comment on this question.


**2)** Is cosine similarity the only similarity that can be chosen? Try different  [metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html) and study how it affects the model performance.


**3)** Construct the scoring function in the manner of CB-dased scoring function from above.

In [None]:
def user_profiles(tf_idf, test_pairs, reviews, anime):
    aid_index_dict, index_aid_dict = re_index(anime, 'anime_id')
    
    user_profile = dict()
    for i in test_pairs.index:
        ...
    return user_profile

def re_index(dataframe, column):
    '''
    Naive reindexing of data via two dictionaries:
    item-users and user-items mapping
    '''
    column_uniques = dataframe[column].unique()
    indexes = np.arange(len(column_uniques))
    item_index_dict = dict(zip(column_uniques, indexes))
    index_item_dict = dict(zip( indexes, column_uniques))
    return item_index_dict, index_item_dict    

In [None]:
def build_sim_model(config, trainset, trainset_description):
    word_vectorizer = TfidfVectorizer(**config['tfidf'])
    tfidf_matrix = ...
    users_profiles = ...

    return word_vectorizer,  users_profiles

In [None]:

sim_config = {
    "tfidf": dict(
        ngram_range = (1, 1),
        min_df=5, max_df=0.9,
        strip_accents='unicode',
        stop_words = 'english',
        analyzer = 'word',
        use_idf=1,
        smooth_idf=1,
        sublinear_tf=1
    ),
    "reviews" :  reviews_clean,
    "anime" : anime_train,
    'test_pairs' : test_pairs
}


anime_description = {
    'feedback' : "rating",
    "items": "anime_id",
    "item_features": "synopsis",}

In [None]:
params = build_sim_model(sim_config, anime_train, anime_description)

In [None]:
def sim_model_scoring(params, test_pairs, anime, reviews):
    word_vectorizer,  users_profiles = params
    ...
    return numerator / (n*m)


# MODELS COMPARISON (2 PTS)

**1)** Compare the discussed model. Which of them works better and why? Comment on your results