# CONTENT BASED MODELS

Imagine, we have a service (like Youtbe, Netflix or whatever) and we have some new users. We want to recommend our users something they would like, but those users are new, so we know nothing about them. How to build recommender system in such situation? Another question is how to score and recommend new items with known features but without known ratings. 

The answer is simple: we have items (videos, movies, ect). Each item has some features and each item has its own rating (calculated based on the grades users or critics put). We want to recommend items with higher rating to our newbies. Below there are three ways how to do that.

But firstly, let me show the datsets we are going to work with.

Download the preprocessed dataset [here](https://drive.google.com/drive/folders/1YPqWaZYW2axz91YKkaM7j_xPqF5rSVx2?usp=drive_link).

In [None]:
import os
import ast
import numpy as np
import pandas as pd
from scipy.sparse import diags
from scipy.sparse.linalg import norm as sparse_norm

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import shap

## Dataset

### Reviews


Lets have a look on our *reviews* dataset. In this dataset we need only four columns with titles *user_id*, *anime_id*, *favorites_anime* and *score*.

- ***user_id*** User profile name. Some profile names are literally insane =)

- ***anime_id*** The mapping of anime titles column onto integers.

- ***score*** grade, how the user with the specific *user_id* evaluated the anime with specific *anime_id*.

-  ***favorites_anime*** List of animes the user with the specific *user_id*  considered as his favourites. **ATTENTION** the animes from this list may not occur in user history (animes user graded). User may have mentioned some animes as favourites, but did not graded them. Automatically, we will consider that the score of favorite animes is **10**



In [None]:
USER_COL = "user_id"
ITEM_COL = "anime_id"
RELEVANCE_COL = "score"

In [None]:
base_path = "./anime_data"

In [None]:
review_data = pd.read_csv(os.path.join(base_path, 'reviews.gz'))

In [None]:
review_data.head()

#### Do you remember what is the difference between explicit and implicit feedback?

In [None]:
def get_log_info(log, user_id='user_id', item_id='item_id'):
    print(f'Num reviews = {log.shape[0]},\nnum users = {log[user_id].nunique()},\nnum items = {log[item_id].nunique()}')

In [None]:
get_log_info(review_data, user_id=USER_COL, item_id=ITEM_COL)

In [None]:
review_data[RELEVANCE_COL].plot.hist(bins=10, figsize=(10, 5), title='Scores distibution from user reviews');

#### How does the distribution of ratings change if we have, for example, a marketplace data?

In [None]:
def group_by_and_plot(df, group_by_name, rating_col_name, quantile=0.99, title=''):
    grouped = df.groupby(group_by_name)[rating_col_name].count()
    print(grouped.describe(percentiles=[0.05, .25, .5, .75, 0.95]))
    grouped[grouped < grouped.quantile(quantile)].plot(kind='hist', bins=20, figsize=(10, 5), title=title)
    return grouped

In [None]:
group_by_and_plot(review_data, group_by_name=USER_COL, rating_col_name=RELEVANCE_COL, quantile=0.99, title='Num reviews per user');

In [None]:
group_by_and_plot(review_data, group_by_name=ITEM_COL, rating_col_name=RELEVANCE_COL, quantile=0.99, title='Num reviews per anime');

#### How many animes could we reliably recommend using popularity-based methods? 

### Animes


The main columns for us are:

- ***anime_id*** - the same id we have in the table above.

- ***synopsis*** - the description of the anime with a specific **anime_id**

- ***score*** - average score over all the grades users or critics  put to the corresponding anime


In [None]:
anime_data = pd.read_csv(os.path.join(base_path, 'animes.gz'), na_filter=False)

In [None]:
anime_data.head(1).T

In [None]:
anime_data['genre'].str.strip().str.split(", ").explode().value_counts()

In [None]:
anime_data.shape # 7636 in reviews, the rest won't be covered by the popularity-based models

In [None]:
anime_data[RELEVANCE_COL].plot.hist(bins=10, title='Score distibution from item features', figsize=(10, 5));

In [None]:
(anime_data[RELEVANCE_COL] > 0).sum()

In [None]:
all_anime_data = (
    anime_data[['anime_id', 'score']]
    .assign(tokens=anime_data[['title', 'genre', 'synopsis']].apply('; '.join, axis=1))
    .set_index('anime_id')
    # ['tokens']
)

In [None]:
all_anime_data.head()

#### What can we do with these tokens?

### User profiles

This dataset contains additional information about users, including their favorite anime.

- ***user_id*** - the same user id as in the reviews table.

- ***gender*** - user gender

- ***birthday*** - user birthday

- ***favorites_anime*** - a list of user favorites


In [None]:
profile_data = pd.read_csv(os.path.join(base_path, 'profiles.gz'), converters={'favorites_anime': ast.literal_eval})

In [None]:
profile_data.head()

In [None]:
profile_data.shape

In [None]:
(
    profile_data['favorites_anime']
    .apply(len)
    .value_counts(sort=False).sort_index()#.cumsum()
    .plot.bar(
        logy=True,
        rot=0,
        xlabel='# favorites',
        ylabel='frequency',
        title='Amount of favorites in user profiles',
        figsize=(10, 5)
    )
);

### Favorites ratings

In [None]:
favorites_data = profile_data.set_index('user_id')['favorites_anime']
favorites_scores = pd.merge(
    favorites_data.explode().rename('anime_id').reset_index(),
    review_data[['user_id', 'anime_id', 'score']],
    on = ['user_id', 'anime_id'],
    how = 'left'
)['score']

print(f'Fraction of favorites without the rating: {favorites_scores.isnull().mean():.0%}')

- Most of the favorites do not have ratings info.
- For the sake of evaluation in this excercise, we will make two assumptions:
  - favorites are of the highest interest to users,
  - other animes that users rate highly should be somewhat similar to their favorites.

# Data split


- In order to construct test set we use the users who have some anime in their favourites list.
- For these users we take **n_pairs** animes a user liked and **n_pairs** animes the user disliked from their anime reviews.
  - We assume the user liked anime if he put the grade no less than some *score_cutoff* value.
- The quality of predictions is then evaluated by comparing how close the liked animes to the favorites based on the predicted scores.

In [None]:
def get_test_pairs(reviews, favorites, n_pairs, score_cutoff, seed):
    '''
    Construct a dataset consisting of pairs of liked and disliked animes. The likes and dislikes
    are defined by the ratings value: everything below threshold is a dislike, the rest are likes.
        
    The function ensures that the amount of likes and dislikes is the same per each user in data.
    The users that do not contain enough likes or dislikes are discarded from the result.
    The result is to be used for evaluating the quality of recommendations by some algorithms.
    Hence, user favorites are excluded to ensure that there is no trivial solution.
    '''
    rng = np.random.default_rng(seed)
    def strict_sample_no_favs(series):
        # sample `n_pairs` elements from `series`, if not enough data - return empty list,
        # discard favorites, otherwise the evaluation on test pairs against favorites makes no sense
        above_cutoff, user_id = series.name
        allowed_items = np.setdiff1d(series.values, favorites.loc[user_id])
        return rng.choice(allowed_items, n_pairs, replace=False) if len(allowed_items)>=n_pairs else []
    
    test_pairs = (
        reviews
         # split by likes and dislikes, group by users
        .groupby([(reviews["score"] >= score_cutoff), 'user_id'])
        # sample `n_pairs` items (both likes and dislikes), disregard user favorites
        ['anime_id'].apply(strict_sample_no_favs)
         # disregard users that have not enough items
        .loc[lambda x: x.apply(len) > 0]
         # make two columns of likes and dislikes
        .unstack('score')
        # ensure each user has both likes and dislikes
        .dropna()
         # rename by rule `score >= score_cutoff`
        .rename(columns={False: 'dislikes', True: 'likes'})
    )
    return test_pairs

We will generate the training data by excluding animes from the pairs of likes and dislikes contained in the test data. 

In [None]:
def split_anime_train_test_data(reviews, favorites, anime, *, n_pairs=3, score_cutoff=5, seed=0):
    '''
    Function to construct train dataset. It deletes animes that occured in the test set
    to prevent information leakage from test to train.
    '''
    test_data = get_test_pairs(reviews, favorites, n_pairs, score_cutoff, seed)
    test_anime_set = (
        test_data
        .melt(value_name='animes') # reshape 2 columns into signle column
        ['animes'].explode() # flatten all lists into a single long 1d array
        .unique() # get only unique values
    )
    train_data = (
        anime
        # only use known score - to be used as the prediction target,
        # also prevent test data leaks into training
        .query('score > 0 and anime_id not in @test_anime_set')
        # combine several text fields into a unified feature view
        .assign(tokens = lambda x: x[['title', 'genre', 'synopsis']].apply('; '.join, axis=1))
        # only take necessary fields
        .loc[:, ['anime_id', 'tokens', 'score']]
    )
    return train_data, test_data

In [None]:
anime_train, anime_test = split_anime_train_test_data(
    review_data, favorites_data, anime_data, score_cutoff=5
)

In [None]:
anime_train.head()

In [None]:
# to evaluate classification/regression metrics

test_animes = (anime_test
        .melt(value_name='animes') # reshape 2 columns into signle column
        ['animes'].explode() # flatten all lists into a single long 1d array
        .unique()) # get only unique values

anime_test_animes = anime_data.query('score > 0 and anime_id in @test_animes') \
        .assign(tokens = lambda x: x[['title', 'genre', 'synopsis']].apply('; '.join, axis=1)) \
        .loc[:, ['anime_id', 'tokens', 'score']]

In [None]:
anime_data.query('score > 0').shape, anime_train.shape[0] + anime_test_animes.shape[0]

In [None]:
anime_test.head()

In [None]:
anime_test_animes.head(2)

In [None]:
anime_test.shape

In [None]:
group_by_and_plot(review_data.query('user_id in @anime_test.index.values'), group_by_name=USER_COL, rating_col_name=ITEM_COL, quantile=0.99, title='Num reviews per test user');

You will need to build a personalized models for those users in your HW. Do you have enough data? 

In [None]:
favorites_data.head()

# Regression model

<!-- ## PURE CONTENT BASED MODELS

We will explore application of regression and classification approaches.
 -->
In regression task (in its straightforward implementation) we want to predict average score based on any parameter (parameters) of anime. Here we will use binary vectors of synopsis of each anime

$$
S = W * x +W_0
$$

where $x$ - text representation $S$ -predicted scores, $W_0$ -bias

In our case we need any representation of textual information. In first we will use a binary representation (if the word presents in text) and than will apply TFIDF representation.

## Task

Here we need to predict anime's rating. For this purpose we are going to use ***regression model***. The simplest regression model is ***Linear regression***.


$$
S = W * x +W_0
$$


Where **S** is predicted scores, **x** is our text representation and **W** is model weights

## Pipeline

Below you can see a simple pipeline for this task. Later we're going to extend it during our course.

- ***build_cb_model*** - this function constructs the model we need. As a rating predictor we use Linear regression.  The entries of this function is model config, trainset_description and trainset which we build above

- ***generate_features*** - applies CountVectorizer or TfidfVectorizer for items' descriptions
 
- ***cb_model_scoring*** - provides the scores for  the entry set of data. This function uses parameters which were defined in ***build_cb_model*** 

- ***cb_config*** - defines parameters of the model

- ***transform_predict*** - unified function to generate recommendations

- ***anime_description*** - maps names of features we use in ***build_cb_model***/***cb_model_scoring*** to column names, used for convenience.

In [None]:
class DenseTransformer(TransformerMixin):
    """
    Convert sparse matrix to dense np array to apply standard scaler with mean.
    """

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.toarray()

In [None]:
def build_cb_model(config, trainset, trainset_description, logistic=False, binary_vectorizer=True):
    """
    Config and fit cb model
    """
    feature_matrix, word_vectorizer = generate_features(config, trainset, trainset_description, binary_vectorizer)
    if logistic:
        regressor = LogisticRegression
    elif 'alpha' in config['model']:
        regressor = Ridge
    else:
        regressor = LinearRegression
    target_column = trainset_description['feedback']
    model = regressor(**config['model']).fit(feature_matrix, trainset[target_column])
    return model, word_vectorizer

def generate_features(config, trainset, trainset_description, binary_vectorizer):
    """
    Config and fit text vectorizer
    """
    if binary_vectorizer:
        word_vectorizer = CountVectorizer(**config['vectorizer']['binary'])
    else:
        word_vectorizer = Pipeline([("tfidf", TfidfVectorizer(**config['vectorizer']['tfidf'])), 
                                    ('dense', DenseTransformer()), 
                                    ("scaler", StandardScaler())])
    features_column = trainset_description['item_features']
    feature_matrix = word_vectorizer.fit_transform(trainset[features_column])
    return feature_matrix, word_vectorizer


def transform_predict(params, tokens):
    """
    Get recommendations from either classification or regression model
    """
    model, word_vectorizer = params
    tokens_encoded = word_vectorizer.transform(tokens)
    try: # handle classification models
        predictor = model.predict_proba
    except AttributeError:
        predictor = model.predict
    scores = predictor(tokens_encoded)
    if scores.ndim > 1: # handle classification
        scores = scores[:, 1] # take class 1
    return scores

def cb_model_scoring(params, testset, testset_description):
    """
    Select necessary features and get recommendations with the fitted pipeline
    """
    tokens = testset[testset_description['item_features']]
    scores = transform_predict(params, tokens)
    return scores


## Linear regression with the binary text vectorization

In [None]:
cb_config = {
    "model": dict(),
    "vectorizer":{
        "binary": dict( # simple binary token encoder
            min_df = 5,
            max_df = 0.9,
            strip_accents='unicode',
            stop_words = 'english',
            analyzer = 'word',
            binary = True,
        ),
        "tfidf": dict( # TfIDF Vectorizer
            min_df = 5,
            max_df = 0.9,
            strip_accents='unicode',
            stop_words = 'english',
            analyzer = 'word',
            use_idf = True,
            smooth_idf = True,
            sublinear_tf = True,
            binary = False,
            norm="l2",
        ),
    }
}
# we also define a general representation of our dataset
anime_description = {
    'feedback' : "score",
    "item_features": "tokens",
}

In [None]:
reg_params = build_cb_model(cb_config, anime_train, anime_description)
reg_scores = cb_model_scoring(reg_params, anime_train, anime_description)
reg_scores_test_anime = cb_model_scoring(reg_params, anime_test_animes, anime_description)

In [None]:
reg_params[1]

In [None]:
reg_params[0]

In [None]:
len(reg_params[1].vocabulary_), anime_train.shape[0]

Any conserns?

In [None]:
len(reg_params[0].coef_), np.linalg.norm(reg_params[0].coef_)

In [None]:
reg_params[0].intercept_

#### TO DO: build a CountVectorizer + LinearRegression model without given functions, fit it with `anime_train` and get the scores for `anime_test_animes`.

In [None]:
# YOUR CODE HERE

## Simple evaluation of our model

Here we compare predicted scores to the ground truth using MAE and RMSE metrics.



In [None]:
def calc_rmse_mae(predicted_scores, gt_scores):
    rmse = np.sqrt(np.mean((predicted_scores - gt_scores)**2))
    mae = np.mean(np.abs(predicted_scores - gt_scores))
    print(f'{rmse=:.4f}, {mae=:.4f}')

#### train scores

In [None]:
calc_rmse_mae(anime_train['score'].values, reg_scores)

#### test scores

In [None]:
calc_rmse_mae(anime_test_animes['score'].values, reg_scores_test_anime)

## A little bit of analysis

Let's look at the significant features of our model

In [None]:
def top_idx(a, topk):
    # idx of top-k with the biggest scores
    parted = np.argpartition(a, -topk)[-topk:]
    # idx of top-k sorted descending by relevance 
    return parted[np.argsort(-a[parted])]

In [None]:
def significant_features(params, topn=10, reverse=False):
    reg, word_vectorizer = params
    features_weights = reg.coef_.squeeze()
    if reverse:
        features_weights = -features_weights
    top_features_idx = top_idx(features_weights, topn)
    if isinstance(word_vectorizer, Pipeline):
        word_vectorizer = word_vectorizer[0]
    try: # handle API changes in different versions of sklearn
        features = word_vectorizer.get_feature_names()
    except AttributeError:
        features = word_vectorizer.get_feature_names_out()

    feature_scores = pd.DataFrame(
        zip(np.array(features)[top_features_idx], np.array(features_weights)[top_features_idx]),
        columns = ['feature', 'weight']
    )
    return feature_scores

### Linear regression feature importance

In [None]:
# features with the most positive impact
significant_features(reg_params)

In [None]:
# features with the most negative impact
significant_features(reg_params, reverse=True)

In [None]:
the_anime_name = "Kara no Kyoukai"
the_anime_series = anime_data[anime_data['title'].str.contains(the_anime_name)].anime_id.unique()
the_anime_reviews = review_data[(review_data['anime_id'].isin(the_anime_series))].copy()
the_anime_reviews.loc[:, 'text'] = the_anime_reviews['text'].str.lower()
ryougi = the_anime_reviews[the_anime_reviews['text'].str.contains("ryougi")].shape[0]
kokutou = the_anime_reviews[the_anime_reviews['text'].str.contains("kokutou")].shape[0]
both = the_anime_reviews[(the_anime_reviews['text'].str.contains("kokutou")) 
       & (the_anime_reviews['text'].str.contains("ryougi"))].shape[0]

In [None]:
print(f"Num reviews for {the_anime_name} with {ryougi = } with {kokutou = }, with {both = }")

Also gunsou (2-nd positive) keroro (6-th negative) is a name of an anime.

#### What could be the reason? 

## Ridge regression with the binary text vectorization

In [None]:
cb_config["model"] = {"alpha": 100}

In [None]:
ridge_reg_params = build_cb_model(cb_config, anime_train, anime_description)
ridge_reg_scores = cb_model_scoring(ridge_reg_params, anime_train, anime_description)
ridge_reg_scores_test_anime = cb_model_scoring(ridge_reg_params, anime_test_animes, anime_description)

In [None]:
ridge_reg_params[1]

In [None]:
ridge_reg_params[0]

In [None]:
np.linalg.norm(ridge_reg_params[0].coef_)

In [None]:
ridge_reg_params[0].intercept_

#### train scores

In [None]:
calc_rmse_mae(anime_train['score'].values, ridge_reg_scores)

#### test scores

In [None]:
calc_rmse_mae(anime_test_animes['score'].values, ridge_reg_scores_test_anime)

### Ridge regression feature importance

In [None]:
# features with the most positive impact
significant_features(ridge_reg_params)

In [None]:
# features with the most negative impact
significant_features(ridge_reg_params, reverse=True)

#### Think, does it make sense to use this kind of analysis of significant features in the case of non-binary feature matrix?

#### scores distribution

In [None]:
bins = np.linspace(-10, 20, 100)
plt.hist(reg_scores_test_anime, bins=bins, alpha=0.5, label='linear');
plt.hist(anime_test_animes['score'].values, bins=bins, alpha=0.5, label='gt');
plt.hist(ridge_reg_scores_test_anime, bins=bins, alpha=0.5, label='ridge');

plt.title("Distribution of ground truth and predicted test scores")
plt.legend(loc='upper right')
plt.show()

**Remark** 

Do not forget to regularize your models)

During the cource we will move from content-based models to the models, build on top os users interactions with the items. Items could also be pretty similar and appear in groups, leading to multicollinearity. Regularization will appear as one of the important success factors for collaborative filtering models.

## EVALUATION

Imagine, we don't have exact anime rating. How to understand whether our model is sane or not? Let's refresh all the info we have for users $u\in\mathcal{U}$.

For each user we have:
- predicted scores $r_i$ for a user $u$ likes ($i\in\mathcal{I}_u^+$)
- predicted scores $r_j$ for a user $u$ dislikes ($j\in\mathcal{I}_u^-$)
- predicted scores $r_k$ for user favourites ($k\in\mathcal{I}_u^f$)

By construction, the set of user favorites is disjoint from the items in the test user preferences $(\mathcal{I}_u^+ \cup \mathcal{I}_u^-)\cap\mathcal{I}_u^f=\emptyset$   (see `get_test_pairs` function).

Intuitively, the predicted scores on a user's favourite animes should be (on average) closer to the predicted scores on the user's likes rather than on dislikes. So, the evaluation is based on the following "closeness-rate" metric:
$$
CR = \frac{1}{|\mathcal{U}|}\sum_{u\in\mathcal{U}}\frac{1}{|\mathcal{I^f_u}|}\sum_{k\in\mathcal{I}_u^f}\mathbb{I}\left(\text{dist}(k,\mathcal{I_u^+})<\text{dist}(k,\mathcal{I_u^-})\right),
$$
where $\mathbb{I}(\cdot)$ is an indicator function that returns 1 or 0 depending on whether its argument is true or not. The distances can be mesured, for example, simply as
$$
\text{dist}(k,\mathcal{I_u^+}) = \frac{1}{|\mathcal{I_u^+}|}{\sum_{i\in\mathcal{I_u^+}}(r_i - r_k)^2}, \\
\text{dist}(k,\mathcal{I_u^-}) = \frac{1}{|\mathcal{I_u^-}|}\sum_{j\in\mathcal{I_{u}^-}}{(r_j - r_k)^2}.
$$

In short, *the deviation of the predicted scores on favorite items from the predcited scores of liked items must be lower then that of the disliked items*. Note that in our setup $|\mathcal{I}_u^+|=|\mathcal{I}_u^-|=n\_pairs$.

Think of other possible functions for the distance estimation.


In [None]:
def cb_model_evaluation(params, test_pairs, favorites, anime_features, distance_function):
    positive_is_closer = []
    for user_id, likes, dislikes in test_pairs[['likes', 'dislikes']].itertuples(name=None):
        pos_distance, neg_distance = distance_function(user_id, likes, dislikes, params, favorites, anime_features)
        positive_is_closer.append((pos_distance < neg_distance).mean())
    return np.mean(positive_is_closer)


def cb_distances(user_id, likes, dislikes, params, favorites, anime_features):
    """
    Caclulate the distance between user's favorites and likes, 
    as well as between user's favorites and dislikes based on scores of a regression model
    """
    user_favorites = favorites[user_id]
    # (n_positives,)
    scores_pos = transform_predict(params, anime_features.loc[likes, 'tokens'])
    scores_neg = transform_predict(params, anime_features.loc[dislikes, 'tokens'])
    # (n_favorites,)
    scores_fav = transform_predict(params, anime_features.loc[user_favorites, 'tokens'])
    # (n_positives, n_favorites) -> (n_favorites), mean distance between the favorite and all users' positives
    pos_deviation = np.power(np.subtract.outer(scores_pos, scores_fav), 2).mean(axis=0)
    neg_deviation = np.power(np.subtract.outer(scores_neg, scores_fav), 2).mean(axis=0)
    return pos_deviation,neg_deviation

In [None]:
cb_model_evaluation(reg_params, anime_test, favorites_data, all_anime_data, cb_distances)

In [None]:
cb_model_evaluation(ridge_reg_params, anime_test, favorites_data, all_anime_data, cb_distances)

Is this result good or bad? Let's compare it to the random prediction.

## Random prediction baseline

In [None]:
class RandomPredictor:
    def __init__(self, seed):
        self.rng = np.random.default_rng(seed)

    def predict(self, tokens_encoded):
        return self.rng.standard_normal(tokens_encoded.shape[0])

In [None]:
rnd_res = []
n_tries = 5
for seed in np.random.SeedSequence().generate_state(n_tries):
    rnd_params = (RandomPredictor(seed), reg_params[1])
    rnd_res.append(cb_model_evaluation(rnd_params, anime_test, favorites_data, all_anime_data, cb_distances))

print(f'Random model result: {np.mean(rnd_res):.4f}+-{np.std(rnd_res):.4f}')

## Ridge regression with the TF-IDF text representation

### TFIDF

According the wikipedia TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We wll use TFIDF for this purpose.




TFIDF constructs the representation of the text as a multiplication of two statistics : 

- ***Term frequency***, wich shows how many times a specific word (term) $t$ happens in the document (synopsis in our case) $d$

$$tf (t,f) = \frac{f(t,d)}{\sum_{t' \in d}{f(t',d)}}$$


- ***Inverse document frequency*** shows how common or rare the word (term) across all documents.

$$idf(t, D) = \log{\frac{|D|}{|d \in D; t\in d |}}$$


So the resulting metric is just a multiplication of $tf$ and $idf$. This metrics automatically provides more attention to the words that are more important for document claclassification.

### "alpha": 10

In [None]:
cb_config["model"] = {"alpha": 10}

In [None]:
%%time
tfidf_reg_params = build_cb_model(cb_config, anime_train, anime_description, binary_vectorizer=False)
tfidf_reg_scores = cb_model_scoring(tfidf_reg_params, anime_train, anime_description)
tfidf_reg_scores_test_anime = cb_model_scoring(tfidf_reg_params, anime_test_animes, anime_description)

In [None]:
tfidf_reg_params[1]

In [None]:
tfidf_reg_params[0]

In [None]:
# sublinear TF causes negative scores
np.min(tfidf_reg_params[1].transform(anime_test_animes['tokens'])), np.max(tfidf_reg_params[1].transform(anime_test_animes['tokens']))

In [None]:
len(tfidf_reg_params[0].coef_), np.linalg.norm(tfidf_reg_params[0].coef_)

In [None]:
tfidf_reg_params[0].intercept_

#### train scores

In [None]:
calc_rmse_mae(anime_train['score'].values, tfidf_reg_scores)

#### test scores

In [None]:
calc_rmse_mae(anime_test_animes['score'].values, tfidf_reg_scores_test_anime)

#### Ridge regression feature importance with TFIDF

In [None]:
# features with the most positive impact
significant_features(tfidf_reg_params)

In [None]:
# features with the most negative impact
significant_features(tfidf_reg_params, reverse=True)

In [None]:
cb_model_evaluation(tfidf_reg_params, anime_test, favorites_data, all_anime_data, cb_distances)

### "alpha": 20000

In [None]:
cb_config["model"] = {"alpha": 20000}

In [None]:
%%time
tfidf_reg_params = build_cb_model(cb_config, anime_train, anime_description, binary_vectorizer=False)
tfidf_reg_scores = cb_model_scoring(tfidf_reg_params, anime_train, anime_description)
tfidf_reg_scores_test_anime = cb_model_scoring(tfidf_reg_params, anime_test_animes, anime_description)

In [None]:
tfidf_reg_params[1]

In [None]:
tfidf_reg_params[0]

In [None]:
np.max(tfidf_reg_params[1].transform(anime_test_animes['tokens']), axis=0)

In [None]:
len(tfidf_reg_params[0].coef_), np.linalg.norm(tfidf_reg_params[0].coef_)

In [None]:
tfidf_reg_params[0].intercept_

#### train scores

In [None]:
calc_rmse_mae(anime_train['score'].values, tfidf_reg_scores)

#### test scores

In [None]:
calc_rmse_mae(anime_test_animes['score'].values, tfidf_reg_scores_test_anime)

#### Ridge regression feature importance with TFIDF

In [None]:
# features with the most positive impact
significant_features(tfidf_reg_params)

In [None]:
# features with the most negative impact
significant_features(tfidf_reg_params, reverse=True)

In [None]:
cb_model_evaluation(tfidf_reg_params, anime_test, favorites_data, all_anime_data, cb_distances)

### Shap values for Ridge regression with TFIDF

In [None]:
anime_train.head(2)

In [None]:
cb_config["model"] = {"alpha": 10}

In [None]:
tf_idf = TfidfVectorizer(**cb_config["vectorizer"]["tfidf"])

In [None]:
train_vectorized = tf_idf.fit_transform(anime_train["tokens"]).toarray()

In [None]:
ridge_tf_idf = Ridge(**cb_config["model"])

In [None]:
ridge_tf_idf.fit(train_vectorized, anime_train["score"])

In [None]:
tf_idf.vocabulary_["music"]

In [None]:
train_preds_tf_idf = ridge_tf_idf.predict(train_vectorized)

In [None]:
# explain the model's predictions using SHAP
explainer = shap.explainers.Linear(ridge_tf_idf, train_vectorized)
shap_values = explainer(train_vectorized)

In [None]:
# we add back the feature names stripped by the TfidfVectorizer
for word,idx in tf_idf.vocabulary_.items():
    shap_values.feature_names[idx] = word

In [None]:
shap.plots.beeswarm(shap_values, max_display=20)

In [None]:
shap.plots.waterfall(shap_values[0], max_display=20)

# Classification Task

- Your task is to turn regression model into classification model.
- Use `LogisitcRegression` class from `sklearn` fro this.
- adopt previously derived functions for the new approach.

- try using more than 2 classes (i.e., not just binary)

In classification task we predict any category (label). Here we will predict whether the  anime can be recommended or not. We will do it in simple way. For example LogReg


$$
y_{pred} = \frac{1}{1+e^{W*x +W_0}}
$$

But first of all we need to label our dataset. Let's take a look at the scores

In [None]:
anime_train.query('score>0').score.hist(bins=10);

In [None]:
anime_train.query('score>0').score.mean()

The histogram shows that the mean of our average score distribution is a little bit disolased.
So we can take its mean as a borderline and say, that all the animes with average score < 6.24 are lame,  and all the animes with scores above this value are lit

#### What kind of interactions data could we get and how could we transform the target to get binary targets?

In [None]:
def assign_labels(data, cutoff):
    '''Add the binary `recommend` field based on the score cutoff value'''
    labeled_df = data.assign(
        recommend = lambda x: x['score'].ge(cutoff).astype(int)
    )
    return labeled_df

### Logistic regression with binary features

In [None]:
cb_config["model"] = {"C": 0.1, 
                      "max_iter": 1000, 
                      "class_weight": "balanced"
                     }

In [None]:
score_cutoff = 6.24

gt_label = assign_labels(anime_train, score_cutoff)
gt_label["recommend"].mean()

In [None]:
logreg_params = build_cb_model(
    cb_config,
    # extend dataset with class labels
    gt_label,
     # target 0/1 instead of raw scores
    {**anime_description, **{'feedback': 'recommend'}},
    # use logistic regression instead of LR 
    logistic = True,
    binary_vectorizer = True
)

In [None]:
logreg_params[1]

In [None]:
logreg_params[0]

In [None]:
logreg_scores = cb_model_scoring(logreg_params, anime_train, anime_description)
logreg_scores_test_anime = cb_model_scoring(logreg_params, anime_test_animes, anime_description)

In [None]:
(logreg_scores_test_anime >= 0.5).astype(int).mean()

In [None]:
accuracy_score(y_true=assign_labels(anime_train, score_cutoff)['recommend'], y_pred=(logreg_scores >= 0.5).astype(int))

In [None]:
accuracy_score(y_true=assign_labels(anime_test_animes, score_cutoff)['recommend'], y_pred=(logreg_scores_test_anime >= 0.5).astype(int))

In [None]:
cb_model_evaluation(logreg_params, anime_test, favorites_data, all_anime_data, cb_distances)

#### Logistic regression feature importance

In [None]:
# features with the most positive impact
significant_features(logreg_params)

In [None]:
# features with the most negative impact
significant_features(logreg_params, reverse=True)

### Logistic regression with TFIDF features

In [None]:
cb_config["model"] = {"C": 0.0001, 
                      "max_iter": 1000, 
                      "class_weight": "balanced"
                     }

In [None]:
logreg_params = build_cb_model(
    cb_config,
    # extend dataset with class labels
    gt_label,
     # target 0/1 instead of raw scores
    {**anime_description, **{'feedback': 'recommend'}},
    # use logistic regression instead of LR 
    logistic = True,
    binary_vectorizer = False
)

In [None]:
logreg_params[1]

In [None]:
logreg_params[0]

In [None]:
logreg_scores = cb_model_scoring(logreg_params, anime_train, anime_description)
logreg_scores_test_anime = cb_model_scoring(logreg_params, anime_test_animes, anime_description)

In [None]:
cb_model_evaluation(logreg_params, anime_test, favorites_data, all_anime_data, cb_distances)

#### Logistic regression feature importance

In [None]:
# features with the most positive impact
significant_features(logreg_params)

In [None]:
# features with the most negative impact
significant_features(logreg_params, reverse=True)

# Content-based similarity models

We will build so called ***user profile***  - the weighted vector of user's preferences. This way, user represented as a vector of consolidated item features. 

### User profile construction

Assuming we have a user $u$, who gave ratings $r_i$ to each anime $i\in\mathcal{I_u}$, represented by feature vector $a_i$,  the corresponding user feature profile vector is then defined as:

$$
v_{\mathcal{I}_u} = \frac{\sum_{i\in\mathcal{I_u}} (r_i \cdot a_{i})}{\sum_{i\in\mathcal{I_u}} r_i}
$$

In order to provide recommendations for the specific user we are going to compare an anime vector representation to user profile vector, which will indicate how close this anime to user preferences from $\mathcal{I_u}$.

#### What are the other ways to construct a user profile? What could a the problem in the above formula?

### The model

In [None]:
def build_sim_model(config, trainset, trainset_description, binary_vectorizer=True):
    _, word_vectorizer = generate_features(config, trainset, trainset_description, binary_vectorizer)
    return None, word_vectorizer

In [None]:
sim_config = {
    'vectorizer': cb_config['vectorizer'].copy()
}

In [None]:
sim_params = build_sim_model(sim_config, anime_train, anime_description)

In [None]:
sim_params

### Evaluation

For the evaluation purpose we are going to measure distances between vector representation of user's favourites and the user's likes and dislikes. We assume that cosine similarity of 1st pair of vectors is greater than cosine similarity of 2nd pair of vectors. Accordingly, the distance functions in the evaluation of CR metric can be set as:
$$
\text{dist}(k,\mathcal{I_u^+}) = 1-\text{sim}(v_{\mathcal{I}_u^+}, v_k), \\
\text{dist}(k,\mathcal{I_u^-}) = 1-\text{sim}(v_{\mathcal{I}_u^-}, v_k),
$$
where $v_{I_u^+}$ and $v_{I_u^-}$ are user feature profiles (consolidated based on features of liked and disliked items) and $v_k$ is a favorite item feature vector.

#### Is cosine similarity better than dot product? What is be the difference?

In [None]:
def sim_distances(user_id, likes, dislikes, params, favorites, anime_features):
    """
    Caclulate the distance between user's favorites and likes, 
    as well as between user's favorites and dislikes based on scores of a content-based similarity model
    """
    _, word_vectorizer = params
    # get favorite items features representation
    # (n_favorites, vocabulary size)
    favorites_features = word_vectorizer.transform(anime_features.loc[favorites[user_id], 'tokens'].values)
    # get user representation
    user_profile_pos = generate_feature_profile(likes, word_vectorizer, anime_features)
    # calculate distance from liked items to favorites
    similarity_pos = feature_similarity(favorites_features, user_profile_pos)
    # calculate distance from disliked items to favorites
    user_profile_neg = generate_feature_profile(dislikes, word_vectorizer, anime_features)
    similarity_neg = feature_similarity(favorites_features, user_profile_neg)
    return 1-similarity_pos, 1-similarity_neg
        
def generate_feature_profile(items, word_vectorizer, item_features):
    """
    Feature profile of a user
    """
    scores = item_features.loc[items, 'score'].values
    tokens = item_features.loc[items, 'tokens'].values
    # (n_items, vocabulary size)
    feature_matrix = word_vectorizer.transform(tokens)
    # (n_items,)
    weights = scores / np.sum(scores)
    # (vocabulary size, )
    return feature_matrix.T.dot(weights)

def feature_similarity(feature_matrix, feature_profile):
    """
    Caclulate the similarity between user's favorites and likes/dislikes, 
    based on content-based cosine similarity
    """
    # (n_favorites, vocabulary size) @ (vocabulary size)
    similarity = feature_matrix.dot(feature_profile)
    weights = sparse_norm(feature_matrix, axis=1) * np.linalg.norm(feature_profile)
    similarity /= weights
    return similarity

In [None]:
cb_model_evaluation(sim_params, anime_test, favorites_data, all_anime_data, sim_distances)

#### TO DO: get the recommendation to the user by building the user's vector from his positives and finding the closest items

In [None]:
the_user_positives  = anime_test.loc["King_Of_Light", :]["likes"]
the_user_positives

In [None]:
anime_data.query("anime_id in @the_user_positives")[["anime_id", "title", "synopsis", "genre"]]

In [None]:
all_anime_data["tokens"].head(2)

In [None]:
# anime_data['anime_id'] == all_anime_data.index

In [None]:
# Construct the user's vector
# user_vector = 

In [None]:
full_item_feature_matrix = sim_params[1].transform(all_anime_data["tokens"])
full_item_feature_matrix

In [None]:
weights = sparse_norm(full_item_feature_matrix, axis=1) * np.linalg.norm(user_vector)
weights

In [None]:
# get the scores
# scores = 

In [None]:
the_user_pred = top_idx(scores, topk=10)


pd.DataFrame({"anime_idx": the_user_pred, "score": scores[the_user_pred]}).merge(anime_data[["anime_id", "title", "synopsis", "genre"]], left_on="anime_idx", right_index=True)

#### Is this evaluation by favorite genres completly fair? 