# Installation with pip
Every dependency needed by the framework will be downloaded and installed automatically

In [1]:
!pip install clayrs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/swapUniba/ClayRS.git@perfect_replicability_vbpr
  Cloning https://github.com/swapUniba/ClayRS.git (to revision perfect_replicability_vbpr) to /tmp/pip-req-build-l1kgkrzo
  Running command git clone --filter=blob:none --quiet https://github.com/swapUniba/ClayRS.git /tmp/pip-req-build-l1kgkrzo
  Running command git checkout -b perfect_replicability_vbpr --track origin/perfect_replicability_vbpr
  Switched to a new branch 'perfect_replicability_vbpr'
  Branch 'perfect_replicability_vbpr' set up to track remote branch 'perfect_replicability_vbpr' from 'origin'.
  Resolved https://github.com/swapUniba/ClayRS.git to commit 9451fb279b428ce090418b8623e6b28c8d12fde6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting spacy~=3.2.1
  Down

# **! RESTART RUNTIME !**

# Correct order log and prints for IPython
This is necessary only for IPython environments (Colab, Jupyter, etc.), since they mess up the order of  ```print``` and ```logging```

```python
# EXAMPLE of the issue
>>> import logging
>>> print("Should go first")
>>> logging.warning("Should go second")
WARNING:root:Should go second
Should go first
```



In [1]:
import functools
print = functools.partial(print, flush=True)

# Import and datasets download

The framework is made of three modules:
> 1.   Content Analyzer
> 2.   Recommender System
> 3.   Evaluation

We import every module as a library and use classes and methods by using the dot notation:

In [2]:
from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

# Usage:
# ...
# ca.Ratings()
# rs.ContentBasedRS()
# eva.EvalModel()
# ...

We use **Movielens 100k** as dataset, with items info expanded thanks to imdb

In [3]:
# Dataset: Movielens-100k

# download items_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json

# download users_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv

# download ratings
!wget https://raw.githubusercontent.com/Silleellie/clayrs/master/datasets/ml-100k/ratings.csv

--2023-02-28 22:55:38--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2222967 (2.1M) [text/plain]
Saving to: ‘items_info.json’


2023-02-28 22:55:38 (36.6 MB/s) - ‘items_info.json’ saved [2222967/2222967]

--2023-02-28 22:55:38--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘users_info.csv’


2023-02-28 22:55:3

### Check items file
In this example, the file containing items info is a JSON where every entry corresponds to a movie.

For every movie there are various information, such as *genres, directors, cast, etc.*

In [4]:
with open("items_info.json", "r") as f:
  # 25 lines but in these 23 lines there are only 2 entries:
  # 'Toy Story', and 'Golden Eye'
  for _ in range(25):
    print(f.readline(), end='')


[
    {
        "movielens_id": "1",
        "imdb_id": "0114709",
        "title": "Toy Story",
        "plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
        "genres": "Animation, Adventure, Comedy, Family, Fantasy",
        "year": "1995",
        "rating": "8.3",
        "directors": "John Lasseter",
        "cast": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, John Morris, Erik von Detten, Laurie Metcalf, R. Lee Ermey, Sarah Rayne, Penn Jillette, Jack Angel, Spencer Aste, Greg Berg, Lisa Bradley, Kendall Cunningham, Debi Derryberry, Cody Dorkin, Bill Farmer, Craig Good, Gregory Grudt, Danielle Judovits, Sam Lasseter, Brittany Levenbrown, Sherry Lynn, Scott McAfee, Mickie McGowan, Ryan O'Donohue, Jeff Pidgeon, Patrick Pinney, Phil Proctor, Jan Rabson, Joe Ranft, Andrew Stanton, Shane Sweet, Wayne Allwine, Tony Anselmo, Jonathan Benair, Anthony Burch, 

### Check users file
In this example, the file containing users info is a CSV file where the first column is the *user id*, while the other columns are side information for that user (*gender, occupation, zip code*)

In [5]:
with open("users_info.csv", "r") as f:

  # print the header and the first 2 entries
  for _ in range(3):
    print(f.readline(), end='')

user_id,age,gender,occupation,zip_code
1,24,M,technician,85711
2,53,F,other,94043


<a name="cell-id"></a>
### Check ratings
In this example, the file containing the interactions between the users and the movies is a CSV, where every interaction is a rating in the **[1, 5]** Likert scale

In [6]:
import pandas as pd

pd.read_csv('ratings.csv')

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


# Content Analyzer: representation of Items
In order to define the *item representation*, the following parameters should be defined:
*   ***source***: the path of the file containing items info
*   ***id***: the field that uniquely identifies an item
*   ***output_directory***: the path where serialized representations are saved



In [7]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified/',
)

<a name="ca_id"></a>
Each item can be represented using a set of fields.
Every field can be **represented** using several techniques, such as *'tfidf'*, *'entity linking'*, *'embeddings'*, etc.

It is possible to process the content of each field using a **Natural Language Processing (NLP) pipeline**.  
It is also possible to assign a **custom id** for each generated representation, in order to allow a simpler reference in the recommendation phase. Both NLP pipeline and custom id are optional parameters.

> In the following example, we process the *'plot'* field by performing **lemmatization** and **stopwords removal** through [NLTK](https://www.nltk.org/), and we represent it in multiple ways:

1. **embedding** using the pre-trained model `glove-twitter-50`


* `Word2DocEmbedding` allows to represent **every word** of the *'plot'* field with an embedding vector, and then to calculate the centroid (several combining techniques are available). The centroid vector is the embedding representation of the **whole field**


```python
>>> plot = "This is a very long text"

# First it will calculate the embedding of every word:
# this = [8.5623 1.2201 0.5652 ...]
# is = [2.1120 3.4578 1.2203 ...]
# a = [5.2345 1.2221 4.2356 ...]
# ...
# text = [4.2201 7.5532 1.0023 ...]

# then it will calculate the centroid (for example) of every embedding above
>>> plot_centroid_embedding = [4.1002 2.5589 3.1245 ...]
```


The pre trained model is downloaded if not found locally, and if preprocessing is specified then the field is first preprocessed 

*    After the model is downloaded, the Gensim library is a bit slow in loading the model into memory, so be patient


In [8]:
movies_ca_config.add_single_config(
    'plot',
    
    # 1
    ca.FieldConfig(
        ca.Word2DocEmbedding(ca.Gensim('glove-twitter-50'),
                             combining_technique=ca.Centroid()),
        ca.NLTK(stopwords_removal=True, lemmatization=True),
        id='gensim'
    )
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


2. **embedding** using the pre-trained model [`SBERT`](https://www.sbert.net/)

* `Sentence2DocEmbedding` allows to represent **every sentence** of the 'plot' field with an embedding vector, and then to calculate the centroid. 



In [9]:
movies_ca_config.add_single_config(
    'plot',
    
    # 2
    ca.FieldConfig(
        ca.Sentence2DocEmbedding(ca.Sbert('paraphrase-distilroberta-base-v1'),
                                 combining_technique=ca.Centroid()),
        ca.NLTK(stopwords_removal=True, lemmatization=True),
        id='sbert'
    ),
)

3. a simple **tfidf** representation using the [*Whoosh Index*](https://whoosh.readthedocs.io/en/latest/index.html)

In [10]:
movies_ca_config.add_single_config(
    'plot',

    # 3
    ca.FieldConfig(ca.WhooshTfIdf(),
                    ca.NLTK(stopwords_removal=True, lemmatization=True),
                    id="whoosh_tfidf")
)

Multiple representations for the same field can be specified one at a time with the `add_single_config()` method as shown above or **all at once** thanks to the `add_multiple_config()` method:



```python
# movies_ca_config.add_multiple_config(
#     'plot',
#     [
#         ca.FieldConfig(...),

#         ca.FieldConfig(...),

#         ...
#     ]
# )
```


At the end of the configuration step, we provide the configuration to the *'Content Analyzer'* and call the `fit()` method:

*   The Content Analyzer will **represent** and **serialize** every item.



In [12]:
ca.ContentAnalyzer(config=movies_ca_config).fit()

[39mINFO[0m - ***********   Processing field: plot   ***********
[39mINFO[0m - Downloading/Loading Gensim glove-twitter-50
Processing and producing contents with Gensim glove-twitter-50:  0%|          | 0/1682 [00:00<?][nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Processing and producing contents with Gensim glove-twitter-50:  100%|██████████| 1682/1682 [00:23<00:00]
[39mINFO[0m - Downloading/Loading Sbert paraphrase-distilroberta-base-v1


  0%|          | 0.00/306M [00:00<?, ?B/s]

Processing and producing contents with Sbert paraphrase-distilroberta-base-v1:  100%|██████████| 1682/1682 [04:01<00:00]
[39mINFO[0m - Computing tf-idf with WhooshTfIdf
Serializing contents:  100%|██████████| 1682/1682 [00:10<00:00]


# [Optional] Content Analyzer: representation of Users
In order to define the *'user representation'*, we could use the same process performed for *'item representation'*. In this case we don't want to represent in a complex way users, so this step is completely optional

In this example, the ID for users is the column `user_id`.

In [13]:
# Configuration of user representation
users_ca_config = ca.UserAnalyzerConfig(
    ca.CSVFile('users_info.csv'),
    id='user_id',
    output_directory='users_codified/',
)

# Since no complex representation for users is needed, the fit() method is called immediately
ca.ContentAnalyzer(config=users_ca_config).fit()

Serializing contents:  100%|██████████| 943/943 [00:02<00:00]


# Recommender System: Random Forests classifier

The Recommender System module needs information about users, items and ratings. 

The **Ratings** class allows you to import rating from a source file (or also from an existent dataframe) into a custom object.   **If** the source file contains users (U), items (I) and ratings (R) in this order, no additional parameters are needed, **otherwise**  the mapping must be explictly specified using:

*   **'user_id'** column,
*   **'item_id'** column,
*   **'score'** column





In [14]:
ratings = ca.Ratings(ca.CSVFile('ratings.csv'))

print(ratings)

Importing ratings:  100%|██████████| 100000/100000 [00:00<00:00]


      user_id item_id  score
0         196     242    3.0
1         186     302    3.0
2          22     377    1.0
3         244      51    2.0
4         166     346    1.0
...       ...     ...    ...
99995     880     476    3.0
99996     716     204    5.0
99997     276    1090    1.0
99998      13     225    2.0
99999      12     203    3.0

[100000 rows x 3 columns]


In [15]:
# (mapping by index) EQUIVALENT:
#
# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column=0,
#     item_id_column=1,
#     score_column=2
# )

In [16]:
# (mapping by column name) EQUIVALENT:

# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column='user_id',
#     item_id_column='item_id',
#     score_column='rating'
# )

The Recommender System also needs an algorithm for ranking or predicting items to users. In the following example we use the **Random Forests** classifier with the [sklearn](https://scikit-learn.org/) implementation.
There are multiple classifiers implemented (Gaussian process, Logistic Regression, etc.) all using sklearn implementation.
> In this case we change the default **number of trees** by passing a custom `n_estimators` parameter

The classifier will be trained on items *liked by the user* and it will rank *unseen items* based on a score in range $[0, 1]$

The items liked by the user are those having a rating higher or equal than a specific **threshold**. If the threshold is not specified, the average score of all items liked by the user is used.

The Recommender System leverages the representations defined by the Content Analyzer. In the current example, we use the representations of the field 'plot'. We could use all representations created by the content analyzer or a subset of them. 
Representations can be referenced using the **external id** (if specified, see [here](#ca_id)) or the **internal id**:


```
For the field 'plot':
First representation created -> internal_id = 0
Second representation created -> internal_id = 1
...
Nth representation created -> internal_id = n-1
```

In [17]:
random_forests = rs.ClassifierRecommender(
    {'plot': ['gensim', 'sbert', 'whoosh_tfidf']},
    
    # custom parameter passed to sklearn
    rs.SkRandomForest(n_estimators=145)
)

# no threshold parameter specified, the average rating given by
# the user will be used

Before we can instantiate the recommender system, we should perform the splitting of the dataset: let's perform a **HoldOut partititioning with train set size equal to 80% of original ratings**

*   The output of the partition module are two lists. One containing one train set (in this case), the other containing one test set (in this case)

In [18]:
train_list, test_list = rs.HoldOutPartitioning(train_set_size=0.8).split_all(ratings) # 0.8 is the percentage of ratings to 'hold'

Performing HoldOutPartitioning:  100%|██████████| 943/943 [00:00<00:00]


The Recommender System needs the following parameters: the recommendation  algorithm, the train set, and the path of the items serialized by the Content Analyzer:

*   We have only a single train set due to the partitioning technique chosen

In [19]:
train_set = train_list[0]

cbrs = rs.ContentBasedRS(random_forests, train_set, 'movies_codified/')

Now the ***cbrs*** must be fit before we can compute the rank:

*   We could do this in two separate steps, by first calling the `fit(..)` method and then the `rank(...)` method 

*   Or by calling directly the `fit_rank(...)` method, which performs both in one step

Since the Random Forest algorithm is a heavy one to fit (we could speed up the process if we limit the depth with the `max_depth` parameter when instantiating the SkLearn classifier), we use the first approach so that we have an already fit cbrs for the next steps

In [20]:
cbrs.fit()

[39mINFO[0m - Loading contents from disk...
Fitting algorithm:  100%|██████████| 943/943 [07:08<00:00]


ContentBasedRS(algorithm=ClassifierRecommender, train_set=      user_id item_id  score
0         196     845    4.0
1         196     108    4.0
2         196     382    4.0
3         196     580    2.0
4         196    1007    4.0
...       ...     ...    ...
79614     941     919    5.0
79615     941       1    5.0
79616     941     763    3.0
79617     941     294    4.0
79618     941     273    3.0

[79619 rows x 3 columns], items_directory=movies_codified/, users_directory=None)

Let's now compute the **top-3** items for the *user 8*, *user 2* and *user 1*.
 
*   We have a single test set due to the partitioning technique chosen

In [22]:
test_set = test_list[0]

rank = cbrs.rank(test_set, user_list=['8', '2', '1'], n_recs=3)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing rank for user 118:  100%|██████████| 3/3 [00:00<00:00]


Let's print the rank just computed

In [23]:
print(rank)

  user_id item_id     score
0       8     341  0.896552
1       8     222  0.896552
2       8     188  0.868966
3       2     298  0.951724
4       2     294  0.924138
5       2     305  0.896552
6       1      81  0.951724
7       1       1  0.931034
8       1     238  0.910345


But let's now compute the rank for all users in the train set. It will be evaluated with some state-of-the-art metrics.

In order to compute a rank for all users, you simply do not specify the *user_list* parameter.

*   We save the result in a list since the `EvalModel` class that we will use in the next step expects a list of ranks/predictions to evaluate, in case multiple splits must be evaluated

***Note:*** by default top-10 recommendations are returned for each user. In order to produce *unbounded ranking*, simply set `n_recs` parameter to `None`

In [24]:
result_list = []

result_rank = cbrs.rank(test_set, n_recs=None)

result_list.append(result_rank)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing rank for user 942:  100%|██████████| 943/943 [00:54<00:00]


# Evaluation module

Recommendations can be evaluated using several metrics. In the following example, we use:

*   ***NDCG***
*   ***NDCG@10***
*   ***NDCG@5***
*   ***MRR***

The Evaluation module needs the following parameters:

*   A list of computed rank/predictions (in case multiple splits must be evaluated)
*   A list of truths (in case multiple splits must be evaluated)
*   List of metrics to compute

Obviously the list of computed rank/predictions and list of truths must have the same length, and the rank/prediction in position $i$ will be compared with the truth at position $i$



In [25]:
em = eva.EvalModel(
    result_list,
    test_list,
    metric_list=[
        eva.NDCG(),
        eva.NDCGAtK(k=10),
        eva.NDCGAtK(k=5),
        eva.MRR()
    ],
)

The fit() method returns two pandas DataFrame: the first one contains the metrics aggregated for the system, while the second contains the metrics computed for each user (where possible)

In [26]:
sys_result, users_result = em.fit()

[39mINFO[0m - Performing evaluation on metrics chosen
Performing MRR:  100%|██████████| 4/4 [00:01<00:00]


For the DataFrame which contains system results, the results are also grouped by splits

In [27]:
sys_result

Unnamed: 0_level_0,NDCG,NDCG@10,NDCG@5,MRR
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sys - fold1,0.923597,0.848631,0.807309,0.755815
sys - mean,0.923597,0.848631,0.807309,0.755815


In [28]:
users_result

Unnamed: 0_level_0,NDCG,NDCG@10,NDCG@5
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.953720,0.907590,0.939843
10,0.948646,0.828650,0.821539
100,0.839224,0.740669,0.615209
101,0.932867,0.871112,0.850939
102,0.898741,0.698623,0.682369
...,...,...,...
95,0.935440,0.785909,0.867832
96,0.977124,0.950314,0.932168
97,0.954598,0.853235,0.849266
98,0.922565,0.922565,0.841413
