# Installation with pip
Every dependency needed by the framework will be downloaded and installed automatically

In [4]:
!pip install clayrs==0.5.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/swapUniba/ClayRS.git@perfect_replicability_vbpr
  Cloning https://github.com/swapUniba/ClayRS.git (to revision perfect_replicability_vbpr) to /tmp/pip-req-build-cl1q0c05
  Running command git clone --filter=blob:none --quiet https://github.com/swapUniba/ClayRS.git /tmp/pip-req-build-cl1q0c05
  Running command git checkout -b perfect_replicability_vbpr --track origin/perfect_replicability_vbpr
  Switched to a new branch 'perfect_replicability_vbpr'
  Branch 'perfect_replicability_vbpr' set up to track remote branch 'perfect_replicability_vbpr' from 'origin'.
  Resolved https://github.com/swapUniba/ClayRS.git to commit 9451fb279b428ce090418b8623e6b28c8d12fde6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting matplotlib~=3.2.2
 

# **! RESTART RUNTIME !**

# Correct order log and prints for IPython
This is necessary only for IPython environments (Colab, Jupyter, etc.), since they mess up the order of  ```print``` and ```logging```

```python
# EXAMPLE of the issue
>>> import logging
>>> print("Should go first")
>>> logging.warning("Should go second")
WARNING:root:Should go second
Should go first
```



In [1]:
import functools
print = functools.partial(print, flush=True)

# Import and datasets download

The framework is made of three modules:
> 1.   Content Analyzer
> 2.   Recommender System
> 3.   Evaluation

We import every module as a library and use classes and methods by using the dot notation:

In [2]:
from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

# Usage:
# ...
# ca.Ratings()
# rs.ContentBasedRS()
# eva.EvalModel()
# ...

We use **Movielens 100k** as dataset, with items info expanded thanks to imdb

In [3]:
# Dataset: Movielens-100k

# download items_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json

# download users_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv

# download ratings
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/ratings.csv

--2023-02-28 23:03:07--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2222967 (2.1M) [text/plain]
Saving to: ‘items_info.json’


2023-02-28 23:03:08 (28.2 MB/s) - ‘items_info.json’ saved [2222967/2222967]

--2023-02-28 23:03:08--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘users_info.csv’


2023-02-28 23:03:0

### Check items file
In this example, the file containing items info is a JSON where every entry corresponds to a movie.

For every movie there are various information, such as *genres, directors, cast, etc.*

In [4]:
with open("items_info.json", "r") as f:
  # 25 lines but in these 25 lines there are only 2 entries:
  # 'Toy Story', and 'Golden Eye'
  for _ in range(25):
    print(f.readline(), end='')


[
    {
        "movielens_id": "1",
        "imdb_id": "0114709",
        "title": "Toy Story",
        "plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
        "genres": "Animation, Adventure, Comedy, Family, Fantasy",
        "year": "1995",
        "rating": "8.3",
        "directors": "John Lasseter",
        "cast": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, John Morris, Erik von Detten, Laurie Metcalf, R. Lee Ermey, Sarah Rayne, Penn Jillette, Jack Angel, Spencer Aste, Greg Berg, Lisa Bradley, Kendall Cunningham, Debi Derryberry, Cody Dorkin, Bill Farmer, Craig Good, Gregory Grudt, Danielle Judovits, Sam Lasseter, Brittany Levenbrown, Sherry Lynn, Scott McAfee, Mickie McGowan, Ryan O'Donohue, Jeff Pidgeon, Patrick Pinney, Phil Proctor, Jan Rabson, Joe Ranft, Andrew Stanton, Shane Sweet, Wayne Allwine, Tony Anselmo, Jonathan Benair, Anthony Burch, 

### Check users file
In this example, the file containing users info is a CSV file where the first column is the *user id*, while the other columns are side information for that user (*gender, occupation, zip code*)

In [5]:
with open("users_info.csv", "r") as f:

  # print the header and the first 2 entries
  for _ in range(3):
    print(f.readline(), end='')

user_id,age,gender,occupation,zip_code
1,24,M,technician,85711
2,53,F,other,94043


<a name="cell-id"></a>
### Check ratings
In this example, the file containing the interactions between the users and the movies is a CSV, where every interaction is a rating in the **[1, 5]** Likert scale

In [6]:
import pandas as pd

pd.read_csv('ratings.csv')

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


# Content Analyzer: representation of Items
In order to define the *item representation*, the following parameters should be defined:
*   ***source***: the path of the file containing items info
*   ***id***: the field that uniquely identifies an item
*   ***output_directory***: the path where serialized representations are saved



In [7]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified/',
)

<a name="ca_id"></a>
Each item can be represented using a set of fields.
Every field can be **represented** using several techniques, such as *'tfidf'*, *'entity linking'*, *'embeddings'*, etc.

It is possible to process the content of each field using a **Natural Language Processing (NLP) pipeline**.  
It is also possible to assign a **custom id** for each generated representation, in order to allow a simpler reference in the recommendation phase. Both NLP pipeline and custom id are optional parameters.

> In the following example, we process: 
1. the *'plot'* field by performing **lemmatization** and **stopwords removal**, and we represent it using **tfidf**;

In [8]:
movies_ca_config.add_single_config(
    'plot',
    ca.FieldConfig(ca.SkLearnTfIdf(),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True),
                   id='tfidf')  # Custom id
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


At the end of the configuration step, we provide the configuration to the *'Content Analyzer'* and call the `fit()` method:

*   The Content Analyzer will **represent** and **serialize** every item.



In [9]:
ca.ContentAnalyzer(config=movies_ca_config).fit()

[39mINFO[0m - ***********   Processing field: plot   ***********
[39mINFO[0m - Computing tf-idf with SkLearnTfIdf
Serializing contents:  100%|██████████| 1682/1682 [00:07<00:00]


# [Optional] Content Analyzer: representation of Users
In order to define the *'user representation'*, we could use the same process performed for *'item representation'*. In this case we don't want to represent in a complex way users, so this step is completely optional

In this example, the ID for users is the column `user_id`.

In [10]:
#Configuration of user representation
users_ca_config = ca.UserAnalyzerConfig(
    ca.CSVFile('users_info.csv'),
    id='user_id',
    output_directory='users_codified/',
)

# Since no complex representation for users is needed, the fit() method is called immediately
ca.ContentAnalyzer(config=users_ca_config).fit()

Serializing contents:  100%|██████████| 943/943 [00:02<00:00]


# Recommender System: centroid vector algorithm

The Recommender System module needs information about users, items and ratings. 

The **Ratings** class allows you to import rating from a source file (or also from an existent dataframe) into a custom object.   **If** the source file contains users (U), items (I) and ratings (R) in this order, no additional parameters are needed, **otherwise**  the mapping must be explictly specified using:

*   **'user_id'** column,
*   **'item_id'** column,
*   **'score'** column





In [11]:
ratings = ca.Ratings(ca.CSVFile('ratings.csv'))

print(ratings)

Importing ratings:  100%|██████████| 100000/100000 [00:01<00:00]


      user_id item_id  score
0         196     242    3.0
1         186     302    3.0
2          22     377    1.0
3         244      51    2.0
4         166     346    1.0
...       ...     ...    ...
99995     880     476    3.0
99996     716     204    5.0
99997     276    1090    1.0
99998      13     225    2.0
99999      12     203    3.0

[100000 rows x 3 columns]


In [12]:
# (mapping by index) EQUIVALENT:
#
# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column=0,
#     item_id_column=1,
#     score_column=2
# )

In [13]:
# (mapping by column name) EQUIVALENT:

# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column='user_id',
#     item_id_column='item_id',
#     score_column='rating'
# )

The Recommender System also needs an algorithm for ranking or predicting items to users. In the following example we use the **CentroidVector** algorithm:

*   It computes the centroid vector of the features of items *liked by the user*
*   It computes the similarity between the centroid vector and unrated items

The items liked by a user are those having a rating higher or equal than a specific **threshold**. If the threshold is not specified, the average score of all items liked by the user is used.

The Recommender System leverages the representations defined by the Content Analyzer. In the current example, we use the representation of the field 'plot'. More representations could be adopted for a single field.


```python
# Example with multiple representations for a single field
{
  'plot': ['tfidf', 'word_embedding'],
  'genre': 'doc_embedding',
  ...
}
```

Representations can be referenced using the **external id** (if specified, see [here](#ca_id)) or the **internal id**:


```
For the field 'plot':
First representation created -> internal_id = 0
Second representation created -> internal_id = 1
...
Nth representation created -> internal_id = n-1
```

In [14]:
centroid_vec = rs.CentroidVector(
    {'plot': 'tfidf'},  # EQUIVALENT TO {'plot': 0}
    similarity=rs.CosineSimilarity()
)

# no threshold parameter specified, the average rating given by
# the user wil be used

Before we can instantiate the recommender system, we should perform the splitting of the dataset: let's perform a **KFold with 2 splits**

*   The output of the partition module are two lists. One containing the two train set (in this case), the other containing the two test set (in this case)





In [15]:
train_list, test_list = rs.KFoldPartitioning(n_splits=2).split_all(ratings)

Performing KFoldPartitioning:  100%|██████████| 943/943 [00:00<00:00]


The Recommender System needs the following parameters: the recommendation  algorithm, the train set, and the path of the items serialized by the Content Analyzer:

*   At the moment let's use the first train set



In [16]:
first_train = train_list[0]

cbrs = rs.ContentBasedRS(centroid_vec, first_train, 'movies_codified/')

Now the ***cbrs*** must be fit before we can compute the rank:

*   We could do this in two separate steps, by first calling the `fit(..)` method and then the `rank(...)` method 

*   Or by calling directly the `fit_rank(...)` method, which performs both in one step

We use the second approach and we compute the **top-3** items for the *user 8*, *user 2* and *user 1*.

*   The first splitted test set is used



In [17]:
first_test_set = test_list[0]

rank = cbrs.fit_rank(first_test_set, user_list=['8', '2', '1'], n_recs=3)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 118:  100%|██████████| 3/3 [00:00<00:00]


Let's print the rank just computed

In [18]:
print(rank)

  user_id item_id     score
0       8     228  0.136628
1       8      50  0.094198
2       8     511  0.088464
3       2     287  0.087853
4       2     294  0.077862
5       2     288  0.054922
6       1       2  0.132588
7       1      24  0.124579
8       1      88  0.104763


Let's now compute the rank for all users of our train set, and let's use both the two train set and two test set obtained thanks to the KFold technique

*   We will save the two computed rank in a list, and we will evaluate them in the next step

In order to compute a rank for all users, you simply do not specify the *user_list* parameter

***Note:*** by default top-10 recommendations are returned for each user. In order to produce *unbounded ranking*, simply set `n_recs` parameter to `None`

In [19]:
result_list = []

for train_set, test_set in zip(train_list, test_list):
  
  cbrs = rs.ContentBasedRS(centroid_vec, train_set, 'movies_codified/')
  rank_to_append = cbrs.fit_rank(test_set, n_recs=None)

  result_list.append(rank_to_append)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 942:  100%|██████████| 943/943 [00:39<00:00]
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 942:  100%|██████████| 943/943 [00:38<00:00]


# Evaluation module

Recommendations can be evaluated using several metrics. In the following example, we use:

*   ***Precision***
*   ***Recall***
*   ***F1 - computed using macro average***
*   ***F1 - computed using micro average***

The Evaluation module needs the following parameters:

*   A list of computed rank/predictions (in case multiple splits must be evaluated)
*   A list of truths (in case multiple splits must be evaluated)
*   List of metrics to compute

Obviously the list of computed rank/predictions and list of truths must have the same length, and the rank/prediction in position $i$ will be compared with the truth at position $i$

In [20]:
em = eva.EvalModel(
    result_list,
    test_list,
    metric_list=[
        eva.Precision(),  # by default sys_average='macro'
        eva.Recall(),     # by default sys_average='macro'
        eva.FMeasure(sys_average='macro'),
        eva.FMeasure(sys_average='micro')
    ]
)

The fit() method returns two pandas DataFrame: the first one contains the metrics aggregated for the system, while the second contains the metrics computed for each user (where possible)

In [21]:
sys_result, users_result =  em.fit()

[39mINFO[0m - Performing evaluation on metrics chosen
Performing F1 - micro:  100%|██████████| 4/4 [00:02<00:00]


For the DataFrame which contains system results, the results are also grouped by splits

In [22]:
sys_result

Unnamed: 0_level_0,Precision - macro,Recall - macro,F1 - macro,F1 - micro
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sys - fold1,0.550551,1.0,0.700224,0.701976
sys - fold2,0.553211,1.0,0.702997,0.705667
sys - mean,0.551881,1.0,0.701611,0.703822


In [23]:
users_result

Unnamed: 0_level_0,Precision - macro,Recall - macro,F1 - macro,F1 - micro
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.599265,1.0,0.749260,0.749260
10,0.293478,1.0,0.451047,0.451047
100,0.628161,1.0,0.769865,0.769865
101,0.672014,1.0,0.803571,0.803571
102,0.560185,1.0,0.717819,0.717819
...,...,...,...,...
95,0.514388,1.0,0.678970,0.678970
96,0.678571,1.0,0.806349,0.806349
97,0.539315,1.0,0.700426,0.700426
98,0.480769,1.0,0.649123,0.649123
