# Installation with pip
Every dependency needed by the framework will be downloaded and installed automatically

In [1]:
! pip install clayrs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/swapUniba/clayrs.git
  Cloning https://github.com/swapUniba/clayrs.git to /tmp/pip-req-build-5teqbonp
  Running command git clone -q https://github.com/swapUniba/clayrs.git /tmp/pip-req-build-5teqbonp
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting ekphrasis@ git+https://github.com/fucaja/ekphrasis.git#egg=ekphrasis
  Cloning https://github.com/fucaja/ekphrasis.git to /tmp/pip-install-giz662q5/ekphrasis_160d1dddf52e4b67a944f4c70aa41f1c
  Running command git clone -q https://github.com/fucaja/ekphrasis.git /tmp/pip-install-giz662q5/ekphrasis_160d1dddf52e4b67a944f4c70aa41f1c
Collecting wn~=0.0.23
  Downloading wn-0.0.23.tar.gz (31.6 MB)
[K     |████████████████████████████████| 31.6 MB 1.2 MB/s 
[?25hCollecting mysql~=0.0.3
  Downlo

# **! RESTART RUNTIME !**

# Correct order log and prints for IPython
This is necessary only for IPython environments (Colab, Jupyter, etc.), since they mess up the order of  ```print``` and ```logging```

```python
# EXAMPLE of the issue
>>> import logging
>>> print("Should go first")
>>> logging.warning("Should go second")
WARNING:root:Should go second
Should go first
```



In [1]:
import functools
print = functools.partial(print, flush=True)

# Import and datasets download

The framework is made of three modules:
> 1.   Content Analyzer
> 2.   Recommender System
> 3.   Evaluation

We import every module as a library and use classes and methods by using the dot notation:

In [2]:
from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

# Usage:
# ...
# ca.Ratings()
# rs.ContentBasedRS()
# eva.EvalModel()
# ...

We use **Movielens 100k** as dataset, with items info expanded thanks to imdb

In [3]:
# Dataset: Movielens-100k

# download items_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json

# download users_info
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv

# download ratings
!wget https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/ratings.csv

--2022-07-01 11:20:11--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2222967 (2.1M) [text/plain]
Saving to: ‘items_info.json’


2022-07-01 11:20:12 (33.2 MB/s) - ‘items_info.json’ saved [2222967/2222967]

--2022-07-01 11:20:12--  https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘users_info.csv’


2022-07-01 11:20:1

### Check items file
In this example, the file containing items info is a JSON where every entry corresponds to a movie.

For every movie there are various information, such as *genres, directors, cast, etc.*

In [4]:
with open("items_info.json", "r") as f:
  # 25 lines but in these 23 lines there are only 2 entries:
  # 'Toy Story', and 'Golden Eye'
  for _ in range(25):
    print(f.readline(), end='')


[
    {
        "movielens_id": "1",
        "imdb_id": "0114709",
        "title": "Toy Story",
        "plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
        "genres": "Animation, Adventure, Comedy, Family, Fantasy",
        "year": "1995",
        "rating": "8.3",
        "directors": "John Lasseter",
        "cast": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, John Morris, Erik von Detten, Laurie Metcalf, R. Lee Ermey, Sarah Rayne, Penn Jillette, Jack Angel, Spencer Aste, Greg Berg, Lisa Bradley, Kendall Cunningham, Debi Derryberry, Cody Dorkin, Bill Farmer, Craig Good, Gregory Grudt, Danielle Judovits, Sam Lasseter, Brittany Levenbrown, Sherry Lynn, Scott McAfee, Mickie McGowan, Ryan O'Donohue, Jeff Pidgeon, Patrick Pinney, Phil Proctor, Jan Rabson, Joe Ranft, Andrew Stanton, Shane Sweet, Wayne Allwine, Tony Anselmo, Jonathan Benair, Anthony Burch, 

### Check users file
In this example, the file containing users info is a CSV file where the first column is the *user id*, while the other columns are side information for that user (*gender, occupation, zip code*)

In [5]:
with open("users_info.csv", "r") as f:

  # print the header and the first 2 entries
  for _ in range(3):
    print(f.readline(), end='')

user_id,age,gender,occupation,zip_code
1,24,M,technician,85711
2,53,F,other,94043


<a name="cell-id"></a>
### Check ratings
In this example, the file containing the interactions between the users and the movies is a CSV, where every interaction is a rating in the **[1, 5]** Likert scale

In [6]:
import pandas as pd

pd.read_csv('ratings.csv')

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


# Content Analyzer: representation of Items and export to json
In order to define the *item representation*, the following parameters should be defined:
*   ***source***: the path of the file containing items info
*   ***id***: the field that uniquely identifies an item
*   ***output_directory***: the path where serialized representations are saved

There is also the optional parameter ***export_json***, which allows to create a file `contents.json` in the output directory containing the serialization of all representations of the content.

In [7]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified/',
    export_json=True
)

<a name="ca_id"></a>
Each item can be represented using a set of fields.
Every field can be **represented** using several techniques, such as *'tfidf'*, *'entity linking'*, *'embeddings'*, etc.

It is possible to process the content of each field using a **Natural Language Processing (NLP) pipeline**.  
It is also possible to assign a **custom id** for each generated representation, in order to allow a simpler reference in the recommendation phase. Both NLP pipeline and custom id are optional parameters.

> In the following example, we process the *'plot'* field by performing **lemmatization** and **stopwords removal** through [NLTK](https://www.nltk.org/), and we represent it using **tfidf**:

In [8]:
movies_ca_config.add_single_config(
    'plot',
    ca.FieldConfig(ca.SkLearnTfIdf(),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True)) 
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


In this example, we also add an exogenous representation to the content by extracting the 'year' field from the local source 



In [9]:
movies_ca_config.add_single_exogenous(
    ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['year']))
)

At the end of the configuration step, we provide the configuration to the *'Content Analyzer'* and call the `fit()` method:

*   The Content Analyzer will **represent** and **serialize** every item (and create the json file containing representations for every content).



In [10]:
content_analyzer = ca.ContentAnalyzer(config=movies_ca_config)
content_analyzer.fit()

[39mINFO[0m - Extracting exogenous properties from local dataset (exogenous_properties_retrieval.py:146)
[39mINFO[0m - ***********   Processing field: plot   *********** (content_analyzer_main.py:186)
[39mINFO[0m - Computing tf-idf with SkLearnTfIdf (tf_idf.py:95)
Serializing contents:  100%|██████████| 1682/1682 [00:10<00:00]


This is the JSON file which has been created:

In [11]:
with open('movies_codified/contents.json', 'r') as f:
    for _ in range(11):
      print(f.readline(), end='')

[
    {
        "content_id": "1",
        "Exo#0": "{'year': '1995'}",
        "plot#0": {
            "sparse_tfidf": "[[(0, 1024),0.1719365702112526],\n [(0, 1729),0.31336848582007987],\n [(0, 2160),0.3010691777558571],\n [(0, 2721),0.25782007590969463],\n [(0, 3752),0.2915290932564375],\n [(0, 4769),0.1484604944785959],\n [(0, 5439),0.31336848582007987],\n [(0, 5932),0.2618948840931845],\n [(0, 6417),0.3307033869191101],\n [(0, 6655),0.3307033869191101],\n [(0, 6853),0.25410006749357394],\n [(0, 6909),0.2445599829941543],\n [(0, 6941),0.31336848582007987]]",
            "pos_word_tuples": "[(5932, 'room'), (1024, 'boy'), (6941, 'toy'), (6909, 'top'), (6655, 'supplants'), (2721, 'figure'), (6417, 'spaceman'), (4769, 'new'), (3752, 'jealous'), (6853, 'threaten'), (5439, 'profoundly'), (2160, 'doll'), (1729, 'cowboy')]",
            "len_vocabulary": 7614
        }
    },
    {


# [Optional] Content Analyzer: representation of Users and export to json
In order to define the *'user representation'*, we could use the same process performed for *'item representation'*. In this case we don't want to represent in a complex way users, so this step is completely optional

In this example, the ID for users is the column `user_id`.

Also for users, it is ossible to export the representation in a ***json file***

In [12]:
#Configuration of user representation
users_ca_config = ca.UserAnalyzerConfig(
    ca.CSVFile('users_info.csv'),
    id='user_id',
    output_directory='users_codified/',
    export_json=True
)

We also add an exogenous representation for each user by extracting the 'gender' field from the local source



In [13]:
users_ca_config.add_single_exogenous(
    ca.ExogenousConfig(ca.PropertiesFromDataset(field_name_list=['gender']))
)

ca.ContentAnalyzer(config=users_ca_config).fit()

[39mINFO[0m - Extracting exogenous properties from local dataset (exogenous_properties_retrieval.py:146)
Serializing contents:  100%|██████████| 943/943 [00:04<00:00]


This is the JSON file which has been created:

In [14]:
with open('users_codified/contents.json', 'r') as f:
    for _ in range(9):
      print(f.readline(), end='')

[
    {
        "content_id": "1",
        "Exo#0": "{'gender': 'M'}"
    },
    {
        "content_id": "2",
        "Exo#0": "{'gender': 'F'}"
    },


# Recommender System: centroid vector algorithm and export to csv

The Recommender System module needs information about users, items and ratings. 

The **Ratings** class allows you to import rating from a source file (or also from an existent dataframe) into a custom object.   **If** the source file contains users (U), items (I) and ratings (R) in this order, no additional parameters are needed, **otherwise**  the mapping must be explictly specified using:

*   **'user_id'** column,
*   **'item_id'** column,
*   **'score'** column





In [15]:
ratings = ca.Ratings(ca.CSVFile('ratings.csv'))

print(ratings)

Importing ratings:  100%|██████████| 100000/100000 [00:00<00:00]


      user_id item_id  score
0         196     242    3.0
1         196     393    4.0
2         196     381    4.0
3         196     251    3.0
4         196     655    5.0
...       ...     ...    ...
99995     941     919    5.0
99996     941     273    3.0
99997     941       1    5.0
99998     941     294    4.0
99999     941    1007    4.0

[100000 rows x 3 columns]


In [16]:
# (mapping by index) EQUIVALENT:
#
# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column=0,
#     item_id_column=1,
#     score_column=2
# )

In [17]:
# (mapping by column name) EQUIVALENT:

# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column='user_id',
#     item_id_column='item_id',
#     score_column='rating'
# )

The Recommender System also needs an algorithm for ranking or predicting items to users. In the following example we use the **CentroidVector** algorithm:

*   It computes the centroid vector of the features of items *liked by the user*
*   It computes the similarity between the centroid vector and unrated items

The items liked by a user are those having a rating higher or equal than a specific **threshold**. If the threshold is not specified, the average score of all items liked by the user is used.

The Recommender System leverages the representations defined by the Content Analyzer. In the current example, we use the representation of the field 'plot'. More representations could be adopted for a single field.


```python
# Example with multiple representations for a single field
{
  'plot': ['tfidf', 'word_embedding'],
  'genre': 'doc_embedding',
  ...
}
```

Representations can be referenced using the **external id** (if specified, see [here](#ca_id)) or the **internal id**:


```
For the field 'plot':
First representation created -> internal_id = 0
Second representation created -> internal_id = 1
...
Nth representation created -> internal_id = n-1
```

In [18]:
centroid_vec = rs.CentroidVector(
    {'plot': 0}, # the first and only representation codifed for the 'plot' field
    similarity=rs.CosineSimilarity()
)

# no threshold parameter specified, the average rating given by
# the user wil be used

Before we can instantiate the recommender system, we should perform the splitting of the dataset: let's perform a **KFold with 2 splits**

*   The output of the partition module are two lists. One containing the two train set (in this case), the other containing the two test set (in this case)

In [19]:
kf = rs.KFoldPartitioning(n_splits=2)
train_list, test_list = kf.split_all(ratings)

Performing KFoldPartitioning:  100%|██████████| 943/943 [00:00<00:00]


Now the ***cbrs*** must be fit before we can compute the rank:

*   We could do this in two separate steps, by first calling the `fit(..)` method and then the `rank(...)` method 

*   Or by calling directly the `fit_rank(...)` method, which performs both in one step

We use the second approach and we compute the rank for all users of our train set: we will use both the two train set and two test set obtained thanks to the KFold technique

In order to compute a rank for all users, you simply do not specify the *user_id_list* parameter

In [20]:
result_list = []

for train_set, test_set in zip(train_list, test_list):
  
  cbrs = rs.ContentBasedRS(centroid_vec, train_set, 'movies_codified/')
  rank_to_append = cbrs.fit_rank(test_set)

  result_list.append(rank_to_append)

[39mINFO[0m - Don't worry if it looks stuck at first (recsys.py:469)
[39mINFO[0m - First iterations will stabilize the estimated remaining time (recsys.py:470)
Computing fit_rank for user 46:  100%|██████████| 943/943 [00:35<00:00]
[39mINFO[0m - Don't worry if it looks stuck at first (recsys.py:469)
[39mINFO[0m - First iterations will stabilize the estimated remaining time (recsys.py:470)
Computing fit_rank for user 46:  100%|██████████| 943/943 [00:33<00:00]


Let's export each ranking generated to a csv file

*   The `rank()` method (and the `fit_rank()` method) returns a **Rank** object, that has a useful exporting method `to_csv()`

We will save also the *test set* of each split, we need them later on in the *EvalModel part*

In [21]:
# we save the result of each split numbered
for i, rank_generated in enumerate(result_list, start=1):
  rank_generated.to_csv(file_name=f'rank_split_{i}')

# we save  the result of each split numbered
for i, test_set in enumerate(test_list, start=1):
  test_set.to_csv(file_name=f'truth_split_{i}')

We can import recommendations we just exported via pandas, or any other library which reads csv file (also the framework itself):

In [22]:
import pandas as pd

rank_split_1 = pd.read_csv('rank_split_1.csv')
rank_split_2 = pd.read_csv('rank_split_2.csv')

print("Result for first split:")
print(rank_split_1)
print("----------------------------------------")
print("Result for second split:")
print(rank_split_2)

Result for first split:
       user_id  item_id     score
0          242     1137  0.026963
1          242     1355  0.021795
2          242      283  0.017970
3          242     1357  0.016461
4          242      305  0.013593
...        ...      ...       ...
50235       46      151  0.009158
50236       46      262  0.008790
50237       46      690  0.008159
50238       46      748  0.007369
50239       46     1062  0.000000

[50240 rows x 3 columns]
----------------------------------------
Result for second split:
       user_id  item_id     score
0          242     1152  0.032814
1          242     1011  0.018523
2          242      331  0.017506
3          242      268  0.015312
4          242      306  0.012405
...        ...      ...       ...
49755       46       50  0.015880
49756       46      286  0.013479
49757       46      100  0.010144
49758       46        7  0.009986
49759       46      125  0.000000

[49760 rows x 3 columns]


# Evaluation module: evaluation of external recommendations

Recommendations can be evaluated with several metrics using the **EvalModel** module of the framework. The nice part of it is that it can evaluate easily also (multiple) recommendations generated via external tools

The Evaluation module needs the following parameters:

*   A list of computed rank/predictions (in case multiple splits must be evaluated)
*   A list of truths (in case multiple splits must be evaluated)
*   List of metrics to compute

Obviously the list of computed rank/predictions and list of truths must have the same length, and the rank/prediction in position $i$ will be compared with the truth at position $i$

Let's suppose we have recommendations (and related truths) generated via other tools in a csv format. We first import them into the framework and then pass them to the EvalModel class

*   In this case we will use the recommendations generated earlier in the RecSys phase of this colab, but they are simple csv files and they could be the output of any other tool!



In [23]:
print("Importing split 1")
rank_1 = ca.Ratings(ca.CSVFile('rank_split_1.csv'))
truth_1 = ca.Ratings(ca.CSVFile('truth_split_1.csv'))

print("Importing split 2")
rank_2 = ca.Ratings(ca.CSVFile('rank_split_2.csv'))
truth_2 = ca.Ratings(ca.CSVFile('truth_split_2.csv'))

# since multiple splits, we wrap ranks and truths in lists
imported_ranks = [rank_1, rank_2]
imported_truths = [truth_1, truth_2]

Importing split 1


Importing ratings:  100%|██████████| 50240/50240 [00:00<00:00]
Importing ratings:  100%|██████████| 50240/50240 [00:00<00:00]

Importing split 2



Importing ratings:  100%|██████████| 49760/49760 [00:00<00:00]
Importing ratings:  100%|██████████| 49760/49760 [00:00<00:00]


We are ready to instantiate the EvalModel class

*   We also need to define a metric list, suppose we want to compute ***Pearson Correlation***, ***MRR*** and ***NDCG***



In [24]:
em = eva.EvalModel(
    result_list,
    test_list,
    metric_list=[
        eva.Correlation('pearson'),
        eva.MRR(),
        eva.NDCG()
    ]
)

The fit() method returns two pandas DataFrame: the first one contains the metrics aggregated for the system, while the second contains the metrics computed for each user (where possible)

In [25]:
sys_result, users_result =  em.fit()

[39mINFO[0m - Performing evaluation on metrics chosen (eval_model.py:133)
Performing NDCG:  100%|██████████| 3/3 [00:03<00:00]


For the DataFrame which contains system results, the results are also grouped by splits

In [26]:
sys_result

Unnamed: 0_level_0,pearson,MRR,NDCG
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sys - fold1,0.049921,0.767457,0.923095
sys - fold2,0.033437,0.768762,0.922108
sys - mean,0.041679,0.76811,0.922601


In [27]:
users_result

Unnamed: 0_level_0,pearson,NDCG
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.039960,0.921503
10,0.126505,0.968621
100,-0.044637,0.889017
101,0.221615,0.914853
102,-0.005116,0.920276
...,...,...
95,0.082709,0.919818
96,0.169951,0.955528
97,0.144300,0.950852
98,0.260989,0.912445


# Other modules: export splitted dataset and methodology module

*    During recommendation phase we have already seen how to split the dataset. We will see a further example on how to split the dataset by considering only some users and how to export said splitted dataset.
*    A peculiar parameter may be changed in the recommendation phase, the `methodology` parameter: it enables you to choose which items need to be predicted using a specific methodology. We'll see how to use it manually and how to export its output.

## Split dataset

In this example we split with a KFold all the users having a **mean average score >= 3.5**:




In [28]:
import statistics

# valid_users will contain all users that have a mean average score given >= 3.5
valid_users = set()

all_users = set(ratings.user_id_column)
for u in all_users:
  user_ratings = ratings.get_user_interactions(u)
  all_users_scores = [user_interaction.score for user_interaction in user_ratings]
  if statistics.mean(all_users_scores) >= 3.5:
    valid_users.add(u)


# Print the length of the two sets to check that they are different
print(f"n_valid_users = {len(valid_users)}")
print(f"n_all_users = {len(all_users)}")

n_valid_users = 575
n_all_users = 943


In [29]:
part_technique = rs.KFoldPartitioning(n_splits=3)

train_list, test_list = part_technique.split_all(ratings,
                                                 user_id_list=valid_users)

Performing KFoldPartitioning:  100%|██████████| 575/575 [00:00<00:00]


We can check the training and test set of the first split:  


In [30]:
first_train = train_list[0]
first_test = test_list[0]

print("First split - train set:\n")
print(first_train)
print("--------------------------------")
print("\nFirst split - test set:\n")
print(first_test)

First split - train set:

      user_id item_id  score
0         242    1137    5.0
1         242    1152    5.0
2         242     305    5.0
3         242     291    3.0
4         242    1357    5.0
...       ...     ...    ...
36817      46     125    4.0
36818      46     245    3.0
36819      46     332    4.0
36820      46     748    5.0
36821      46     262    5.0

[36822 rows x 3 columns]
--------------------------------

First split - test set:

      user_id item_id  score
0         242     934    5.0
1         242     361    5.0
2         242     306    5.0
3         242       1    4.0
4         242     111    4.0
...       ...     ...    ...
18680      46     100    4.0
18681      46     305    5.0
18682      46     300    3.0
18683      46      93    4.0
18684      46     286    5.0

[18685 rows x 3 columns]


Each object in the two lists is a **Rating** object, which has a useful exporting method `to_csv()`

In [31]:
first_train.to_csv(file_name='exported_split_train')
first_test.to_csv(file_name='exported_split_test')

## Methodology module

During recommendation phase, it is also possible to specify an optional parameter in the `rank()` or `fit_rank()` method, called ***methodology***, for choosing which items must be ranked.
For each target user **u**, the following 4 different methodologies are available for defining those lists:

1.   **TestRatings** (default): the list of items to be evaluated consists of items rated by u in the test set
2.   **TestItems**: every item in the test set of every user except those in the training set of the target user will be predicted
3.   **TrainingItems**: every item in the training set of every user will be predicted except those in the training set of the target user
4.   **AllItems**: the whole set of items, except those in the training set of the target user, will be predicted

More information on [this paper](https://repositorio.uam.es/bitstream/handle/10486/665121/precision-oriented_bellogin_recsys_2011_ps.pdf;jsessionid=85982302D4DA9FF4DD7F21E4AC4F3391?sequence=1).


If you have multiple splits you need to iterate over your train set and test set, since the `filter_all()` method of each methodology works on a single pair of train set and test set



In [32]:
test_r = rs.TestRatingsMethodology().filter_all(first_train, first_test)

train_i = rs.TrainingItemsMethodology().filter_all(first_train, first_test)

Filtering items based on TestRatingsMethodology:  100%|██████████| 575/575 [00:00<00:00]
Filtering items based on TrainingItemsMethodology:  100%|██████████| 575/575 [00:00<00:00]


The output of the `filter_all()` method is a list of pandas DataFrame (one for every split in the split_list), so they can easily be exported to .csv, .tsv, etc.

Let's check the item to predict with the test ratings methodology for the first split:

In [33]:
print(test_r)

      user_id item_id
0         242     111
1         242     934
2         242    1355
3         242       1
4         242     306
...       ...     ...
18680      46     127
18681      46       7
18682      46     288
18683      46     300
18684      46     286

[18685 rows x 2 columns]


## Report module

Via the `Report` class, you can generate several ***yml*** files containing all the parameters passed to key classes and functions during your experiment.

The mentioned class is very flexible and it is able to document various parts of the experiment, based on the module you use:

* You can document how you preprocessed items and complexly represented them via the *Content Analyzer*
* Or maybe the experimental setup of your recommendation pipeline by passing key objects of it to the Report class
* And lastly how you evaluated your recommender systems and on which metrics

The aim is to give you all the tools to ***reproduce*** the experiment also in a different experimental setup

First we import the utils package which contains the `Report` class

In [34]:
from clayrs import utils as ut

Then you can instantiate the Report class. It needs the following parameters:
* ***output_dir***: where the reports generated will be stored, by default it points to the current directory **('.')**
* ***ca_report_filename***: how to rename the report of the content analyzer (if one is generated). Default is **'ca_report'**
* ***rs_report_filename***: how to rename the report of the recsys module (if one is generated). Default is **'rs_report'**
* ***eva_report_filename***: how to rename the report of the evaluation module (if one is generated). Default is **'eva_report'**

In [35]:
# in this case we are satisfied with the default parameters
rep = ut.Report()

Then simply call the *yaml()* method which will produce the three yaml reports.


> You need to pass to it the objects instantiated and used to execute your experiment. For the recsys module and the evaluation module you first need to perform the actual experiment before you are able to obtain a report for them

All the parameters of the function are *optional* so that you decide for which module a yaml report must be produced, in case you performed a partial experiment and only used *some* of the modules offered by the framework

In [36]:
# In this case we generate a full report for the three modules used in the
# experiment in this notebook
rep.yaml(content_analyzer=content_analyzer,
         original_ratings=ratings,
         partitioning_technique=kf,
         recsys=cbrs,
         eval_model=em)

The three yml files are generated in the current directory

In [37]:
import os

assert os.path.isfile('ca_report.yml')
assert os.path.isfile('rs_report.yml')
assert os.path.isfile('eva_report.yml')